• Aucun résultat trouvé

Confidence sets for model selection

N/A
N/A
Protected

Academic year: 2022

Partager "Confidence sets for model selection"

Copied!
137
0
0

Texte intégral

(1)

Thesis

Reference

Confidence sets for model selection

HANNAY, Mark

Abstract

At first glance the goals of model selection might seem clear. Out of a set of possible models, we want to select the ”best” or a subset of ”best” models. This notion of ”best” however is not well defined, since it obviously depends on the initial goals of the selection. In order to study the uncertainty in model selection, we introduce a new definition of a model, where the models are no longer defined through zero and non-zero components but through irrelevant and relevant component. Then inspired by confidence intervals for estimated parameters, we propose a method to build confidence sets for model selection in a parametric setting, i.e.

create sets of models within which the true model is included with a certain confidence. This allows us to perform inference on model selection. We discuss the computational challenges with such a method, how to find p-values (for the model), consistency in model selection and through a data set and a simulation study show the implications of this new method.

HANNAY, Mark. Confidence sets for model selection. Thèse de doctorat : Univ. Genève, 2016, no. GSEM 31

URN : urn:nbn:ch:unige-864673

DOI : 10.13097/archive-ouverte/unige:86467

Available at:

http://archive-ouverte.unige.ch/unige:86467

Disclaimer: layout of this document may differ from the published version.

1 / 1

(2)

Confidence sets for model selection

by

Mark Hannay

A thesis submitted to the

Geneva School of Economics and Management, University of Geneva, Switzerland,

in fulfillment of the requirements for the degree of PhD in Statistics

Members of the thesis committee:

Prof. Eva Cantoni, Chair, University of Geneva Prof. ElvezioRonchetti, Adviser, University of Geneva Prof. Werner Stahel, Eidgen¨ossische Technische Hochschule Z¨urich

Thesis No. 31 August 2016

(3)
(4)

Acknowledgements

First of all, I would like to express my gratitude to my adviser Prof. Elvezio Ronchetti.

Over these last 3 years, he has always had time for me and has consistently supported me in my endeavours, for which I am immensely thankful.

Secondly, I would like to thank the chair of my thesis committee Prof. Eva Cantoni, for her scrupulous study of the manuscript and her insightful comments on the dissertation.

During my time at the University of Geneva, I have had the good fortune of having my office next to hers and have therefore also benefited from her encouraging and positive attitude.

My sincere thanks also goes to Prof. Werner Stahel, who was not only another valuable member to the thesis committee but is someone I consider a dear friend. His constant questioning, not only improved the text but also forced me to rethink many passages and ultimately allowed me to make surprising links between different fields.

Last but not least, I would like to thank my wife Digisha Hannay for putting up with all those long nights of me typing away on the laptop and for all those moments of me with my head in the clouds. You have been incredibly supportive and caring, which are only 2 of your many beautiful attributes.

(5)
(6)

Abstract

At first glance the goals of model selection might seem clear. Out of a set of possible models, we want to select the ”best” or a subset of ”best” models. This notion of ”best”

however is not well defined, since it obviously depends on the initial goals of the selection.

In order to study the uncertainty in model selection, we introduce a new definition of a model, where the models are no longer defined through zero and non-zero components but through irrelevant and relevant component. Then inspired by confidence intervals for estimated parameters, we propose a method to build confidence sets for model selection in a parametric setting, i.e. create sets of models within which the true model is included with a certain confidence. This allows us to perform inference on model selection.

We discuss the computational challenges with such a method, how to find p-values (for the model), consistency in model selection and through a data set and a simulation study show the implications of this new method.

(7)
(8)

esum´ e

A premi`ere vue, le but de la s´election de mod`eles est clair. Parmi un ensemble de mod`eles possibles, nous voulons choisir le ”meilleur” ou un sous-ensemble de ”meilleures” mod`eles.

Cette notion de ”meilleur” mod`ele n’est pas pr´ecise, comme elle d´epend des objectifs initiaux de la s´election.

Pour ´etudier la variabilit´e de la s´election de mod`eles, nous introduisons une nouvelle d´efinition de mod`ele. Avec l’aide de cette nouvelle d´efinition, il est possible de cr´eer un en- semble de confiance pour la s´election de mod`eles dans le cas de la statistique param´etrique, c.`a.d. que nous pouvons cr´eer un ensemble de mod`eles qui inclut le vrai mod`ele avec une certaine confiance. Ceci peut ˆetre interpr´et´e comme une extension de la notion d’intervalle de confiance. Avec cet ensemble il est alors possible de faire de l’inf´erence statistique sur la s´election de mod`eles.

De plus nous proposons un algorithme pour cette nouvelle m´ethode, qui nous permet de trouver des p-valeurs (pour un mod`ele) et nous montrons la convergence de la s´election de mod`eles. Pour finir, `a partir d’une base de donn´ees et une ´etude de simulation nous montrons les implications de cette nouvelle m´ethode.

(9)
(10)

Contents

Acknowledgements i

Abstract iii

R´esum´e v

Introduction 1

1 Univariate model selection 5

1.1 Likelihood approach . . . 5

1.1.1 Log-likelihood . . . 5

1.1.2 The dilemma of the zero component . . . 6

1.1.3 Relevance . . . 7

1.1.4 Maximum likelihood for relevance . . . 7

1.1.5 Signed relevance . . . 9

1.1.6 Maximum likelihood for signed relevance . . . 9

1.2 Bayesian approach . . . 11

1.2.1 Posterior model probability . . . 11

1.2.2 Spike and slab priors . . . 11

1.2.3 Point-normal priors . . . 12

1.2.4 Choice of nuisance parameter . . . 12

1.2.5 Relevance . . . 13

1.2.6 Signed relevance . . . 15

1.3 Uncertainty in model selection . . . 16

1.3.1 Relevance . . . 16

1.3.2 Signed relevance . . . 20

1.4 Conclusion . . . 23

2 Testing theory for multiparameter hypotheses 25 2.1 Set-up . . . 25

2.2 Two hypotheses . . . 26

2.2.1 Rejectable hypotheses . . . 26

2.2.2 The d-test: definition and properties . . . . 28

2.3 Component-wise testing . . . 30

2.3.1 Testing relevance and irrelevance . . . 30

2.3.2 Choice of relevance threshold in linear models . . . 33

2.3.3 Application to Prostate Cancer Data . . . 34

2.4 Group testing . . . 39

2.4.1 Irrelevance testing . . . 39

(11)

viii Contents

2.4.2 Relevance testing . . . 42

2.4.3 Application on U.S. states data 1990-91 . . . 44

2.5 Intersection testing . . . 46

2.6 Conclusion . . . 50

3 Confidence sets 51 3.1 Set-up . . . 51

3.2 Model selection . . . 51

3.2.1 Model examples . . . 52

3.2.2 Maximum likelihood model estimator . . . 56

3.3 Uncertainty in model selection . . . 57

3.4 Links to existing literature and introduction of descriptive methods . . . . 59

3.4.1 Methods for relevance . . . 60

3.4.2 Methods for signed relevance . . . 61

3.5 Numerical results . . . 62

3.5.1 Case: p= 2 . . . 63

3.5.2 Case: p= 3 . . . 65

3.5.3 Case: p= 4 . . . 67

3.6 Data examples . . . 69

3.6.1 Prostate cancer data . . . 69

3.6.2 Diabetes data . . . 71

3.7 Conclusion . . . 74

4 Computational considerations 75 4.1 Distance to linearly constrained sets . . . 75

4.1.1 Regularization . . . 75

4.1.2 Path search and algorithm . . . 76

4.2 Distance to complement of linearly constrained set . . . 79

4.3 Distances to closed balls in group testing . . . 81

4.3.1 Regularization . . . 82

4.3.2 Path search and algorithm . . . 82

4.4 Distance to complement of closed balls in group testing . . . 85

4.5 Recovering the critical values . . . 87

4.5.1 Linearly constrained sets . . . 87

4.5.2 Complements of linearly constrained sets . . . 88

4.6 Model set function . . . 89

4.6.1 Exhaustive search . . . 90

4.6.2 Recovering the critical values . . . 93

4.7 Conclusion . . . 93

Conclusion 95

A Technical arguments for Univariate normal models 99 B Proofs: Testing theory for multiparameter hypotheses 103

C Proofs: Confidence sets 111

D Proofs: Computational considerations 113

(12)

To Digisha.

(13)
(14)

Introduction

Model selection is the procedure of selecting the ”best” model amongst a set of possible models based on a given data set. There are many reasons for wishing to do this. For instance, one may want to reduce the complexity of the model, or one may want to select a less complex model, which can lead to better predictions of future observations. Our main interest will be to select the true model, for purposes of inference.

We restrict our work to model selection in a parametric setting, which gains from the added structure with respect to the more general framework, while still encompassing many important cases. More specifically, in a parametric setting, selecting a model usually entails finding the zero components of the parameter of interest, which we denote as θ, based on a data set with n observations, i.e. {Xi}ni=1. We will now first introduce the notion of a model in this particular case, then briefly discuss about some of the most common methods of model selection before briefly mentioning some of the existing literature on uncertainty in model selection. After this short introduction to the subject, we provide an outline.

Parametric models

Throughout, we assume the parameter of interest θ to have p components. A model can then be represented by a set A ⊂ {1, ..., p}, which represents the zero components of the parameter of interest. Thus, for A ⊂ {1, ..., p}, we first define a naive type of model:

MA :θ[A]=0[A],

where0 ∈Rp is the zero vector. Consequently, because of the hierarchy in this particular structure, if MA is a correct model then so is MA˜ for any ˜A ⊂ A. Thus, in this context, selecting the true models would require first finding the biggest A, such thatMA is true, and then providing a set with all the models MA˜ for any ˜A ⊂ A. This selection is redundant, since it only relies on a single model, which is why we propose to work with the following definition of a model for a fixed A instead:

MA :θ[A]=0[A] and θ[i]6= 0 for alli∈ Ac. (1) We have that this definition implies that only a single model, from all the possible models, can be correct and so selecting the true model is now well-defined.

Common methods for model selection

Most of the common methods for model selection are based on the log-likelihood func- tion. Therefore, we begin by introducing this notion. Let fθ be a density function for

(15)

2

observations Xi, which we assume to be independent and identically distributed (i.i.d.).

This lead to the log-likelihood function, defined as:

l(θ) = log

n

Y

i=1

fθ(Xi)

!

=

n

X

i=1

log (fθ(Xi))

This can be used to define estimators for each model, by maximizing the likelihood constrained to feasible points within a given model. More specifically, for a model MA

we define:

θˆMA = arg max

θ[A]=0[A]

l(θ).

One common method for model selection is based on testing between nested models.

More specifically, for A1 ⊂ A2, MA2 is said to be nested in MA1. Moreover, for k =

|A2| − |A1| and under mild conditions, we asymptotically have:

2l(ˆθMA2)−2l(ˆθMA1)∼χ2k.

Consequently, one can test between nested models. This allows for stepwise procedures, where one starts either at the full, respectively the null, model and tests backwards, respectively forwards, until a stopping rule is satisfied. See Bendel and Afifi, (1977) for details on the stopping rule.

The stepwise procedure has a major drawback, namely that we can only compare nested models along the testing path. Information theory offers an elegant response to this problem. To each model MA we can associate a value:

−2l(ˆθMA) +k(n,|A|).

The idea is that k(n,|A|) should penalize bigger models and so k(n,|A|) is monotone decreasing in |A|. One can now even compare non-nested models. The two main in- formation criteria are the AIC criterion (Akaike, 1974) and the BIC criterion (Schwarz, 1978).

More recently, regularization methods have been proposed for model selection. These methods benefit from convexity and do not require computing values for each model. For instance, Tibshirani, (1996) defined the lasso estimator as:

θˆ = arg min−2l(θ) +λ

p

X

i=1

θ[i], where λ≥0.

It has been well established by B¨uhlmann and Van De Geer, (2011) that, under mild conditions on the sparsity of θ and on the covariance matrix, that selecting the non zero components of the lasso estimator for specific λ performs well in variable selection.

However, it is generally not consistent in model selection, prompting Zou, (2006) to penalize the components differently, which leads to the adaptive lasso. This new type of penalization method is then consistent in model selection.

Further common methods are model selection through cross validation (Shao, (1993) and Zhang, (1993)) and Bayesian model selection (Mitchell and Beauchamp, (1988) and Chipman et al., (2001)). The cross validation methods have asymptotic properties similar to those of information criteria. While the Bayesian method is based on model priors and parameter priors, with properties depending on the chosen priors. We will introduce Bayesian model selection more thoroughly in Chapter 1.

(16)

Introduction 3

Uncertainty in model selection

The previous methods are commonly used as ”point estimates” of models, which on their own do not account for the uncertainty in model selection.

The cross validation methods can produce some notion of uncertainty in model selec- tion, since for each model one has a standard error of the estimate error as is done by Cantoni et al., (2007). However, the theory of uncertainty in model selection, using these techniques hasn’t be fully studied yet.

As previously mentioned, under some mild conditions, the Lasso can be used to build a set of variables, within which the true non zero components are included with a certain probability. Unfortunately, generally this set will also include zero components, even asymptotically. The adaptive lasso, which is consistent in model selection, does not solve this problem, since no notion of uncertainty for the adaptive step has been derived.

Bayesian methods offer the most striking response to the quantification of uncertainty in model selection by providing credibility sets. These sets though rely on priors, which can put many practitioners at a certain unease. Once again, we shall investigate these properties in further detail in Chapter 1.

Fortunately if we impose a threshold for the true non-zero components, we can recover the exact set of non zero variables asymptotically by many methods including using the Lasso or even the Zcut method described by Ishwaran and Rao, (2005). Both methods provide list of variables, with statistical guarantees. Unfortunately they do not provide list of models, which is what would characterize uncertainty in model selection. To that end Hansen et al., (2011) introduced model confidence sets, which include the ”best” model with a given confidence. More recently Ferrari and Yang, (2015) introduce a method of selecting a set of models in the linear model based on F-tests. Yet, mostly because of the definition of models used in these papers, their sets of models do not collapse to the set only including the true model asymptotically.

Outline

Although widely used, we show in Chapter 1 that the definition of a model used in (1.4) poses a dilemma. This dilemma is caused by the focus on finding the exact zero components. We argue that we should instead focus on relevance, which leads to a new definition of a model. Furthermore, following the ideas of Hansen et al., (2011), we introduce confidence sets, which allow to quantify the uncertainty in the selection, similarly to confidence intervals for estimation.

The new method we propose for model selection is based on testing theory. Therefore, in Chapter 2 we focus on testing theory in a general framework. We provide a method, based on the squared distance to the model of interest, to decide whether to reject or accept a model. Before discussing model selection, we then describe some interesting applications of this new testing method built on squared distances between models and apply it to real data example.

In Chapter 3, we extend the confidence sets in the univariate case to the general case.

It is based on aggregating all the individual decisions on accepting or rejecting models in Chapter 2. Furthermore, we supply multiple examples of possible models, which are simple extensions of existing notions of models and can be dealt with efficiently as long as p, the dimension of the parameter of interest, isn’t too big. After first providing a simulation study to better understand the small sample properties of the new method,

(17)

4

we use our newly developed method on 2 already widely studied data sets, yielding new interpretations.

Finally in Chapter 4, we focus on the computational methods used for testing and recovering the critical values. This particular chapter also adds some insight into how our proposed method reduces to well known methods in the standard cases. Most of the proposed computational methods are motivated by methods used in the growing field of regularization and rely heavily on convexity.

(18)

Chapter 1

Univariate model selection

In order to motivate our work, we start with the simplest model selection problem, which consists in choosing between two models defined by a simple parameter. Let M0 :θ = 0 and M1 : θ 6= 0, thus M0 and M1 are different models, conform to the definition of a model in Equation (1) and Xi ∼ N(θ, σ) i.i.d.. The goal is to determine which model is correct based on the data {Xi}ni=1.

We first present a frequentist approach and then a Bayesian approach to this problem.

We show there are conceptual problems in both approaches leading to new definitions of models, based on relevance, irrelevance and signed relevance. By using extensions of the existing methods, we can study the decision problem related to these new models. In the last part, we propose a method to quantify the uncertainty in model selection for this particular framework.

Throughout this chapter, the statistical decisions are made with respect to the max- imum likelihood estimator, ˆθn. By using the distributional assumptions we have that θˆn = n1 Pni=1Xi ∼ N(θ, σ2/n).

1.1 Likelihood approach

1.1.1 Log-likelihood

For given data {Xi}ni=1, under the assumption of normality and independence, we have that the log-likelihood function for known σ is equal to:

ln(θ) = log

n

Y

i=1

fθ(Xi)

!

=−nlog(√

2πσ)−

n

X

i=1

1

2(Xiθ)2, where fθ(x) = 1

2πσexp(x−θ)22is the density function.

By definition ˆθn maximizes ln(·) over R and we have that ˆθn 6= 0 with probability 1.

This implies that ˆθn is the maximum likelihood estimator over all possible values in M1 with probability 1. On the other hand, since 0 is the only value in M0, 0 is the maximum likelihood estimator over all possible values in M0.

Furthermore, we have ln(0) < lnθn) with probability 1. This means that if we were to pick the model with the highest likelihood, we would always pick M1, even in the case where M0 is the true model.

Information theory provides an elegant solution to this problem by focusing on the study of penalized likelihoods. More specifically, there exists penalties k0;n and k1;n, such

(19)

6 Chapter 1. Univariate model selection that we associate the values −2ln(0) +k0;n and −2lnθn) +k1;n to the modelsM0 and M1

respectively. We then select M0 if −2ln(0) +k0;n ≤ −2lnθn) +k1;n, otherwise we select M1. In fact if k0;n < k1;n, there is a strictly positive probability of selecting M0. The most common choices for the penalties are ki;n = 2i, the AIC criterion (Akaike, 1974), and ki;n = log(n)i the BIC criterion (Schwarz, 1978).

1.1.2 The dilemma of the zero component

It is well known that the AIC criterion isn’t consistent in model selection, while the BIC criterion is. We revisit these known facts in our example and present a dilemma one is faced with when using these methods or indeed any other method.

If one uses theAICcriterion to select the model, it can easily be shown that one selects M0 if and only if ˆθn2n2σ2. If one uses the BICcriterion to select the model, one selects M0 if and only if ˆθ2nlog(n)n σ2. This means that for both criteria, the decision depends solely on ˆθn. Therefore we define the decision functions MAIC, MBIC:R→ {M0, M1} as:

MAIC(x) =

M0 if x22nσ2 M1 otherwise , MBIC(x) =

M0 if x2log(n)n σ2 M1 otherwise .

LetPθ˜(A) be the probability of A under the assumption that θ= ˜θ. We now use this notation to study the probabilities of selecting M0 in different cases. In fact it follows from the previous observations that for any θ ∈R:

PθMAICθn) = M0 = P

Z+

n σ θ

!2

≤2

, PθMBICθn) = M0 = P

Z+

n σ θ

!2

≤log(n)

, where Z ∼ N(0,1).

First we assume M0 to be true, i.e. θ = 0. The probability of selecting the wrong model with the AIC criterion, i.e. PθMAICθn) = M1, is equal to P (Z2 >2) ≈ 0.157, which is strictly positive. Therefore, in this case, the AIC criterion is inconsistent, since even asymptotically we have a non negligible probability of selecting the wrong model. On the other hand, the probability of selecting the wrong model with the BIC criterion , i.e.

PθMBICθn) =M1, is equal to P (Z2 >log(n)). This probability tends to 0 asymptot- ically, which implies M0 is selected with probability 1 asymptotically. One can therefore argue that, in the caseM0 is true, theBICcriterion may be preferable to theAICcriterion, since it is asymptotically consistent while the other one isn’t.

We now assumeM1to be true. The maximum probability of selecting the wrong model, while still assumingM1 to be true with theAICcriterion, i.e. supθ6=0PθMAICθn) = M0, is equal to P (Z2 ≤2) = 0.843, which is strictly positive. The worst case probability of selecting the wrong model is thus large. TheBICcriterion performs even worse in this case, since the worst case probability of selecting the wrong model, supθ6=0PθMBICθn) = M0, is equal to P (Z2 ≤log(n)). Thus, the probability of making the wrong decision tends to 1 for n→ ∞.

(20)

1.1. Likelihood approach 7 This leads to a dilemma, since even though the BIC criterion is consistent for a fixed θ, the probability of making the wrong decision can be arbitrarily close to 1. Whereas the AIC criterion is not consistent in model selection, but at least the probability of selecting the wrong model is bounded away from 1. We conclude, that it is impossible based on θˆn to be uniformly consistent in model selection in the case M1 is true, while also being consistent in the case M0 is true. This is because the distributional differences between θˆn under the assumption M0 is true or M1 is true can be arbitrarily small.

1.1.3 Relevance

The problem in the previous subsection comes from the fact if |θ| is very small, one would need huge amounts of observations to detect a signal. This remains true for other notable model selections techniques, such as the Lasso (Tibshirani, (1996), Zou, (2006) and B¨uhlmann and Van De Geer, (2011)), cross validation methods (Shao, (1993) and Zhang, (1993)) or even testing theory (Lehmann and Romano,2006).

All is not doom and despair though, since in many application one might not be interested in showing that θ = 0. Before even testing this hypothesis, one could imagine that the practitioner knows this hypothesis to be false or at least highly unlikely as was noted by Jones and Tukey, (2000). Furthermore, many practitioners are not strictly interested in significance testing but in discovering practical significance, which we denote as relevance.

For instance in bioequivalence testing (Schuirmann,1987), one is interested in testing the hypothesis |θ| ≤ δ. This leads to the following new different definitions of models, where for some fixed δ >0 we define:

M0(δ) : θ ∈[−δ, δ], (1.1)

M1(δ) : θ ∈[−δ, δ]c.

The modelM0(δ)represents the case whereθ may not be zero, but is irrelevant, whileM1(δ) represents the case where θ is relevant. We believe this definition not only helps develop the theory, as we will see below, but makes practical sense, too.

1.1.4 Maximum likelihood for relevance

Due to the definitions of M0(δ) and M1(δ) in Equations (1.1), the likelihood is no longer necessarily maximized for values within the domain of M1(δ). This means that a simple maximum likelihood method can be used to select the model. In fact, clearly

sup

|θ|≤δ

ln(θ)≥ sup

|θ|≥δ

ln(θ) ⇐⇒ θˆn∈[−δ, δ].

Therefore, we can define the decision function based on the maximum likelihood, which once again depends entirely on ˆθn. We denote this function as MM l :R→nM0(δ), M1(δ)o:

MM l(x) =

M0(δ) if |x| ≤δ M1(δ) otherwise .

(21)

8 Chapter 1. Univariate model selection We now study the probabilities of selectingM0(δ) andM1(δ) as functions of θ. In fact it follows directly from the definition of MM l that for any θ∈R:

Pθ

MM lθn) =M0(δ) = Pθ

θˆn∈[−δ, δ]=P −δ≤θ+ σ

nZδ

!

, (1.2) PθMM lθn) =M1(δ) = Pθθˆn6∈[−δ, δ]= 1−P −δ≤θ+ σ

nZδ

!

,

where Z ∼ N (0,1).

Thus,PθMM lθn) =M0(δ)is a strictly monotone decreasing function in |θ|, whereas PθMM lθn) = M1(δ) is a strictly increasing in |θ|. This implies that we recover the following expressions for the worst covering probability:

θ∈[−δ,δ]inf PθMM lθn) = M0(δ) = P −δ ≤δ+ σ

nZδ

!

, inf

θ∈[−δ,δ]cPθ

MM lθn) = M1(δ) = 1−P −δ≤δ+ σ

nZδ

!

.

By using the fact that P−δ ≤δ+ σnZδ =P −2δ

n

σZ ≤0 and δ > 0, we re- cover that limn→∞P −δ≤δ+σnZδ=P (Z ≤0) = 12. Thus, we find the following asymptotical results:

n→∞lim inf

θ∈[−δ,δ]PθMM lθn) = M0(δ) = 1 2,

n→∞lim inf

θ∈[−δ,δ]cPθMM lθn) = M1(δ) = 1−1 2 = 1

2.

Similarly, we recover the following expressions for the worst case probability of selecting the wrong model:

sup

θ∈[−δ,δ]

PθMM lθn) =M1(δ) = 1−P −δ≤δ+ σ

nZδ

!

, sup

θ∈[−δ,δ]c

PθMM lθn) =M0(δ) = P −δ≤δ+ σ

nZδ

!

,

which also both tend to 12 as n→ ∞.

Therefore, this method doesn’t suffer from the same problem as the BIC criterion, where it was possible to select the the wrong model with a probability arbitrarily close to 1. This solves one part of the dilemma.

Furthermore, by using the expression for PθMM lθn) =M0(δ) in Equation (1.2), we can show that the method achieves near uniform consistency. More specifically, for any

(22)

1.1. Likelihood approach 9 0< < δ, we have:

n→∞lim inf

|θ|≤δ−PθMM lθn) = M0(δ) = lim

n→∞P −δ≤δ+ σ

nZδ

!

,

= lim

n→∞P −(2δ−)

n

σZ

n σ

!

= 1,

n→∞lim inf

|θ|≥δ+PθMM lθn) = M1(δ) = 1− lim

n→∞P −δ ≤δ++ σ

nZδ

!

,

= 1− lim

n→∞P −(2δ+)

n

σZ ≤ −

n σ

!

,

= 1−0 = 1.

Therefore if the true θ is not too close to ±δ we have uniform consistency in model selection. This implies that setting a threshold for relevance, namely δ >0, also offers a solution to the other part of the dilemma previously discussed.

Consequently, within the relevance framework, the maximum likelihood estimator is uniformly consistent in model selection for θ not too close to ±δ, while at the same time avoiding the case of selecting the wrong model with an arbitrarily high probability.

1.1.5 Signed relevance

In many situations, one may not only be interested in the relevance of the parameter of interest but also in its sign or direction. In fact, Kaiser, (1960) criticized the traditional two-sided test, i.e. the case of zero competent testing, by stating : ”to find a ‘significant’

effect and not be able to decide in which direction this difference or effect lies, seems a sterile way to do business”. This prompted him to suggest using two one-sided tests instead of one two-sided test.

Translating this to our case, it entails separating M1(δ) into a model for negatively relevant θ and another for positively relevant θ. More specially, M1(δ) can be separated into M(δ) and M+(δ), in which case for a fixed δ >0 there are 3 models:

M(δ) : θ <−δ, (1.3)

M0(δ) : −δ ≤θδ, M+(δ) : θ > δ, where M0(δ) is still the model of irrelevant θ.

1.1.6 Maximum likelihood for signed relevance

Once again, we can use the maximum likelihood estimator, ˆθn, to select between the mod- elsM(δ),M0(δ)andM+(δ), defined in Equations (1.3), where we fixδ >0. The corresponding decision function MM l± :R→nM(δ), M0(δ), M+(δ)o is defined as:

MM l±(x) =

M(δ) if x <−δ M0(δ) if −δxδ M+(δ) if x > δ

.

(23)

10 Chapter 1. Univariate model selection We now study the probabilities of selectingM(δ), M0(δ) and M+(δ) based on θ. In fact, just as in Subsection 1.1.4, for any θ∈R we recover:

PθMM l±θn) =M(δ) = Pθθˆn<−δ=P θ+ σ

nZ <−δ

!

, (1.4)

PθMM l±θn) =M0(δ) = Pθ−δ≤θˆnδ=P −δ≤θ+ σ

nZδ

!

, PθMM l±θn) =M+(δ) = Pθθˆn> δ=P θ+ σ

nZ > δ

!

, where Z ∼ N (0,1).

Thus, PθMM l±θn) = M0(δ) is a strictly monotone decreasing function in |θ|, while PθMM l±θn) = M(δ)is a strictly decreasing inθ andPθMM l±θn) = M+(δ)is a strictly increasing in θ. This implies that we recover the following analytical expressions for the worst covering probabilities:

θ<−δinf PθMM l±θn) = M(δ) = P −δ+ σ

nZ <−δ

!

= 1 2,

−δ≤θ≤δinf PθMM l±θn) = M0(δ) = P −δ ≤δ+ σ

nZδ

!

,

θ>δinfPθMM l±θn) = M+(δ) = P δ+ σ

nZ > δ

!

= 1 2. By using the fact that P−δ ≤δ+σnZδ = P −2δ

n

σZ ≤0 and δ > 0, we recover that limn→∞P−δ ≤δ+ σnZδ = P(Z ≤0) = 12. Similarly, we recover the following expressions for the worst case probability of selecting the wrong model:

sup

θ≥−δ

PθMM l±θn) =M(δ) = P −δ+ σ

nZ ≤ −δ

!

= 1 2, sup

θ∈[−δ,δ]c

PθMM l±θn) =M0(δ) = P −δ≤δ+ σ

nZδ

!

, sup

θ≤δ

PθMM l±θn) =M+(δ) = P δδ+ σ

nZ

!

= 1 2, where supθ∈[−δ,δ]cPθMM l±θn) =M0(δ) tends to 12 asn → ∞.

Therefore, just as the method described in Subsection 1.1.4, this method doesn’t suffer from the same problem as the BIC criterion, where it was possible to select the wrong model with a probability arbitrarily close to 1.

Furthermore, by using the expressions for the probabilities in Equations (1.4), we can show that the method achieves near uniform consistency, in the same way as in the relevance framework.

(24)

1.2. Bayesian approach 11

1.2 Bayesian approach

Ultimately we are interested in quantifying the uncertainty in model selection. Much work has already been done on this topic within the Bayesian framework. Interestingly Bayesian model selection, as is described in Mitchell and Beauchamp, (1988) and in Chipman et al., (2001), can both provide a method to select a model and provide a credibility set of models. We apply this methodology to our particular case and discuss some of the implications.

1.2.1 Posterior model probability

In a general Bayesian framework, to each modelM0andM1 we associate the distributions π0 and π1 to the parameter space. Additionally, let fn(· | θ) denote the density of ˆθn for a given θ. The marginal densities for ˆθn are:

pn(· |Mk) =

Z

fn(· |θ)πk(θ)dθ, (1.5)

where k ∈ {0,1}.

Furthermore, letP (M0),P (M1) = 1−P (M0) be the probabilities of the observations being generated by the modelsM0,M1 respectively. As a result, by a standard application of Bayes Theorem, we recover the posterior model probabilities for k ∈ {0,1}:

P Mk|θˆn= pnθˆn |MkP (Mk)

pnθˆn|M0P (M0) +pnθˆn |M1P (M1). (1.6) One can then select the model with the higher posterior model probability. Conse- quently, we can define the Bayesian decision function MBayes :R→ {M0, M1} by:

MBayes(x) = arg max

M∈{M˜ 0,M1}

P M˜ |x. (1.7)

1.2.2 Spike and slab priors

Concerning the choice of the priors, a common approach, as suggested by Mitchell and Beauchamp, (1988), is to use the point and uniform priors, otherwise known as the spike and slab priors. More specifically we set π0(t) = δ0(t), a degenerate distribution with all its mass at 0 and π1(t) = 2t1

01|t|≤t0, a uniform distribution with t0 >0.

By the choices of the priors, we have Pπ0(θ= 0) = 1 and Pπ1(θ 6= 0) = 1. This means that under the assumption θ follows the distribution π0, the model M0 is true with probability 1. Likewise, if θ follows the distribution π1, the model M1 is true with probability 1.

Furthermore, by plugging in the above priors into Equation (1.5), it can be shown (details in Appendix A) that in this case the marginal likelihoods have the following simple expressions:

pnθˆn |M0 = 1

√2πqσ2/n exp

θˆn22/n

, pnθˆn |M1 =

Z t0

−t0

√ 1

qσ2/n exp

−(ˆθnt)22/n

1 2t0dt.

(25)

12 Chapter 1. Univariate model selection Now, by plugging in these expressions into Equation (1.6), we can now give a rather simple expression of the posterior model probabilities. More specifically, (details in Ap- pendix A) the posterior model probabilities are determined by:

PM0 |θˆn =

1 +

Z t0

−t0exp

t22/n +

θˆnt σ2/n

1

2t0dtP(M0)−1P (M1)

−1

= 1−P M1 |θˆn.

These probabilities are exact under the prior assumptions. Due to the definition of M0, the distribution π0 needs no justification since it is the unique distribution within the domain of M0. The choice of π1 is more arbitrary.

1.2.3 Point-normal priors

Another possible choice of priors is found in Chipman et al., (2001), where they prefer to use the point-normal formulation. More specifically we set π0(t) = δ0(t), a degenerate distribution with all its mass at 0, and π1(t) = 1

2πτ expt22

, a normal distribution with mean 0 and a scale parameter τ > 0.

Just as for the spike and slab priors, we have Pπ0(θ = 0) = 1 and Pπ1(θ 6= 0) = 1.

Now, by plugging in the above priors into Equation (1.5), it can be shown (details in Appendix A) that the marginal likelihoods are equal to:

pnθˆn|M0 = 1

√2πqσ2/n exp

θˆn22/n

, pn

θˆn|M1 = 1

√2πqτ2+σ2/nexp

θˆ2n 2(τ2+σ2/n)

.

Once again, by plugging in these expressions into Equation (1.6), we can now give an analytical expression of the posterior model probabilities. More specifically, (details in Appendix A) the posterior model probabilities are determined by:

P M0 |θˆn =

1 +

v u u t

σ2/n

τ2+σ2/nexp

θˆn2

2/n · τ2 τ2 +σ2/n

P (M0)−1P (M1)

−1

= 1−P M1 |θˆn.

These probabilities are exact under the prior assumptions, requiring the knowledge of τ.

1.2.4 Choice of nuisance parameter

For both spike and slab and point-normal priors, there remain nuisance parameters to specify. Namely, we need to specify t0 for spike and slab prior and τ for the point- normal prior. Here we focus on the point-normal prior, to show how important the choice of the nuisance parameter is. Not wanting to favour a model over another, we set P (M0) =P (M1) = 12 in the rest of the subsection.

(26)

1.2. Bayesian approach 13 Under the assumption that M0 is true, we have ˆθn = σnZ where Z ∼ N (0,1). It therefore follows that under this assumption:

P M0 |θˆn=

1 +

v u u t

σ2/n

τ2+σ2/nexp Z2

2 · τ2 τ2+σ2/n

!

−1

.

Consequently, allowing τ to depend on n, if limn→∞ σ2/n

τ22/n = 0 then P M0 |θˆn converges in probability to 1 under the assumption M0 is true. On the other hand, if we allow the scale parameter τ to decay at the rate of 1/√

n by setting τ = a·σ/n, we have:

P M0 |θˆn=

1 +

s 1

a2+ 1exp Z2 2 · a2

a2+ 1

!

−1

.

In such a case, the probability of choosing M1 converges to P PM0 |θˆn< 12 > 0 as n → ∞, which is same problem we encountered using the AICcriterion.

In the case we assumeM1 to be true, the choice of τ can be just as crucial. Assume the true density of θ is equal toπ(x) = 2πτ1

expx22

. We have ˆθn=qτ2+σ2/n·Z, where Z ∼ N(0,1). By using the previous prior we recover:

P M0 |θˆn=

1 +

v u u t

σ2/n

τ2+σ2/nexp Z2

2 · τ2 +σ2/n

σ2/n · τ2 τ2+σ2/n

!

−1

. Let a >0 such that τ =a·σ/

n, we then have:

P M0 |θˆn =

1 +

v u u t

σ2/n

τ2+σ2/nexp Z2

2 ·(a2+ 1)· τ2 τ2+σ2/n

!

−1

1 +

s σ2

2+σ2 exp Z2

2 ·(a2+ 1)

!

−1

Therefore in the case τ is badly specified, it it possible for P M0 |θˆn to be arbitrarily big in probability. For instance if a, σ and τ are kept constant, while n → ∞ we then have that PM0 |θˆnconverges in probability to 1. This points to the same problem the BIC criterion had in Subsection 1.1.2. Unfortunately, the exact same problem is true for the t0 parameter in the spike and slab prior.

1.2.5 Relevance

The problem described in the previous subsection can be avoided by focusing on the notion of practical significance. Chipman et al., (2001) introduced the idea of working with models M0(δ) and M1(δ) as in Subsection 1.1.3 by defining δ as the ”threshold of practical significance”, i.e. ”threshold of relevance”. In this subsection we therefore focus on the models M0(δ) and M1(δ) for a given δ > 0. The definitions of marginal densities, posterior model probabilities and the Bayesian decision function from Equations (1.5), (1.6) and (1.7) translate directly to this setting, by replacing M0 and M1 by M0(δ) and M1(δ). Within this setting, we now propose 2 different types of priors and discuss some of their properties and implications.

Références

Documents relatifs

In order to take into account more general loading cases and the closure effect in the fully damaged zone, we consider an asymmetric tension/compression damage potential with a

Keywords: approximations spaces, approximation theory, Besov spaces, estimation, maxiset, model selection, rates of convergence.. AMS MOS: 62G05, 62G20,

So, the strategy proposed by Massart outperforms the model selection procedure associated with the nested collection of models of Section 5.1 from the maxiset point of view for

In a general statistical framework, the model selection performance of MCCV, VFCV, LOO, LOO Bootstrap, and .632 bootstrap for se- lection among minimum contrast estimators was

Keywords and phrases: Nonparametric regression, Least-squares estimators, Penalized criteria, Minimax rates, Besov spaces, Model selection, Adaptive estimation.. ∗ The author wishes

By using the developed simple and complex volumes, primary datum, datum reference frame and tolerance zones, it is possible to accomplish the GD&amp;T specification of a

Using the seven proposed criteria for inherent prior model interpretability (section 4) to define 6 Dirac (binary) measures for SVM (Table 3) meeting each criterion without

Globally the data clustering slightly deteriorates as p grows. The biggest deterioration is for PS-Lasso procedure. The ARI for each model in PS-Lasso and Lasso-MLE procedures