A Prediction Divergence Criterion for Model Selection

(1)

Preprint

Reference

A Prediction Divergence Criterion for Model Selection

GUERRIER, Stéphane, VICTORIA-FESER, Maria-Pia

Abstract

In this paper, we propose a new criterion for selection between nested models. We suppose that the correct model is one (or near one) of the available models and construct a criterion which is based on the Bregman divergence between the out-of-sample prediction of the smaller model and the in-sample prediction of the larger model. This criterion, the prediction divergence criterion (PDC), is different from the ones that are often used like the AIC, BIC, Cp, in that, in a sequential approach, it directly considers the prediction divergence between two models, rather that differences between the former criteria evaluated at two different models. We derive an estimator for the PDC (PDCE) using Efron (2004) approach on parametric covariance penalty method, and for the linear model and smoothing splines, we show that the PDCE on a suitable sequence of nested models that we formalize, selects the correct model with probability 1 as the sample size tends to infinity. In finite samples, we compare the performance of our criterion to the other ones as well as to the lasso, as find that it outperforms the other criteria in terms of [...]

GUERRIER, Stéphane, VICTORIA-FESER, Maria-Pia. A Prediction Divergence Criterion for Model Selection.

Available at:

http://archive-ouverte.unige.ch/unige:24187

Disclaimer: layout of this document may differ from the published version.

(2)

A Prediction Divergence Criterion for Model Selection

St´ephane Guerrier and Maria-Pia Victoria-Feser

Research Center for Statistics & HEC Gen`eve University of Geneva, Switzerland e-mail:[email protected] e-mail:[email protected]

Abstract: In this paper, we propose a new criterion for selection between nested models. We suppose that the correct model is one (or near one) of the available models and construct a criterion which is based on the Bregman divergence between the out-of-sample prediction of the smaller model and the in-sample prediction of the larger model. This criterion, the prediction divergence criterion (PDC), is different from the ones that are often used like the AIC, BIC,Cp, in that, in a sequential approach, it directly considers the prediction divergence between two models, rather that differences between the former criteria evaluated at two different models. We derive an estimator for the PDC (PDCE) usingEfron(2004) approach on parametric covariance penalty method, and for the linear model and smoothing splines, we show that the PDCE on a suitable sequence of nested models that we formalize, selects the correct model with probability 1 as the sample size tends to infinity. In finite samples, we compare the performance of our criterion to the other ones as well as to the lasso, as find that it outperforms the other criteria in terms of prediction error in sparse situations.

Keywords and phrases:Goodness-of-fit, Linear predictors, Stepwise selection, Bregman divergence, Covariance penalty, AIC, BIC, lasso.

1. Introduction

Model selection is an important and challenging problem in statistics. Indeed, it becomes unavoidable in more and more applications involving incomplete theoretical knowledge about the phenomenon under investigation and important amounts of available information, like in medicine, biology, economics, etc. Very often model selection is about choosing among a set of predictions, the subset that best explains or predicts a response variable. We suppose here that the set of predictors contains the correct subset, and we will focus on linear models.

Model selection in practice consists in computing a criterion associated to either each potential model, or to a suitable sequence of potential models, and choose the one(s) that optimize the criterion. Many criteria have been proposed and the most popular ones include Mallow’s Cp (Mallows, 1973) based on

1

(3)

prediction error, Akaike’s Information Criterion (AIC) (Akaike,1974), based on the Kullback-Leibler divergence between the candidate model and the true one, and the Bayesian Information Criterion (BIC) (Schwarz,1978). These criteria, and others, are in fact goodness-of-fit measures. Since we suppose that the true model is included in the set of potential models, a suitable property for a criterion is consistency (see e.g.McQuarrie and Tsai,1998), i.e. the probability of selecting the correct model tends to one as the sample size tends to infinity. The BIC is consistent, while the AIC andCp are not. The latter are actually asymptotically efficient, a concept related to prediction accuracy. Another property that can be used to compare criteria is the signal-to-noise (STN) ratio (see e.g.McQuarrie and Tsai, 1998); a criterion with a weak STN ratio will tend to choose models that overfit while a criterion with a strong STN ratio will tend to choose models that underfit.

With finite samples, estimates of these criteria are in fact compared to find the model having the optimal value (minimum) for the estimated criterion.

In particular, in a sequential approach, differences in estimated criteria are considered and the search stops when this difference becomes negative. More precisely, suppose one uses criterionC on model M^j nested in M^k, then in practice, one considers the decision rule ˆR(M^j,M^k) = ˆC(M^j)−C(ˆ M^k) which if found negative, then the optimal model is chosen to beM^j, and if not,M^k is compared to a larger model. Hence ˆR(M^j,M^k) is implicitly considered as a suitable estimate of the difference in goodness-of-fit of modelsM^j andM^k.

In this paper, we propose a criterion that directly compares the goodness-of-fit of two competing models, one nested in the other. It is based on a divergence measure between predictions under each model. The criterion is optimized, in a suitable sequence of nested models, when the divergence is minimal, leading to the choice of the smallest model. We call the resulting criterion the Prediction Divergence Criterion (PDC), we propose an estimator and for the linear regression and smoothing splines, we show that the PDC is consistent. We also compare its STN ratio to the one of the BIC and theCp and find that it is larger than the Cp’s one but close to it, so that it does not have the tendency to choose models that underfit like the BIC.

To fix ideas, consider the regression model and a sampleY = (Yi)i=1,...,n of response variables, together withXa non-randomn×pfull rank matrix of inputs, and ap-dimensional vector of (true) regression slopesβ. Suppose also thatYi|xi, wherexi represents thei^th row of the matrixX, are independent and have the same family of distribution, typically theN(µi, σ²), withµi =E[Yi]. Given a sample of observationsy= (yi)i=1,...,n, one can asses the goodness-of-fit of model M^j corresponding to say the regression model withβ_j a subset of β of size j, using a prediction functionm_j(Y|X,β_j) = (mj(Yi|xi,β_j))_i=1,...,n. Sample predictions can be computed usingmj(yi|xi,βˆ_j) were ˆβ_j is an estimator ofβ_j, and for notational simplicity, we will usemj(yi) :=mj(yi|xi,βˆ_j). The prediction functionmj(·) is compared to other responses (not iny) also generated from

(4)

the same family of distributions, sayY_i⁰, typically using the expectation of the square difference between the true response and the sample estimated prediction, i.e. 1/nP

E E0

(Y_i⁰−m(Yi))²|Y

, where E[·] andE0[·] denote expectations under the distribution of Yi|x_i and Y_i⁰|x_i. This criterion is a special case of 1/nP

E E0

Q(Y_i⁰, m(Yi))|Y

, whereQ(·) is a positive measure of prediction error. The “best model” is then chosen as the one minimizing (an estimator of) this quantity, either using a sequential approach (not all potential models are assessed, but only a suitable sequence), or a global approach (all potential models are assessed). As will be detailed in Section2, the PDC instead, directly comparesmj(Y⁰) and mj+1(Y), with model M^j nested within modelM^j+1, by means of a class of divergence functions.

To compute an estimator of the PDC, we followEfron(2004) approach on parametric covariance penalty method leading to a class of estimators that includes among others the AIC and theCp.Efron(2004) has shown that these estimators offer substantially better accuracy than their nonparametric counter- part, assuming that the model is believable. In finite samples, through simulation studies, we find out that compared to other standard methods, the PDC’s covariance penalty estimator (PDCE) has not only a better performance in terms of the probability of choosing the correct model, but also that its mean squared error of prediction (as defined in (4.1) below and denoted by PEy) and its mean squared error of estimation (as defined in (4.2) and denoted by MSEβ) are at least as good as the ones of other methods across different settings, in particular in sparse situations. As an illustrative example, consider the linear model

Y =Xβ+ε, ε∼ N 0, σ_ε²I with

β= (0,1,0,1,0,1,0,1,0,1,0, ...,0

| {z }

50

) (1.1)

andσε= 1. Suppose also that the pairwise correlation betweenXj andXk is arbitrarily chosen to becorr(Xj, Xk) = 0.5^|j−k|. This situation corresponds to a theoretical R² of approximately 88.2% and to signal-to-noise ratio for the slope coefficients of about 7.4. In Table2we present the results of a simulation study comparing the performance of our PDCE (within a stepwise selection procedure algorithm that will be explained later) together with the lasso using the R functionlarsof thelars package and with shrinkage factor chosen by minimizing theCp statistic, the stepwise forward using the AIC criterion and the stepwise forward using the BIC criterion. We also compute the least squares (LS) estimator on the complete model as benchmark and consider as selected model the one having only the significant variables. The performance criteria used for comparing the methods are presented in Table1and measured on 500 simulated samples under the correct model. For the training and test samples, we chosen= 70.

While a more extensive simulation study is provided in Section4, Table2 clearly reveals the advantage of our PDCE, not only in the probability of selecting

(5)

Criteria Description

correct [%] Proportion of times the correct model is selected.

included [%] Proportion of times the correct model is nested within the selected model.

true+ Average number of selected significant variables (true positives).

false+ Average number of selected non-significant variables (false positives).

NbReg Average number of regressors in the selected model.

Med(PEy) median of PEy(computed on test samples).

Med MSEβ

median of MSEβ(computed on test samples).

Table 1

Model selection evaluation criteria

the correct model, but also in prediction and estimation error. Figure1presents the MSEβ and PEy distributions and reveals even more clearly the advantage of the PDCE over the other methods, at least in the settings considered here.

Med (PEy) Med MSEβ

correct [%] included [%] true+ false+ NbReg

LS 7.390 (0.170) 11.36 (0.247) 0.0 100 5 55 60

AIC 4.394 (0.157) 5.570 (0.391) 0.0 100 5.0 33.9 38.9 BIC 1.426 (0.016) 0.504 (0.031) 5.4 100 5.0 5.35 10.4 PDCE 1.092(0.006) 0.093(0.005) 75.2 99.8 5.0 0.31 5.31 lasso 1.324 (0.012) 0.352 (0.011) 1.0 99.6 5.0 13.9 18.9

Table 2

Evaluation criteria as explained in Table1for the full model with the LS estimator (LS), stepwise forward AIC (AIC), stepwise forward BIC (BIC), stepwise forward PDCE (PDCE) and the lasso (lasso) based on500simulated samples under the correct model.n= 70for the

training and test samples. The true parameter vector is given in (1.1) andσ= 1, the pairwise correlation betweenXjandXk was arbitrarily chosen to becorr(Xj, Xk) = 0.5^|j−k|. This situation corresponds to a theoreticalR² of approximately88.2%and to signal-to-noise

ratio for the slope coefficients of about7.4. The numbers in parentheses for the columns Med(PEy)and Med MSEβ

are the corresponding standard errors estimated by using the bootstrap withB= 500resampling. Bold figures are for the best observed criteria (before

rounding).

The paper is organized as follows. In Section 2 we present the PDC, as well as the PDCE, a consistent estimator. As divergence measures, instead of Efron(1986)q-class, we consider the closely related class of Bregman Divergence (Bregman, 1967). In Section 3, we consider the cases of the linear regression model and smoothing splines, for which we also show the consistency of the selection procedure based on the PDCE. In Section4 we perform and extensive simulation study for comparing the performance of the PDCE to other well accepted model selection precedures, for the linear regression model, smoothing splines and also for autoregressive processes. Finally Section5 concludes.

(6)

LS AIC BIC PDCE Lasso

0123456

MSEβ

1234567

PEy

Fig 1. Boxplots of MSEβand PEyfor full model with the LS estimator (LS), stepwise forward AIC (AIC), stepwise forward BIC (BIC), the stepwise forward PDCE (PDCE) and the lasso (lasso) based on500simulated samples under the correct model.n= 70for the training and test samples. The true parameter vector is given in (1.1) andσ= 1, the pairwise correlation betweenXj and Xk was arbitrarily chosen to be corr(Xj, Xk) = 0.5^|j−k|. This situation corresponds to a theoreticalR² of approximately88.2%and to signal-to-noise ratio for the slope coefficients of about7.4.

2. Prediction divergence criterion

Consider a random variable Y distributed according to model Fθ, possibly conditionally on a set of fixedx= [X1. . . Xp] covariates. We observe a random sampleY = (Yi)i=1,...,n supposedly generated fromFθ, possibly together with a non-randomn×pfull rank matrix of inputsX. Given a prediction function m(Y) that depends on the chosen model,Efron (1986) uses a functionQ(x, y) based onq-class error measures to define a prediction error measure between the in-sample prediction and an out-of-sample predicted variable. Theq-class error measures is given by

Q(x, y) =q(y) + ˙q(y)(x−y)−q(x)

where ˙q(y) is the derivative ofqevaluated aty. The particular choice ofq(x) = x(1−x) gives the squared loss functionQ(x, y) = (x−y)². The prediction error measure is quantified by the (out-of-sample) expected prediction error

EPErr = 1 n

n

X

i=1

EPErri where EPErri=E E0

Q(Y_i⁰, m(Yi))|Y

(2.1) with Y⁰ = (Y_i⁰)i=1,...,n a random variable distributed as Y, and where, as throughout this paper,E[·], respectivelyE0[·], denote expectations under the distribution ofYi|x_i, respectivelyY_i⁰|x_i, i.e. the correct model. The expectations are, depending on the context, simple or multiple.

(7)

In the special case when the distribution ofY is replaced by the empirical distributiony, one gets thein-sampleerror,

ISErr = 1 n

n

X

i=1

E0

Q(Y_i⁰, m(yi))|y

Atraining orapparent error can be simply computed as the average loss over the training sampley, i.e.

AErr = 1 n

n

X

i=1

Q(yi, m(yi))

Actually AErr is an optimistic estimate of EPErr because the same data is used to fit the prediction rulem(·) and assess its error.Efron(2004) has shown that

EPErri=E[Q(Yi, m(Yi))] +cov( ˙q(m(Yi)), Yi) Hence, an estimator of EPErr is obtained as

EPErr =\ 1 n

n

X

i=1

EPErr\i where EPErr\i=Q(yi, m(yi)) +covc ( ˙q(m(Yi)), Yi) (2.2) where, depending on the distribution of Yi|xi, covc is obtained analytically up to a value of θ, the model’s parameters, which is then replaced by ˆθ, or by resampling methods (see e.g.Efron,2004).

In practice,EPErr is computed for two models, say\ M^j nested inM^k, and the difference inEPErr is used for selection. We propose instead here another\ criterion that is aimed at directly measuring a prediction divergence between the two models, namely the PDC. This criterion compares the out-of-sample prediction computed in the smaller modelm_j Y⁰

to the in-sample prediction in the larger modelm_k(Y), quantified by a divergence functionQ(·), namely

PDCj,k=E E0

Q mj Y⁰

,mk(Y)

|Y

(2.3) where the expectation is multidimensional. If the smaller model were correct, the prediction should in theory be perfect, and if this not the case, the additional elements in the larger model create differences in the predictions and therefore should be accounted for. The prediction error functionQ(·) is here a multidimensional equivalent to the prediction error functionQ(·) in (2.1). We propose forQ(·) to use the Bregman divergence which is the multivariate equivalent of Efron’sq-class (seeBregman, 1967for more details). The Bregman divergence encompasses squared error, relative entropy, logistic loss, Mahalanobis distance.

The Bregman divergence between two equi-dimensional vectorsxandyis defined as

Q(x,y) =ψ(x)−ψ(y)−(x−y)^T∇ψ(y) (2.4)

(8)

whereψ(·) is a scalar and∇ψ(y) represents the gradient vector ofψ(·) evaluated aty. The function ψ(·) is strictly convex and differentiable. For example, a squared loss functionQ(x,y) =||x−y||²2is obtained whenψ(z) =z^Tz.

In a suitable sequence of nested models, with increasing complexity and such that modelM^j is nested in modelM^j+1, supposing we have a consistent estimator of PDCj,j+1, sayPDC[j,j+1, then PDC[j,j+1 is expected to be minimal whenj=q < b,bis the number of potential sequentially nested models, andM^q denotes the correct (or closest to the correct) underlying model. Indeed, while j < q, we expect PDC[j,j+1 to be relatively large, since model M^j is missing some elements of modelM^q which are included in modelM^j+1. This is also true withPDC[j,j+k, k >0 orPDC[j−k,j, k >1. On the other hand, ifj≥q,PDC[j,j+1

(or indeedPDC[j,j+k, k >0) is relatively small compared to whenj < q because both models include the correct one. Among all modelsj≥q, PDC[j,j+k should be minimized atj=qandk= 1 sincePDC[q,q+1 compares the prediction of the correct model with the least overfitted one. In the case of the linear regression model, we show in Section3 that PDCj,j+1=E[PDC[j,j+1] for the squared loss functionQ(·), is minimized atj =q.

To estimate (2.3) we follow the approach used byEfron(2004) with covariance’s penalty criterion. First, we construct a naive estimator of the quantity of interest, called the Apparent Prediction Divergence Criterion denoted by APDCj,k. Second, we correct the bias of APDCj,k appropriately. As a naive estimator of (2.3) we simply propose APDCj,k=Q(mj(y),m_k(y)). Intuitively, if APDCj,k is “large” this should indicate that there exist a “large” difference in terms of “predictability” from modelM^j to model M^k and thus that the larger model (i.e. modelM^k) should be preferred. However, APDCj,k, similarly to AErr, is an unreliable estimator of the quantity of interest as the same data is used to fit the models and assess their difference in terms of predictions. In the spirit of Efron’s optimism, we let ∆j,k represent the expected bias of APDCj,k

i.e.

∆j,k= PDCj,k−E[Q(mj(Y),m_k(Y))] (2.5) The following theorem allows us to relate ∆j,kto a quantity that can be consis- tently estimated. The proof is given in AppendixA.

Theorem 1. Assuming one has2models at hand, say M^j nested in M^k and assuming further that one has a valid Bregman divergence (2.4) based on ψ(·), then

∆j,k= PDCj,k−E[Q(mj(Y),m_k(Y))] = tr{cov[mj(Y),∇ψ(mk(Y))]} withPDCj,k given in (2.3).

Therefore, a consistent estimator of (2.3), the PDCE is obtained as PDC[j,k= PDCEj,k=Q(m_j(y),m_k(y)) +tr{cov[c m_j(y),∇ψ(m_k(y))]}

(2.6)

(9)

wherecovc is a consistent estimator of the covariance. Depending on the distribution ofY,covc is obtained analytically up to a value of θwhich is then replaced by a consistent estimator ˆθ, or by resampling methods (see e.g.Efron(2004)).

Hence, assuming that there existb competing nested models (and the largest model is not the correct one) for describing the behavior ofY, we propose to choose the modelM^j as the one satisfying

argmin

j=1,...,b−1

PDCEj,j+1 (2.7)

If a clear sequence of competing nested models does not exist, one can build one prior to applying the selection rule (2.7). This will be illustrated when treating the linear regression model in Section3.1.

3. Linear models

In this Section, we consider in turn the linear regression model and smoothing splines and show that the probability of choosing the correct model using the PDCE as in (2.7) converges to one as the sample size increases. We treat both models separately.

3.1. The linear regression model

Let us consider the following linear regression model Y =Xβ+ε, ε∼ N 0, σ_ε²I

where we assume thatX is a non-random n×pfull rank matrix. A classical estimator forβ is the LS estimator ˆβ = (X^TX)⁻¹X^TY, such that a (linear) prediction function is obtained as

m(Y) =Xβˆ=X(X^TX)⁻¹X^TY =SY

whereSis an n×nmatrix not depending onY. Let us consider a squared loss function defined byψ(x) =x^Tx. The apparent prediction divergence criterion between model M^j with pj parameters nested in model M^k with pk > pj

covariates is then APDCj,k=||m_j(y)−m_k(y)||²2=||S^(j)y−S^(k)y||²2, withS^(l) the hat matrix of modelsM^l, l=j, k. Using theorem 1, the expected bias of APDCj,kcan be expressed as

∆j,k= tr{cov[mj(Y),∇ψ(mk(Y))]}= 2 trn covh

S^(j)Y,S^(k)Yio

= 2σ_ε²tr

S^(j)S^(k)

= 2σ²_εtr S^(j)

= 2σ²_εpj

(3.1)

(10)

An suitable estimator is obtained by replacingσ²_ε, by ˆσ²_ε, a consistent estimator, like the LS estimator at the full model.

In order to apply rule (2.7), one needs a sequence of competing nested models that includes the correct one. In practice, this sequence might not always exist naturally, and we have instead a set ofppredictors which can generate a very large number of nested sequences. We propose here an algorithm for finding this sequence, and will show in Theorem 2below, that using this algorithm together with rule (2.7), the correct model is selected with probability going to 1 asn→ ∞.

We suppose that a “null” modelM⁰, of say size 0≤r < p, is available, which represents the smallest possible model, nested in the correct one. Such a model is typically the modelY =ε(in that case,r= 0). FromM⁰, we choose modelM¹ that includes a single additional predictor chosen among thep−ravailable ones, such that the PDCE between the null model and the model with the chosen predictor is maximal. Then, starting fromM¹, one applies the same rule again until a sequence ofp−rnested models is obtained. Let PDCE^(k)_j,j+1 denote the PDCE betweenM^jandM^j+1with the later containing the additional predictor kamong thep−j available ones, then at modelM^j, the rule is

argmax

k=1,...,p−j

PDCE^(k)_j,j+1 (3.2)

Indeed, by choosing the model corresponding to the maximal estimated PDC, one chooses the model which additional covariate most reduces the prediction error. We note that for the squared loss function, since at modelM^j, the APDCE correction (3.1) is the same for allPDCE\^(k)_j,j+1 in (3.2), the later is equivalent to

argmax

k=1,...,p−j

APDCE^(k)_j,j+1 (3.3)

with APDCE^(k)_j,j+1 defined similarly as PDCE^(k)_j,j+1. The following theorem shows that using rule (3.3) to build a sequence of nested models, and given this sequence, the selection procedure defined by (2.7), is consistent. The proof is given in AppendixCand requires the results of 3 lemmas given in AppendixB.

Theorem 2. Consider the following linear model Y =Xβ+ε, ε∼ N 0, σ_ε²I

whereXis a non-random n×pfull rank matrix andβ= (βj)_j=1,...,p∈ R^p. Let J ={j :βj 6= 0, j= 1, ..., p} be the set of indices of the non-zero components of β and letq= card (J) denote the number of elements in J. Let further L be the set of indices of the elements ofβ selected by the PDCE with squared loss function (i.e. the solution of (2.7) obtained after the columns of X are reorganized according to the iterative process with rule (3.3)) with a squared loss function, then assuming1≤q < pwe have asymptotically that J =L.

(11)

Hence, given a null model, a complete model that is not equal to the correct one, using (3.3) and (2.7) with a squared loss function is a consistent selection procedure. The constraint that the complete model is not equal to the correct one is in practice not really difficult to satisfy, since one can always add a covariate that is generated randomly.

Consistency is an important property if one supposes that the correct model is among the potential ones. To then differentiate consistent criteria, one can compute the STN ratio (see e.g. McQuarrie and Tsai, 1998). In a stepwise selection procedure, selection is performed on the basis of the difference between two criterion’s estimates of two consecutive models, so that a sensible measure is given by the STN ratio of the difference and not of the criterion itself. Suppose we have three competing modelsM¹nested inM² nested inM³, with respectively r1,r1+r2 andr1+r2+r3 covariates. Consider the differences in criterion for respectively theCp, the AIC, the BIC and the PDCE, in theorem3we derive the SNT ratio of these differences. The proof is provided in AppendixD.

Theorem 3. Consider the following three competing models M¹: Y =X⁽¹⁾β₁+ε^?

M²: Y =X⁽¹⁾β₁+X⁽²⁾β₂+ε⁰

M³: Y =X⁽¹⁾β₁+X⁽²⁾β₂+X⁽³⁾β₃+ε

whereX⁽¹⁾,X⁽²⁾ and X⁽³⁾ are respectively n×r1,n×r2 and n×r3 matrices and the dimensions ofβ₁, ofβ₂ and of β₃ are defined accordingly. We assume that the matrix

X⁽¹⁾ X⁽²⁾ X⁽³⁾

is a non-randomn×(r1+r2+r3)full rank matrix. LetY ∼ N Xβ, σ²_εI

whereσ_ε²is known. Let respectivelyCp_l,AICland BICl, l= 1,2,3be theCp,AICandBICfor modelsM^l, andPDCEj,j+1, j= 1,2 be thePDCEwith squared loss function for the comparison of models M^j and M^j+1 Define the following criterion’s differences for respectively the Cp,AIC, BICandPDCE

∆Cp2,1 =Cp2−Cp1

∆ AIC2,1= AIC2−AIC1

∆ BIC2,1= BIC2−BIC1

∆ PDCE3,1= PDCE2,3−PDCE1,2

Let alsoλ= ^d_σ^T2^d

ε ,ρ= ^h_σ^T2^h

ε , withd= S⁽¹²⁾−S⁽¹⁾

Xβ,h= S⁽¹²³⁾−S⁽¹²⁾ Xβ, S⁽¹²⁾=X⁽¹²⁾ X^(12)TX⁽¹²⁾⁻¹

X^(12)T,S⁽¹²³⁾=X⁽¹²³⁾ X^(123)TX⁽¹²³⁾⁻¹

X^(123)T,

(12)

X⁽¹²⁾=

X⁽¹⁾ X⁽²⁾

andX⁽¹²³⁾=

X⁽¹⁾ X⁽²⁾ X⁽³⁾ , then STN ∆Cp2,1

=

√2 2

r2−λ

√r2+ 2λ (3.4)

STN (∆ AIC2,1) =

√2 2

r2−λ

√r2+ 2λ (3.5)

STN (∆ BIC2,1) =

√2 2

log (n)−(r2+λ)

√r2+ 2λ (3.6)

STN (∆ PDCE3,1) =

√2 2

r2+r3+ρ−λ

pr2+r3+ 2 (ρ+λ) (3.7) To illustrate the results of theorem3we computed the STN for theCp (or AIC), the BIC and the PDCE to compare the following models:

M¹ Yi =α+εi

M² Yi =α+βxi+εi

M³ Yi =α+βxi+γwi+εi

whereεi

iid∼ N 0, σ_ε²

withσ_ε²having a known value of 1. Then= 100 values of wi are generated from a N(0,1) and the values ofxi are equidistant in [0,1.3].

We consider forβ, 100 different values equally spaced between 0 and 1.3 and set α= 1 andγ= 0. Therefore, whenβ= 0 the true model isM¹ while forβ >0 the true model isM².

The left panel of Figure2shows the STN ratio ofCp (which in this case is equivalent to the STN ratio of the AIC), BIC and PDCE for selection between M¹and M², as a function of β. For any values ofβ ∈[0,1.3] the BIC has the largest STN while theCp has the lowest. The STN of the PDCE is nearer in behavior to the one of theCp. As can be seen in Figure2, right panel, a criterion with a high STN ratio tends to choose the smallest model, i.e.M¹, since when β > 0, the BIC has a smaller probability of selecting the correct modelM², while the PDCE has a larger probability. The behavior of the PDCE is hence a compromise between the behavior of the BIC andCp: it is consistent like the BIC, but with a larger probability of selecting the larger model when it is correct.

3.2. Smoothing Splines

We consider here smoothing splines as an extension of the linear model to account for possible non linearity of the relationship between the regressor and the response. Indeed, the simple linear model (M¹)

Yi=α+γzi+ε^?_i, i= 1, ..., n (3.8)

(13)

Signal-to-NoiseRatio

β

-1012

0.0 0.2 0.4 0.6 0.8 1.0 1.2

Cp or AIC BIC PDCE

SelectionfrequencyofM2[%]

β

020406080100

0.0 0.2 0.4 0.6 0.8 1.0 1.2

Fig 2.(Left Panel:)STN ratio ofCp(or AIC), BIC and PDCE for selection betweenM1

andM2, as a function ofβ.(Right Panel:)Percentage selection of modelM2using theCp

(or AIC), BIC and PDCE, as a function ofβ, based on10⁴ simulated samples.

is compared to the nonlinear model (M²)

Yi=α+f(zi) +εi, i= 1, ..., n (3.9) wheref(z) is a smoothing spline. Let alsoεi∼ N(0, σ_ε²). It is estimated, for a given spline, by means of (sample) penalized residual sum of squares

argmin

f n

X

i=1

{yi−f(zi)}²+λ Z

{f⁰⁰(t))}²dt

The solution depends on a smoothing parameter λ that can be chosen auto- matically by minimizing a criterion such as the Generalized Cross-Validation (GCV)

GCV fˆλ(y)

= 1 n

n

X

i=1

yi−yˆi

1−S(λ)ii

(3.10) whereS(λ)ii are the diagonal elements of S(λ) in ˆαen+ ˆfλ(y) =S(λ)y, with S(λ) a non random matrix (see e.g.Hastie et al.,2009) andenan-vector of ones.

Actually, there is a monotonic relationship between the smoothing parameter λ and theeffective degrees of freedomof the smoothing spline throughdf=tr(S(λ)).

This means that λ can alternatively be determined through the specification of df, which also represents the degree of complexity of the model. Since to evaluate the added value of relaxing the hypothesis of linearity by comparing (3.8) and (3.9) using the PDCE, we need a third, more complex model, we use

this relationship to build the later.

Letm₁(Y) =e_nα+zˆγ=SY with ˆγthe LS estimator ofγin (3.8),m₂(Y) = e_nα+ ˆf_λ₂(Y) =S(λ2)Y with λ2 determined using the GCV on model (3.9).

(14)

Let alsom₃(Y) =S(λ3)Y with λ3 such thattr(S(λ3))> tr(S(λ2)) associated to modelM³. For example, one can choose the same increase of complexity as between the linear and nonlinear model, i.e.tr(S(λ3))−2 = 2(tr(S(λ2))−2).

We have that for a squared loss functionψ(x) =x^Tx

tr{cov[m1(Y),∇ψ(m2(Y))]}= 2σ_ε²tr(SS(λ2)) = 4σ²_ε (3.11) and

tr{cov[m2(Y),∇ψ(m3(Y))]}= 2σ_ε²tr(S(λ2)S(λ3)) = 2σ²_εtr(S(λ2)) Hence,

PDCE1,2=||Sy−S(λ2)y||²2+ 4σ²_ε

PDCE2,3=||S(λ2)y−S(λ3)y||²2+ 2σ²_εtr(S(λ2)) (3.12) withσ_ε²replaced by a consistent estimator (we chose the residual variance estimate of M²). The simple linear regression model can be extended to the multiple linear regression model, by replacingαbyXβ, containing the intercept part and a set ofp−1 covariates. In that case, (3.11) becomes 2pσ²_ε. Given a sample, the selection rule consists in choosing the linear modelM¹if PDCE1,2<PDCE2,3

and the nonlinear oneM²otherwise. In the following theorem, we show that this selection procedure is consistent in choosing between the linear and nonlinear model. The proof is given in the AppendixE.

Theorem 4. Consider the two competing models:

M¹: Yi=xiβ+ziγ+ε⁽¹⁾_i M²: Yi=xiβ+fλ2(zi) +ε⁽²⁾_i

withf(·)is a smoothing spline. Let SandS(λ2)be the projection matrices of respectivelyM¹ andM², with tr (S) =p <tr (S(λ2)). Let also

M³: Yi=xiβ+fλ3(zi) +εi

with tr (S(λ3)) > tr (S(λ2)) and εi ∼ N(0, σ²_ε). We assume that the design matrices associated to models M¹ toM³ are full rank so that the projection matricesS, S(λ2)andS(λ3)exist. Letm= 1ifM¹ is the correct model and m = 2 if M² is the correct model. Let also mˆ = 1 if PDCE1,2 < PDCE2,3

(defined in (3.12)) andmˆ = 2otherwise, we have that asymptotically mˆ =m.

4. Simulation Study

In this section we perform several simulation studies to compare the performance of the PDCE with other selection methods. The later include the lasso using the R functionlarsof thelars package and with shrinkage factor chosen by minimizing theCp statistic, the stepwise forward using the AIC criterion and

(15)

the stepwise forward using the BIC criterion. As a benchmark, we also consider the LS estimator on the full model. The performance is measured by means of the indicators provided in Table1as well as the distribution (boxplot) of PEy

and MSEβ computed on 500 simulated samples under the correct model.

We will consider the models for which the theoretical aspects have been developed in Section3, as well the autoregressive (AR) models. Although the theory in Section3does not apply exactly to the latter, the simulation study will demonstrate that the PDCE has nevertheless a very good performance.

Across simulation studies, we will manipulate the level of “sparsity”, the level of correlation among the covariates, the ratio of the sample size and number of covariates, as well as the signal to noise ratio of the slopes.

4.1. Linear regression model

The mean squared error of prediction of the method is evaluated on the test sample using the squared loss function, i.e.

PEy= 1

ntest||y_test−Xtestβˆ||²2 (4.1) The estimation precision is assessed using average mean squared error of estimation (estimated on the training set only), i.e

MSEβ=||βˆ−β||²2 (4.2) The boxplots and the medians of both PEy and MSEβ are then used to compare the methods. We also compute the standard deviation of the medians by parametric bootstrap based onB= 500 replications.

Simulation 1. In the first situation we consider a linear model Y =Xβ+ε, ε∼ N 0, σ_ε²I

with β = (1,0,0,0,0,2,1,0,0,0,0,2) and σ = 1. We also arbitrarily set the pairwise correlation betweenXj and Xk to be corr(Xj, Xk) = 0.5^|j−k|. This situation corresponds to a theoreticalR²of about 67.3% and to a signal-to-noise ratio for the significant slope coefficients of roughly 2.1. We choosen= 20 for the training andn^?= 200 for the test samples. The performance of the PDCE compared to other selection methods using the indicators of Table1is given in Table 3. Like in the settings of Table2, the performance of the PDCE is quite good, especially in terms of mean squared error of prediction and probability of selecting the correct model. Figure3shows this feature even more clearly.

Simulation 2. In the second we consider β= (1,0, ...,0

| {z }

19

) (4.3)

(16)

0.00.51.01.52.02.53.0

MSEβ

1.01.52.02.53.03.54.0

PEy

Fig 3. Boxplots of MSEβ and PEyfor the full model with the LS estimator (LS), stepwise forward AIC (AIC), stepwise forward BIC (BIC), the PDCE (PDCE) and the lasso (lasso) based on500simulated samples under the correct model (as presented in simulation1), with n= 20 for the training andn^?= 200for the test samples. The true parameter vector is β= (1,0,0,0,0,2,1,0,0,0,0,2)andσ= 1, the pairwise correlation betweenXjandXk is corr(Xj, Xk) = 0.5^|j−k|. This situation corresponds to a theoreticalR² of about67.3%and to signal-to-noise ratio for the significant slope parameters of roughly2.1.

Med (PEy) Med MSEβ

LS 1.891 (0.032) 1.417 (0.057) 0 100 2 8 10

AIC 1.466 (0.022) 0.551 (0.033) 13.6 98.8 1.99 2.35 4.34 BIC 1.313 (0.019) 0.350 (0.025) 34.2 97.6 1.98 1.26 3.24 PDCE 1.168(0.018) 0.166(0.015) 55.4 95.6 1.95 0.62 2.57 lasso 1.296 (0.020) 0.318 (0.020) 14.2 98.6 1.99 2.63 4.62

Table 3

Evaluation criteria as explained in Table1for the full model with the LS estimator (LS), stepwise forward AIC (AIC), stepwise forward BIC (BIC), the PDCE (PDCE) and the lasso (lasso) based on500simulated samples under the correct model (as presented in simulation 1), withn= 20for the training andn^?= 200for the test samples. The true parameter vector isβ= (1,0,0,0,0,2,1,0,0,0,0,2)andσ= 1, the pairwise correlation betweenXj andXkis corr(Xj, Xk) = 0.5^|j−k|. This situation corresponds to a theoreticalR² of about67.3%and to

signal-to-noise ratio for the significant slope parameters of roughly2.1. The numbers in parentheses for the columns Med(PEy)and Med MSEβ

are the corresponding standard errors estimated by bootstrap withB= 500resampling. Bold figures are for the best observed

criteria (before rounding).

andσ = 2. The pairwise correlations betweenXj andXk are arbitrarily set to corr(Xj, Xk) = 0, j6=k. This situation corresponds to a theoreticalR² of 33.33% and to signal-to-noise ratio for the significant slope parameter of 0.5.

We choosen= 30 for the training andn^?= 300 for the test samples. Table4 and Figure 4present the results of this simulation. In this “sparse” scenario the performance of the PDCE is very good, especially in terms of MSEβ which is about 3 times lower than the lasso and about 10 times lower than the BIC.

Moreover, in 57.6% of the time the PDCE selects the correct model while the

(17)

01234

MSEβ

23456

PEy

Fig 4. Boxplots of MSEβ and PEyfor the full model with the LS estimator (LS), stepwise forward AIC (AIC), stepwise forward BIC (BIC), the PDCE (PDCE) and the lasso (lasso) based on500simulated samples under the correct model (as presented in simulation2), with n= 20for the training andn^?= 200for the test samples. The true parameter vector is given in (4.3) andσ= 2, the pairwise correlation betweenXjandXk is corr(Xj, Xk) = 0, j6=k.

This situation corresponds to a theoreticalR² of33.33%and to signal-to-noise ratio for the significant slope parameter of0.5.

lasso and the BIC only selects this model in respectively 30.4% and 18.0% of the time.

Med (PEy) Med MSEβ

LS 7.569 (0.177) 5.504 (0.207) 0 100 1 19 20

AIC 4.038 (0.095) 1.956 (0.119) 1.2 100 1 6.84 7.84

BIC 2.775 (0.037) 0.731 (0.058) 18.0 100 1 2.64 3.64 PDCE 2.187(0.025) 0.079(0.020) 57.6 99.6 1 0.69 1.69 lasso 2.240 (0.022) 0.213 (0.015) 30.4 100 1 3.85 4.85

Table 4

Evaluation criteria as explained in Table1for the full model with the LS estimator (LS), stepwise forward AIC (AIC), stepwise forward BIC (BIC), the PDCE (PDCE) and the lasso (lasso) based on500simulated samples under the correct model (as presented in simulation 2), withn= 20for the training andn^?= 200for the test samples. The true parameter vector

is given in (4.3) andσ= 2, the pairwise correlation between XjandXkis corr(Xj, Xk) = 0, j6=k. This situation corresponds to a theoreticalR²of33.33%and to signal-to-noise ratio for the significant slope parameter of0.5. The numbers in parentheses

for the columns Med(PEy)and Med MSEβ

are the corresponding standard errors estimated by bootstrap withB= 500resampling. Bold figures are for the best observed criteria

(before rounding).

Simulation 3. In the third simulation, we consider the case where β= (0,0.4,0,0.4,0,0.4)

andσ= 1. The pairwise correlation betweenXjandXkare set to corr(Xj, Xk) = 0.8^|j−k|. This situation corresponds to a theoreticalR² of roughly 50% and to a

(18)

0.00.20.40.60.8

MSEβ

0.91.01.11.21.3

PEy

Fig 5. Boxplots of MSEβ and PEyfor the full model with the LS estimator (LS), stepwise forward AIC (AIC), stepwise forward BIC (BIC), the PDCE (PDCE) and the lasso (lasso) based on500 simulated samples under the correct model (as presented in simulation 3), withn = 100 for the training andn^? = 1000 for the test samples. The true parameter vector isβ= (0,0.4,0,0.4,0,0.4)andσ= 1, the pairwise correlation betweenXjandXkis corr(Xj, Xk) = 0.8^|j−k|. This situation corresponds to a theoreticalR² of roughly50%and to a signal-to-noise ratio for the significant slope parameters of about1.

signal-to-noise ratio for the significant slope parameters of about 1. We choose n= 100 for the training andn^?= 1000 for the test samples. Table5and Figure5 present the results of this simulation. In this relatively “dense” scenario the lasso performs best in terms of PEy and of MSEβ. The performance of the PDCE, AIC and BIC are very similar.

corr(Xj, Xk) = 0.8^|j−k|. This situation corresponds to a theoretical R² of roughly 50% and

Med (PEy) Med MSEβ

LS 1.054 (0.003) 0.180 (0.008) 0 100 3 3 6

AIC 1.053 (0.004) 0.169 (0.021) 43.0 65.8 2.59 0.63 3.21 BIC 1.063 (0.005) 0.184 (0.042) 48.6 50.6 2.37 0.30 2.67 PDCE 1.056 (0.005) 0.172 (0.033) 47.2 55.8 2.42 0.40 2.82 lasso 1.046(0.004) 0.124(0.007) 15.4 90.4 2.90 1.43 4.33

Table 5

Evaluation criteria as explained in Table1for the ful model with the LS estimator (LS), stepwise forward AIC (AIC), stepwise forward BIC (BIC), the PDCE (PDCE) and the lasso (lasso) based on500simulated samples under the correct model (as presented in simulation

3), withn= 100for the training andn^?= 1000for the test samples. The true parameter vector isβ= (0,0.4,0,0.4,0,0.4)andσ= 1, the pairwise correlation betweenXjandXkis corr(Xj, Xk) = 0.8^|j−k|. This situation corresponds to a theoreticalR² of roughly50%and to

a signal-to-noise ratio for the significant slope parameters of about1. The numbers in parentheses for the columns Med(PEy)and Med MSEβ

are the corresponding standard errors estimated by bootstrap withB= 500resampling. Bold figures are for the best observed

criteria (before rounding).

(19)

02468

MSEβ

456789

PEy

Fig 6. Boxplots of MSEβ and PEyfor the full model with the LS estimator (LS), stepwise forward AIC (AIC), stepwise forward BIC (BIC), the PDCE (PDCE) and the lasso (lasso) based on500simulated samples under the correct model (as presented in simulation3), with n= 20for the training andn^?= 200for the test samples. The true parameter vector is given in (4.4) andσ= 2, the pairwise correlation betweenXjandXk is corr(Xj, Xk) = 0, j6=k.

This situation corresponds to a theoreticalR² of approximately88.5%and a to signal-to-noise ratio for the significant slope parameters of about7.7.

Simulation 4. In the forth simulation, we consider the case where β= (2,0,1,2,0,1,0, ...,0

| {z }

40

, ,2,0,1,2,0,1) (4.4) and σ = 2. The pairwise correlation between Xj and Xk is arbitrarily set to corr(Xj, Xk) = 0.5^|j−k|. This situation corresponds to a theoretical R² of approximately 88.5% and to a signal-to-noise ratio for the significant slope parameters of about 7.7. We choosen= 100 for the training andn^?= 1000 for the test samples. Tables6and Figure6present the results of this simulation.

In this “sparse” scenario the performance of the PDCE is much better than its competitors in terms of PEy, of MSEβ and of proportion of times the correct model is selected. Moreover, the PDCE produces models that contain on average 8.23 regressors which is very close to the true model dimension (i.e. 8) while its competitors select about 10 for BIC and about 20 regressors for the AIC and the lasso.

Simulations1to 4indicate that the PDCE produces sparse solutions and that this criterion performs best in such settings. In more “dense” situations (as in simulation3) the lasso performs better than AIC, BIC and PDCE but these 3 criteria have approximately similar performances. However, a closer look at table5reveals that the PDCE seems to perform slightly better than the BIC (at least in terms of PEy and of MSEβ) by selecting larger models. In “sparse”

situations (e.g. simulation2or 4) the PDCE achieves better model selection performance than the BIC by selecting less complex models. This indicates that

(20)

Med (PEy) Med MSEβ

LS 8.251 (0.065) 7.012 (0.14) 0 100 8 44 52

AIC 6.349 (0.058) 3.418 (0.122) 0 98.0 7.98 12.4 20.38 BIC 4.864 (0.037) 1.017 (0.050) 17.4 91.0 7.89 2.14 10.03 PDCE 4.610(0.034) 0.662(0.021) 41.2 78.2 7.62 0.61 8.23

lasso 5.111 (0.026) 1.210 (0.023) 0 98.8 7.99 12.23 20.21 Table 6

Evaluation criteria as explained in Table1for the full model with the LS estimator (LS), stepwise forward AIC (AIC), stepwise forward BIC (BIC), the PDCE (PDCE) and the lasso (lasso) based on500simulated samples under the correct model (as presented in simulation 3), withn= 20for the training andn^?= 200for the test samples. The true parameter vector

is given in (4.4) andσ= 2, the pairwise correlation between XjandXkis corr(Xj, X_k) = 0, j6=k. This situation corresponds to a theoreticalR² of approximately 88.5%and a to signal-to-noise ratio for the significant slope parameters of about7.7. The numbers in parentheses for the columns Med(PEy)and Med MSEβ

are the corresponding standard errors estimated by bootstrap withB= 500resampling. Bold figures are for the best

observed criteria (before rounding).

the PDCE can be seen as an improved version of the BIC. This feature will also be observed later with other models.

4.2. Smoothing Splines

In this Section, we consider an univariate setting and compare the performance of the PDCE to theCp. We also use as criteria the model that minimizes the GCV in (3.10). The simulation setting is very similar to the one inGreven and Kneib (2010) who compare in the context of linear mixed models the performance of the AIC and various version of the cAIC (seeVaida and Blanchard (2005) and Greven and Kneib(2010) for details). Two models are compared

M¹: Yi=β0+β1xi+ε⁽¹⁾_i

M²: Yi=fλ(xi) +ε⁽²⁾_i (4.5) where the functionfλ(·) is a smoothing spline with smoothing parameterλ. The GCV is used to select the smoothing parameterλ. In order, to use the PDCE to selectM¹ orM² we need to define a third modelM³, i.e.

M³: Yi=fλ^?(xi) +ε⁽³⁾_i withλ^? such that

tr (S3) = tr (S1) + 2 [tr (S2)−tr (S1)] = 2 tr (S2)−tr (S1)

and withSldenoting the projection matrix of modelM^l, l= 1,2,3. Therefore, λ^? is chosen such that the increase in complexity between M¹andM² is the same than betweenM²andM³. This satisfies the conditions of theorem4.