Fast Robust Model Selection in Large Datasets

(1)

Article

Reference

Fast Robust Model Selection in Large Datasets

DUPUIS, Debbie, VICTORIA-FESER, Maria-Pia

Abstract

Large datasets are more and more common in many research fields. In particular, in the linear regression context, it is often the case that a huge number of potential covariates are available to explain a response variable, and the first step of a reasonable statistical analysis is to reduce the number of covariates. This can be done in a forward selection procedure that includes the selection of the variable to enter, the decision to retain it or stop the selection and estimation of the augmented model. Least squares plus t-tests can be fast, but the outcome of a forward selection might be suboptimal when there are outliers. In this paper, we propose a complete algorithm for fast robust model selection, including considerations for huge sample sizes. Since simply replacing the classical statistical criteria by robust ones is not computationally possible, we develop simplified robust estimators, selection criteria and testing procedures for linear regression. The robust estimator is a one-step weighted M-estimator that can be biased if the covariates are not orthogonal. We show that the bias can be made smaller by [...]

DUPUIS, Debbie, VICTORIA-FESER, Maria-Pia. Fast Robust Model Selection in Large Datasets. Journal of the American Statistical Association, 2011, vol. 106, p. 203-212

DOI : 10.1198/jasa.2011.tm09650

Available at:

http://archive-ouverte.unige.ch/unige:12736

Disclaimer: layout of this document may differ from the published version.

(2)

Fast Robust Model Selection in Large Datasets ^∗

Debbie J. Dupuis and Maria-Pia Victoria-Feser HEC Montr´eal and HEC Gen`eve

Fourth Revision October 29, 2010

Abstract

Large datasets are more and more common in many research fields. In particular, in the linear regression context, it is often the case that a huge number of potential covariates are available to explain a response variable, and the first step of a reasonable statistical analysis is to reduce the number of covariates. This can be done in a forward selection procedure that includes the selection of the variable to enter, the decision to retain it or stop the selection and estimation of the augmented model. Least squares plust-tests can be fast, but the outcome of a forward selection might be suboptimal when there are outliers. In this paper, we propose a complete algorithm for fast robust model selection, including considerations for huge sample sizes. Since simply replacing the classical statistical criteria by robust ones is not computationally possible, we develop simplified robust estimators, selection criteria and testing procedures for linear regression. The robust estimator is a one-step weighted M-estimator that can be biased if the covariates are not orthogonal. We show that the bias can be made smaller by iterating the M-estimator one or more steps further. In the variable selection process, we propose a simplified robust criterion based on a robustt-statistic that we compare to a false discovery rate adjusted level. We carry out a simulation study to show the good performance of our approach. We also analyze two datasets and show that the results obtained by our method outperform those from robust LARS and random forests.

Supplemental materials are available online, seehttp link.

Keywords: Linear regression, multicollinearity, M-estimator, robust t-test, partial correlation, false discovery rate, LARS, random forests.

∗The first author acknowledges the support of the Natural Sciences and Engineering Research Council of Canada. The second author acknowledges the support of the Swiss National Science Foundation (grant no 100012-109532). The authors thank Dr. Susanne Ragg from the Center for Computational Diagnostics at Indiana University for providing access to the protein data as well as Dr. Olga Vitek of the Department of Statistics and the Department of Computer Science at Purdue University for her advice on the analysis of these data. The authors also thank the Editor, the Associate Editor and two referees for comments that greatly improved the presentation.

(3)

1 Introduction

Datasets with millions of observations and a huge number of variables are now quite common, especially in finance, management, computer sciences, health sciences, etc.

Performing statistical inference with such datasets requires the development of new techniques because otherwise the calculations of the necessary statistics are impossible. One group of techniques is concerned with the improvement of the efficiency of algorithms so as to circumvent computer memory problems. Another group of techniques seeks to develop new statistical tools that can achieve the same goal, i.e.

make the calculations feasible. In this paper, we are concerned with the latter.

When the number p of variables is very large, good statistical practice requires that this number be reduced by means of proper statistical criteria in order to better understand the phenomenon under investigation. However, standard statistical techniques which often take into account all the variables simultaneously cannot be used when p is too large, either because of the curse of dimensionality or because they involve computations of statistical criteria on all combinations of subsets of variables.

For the linear regression problem, simple forward selection procedures that are computationally feasible exist and can be readily used. They involve the estimation of an initial model, to which one or a group of potential explanatory variables is added according to a given criterion, usually based on a test statistic, until the criterion is no longer satisfied. Recently, criteria that take into account the false discovery rate (FDR) have flourished; see e.g. Benjamini and Gavrilov (2009), Gavrilov, Benjamini, and Sakar (2009), Wu, Boos, and Stefanski (2007), Abramovich, Benjamini, Donoho, and Johnstone (2006), Benjamini, Krieger, and Yekutieli (2006), Foster and Stine (2004), etc. These methods are all based on the classical least squares (LS) estimator for the linear regression model, but the criterion can be adapted to other estimators.

Basically one needs a computationally fast method to select the best explanatory

(4)

variables at each stage of the forward selection procedure and a test statistic for testing the significance of the regression coefficient of the selected explanatory variable(s). Thep-value is then compared to an adaptive threshold related to the number of explanatory variables already in the model.

In this paper, we are concerned with the use of such procedures but based on robust estimators and hence test statistics of the linear regression model. As shown in Ronchetti and Staudte (1994), spurious model deviations such as outliers can lead to a completely different, and suboptimal, selected model when a non robust criterion, like Mallows’ Cp, is used. This happens because under slight data contamination the estimated model parameters and consequently the model choice criterion can be seriously biased. The consequence is that decisions are taken at the wrong level and this has an effect on the empirical FDR as illustrated in simulations in§4. The problem is not new and several robust model selections procedures have been proposed.

Global criteria such as the AIC (Akaike 1973), the BIC (Schwarz 1978) and Mallows’

Cp (Mallows 1973) have been extended to the robust case by Ronchetti (1982) and M¨uller and Welsh (2005) for the AIC, Machado (1993) for the BIC and Ronchetti and Staudte (1994) for Mallows’ Cp, respectively. Ronchetti, Field, and Blanchard (1997) develop a robust criterion based on cross-validation (CV). All these criteria are very computer intensive when all the possible models are evaluated which means that only forward selection procedures can be used in large datasets. Moreover, and even worse, the available robust estimators are just impossible to compute at the full model if p is too large, and they become useless in a forward selection procedure as the number of explanatory variables increases in the selected model. For example, the robust AIC used in Maronna, Martin, and Yohai (2006), p. 151, in conjunction with a stepwise selection is too long, or even impossible to compute, with our data sets.

In the first, we seek to predict the abundance of a protein on the basis ofp=159 measured covariate values for n=231 observations. In the second we seek to explain the

(5)

average price asked for a housing unit given p=1325 covariate values for n=13,970 observations. All available robust estimators with nice properties of efficiency and breakdown cannot be used without some modifications/simplifications.

In this paper, we propose a fast robust forward selection (FRFS) procedure based on fast to compute robust estimators and robust significance test statistics. We follow the procedure proposed in Foster and Stine (2004) based on partial correlations for the selection criterion and the t-test for the stopping criterion with FDR correction and consider new robust criteria instead of the classical ones. Namely, we propose M-estimators as weighted LS, choose the weights appropriately for consistency and robustness considerations and use the weights to build weighted partial correlations and significance test statistics. The full algorithm is presented in §2, while we study analytically in§3 and through a simulation study in§4, the properties of consistency of the proposed robust selection procedure in terms of estimation bias and model selection ability within the general framework of correlated explanatory variables. We also provide a subsampling algorithm for the improvement of computational speed when there are huge amounts of observations.

Finally, when analyzing the large real datasets in §5, we compare our approach with alternative methods for model selection in linear regression. More specifically, we consider the least angle regression (LARS) of Efron, Hastie, Johnstone, and Tibshirani (2004), the robust LARS (RLARS) of Khan, Van Aelst, and Zamar (2007b) as well as the more general random forests of Breiman (2001).

2 Fast Robust Model Selection

In this section, we present a detailed algorithm for FRFS. The theoretical justifications are left for §3. The main building block of the selection procedure is the weighted M-estimator which is now presented.

(6)

2.1 Weighted M -estimator

Consider M-estimators (Huber 1964, 1967) as estimators for the parameters β = [β0, β1, . . . , βp]^T of the regression model with response variable y and explanatory variables x= [1 x1. . . xp]^T such that y|x∼N(x^Tβ;σ²). The M-estimators that we propose are weighted LS estimators defined implicitly through

Xn i=1

wi(ri;c)rix_i = 0 (1)

for a given sample of n observations, with the weights wi(ri;c) chosen such that wi(ri;c)rix_i is at least bounded, possibly redescending for a high breakdown point, c is a tuning constant that controls the relationship between efficiency and robustness, andri = (yi−x^T_i β)/σare the standardized residuals. We choose Tukey’s redescending biweight weights given by

wi(ri;c) =





 ri

c

2

−12

if |ri| ≤c, 0 if |ri|> c.

(2)

The residuals ri in (1) depend on a scale σ which needs to be estimated and we provide some estimators below. Throughout the paper, we choose c = 4.685 which corresponds to an efficiency level of 95% for the robust estimator compared to the LS estimator at the normal model. Note that LS estimates are obtained when c→ ∞.

Weights in (1) are not fixed, but depend on β through the residual ri and hence finding the solution of (1) is computationally intensive in high dimensions. To lessen the computational complexity, we propose to compute weights based on a different estimator for β, which we call ˆβ⁰, that can be computed in a first step. Then, we

(7)

propose as estimator ˆβ¹ ofβthe solution of a slightly modified version of (1), namely Xn

i=1

wi(r_i⁰;c)(yi−x^T_i β)xi = 0 (3)

with r_i⁰ = (yi−x^T_i βˆ⁰)/ˆσ⁰ where ˆσ⁰ = 1.483med|r˜_i⁰−med(˜r_i⁰)|, the median absolute deviation (MAD) of the residuals ˜r_i⁰ =yi−x^T_i βˆ⁰, is used for the scale.

To compute ˆβ⁰, we consider robust slope estimators ˆβ1, . . . ,βˆp together with intercept estimators ˆβ01, . . . ,βˆ0p for thepmarginal modelsy =β01+x1β1+ε1, . . . , y= β0p +xpβp +εp. These marginal estimators are defined implicitly through (1) at the marginal models. The scale estimates ˜σj are first computed separately with a high breakdown S-estimator (Rousseeuw and Yohai 1984). The ˆβj, j = 1, . . . , p with σj replaced by ˜σj, are called MM-estimators and enjoy good breakdown properties, see e.g. Maronna, Martin, and Yohai (2006), §5.5. Note that when the covariates xj, j = 1, . . . , p are not orthogonal, ˆβj is not a consistent estimator of βj where βj

corresponds to the model y = x^Tβ +ε, and we discuss the effect of the induced bias in §3. ¿From the marginal slope and intercept estimators, we build weights wij,∀j = 1, . . . , p, using (2) computed at the residualsrij = (yi−βˆ0j−xijβˆj)/ˆσj with ˆ

σj = MAD(yi−βˆ0j−xijβˆj). Now, considering the weighted regressors matricesX^w₀ = [ 1 √wi1xi1 . . . √wipxip ], i = 1, . . . , n, and X^w2₀ = [ 1 wi1xi1 . . . wipxip ], i= 1, . . . , n, we define, for p < n, the initial estimator for the model y=x^Tβ+ε as βˆ⁰ =

(X^w₀)^TX^w₀−1

(X^w2₀ )^Ty where y is then-vector of responses. Our corresponding residual scale estimate is ˆσ⁰ = MAD(yi−x^T_i βˆ⁰).

One might be tempted to consider the univariate ˆβj and their corresponding scale estimates as initial estimators, however although computational time is increased, we favor ˆβ⁰since as discussed in§3.1 and shown in the supplemental material inhttp link, βˆ⁰ has a smaller bias than an estimator built from univariate ˆβj and weights based on residuals standardized by the MAD performed better in the simulation studies.

(8)

Using Tukey’s weights w⁰_i computed in (2) with residuals r_i⁰ = (yi −x^T_i βˆ⁰)/ˆσ⁰, our weighted M-estimator can be written as ˆβ¹ =

(X^w)^TX^w−1

(X^w)^Ty^w with X^w = diag(p

w⁰_i)·X and y^w = diag(p

w_i⁰)·y.

The marginal weights wij,∀j = 1, . . . , p, play a central role in our first FRFS procedure which we accordingly call FRFS-Marginal. The choice of marginal weights over fullmodel weights is revisited in §2.3 where a modified version of the algorithm, FRFS-Full, is proposed.

2.2 FRFS-Marginal Algorithm

Suppose a model including S explanatory variables plus intercept x_S to which we consider adding one of the remaining explanatory variables, say zj. We have ˆβ¹_S, the M-estimator (3) for the sub-model with x_S, with corresponding weights w⁰_iS,∀i, and scale estimate ˆσ² = MAD(yi − x^T_Siβˆ¹_S), as well as the weights wij,∀i, of the MM-estimator for the simple linear regression model with zj and intercept.

Step 1 For all explanatory variables zj not yet included in the model, compute z^w_j = diag(√wij)zj and y^w_j = diag(√wij)y where z_j and y are the n-vectors con- taining the observations for the corresponding variable, as well as the weighted partial correlations

φ^wY Z|X = z^wT_j z^w_j −z^wT_j H_Sz^w_j−1

z^wT_j (I −H_S)y^w_j (4)

withH_S =X^w_S[X^wT_S X^w_S]⁻¹X^wT_S , where X^w_S = diag(p

w⁰_iS)·X_S and X_S is the n×(S+ 1) design matrix for the sub-model with x_S and the intercept.

Step 2 Select the variable with the largest weighted partial correlation (4) in absolute value, say zj.

(9)

Step 3 Compute the test statistic

T_c² = 1 ˆ σ²

Pnn

i=1wij

ecy^wT_j z^w_jφ^wY Z|X (5) with

ec = Z c

−c

(5r⁴/c⁴−6r²/c²+ 1)dΦ(r)

2, Z c

−c

r²(r²/c²−1)⁴dΦ(r) (6)

and p-value = 2(1−Φ(|Tc|)).

Step 4 For a given test levelα, compare thep-value to the FDR corrected levelαF DR = α(S+ 1)/pand stop ifp-value> αF DR. The use of (S+ 1)/pis shown to control FDR in Benjamini and Hochberg (1995).

Step 5 Compute the initial slope estimator ˆβ⁰_[S+1] on the augmented model using the marginal weights wij as

βˆ⁰_[S+1]=

(X^w_0[S+1])^TX^w_0[S+1]−1

(X^w2_0[S+1])^Ty (7)

withX^w_0[S+1]=

1_n √wilxil

i=1,...,n,l∈S⁺

andX^w2_0[S+1]=

1_n wilxil

i=1,...,n,l∈S⁺

, S⁺ being the set of integers for the indices of the explanatory variables in x_[S+1]= [xS zj]. Get the weightsw_i⁰ with corresponding residual scale estimate ˆ

σ⁰ = MAD(yi−x^T_[S+1]iβˆ⁰_[S+1]).

Step 6 Compute theM-estimator (3) on the augmented model as ˆβ¹_[S+1]=

(X^w_[S+1])^TX^w_[S+1]−1

(X^w_[S+1])^Ty^w with X^w_[S+1] = diag(p

w_i⁰)·X_[S+1] and y^w = diag(p

w_i⁰)·y and return to Step 1.

Notes: 1. We start the search by considering for entry in the model the variable with the largest test statistic in (5). 2. Equation (4) is the weighted partial correlation betweeny^w_j andz^w_j givenX^w. Note thatz^wT_j (I−H_S)y^w_j isn−1 times the estimated

(10)

partial covariance betweeny^w_j andz^w_j givenX^w andz^wT_j z^w_j −z^wT_j H_Sz^w_j is the partial variance ofz^w_j givenX^w. 3. Equation (6) is the efficiency as a function of the tuning parameter c of the M-estimator in (1) with weights (2) (see e.g. Heritier, Cantoni, Copt, and Victoria-Feser 2009, p. 57).

When both the number of regressors and the sample size are very large, it may be advantageous, from a computational perspective, to take a subsampling approach.

That is, we will choose candidate regressors to include in the model based on a computation of partial correlations on random subsamples of the data. We propose the following. Select a subsample of size nn among the n observations and compute weighted partial correlations following (4) using only thenn observations in the subsample. For the variable with the largest absolute weighted partial correlation in the subsample, compute its weighted partial correlation using all the data. Repeat the process ns times and select in Step 2 the variable, among the ns candidates, that yielded the largest absolute weighted partial correlation in the full sample.

2.3 FRFS-Full Algorithm for Heavily Correlated Covariates

The robust selection procedure based on weighted partial correlations using marginal weights presented in §2.2 allows fast robust estimation and model selection in high dimensions. The cost for this dramatic gain in computational time is the bias of the marginal MM-estimators and weights when the covariates are collinear. Actually, it is discussed in §3.1 and shown the supplemental material in http link, that the bias on both the marginal MM-estimators and the weightedM-estimator depend on the correlation structure between the covariates, and that it is larger on the former. Con- sequently, when collinearity is high among the covariates, the information provided by the weighted partial correlations (4) that use the marginal weights can lead to wrong selection or even to too short a selection, i.e. the stopping criterion is met

(11)

after too few steps. An alternative approach that is more time consuming but still feasible in high dimensions is to compute the weighted M-estimator for all potential models, i.e. those augmented by one variable at a time, and then use the associated weights, instead of the marginal weights, for the computation of the weighted partial correlations.

We hence propose to modify the FRFS-Marginal algorithm in §2.2 by including:

Step 0 For all explanatory variables zj not yet included in the model, get the weights w⁰_i,S+1 of the M-estimator (3) computed at the augmented model x_[S+1] = [xS zj].

and replacing the marginal weights wij in (4) and (5) by w⁰_i,S+1. Note that, as in the classical setting, this amounts to selecting the covariate with the largest robust significance test statistic (5) in absolute value (see §3.2).

Finally, when both the number of regressors and the sample size are very large, proceed analogously as in FRFS-Marginal by computing (5) for a subsample of size nn. For the variable with the largest test statistic in absolute value, compute (5) using all the data. Repeat the processns times and select the variable with the largest test statistic in the full sample.

3 Statistical Properties

3.1 Properties of β ˆ

¹

In this section, we consider in turn the robustness and the consistency properties of βˆ¹. For the robustness properties, it is well known that an estimator of the regression slopes is robust if large residuals and large leverage values in the covariates are downweighted. The weighting scheme of ˆβ¹ is based on residuals from an initial robust fit ( ˆβ⁰) which in turn is based on robust fits of the univariate models ( ˆβj). In a worst

(12)

case scenario, it is possible that extreme residuals for the full model are not visible in the univariate regressions and might not be correctly downweighted by the robust estimators of the univariate regressions. However, ˆβ¹ downweights extreme residuals with respect to the full model, so that it is expected that “missed” extreme residuals for the univariate regressions will be somewhat downweighted at the full model by ˆβ¹. Moreover, ˆβ⁰ is a coordinate-wise robust estimator and Alqallaf, Van Aelst, Yohai, and Zamar (2009) show through the computation of a generalized version of the influence function of Hampel(1968, 1974) and different contamination schemes in the multivariate normal (MVN) setting, that coordinate-wise robust estimators can be less sensitive to extreme observations when they occur independently at the univariate level.

The consistency of the regression model estimator is also important, since biased slope estimates can lead to wrongly selected covariates in the FRFS. Our ˆβ¹ depends on the ˆβj which are biased estimators of the slopes if the covariates are not orthogonal. In http link, we compute the asymptotic bias of the slopes’ estimator under the hypothesis that that the response variable and the covariates are centered and n⁻¹P

i

Qp

j=1x^r_ij^j = 0 if P

rj is odd. This hypothesis is for example satisfied in the multivariate normal case. We find that the bias depends on high moments (i.e.

up to P

rj = 6) and is not easily interpretable. It is computed at the uncontami- nated model; the effect of model contamination on the possible bias of the estimator is discussed through its robustness properties above. The bias is of course equal to zero if the covariates are orthogonal. However, we also show that the bias can be made smaller to even zero if the weighted M-estimator serves as an initial estimator of another weighted M-estimator in an iterative fashion. Indeed, ˆβ¹ defined through (3) is actually a one-step M-estimator of an iterative procedure to compute the M-estimator defined implicitly through (1) with ˆβ⁰ as initial solution. ˆβ¹ could be iterated further to get say ˆβ^k computed at the updated weights based on the

(13)

residuals r_i⁽¹⁾ = (yi−x^T_i βˆ⁽¹⁾)/ˆσ⁽¹⁾, . . ., r_i^(k−1) = (yi−x^T_i βˆ^(k−1))/ˆσ^(k−1).

3.2 Weighted partial correlations

In this section, we show that the robust weighted partial correlations (4) arise as a natural information criterion used in significance testing. In the standard robust theory for regression models, significance can be tested using a robust Wald-type test statistic (Heritier and Ronchetti 1994) orz-statistic given generally by

zψ-ratio = βˆj

SE( ˆβj) (8)

with SE( ˆβj)² = var( ˆβ)jj and for an M-estimator defined through (1) in which wi(ri;c)ri := ψ(ri, c), var( ˆβ) = n⁻¹M(ψ,Φ)⁻¹Q(ψ,Φ)M(ψ,Φ)⁻¹ with M(ψ,Φ) =

−n⁻¹Pn i=1

R(∂/∂β^T)ψ(r;c)xidΦ(r) and Q(ψ,Φ) = n⁻¹Pn i=1

R ψ(r;c)²x_ix^T_i dΦ(r).

The variance var( ˆβ)jj can be estimated using

SE( ˆc βj)² = σˆ² n



 1 Pn

i=1wi

Xn i=1

wix_ix^T_i

!−1



jj

e⁻¹_c (9)

(see e.g. Maronna et al. 2006, p. 140) where the weights wi are the robust weights corresponding to the M-estimator on the model with x and ec its efficiency. Using (8) instead of the standardt-test guarantees that the levelα at which the significance test is performed remains stable under model contamination. Indeed, α is specified through P (|T|> tα) = α using the asymptotic distribution of the test statistic T and corresponding quantile tα under the hypothesis that the data upon which the test statistic is computed are generated exactly by the postulated model. If this is not the case, then the asymptotic distribution is not correct and consequently neither is the level; for a more formal treatment and examples, see e.g. Heritier et al. (2009), §2.4.

(14)

The price to pay for a robust test statistic is however a relatively small loss of power when there is no model contamination. This loss is quantified in the simulation study of §4 when we compare the performance of classical and robust selection procedures under no model contamination.

In order to be able to compute a robust test statistic for H0 : βj = 0 in a FRFS in high dimensions, we need to adapt thezψ-ratio so that it can be computed rapidly with the information available at the current forward selection step. In http link, we show that zψ-ratio for testing the significance of βj in the model with [xS zj], can be reasonably approximated by Tc in (5). Moreover, in the orthogonal case, we have T_c² = _σ_ˆ¹2

Pnn

i=1wijec e^wT_S e^w_S −e^wT_S+1e^w_S+1

withe^w_S =y^w_j −X^w_Sβˆ¹_S the weighted residuals for the model with the current S covariates and e^w_S+1 for the augmented model with S+ 1 covariates. Hence, the predictor with the largest weighted partial correlation in absolute value offers the greatest reduction in the weighted residual sum of squares and is selected in a robust forward stepwise regression, analogously to the approach followed by Foster and Stine (2004) in the classical setting. Finally, if the weighted M-estimator in (3) is fully iterated, and the corresponding weights used in (5) then we have zψ-ratio =Tc.

4 Simulation Study

We carry out a simulation study to assess the effectiveness of the model selection approaches outlined above and to showcase the strengths of our proposed robust weighted M-estimator. First, we create a linear model

y=X1+X2+. . .+Xk+σε (10) where X1, X2, . . . , Xk are MVN with E(Xi) = 0, Var(Xi) = 1, and corr(Xi, Xj) =ρ, i6=j, i, j = 1, . . . , k, andε an independent standard normal variable. We chooseρto

(15)

produce a range of theoreticalR² = (Var(y)−σ²)/Var(y) values for (10) andσto give t values for our target regressors of about 6 under normality as in Ronchetti et al.

(1997). The covariates X₁, . . . , Xk are our k target covariates. Let e_k+1, . . . , ep be independent standard normal variables and use the first 2k to give the 2k covariates

Xk+1 = X1+λek+1, Xk+2 =X1+λek+2, X_k+3 = X₂+λe_k+3, X_k+4 =X₂+λe_k+4,

...

X3k−1 = Xk+λe3k−1, X3k =Xk+λe3k; and the finalp−3k to give thep−3k covariates

Xi =ei, i= 3k+ 1, . . . , p.

Variables Xk+1, . . . , X3k are noise covariates that are correlated with our target covariates, and variables X3k+1, . . . , Xp are independent noise covariates.

We consider samples without and with contamination. Samples with no contamination are generated using ε ∼N(0,1). To allow for 5% outliers, we generate using ε ∼95%N(0,1) + 5%N(30,1). These contaminated cases also have high leverage X- values: X₁, . . . , Xk ∼ MVN as before, except Var(Xi) = 5, i = 1, . . . , k. This repre- sents the most difficult contamination scheme: large residuals at high leverage points.

We choose λ = 3.18 so that corr(X1, Xk+1) = corr(X1, Xk+2) = corr(X2, Xk+3) = . . . = corr(Xk, X3k) = 0.3. First, we consider a small number of regressors and a small sample size so as to compare our proposals with another robust selection method. We consider a sample of size n = 200, p = 10 candidate regressors with k = 3 target covariates. Using α= 0.05, we consider our FRFS with weights on the marginal and full models, i.e. FRFS-Marginal and FRFS-Full, robust AIC as implemented in step.lmRob of the robust package for R, and LS which is equivalent to FRFS with all weights set to 1. We consider both low and high signal-to-noise ratios.

(16)

Results are in Table 1. Entries in the top panel give the percentage of runs falling into each category. The category “Correct” means that the correct model was chosen and is the key measure of performance. “Extra” means that a model was chosen for which the true model is a proper subset. “Missing 1” means that the model chosen differed from the true model only in that it was missing one of the target covariates;

“Missing 2” is defined analogously. The Monte Carlo standard deviation of entries with value around 90% is 2.1%, and the Monte Carlo standard deviation of all entries is bounded by 3.5%. We also report the empirical FDR: the proportion of noise covariates among selected covariates, see Benjamini and Hochberg (1995).

Table 1: Model selection results. Simulated data, as described in §4, have n = 200 observations with p = 10 potential regressors, including k = 3 target regressors with corr(X1, X2) = corr(X1, X3) = corr(X2, X3) = ρ = 0 (R² = 0.35) and ρ = 0.76 (R² = 0.80), and corr(X1, Xk+1) = corr(X1, Xk+2) = corr(X2, Xk+3) = . . . = corr(Xk, X3k) = 0.3 in both cases. Methods are r1 for FRFS-Marginal, r2 for FRFS- Full, LS for FRFS with all weights set to 1 and r3 for backward selection based on robust AIC and robust linear regression with high breakdown point and high efficiency regression as implemented in step.lmRob of the robust package for R. Table entries are % of cases in categories listed in first column. Empirical FDR appears in the last row. Mean execution times for the r1, r2, r3, and LS methods are ∼ 0.6, 0.9, 3 and 0.1 s, respectively. Results are based on 200 simulations for each configuration.

R² = 0.35 R² = 0.80

no outliers 5% outliers no outliers 5% outliers

r1 r2 r3 LS r1 r2 r3 LS r1 r2 r3 LS r1 r2 r3 LS

%Correct 88 87 30 91.5 84.5 83.5 27 0 91 87 30 91.5 89.5 83 27 2

%Extra 12 13 70 8.5 15.5 16.5 73 0 9 13 70 8.5 10 16.5 73 0

%Missing 1 0 0 0 0 0 0 0 11 0 0 0 0 0.5 0.5 0 8

%Missing 2 0 0 0 0 0 0 0 59.5 0 0 0 0 0 0 0 73

%Other 0 0 0 0 0 0 0 29.5 0 0 0 0 0 2.0 0 17

%FDR 3.4 3.6 26 2.2 4.1 4.4 26 25 2.6 3.7 26 2.2 2.5 4.4 26 14 Both FRFS algorithms perform very well in terms of proportion of correctly selected models: (1) performing as well as LS when there are no outliers, i.e. there is very little loss of power, and (2) outperforming LS when there are outliers. The backward selection using robust AIC is not very good in all cases; while it often yields

(17)

a model which includes the correct model, the final model also contains many noise covariates. The classical LS approach fails miserably in the presence of outliers and rarely chooses the correct model. The empirical FDR brings another insight. The latter is stable across all data configurations for the FRFS-Full (r2) with a value of around 4%. For the FRFS-Marginal, there is a small drop in the FDR when the target covariates are heavily correlated (R² = 0.80). This is likely due to the impact of multicollinearity on the bias of the M-estimator and/or on the use of marginal weights in (4). It is for this reason that we also proposed the FRFS-Full approach.

For LS, the FDR is reasonable without data contamination, without being better than both our robust methods, but reaches large values with data contamination.

The FDR for backward selection using robust AIC is too large in all settings.

Now, we consider a sample of size n = 1000, p = 100 candidate regressors with k = 5 target covariates. Robust AIC is not a viable approach here as selection takes over 20 hours on one sample. Results for the other methods are in Table 2.

Computations are carried out in R and the largest contributor to the larger time for the robust analysis is the time required to compute allpmarginal robust models using lmRob to obtain the required starting values. The work that follows is only slightly more computationally expensive when we use the FRFS-Marginal approach with the obvious gains shown in Table 2. In terms of correctly selected models and empirical FDR, the conclusions on the performance comparison between the three methods is the same as with the smaller scale simulation. Both robust methods are very stable across settings, with a relative performance loss for the FRFS-Marginal in the highly correlated case (R² ≈ 0.80). LS performs miserably with data contamination, never choosing the correct model or a larger model of which the correct model is a subset. It does manage however to maintain a respectable FDR in the R² ≈ 0.80 case, often choosing a one-target-variable model. On average, the robust estimators assign weights below 0.20 to only 0.6% of the observations when there are no outliers,

(18)

Table 2: Forward selection results. As in Table 1, except n = 1000, p = 100, k = 5, ρ = 0.10 (R² ≈ .20) and ρ = 0.85 (R² ≈ 0.80). Subsampling approach uses ns

samples of size nn and is applied to r1, r2 and LS methods in middle and bottom panels. Last row of each panel shows the mean execution time in seconds.

R² ≈0.20 R² ≈0.80

r1 r2 LS r1 r2 LS r1 r2 LS r1 r2 LS

%Correct 79 78 80 71 66 0 90.5 78 80 85.5 67.5 0

%Extra 21 22 19.5 22 25 0 8.5 21 19.5 10 27 0

%Missing 1 0 0 0.5 6.5 5.5 0.5 1 1 0.5 4.5 5 0.5

%Missing 2 0 0 0 0 0 1.5 0 0 0 0 0 0.5

%Missing 3 0 0 0 0 0 6 0 0 0 0 0 6

%Other 0 0 0 0.5 3.5 92 0 0 0 0 0.5 93

%FDR 4.1 4.4 3.7 4.2 6.0 28 1.5 4.1 3.7 1.7 5.4 4.5

time 28 54 8.7 28 53 2.6 28 56 8.7 28 55 2.8

subsampling with nn = 100, ns = 10

%Correct 72 50.5 65 72 45 0 91.5 58.5 64 91 55 0

%Extra 9 9 7 8 10 0 7.5 7 6 4.5 9.5 0

%Missing 1 17 33.5 24 18 32 1 1 33 29 4.5 34 0.5

%Missing 2 1 5 3 1 8.5 1.5 0 1 0.5 0 1 0.5

%Missing 3 0 0.5 0.5 0 1.5 5 0 0 0 0 0 5

%Other 1 1.5 0.5 1 3 92.5 0 0.5 0.5 0 0.5 94

%FDR 1.7 2.0 1.3 1.7 2.5 27 1.3 1.3 1.2 0.8 1.7 2.3

time 22 59 3.2 23 60 1.4 23 59 3.2 22 60 1.3

subsampling with nn = 100, ns = 50

%Correct 80 78 80.5 73.5 65 0 90.5 78 80.5 86.5 68.5 0

%Extra 20 20 18 19.5 23.5 0 8.5 19 18 9 23 0

%Missing 1 0 1.5 1 6.5 8 1 1 3 1.5 4.5 8 0.5

%Missing 2 0 0 0 0 0 1.5 0 0 0 0 0 0.5

%Missing 3 0 0 0 0 0 6.5 0 0 0 0 0 6

%Other 0 0.5 0.5 0.5 3.5 91 0 0 0 0 0.5 93

%FDR 3.9 4.1 3.6 3.8 5.7 28 1.5 3.6 3.5 1.5 4.6 3.6

time 30 248 12 30 244 3.2 31 255 12 30 250 3.3

(19)

and 6.4% of the observations when there are 5% outliers.

Moreover, the results clearly show that subsampling can work just as well as considering the full sample, suffices to take an adequate number of subsamples, and that the forward selection is sufficient. What we cannot appreciate given the times in Table 2 is that subsampling is really faster when p and n are really large, and sometimes the only option, e.g. we cannot proceed without subsampling, due to lack of resident memory, on most machines in the census data example analyzed in §5.

As mentioned in §1, an alternative approach for model selection is given by the popular least angle regression (LARS) of Efron, Hastie, Johnstone, and Tibshirani (2004), an extremely efficient algorithm for computing the entire LASSO (Tibshirani 1996) path. LASSO shrinks regression coefficients by imposing a penalty on their size. Making the penalty sufficiently large necessarily causes some of the coefficients to be exactly zero, i.e. providing a form of model selection. Khan et al. (2007b) propose a robust alternative, namely RLARS, which replaces the non robust building blocks of LARS (mean, variance and correlation) by their robust counterparts. Both methods can deal with large p but provide only an order of entry into the model for the explanatory variables. Model selection based on minimizing K-fold CV mean squared prediction error for augmented models suggested by LARS is however readily available in the lars package for R.

Running RLARS using thebrlarsR code made available on the authors’ website, on one simulated sample takes about 9.5 minutes, and this yields only the suggested order of entry of the variables, not where to stop. Our FRFS-Marginal selects a model in less than 30 seconds and less than 60 seconds with the FRFS-Full, see Table 2.

Note that throughout the paper, our time comparisons are not intrinsic comparisons, but rather comparisons of available implementations of the methods only.

In http link, the results of a simulation study with 10% data contamination show that both robust approaches maintain some level of robustness but present signs of

(20)

breakdown. Our very difficult data generating setting is not favorable to any model selection procedure and it is not reasonable to expect contaminated data in amounts exceeding 5-7% in most real datasets. We are confident that our robust approaches can be used in many applications and this is illustrated with the real datasets in §5.

Finally, it should be noted that there exists alternative robust methods such as the robust boosting of Lutz et al. (2008) and the robust forward and stepwise selection approach based on robust correlation estimates, an approach similar to RLARS, of Khan, Van Aelst, and Zamar (2007a), but these methods are not compared to our methods since code is not available.

5 Examples

In this section, we analyze two real data sets. In each case, the explanatory variables have been centered and scaled to have mean equal to zero and standard deviation equal to one. We also scale the dependent variable to have standard deviation equal to one. Where computational times remain reasonable, selected models are compared using the median absolute prediction error (MAPE), as measured by 10-fold CV.

That is, we split the data into 10 roughly equal-sized parts. For the k^th part, we carry out model selection using the other nine parts of the data and calculate the MAPE of the chosen model when predicting the k^th part of the data. We do this for k = 1, . . . ,10 and report the mean of the 10 estimates of the MAPE. For all methods, the data were split in the same way. We also report the standard error (SE) of the MAPE estimates.

In both our examples, there is no a priori reason to believe that a linear regression model is the best approach. As usual, it remains the easiest route in the hope of obtaining some interpretability of the data. If simply prediction, and not interpreta- tion, is sought then we can turn to random forests (Breiman 2001). We compare our

(21)

predictions with those of random forests for completeness.

5.1 Protein data

We analyze data from a clinical investigation of patients with cardiovascular disease conducted by the Medizinische Klinik und Deutsches Herzzentrum der Technischen Universit¨at in Munich and the Center for Computational Diagnostics at Indiana Uni- versity Purdue University. See Clough et al. (2009) and Patil et al. (2007) for details.

Among the measures taken on the participants is the abundance of a series of proteins and an information of interest is the link between the abundance of each protein and a series of clinical variables. We consider here the abundance of the protein Fibrinogen beta chain that we seek to explain by means of 19 potential variables and their 171 interactions. Some regressors are binary however and over the n = 231 observations collected, some potential regressors yield a column of zeros, are exactly collinear with one column, or a linear combination of several columns, and must be removed. After removal of problematic interactions, we are left with 140 interactions and a total of p= 159 regressors. These data are available at http link. We carry out classical and robust forward selection as described in §2 using α = 0.10. As n is only 231 we do not need to resort to subsampling. Also, as the chosen models remain small, we also consider further iterating our one-step weighted M-estimator to examine the impact of the bias. Results appear in Table 3.

First, we consider FRFS-Marginal, FRFS-Full and FRFS with all weights set to 1 (LS). All three approaches pick up the importance ofX1X3, but the robust approaches choose X2X11 over X3X11, the second variable in the classically chosen model. The correlation between X2 and X3 is only 0.05, however the correlation between X2X11 andX3X11 is 0.86, so these models differ only mildly, although robust selection yields slightly smaller MAPE. Estimates for our two approaches using three-step estimates

(22)

include more variables; results are the same for any greater number of steps. Most notably, the FRFS-Full approach adds X6X11 and X2X19 to the X1X3 variable. This reduces the estimate of σ and the MAPE value. Note that while X6X11 and X3X11 are heavily correlated (0.93), the correlation between X3X11 and X2X19 is -0.05, and between X1X3 and X2X19 is -0.13, so X2X19 is adding new information to the model that is useful as both ˆσ and MAPE values are reduced. The LS estimate of σ is inflated due to the presence of four outliers, patients 116, 206, 159 and 21, which had the largest abundances of protein, i.e. three standard deviations larger than the mean abundance, see Figure 1.

Table 3: Estimates (SE) for models selected by forward selection with FDR stopping for the protein data. Methods are FRFS-Marginal, FRFS-Full and FRFS with all weights equal to 1 (LS). For robust estimators, one-step and three(or more)-steps weighted M-estimators are considered. X1 is age, X2 is gender, X3 is nicotine, X6 is cholesterol, X11 is an indicator of unstable angina, X19 is the glomerular filtration rate, an overall index of kidney function. The p-values are: ^∗ < 10⁻³, ^∗∗ < 10⁻⁴,

∗∗∗ <10⁻⁵. In FRFS-Full with 3+ steps, 4 of 231 observations, patients 206(w= 0), 21(w= 0.10), 159(w= 0.13) and 116(w= 0.15), had robust weightswless than 0.20.

MAPE is measured following 10-fold CV as described in the text. Last row shows CPU time in seconds. MAPE for random forests is 0.64 (0.05).

FRFS-Marginal FRFS-Full LS

One-step 3+ steps One-step 3+ steps

int 53 (0.06) 53 (0.05) 53 (0.06) 53 (0.05) 53 (0.06) X1X3 0.22 (0.06)^∗ 0.26 (0.05)^∗∗∗ 0.22 (0.06)^∗ 0.24 (0.05)^∗∗∗ 0.26 (0.06)^∗∗

X2X11 0.25 (0.06)^∗∗ 0.25 (0.06)^∗∗

X3X11 0.21 (0.06)^∗

X6X11 0.23 (0.05)^∗∗ 0.23 (0.05)^∗∗

X2X19 -0.17 (0.05)^∗

ˆ

σ 0.83 0.80 0.83 0.78 0.94

MAPE 0.60 (0.06) 0.61 (0.05) 0.62 (0.06) 0.60 (0.05) 0.65 (0.05)

time(s) 6.4 6.4 10.6 13.8 1.0

The robust model chosen using the FRFS-Marginal approach also outperforms the LS model, but as was our experience in the simulation study, it is more parsimo- nious than the model chosen by the FRFS-Full approach which provides an additional

(23)

predictor and better predictions. Note that with p = 159 regressors an all-possible- subsets approach, whether robust or non-robust, is not possible. We tried model selection following RLARS. The latter, as implemented in the brlars function provided by the authors, does not run to completion. RLARS proceeds by sampling with replacement and given the 14 binary covariates and their numerous interactions, subsamples lead to singularity in the computation of the Huber correlations and the function terminates in error. We can carry out model selection by minimizing 10-fold CV mean squared prediction error for LARS using the function cv.lars in the R package lars. The selected model contains no regressors. It is hypothesized that the outliers and/or multicollinearity in the data baffle LARS, however only 5% of the 159*158/2 = 12561 correlations are greater than 0.66 and only 2% are greater than 0.90.

Figure 1: Three-step FRFS-Full and LS standardized residual analysis of models in Table 3 for the protein data.

52.5 53.0 53.5 54.0

−2024

ROBUST

fitted values

standardized residuals

21 159 116

206

52.5 53.0 53.5

−2−101234

LS

fitted values

standardized residuals

21 116

159 206

(24)

It is quite straightforward to obtain predictions using random forests as implemented in the randomForest function of the randomForest package for R. MAPE using the classical, both robust approaches, and random forests also appear in Table 3. While random forests outperform the classical approach, they do not do as well as either robust approach.

5.2 Census data

We analyze data constructed from the 1990 US Census. The original data were assem- bled by Data for Evaluating Learning in Valid Experiments (DELVE) at the Univer- sity of Toronto and may be downloaded athttp://www.cs.toronto.edu/∼delve/data /census-house/desc.html. There are 22,784 observations and 139 variables in the original data. The goal is to explain the average price asked for a housing unit. Some of the other 138 variables are collinear, e.g. one variable is % female in sector and another is % male in sector, and some deletion is required to carry out the analysis.

Also, there are many of the covariates which are all zeros, or almost all zeros, and a few that are categories essentially of the average price. After removing the latter and cleaning the covariates, we are left with n = 13,970 observations and 51 first- order covariates. We then include all possible interactions except one which caused collinearity. The 51 first-order covariates, along with the R code to produce the 1325 centered and scaled regressors, are available at http link.

With n = 13,970 and p = 1325, the subsampling approach outlined in §2 and tested in §4 has to be used on most computers. As we had access to a high performance machine, a2 CPU type AMD Opteron^{T M} Processor 250 running at 2390.073 Mhz under the Linux/x86 64−unknown−linux co4.2−gnu operating system with 16 GB of RAM and 33 GB of swap memory, we can report some results with and without subsampling. We report subsampling results for (nn, ns) = (200,10) but re-

(25)

sults are qualitatively similar for many other schemes. Table 4 gives the CPU running times for each approach. Some MAPE are shown in Table 5. The cv.lars procedure settles on a 208-variable model in a running time of 148 minutes. The brlars procedure had not returned any results after more than two weeks of running time.

Table 4: Number of variables entered in model [execution times in minutes] for forward selection with FDR stopping for census data. Computations are on high performance machine described in the text. Results for intermediate 10- and 20-variable models are included for purposes of comparison. Subsampling with (nn, ns) = (200,10). f:

final model selected. Both FRFS-Full and LS on the full sample enter 151 variables;

only 42 variables are common to both final models.

FRFS-Marginal FRFS-Full LS

subsampling full sample subsampling full sample subsampling full sample 9^f [43] 16^f [1072] 10 [81] 10 [395] 10 [11] 10 [547]

20 [168] 20 [947] 20 [24] 20 [1157]

64^f [1367] 151^f [67,988] 62^f [97] 151^f [8752]

Table 5: MAPE (SE) of different variable selection methods for census data as evaluated by 10-fold CV. ^∗ selection carried out using sub-sampling with (nn, ns) = (200,10).

Method MAPE

Mean-only 0.530 (0.004)

LS^∗ (FRFS with all weights equal to 1) 0.223 (0.004) FRFS-Full^∗ with maximum 10 variables in model 0.215 (0.003) FRFS-Full^∗ with maximum 20 variables in model 0.201 (0.002)

FRFS-Full^∗ 0.190 (0.003)

LARS 0.216 (0.003)

A description of some of the regressors selected by FRFS-Full with subsampling appears in Table 6. Only 14 of the 151 variables chosen by LS, and 19 of the 208 variables retained by LARS, are among the 64 variables selected by FRFS-Full with subsampling. It is perhaps not surprising that non-robust methods perform poorly here as 8.3% of observations get downweighted below 0.01 by our robust estimator.

We note that 11.8% of the observations are assigned weights less than 0.20. This

(26)

amount of hard downweighting is more than that in our simulated data with 5%

outliers in §4, but less than that in our simulated data with 10% outliers, see http link.

With these data we do not seek a ‘true’ model, but rather to best explain the average asking price in a sector with very few regressors. As there is near multicollinearity, many models should perform equivalently. The FRFS-Full approach outperforms LS and LARS w.r.t. MAPE. A model consisting of the first 10 variables selected by FRFS-Full outperforms the final 62-variable LS model w.r.t. MAPE, and has a shorter running time: 81 minutes versus 97 minutes, respectively. Given the added-value of variables beyond the first 10 or 20 selected robustly and data collecting costs, the practitioner might retain one of the smaller models even though they do not yield the selected FDR.

Table 6: Description of first-order covariates that appear in at least four variables selected by FRFS-Full with subsampling (nn = 200, ns = 10). • indicates that the first-order covariate is selected. # is the number of times the first-order covariate appears in selected first- or second-order variables. Description of other covariates is in http link.

Name Description #

P6.4 %-tage Asian or Pacific Inslander (asian) 13

P11.1 %-tage (0:11] years old 6

P11.3 %-tage [25:64] years old 6

P14.7 %-tage females married, not separated 4

P16.1• %-tage of HH-lds with 1 person 6

H5.2 %-tage of vacant HU for sale only 4

H5.4 %-tage of vacant HU for seasonal, recreational or occasional use 4

H9.4 %-tage of ownOcc HU with asian HH-lder 9

H9.9 %-tage of rentOcc HU with asian HH-lder 8

H12.2 Average age of HH-lder in rentOcc HU’s 4

H15.1 Average number of rooms in a ownOcc HU 5

H15.2• Average number of rooms in a rentOcc HU 5 Finally, we put these data to random forests and get predictions in 410 minutes.

Subsequently running the CV we find a MAPE of 0.191 (0.002). This is clearly better

(27)

than the prediction power of LS and LARS, but only the same as that of FRFS-Full which has the advantage of yielding an interpretable model.

6 Concluding remarks

Although outliers are prevalent in very large data sets, existing robust methods for model fitting and selection are not computationally feasible in high dimensions. We present a simplified robust estimator that is fast and effective, allowing for much needed robust inference in high dimensions.

Note that all our computations were done in R and the code for our proposed approach is not optimized for speed, but rather for ease of use. Computing times reported should thus not be examined in the absolute, but rather on a comparative basis for the methods considered. Increased computational efficiency is a topic for future research.

Also note that we chose to scale predictors using non-robust mean and standard deviation. More robust location and scale estimators can be used, but we found that use of the median and the MAD led to a poorer performance of the FRFS-Marginal approach while causing no change in the performance of the FRFS-Full approach.

Finally, note that we used an MM-estimator for the marginal regression models with one explanatory variable to get the starting weights, but one could also consider different and possibly simpler robust estimators. A redescending ψ-function is recommended in high dimensions, but for the marginal models, a non redescending ψ-function, such as Huber’s (Huber 1964) ψ-function could in principle be used. In that case, the bias of the one-step M-estimator would be different.

(28)

Supplemental Materials

Technical Details: More detailed presentation of technical results, additional simulation results and covariate description. (PDF format)

Data Sets: Data sets used for analyses in §5. (.csv format with header)

R Code: R code to build the 1325 centered and scaled regressors analysed in §5.2 from the 51 covariates in the provided data set. (ASCII)

References

Abramovich, F., Y. Benjamini, D. Donoho, and I. Johnstone (2006). Adapting to unknown sparsity by controlling the false discovery rate.Annals of Statistics 34, 584–653.

Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In B. N. Petrov and F. Csaki (Eds.),Proceedings of the Second Inter- national Symposium on Information Theory, Budapest, pp. 267–281. Akademiai Kiado.

Alqallaf, F., S. Van Aelst, V. J. Yohai, and R. H. Zamar (2009). Propagation of outliers in multivariate data.The Annals of Statistics 37, 311–331.

Benjamini, Y. and Y. Gavrilov (2009). A simple forward selection procedure based on false discovery rate control. Annals of Applied Statistics 3, 179–198.

Benjamini, Y. and Y. Hochberg (1995). Controlling the false discovery rate: a prac- tical and powerful approach to multiple testing.Journal of the Royal Statistical Society, Ser. B 57, 289–300.

Benjamini, Y., A. Krieger, and D. Yekutieli (2006). Adaptative linear step-up procedures that control the false discovery rate. Biometrika 93, 491–507.

(29)

Breiman, L. (2001). Random forests.Machine Learning 45, 5–32.

Clough, T., M. Key, I. Ott, S. Ragg, G. Schadow, and O. Vitek (2009). Pro- tein quantification in label-free LC-MS experiments. Journal of Proteome Re- search 8(11), 5275–5284.

Efron, B., T. Hastie, I. Johnstone, and R. Tibshirani (2004). Least angle regression (with discussion). Annals of Statistics 32, 407–499.

Foster, D. and R. Stine (2004). Variable selection in data mining: Building a predic- tive model for bankruptcy. Journal of the American Statistical Association 99, 303–313.

Gavrilov, Y., Y. Benjamini, and S. Sakar (2009). An adaptative step-down procedure with proven FDR control under independence. Annals of Statistics 37, 619–629.

Hampel, F. R. (1968). Contribution to the Theory of Robust Estimation. Ph. D.

thesis, University of California, Berkeley.

Hampel, F. R. (1974). The influence curve and its role in robust estimation.Journal of the American Statistical Association 69, 383–393.

Heritier, S., E. Cantoni, S. Copt, and M. Victoria-Feser (2009).Robust Methods in Biostatistics. Wiley.

Heritier, S. and E. Ronchetti (1994). Robust bounded-influence tests in general parametric models. Journal of the American Statistical Association 89(427), 897–904.

Huber, P. J. (1964). Robust estimation of a location parameter.Annals of Mathe- matical Statistics 35, 73–101.

Huber, P. J. (1967). The behavior of maximum likelihood estimates under non- standard conditions. InProceedings of the Fifth Berkeley Symposium on Math-

(30)

ematical Statistics andProbability, Volume 1, pp. 221–233.

Khan, J. A., S. Van Aelst, and R. H. Zamar (2007a). Building a robust linear model with forward selection and stepwise procedures. Computational Statistics and Data Analysis 52, 239–248.

Khan, J. A., S. Van Aelst, and R. H. Zamar (2007b). Robust linear model selection based on least angle regression. Journal of the American Statistical Associa- tion 102, 1289–1299.

Lutz, R. W., M. Kalisch, , and P. Buehlmann (2008).Computational Statistics and Data Analysis 52, 3331–3341.

Machado, J. A. F. (1993). Robust model selection andm-estimation.Econometric Theory 9, 478–493.

Mallows, C. L. (1973). Some comments oncp. Technometrics 15, 661–675.

Maronna, R. A., R. D. Martin, and V. J. Yohai (2006). Robust Statistics: Theory and Methods. Chichester, West Sussex, UK: Wiley.

M¨uller, S. and A. H. Welsh (2005). Outlier robust model selection in linear regression. Journal of the American Statistical Association 100, 1297–1310.

Patil, S., R. E. Higgs, J. E. Brandt, M. D. Knierman, V. Gelfanova, J. P. Butler, A. M. Downing, J. Dorocke, R. A. Dean, W. Z. Potter, D. Michelson, A. X.

Pan, S. S. Jhee, and J. E. Hale (2007). Identifying pharmacodynamic protein markers of centrally active drugs in humans: a pilot study in a novel clinical model. Journal of Proteome Research 6(3), 955–966.

Ronchetti, E. (1982).Robust Testing in Linear Models: The Infinitesimal Approach.

Ph. D. thesis, ETH, Zurich, Switzerland.

Ronchetti, E., C. Field, and W. Blanchard (1997). Robust linear model selection by cross-validation.Journal of the American Statistical Association 92, 1017–1023.

(31)

Ronchetti, E. and R. G. Staudte (1994). A robust version of Mallows’sCp.Journal of the American Statistical Association 89, 550–559.

Rousseeuw, P. J. and V. J. Yohai (1984). Robust regression by means of S- estimators. In J. W. Franke, Hardle, and R. D. Martin (Eds.), Robust and Nonlinear Time Series Analysis, pp. 256–272. New York: Springer-Verlag.

Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statis- tics 6, 461–464.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B 58, 267–288.

Wu, Y., D. D. Boos, and L. A. Stefanski (2007). Controlling variable selection by the addition of pseudovariables. Journal of the American Statistical Associa- tion 102, 235–243.

Fast Robust Model Selection in Large Datasets

Article

Reference

Fast Robust Model Selection in Large Datasets

Fast Robust Model Selection in Large Datasets ∗

Debbie J. Dupuis and Maria-Pia Victoria-Feser HEC Montr´eal and HEC Gen`eve

Fourth Revision October 29, 2010

1 Introduction

2 Fast Robust Model Selection

2.1 Weighted M -estimator

2.2 FRFS-Marginal Algorithm

2.3 FRFS-Full Algorithm for Heavily Correlated Covariates

3 Statistical Properties

3.1 Properties of β ˆ

3.2 Weighted partial correlations

4 Simulation Study

5 Examples

5.1 Protein data

5.2 Census data

6 Concluding remarks

Supplemental Materials

References

Fast Robust Model Selection in Large Datasets ^∗