Report
Reference
Influence functions for penalized M-estimators
AVELLA MEDINA, Marco Andrés
AVELLA MEDINA, Marco Andrés. Influence functions for penalized M-estimators. 2014
Available at:
http://archive-ouverte.unige.ch/unige:35319
Disclaimer: layout of this document may differ from the published version.
Influence functions for penalized M-estimators
Marco Avella-Medina
Research Center for Statistics University of Geneva, Switzerland
March 2014
Abstract
We study the local robustness properties of general penalized M- estimators via the influence function. More precisely, we propose a framework that allows us to define rigourously the influence function as the limiting influence function of a sequence of approximating es- timators. Our approach can deal with nondifferentiable penalized M- estimators and a diverging number of parameters. We show that it can be used to characterize the robustness properties of a wide range of sparse estimators and we derive its form for general penalized M- estimators including lasso and adaptive lasso type estimators. We prove that our influence function is equivalent to a derivative in the sense of distribution theory.
1 Introduction
Sparse models have become very popular in recent years. Since their in- troduction in the linear model (Breiman [1995], Tibshirani [1996]), many extensions and algorithms have been proposed. See Tibshirani [2011] for a retrospective. Asymptotic properties of lasso-type estimators have been
studied in the fixed dimensional parameter case (Knight and Fu [2000],Fan and Li [2001], Zou [2006]), as well as in the high dimensional set up where the number of parameters is allowed to grow at an even faster rate than the sample size (B¨uhlmann and Van De Geer [2011]).
Given the increasing importance that sparsity inducing penalties play in modern statistics, the need for a clear understanding of the robustness properties of these type of procedures is evident. Robust statistics develops a theoretical framework that allows us to take into account that the models used for fitting the data are only idealized approximations of reality. It provides methods that still give reliable results when slight deviations from the stochastic assumptions on the model occur. Book-length expositions can be found in Huber [1981] and 2nd edition by Huber and Ronchetti [2009], Hampel et al. [1986] and Maronna et al. [2006]. One of the main lines of research in the robustness literature was opened by Hampel who considered local robustness, i.e. the impact of moderate distributional deviations from ideal models on a statistical procedure. In this setting the quantities of interest are viewed as functionals of the underlying distribution. Typically their linear approximation is studied to assess the behavior of the estmators in a neighborhood of the model. In this approach the influence function plays a crucial role in describing the local stability of the functional analyzed.
Although some authors have suggested sparse estimators that limit the impact of contamination in the data (e.g. Sardy et al. [2001], Wang et al.
[2007a], Li et al. [2011] and Lozano and Meinshausen [2013] among many others), the robustness properties of these procedures are not well under- stood. All these procedures rely on the intuition that a loss function that defines robust estimates in the well understood unpenalized fixed dimensional M-estimator, should also define a robust estimator when it is penalized by a deterministic function. To the best of our knowledge, only Alfons et al. [2013]
and Wang et al. [2013] established formal robustness properties for their es- timators. Both papers computed finite sample breakdown points. The latter also gives influence functions, although in a limited set up without develop- ping a rigorous framework. Further details can be found in Section 4.
The goal of this article is to charaterize the robustness properties of a wide class of penalized M-estimators, that covers most of the existing pro-
posals, by deriving their influence function. This requires developping a new framework. Indeed, the typical tools used to derive the influence function of M-estimators suffer from two main problems when considering penalized M-estimators. They are not well suited for a scenario where the number of parameters is allowed to increase with the sample size, and they cannot handle nondifferentiable penalty functions which are necessary for achieving sparsity (Fan and Li [2001]).
Our work provides a number of constributions to the existing literature.
First, we introduce a influence function defined through a sequence of ap- proximating estimators and show that it is uniquely defined for penalized M-estimators and two-stage penalized M-estimators. The former class cov- ers lasso and group lasso type estimators. The latter includes adaptive lasso type estimators. We compute the influence function of all these important examples. The result is valid when the number of parameters diverges with the sample size, in particular the result holds for high dimensional problems.
Second, we show that the two main features of the influence functions for M- estimators are also valid for the influence function of penalized M-estimators, i.e. (a) it allows to assess the relative influence of individual observations to- ward the value of an estimate; (b) it allows an inmediate and simple, informal assessment of the asymptotic properties of an estimate (Huber [1981], p. 14- 15). Third, we show that our limiting influence function can be viewed as a distributional derivative in the sense of Schwartz [1959]. This opens the door for further research exploting the tools of distribution theory, which to the best of our knowledge, has essentially not been used in the statistical literature previously. Finally, a key step in our theoretical argument is the innovative use of Berge’s maximum theorem. This is a powerful tool that could have more applications in statistics.
The rest of the article is organized as follows. Section 2 introduces the general framework and provides the main results regarding our influence function. In Section 3 we compute the influence function for a wide class of penalized M-estimators. Section 4 introduces distributional derivatives for the computation of influence functions. Section 5 concludes with some remarks. All the proofs are given at the end of the article and auxiliary results are provided in the Appendix.
2 Penalized M-estimators and Influence Functions
We consider parametric estimators obtained as the minimizers of regularized problems of the form
Λλ(θ;F, p) = EF[L(Z, θ)] +p(θ;λ) (1) with respect to the parameter θ ∈ Θ ⊂ Rd, where L is a continuous loss function,p(·;λ) a continuous penalty function with regularization parameter λ, Z ⊂ Rk and F the distribution function of Z. We denote the resulting penalized M-estimators by T(F) =θF,λ =θ∗. Important examples are given in Section 3.
For M-estimators the standard argument for showing the existence and deriving the form influence function, is to use an appropiate implicit function theorem. This typically requires for (1), the existence of two derivatives with respect to the parameter . However it is well known that a penalty function has to be singular at the origin in order to achieve sparsity (Fan and Li [2001]). Therefore new tools that can deal with nondifferentiable penalty functions are required to derive influence functions of many modern penalized estimators. We propose to define the influence function as the limit of the influence functions of a sequence of differentiable penalized M-estimators that converge to the penalized M-estimator of interest.
2.1 Influence Function
We will require the following set of conditions for the derivation of our the- oretical results:
(C1) The set Θ⊂Rd is compact.
(C2) EF[L(Z, θ)] has two derivatives with respect to θ denoted by EF[ψ(Z, θ)] and EF[ ˙ψ(Z, θ)] respectively.
(C3) The functions ψ(z, θ) and EF[ ˙ψ(Z, θ)] are continuous at θ∗.
(C4) Let {pm}m≥1 and {p0m}m≥1 be two sequences in C∞(Θ) = {f : Θ → R| f continuous and infinitely differentiable} converging to p in the Sobolev space W2,2(Θ) when m→ ∞.
For a functionh:Rd→R, we denote by∇hand∇2hits first two derivatives and by ∇xh its derivative with respect to the variable x. Recall that the influence function (Hampel [1974], Hampel et al. [1986]) of a functionalT(F) is a special Gˆateaux derivative given by
IF(z;F, T) = lim
→0
T(F)−T(F)
,
whereF = (1−)F+∆z and ∆z is the distribution probability that assigns mass 1 at the point z and 0 elsewhere. It has the heuristic interpretation of describing the effect of an infinitessimal contamination at the point z on the estimate, standardized by the mass of contamination. When the penalty function is sufficiently smooth, a standard argument establishes the existence and the form of the influence function of the minimizer of (1).
Lemma 1 : Assume (C1) and (C2). Letp: Θ→Rbe twice differentiable andS :=EF[ ˙ψ(Z, θ∗)]+∇2p(θ∗;λ) be invertible. Then the influence function of T(F) exists for allz ∈Rk and we have
IF(z;F, T) =−S−1 ψ(z, θ∗) +∇p(θ∗;λ) .
It follows that just as for a M-estimators, a bounded derivative for the loss function is required in order to obtain bounded influence estimators in the penalized setting. When the penalty function in (1) is not differentiable the conditions of Lemma 1 do not hold. We therefore propose to study the limiting form of the influence function of penalized M-estimators obtained using smooth penalty functionspmsuch that limm→∞pm =p. Such penalized M-estimators, denoted by T(F;pm), are defined as the global minimizers of
Λλ(θ;F, pm) =EF[L(Z, θ)] +pm(θ;λ). (2) Let IFpm(z;T, F) be the influence function ofT(F;pm). We define the influ- ence function of T(F) as
IF(z;F, T) := lim
m→∞IFpm(z;F, T). (3)
The following lemma states the uniqueness of the limiting estimatorT(F).
The use of Berge’s maximum theorem (Berge [1997]) is a key step in our proof.
It constitutes an innovative tool in the statistical literature. The lemma is crucial for the uniqueness argument of the limiting influence function, stated in Proposition 1. While completing this article, the author noticed that Machado [1993] has also used Berge’s maximum theorem for the derivation of qualitative robustness for model selection criteria based on M-estimators.
Lemma 2 : Under (C1),(C2) and (C4) we have limmT(F;pm) = limmT(F;p0m) =T(F).
Remark 1 : We could extend the unicity argument of Lemma 2 to local minimizers if the selection step given at the end of the proof (see Section 6) can be extended for local minima. This could be easily done, for example, if there is a class of approximating penalty functions {pm} that generate at least the same number of local minima as p form > m0.
Proposition 1 : Let T(F;pm) satisfy the conditions of Lemma 1 and assume (C1)-(C4). Then the limiting influence function defined in (3) does not depend on the choice of pm.
2.2 Two-stage penalized M-estimators
We can extend the previous results to a class of two-stage penalized M- estimators. Important examples of such estimators are studied and discussed in Section 3.2. The following set up ca be viewed as a direct extension of the framework provided by Zhelonkin et al. [2012]. LetF be the distribution function of Z = (Z(1), Z(2)) and let θ = (θ1, θ2) be a vector defining the arguments of the first and second stages, with θ1 ∈Θ1 ⊂Rd1, θ2 ∈Θ2 ⊂Rd2 andd=d1+d2. We consider penalized M-estimators (θ1∗, θ2∗) = (S(F), T(F)) defined by
θ∗1 = argmin
θ1
n
EF[L(1)(Z(1), θ1)] +p(1)(θ1;γ)o
(4) θ∗2 = argmin
θ2
n
EF[L(2)(Z(2), θ2, Z(1), θ∗1)] +p(2)(θ2, θ∗1;λ))o
(5)
where L(i) and p(i) denote respectively the loss and penalty functions in the ith stage. For the theoretical argument we adapt conditions (C2)-(C4) to this set up in the following staightforward way:
(C2’) EF[L(1)(Z(1), θ1)] has two derivatives with respect to θ1 de- noted by EF[ψ(1)(Z(1), θ1)] and EF[ ˙ψ(1)(Z(1), θ1)] respectively.
EF[L(2)(Z(2), θ2, Z(1), θ1)] has two derivatives with respect to θ1
denoted by EF[ψ(2)(Z(2), θ2, Z(1), θ1)] and EF[ ˙ψ(2)(Z(2), θ2, Z(1), θ1)]
respectively. Furthermore EF[ψ(2)(Z(2), θ2, Z(1), θ1)] is differentiable with respect to θ1.
(C3’)The functions ψ(z(1), θ1) and ψ(z(2), θ2, z(1), θ1) are continuous at θ1∗, and EF[ ˙ψ(1)(Z(1), θ1)], ∇θ1EF[ψ(2)(Z(2), θ2, Z(1), θ1)] and EF[ ˙ψ(2)(Z(2), θ2, Z(1), θ1)] are continuous at (θ1∗, θ2∗).
(C4’) Fori= 1,2,{p(i)mi}and{p(i)mi
0}be two sequences inC∞(Θi) converging to p(i) in the Sobolev space W2,2(Rdi) converging to when mi → ∞.
Furthermore{∇θ2p(2)m }and{∇θ2p(2)m 0}are differentiable with respect to θ1.
We first provide the influence function of (5) for sufficiently smooth penalty functions.
Lemma 3 : Denote by θ∗ = (S(F), T(F)) the estimators defined by (4)-(5). Assume (C1),(C2’) and let p(i) : Θi → R be twice differentiable with respect to θi and ∇θ2p(2) be differentiable with respect to θ1. Let also S :=EF[ ˙ψ(2)(Z(2), θ∗2, Z(1), θ1∗)] +∇θ2,θ2p(2)(θ2∗, θ∗1;λ) be invertible. Then the influence function of T(F) exists for allz ∈Rk and we have:
IF(z;F, T) =−S−1
ψ(2)(Z(2), θ2∗, Z(1), θ1∗) +∇θ2p(2)(θ∗2, θ1∗;λ)
+ ∇θ2EF[ ˙ψ(2)(Z(2), θ2, Z(1), θ1)] +∇θ2,θ1p(2)(θ2∗, θ∗1;λ)
IF(z(1);F, S)
,
where IF(z(1);S, F) has the form given in Lemma 1.
Unsurprisingly, bounded-influence estimators are obtained only by tak- ing loss functions with bounded derivatives in both stages. The expression
obtained is very similar to the one derived in the unpenalized set up in Zh- elonkin et al. [2012]. We are now ready to state the uniqueness of the limiting two-stage estimator (5) and its influence function.
Lemma 4 : Under (C1),(C2’) and (C4’) we have limmT(F;p(2)m ) = limmT(F;p(2)m 0) =T(F).
Proposition 2 : Letpm = (p(1)m , p(2)m ) ,T(F;pm) satisfy the conditions of Lemma 1 and assume (C1),(C2’)-(C4’). Then the limiting influence function (3) of (5) does not depend on the choice of pm.
3 Examples
We are now ready to derive the influence function of some general penalized M-estimators using approximating sequences. We consider different classes of sparsity inducing penalty functions in the analysis. Without loss of generality we will assume that the tuning parameter λ is such that the resulting esti- mators are sparse. More specifically we consider that θF,λ =θ∗ = (θ∗T1 , θ∗T2 )T with θ1∗ ∈Rs,s < d and θ2∗ = 0. Note that (1) is a population version of the empirical problem encountered in practice, whereF is replaced by its empir- ical version ˆF, i.e. the distribution that assigns mass 1/nto each observation (xi, yi), i= 1, . . . , n. In a set up where the number of parameters is allowed to grow with the sample size, the true underlying distribution generating the data is Fn and its empirical version ˆFn. For notational convenience we will write F instead of Fn as our result holds in both cases, as well as for their empirical counterparts. In particular our results hold in the high dimensional framework where d > n.
3.1 Lasso and group lasso type penalties
The following proposition gives the form of the influence function of esti- mators that arise when considering a class of general penalty functions. It covers as special cases convex penalties such as the lasso (Tibshirani [1996]) and nonconvex penalties such as the scad (Fan and Li [2001]).
Proposition 3 : Denote by θ∗ = T(F) the penalized M-estimators ob- tained as the minimizer of (1) with penalty functions of the form pλ(θ) = Pp
j=1pλ,j(|θj|), where pλ,j(·) are differentiable functions. Then under (C1)- (C3) the influence function (3) of T(F) has the form
IF(z;F, T) =−S−1
ψ(z, θ∗) +φλ(θ∗)
,
whereS−1 = blockdiag{(M11+Pλ)−1,0},M11 =EF[ ˙ψ11(Z, T(F))],Pλ is a di- agonal matrix with diagonal elements p00λ,j(|θj∗|) andφλ(θ∗) is addimensional vector with components p0λ,j(|θ∗j|)sign(θ∗j) for j = 1, . . . , d.
We now give the form of the influence function of penalized M-estimators achieving sparsity for grouped variables via group lasso type of penalties (e.g.
Yuan and Lin [2006], Wang et al. [2007b], Huang et al. [2009]).
Proposition 4 : Denote by θ∗ = T(F) the penalized M-estimators ob- tained as the minimizer of (1) with group penalty functions of the form pλ(θ) = PG
g=1pλ,g(kθ(g)k2), where pλ,g(·) are differentiable functions, θ = (θ(1), . . . , θ(G)) and each θ(g) is a subvector of θ corresponding to the gth group of variables. Then under (C1)-(C3) the influence function (3) of T(F) has the form
IF(z;F, T) =−S−1
ψ(z, θ∗) +φλ(θ∗) ,
where S−1 = blockdiag{(M11+Pλ)−1,0},M11 =EF[ ˙ψ11(Z, T(F))] andPλ is a block diagonal matrix with blocks p00λ,g(kθ(g)k2)(θ(g)θT(g)− kθ(g)k2I)/kθ(g)k32 where
φλ(θ∗) =
( p0λ,g(kθ(g)∗ k)θj∗/kθ∗(g)k2 if θ∗(g) 6= 0
0 if θ∗(g) = 0
and p0λ,g(t) andp00λ,g(t) are the first two derivatives of pλ,g(t) fort >0.
It is clear from Proposition 3 and Proposition 4 that a boundedψfunction is necessary for a Penalized M-estimator to have bounded limiting influence function. The fact that the influence function has some zero components is rather surprising. Further discussion can be found in the next two sections.
3.2 Adaptive lasso type penalties
The adaptive lasso of Zou [2006] is a popular two stage procedure that im- proves on the results of the lasso by ensuring the oracle properties of Fan and Li [2001]. The tools developed in Section 2.2 allow us to derive the influence function of adaptive lasso type estimators.
Proposition 5 : Let θ be the d dimensional parameter of interest and let θ(0) = S(F) be an initial estimate of θ, with s(0) non zero components, defined by (4). For j = 1, . . . , d and some function w, define the weights wj = w(|θj(0)|). Denote by θ∗ = T(F) the penalized M-estimators obtained as the minimizer of (5) with penalty function of the form p(θ, θ(0);λ) = λPp
j=1wj|θj| and loss function L(Z, θ, θ(0)). Then under (C1), (C2’) and (C3’) the influence function (3) of T(F) has the form
IF(z;F, T) = −S−1
ψ(z, θ∗) +φλ(θ∗, θ(0)) +ϕλ(θ∗, θ(0))IF(z;F, S) , where S−1 = blockdiag{M11−1,0}, M11 = EF[ ˙ψ11(Z, T(F))], φλ(θ∗, θ(0)) is a d dimensional vector with components λw0(|θ(0)j |)sign(θj(0))sign(θ∗j) for j = 1, . . . , d with w0 denoting the derivative of w, and ϕλ(θ∗, θ(0)) = blockdiag{Hγ,0} where Hγ ={∇θ(0)
k
φλ,j(θ∗, θ(0))}sj,k=1(0) .
The form of IF(z;F, T) depends on the choice of the initial estimator.
A bounded influence estimator T(F) can only be obtained by taking a bounded influence initial estimator S(F) and choosing a loss function defin- ing a bounded ψ function in the second stage. Among the nonsparse initial estimates proposed in the literature, Zou [2006] proposed to use maximum likelihood estimates for the fixed parameter case where ddoes not vary with n. In the high dimensional set up whered > n, Huang et al. [2008] proposed to use an initial zero consistent estimate, e.g. marginal least squares. For those cases the influence function ofS(F) is simply proportional to the score function as they are well known M-estimators. Lasso estimators have also been proposed as initial estimates. See for instance Fan et al. [2014] and Avella-Medina and Ronchetti [2014], in the context of high dimensional pe- nalized likelihood and robust quasilikelihood estimation respectively. Note that for wj = 1/|θj(0)| the usual convention is that for θj(0) = 0 we define
wj =∞ and ∞ ·0 = 0. Hence for this choice of wj, a coefficient set to zero in the first step will never appear in T(F).
4 Connections to distribution theory
As mentioned in the introduction, Wang et al. [2013] studied the robustness properties of their estimator by calculating its finite sample breakdown point and influence function. It can be seen that their derivation of the influence function (Theorem 3) could easily be extended to more general loss functions.
By the argument given above in Section 3, it also extends to a diverging number of parameters set up. In their proof they implicitly use distributions in the sense of Schwartz [1959], since they require the first two derivatives of the absolute value function. Indeed they use sign(x) as first derivative of |x|
and the Dirac delta function δ(x) as its second derivative. These derivatives are justified by the theory of distributions. However, working explicitly with the Dirac delta function understood informally as
δ(x) =
( +∞ if x= 0 0 otherwise
and inverting a matrix containing such an expression, is not fully satisfac- tory from a formal mathematical standpoint. Interestingly, the expression obtained in Theorem 3 in Wang et al. [2013] is the same as the one we give in Proposition 3. This suggest that a more careful treatment of the problem with a rigorous use of differentiation in the sense of distribution theory will yield the same influence function.
We establish such a result in a very simple set up. Assume an orthogonal linear model i.e.
yi =xTi θ+i, i= 1, . . . , n,
where i are zero mean errors, yi the responses and xi the covariates with XTX = I and X = (x1, . . . , xn)T. We consider the penalized least squares problem
1 n
n
X
i=1
(yi−xTi θ)2+
d
X
j=1
pλ(|θj|) (6)
wherepλ(·) is a penalty function andλ the regularization parameter. For the lasso and scad penalties, the resulting estimators have an explicit solution that allows for easy computations of distributional derivatives.
Lemma 5 : Let θ(0) = T(0)(F) denote the least square estimator of θ.
Then the influence function of the minimizer θ∗ = T(F) of (6) exists as a distributional derivative for the lasso and scad penalties, and an explicit computation yields
IF(z;F, T) =
( 0 if |θ(0)j | ≤λ ψj(z, θ∗) +λsign(θj∗) otherwise and
IF(z;F, T) =
( 0 if |θj(0)| ≤2λ
ψj(z, θ∗) +p0λ(|θj∗|)sign(θj∗) /
1 +p00λ(|θj∗|)
otherwise respectively, where z = (y, x), ψ(z, θ) = (y−xTθ∗)xj, and p0λ(t) and p00λ(t) denote the first two derivatives of pλ(t) for t >0.
At first glance the theory of distributions seems to provide a natural and rigorous way of tackling nondifferentiable penalties. However the theory suf- fers from at least two major drawbacks for the purposes of deriving influence functions as in Lemma 5. The product of two distributions cannot be con- sistently defined in general. This makes the manipulation of distributions delicate. Furthermore, to the best of our knowledge there is no implicit func- tion theorem that would easily allow to derive the form of the derivative of an implicit function. Therefore extending the proof of Lemma 5 to more gen- eral problems does not look obvious. We can however show that the influence functions derived in Section 3 using the limiting influence function can be viewed as distributional derivatives. Before giving this result in Proposition 6, we need an intermediate result concerning the continuity of T(F) with respect to that is interesting on its own. Its proof uses Berge’s maximum theorem and is similar to the proof of Lemma 2.
Lemma 6 : Under (C1) and (C2), the penalized M-estimator T(F) resulting from the minimization of (1) is continuous with respect to ∈ (−ε, ε), ε >0.
Proposition 6 : Under the assumptions of Proposition 1, the influence function (3) of the minimizer T(F) of (1) is the distributional derivative of T(F) with respect to evaluated at 0.
Note that with Proposition 6 at hand, Lemma 5 becomes a direct conse- quence of it. We present Lemma 5 and give its proof through direct calcula- tions, as it constitutes one rare example where distributional derivatives can easily and explicitly be computed for a penalized M-estimators.
5 Discussion
We introduced the idea of calculating the influence function of penalized M-estimators with the help of a sequence of approximating estimators. In Section 2 we justified the validity if such an approach. In Section 4 showed that it is equivalent to computing a distributional derivative. In Section 3 we derived the limiting influence functions of general penalized M-estimators, arising when choosing some classes of sparsity inducing penalty functions.
The three families of penalty functions analyzed cover the most prominent examples studied in the literature. These derivations are easy to compute, fairly intuitive and give explicit solutions.
Let us now discuss two main properties of the influence function. First, since by definition the influence function is essentially an ordinary derivative with respect to , if a statistical functionalT(F) is sufficiently regular, a von Mises expansion (Taylor expansion, Mises [1947]) yields
T(G)≈T(F) + Z
IF(z;F, T)d(G−F)(z). (7) Substituting in this expression the empirical distribution ˆF for G, we obtain
√n
T( ˆF)−T(F)
≈ 1
√n
n
X
i=1
IF(zi;F, T) because R
IF(z;F, T)dF(z) = 0. Then by the central limit theorem have that √
n T( ˆF)−T(F)
is asymptotically normal with mean 0 and variance V(F, T) = R
IF(z;F, T)IFT(z;F, T)dF(z). A rigourous general argument can be found in Huber [1981]. For M-estimators, the conditions for the
Fr´echet differentiability of Clarke [1986] guarantee the validity of the von Mises expansion and imply good robustness properties as discussed in Bed- narski [1993]. Using these formulas and the influence functions derived in Section 3, we see that for penalized M-estimators, the approximation (7) leads to the expression
√n(ˆθ−θ∗)→d N(0, V(F, T)),
whereV(F, T) is block diagonal with 0 entries for the elements corresponding toθ2. All the analysis in Section 3 was developed for a fixedλ. If we assume that there is a true underlying set of parameters θ0, this implies in general that θ∗ 6=θ0. Note however that if we work instead with aλn tending to zero at an appropiate rate, the heuristic asymptotic distribution would match the oracle properties (Fan and Li [2001]) for the class of lasso type estimators sat- isfying such properties. Examples of this type of estimators were derived for instance in Fan and Lv [2010] and Avella-Medina and Ronchetti [2014]. The von Mises expansion can be therefore used to assess informally the asymp- totic properties of an important class of penalized M-estimators. A more careful study of this phenomenon is required for a better understanding of the conditions under which it holds. Still, as pointed out in Hampel et al.
[1986], p.85: “. . . it is usually easier to verify the asymptotic normality in another way instead of trying to assess the necessary regularity conditions to make this approach rigourous”.
Second, it allows for an easy assessment of the relative influence of indi- vidual observations on the value of an estimate. If it is unbounded, a single outlier could cause trouble. Considering the approximation (7) over an neighborhoood of the model, we see that the influence can be used to lin- earize the asymptotic bias in a neighborhood of the ideal model. Therefore a bounded influence function implies a bounded approximate bias. Just as in the unpenalized case, a bounded ψ is needed for an estimator to have a bounded influence function. This reflects the intuition that the sources of local instability for M-estimators and their penalized counterparts should be the same. As shown in this article, in both situations, the local robustness properties are a direct result of the form and boundedness of the derivative of the loss function. Extensive simulations illustrating the impact of deviations
from the stochastic assumptions on penalized M-estimators can be found, among many others, in Sardy et al. [2001], Li et al. [2011], Wang et al.
[2013], Lozano and Meinshausen [2013] and Avella-Medina and Ronchetti [2014].
6 Proofs
Proof of Lemma 1
The estimator T(F) = θF,λ is obtained as a solution of H(, θ) :=EF[ψ(Z, θ)] +∇p(θ;λ) = 0.
Note that H has partial derivatives with respect to and θ at (0, θF,λ).
Therefore the claimed result follows immediately from the implicit function theorem.
Proof of Lemma 2
Let b(p) = infθ∈ΘΛλ(θ;F, p) and f(θ, p) = b(p)−Λλ(θ;F, p). The mapping Γ : W2,2 7→ Θ, Γp ={θ|θ ∈ Θ, f(θ, p)≤ 0} is closed by construction (Berge [1997], Example, p.111). Therefore Γp is compact for any p, which implies that Γ is continuous (Berge [1997], Example p.109). From Berge’s maximum theorem, M(p) = max{−EF[L(Z, θ)] | θ ∈ Γp} is continuous in W2,2 and the mapping φp ={θ|θ ∈Γp, EF[L(Z, θ)] =M(p)}is upper semi-continuous from W2,2 to Θ. Define θm = {θ|θ ∈ φpm, EF[L(Z, θ)] = supM(pm)} and θ∗ ={θ|θ ∈φp, EF[L(Z, θ)] = supM(p)}. Then from Lemma 7 and Lemma 8 in the Appendix we have that limmθm =θ∗ ∈Γp= limmΓpm.
Proof of Proposition 1
For ease of notation we will write ψp = ψ(Z;T(F;pm)), Sp = EF[ ˙ψ(Z;T(F;pm))] + ∇2pm(T(F;pm)), H = F − ∆z and IF(pm) = IFpm(z;F, T). By Lemma 1
IF(p0m)−IF(pm) = Sp−1EH[ψp]−Sp−10 EH[ψp0]
= Sp−10 EH[ψp −ψp0] +Sp−1(Sp−Sp0)Sp−1EH[ψp] +o(kSp−Sp0k).
By Lemma 1 we have limmT(F;p0m) = limmT(F;pm) = T(F). Therefore (C3) and (C4) guarantee that EH[ψp−ψp0]→0 andkSp−Sp0k →0. Hence limm
IF(pm)−IF(p0m)
= 0.
Proof of Lemma 3
The existence ofIF(z(1);F, S) follows from Lemma 1. The estimatorT(F) = θ∗2) is obtained as a solution of
H(, θ2) :=EF[ψ(2)(Z(2), θ2, Z(1), θ1∗)] +∇θ2p(2)(θ2, θ1∗;λ) = 0.
Since H has partial derivatives with respect to and θ at (0, θ∗2), the exis- tence of IF(z;F, T) follows from the implicit function theorem. The claimed expression is obtained by slightly modifying the computations provided in Zhelonkin et al. [2012] for unpenalized two-stage M-estimators.
Proof of Lemma 4
For a given first stage estimator the proof is similar to the one given for Lemma 2.
Proof of Proposition 2
For ease of notation we will write (θ1m, θm2 ) = (S(F;p(1)m ), T(F;p(2)m )), ψp = ψ(z(2), θ2m, z(1), θm1 ), Sp = EF[ ˙ψ(2)(Z(2), θ2m, Z(1), θm1 )] +∇θ2,θ2p(2)m (T(F;p(2)m )), H = F −∆z, IF(p(1)m ) = IFp(1)
m(z1;F, S) and IF(p(2)m ) = IFp(2)
m(z;F, T). By Lemma 3
IF(p(2)m 0)−IF(p(2)m ) = Sp−1 EH[ψp] + (∇θ2EF[ψp] +∇θ2,θ1p(2))IF(z(1);F, S)
−Sp−10 EH[ψp0] + (∇θ2EF[ψp0] +∇θ2,θ1p(2)0)IF(z(1);F, S)
= T1+T2IF(z(1);F, S).
By the arguments given in Proposition 1,T1 →0. In particular kSp−Sp0k → 0. Furthermore from Lemma 4 we have limmT(F;p(2)m ) = limmT(F;p(2)m 0) = T(F). Therefore (C3’) and (C4’) guarantee that limm
IF(p(2)m 0)−IF(p(2)m )
= 0.
Proof of Proposition 3
From Proposition 1 it suffices to show that the limiting influence function of a smooth approximation of the problem has the desired form. One possible infinitely differentiable approximation for the absolute value is
sm(t) = 2
mlog(etm+ 1)−t −→
m→∞|t|.
Its first two derivatives have the form
s0m(t) := 2etm
etm+ 1 −1 −→
m→∞ sign(t) =
−1 if t <0.
+1 if t >0.
0 otherwise and
s00m(t) = 2metm
(etm+ 1)2 −→
m→∞
( 0 if t6= 0.
+∞ otherwise.
Defining pm(θ) = Pp
j=1pλ,j(sm(θj)), from Lemma 1 and the notation of Proposition 1 we have
IFpm(z;T, F) =−Sp−1
m ψ(z;T(F;pm)) +∇pm(T(F;pm))
Remember that for a partitioned matrix A, if all the necessary inverses exist, the elements of A−1 are
A11= (A11−A12A−122A21)−1, A22= (A22−A21A−111A12)−1), A12=−A11A12A−122, A21=−A−122A21A11.
)
Hence, bearing in mind this formula and the previous limits, it is easily seen that IF(z;F, T) = limmIFpm(z;F, T) has the claimed form.
Proof of Proposition 4
The difficulty of deriving the influence function for group lasso type of penal- ties comes from the fact that p
|t| is not differentiable. As for the lasso penalty we will approximate the group lasso penalty by a differentiable one guaranteeing the conditions of Proposition 1. We approximate √
t by
psm(t), t≥0 where sm(t) is the same function used in Proposition 3. This yields
psm(t)0
= etm−1
(etm+ 1)s2m(t) −→
m→∞
( t−1/2 if t >0
∞ for t↓0 and
psm(t)00
= 2metm
(etm+ 1)2s1/2m (t)− (etm−1)2
2(etm+ 1)2s3/2m (t) −→
m→∞
( −t−3/2 if t >0
−∞ for t ↓0.
Note that after some simplifications p
sm(t)00−1
psm(t)0
= sm(t)(etm+ 1)(etm−1) 4metmsm(t)−(etm−1)2 −→
m→∞ −t, t≥0. (8) For the penalty pm(θ) =PG
g=1pλ,g(p
sm(kθ(g)k22)), using the formula for the inverse of partitioned matrices, we have
Sp−1
m =
"
EF[ ˙ψ11(Z, T(F))] +∇2p1m(T1(F)) EF[ ˙ψ12(Z, T(F))]
EF[ ˙ψ21(Z, T(F))] EF[ ˙ψ22(Z, T(F))] +∇2p2m(T2(F))
#−1
m→∞−→
"
(M11+Pλ)−1 0
0 0
#
=S−1. (9)
From Lemma 1 we have
IFpm(z;T, F) = EF[ ˙ψ(Z;T(F;pm))] +∇2pm(T(F;pm))−1 ψ(z;T(F;pm)) +∇pλ,m(T(F;pm))
. (10) Finally, using (8)-(9) when taking the limit of (10) as m → ∞ yields the claimed result.
Proof of Proposition 5
From Lemma 3 and Proposition 2 we know that we only need to take a sequence of approximating influence functions. First note that
∇θ(0)EF[ψ(Z, T(F))] = 0 and ∇θj,θjwj|θj| = 0 for θj 6= 0. Then an argu- ment similar to the one given in Proposition 3 completes the proof.
Proof of Lemma 5
In the orthogonal linear model the lasso and scad estimators have the explicit solutions
θˆjlasso= sign(θj(0))(|θj(0)| −λ)+ and
θˆscadj =
sign(θj(0))(|θj(0)| −λ)+ if |θ(0)j |<2λ (a−1)θ(0)j −sign(θj(0))aλ
/(a−2) if 2λ ≤ |θ(0)j | ≤aλ
θ(0)j otherwise
where (x)+ = max{0, x}and a >2. The functional form of the least squares estimator is very simple and the jth coefficient at F is
θj(0)() =f() = Z
XjY dF =A+(B−A),
where A = EF[XjY] and B = xjy. Recall that a distribution is a linear functional on the space D of test functions, i.e. the space of infinitely dif- ferentiable functions with compact support. The Dirac delta function, for instance, is the linear functional defined by hδ, φi = φ(0), φ ∈ D. The distributional derivative of a function g :R → Ris defined as
hg0, φi=−hg, φ0i=− Z
Ω
g(x)φ0(x)dx,
where Ω is the support of φ and φ0 denotes the derivative ofφ. More details on distributions can be found in the Appendix.
Lets first compute the influence function of the lasso. The distributional derivative of ˆθlassoj is by definition
h∇θˆlassoj , φi=− Z
sign(|f()|)(|f()| −λ)+φ0()d. (11) In order to obtain an expression for the influence function of the lasso esti- mator, it suffices to integrate (11) over a small interval [0, ε] because we are only interested in the local behavior of the distributional derivative. Clearly for f(0) = 0,λ >0 and ε small enough
h∇θˆlassoj , φi= 0. (12)
When f(0) >0 and ε >0 is the smallest value such f(ε) = 0, we have h∇θˆjlasso, φi = −
Z
[0,ε]
(f()−λ)φ0()d=−φ()(f()−λ)
ε 0
+ Z
[0,ε]
f0()φ()d
= −φ(ε)(f(ε)−λ) +φ(0)(f(0)−λ)−(A−B) Z
0,ε
φ()d.
Hence for 0 ∈[0, ε] we obtain
∇θˆlassoj (0) =−(A−B)−φ(0)(f(0)−λ) δ(0)−δ(0)
. (13) An analogous argument yields for f0(0)<0 and0 small enough
∇θˆlassoj (0) =−(A−B) +φ(0)(f(0) +λ) δ(0)−δ(0)
. (14) Noticing thatA=−λsign(ˆθlassoj ) when ˆθjlasso6= 0, we conclude from (12)-(14) that
∇θˆjlasso(0) =
( 0 if |θ(0)j | ≤λ ψ(z,θˆlassoj ) +λsign(ˆθlassoj ) otherwise.
Let us now turn to the derivation of the influence function of the scad estimator. It is clear from the form of the ˆθjscad that ∇θˆjscad(0) has the same form as ∇θˆlassoj (0) when |θ(0)j | < 2λ. It is also obvious that ∇θˆjscad(0) = ψ(z,θˆjscad) when |θ(0)j | > aλ. We therefore only have to consider the case 2λ ≤ |θj(0)| ≤ aλ. Suppose that f(0) > 0 and denote by ε the smallest positive real number such that f(ε) = 0. Then
h∇θˆjlasso, φi = − Z
[0,ε]
(a−1)f()−aλ
a−2 φ0()d
= −a−1
a−2f()φ()
ε 0
+a−1 a−2
Z
[0,ε]
f0()φ()d− aλ a−2φ()
ε 0
= a−1
a−2(A−B) Z
[0,ε]
φ()d+ a−1 a−2
f(0)φ(0)−f(ε)φ(ε)
+ aλ a−2
φ(0)−φ(ε) .
Hence for sufficiently small 0 we have
∇θˆjscad(0) =a−1
a−2(A−B) + a−1
a−2f(0)φ(0) δ(0)−δ(0) + aλ
a−2φ(0) δ(0)−δ(0)
and thus
∇θˆjlasso(0) = a−1 a−2
ψ(z,θˆlassoj ) +p0λ(ˆθjscad)
.
The same expression is obtained when f(0) < 0. Remember that for t > 0 the scad penalty is defined by
pλ(t) = λ n
I(t≤λ) + (aλ−t)+
(a−1)λ I(t > λ) o
.
Hence for 2λ ≤t≤aλ we have 1 +p00λ(t) = 1 + 1/(a−1) = (a−2)/(a−1).
This completes the proof.
Proof of Lemma 6
The idea of the proof is similar to the one given in Lemma 2, but using the variant of Berge’s maximum theorem given in the Appendix.
Let ∈ (−ε, ε) for ε > 0, b() = infθ∈ΘΛλ(θ;F, p) and f(θ, ) = b()−Λλ(θ;F, p). The mapping Γ : (−ε, ε) 7→ Θ, Γ ={θ|θ ∈ Θ, f(θ, ) ≤ 0} is closed by construction (Berge [1997], Example, p. 111). There- fore Γ is compact for any , which implies that Γ is continuous (Berge [1997], Example, p.109). From Proposition 7 in the Appendix, M() = max{−EF[L(Z, θ)] | θ ∈ Γ} is continuous in (−ε, ε) and the mapping φ = {θ|θ ∈ Γ, EF[L(Z, θ)] = M()} is upper semi-continuous from (−ε, ε) to Θ. Let {m}n≥1 be a sequence converging to ∈ (−ε, ε) as m → ∞, and define θm = {θ|θ ∈ φm, EFm[L(Z, θ)] = supM(m)} and θ∗ ={θ|θ ∈φ, EF[L(Z, θ)] = supM()}. Then from Lemma 7 and Lemma 8 in the Appendix we have that limmθm =θ∗ ∈Γ= limmΓm.
Proof of Proposition 6
By Lemma 6, T(F) is a continuous function of in a neighborhood of 0.
Therefore T(F) is integrable and the result follows from Lemma 10 in the Appendix.
Appendix
For completeness, we recall some definitions and results from Berge [1997]
and distribution theory (Schwartz [1959]).
Auxiliary results for Berge’s maximum theorem
This material is needed for the proofs of Lemma 2, Lemma 4 and Lemma 6.
Let Γ be a mapping of the topological space X to the topological space Y. We say that Γ is lower semi-continuous at x0 ∈ X if for each open set O meeting Γx0 there is a neighborhood N(x0) such that x ∈ N(x0) ⇒ Γx0 ∩O 6= ∅. We say that Γ is upper semi-continuous at x0 ∈ X if for each open set O containing Γx0 there is a neighborhood N(x0) such that x∈N(x0)⇒Γx0 ⊂O. If Γ is both loweer and upper semi-continuous atx0 we say that Γ is continuous at x0.
If Γ is lower semi-continuous at each point of X it is called lower semi- continuous in X. We say that Γ isupper semi-continuous inX if it is upper semi-continuous at each point ofX and if also Γxis compact for eachx∈ X. If Γ is both lower and upper semi-continuous inX, then it is calledcontinuous in X.
Berge’s maximum theorem : (Berge [1997], p. 116) If h is a contin- uous function in Y and Γ is a continuous mapping of X toY such that, for eachx, Γx6=∅, then the functionM defined byM(x) = max{h(y)|y∈Γx}
is continuous n X and the mapping φ defined by φx ={y | y ∈ Γx, h(y) = M(x)} is a upper semi-continuous mapping ofX intoY.
Lemma 7 : (Berge [1997], Theorem 4, p.111) If Γ is a closed mapping
then
xn→x0 yn →y0
∀n:yn ∈Γxn
⇒y0 ∈Γx0
Lemma 8 : (Berge [1997], Theorem 6, p.112) Every upper semi- continuous mapping is closed.
Lemma 9 : (Berge [1997],Theorem 7, p.112 ) If Γ1 is a closed mapping ofX intoY and Γ2 is an upper semi-continuous mappingn ofX intoY , then the mapping Γ = Γ1∩Γ2 is upper semi-continuous.
We provide a variant of Berge’s maximum theorem. Its proof is similar to the one of the original theorem and is given for completeness.
Proposition 7 : If h is a continuous function in X × Y and Γ is a continuous mapping of X to Y such that, for each x, Γx 6= ∅, then the function M defined by M(x) = max{h(x, y) | y ∈ Γx} is continuous n X and the mapping φ defined by φx={y |y ∈Γx, h(x, y) = M(x)} is a upper semi-continuous mapping of X intoY.
Proof : The functionh is continuous inX × Y and so M is a continuous function. Furthermore the mapping ∆ given by
∆x={y | M(x)−h(x, y)≤0}
is closed (Berge [1997], Example, p.111). Hence by Lemma 9, φ = Γ∩∆ is upper semi-continuous.
Auxiliary results from distribution theory
Define Cc∞(Ω) = {f : Ω →R| f ∈ C∞(Ω), with compact support} where Ω is an open set in Rd.
Definition : A sequence {φn}n≥1 of functions φn ∈Cc∞(Ω) converges to φ ∈Cc∞(Ω) in the sense of test functions if:
(a) there exists Ω0 ∈Ω such that suppφn⊂Ω0 for every n ∈N;
(b) ∂αφn →∂αφ as n→ ∞ uniformly on Ω for every multindex α∈Nd. The topological vector space D(Ω) consists of Cc∞(Ω) equipped with the topology that corresponds to convergence in the sense of test functions. A linear functional on D(Ω) is a linear map T : D(Ω) → R. The value of T acting on a test function φ is denoted by hT, φi. Therefore if T is linear we have
hT, λφ+µψi=λhT, φi+µhT, ψi, for all λ, µ∈R and φ, ψ∈ D(Ω).
A functional T is continuous if φn →φ in the sense of test functions implies that hT, φni → hT, φiin R.
Definition : A distribution on Ω is a continuous linear functional T : D(Ω)→R.
A sequence of distributions {Tn} converges to T if hTn, φi → hT, φi for every φ ∈ D(Ω). The topological vector space D0(Ω) consists of the distri- butions on Ω equipped with the topology corresponding to this notion of convergence. An important example of distribution is the Dirac delta func- tion supported at x∈Ω, which is the distribution δx :D(Ω)→R defined by hδx, φi=φ(x).
Definition : For 1 ≤ i ≤ d, the ith partial derivative of a distri- bution T ∈ D0(Ω) is the distribution ∂iT ∈ D0(Ω) defined by h∂iT, φi =
−hT, ∂iφifor all φ∈ D(Ω).Forα ∈Nd, the derivative ∂αT ∈ D0(Ω) of order
|α|is defined by h∂αT, φi= (−1)|α|hT, ∂αφi for all φ ∈ D(Ω).
We recall that for 1 ≤ p < ∞, the space Lp(Ω) consists of all integrable functions, i.e. the Lebesgue measurable functions f : Ω → R such that R
Ω|f|pdx < ∞. Equiped with the norm norm kfkp = (R
Ω|f|pdx)1/p they constitute a Banach space. Note that any (locally) integrable function de- fines a regular distribution Tf ∈ D0(Ω) by hTf, φi = R
Ωf φdx. Typically the function f and the distribution Tf are regarded as equivalent.
Lemma 10 : Let f ∈ L1(Ω) and {fn} be a sequence of function in C∞(Ω) such thatfn→f and ∂αfn→g inL1(Ω). Thenf has distributional derivative given by g =∂αf ∈L1(Ω).
Proof : Note that sincefn→f inL1(Ω) and φ∈Cc∞(Ω), we have Z
Ω
fnφdx −→
n→∞
Z
Ω
f φdx, because for K = suppφ
Z
Ω
fnφdx− Z
Ω
f φdx =
Z
K
(fn−f)φdx ≤sup
K
|φ|
Z
K
|fn−f|dx→0.
Hence, for every φ∈Cc∞(Ω), the convergence of fn and ∂αfn implies that Z
Ω
f ∂αφdx= lim
n→∞
Z
Ω
fn∂αφdx= (−1)|α| lim
n→∞
Z
Ω
∂αfnφdx= (−1)|α|
Z
Ω
gφdx.
Therefore the distributional derivative of f is ∂αf =g.
Acknowledgements
The author is grateful to Elvezio Ronchetti for helpful discussions and sug- gestions, including a key contribution to an early development of the idea of limiting influence function.
References
A. Alfons, C. Croux, and S. Gelper. Sparse least trimmed squares regression for analyzing high-dimensional large data sets. The Annals of Applied Statistics, 7(1):226–248, 2013.
M. Avella-Medina and E. Ronchetti. Robust and consistent variable selection for generalized linear and additive models. Manuscript, 2014.
T. Bednarski. Fr´echet differentiability of statistical functionals and implica- tions to robust statistics. New Directions in Statistical Data Analysis and Robustness, pages 25–34, 1993.
C. Berge. Topological Spaces: including a treatment of multi-valued functions, vector spaces, and convexity. Dover Publications, 1997.
L. Breiman. Better subset regression using the nonnegative garrote. Tech- nometrics, 37(4):373–384, 1995.
P. B¨uhlmann and S. Van De Geer. Statistics for high-dimensional data:
methods, theory and applications. Springer, 2011.
B. Clarke. Nonsmooth analysis and fr´echet differentiability of m-functionals.
Probability Theory and Related Fields, 73(2):197–209, 1986.
J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96 (456):1348–1360, 2001.