Influence functions for penalized M-estimators

(1)

Report

Reference

Influence functions for penalized M-estimators

AVELLA MEDINA, Marco Andrés

AVELLA MEDINA, Marco Andrés. Influence functions for penalized M-estimators. 2014

Available at:

http://archive-ouverte.unige.ch/unige:35319

Disclaimer: layout of this document may differ from the published version.

(2)

Influence functions for penalized M-estimators

Marco Avella-Medina

Research Center for Statistics University of Geneva, Switzerland

March 2014

Abstract

We study the local robustness properties of general penalized M- estimators via the influence function. More precisely, we propose a framework that allows us to define rigourously the influence function as the limiting influence function of a sequence of approximating estimators. Our approach can deal with nondifferentiable penalized M- estimators and a diverging number of parameters. We show that it can be used to characterize the robustness properties of a wide range of sparse estimators and we derive its form for general penalized M- estimators including lasso and adaptive lasso type estimators. We prove that our influence function is equivalent to a derivative in the sense of distribution theory.

1 Introduction

Sparse models have become very popular in recent years. Since their introduction in the linear model (Breiman [1995], Tibshirani [1996]), many extensions and algorithms have been proposed. See Tibshirani [2011] for a retrospective. Asymptotic properties of lasso-type estimators have been

(3)

studied in the fixed dimensional parameter case (Knight and Fu [2000],Fan and Li [2001], Zou [2006]), as well as in the high dimensional set up where the number of parameters is allowed to grow at an even faster rate than the sample size (B¨uhlmann and Van De Geer [2011]).

Given the increasing importance that sparsity inducing penalties play in modern statistics, the need for a clear understanding of the robustness properties of these type of procedures is evident. Robust statistics develops a theoretical framework that allows us to take into account that the models used for fitting the data are only idealized approximations of reality. It provides methods that still give reliable results when slight deviations from the stochastic assumptions on the model occur. Book-length expositions can be found in Huber [1981] and 2nd edition by Huber and Ronchetti [2009], Hampel et al. [1986] and Maronna et al. [2006]. One of the main lines of research in the robustness literature was opened by Hampel who considered local robustness, i.e. the impact of moderate distributional deviations from ideal models on a statistical procedure. In this setting the quantities of interest are viewed as functionals of the underlying distribution. Typically their linear approximation is studied to assess the behavior of the estmators in a neighborhood of the model. In this approach the influence function plays a crucial role in describing the local stability of the functional analyzed.

Although some authors have suggested sparse estimators that limit the impact of contamination in the data (e.g. Sardy et al. [2001], Wang et al.

[2007a], Li et al. [2011] and Lozano and Meinshausen [2013] among many others), the robustness properties of these procedures are not well understood. All these procedures rely on the intuition that a loss function that defines robust estimates in the well understood unpenalized fixed dimensional M-estimator, should also define a robust estimator when it is penalized by a deterministic function. To the best of our knowledge, only Alfons et al. [2013]

and Wang et al. [2013] established formal robustness properties for their estimators. Both papers computed finite sample breakdown points. The latter also gives influence functions, although in a limited set up without developping a rigorous framework. Further details can be found in Section 4.

The goal of this article is to charaterize the robustness properties of a wide class of penalized M-estimators, that covers most of the existing pro-

(4)

posals, by deriving their influence function. This requires developping a new framework. Indeed, the typical tools used to derive the influence function of M-estimators suffer from two main problems when considering penalized M-estimators. They are not well suited for a scenario where the number of parameters is allowed to increase with the sample size, and they cannot handle nondifferentiable penalty functions which are necessary for achieving sparsity (Fan and Li [2001]).

Our work provides a number of constributions to the existing literature.

First, we introduce a influence function defined through a sequence of approximating estimators and show that it is uniquely defined for penalized M-estimators and two-stage penalized M-estimators. The former class covers lasso and group lasso type estimators. The latter includes adaptive lasso type estimators. We compute the influence function of all these important examples. The result is valid when the number of parameters diverges with the sample size, in particular the result holds for high dimensional problems.

Second, we show that the two main features of the influence functions for M- estimators are also valid for the influence function of penalized M-estimators, i.e. (a) it allows to assess the relative influence of individual observations to- ward the value of an estimate; (b) it allows an inmediate and simple, informal assessment of the asymptotic properties of an estimate (Huber [1981], p. 14- 15). Third, we show that our limiting influence function can be viewed as a distributional derivative in the sense of Schwartz [1959]. This opens the door for further research exploting the tools of distribution theory, which to the best of our knowledge, has essentially not been used in the statistical literature previously. Finally, a key step in our theoretical argument is the innovative use of Berge’s maximum theorem. This is a powerful tool that could have more applications in statistics.

The rest of the article is organized as follows. Section 2 introduces the general framework and provides the main results regarding our influence function. In Section 3 we compute the influence function for a wide class of penalized M-estimators. Section 4 introduces distributional derivatives for the computation of influence functions. Section 5 concludes with some remarks. All the proofs are given at the end of the article and auxiliary results are provided in the Appendix.

(5)

2 Penalized M-estimators and Influence Functions

We consider parametric estimators obtained as the minimizers of regularized problems of the form

Λ_λ(θ;F, p) = E_F[L(Z, θ)] +p(θ;λ) (1) with respect to the parameter θ ∈ Θ ⊂ R^d, where L is a continuous loss function,p(·;λ) a continuous penalty function with regularization parameter λ, Z ⊂ R^k and F the distribution function of Z. We denote the resulting penalized M-estimators by T(F) =θ_F,λ =θ^∗. Important examples are given in Section 3.

For M-estimators the standard argument for showing the existence and deriving the form influence function, is to use an appropiate implicit function theorem. This typically requires for (1), the existence of two derivatives with respect to the parameter . However it is well known that a penalty function has to be singular at the origin in order to achieve sparsity (Fan and Li [2001]). Therefore new tools that can deal with nondifferentiable penalty functions are required to derive influence functions of many modern penalized estimators. We propose to define the influence function as the limit of the influence functions of a sequence of differentiable penalized M-estimators that converge to the penalized M-estimator of interest.

2.1 Influence Function

We will require the following set of conditions for the derivation of our theoretical results:

(C1) The set Θ⊂R^d is compact.

(C2) EF[L(Z, θ)] has two derivatives with respect to θ denoted by EF[ψ(Z, θ)] and EF[ ˙ψ(Z, θ)] respectively.

(C3) The functions ψ(z, θ) and EF[ ˙ψ(Z, θ)] are continuous at θ^∗.

(6)

(C4) Let {p_m}m≥1 and {p⁰_m}m≥1 be two sequences in C^∞(Θ) = {f : Θ → R| f continuous and infinitely differentiable} converging to p in the Sobolev space W^2,2(Θ) when m→ ∞.

For a functionh:R^d→R, we denote by∇hand∇²hits first two derivatives and by ∇_xh its derivative with respect to the variable x. Recall that the influence function (Hampel [1974], Hampel et al. [1986]) of a functionalT(F) is a special Gˆateaux derivative given by

IF(z;F, T) = lim

→0

T(F)−T(F)

,

whereF = (1−)F+∆_z and ∆_z is the distribution probability that assigns mass 1 at the point z and 0 elsewhere. It has the heuristic interpretation of describing the effect of an infinitessimal contamination at the point z on the estimate, standardized by the mass of contamination. When the penalty function is sufficiently smooth, a standard argument establishes the existence and the form of the influence function of the minimizer of (1).

Lemma 1 : Assume (C1) and (C2). Letp: Θ→Rbe twice differentiable andS :=E_F[ ˙ψ(Z, θ^∗)]+∇²p(θ^∗;λ) be invertible. Then the influence function of T(F) exists for allz ∈R^k and we have

IF(z;F, T) =−S⁻¹ ψ(z, θ^∗) +∇p(θ^∗;λ) .

It follows that just as for a M-estimators, a bounded derivative for the loss function is required in order to obtain bounded influence estimators in the penalized setting. When the penalty function in (1) is not differentiable the conditions of Lemma 1 do not hold. We therefore propose to study the limiting form of the influence function of penalized M-estimators obtained using smooth penalty functionsp_msuch that limm→∞p_m =p. Such penalized M-estimators, denoted by T(F;p_m), are defined as the global minimizers of

Λ_λ(θ;F, p_m) =E_F[L(Z, θ)] +p_m(θ;λ). (2) Let IF_p_m(z;T, F) be the influence function ofT(F;p_m). We define the influence function of T(F) as

IF(z;F, T) := lim

m→∞IF_p_m(z;F, T). (3)

(7)

The following lemma states the uniqueness of the limiting estimatorT(F).

The use of Berge’s maximum theorem (Berge [1997]) is a key step in our proof.

It constitutes an innovative tool in the statistical literature. The lemma is crucial for the uniqueness argument of the limiting influence function, stated in Proposition 1. While completing this article, the author noticed that Machado [1993] has also used Berge’s maximum theorem for the derivation of qualitative robustness for model selection criteria based on M-estimators.

Lemma 2 : Under (C1),(C2) and (C4) we have limmT(F;pm) = limmT(F;p⁰_m) =T(F).

Remark 1 : We could extend the unicity argument of Lemma 2 to local minimizers if the selection step given at the end of the proof (see Section 6) can be extended for local minima. This could be easily done, for example, if there is a class of approximating penalty functions {pm} that generate at least the same number of local minima as p form > m0.

Proposition 1 : Let T(F;pm) satisfy the conditions of Lemma 1 and assume (C1)-(C4). Then the limiting influence function defined in (3) does not depend on the choice of pm.

2.2 Two-stage penalized M-estimators

We can extend the previous results to a class of two-stage penalized M- estimators. Important examples of such estimators are studied and discussed in Section 3.2. The following set up ca be viewed as a direct extension of the framework provided by Zhelonkin et al. [2012]. LetF be the distribution function of Z = (Z⁽¹⁾, Z⁽²⁾) and let θ = (θ1, θ2) be a vector defining the arguments of the first and second stages, with θ1 ∈Θ1 ⊂R^d¹, θ2 ∈Θ2 ⊂R^d² andd=d1+d2. We consider penalized M-estimators (θ₁^∗, θ₂^∗) = (S(F), T(F)) defined by

θ^∗₁ = argmin

θ1

n

E_F[L⁽¹⁾(Z⁽¹⁾, θ₁)] +p⁽¹⁾(θ₁;γ)o

(4) θ^∗₂ = argmin

θ2

n

E_F[L⁽²⁾(Z⁽²⁾, θ₂, Z⁽¹⁾, θ^∗₁)] +p⁽²⁾(θ₂, θ^∗₁;λ))o

(5)

(8)

where L⁽ⁱ⁾ and p⁽ⁱ⁾ denote respectively the loss and penalty functions in the ith stage. For the theoretical argument we adapt conditions (C2)-(C4) to this set up in the following staightforward way:

(C2’) EF[L⁽¹⁾(Z⁽¹⁾, θ1)] has two derivatives with respect to θ1 denoted by EF[ψ⁽¹⁾(Z⁽¹⁾, θ1)] and EF[ ˙ψ⁽¹⁾(Z⁽¹⁾, θ1)] respectively.

EF[L⁽²⁾(Z⁽²⁾, θ2, Z⁽¹⁾, θ1)] has two derivatives with respect to θ1

denoted by EF[ψ⁽²⁾(Z⁽²⁾, θ2, Z⁽¹⁾, θ1)] and EF[ ˙ψ⁽²⁾(Z⁽²⁾, θ2, Z⁽¹⁾, θ1)]

respectively. Furthermore EF[ψ⁽²⁾(Z⁽²⁾, θ2, Z⁽¹⁾, θ1)] is differentiable with respect to θ1.

(C3’)The functions ψ(z⁽¹⁾, θ1) and ψ(z⁽²⁾, θ2, z⁽¹⁾, θ1) are continuous at θ₁^∗, and EF[ ˙ψ⁽¹⁾(Z⁽¹⁾, θ1)], ∇θ1EF[ψ⁽²⁾(Z⁽²⁾, θ2, Z⁽¹⁾, θ1)] and EF[ ˙ψ⁽²⁾(Z⁽²⁾, θ2, Z⁽¹⁾, θ1)] are continuous at (θ₁^∗, θ₂^∗).

(C4’) Fori= 1,2,{p⁽ⁱ⁾mi}and{p⁽ⁱ⁾mi

0}be two sequences inC^∞(Θi) converging to p⁽ⁱ⁾ in the Sobolev space W^2,2(R^dⁱ) converging to when mi → ∞.

Furthermore{∇θ2p⁽²⁾m }and{∇θ2p⁽²⁾m 0}are differentiable with respect to θ1.

We first provide the influence function of (5) for sufficiently smooth penalty functions.

Lemma 3 : Denote by θ^∗ = (S(F), T(F)) the estimators defined by (4)-(5). Assume (C1),(C2’) and let p⁽ⁱ⁾ : Θ_i → R be twice differentiable with respect to θ_i and ∇_θ₂p⁽²⁾ be differentiable with respect to θ₁. Let also S :=E_F[ ˙ψ⁽²⁾(Z⁽²⁾, θ^∗₂, Z⁽¹⁾, θ₁^∗)] +∇_θ₂_,θ₂p⁽²⁾(θ₂^∗, θ^∗₁;λ) be invertible. Then the influence function of T(F) exists for allz ∈R^k and we have:

IF(z;F, T) =−S⁻¹

ψ⁽²⁾(Z⁽²⁾, θ₂^∗, Z⁽¹⁾, θ₁^∗) +∇_θ₂p⁽²⁾(θ^∗₂, θ₁^∗;λ)

+ ∇θ2EF[ ˙ψ⁽²⁾(Z⁽²⁾, θ2, Z⁽¹⁾, θ1)] +∇θ2,θ1p⁽²⁾(θ₂^∗, θ^∗₁;λ)

IF(z⁽¹⁾;F, S)

,

where IF(z⁽¹⁾;S, F) has the form given in Lemma 1.

Unsurprisingly, bounded-influence estimators are obtained only by taking loss functions with bounded derivatives in both stages. The expression

(9)

obtained is very similar to the one derived in the unpenalized set up in Zh- elonkin et al. [2012]. We are now ready to state the uniqueness of the limiting two-stage estimator (5) and its influence function.

Lemma 4 : Under (C1),(C2’) and (C4’) we have limmT(F;p⁽²⁾m ) = limmT(F;p⁽²⁾m 0) =T(F).

Proposition 2 : Letpm = (p⁽¹⁾m , p⁽²⁾m ) ,T(F;pm) satisfy the conditions of Lemma 1 and assume (C1),(C2’)-(C4’). Then the limiting influence function (3) of (5) does not depend on the choice of pm.

3 Examples

We are now ready to derive the influence function of some general penalized M-estimators using approximating sequences. We consider different classes of sparsity inducing penalty functions in the analysis. Without loss of generality we will assume that the tuning parameter λ is such that the resulting estimators are sparse. More specifically we consider that θ_F,λ =θ^∗ = (θ^∗T₁ , θ^∗T₂ )^T with θ₁^∗ ∈R^s,s < d and θ₂^∗ = 0. Note that (1) is a population version of the empirical problem encountered in practice, whereF is replaced by its empirical version ˆF, i.e. the distribution that assigns mass 1/nto each observation (x_i, y_i), i= 1, . . . , n. In a set up where the number of parameters is allowed to grow with the sample size, the true underlying distribution generating the data is F_n and its empirical version ˆF_n. For notational convenience we will write F instead of F_n as our result holds in both cases, as well as for their empirical counterparts. In particular our results hold in the high dimensional framework where d > n.

3.1 Lasso and group lasso type penalties

The following proposition gives the form of the influence function of estimators that arise when considering a class of general penalty functions. It covers as special cases convex penalties such as the lasso (Tibshirani [1996]) and nonconvex penalties such as the scad (Fan and Li [2001]).

(10)

Proposition 3 : Denote by θ^∗ = T(F) the penalized M-estimators obtained as the minimizer of (1) with penalty functions of the form p_λ(θ) = Pp

j=1p_λ,j(|θ_j|), where p_λ,j(·) are differentiable functions. Then under (C1)- (C3) the influence function (3) of T(F) has the form

IF(z;F, T) =−S⁻¹

ψ(z, θ^∗) +φλ(θ^∗)

,

whereS⁻¹ = blockdiag{(M₁₁+P_λ)⁻¹,0},M₁₁ =E_F[ ˙ψ₁₁(Z, T(F))],P_λ is a diagonal matrix with diagonal elements p⁰⁰_λ,j(|θ_j^∗|) andφ_λ(θ^∗) is addimensional vector with components p⁰_λ,j(|θ^∗_j|)sign(θ^∗_j) for j = 1, . . . , d.

We now give the form of the influence function of penalized M-estimators achieving sparsity for grouped variables via group lasso type of penalties (e.g.

Yuan and Lin [2006], Wang et al. [2007b], Huang et al. [2009]).

Proposition 4 : Denote by θ^∗ = T(F) the penalized M-estimators obtained as the minimizer of (1) with group penalty functions of the form p_λ(θ) = PG

g=1p_λ,g(kθ_(g)k₂), where p_λ,g(·) are differentiable functions, θ = (θ₍₁₎, . . . , θ_(G)) and each θ_(g) is a subvector of θ corresponding to the gth group of variables. Then under (C1)-(C3) the influence function (3) of T(F) has the form

IF(z;F, T) =−S⁻¹

ψ(z, θ^∗) +φ_λ(θ^∗) ,

where S⁻¹ = blockdiag{(M₁₁+P_λ)⁻¹,0},M₁₁ =E_F[ ˙ψ₁₁(Z, T(F))] andP_λ is a block diagonal matrix with blocks p⁰⁰_λ,g(kθ_(g)k₂)(θ_(g)θ^T_(g)− kθ_(g)k₂I)/kθ_(g)k³₂ where

φλ(θ^∗) =

( p⁰_λ,g(kθ_(g)^∗ k)θ_j^∗/kθ^∗_(g)k₂ if θ^∗_(g) 6= 0

0 if θ^∗_(g) = 0

and p⁰_λ,g(t) andp⁰⁰_λ,g(t) are the first two derivatives of p_λ,g(t) fort >0.

It is clear from Proposition 3 and Proposition 4 that a boundedψfunction is necessary for a Penalized M-estimator to have bounded limiting influence function. The fact that the influence function has some zero components is rather surprising. Further discussion can be found in the next two sections.

(11)

3.2 Adaptive lasso type penalties

The adaptive lasso of Zou [2006] is a popular two stage procedure that im- proves on the results of the lasso by ensuring the oracle properties of Fan and Li [2001]. The tools developed in Section 2.2 allow us to derive the influence function of adaptive lasso type estimators.

Proposition 5 : Let θ be the d dimensional parameter of interest and let θ⁽⁰⁾ = S(F) be an initial estimate of θ, with s⁽⁰⁾ non zero components, defined by (4). For j = 1, . . . , d and some function w, define the weights w_j = w(|θ_j⁽⁰⁾|). Denote by θ^∗ = T(F) the penalized M-estimators obtained as the minimizer of (5) with penalty function of the form p(θ, θ⁽⁰⁾;λ) = λPp

j=1w_j|θ_j| and loss function L(Z, θ, θ⁽⁰⁾). Then under (C1), (C2’) and (C3’) the influence function (3) of T(F) has the form

IF(z;F, T) = −S⁻¹

ψ(z, θ^∗) +φ_λ(θ^∗, θ⁽⁰⁾) +ϕ_λ(θ^∗, θ⁽⁰⁾)IF(z;F, S) , where S⁻¹ = blockdiag{M₁₁⁻¹,0}, M11 = EF[ ˙ψ11(Z, T(F))], φλ(θ^∗, θ⁽⁰⁾) is a d dimensional vector with components λw⁰(|θ⁽⁰⁾_j |)sign(θ_j⁽⁰⁾)sign(θ^∗_j) for j = 1, . . . , d with w⁰ denoting the derivative of w, and ϕλ(θ^∗, θ⁽⁰⁾) = blockdiag{Hγ,0} where Hγ ={∇_θ⁽⁰⁾

k

φλ,j(θ^∗, θ⁽⁰⁾)}^s_j,k=1⁽⁰⁾ .

The form of IF(z;F, T) depends on the choice of the initial estimator.

A bounded influence estimator T(F) can only be obtained by taking a bounded influence initial estimator S(F) and choosing a loss function defining a bounded ψ function in the second stage. Among the nonsparse initial estimates proposed in the literature, Zou [2006] proposed to use maximum likelihood estimates for the fixed parameter case where ddoes not vary with n. In the high dimensional set up whered > n, Huang et al. [2008] proposed to use an initial zero consistent estimate, e.g. marginal least squares. For those cases the influence function ofS(F) is simply proportional to the score function as they are well known M-estimators. Lasso estimators have also been proposed as initial estimates. See for instance Fan et al. [2014] and Avella-Medina and Ronchetti [2014], in the context of high dimensional penalized likelihood and robust quasilikelihood estimation respectively. Note that for w_j = 1/|θ_j⁽⁰⁾| the usual convention is that for θ_j⁽⁰⁾ = 0 we define

(12)

w_j =∞ and ∞ ·0 = 0. Hence for this choice of w_j, a coefficient set to zero in the first step will never appear in T(F).

4 Connections to distribution theory

As mentioned in the introduction, Wang et al. [2013] studied the robustness properties of their estimator by calculating its finite sample breakdown point and influence function. It can be seen that their derivation of the influence function (Theorem 3) could easily be extended to more general loss functions.

By the argument given above in Section 3, it also extends to a diverging number of parameters set up. In their proof they implicitly use distributions in the sense of Schwartz [1959], since they require the first two derivatives of the absolute value function. Indeed they use sign(x) as first derivative of |x|

and the Dirac delta function δ(x) as its second derivative. These derivatives are justified by the theory of distributions. However, working explicitly with the Dirac delta function understood informally as

δ(x) =

( +∞ if x= 0 0 otherwise

and inverting a matrix containing such an expression, is not fully satisfac- tory from a formal mathematical standpoint. Interestingly, the expression obtained in Theorem 3 in Wang et al. [2013] is the same as the one we give in Proposition 3. This suggest that a more careful treatment of the problem with a rigorous use of differentiation in the sense of distribution theory will yield the same influence function.

We establish such a result in a very simple set up. Assume an orthogonal linear model i.e.

yi =x^T_i θ+i, i= 1, . . . , n,

where _i are zero mean errors, y_i the responses and x_i the covariates with X^TX = I and X = (x₁, . . . , x_n)^T. We consider the penalized least squares problem

1 n

n

X

i=1

(y_i−x^T_i θ)²+

d

X

j=1

p_λ(|θ_j|) (6)

(13)

wherep_λ(·) is a penalty function andλ the regularization parameter. For the lasso and scad penalties, the resulting estimators have an explicit solution that allows for easy computations of distributional derivatives.

Lemma 5 : Let θ⁽⁰⁾ = T⁽⁰⁾(F) denote the least square estimator of θ.

Then the influence function of the minimizer θ^∗ = T(F) of (6) exists as a distributional derivative for the lasso and scad penalties, and an explicit computation yields

IF(z;F, T) =

( 0 if |θ⁽⁰⁾_j | ≤λ ψ_j(z, θ^∗) +λsign(θ_j^∗) otherwise and

IF(z;F, T) =

( 0 if |θ_j⁽⁰⁾| ≤2λ

ψ_j(z, θ^∗) +p⁰_λ(|θ_j^∗|)sign(θ_j^∗) /

1 +p⁰⁰_λ(|θ_j^∗|)

otherwise respectively, where z = (y, x), ψ(z, θ) = (y−x^Tθ^∗)x_j, and p⁰_λ(t) and p⁰⁰_λ(t) denote the first two derivatives of p_λ(t) for t >0.

At first glance the theory of distributions seems to provide a natural and rigorous way of tackling nondifferentiable penalties. However the theory suf- fers from at least two major drawbacks for the purposes of deriving influence functions as in Lemma 5. The product of two distributions cannot be con- sistently defined in general. This makes the manipulation of distributions delicate. Furthermore, to the best of our knowledge there is no implicit function theorem that would easily allow to derive the form of the derivative of an implicit function. Therefore extending the proof of Lemma 5 to more general problems does not look obvious. We can however show that the influence functions derived in Section 3 using the limiting influence function can be viewed as distributional derivatives. Before giving this result in Proposition 6, we need an intermediate result concerning the continuity of T(F) with respect to that is interesting on its own. Its proof uses Berge’s maximum theorem and is similar to the proof of Lemma 2.

Lemma 6 : Under (C1) and (C2), the penalized M-estimator T(F) resulting from the minimization of (1) is continuous with respect to ∈ (−ε, ε), ε >0.

(14)

Proposition 6 : Under the assumptions of Proposition 1, the influence function (3) of the minimizer T(F) of (1) is the distributional derivative of T(F) with respect to evaluated at 0.

Note that with Proposition 6 at hand, Lemma 5 becomes a direct conse- quence of it. We present Lemma 5 and give its proof through direct calcula- tions, as it constitutes one rare example where distributional derivatives can easily and explicitly be computed for a penalized M-estimators.

5 Discussion

We introduced the idea of calculating the influence function of penalized M-estimators with the help of a sequence of approximating estimators. In Section 2 we justified the validity if such an approach. In Section 4 showed that it is equivalent to computing a distributional derivative. In Section 3 we derived the limiting influence functions of general penalized M-estimators, arising when choosing some classes of sparsity inducing penalty functions.

The three families of penalty functions analyzed cover the most prominent examples studied in the literature. These derivations are easy to compute, fairly intuitive and give explicit solutions.

Let us now discuss two main properties of the influence function. First, since by definition the influence function is essentially an ordinary derivative with respect to , if a statistical functionalT(F) is sufficiently regular, a von Mises expansion (Taylor expansion, Mises [1947]) yields

T(G)≈T(F) + Z

IF(z;F, T)d(G−F)(z). (7) Substituting in this expression the empirical distribution ˆF for G, we obtain

√n

T( ˆF)−T(F)

≈ 1

√n

n

X

i=1

IF(zi;F, T) because R

IF(z;F, T)dF(z) = 0. Then by the central limit theorem have that √

n T( ˆF)−T(F)

is asymptotically normal with mean 0 and variance V(F, T) = R

IF(z;F, T)IF^T(z;F, T)dF(z). A rigourous general argument can be found in Huber [1981]. For M-estimators, the conditions for the

(15)

Fr´echet differentiability of Clarke [1986] guarantee the validity of the von Mises expansion and imply good robustness properties as discussed in Bed- narski [1993]. Using these formulas and the influence functions derived in Section 3, we see that for penalized M-estimators, the approximation (7) leads to the expression

√n(ˆθ−θ^∗)→_d N(0, V(F, T)),

whereV(F, T) is block diagonal with 0 entries for the elements corresponding toθ₂. All the analysis in Section 3 was developed for a fixedλ. If we assume that there is a true underlying set of parameters θ₀, this implies in general that θ^∗ 6=θ₀. Note however that if we work instead with aλ_n tending to zero at an appropiate rate, the heuristic asymptotic distribution would match the oracle properties (Fan and Li [2001]) for the class of lasso type estimators sat- isfying such properties. Examples of this type of estimators were derived for instance in Fan and Lv [2010] and Avella-Medina and Ronchetti [2014]. The von Mises expansion can be therefore used to assess informally the asymptotic properties of an important class of penalized M-estimators. A more careful study of this phenomenon is required for a better understanding of the conditions under which it holds. Still, as pointed out in Hampel et al.

[1986], p.85: “. . . it is usually easier to verify the asymptotic normality in another way instead of trying to assess the necessary regularity conditions to make this approach rigourous”.

Second, it allows for an easy assessment of the relative influence of individual observations on the value of an estimate. If it is unbounded, a single outlier could cause trouble. Considering the approximation (7) over an neighborhoood of the model, we see that the influence can be used to lin- earize the asymptotic bias in a neighborhood of the ideal model. Therefore a bounded influence function implies a bounded approximate bias. Just as in the unpenalized case, a bounded ψ is needed for an estimator to have a bounded influence function. This reflects the intuition that the sources of local instability for M-estimators and their penalized counterparts should be the same. As shown in this article, in both situations, the local robustness properties are a direct result of the form and boundedness of the derivative of the loss function. Extensive simulations illustrating the impact of deviations

(16)

from the stochastic assumptions on penalized M-estimators can be found, among many others, in Sardy et al. [2001], Li et al. [2011], Wang et al.

[2013], Lozano and Meinshausen [2013] and Avella-Medina and Ronchetti [2014].

6 Proofs

Proof of Lemma 1

The estimator T(F) = θF,λ is obtained as a solution of H(, θ) :=E_F[ψ(Z, θ)] +∇p(θ;λ) = 0.

Note that H has partial derivatives with respect to and θ at (0, θ_F,λ).

Therefore the claimed result follows immediately from the implicit function theorem.

Proof of Lemma 2

Let b(p) = infθ∈ΘΛ_λ(θ;F, p) and f(θ, p) = b(p)−Λ_λ(θ;F, p). The mapping Γ : W^2,2 7→ Θ, Γp ={θ|θ ∈ Θ, f(θ, p)≤ 0} is closed by construction (Berge [1997], Example, p.111). Therefore Γp is compact for any p, which implies that Γ is continuous (Berge [1997], Example p.109). From Berge’s maximum theorem, M(p) = max{−E_F[L(Z, θ)] | θ ∈ Γp} is continuous in W^2,2 and the mapping φp ={θ|θ ∈Γp, E_F[L(Z, θ)] =M(p)}is upper semi-continuous from W^2,2 to Θ. Define θ_m = {θ|θ ∈ φp_m, E_F[L(Z, θ)] = supM(p_m)} and θ^∗ ={θ|θ ∈φp, E_F[L(Z, θ)] = supM(p)}. Then from Lemma 7 and Lemma 8 in the Appendix we have that lim_mθ_m =θ^∗ ∈Γp= lim_mΓp_m.

Proof of Proposition 1

For ease of notation we will write ψp = ψ(Z;T(F;pm)), Sp = EF[ ˙ψ(Z;T(F;pm))] + ∇²pm(T(F;pm)), H = F − ∆z and IF(pm) = IFpm(z;F, T). By Lemma 1

IF(p⁰_m)−IF(p_m) = S_p⁻¹E_H[ψ_p]−S_p⁻¹0 E_H[ψ_p⁰]

= S_p⁻¹0 E_H[ψ_p −ψ_p⁰] +S_p⁻¹(S_p−S_p⁰)S_p⁻¹E_H[ψ_p] +o(kS_p−S_p⁰k).

(17)

By Lemma 1 we have lim_mT(F;p⁰_m) = lim_mT(F;p_m) = T(F). Therefore (C3) and (C4) guarantee that E_H[ψ_p−ψ_p⁰]→0 andkS_p−S_p⁰k →0. Hence lim_m

IF(p_m)−IF(p⁰_m)

= 0.

Proof of Lemma 3

The existence ofIF(z⁽¹⁾;F, S) follows from Lemma 1. The estimatorT(F) = θ^∗₂) is obtained as a solution of

H(, θ2) :=EF[ψ⁽²⁾(Z⁽²⁾, θ2, Z⁽¹⁾, θ₁^∗)] +∇θ2p⁽²⁾(θ2, θ₁^∗;λ) = 0.

Since H has partial derivatives with respect to and θ at (0, θ^∗₂), the existence of IF(z;F, T) follows from the implicit function theorem. The claimed expression is obtained by slightly modifying the computations provided in Zhelonkin et al. [2012] for unpenalized two-stage M-estimators.

Proof of Lemma 4

For a given first stage estimator the proof is similar to the one given for Lemma 2.

Proof of Proposition 2

For ease of notation we will write (θ₁^m, θ^m₂ ) = (S(F;p⁽¹⁾m ), T(F;p⁽²⁾m )), ψ_p = ψ(z⁽²⁾, θ₂^m, z⁽¹⁾, θ^m₁ ), S_p = E_F[ ˙ψ⁽²⁾(Z⁽²⁾, θ₂^m, Z⁽¹⁾, θ^m₁ )] +∇_θ₂_,θ₂p⁽²⁾m (T(F;p⁽²⁾m )), H = F −∆_z, IF(p⁽¹⁾m ) = IF_p(1)

m(z¹;F, S) and IF(p⁽²⁾m ) = IF_p(2)

m(z;F, T). By Lemma 3

IF(p⁽²⁾_m ⁰)−IF(p⁽²⁾_m ) = S_p⁻¹ E_H[ψ_p] + (∇_θ₂E_F[ψ_p] +∇_θ₂_,θ₁p⁽²⁾)IF(z⁽¹⁾;F, S)

−S_p⁻¹0 E_H[ψ_p⁰] + (∇_θ₂E_F[ψ_p⁰] +∇_θ₂_,θ₁p⁽²⁾⁰)IF(z⁽¹⁾;F, S)

= T₁+T₂IF(z⁽¹⁾;F, S).

By the arguments given in Proposition 1,T₁ →0. In particular kS_p−S_p⁰k → 0. Furthermore from Lemma 4 we have lim_mT(F;p⁽²⁾m ) = lim_mT(F;p⁽²⁾m 0) = T(F). Therefore (C3’) and (C4’) guarantee that lim_m

IF(p⁽²⁾m 0)−IF(p⁽²⁾m )

= 0.

(18)

Proof of Proposition 3

From Proposition 1 it suffices to show that the limiting influence function of a smooth approximation of the problem has the desired form. One possible infinitely differentiable approximation for the absolute value is

s_m(t) = 2

mlog(e^tm+ 1)−t −→

m→∞|t|.

Its first two derivatives have the form

s⁰_m(t) := 2e^tm

e^tm+ 1 −1 −→

m→∞ sign(t) =







−1 if t <0.

+1 if t >0.

0 otherwise and

s⁰⁰_m(t) = 2me^tm

(e^tm+ 1)² −→

m→∞

( 0 if t6= 0.

+∞ otherwise.

Defining p_m(θ) = Pp

j=1p_λ,j(s_m(θ_j)), from Lemma 1 and the notation of Proposition 1 we have

IF_p_m(z;T, F) =−S_p⁻¹

m ψ(z;T(F;p_m)) +∇p_m(T(F;p_m))

Remember that for a partitioned matrix A, if all the necessary inverses exist, the elements of A⁻¹ are

A¹¹= (A₁₁−A₁₂A⁻¹₂₂A₂₁)⁻¹, A²²= (A₂₂−A₂₁A⁻¹₁₁A₁₂)⁻¹), A¹²=−A¹¹A₁₂A⁻¹₂₂, A²¹=−A⁻¹₂₂A₂₁A¹¹.

)

Hence, bearing in mind this formula and the previous limits, it is easily seen that IF(z;F, T) = lim_mIF_p_m(z;F, T) has the claimed form.

Proof of Proposition 4

The difficulty of deriving the influence function for group lasso type of penalties comes from the fact that p

|t| is not differentiable. As for the lasso penalty we will approximate the group lasso penalty by a differentiable one guaranteeing the conditions of Proposition 1. We approximate √

t by

(19)

ps_m(t), t≥0 where s_m(t) is the same function used in Proposition 3. This yields

ps_m(t)0

= e^tm−1

(e^tm+ 1)s²_m(t) −→

m→∞

( t^−1/2 if t >0

∞ for t↓0 and

ps_m(t)00

= 2me^tm

(e^tm+ 1)²s^1/2m (t)− (e^tm−1)²

2(e^tm+ 1)²s^3/2m (t) −→

m→∞

( −t^−3/2 if t >0

−∞ for t ↓0.

Note that after some simplifications p

s_m(t)00−1

ps_m(t)0

= s_m(t)(e^tm+ 1)(e^tm−1) 4me^tms_m(t)−(e^tm−1)² −→

m→∞ −t, t≥0. (8) For the penalty p_m(θ) =PG

g=1p_λ,g(p

s_m(kθ_(g)k²₂)), using the formula for the inverse of partitioned matrices, we have

S_p⁻¹

m =

"

E_F[ ˙ψ₁₁(Z, T(F))] +∇²p¹_m(T₁(F)) E_F[ ˙ψ₁₂(Z, T(F))]

E_F[ ˙ψ₂₁(Z, T(F))] E_F[ ˙ψ₂₂(Z, T(F))] +∇²p²_m(T₂(F))

#⁻¹

m→∞−→

"

(M₁₁+P_λ)⁻¹ 0

0 0

#

=S⁻¹. (9)

From Lemma 1 we have

IFpm(z;T, F) = EF[ ˙ψ(Z;T(F;pm))] +∇²pm(T(F;pm))⁻¹ ψ(z;T(F;p_m)) +∇p_λ,m(T(F;p_m))

. (10) Finally, using (8)-(9) when taking the limit of (10) as m → ∞ yields the claimed result.

Proof of Proposition 5

From Lemma 3 and Proposition 2 we know that we only need to take a sequence of approximating influence functions. First note that

∇_θ(0)E_F[ψ(Z, T(F))] = 0 and ∇_θ_j_,θ_jw_j|θ_j| = 0 for θ_j 6= 0. Then an argument similar to the one given in Proposition 3 completes the proof.

(20)

Proof of Lemma 5

In the orthogonal linear model the lasso and scad estimators have the explicit solutions

θˆ_j^lasso= sign(θ_j⁽⁰⁾)(|θ_j⁽⁰⁾| −λ)₊ and

θˆ^scad_j =







sign(θ_j⁽⁰⁾)(|θ_j⁽⁰⁾| −λ)₊ if |θ⁽⁰⁾_j |<2λ (a−1)θ⁽⁰⁾_j −sign(θ_j⁽⁰⁾)aλ

/(a−2) if 2λ ≤ |θ⁽⁰⁾_j | ≤aλ

θ⁽⁰⁾_j otherwise

where (x)₊ = max{0, x}and a >2. The functional form of the least squares estimator is very simple and the jth coefficient at F is

θ_j⁽⁰⁾() =f() = Z

XjY dF =A+(B−A),

where A = E_F[X_jY] and B = x_jy. Recall that a distribution is a linear functional on the space D of test functions, i.e. the space of infinitely differentiable functions with compact support. The Dirac delta function, for instance, is the linear functional defined by hδ, φi = φ(0), φ ∈ D. The distributional derivative of a function g :R → Ris defined as

hg⁰, φi=−hg, φ⁰i=− Z

Ω

g(x)φ⁰(x)dx,

where Ω is the support of φ and φ⁰ denotes the derivative ofφ. More details on distributions can be found in the Appendix.

Lets first compute the influence function of the lasso. The distributional derivative of ˆθ^lasso_j is by definition

h∇θˆ^lasso_j , φi=− Z

sign(|f()|)(|f()| −λ)₊φ⁰()d. (11) In order to obtain an expression for the influence function of the lasso estimator, it suffices to integrate (11) over a small interval [0, ε] because we are only interested in the local behavior of the distributional derivative. Clearly for f(0) = 0,λ >0 and ε small enough

h∇θˆ^lasso_j , φi= 0. (12)

(21)

When f(0) >0 and ε >0 is the smallest value such f(ε) = 0, we have h∇θˆ_j^lasso, φi = −

Z

[0,ε]

(f()−λ)φ⁰()d=−φ()(f()−λ)

ε 0

+ Z

[0,ε]

f⁰()φ()d

= −φ(ε)(f(ε)−λ) +φ(0)(f(0)−λ)−(A−B) Z

0,ε

φ()d.

Hence for ⁰ ∈[0, ε] we obtain

∇θˆ^lasso_j (⁰) =−(A−B)−φ(⁰)(f(⁰)−λ) δ(⁰)−δ(0)

. (13) An analogous argument yields for f⁰(0)<0 and⁰ small enough

∇θˆ^lasso_j (⁰) =−(A−B) +φ(⁰)(f(⁰) +λ) δ(⁰)−δ(0)

. (14) Noticing thatA=−λsign(ˆθ^lasso_j ) when ˆθ_j^lasso6= 0, we conclude from (12)-(14) that

∇θˆ_j^lasso(0) =

( 0 if |θ⁽⁰⁾_j | ≤λ ψ(z,θˆ^lasso_j ) +λsign(ˆθ^lasso_j ) otherwise.

Let us now turn to the derivation of the influence function of the scad estimator. It is clear from the form of the ˆθ_j^scad that ∇θˆ_j^scad(0) has the same form as ∇θˆ^lasso_j (0) when |θ⁽⁰⁾_j | < 2λ. It is also obvious that ∇θˆ_j^scad(0) = ψ(z,θˆ_j^scad) when |θ⁽⁰⁾_j | > aλ. We therefore only have to consider the case 2λ ≤ |θ_j⁽⁰⁾| ≤ aλ. Suppose that f(0) > 0 and denote by ε the smallest positive real number such that f(ε) = 0. Then

h∇θˆ_j^lasso, φi = − Z

[0,ε]

(a−1)f()−aλ

a−2 φ⁰()d

= −a−1

a−2f()φ()

ε 0

+a−1 a−2

Z

[0,ε]

f⁰()φ()d− aλ a−2φ()

ε 0

= a−1

a−2(A−B) Z

[0,ε]

φ()d+ a−1 a−2

f(0)φ(0)−f(ε)φ(ε)

+ aλ a−2

φ(0)−φ(ε) .

Hence for sufficiently small ⁰ we have

∇θˆ_j^scad(⁰) =a−1

a−2(A−B) + a−1

a−2f(⁰)φ(⁰) δ(0)−δ(⁰) + aλ

a−2φ(⁰) δ(0)−δ(⁰)

(22)

and thus

∇θˆ_j^lasso(0) = a−1 a−2

ψ(z,θˆ^lasso_j ) +p⁰_λ(ˆθ_j^scad)

.

The same expression is obtained when f(0) < 0. Remember that for t > 0 the scad penalty is defined by

pλ(t) = λ n

I(t≤λ) + (aλ−t)₊

(a−1)λ I(t > λ) o

.

Hence for 2λ ≤t≤aλ we have 1 +p⁰⁰_λ(t) = 1 + 1/(a−1) = (a−2)/(a−1).

This completes the proof.

Proof of Lemma 6

The idea of the proof is similar to the one given in Lemma 2, but using the variant of Berge’s maximum theorem given in the Appendix.

Let ∈ (−ε, ε) for ε > 0, b() = infθ∈ΘΛ_λ(θ;F, p) and f(θ, ) = b()−Λ_λ(θ;F, p). The mapping Γ : (−ε, ε) 7→ Θ, Γ ={θ|θ ∈ Θ, f(θ, ) ≤ 0} is closed by construction (Berge [1997], Example, p. 111). There- fore Γ is compact for any , which implies that Γ is continuous (Berge [1997], Example, p.109). From Proposition 7 in the Appendix, M() = max{−E_F[L(Z, θ)] | θ ∈ Γ} is continuous in (−ε, ε) and the mapping φ = {θ|θ ∈ Γ, E_F[L(Z, θ)] = M()} is upper semi-continuous from (−ε, ε) to Θ. Let {_m}n≥1 be a sequence converging to ∈ (−ε, ε) as m → ∞, and define θ_m = {θ|θ ∈ φ_m, E_F_m[L(Z, θ)] = supM(_m)} and θ^∗ ={θ|θ ∈φ, E_F[L(Z, θ)] = supM()}. Then from Lemma 7 and Lemma 8 in the Appendix we have that lim_mθ_m =θ^∗ ∈Γ= lim_mΓ_m.

Proof of Proposition 6

By Lemma 6, T(F) is a continuous function of in a neighborhood of 0.

Therefore T(F) is integrable and the result follows from Lemma 10 in the Appendix.

(23)

Appendix

For completeness, we recall some definitions and results from Berge [1997]

and distribution theory (Schwartz [1959]).

Auxiliary results for Berge’s maximum theorem

This material is needed for the proofs of Lemma 2, Lemma 4 and Lemma 6.

Let Γ be a mapping of the topological space X to the topological space Y. We say that Γ is lower semi-continuous at x₀ ∈ X if for each open set O meeting Γx₀ there is a neighborhood N(x₀) such that x ∈ N(x₀) ⇒ Γx₀ ∩O 6= ∅. We say that Γ is upper semi-continuous at x₀ ∈ X if for each open set O containing Γx₀ there is a neighborhood N(x₀) such that x∈N(x₀)⇒Γx₀ ⊂O. If Γ is both loweer and upper semi-continuous atx₀ we say that Γ is continuous at x₀.

If Γ is lower semi-continuous at each point of X it is called lower semi- continuous in X. We say that Γ isupper semi-continuous inX if it is upper semi-continuous at each point ofX and if also Γxis compact for eachx∈ X. If Γ is both lower and upper semi-continuous inX, then it is calledcontinuous in X.

Berge’s maximum theorem : (Berge [1997], p. 116) If h is a continuous function in Y and Γ is a continuous mapping of X toY such that, for eachx, Γx6=∅, then the functionM defined byM(x) = max{h(y)|y∈Γx}

is continuous n X and the mapping φ defined by φx ={y | y ∈ Γx, h(y) = M(x)} is a upper semi-continuous mapping ofX intoY.

Lemma 7 : (Berge [1997], Theorem 4, p.111) If Γ is a closed mapping

then 





x_n→x₀ y_n →y₀

∀n:y_n ∈Γx_n







⇒y₀ ∈Γx₀

Lemma 8 : (Berge [1997], Theorem 6, p.112) Every upper semi- continuous mapping is closed.

(24)

Lemma 9 : (Berge [1997],Theorem 7, p.112 ) If Γ₁ is a closed mapping ofX intoY and Γ₂ is an upper semi-continuous mappingn ofX intoY , then the mapping Γ = Γ₁∩Γ₂ is upper semi-continuous.

We provide a variant of Berge’s maximum theorem. Its proof is similar to the one of the original theorem and is given for completeness.

Proposition 7 : If h is a continuous function in X × Y and Γ is a continuous mapping of X to Y such that, for each x, Γx 6= ∅, then the function M defined by M(x) = max{h(x, y) | y ∈ Γx} is continuous n X and the mapping φ defined by φx={y |y ∈Γx, h(x, y) = M(x)} is a upper semi-continuous mapping of X intoY.

Proof : The functionh is continuous inX × Y and so M is a continuous function. Furthermore the mapping ∆ given by

∆x={y | M(x)−h(x, y)≤0}

is closed (Berge [1997], Example, p.111). Hence by Lemma 9, φ = Γ∩∆ is upper semi-continuous.

Auxiliary results from distribution theory

Define C_c^∞(Ω) = {f : Ω →R| f ∈ C^∞(Ω), with compact support} where Ω is an open set in R^d.

Definition : A sequence {φ_n}_n≥1 of functions φ_n ∈C_c^∞(Ω) converges to φ ∈C_c^∞(Ω) in the sense of test functions if:

(a) there exists Ω⁰ ∈Ω such that suppφ_n⊂Ω⁰ for every n ∈N;

(b) ∂^αφ_n →∂^αφ as n→ ∞ uniformly on Ω for every multindex α∈N^d. The topological vector space D(Ω) consists of C_c^∞(Ω) equipped with the topology that corresponds to convergence in the sense of test functions. A linear functional on D(Ω) is a linear map T : D(Ω) → R. The value of T acting on a test function φ is denoted by hT, φi. Therefore if T is linear we have

hT, λφ+µψi=λhT, φi+µhT, ψi, for all λ, µ∈R and φ, ψ∈ D(Ω).

(25)

A functional T is continuous if φ_n →φ in the sense of test functions implies that hT, φ_ni → hT, φiin R.

Definition : A distribution on Ω is a continuous linear functional T : D(Ω)→R.

A sequence of distributions {Tn} converges to T if hTn, φi → hT, φi for every φ ∈ D(Ω). The topological vector space D⁰(Ω) consists of the distributions on Ω equipped with the topology corresponding to this notion of convergence. An important example of distribution is the Dirac delta function supported at x∈Ω, which is the distribution δx :D(Ω)→R defined by hδx, φi=φ(x).

Definition : For 1 ≤ i ≤ d, the ith partial derivative of a distribution T ∈ D⁰(Ω) is the distribution ∂iT ∈ D⁰(Ω) defined by h∂iT, φi =

−hT, ∂iφifor all φ∈ D(Ω).Forα ∈N^d, the derivative ∂^αT ∈ D⁰(Ω) of order

|α|is defined by h∂^αT, φi= (−1)^|α|hT, ∂^αφi for all φ ∈ D(Ω).

We recall that for 1 ≤ p < ∞, the space L^p(Ω) consists of all integrable functions, i.e. the Lebesgue measurable functions f : Ω → R such that R

Ω|f|^pdx < ∞. Equiped with the norm norm kfk_p = (R

Ω|f|^pdx)^1/p they constitute a Banach space. Note that any (locally) integrable function defines a regular distribution T_f ∈ D⁰(Ω) by hT_f, φi = R

Ωf φdx. Typically the function f and the distribution T_f are regarded as equivalent.

Lemma 10 : Let f ∈ L¹(Ω) and {f_n} be a sequence of function in C^∞(Ω) such thatf_n→f and ∂^αf_n→g inL¹(Ω). Thenf has distributional derivative given by g =∂^αf ∈L¹(Ω).

Proof : Note that sincef_n→f inL¹(Ω) and φ∈C_c^∞(Ω), we have Z

Ω

fnφdx −→

n→∞

Z

Ω

f φdx, because for K = suppφ

Z

Ω

f_nφdx− Z

Ω

f φdx =

Z

K

(f_n−f)φdx ≤sup

K

|φ|

Z

K

|f_n−f|dx→0.

Hence, for every φ∈C_c^∞(Ω), the convergence of f_n and ∂^αf_n implies that Z

Ω

f ∂^αφdx= lim

n→∞

Z

Ω

f_n∂^αφdx= (−1)^|α| lim

n→∞

Z

Ω

∂^αf_nφdx= (−1)^|α|

Z

Ω

gφdx.

(26)

Therefore the distributional derivative of f is ∂^αf =g.

Acknowledgements

The author is grateful to Elvezio Ronchetti for helpful discussions and sug- gestions, including a key contribution to an early development of the idea of limiting influence function.

References

A. Alfons, C. Croux, and S. Gelper. Sparse least trimmed squares regression for analyzing high-dimensional large data sets. The Annals of Applied Statistics, 7(1):226–248, 2013.

M. Avella-Medina and E. Ronchetti. Robust and consistent variable selection for generalized linear and additive models. Manuscript, 2014.

T. Bednarski. Fr´echet differentiability of statistical functionals and implica- tions to robust statistics. New Directions in Statistical Data Analysis and Robustness, pages 25–34, 1993.

C. Berge. Topological Spaces: including a treatment of multi-valued functions, vector spaces, and convexity. Dover Publications, 1997.

L. Breiman. Better subset regression using the nonnegative garrote. Tech- nometrics, 37(4):373–384, 1995.

P. B¨uhlmann and S. Van De Geer. Statistics for high-dimensional data:

methods, theory and applications. Springer, 2011.

B. Clarke. Nonsmooth analysis and fr´echet differentiability of m-functionals.

Probability Theory and Related Fields, 73(2):197–209, 1986.

J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96 (456):1348–1360, 2001.