Sparse Regularization: Convergence Of Iterative Jumping Thresholding Algorithm

(1)

Sparse Regularization: Convergence Of Iterative Jumping Thresholding Algorithm

Jinshan Zeng, Shaobo Lin, and Zongben Xu

Abstract—In recent studies on sparse modeling, nonconvex penalties have received considerable attentions due to their supe- riorities on sparsity-inducing over the convex counterparts. In this paper, we study the convergence of a nonconvex iterative threshold- ing algorithm for solving a class of sparse regularized optimization problems, where the corresponding thresholding functions of the penalties are discontinuous with jump discontinuities. Therefore, we call the algorithm the iterative jumping thresholding (IJT) al- gorithm. The finite support and sign convergence of IJT algorithm is first verified via taking advantage of such jump discontinuity. To- gether with the introduced restricted Kurdyka–Łojasiewicz prop- erty, then the global convergence¹ of the entire sequence can be further proved. Furthermore, we can show that the IJT algorithm converges to a strictly local minimizer at an eventual linear rate² under some additional conditions. Moreover, we derive a posteriori computable error estimate, which can be used to design an efficient terminate rule. It should be pointed out that the q quasinorm (0< q <1) is an important subclass of the nonconvex penalties studied in this paper. In particular, when applied to theq regu- larization, IJT algorithm can converge to a local minimizer with an eventual linear rate under certain concentration conditions.

We also apply the proposed algorithm to sparse signal recovery and synthetic aperture radar imaging problems. The experiment results show the effectiveness of the proposed algorithm.

Index Terms—Sparse regularization, non-convex optimization, iterative thresholding algorithm,q regularization (0< q <1), Kurdyka-Łojasiewicz inequality.

I. INTRODUCTION

T

HE sparse regularized optimization problems emerging in many areas of scientific research and engineering practice have attracted considerable attention in recent years. Typical applications include regression [37], visual coding [32], signal processing [20], compressed sensing [10], [23], and microwave imaging [40]. These problems can be intuitively modeled as the following0quasi-norm regularized optimization problem

min_x_∈_R^N {F(x) +λx0}, (1)

Manuscript received July 27, 2015; revised March 14, 2016 and May 26, 2016; accepted July 17, 2016. Date of publication July 28, 2016; date of current version August 08, 2016. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Ami Wiesel. The work of J. Zeng was supported in part by the National Science Foundation (NSF) under Grant 11501440. The work of S. Lin was supported in part by the NSF under Grants 61502342 and 11401462. (Corresponding author: Shaobo Lin.)

J. Zeng is with the College of Computer Information Engineering, Jiangxi Normal University, Nanchang 330022, China (e-mail: jsh.zeng@gmail.com).

S. Lin is with the College of Mathematics and Information Science, Wenzhou University, Wenzhou 325035, China (e-mail: sblin1983@gmail.com).

Z. Xu is with the School of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an 710049, China (e-mail: zbxu@mail.xjtu.edu.cn).

Color versions of one or more of the figures in this letter are available online at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TSP.2016.2595499

1The global convergence in this paper is defined in the sense that the entire sequence converges regardless of the initial point.

2It is also known as asymptotic or local linear rate in other papers.

where F:R^N →[0,∞) is a proper lower-semicontinuous function,x0, commonly called the 0 quasi-norm, denotes the number of nonzero components ofxand λ>0 is a regularization parameter. Some efficient algorithms including the iterative hard thresholding algorithm ([3], [29]) were developed to solve (1).

Besides the0regularized optimization problem, a more general class of problems are considered in both practice and theory, that is,

xmin∈R^N{F(x) +λΦ(x)}, (2) where Φ(x) is a certain separable, continuous penalty with Φ(x) =_N

i= 1φ(|xi|), and x= (x1,· · ·, xN)^T. One of the most important cases is the 1-norm with Φ(x) =x1 = _N

i= 1|x_i|. The ₁-norm is convex and thus, the corresponding₁-norm regularized optimization problem can be efficiently solved. Nevertheless, the₁-norm may not induce adequate sparsity when applied to certain applications [13]. Alternatively, many non-convex penalties were proposed as relaxations of the 0 quasi-norm. Some typical non-convex examples are theq

quasi-norm (0< q <1) [13], [14], [39], smoothly clipped ab- solute deviation (SCAD) [24], and log-sum penalty [11]. Com- pared with the1-norm, the non-convex penalties can usually induce better sparsity while the corresponding non-convex regularized optimization problems are generally more difficult to solve.

There are mainly four classes of algorithms to solve the non- convex regularized optimization problem (2). The first one is the half-quadratic (HQ) algorithm [26], [27]. HQ algorithms can be efficient when both subproblems are easy to solve (particularly, when both subproblems have closed-form solutions). The second class is the iterative reweighted algorithm including iterative reweighted least squares (IRLS) minimization ([15], [21], [28], [30]) and iterative reweighted1-minimization (IRL1) [11] algorithms. The basic idea of the iterative reweighted algorithm is to obtain an approximate sparse solution via solving a sequence of weighted least squares (or, 1-minimization) problems. The third class is the difference of convex functions algorithm (DC programming) [25]. DC programming method first converts the original problem into the difference of two convex problems (called primal and dual problems, respectively), then iteratively optimizes these two problems. The last class is the iterative thresholding algorithm, which fits the framework of the forward-backward splitting (FBS) algorithm [2] and the generalized gradient projection method [7] when applied to a separable non-convex penalty. Some typical iterative threshold- ing algorithms include iterative hard [3], soft [22] and half [39]

thresholding algorithms. Compared to other types of non-convex

See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

(2)

algorithms such as the HQ, IRLS, IRL1 and DC programming algorithms, the iterative thresholding algorithm is easy to im- plement and has almost the least computational complexity for large scale problems (see, [40] for instance).

Although the effectiveness of the iterative thresholding algorithms for the non-convex regularized optimization problems has been verified in many applications, except for the iterative hard [29] and half [41] thresholding algorithms, the conver- gence of most of these algorithms has not been thoroughly investigated. Basically, the three questions, i.e., when, where, and how fast does the algorithm converge, should be answered.

A. Main Contribution

In this paper, we give the convergence analysis for the iterative jumping thresholding algorithm (called IJT algorithm henceforth) for solving a certain class of non-convex regularized optimization problems. The main contributions can be summa- rized as follows:

a) We prove that the supports and signs of any sequence generated by IJT algorithm can converge with finitely many iterations.

b) Under a further assumption that there exists one limit point such that the objective function satisfies the so- called restricted Kurdyka-Łojasiewicz (rKL) property (see Definition 2) at this point, the whole sequence converges to this point (see Theorem 1).

c) Under certain second-order conditions, we demonstrate that IJT algorithm converges to a strictly local minimizer at an eventual linear rate (see Theorems 2 and 3).

d) When applied to the q (0< q <1) regularization, IJT algorithm converges to a local minimizer at an eventual linear rate as long as the matrix satisfies a certain concentration property (see Theorem 4).

B. Notations and Organization

We denoteR,NandCas the sets of real number, natural number and complex number, respectively. For any vectorx∈ R^N, xi is its ith component, and for a given index set I⊂ IN {1, . . . , N},xIrepresents its subvector containing all the components restricted toI. I^c represents the complementary set ofI, i.e., I^c=IN\I.x2 represents the Euclidean norm of a vectorx. Supp(x)is the support ofx, i.e., Supp(x) ={i:

|xi|>0, i= 1, . . . , N}. For any matrixA∈R^N^×^N,σi(A)and σ_min(A) (λi(A) and λmin(A)) denote as theith and minimal singular values (eigenvalues) ofA, respectively. Similar to the vector case, for a given index setI,A_Irepresents the submatrix ofAcontaining all the columns restricted toI. For anyz∈R, sign(z)denotes its sign function, i.e.,

sign(z) =

⎧⎪

⎨

⎪⎩

1, forz >0 0, forz= 0

−1, forz <0 .

The remainder of this paper is organized as follows. In Section II, we give the problem settings and then introduce IJT algorithm with some basic properties. In Section III, we give

the convergence analysis of IJT algorithm. In Section IV, we apply the developed theoretical analysis to theq (0< q <1) regularization. In Section V, we give some related works and comparisons. In Section VI, we present some applications to show the effectiveness of the proposed algorithm. We conclude this paper in Section VII. The proofs are presented in Appendix.

II. IJT ALGORITHM

A. Problem Settings

We make several assumptions on the concerned problem

x∈Rmin^N{T_λ(x) =F(x) +λΦ(x)}, (3) whereΦ(x)is separable withΦ(x) =_N

i= 1φ(|xi|).

Assumption 1: F :R^N →[0,∞)is continuously differentiable with Lipschitz continuous gradient, i.e., it holds that

∇F(u)− ∇F(v)2 ≤Lu−v2, ∀u, v∈R^N, whereL >0is the Lipschitz constant.

Note that Assumption 1 is a general assumption forF. For example, the least squares and logistic loss functions used in machine learning are two typical cases.

Assumption 2: φ: [0,∞)→[0,∞)is continuous and satisfies the following assumptions:

a) φis non-decreasing withφ(0) = 0andφ(z)→ ∞when z→ ∞.

b) For eachb >0, there exists ana >0such thatφ(z)≥az² forz∈[0, b].

c) φ is differentiable on (0,∞), and its first derivative φ is strictly convex with φ(z)→ ∞ for z→0 and lim_z→∞φ(z)/z = 0.

d) φhas a continuous second derivativeφon(0,∞).

Most of the above assumptions were considered in [7]. It can be observed that Assumption 2(a) ensures the coercivity ofφ, and thus the existence of the minimizer. Assumption 2(b) guarantees the weakly sequential lower semi-continuity of φ inl², and Assumption 2(c) is assumed to induce the sparsity.

In practice, there are many non-convex functions satisfying Assumption 2. Two of the most typical subclasses areφ(z) =z^q andφ(z) =log(1 +z^q)withq∈(0,1)as shown in Fig. 1.

B. IJT Algorithm

In order to describe IJT algorithm, we need to define the following proximity operator ofΦ,

P rox_{μ ,λΦ}(x) =arg min

u∈R^N

x−u²2

2μ +λΦ(u)

, (4) where μ >0 is a parameter. SinceΦis separable, computing P roxμ ,λΦcan be reduced to solve a one-dimensional minimization problem, that is,

prox_{μ ,}_λ_φ(z) =arg min

v∈R

|z−v|²

2μ +λφ(|v|)

. (5) Therefore,

P roxμ ,λΦ(x) = (prox_{μ ,}_λ_φ(x1),· · ·, prox_{μ ,}_λ_φ(xN))^T. (6) We list some useful results onprox_{μ ,λφ}obtained in [7].

(3)

Fig. 1. Typicalφsatisfying Assumption 2 and the corresponding thresholding functions. We plot the figures ofφ(|z|) =|z|^1/²,|z|^2/³,log(1 +|z|¹^/³), and their corresponding thresholding functions. For comparison, we also plot the figures of two well-known cases, i.e.,0-norm withφ(|z|) = 1_|_z_|_>₀ as the indicator function of|z|>0,1-norm withφ(|z|) =|z|, and their corresponding thresholding functions. (a) Typical penalty functions. (b) Thresholding functions.

Lemma 1. ([7, Lemmas 3.2 and 3.3]): Assume thatφsatisfies Assumption 2, then

a) for eachμ >0, the functionρμ :z→z+λμφ(z)is well defined onR+;

b) the function ψ:z→2(φ(z)−zφ(z))/z² is strictly decreasing and one-to-one on(0,∞)→(0,∞);

c) for any z >0, φ(z) is negative and monotonically increasing;

d) prox_{μ ,}_λ_φis well defined and can be specified as prox_{μ ,λφ}(z) = sign(z)ρ⁻¹_μ (|z|), for|z| ≥τμ

0, for|z| ≤τμ

, (7) for anyz∈Rwith

τ_μ =ρ_μ(η_μ) and η_μ =ψ⁻¹((λμ)⁻¹). (8) Moreover, the range ofprox_{μ ,}_λ_φ is{0} ∪[ημ,∞).

It can be observed that the proximity operator is discontinuous with a jump discontinuity, which is one of the most significant features of such a class of non-convex penalties studied in this paper. Henceforth, we callprox_{μ ,}_λ_φ the jumping thresholding function. Moreover, it can be easily checked that the proximity operator is not nonexpansive in general. (Some specific proximity operators are shown in Fig. 1(b).)

Formally, the iterative form of IJT algorithm can be expressed as follows:

x^{n+ 1} ∈P roxμ ,λΦ(xⁿ−μ∇F(xⁿ)), (9) whereμ >0is a step size parameter. For simplicity, we define

G_{μ ,λΦ}(x) =P rox_{μ ,λΦ}(x−μ∇F(x)), x∈R^N, and its fixed point setFμ {x:x=G_{μ ,λΦ}(x)}. C. Some Basic Properties of IJT Algorithm

Property 1: Letx^∗be a fixed point ofG_{μ ,λΦ}and{xⁿ}be a sequence generated by IJT algorithm, then it holds

a) for any i∈Supp(x^∗), |x^∗_i| ≥η_μ and [∇F(x^∗)]_i+ λsign(x^∗_i)φ(|x^∗_i|) = 0; and for anyi∈Supp(x^∗)^c,|x^∗_i|= 0and|[∇F(x^∗)]_i| ≤τ_μ/μ;

b) for any i∈Supp(xⁿ^{+ 1}), |x^{n+ 1}_i | ≥η_μ and x^{n+ 1}_i +λμ sign(x^{n+ 1}_i )φ(|x^{n+ 1}_i |) =xⁿ_i −μ[∇F(xⁿ)]_i; and for any i∈Supp(x^{n+ 1})^c,|x^{n+ 1}_i |= 0and|xⁿ_i −μ[∇F(xⁿ)]i| ≤ τμ ,n∈N,

where[∇F(x^∗)]iand[∇F(x^{n+ 1})]irepresent theith component of∇F(x^∗)and∇F(x^{n+ 1})respectively.

This property can be easily derived by the definition of proximity operator and Lemma 1(d). Actually, Property 1(a) is a certain type of optimality conditions of problem (3). We call x^∗ a stationary point of (3) if x^∗ satisfies Property 1(a), and we denote byΩμ the stationary point set for a givenμ. Then according to Property 1(a), it holdsFμ ⊂Ω_μ.

Property 2: Let{xⁿ}be a sequence generated by IJT algorithm with a bounded initialization. Assume that0< μ < _L¹, then it holds

a) T_λ(x^{n+ 1})≤T_λ(xⁿ)−¹₂(¹_μ −L)x^{n+ 1} −xⁿ²2, and there exists a constantT_λ^∗such that limn→∞T_λ(xⁿ)→T_λ^∗; b) xⁿ^{+ 1}−xⁿ2 →0asn→ ∞;

c) each accumulation point of{xⁿ}is a fixed point ofGμ ,λΦ; d) if{xⁿ}possess an isolated accumulation point, then the

whole sequence converges to somex^∗∈ Fμ.

This property can be claimed from [7, Propositions 2.1, 2.3 and Corollary 2.1] with μn ≡μ. Property 2(a) is commonly called the sufficient decrease property, which is a basic property desired for a descent method. LetX be the accumulation point set of {xⁿ}, then by Property 2(c), X ⊂ Fμ, and further by Property 1(a),X ⊂Ω_μ.

Property 3: Suppose that0< μ < _L¹, then each global minimizer ofT_λis a fixed point ofG_{μ ,λΦ}. LetMbe the set of global minimizers, thenM ⊂ Fμ.

Property 3 is a corollary of [7, Propositions 2.2] with a uni- form step size. From Properties 2 and 3, the following relations hold

X ⊂ Fμ,M ⊂ Fμ andFμ ⊂Ωμ. III. CONVERGENCEANALYSIS

In this section, we will answer the basic questions concerning IJT algorithm presented in introduction, i.e., when, where and how fast does the algorithm converge?

A. rKL Property

Kurdyka-Łojasiewicz (KL) property has been widely used to prove the convergence of the non-convex algorithms (see, [2]

for instance).

Definition 1. (KL property): A function f :R^N →R∪ {+∞}is said to have the KL property atx^∗∈dom(∂f) if there existη∈(0,+∞], a neighborhoodU of x^∗ and a continuous concave functionϕ: [0, η)→R+such that:

i) ϕ(0) = 0andϕisC¹on(0, η);

ii) for alls∈(0, η),ϕ(s)>0;

iii) for all x in U∩ {x:f(x^∗)< f(x)< f(x^∗) +η}, the KL inequality holds

ϕ(f(x)−f(x^∗))dist(0, ∂f(x))≥1. (10)

(4)

Proper lower semi-continuous functions which satisfy the KL inequality at each point of dom(∂f) are called KL functions.

The KL property of f at some point x^∗ means that “f is amenable to sharpness atx^∗” [6], and the KL inequality (10) is equivalent to

dist(0, ∂(ϕ◦(f(x)−f(x^∗))))≥1, (11) for allx∈U∩ {x:f(x^∗)< f(x)< f(x^∗) +η} (simply use the “one-sided” chain-rule [35, Theorem 10.6]). KL functions include real analytic functions, semialgebraic functions and locally strongly convex functions (more information can be re- ferred to Sec. 2.2 in [38] and references therein). However, according to [4] (Sec. 1, page 1), some simple functions such asf(x) =exp(−_x¹2),∀x∈R,are not KL function, and in the latter proof of Proposition 1 (see Appendix B), a class of simple functions are shown to be not KL functions.

Motivated by this, in this paper, we introduce another related but weaker property called the rKL property. Before describing the definition of rKL property formally, we define a projection mapping associated with an index setI⊂IN,

P_I :R^N →R^|^I^|, P_Ix=x_I,∀x∈R^N. We also denoteP_I^T as the transpose ofP_I,

P_I^T :R^|^I^|→R^N,(P_I^Tz)_I =zand(P_I^Tz)_I^c = 0,∀z∈R^|^I^|, where|I|is the cardinality ofIandI^c=IN\I.

Definition 2. (rKL property): A function f :R^N →R∪ {+∞} is said to have the rKL property atx^∗∈dom(∂f), if g:R^|I|→R∪ {+∞}, g(z) =f(P_I^Tz)satisfies KL property atz^∗=x^∗_I withI=Supp(x^∗).

From Definition 2, rKL property only requires that the sub- gradient offwith respect to the nonzero variables can get sharp after certain a concave transform, while KL property requires such well property for all variables around some point. In the following, we give a sufficient condition of the rKL property.

Lemma 2: Given an index setI⊂IN, consider the function g(z) =f(P_I^Tz). Assume thatz^∗is a stationary point ofg(i.e.,

∇g(z^∗) = 0), and g is twice continuously differentiable at a neighborhood ofz^∗, i.e.,B(z^∗, 0)for some0 >0. Moreover, if∇²g(z^∗)is nonsingular, thenf satisfies the rKL property at P_I^Tz^∗. Actually, it holds

|g(z)−g(z^∗)| ≤C^∗∇g(z)²₂,∀z∈B(z^∗, ), for some0< < 0and a positive constantC^∗>0.

The proof of this lemma is shown in Appendix A. Then we present a proposition to show that rKL property is an extension of KL property.

Proposition 1. (rKL is a generalization of KL): Iffsatisfies the KL property atx^∗, thenf satisfies the rKL property atx^∗, but not vice versa.

The proof of this proposition is presented in Appendix B. Ac- cording to the proof procedure of Proposition 1, the conditions listed in Lemma 2 are essential for the rKL property in the sense that there exists a function satisfying conditions in Lemma 2 but not KL property.

B. Convergence of Entire Sequence

Lemma 3. [Finite Support Convergence]: Let{xⁿ}be a sequence generated by IJT algorithm and Iⁿ =Supp(xⁿ). As- sume that0< μ <_L¹, then there exist a positive integern^∗, an index set I and a sign vector S^∗ such that when n > n^∗, the following hold

a) Iⁿ =Iand Supp(x^∗) =I,∀x^∗∈ X, b) sign(xⁿ) =S^∗and sign(x^∗) =S^∗,∀x^∗∈ X.

The proof of this lemma is presented in Appendix C. Accord- ing to Lemma 3, the support and sign freeze with finitely many iterations. Furthermore, by Lemma 3, we can claim that{xⁿ} converges tox^∗ if the new sequence{xⁱ⁺ⁿ^∗}i∈N converges to x^∗, which is also equivalent to the convergence of the sequence {zⁱ⁺ⁿ^∗}i∈N,

zⁱ⁺ⁿ^∗ →z^∗asi→ ∞ (12) withzⁱ⁺ⁿ^∗=P_Ixⁱ⁺ⁿ^∗andz^∗=P_Ix^∗. Let

ˆ

zⁿ =zⁿ⁺ⁿ^∗, (13)

then{ˆzⁿ}has the same convergence behavior of{xⁿ}.

For any >0, we define a one-dimensional real set R R\[−, ].

Particularly, letR₀ =R\{0}. We let

Z^∗P_IX ={P_Ix^∗:x^∗∈ X },

thenZ^∗is the accumulation point set of the sequence{zˆⁿ}. We defineT :R^|_η^I^|

μ/2 →Randf :R^|_η^I^|

μ/2 →Ras

T(z) =T_λ(P_I^Tz)andf(z) =F(P_I^Tz),∀z∈R^|I|_η

μ/2. (14)

For any z^∗∈ Z^∗, it can be observed from Property 1(a) that z^∗∈R^|I|ημ andz^∗is a stationary point ofT. Moreover, we define a series of mappingsφ_1,m :R^m₀ →R^m andφ_2,m :R^m₀ → R^m^×m as follows

φ1,m(z) = (sign(z1)φ(|z1|),· · · ,sign(zm)φ(|zm|))^T, φ2,m(z) =diag(φ(|z1|),· · ·, φ(|zm|)), m= 1, . . . , N,

(15) where diag(z)represents the diagonal matrix generated by z.

For brevity, we denoteφ1,mandφ2,masφ1andφ2respectively whenmis fixed and there is no confusion.

By Properties 1 and 2, we can easily verify that{ˆzⁿ}satisfies the following properties.

Lemma 4: {zˆⁿ}satisfies the following:

a) (Sufficient decrease condition). For eachn∈N, T(ˆz^{n+ 1})≤T(ˆzⁿ)−1

2 1

μ−L

ˆzⁿ^{+ 1}−zˆⁿ²2. b) (Relative error condition). For eachn∈N,

∇T(ˆz^{n+ 1})2 ≤(1

μ+L)ˆz^{n+ 1}−zˆⁿ2. c) (Continuity condition). There exists a subsequence

{zˆⁿ^j}j∈N andz^∗such that ˆ

zⁿ^j →z^∗andT(ˆzⁿ^j)→T(z^∗), asj→ ∞.

(5)

Lemma 4(a) and (c) are obvious by Property 2, the specific form of T (14) and the construction of {ˆzⁿ} (13).

Lemma 4(b) holds mainly due to Property 1(b) and Assumptions 1–2. Specifically, by Property 1(b), it can be easily checked that

ˆ

z^{n+ 1} +λμφ₁(ˆz^{n+ 1}) = ˆzⁿ −μ∇f(ˆzⁿ), which implies

μ(∇f(ˆz^{n+ 1}) +λφ1(ˆzⁿ^{+ 1})) = (ˆzⁿ−zˆ^{n+ 1}) +μ(∇f(ˆz^{n+ 1})

− ∇f(ˆzⁿ)).

Thus,∇T(ˆz^{n+ 1})2 = 1

μ(ˆzⁿ−zˆ^{n+ 1}) +μ(∇f(ˆz^{n+ 1})− ∇f(ˆzⁿ))2. By Assumption 1,∇Fis Lipschitz continuous, then

∇f(ˆzⁿ^{+ 1})− ∇f(ˆzⁿ)2 ≤ ∇F(P_I^Tzˆ^{n+ 1})− ∇F(P_I^Tzˆⁿ)2

≤LP_I^Tzˆⁿ^{+ 1}−P_I^Tzˆⁿ2 =Lzˆ^{n+ 1} −ˆzⁿ2. Therefore,∇T(ˆz^{n+ 1})2 ≤(_μ¹ +L)zˆ^{n+ 1}−zˆⁿ2.

From Lemma 4, ifTfurther has the KL property at the limit pointz^∗, then according to Theorem 2.9 in [2],{zˆⁿ}converges toz^∗. Note that the construction form of{zˆⁿ}, we can obtain the following convergence result.

Theorem 1. [Global Convergence]: Assume that F and φ satisfy Assumptions 1 and 2, respectively. Let{xⁿ} be a sequence generated by IJT algorithm. Suppose that0< μ <_L¹, then{xⁿ}converges subsequentially to a setX. If further there is a limit pointx^∗∈ X at whichT_λsatisfies the rKL property, then the whole sequence converges tox^∗.

Together with Lemma 2, the following corollary holds.

Corollary 1: Under Assumptions 1 and 2, suppose that0<

μ < _L¹, and that there exists a limit pointx^∗of{xⁿ}such that F is twice continuously differentiable atx^∗and∇²T(PIx^∗)is nonsingular, then{xⁿ}converges tox^∗.

Remark 1: A similar condition is also used to guarantee the convergence of the steepest descent method in [34, Theorem 2, pp. 266]. Obviously, ifz^∗is a strictly local minimizer (or maxi- mizer), or a strict saddle point ofT, then the nonsingularity of

∇²T(z^∗)holds naturally. Therefore, ifTis locally strict convex or concave, then Corollary 1 holds.

C. Convergence to a Strictly Local Minimizer

As shown in Corollary 1, if ∇²T(P_Ix^∗) is nonsingular at some limit pointx^∗, then the sequence generated by IJT algorithm converges tox^∗. In this subsection, we will justify thatx^∗ is also a strictly local minimizer of the optimization problem if

∇²T(PIx^∗)is positive definite.

Theorem 2. [Convergence to a Strictly Local Minimizer]:

Under assumptions of Corollary 1, if further ∇²T(PIx^∗) is positive definite, thenx^∗is a strictly local minimizer ofT_λ.

The proof of this theorem is rather intuitive. By Property 1(a) we have

[∇F(x^∗)]I+λφ1(x^∗_I) = 0. (16) This together with the condition of the theorem

∇²T(P_Ix^∗) =∇²_{I I}F(x^∗) +λφ2(x^∗_I)0

imply that the second-order optimality conditions hold atx^∗= (x^∗_I,0), where∇²_{I I}F(x^∗) =^∂²_{∂ x}^F^(x)2

I

x=x^∗.For sufficiently small vectorh, we denotex^∗_h = (x^∗_I+h_I,0). It then follows

F(x^∗_h) +λ

i∈I

φ(|x^∗_i+h_i|)≥F(x^∗) +λ

i∈I

φ(|x^∗_i|). (17) Furthermore, by Assumption 2(c), it obviously holds that

φ(t)>([∇F(x^∗)]_I^c∞+ 2)t/λ,

for sufficiently smallt >0. By this fact and the differentiability ofF, for sufficiently smallh, there hold

F(x^∗+h)−F(x^∗_h) +λ

i∈I^c

φ(|hi|)

= h^T_Ic[∇F(x^∗)]I^c +λ

i∈I^c

φ(|hi|) +o(hI^c)

≥

i∈I^c

([∇F(x^∗)]_I^c∞−[∇F(x^∗)]_i+ 1)|h_i| ≥0. (18) Summing up the above two inequalities (17)–(18), one has that for all sufficiently smallh,

T_λ(x^∗+h)−T_λ(x^∗)≥0, (19) and hencex^∗ is a local minimizer. Moreover, we can observe that whenh= 0, then at least one of these two inequalities (17) and (18) will hold strictly, which implies thatx^∗ is a strictly local minimizer.

D. Eventual Linear Convergence Rate

In order to derive the convergence rate of IJT algorithm, we first show some observations on∇Fandφin the neighborhood ofx^∗. For any0< ε < ημ, we define a neighborhood ofx^∗as follows

N(x^∗, ε) ={x∈R^N :xI −x^∗_I2 < ε, xI^c = 0}.

If F is twice continuously differentiable at x^∗ and alsoλmin

(∇²_{I I}F(x^∗))>0, then for any x∈ N(x^∗, ε), there exist two sufficiently small positive constantsc_F andc_φ (bothc_F andc_φ depending onεwithcF →0andcφ →0asε→0) such that

[∇F(x)]I −[∇F(x^∗)]I, xI−x^∗_I

≥(λmin(∇²I IF(x^∗))−c_F)x_I−x^∗_I²2, (20) φ1(xI)−φ1(x^∗_I), xI−x^∗_I ≥(φ(e)−cφ)xI−x^∗_I²₂,

(21) where (21) holds for φ being strictly convex on (0,∞), and thus φ being nondecreasing on (0,∞), consequently, min_i∈Iφ(|x^∗_i|) =φ(min_i∈I|x^∗_i|). With the observations (20) and (21), we obtain the following theorem.

Theorem 3. (Eventual Linear Rate): Under conditions of Corollary 1, if the following conditions also hold

a) λmin(∇²_{I I}F(x^∗))>0;

b) 0<λ<−^λ^min^(∇_φ²^{I I}(e)^F^(x^∗⁾⁾,

c) either 0< μ <min{²⁽^λ^min⁽_L^∇2²^{I I}−(λφ^F^(x^∗(e))⁾⁾⁺²^λ^φ^(e)),_L¹}, or, for any sufficiently small 0< ε < η_μ, the third derivative

(6)

φ is well-defined, bounded and nonzero on the set

∪i∈IB(x^∗_i, ε), whereB(x^∗_i, ε) := (x^∗_i −ε, x^∗_i+ε), where e=min_i∈I|x^∗_i|, then there exists a positive integer n₀ and a constantρ∈(0,1)such that whenn > n0,

xⁿ^{+ 1}−x^∗2 ≤ρxⁿ −x^∗2, and xⁿ^{+ 1}−x^∗2 ≤ ρ

1−ρxⁿ^{+ 1}−xⁿ2.

The proof of Theorem 3 is presented in Appendix D. As shown by this theorem, if we can fortunately obtain a good initial point, then IJT algorithm may converge fast with a linear rate. On the other hand, Theorem 3 also provides a posteriori computable error estimate of the algorithm, which can be used to design an efficient terminate rule of IJT algorithm. It can be observed that the conditions of Theorem 3 are slightly stricter than those of Theorem 2, and thus,x^∗ is also a strictly local minimizer under the conditions of Theorem 3.

IV. APPLICATION TOq(0< q <1) REGULARIZATION

The_q(0< q <1)regularization is formulated as follows:

min_x_∈_R^N

T_λ(x) =1

2Ax−y²₂+λx^q_q

, (22)

whereA∈R^M^×^N (commonly,M < N),y∈R^M, andx^q_q

=_N

i= 1|x_i|^q. The proximity operator prox_{μ ,λ|·|}q can be expressed as (see [7])

prox_{μ ,}_λ|·|^q(z) = (·+λμqsign(·)| · |^q−1)⁻¹(z), |z| ≥τμ ,q

0, |z| ≤τμ ,q

(23) for anyz∈R, where

τμ ,q = 2−q

2−2q(2λμ(1−q))^2−q¹ , (24) ημ ,q = (2λμ(1−q))^2−q¹ , (25) and the range ofprox_{μ ,λ|·|}q is{0} ∪[η_{μ ,q},∞). Specifically, for some specialq(say,q= 1/2,2/3), the corresponding proximity operators can be expressed analytically [39], [12].

According to [2] (See Example 5.4, page 122), the function T_λ(x) = ¹₂Ax−y²₂+λx^q_qis a KL function and obviously satisfies the rKL propety at any limit point. Then we can obtain the following corollary directly.

Corollary 2: Let{xⁿ} be a sequence generated by IJT algorithm for q regularization with q∈(0,1). Assume that 0< μ < _A¹2

2, then {xⁿ} converges to a stationary point of _qregularization.

Moreover, it is easy to check thatφ(z) =z^qsatisfies the second part of condition (c) in Theorem 3. Therefore, the eventual linear convergence rate of IJT algorithm for _q regularization can be claimed as follows.

Corollary 3: Under conditions of Corollary 2, if the follow- ing conditions also hold:

a) λmin(A^T_IAI)>0, b) 0<λ< ^λ^min^(A_q(1−q)^TÎÂÎ^)e^2−q,

where I=Supp(x^∗)and e=min_i∈I|x^∗_i|, then IJT algorithm converges to a strictly local minimizerx^∗with an eventual linear rate.

It can be observed that the minimal nonzero entry eof x^∗ is used in condition (b) of this corollary. A theoretical lower bound ofeis estimated by Chen et al. [17]. In the following, we derive another sufficient conditions through the observation that the threshold value (25) is generally a tighter lower bound ofethan that studied in [17]. Specifically, by (25), it holds

e≥η_{μ ,q} = (2λμ(1−q))²¹⁻^q. (26) Then if ^λ^min_A^(A^TÎ2ÂÎ⁾

2 > ^q₂ and _2λ ^q

min(A^T_IAI) < μ <_A¹2

2,the conditions in Corollary 3 hold naturally.

Theorem 4: Under conditions of Corollary 2, if the following conditions still hold:

a) λmin(A^T_IA_I) A²₂ > q

2,

b) q

2λmin(A^T_IAI) < μ < 1 A²2

,

then IJT algorithm converges to a strictly local minimizerx^∗ with an eventual linear rate.

From Theorem 4, it means that if the matrix A satisfies a certain concentration property and the step size μ is chosen appropriately, then IJT algorithm can converge to a local minimizer with an eventual linear rate. Note that the condition (a) in Theorem 4 implies _2λ ^q

min(A^T_IAI) < _A¹2

2 naturally. Thus, the condition (b) of Theorem 4 is a natural and reachable condition and, furthermore, whenever this condition is satisfied, the sequence{xⁿ}is indeed convergent by Corollary 2. This shows that only condition (a) is essential in Theorem 4. We notice that condition (a) is a concentration condition on eigenvalues of the submatrixA^T_IAI, and, in particular, it implies

λmin(A^T_IAI)> qλm ax(A^T_IAI)/2, or equivalently

Cond(A^T_IA_I) := λm ax(A^T_IAI) λmin(A^T_IAI) < 2

q, (27)

where Cond(A^T_IA_I)is the condition number of A^T_IA_I. (27) thus shows that the submatrixA^T_IA_I is well-conditioned with the condition number lower than2/q.

In recent years, a property called the restricted isometry property (RIP) of a matrix A was introduced to characterize the concentration degree of the eigenvalues of its submatrix with k columns [9]. A matrix A is said to be of the k-order RIP (denoted then byδk-RIP) if there exists aδk ∈(0,1)such that

(1−δk)x²₂ ≤ Ax²₂ ≤(1 +δk)x²₂,˜∀x0 ≤k. (28) In other words, the RIP ensures that all submatrices ofAwith k columns are close to an isometry, and therefore distance- preserving. LetK=x^∗0. It can be seen from (28) that ifA possessesδK-RIP withδK < ²_2+q⁻^q, then

Cond(A^T_IAI)≤1 +δK

1−δ_K < 2 q.