Global Convergence of ADMM in Nonconvex Nonsmooth Optimization

(1)

https://doi.org/10.1007/s10915-018-0757-z

Global Convergence of ADMM in Nonconvex Nonsmooth Optimization

Yu Wang¹ · Wotao Yin² ·Jinshan Zeng³

Received: 23 November 2016 / Revised: 6 December 2017 / Accepted: 30 May 2018 / Published online: 7 June 2018

Abstract In this paper, we analyze the convergence of the alternating direction method of multipliers (ADMM) for minimizing a nonconvex and possibly nonsmooth objective function,φ(x0, . . . ,x_p,y), subject to coupled linear equality constraints. Our ADMM updates each of the primal variablesx₀, . . . ,x_p,y, followed by updating the dual variable. We separate the variableyfromx_i’s as it has a special role in our analysis. The developed convergence guarantee covers a variety of nonconvex functions such as piecewise linear functions,q

quasi-norm, Schatten-q quasi-norm (0 < q < 1), minimax concave penalty (MCP), and smoothly clipped absolute deviation penalty. It also allows nonconvex constraints such as compact manifolds (e.g., spherical, Stiefel, and Grassman manifolds) and linear complementarity constraints. Also, thex₀-block can be almost any lower semi-continuous function. By applying our analysis, we show, for the first time, that several ADMM algorithms applied to solve nonconvex models in statistical learning, optimization on manifold, and matrix decomposition are guaranteed to converge. Our results provide sufficient conditions for ADMM to converge on (convex or nonconvex) monotropic programs with three or more blocks, as they are special cases of our model. ADMM has been regarded as a variant to the augmented

The work of W. Yin was supported in part by NSF Grants DMS-1720237 and ECCS-1462397, and ONR Grant N00014171216. The work of J. Zeng was supported in part by the NSFC Grants (61603162, 11501440, 61772246, 61603163) and the doctoral start-up foundation of Jiangxi Normal Univerity.

B

Jinshan Zeng [email protected] Yu Wang

[email protected] Wotao Yin

[email protected]

1 Department of Statistics, University of California, Berkeley (UCB), Berkeley CA 94704, USA 2 Department of Mathematics, University of California, Los Angeles (UCLA),

Los Angeles CA 90025, USA

3 School of Computer and Information Engineering, Jiangxi Normal University, Nanchang 330022, Jiangxi, China

(2)

Lagrangian method (ALM). We present a simple example to illustrate how ADMM converges but ALM diverges with bounded penalty parameterβ. Indicated by this example and other analysis in this paper, ADMM might be a better choice than ALM for some nonconvex nonsmoothproblems, because ADMM is not only easier to implement, it is also more likely to converge for the concerned scenarios.

Keywords ADMM·Nonconvex optimization·Augmented Lagrangian method·Block coordinate descent·Sparse optimization

1 Introduction

In this paper, we consider the (possibly nonconvex and nonsmooth) optimization problem:

minimize

x0,x1,...,xp,y φ(x₀,x₁, . . . ,x_p,y) (1)

subject to A0x0+A1x1+ · · · +Apxp+By=b,

whereφ:Rⁿ⁰×· · ·×Rⁿ^p×R^q →R∪{∞}is a continuous function,x_i ∈Rⁿⁱare variables with their coefficient matricesA_i ∈ R^m×nⁱ,i =0, . . . ,p, and y ∈R^q is the last variable with its coefficient matrixB∈R^m×q. The model remains as general withoutyandBy; but we keepyandBto simplify the notation.

We setb=0 throughout the paper to simplify our analysis. All of our results still hold if b=0 is in the image of the matrixB, i.e.,b∈Im(B).

Besides the linear constraints in (1), any constraint on each variablex₀,x₁, . . . ,x_pandy can be treated as an indicator function and included in the objective functionφ. Therefore, we do not include constraints like:x0 ∈X0,x1∈X1, . . . ,xp∈Xp,y∈Y.

In spite of the success of ADMM on convex problems, the behavior of ADMM on nonconvex problems has been largely a mystery, especially when there are also nonsmooth functions and nonconvex sets in the problems. ADMM generally fails on nonconvexity problems, but it has found to not only work in some applications but often exhibit great performance! Indeed, successful examples include: matrix completion and separation [47,49,61,63], asset alloca- tion [56], tensor factorization [34], phase retrieval [57], compressive sensing [9], optimal power flow [64], direction fields correction [31], noisy color image restoration [31], image registration [6], network inference [39], and global conformal mapping [31]. In these applications, the objective function can be nonconvex, nonsmooth, or both. Examples include the piecewise linear function, theqquasi-norm forq∈(0,1), the Schatten-q(0<q<1)[59]

quasi-norm f(X)=

iσi(X)^q(whereσi(X)denotes theith largest singular value ofX), and the indicator functionι_B, whereBis a nonconvex set.

The success of these applications can be intriguing, since these applications are far beyond the scope of the theoretical conditions that ADMM is proved to converge. In fact, even the three-block ADMM can diverge on a simple convex problem [10]. Nonetheless, we still find that it works well in practice. This has motivated us to explore in the paper and respond to this question: when will the ADMM type algorithms converge if the objective function includes nonconvex nonsmooth functions?

We present our Algorithm1, whereL_βdenotes the augmented Lagrangian (2), and show that it converges for a large class of problems. For simplicity, Algorithm1uses the standard ADMM subproblems, which minimize the augmented LagrangianLβwith all but one variable fixed. It is possible to extend them to inexact, linearized, and/or prox-gradient subproblems as long as a few key principles (cf. §3.1) are preserved.

(3)

Algorithm 1Nonconvex ADMM for (1)

Initializex₁⁰, . . . ,x⁰_p,y⁰, w⁰ whilestopping criteria not satisfieddo

fori=0, . . . ,pdo

x_i^k+1←argmin_x_iLβ(x^k+1_<i ,xi,x^k_>i,y^k, w^k); end for

y^k+1←argmin_yLβ(x^k+1,y, w^k); w^k+1←w^k+β

Ax^k+1+By^k+1

; k←k+1;

end while

returnx₁^k, . . . ,x^k_pandy^k.

In this paper, under some assumptions on the objective and matrices, Algorithm1is proved to converge. Algorithm1is a generalization to the coordinate descent method. By setting A₀,A₁, . . . ,A_p,Bto 0, Algorithm1reduces to thecycliccoordinate descent method.

1.1 Proposed Algorithm

Our variable isx:= [x₀;. . .;x_p] ∈Rⁿwheren=p

i=0n_i. LetA:= [A₀ · · · A_p] ∈R^m×n andAx:=_p

i=0Aixi ∈R^m. To present our algorithm, we define the augmented Lagrangian:

Lβ(x,y, w):=φ(x,y)+ w,Ax+By + β

2Ax+By². (2)

The proposed Algorithm1extends the standard ADMM to multiple variable blocks. It also extends thecoordinate descentalgorithms dealing with linear constraints. We letx_<i :=

[x0;. . .;xi−1] ∈Rⁿ⁰⁺ⁿ¹^+···+nⁱ⁻¹andx_>i := [xi+1;. . .;xp] ∈Rⁿⁱ⁺¹^+···+n^p(clearly,x_<0and x_>_pare null variables, which may be used for notational ease). Subvectorsx_≤i := [x_<i,x_i] andx_≥iare defined similarly. The convergence of Algorithm1will be given in Theorems1 and2.

1.2 Relation to the Augmented Lagrangian Method (ALM)

ALM is a widely-used method for solving constrained optimization models [23,44]. It applies broadly to nonconvex nonsmooth problems. ADMM is an approximation to ALM by sequen- tially updating each of the primal variables.

ALM generally uses a sequence of penalty parameters{β^k}, which is nondecreasing and possibly unbounded. Whenβ^kbecomes large, the ALM subproblem becomes ill-conditioned.

Therefore, using boundedβ^kis practically desirable (see [12, Theorem 5.3], [3, Proposition 2.4], or [4, Chapter 7]). For general nonconvex and nonsmooth problems, it is well known thatβ^k,k ∈ N is bounded is not enough for the convergence of ALM [3, Section 2.1].

Proposition2below introduces a simple example on which ALM diverges with any bounded β^k. It is surprising, however, that ADMM converges in finite steps for any fixedβ >1 on this example.

Proposition 1 Consider the problem minimize

x,y∈R x²−y² (3)

subject to x =y, x∈ [−1,1].

(4)

It holds that

1. If{β^k|k∈N}is bounded, ALM generates a divergent sequence;

2. for any fixedβ >4, ADMM generates a convergent and finite sequence to a solution.

The proof is straightforward and included in the “Appendix”. ALM diverges because L_β(x,y, w) does not have a saddle point, and there is a non-zero duality gap. ADMM, however, is unaffected. As the proof shows, the ADMM sequence satisfies 2y^k= −w^k,∀k.

By substitutingw≡ −2yintoLβ(x,y, w), we get a convex function in(x,y)! Indeed, ρ(x,y):=L_β(x,y, w)_w=−2y=(x²−y²)+ι[−1,1](x)−2y(x−y)+β

2x−y²

= β+2

2 |x−y|²+ι[−1,1](x),

whereιS is the indicator function of set S(that is,ιS(x) = 0 ifx ∈ S; otherwise, equals infinity). It turns out that ADMM solves (3) by performing the following coordinate descent iteration toρ(x,y):

x^k⁺¹=argmin_xρ(x,y^k), y^k+1=y^k−_β2^β−4 d

dyρ(x^k+1,y^k).

Our analysis for the general case will show that the primal variableysomehow “controls”

the dual variablewand reduces ADMM to an iteration that is similar to coordinate descent.

1.3 Related Literature

The original ADMM was proposed in [19,21]. For convex problems, its convergence was established firstly in [20] and its convergence rates given in [15,16,22] in different set- tings. When the objective function is nonconvex, the recent results [28,37,61] directly make assumptions on the iterates(x^k,y^k, w^k). Hong et al. [24] deals with the nonconvex separable objective functions for some specificAi, which forms the sharing and consensus problem.

Li and Pong [32] studied the convergence of ADMM for some special nonconvex models, where one of the matrices AandB is an identity matrix. Wang et al. [51,52] studied the convergence of the nonconvex Bregman ADMM algorithm, which includes ADMM as a special case. We review their results and compare to ours in Sect.4below.

1.4 Contribution and Novelty

The main contribution of this paper is the establishment of the global convergence of Algo- rithm1under certain assumptions given in Theorems1and2below. The assumptions apply to largely many nonconvex and nonsmooth objective functions. The developed theoretical results can be extended to the case where subproblems are solved inexactly with summable errors. We also allow the primal block variablesx₁, . . . ,x_pto be updated in an arbitrary order as long asx0is updated first andyis updated last (just before thew-update). The novelty of this paper can be summarized as follows:

(1) Weaker assumptions.Compared to the related works [24,28,32,37,51,52,61], the convergence conditions in this paper are weaker, extending the ADMM theory to sig- nificantly more nonconvex functions and nonconvex sets. See Table1. In addition, we allow the primal variablesx₁, . . . ,x_p to be updated in an arbitrary order at each iter-

(5)

ation,¹which is new in the ADMM literature. We show that most of our assumptions are necessary by providing counter examples. We also give the first example that causes ADMM to converge but ALM to diverge.

(2) New examples.By applying our main theorems, we prove convergence for the nonconvex ADMM applied to the following problems which could not be recovered from previous convergence theory:

– statistical regression based on nonconvex regularizer such as minimax concave penalty(MCP), smoothly clipped absolute deviation (SCAD), andqquasi-norm;

– minimizing smooth functions subject to norm or Stiefel/Grassmannian manifold constraints;

– matrix decomposition using nonconvex Schatten-qregularizer;

– smooth minimization subject to complementarity constraints.

(3) Novel techniques.We improve upon the existing analysis techniques and introduce new ones.

(a) An induction technique for nonconvex, nonsmooth case. The analysis uses the aug- mented Lagrangian as the Lyapunov function: Algorithm1 produces a sequence of points whose augmented Lagrangian function values are decreasing and lower bounded. This technique appeared first in [24] and also in [32,51]. However, it has trouble handling nonsmooth functions. An induction technique is introduced to overcome this difficulty and extend the current framework to nonconvex, nonsmooth, multi-block cases. The technique is used in the proof of Lemma9.

(b) Restricted prox-regularity. Most of the convergence analysis of nonconvex opti- mization either assumes or proves the sufficient descent and bounded subgradient properties (c.f., [1,24]). This property is easily obtainable if the objective is smooth.

However, some nonconvex and nonsmooth objectives (e.g. nonconvexqquasi-norm) violate these properties. We overcome this challenge with the introducedrestricted prox-regularity property(Definition2). If the objective satisfies such a property, we prove that the sequence enjoy sufficient descent and bounded subgradients after a finite number of iterations.

(c) More general linear mappings. Most nonconvex ADMM analysis is applied to the primal variablesxandydirectly. This requires the matrices A₀,A₁, . . . ,A_p,Bto either identity or have full column/row rank. In this paper, we introduce techniques to work with possibly rank-deficientA0,A1, . . . ,Ap,B(see, for example, Lemma 5). This allows us to ensure convergence of ADMM on some important applications in signal processing and statistical learning (see Sect.5).

In addition, we use several other techniques that are tailored to relax our convergence assumptions as much as possible.

1.5 Notation and Organization

We denoteRas the real number set,R∪ {+∞}as the extended real number set,R+as the positive real number set, andNas the natural number set. Given a matrixX, Im(X)denotes its image,σi(X)denotes itsith largest singular value. · represents the Euclidean norm for a vector or the Frobenius norm for a matrix. dom(f)denotes the domain of a function

1This is the best that one hope (except for very specific problems) since [62, Section 1] shows a convex 2-block problem, which ADMM fails to converge.

(6)

f. For any two square matricesAandBwith the same size, A Bmeans that A−Bis positively semi-definite.

The remainder of this paper is organized as follows. Section2presents the main convergence analysis. Section3gives the detailed proofs. Section4discusses the tightness of the assumptions, the primal variable update order, and inexact minimization issues. Section5 applies the developed theorems in some typical applications and obtains novel convergence results. Finally, Sect.6concludes this paper.

2 Main Results 2.1 Definitions

In our definitions,∂f denotes the set of general subgradients of f in [45, Definition 8.3].

We call a functionLipschitz differentiableif it is differentiable and its gradient is Lipschitz continuous. The functions given in the next two definitions are permitted in our model.

Definition 1 (Piecewise linear function)A function f :Rⁿ →Rispiecewise linearif there exist polyhedraU1, . . . ,UK ⊂Rⁿ, vectorsa1, . . . ,aK ∈Rⁿ, and pointsb1, . . . ,bK ∈R such that_K

i=1U_i =Rⁿ,U_i

U_j = ∅(∀i = j), and f(x) = a_i^Tx +b_i whenx ∈U_i, i=1, . . . ,K.

Definition 2 (Restricted prox-regularity)For a lower semi-continuous function f, letM ∈ R+, f :R^N →R∪ {∞}, and define the exclusion set

S_M:= {x∈dom(f): d>Mfor alld∈∂f(x)}.

f is calledrestricted prox-regular if, for any M > 0 and bounded set T ⊆ domf, there existsγ >0 such that

f(y)+γ

2x−y²≥ f(x)+ d,y−x , ∀x∈T \S_M, y∈T,d∈∂f(x), d ≤M.

(4) (IfT\SM is empty, (4) is satisfied.)

Definition2 is related to, but weaker than, the conceptsprox-regularity[43], hypomono- tonicity[45, Example 12.28] andsemi-convexity[26,30,38,40], all of which impose global conditions. Definition2only requires (4) to hold over a subset. As shown in Proposition 1, while prox-regular functions include any convex functions and anyC¹ functions with Lipschitz continuous gradients, restricted prox-regular functions further include a set of nonsmooth non-convex functions such asq quasi-norms (0<q<1), Schatten-qquasi-norms (0<q<1), and indicator functions of compact smooth manifolds.

Proposition 1 Examples of (restricted) prox-regular functionsThe following functions are prox-regular functions:

(1) convex functions, including indicator functions of convex sets, (2) C¹smooth functions with L-Lipschitz continuous gradient.

The following functions are restricted prox-regular functions:

(1) q(x):= x^qq function for q∈(0,1);

(7)

Table1ConditionsforADMMconvergence(note:f0,f1,...,fpisnotrequiredtoexist) Scenario1Scenario2 Modelminimize x=(x0,...,xp),yφ(x,y):=g(x)+p i=0fi(xi)+h(y) subjecttoAx+By=0

minimize x,yφ(x,y) subjecttoAx+By=0 φCoerciveoverthefeasibleset{(x,y):Ax+By=0};seeassumptionA1 g,hLipschitzdifferentiableφLipschitzdifferentiable Scenario1aScenario1b f0Lowersemi-continuous∂fboundedinanyboundedset f1,...,fpRestrictedprox-regularPiecewiselinear A,BIm(A)⊆Im(B) SolutiontoeachADMMsub-problemisLipschitzw.r.t.input(A3)

(8)

(2) Schatten-q quasi-norm:Aq = _n

i=1σ_i^q, where q ∈ (0,1)andσi is the i th largest singular value of A;

(3) Indicator functions ιS of a compact C² manifold, such as the unit sphere in a finite Euclidean space.

Definition2introduces functions that do not satisfy (4) globallyonly becausethey are asymptotically “steep” in the exclusion setS_M. Such functions include|x|^q (0<q <1), for whichS_Mhas the form(−M,0)∪(0, M); the Schatten-qquasi-norm (0<q<1), for whichS_M = {X: ∃i, σi(X) < M}as well as log(x), for whichS_M =(0, M), whereM

is a constant depending onM. We only need (4) because the iteratesx_i^kof Algorithm1, for all largek, never enter the exclusion setSM.

2.2 Main Theorems

To ensure the boundedness of the sequence(x^k,y^k, w^k), we only need the coercivity of the objective function within the feasible set.

A1 (coercivity)Define the feasible setF := {(x,y)∈Rⁿ⁺^q :Ax+By=0}. The objective function φ(x,y) is coercive over this set, that is,φ(x,y) → ∞if(x,y) ∈ F and (x,y) → ∞;

If the feasible set of(x,y)is bounded, then A1 holds trivially for any continuous objective function. Therefore, A1 is much weaker than assuming that the objective function is coercive over the entire spaceR^n+q. The assumption A1 can be dropped if the boundedness of the sequence can be deducted from other means.

Within the proof,A_ix_i^kandBy^koften appear in the first order conditions (e.g. see Eqs. (12), (13)). In order to have a reverse control, i.e., controllingx_i^k,y^kbased onAix_i^k,By^k, we need the following two assumptions on matricesA_i andB.

A2 (feasibility)Im(A)⊆Im(B), where Im(·)returns the image of a matrix;

A3 (Lipschitz sub-minimization paths)

(a) For any fixedx, argmin_y{φ(x,y):By=u}has a unique minimizer.H :Im(B)→ R^qdefined byH(u)argmin_y{φ(x,y):By=u}is a Lipschitz continuous map.

(b) Fori = 0, . . . ,p and anyx_<i,x_>i andy, argmin_x_i{φ(x_<i,xi,x_>i,y) : Aixi = u} has a unique minimizer and Fi : Im(Ai) → Rⁿⁱ defined by Fi(u) argmin_x_i{φ(x_<i,xi,x_>i,y):Aixi =u}is a Lipschitz continuous map.

Moreover, the aboveF_iandHhave a universal Lipschitz constantM¯ >0.

These two assumptions allow us to controlx_i^k,y^kbyAix_i^k,By^kas in Lemma1.

Lemma 1 It holds that,∀k1,k2 ∈N,

y^k¹−y^k² ≤ ¯MBy^k¹−By^k², (5) x_i^k¹−x_i^k² ≤ ¯MAix_i^k¹−Aix_i^k², i=0,1, . . . ,p, (6) whereM is given in A3.¯

They weaken the full column rank assumption typically imposed on matrices Ai and B. When Ai andBhave full column rank, their null spaces are trivial and, therefore,Fi,H reduce to linear operators and satisfy A3. However, the assumption A3 allows non-trivial null

(9)

H(u^∗) + Null(B) H(u^∗)

H(u)

sub-minimization path

Fig. 1 Illustration of the assumption A3, which assumes thatH(u)=argmin{h(y):By=u}is Lipschitz [46]

spaces and holds for more functions. For example, if a function f is aC²with its Hessian matrix H bounded everywhereσ1I H σ2I (σ1 > σ2 > 0), thenFsatisfies A3 for any matrixA. If the uniqueness fails to hold, i.e., there existsy₁,y₂ such thatBy₁ = By₂ andφ(x,y₁)=φ(x,y₂), then the augmented Lagrangian cannot distinguish them, causing troubles to the boundedness of the sequence (Fig.1).

As for the objective function, we consider two different scenarios:

– Theorem1considers the scenario wherexandyare decoupled in the objective function;

– Theorem2considers the scenario wherexandyare possibly coupled but their function φ(x,y)is Lipschitz differentiable.

The model in the first scenario is minimize

x0,x1,...,xp,y f(x0,x1, . . . ,xp) +h(y) (7)

subject to A₀x₀+A₁x₁+ · · · +A_px_p+By=b, where the function f :Rⁿ →R∪ {∞}(n=p

i=0n_i) is proper, continuous, and possibly nonsmooth, and the functionh:R^q →Ris proper and differentiable. Both f andhcan be nonconvex.

Theorem 1 Suppose that A1–A3 and the following assumptions hold.

A4 (objective-f regularity) f has the form f(x):=g(x)+

p

i=0

f_i(x_i) where

(i) g(x)is Lipschitz differentiable with constant L_g, (ii) Either

a. f0is lower semi-continuous, fi(xi)is restricted prox-regular (Definition2) for i=1, . . . ,p; Or,

b. The supremumsup{d :x₀∈S,d∈∂f₀(x₀)}is bounded for any bounded set S, f_i(x_i)is continuous and piecewise linear (Definition1) for i=1, . . . ,p;

A5 (objective-hregularity)h(y)is Lipschitz differentiable with constant L_h;

Then, Algorithm1converges subsequently for any sufficiently largeβ(the lower bound is given in Lemma9), that is, starting from any x₁⁰, . . . ,x⁰_p,y⁰, w⁰, it generates a sequence that is bounded, has at least one limit point, and that each limit point(x^∗,y^∗, w^∗)is a stationary point ofL_β, namely,0∈∂L_β(x^∗,y^∗, w^∗).

(10)

In addition, ifL_β is a Kurdyka-Łojasiewicz (KŁ) function [1,5,35], then (x^k,y^k, w^k) converges globally²to the unique limit point(x^∗,y^∗, w^∗).

Assumptions A4 and A5 regulate the objective functions. None of the functions needs to be convex. f0can be any lower semi-continuous function, and the non-Lipschitz differentiable parts f1, . . . ,fnof f shall satisfy either Definitions1or2. Under Assumptions A4 and A5, the augmented Lagrangian functionL_βis lower semi-continuous.

It will be easy to see, from our proof in Sect.3.3, that the Lipschitz differentiable assumption ongcan be relaxed to hold just in any bounded set, since the boundedness of{x^k}is established before that property is used in our proof. Consequently,gcan be functions like e^x, whose derivative is not globally Lipschitz.

Functions satisfying the KŁ inequality include real analytic functions, semi-algebraic functions and locally strongly convex functions (more information can be referred to Sect.

2.2 in [60] and references therein).

In the second scenario,xandy can be coupled in the objective as shown in (1), but the objective needs to be smooth.

Theorem 2 Suppose that A1–A3 hold andφin(1)is Lipschitz differentiable with constant L_φ. Then, Algorithm1has the same subsequential and global convergence results as stated in Theorem1.

Although Theorems1and2impose different conditions on the objective functions, their proofs are similar. Hence, we will focus on proving Theorem1first and leave the proof of Theorem2to the “Appendix”.

3 Proof 3.1 Keystones

The following properties hold for Algorithm1under our assumptions. Here, we first list them and present Proposition 2, which establishes convergence assuming these properties. Then in the next two subsections, we prove these properties.

P1 (Boundedness){x^k,y^k, w^k}is bounded, andL_β(x^k,y^k, w^k)is lower bounded.

P2 (Sufficient descent)There is a constantC₁(β) >0 such that for all sufficiently largek, we have

L_β(x^k,y^k, w^k)−L_β(x^k+1,y^k+1, w^k+1)

≥C1(β)

B(y^k+1−y^k)²+

p

i=1

Ai(x_i^k−x_i^k+1)²

. (8)

P3 (Subgradient bound)There existsC₂(β) >0 andd^k⁺¹∈∂L_β(x^k⁺¹,y^k⁺¹, w^k⁺¹)such that

d^k⁺¹ ≤C2(β)

B(y^k⁺¹−y^k) +

p

i=1

Ai(x_i^k+1−x_i^k)

. (9)

It is our intention to startiat 1, thus skipping thex0-block, in (8) and (9).

2“Globally”here means regardless of where the initial point is.

(11)

P4 (Limiting continuity)If(x^∗,y^∗, w^∗)is the limit point of a sub-sequence(x^k^s,y^k^s, w^k^s) fors∈N, thenL_β(x^∗,y^∗, w^∗)=lim_s→∞L_β(x^k^s,y^k^s, w^k^s).

The proposition below is standard and not new though it does not appear exactly in the literature.

Proposition 2 Suppose that when an algorithm is applied to the problem(7), its sequence (x^k,y^k, w^k) satisfies P1–P4. Then, the sequence has at least a limit point(x^∗,y^∗, w^∗), and any limit point(x^∗,y^∗, w^∗) is a stationary point. That is,0 ∈ ∂L_β(x^∗,y^∗, w^∗), or equivalently,

0=Ax^∗+By^∗, (10a)

0∈∂f(x^∗)+A^Tw^∗, (10b)

0∈∂h(y^∗)+B^Tw^∗. (10c)

Furthermore, the running best rates³of the sequences{B(y^k+1−y^k)²+_p

i=1Ai(x_i^k− x_i^k+1)²}and{d^k⁺¹}are o(¹_k)and o(^√¹_k), respectively. Moreover, ifL_βis a KŁ function, then(x^k,y^k, w^k)converges globally to the unique point(x^∗,y^∗, w^∗).

Proof The proof is standard. Similar steps are found in, for example, [1,60].

By P1, the sequence(x^k,y^k, w^k)is bounded, so there exist a convergent subsequence and a limit point, denoted by (x^k^s,y^k^s, w^k^s)s∈N → (x^∗,y^∗, w^∗) as s → +∞. By P1 and P2,L_β(x^k,y^k, w^k)is monotonically nonincreasing and lower bounded, and therefore A_ix_i^k−A_ix_i^k+1 →0 andBy^k−By^k+1 →0 ask → ∞. Based on P3, there exists d^k ∈ ∂Lβ(x^k,y^k, w^k)such thatd^k → 0. In particular,d^k^s → 0 ass → ∞. Based on P4,L_β(x^∗,y^∗, w^∗)= limsL_β(x^k^s,y^k^s, w^k^s). By definition of general subgradient [45, Definition 8.3], we have 0∈∂L_β(x^∗,y^∗, w^∗).

The running best rate of the sequence{B(y^k⁺¹−y^k)²+p

i=1A_i(x_i^k−x_i^k⁺¹)²}can be easily obtained via taking advantage of [17, Lemma 1.2] or [29, Theorem 3.3.1]. By (9), it is obvious that the running best rate of the sequence{d^k+1}iso(^√¹_k).

Similar to the proof of Theorem 2.9 in [1], we can claim the global convergence of the considered sequence(x^k,y^k, w^k)k∈Nunder the KŁ assumption ofL_β. In P2, the sufficient descent inequality (8) is only required for any sufficiently largek, not allk.

In our analysis, P1 gives subsequence convergence, P2 measures the augmented Lagrangian descent, and P3 bounds the subgradient by total point changes. The reader may still obtain P1–P4 when generalizing Algorithm1, for example, by replacing the direct minimization subproblems to prox-gradient or inexact subproblems and by relaxing the ordering in which the primal variables are updated.

3.2 Preliminaries

In this subsection, we give some useful lemmas that will be used in the main proof. To save space, throughout this section we assume assumptions A1–A5 hold, and let

(x⁺,y⁺, w⁺):=(x^k+1,y^k+1, w^k+1). (11) In addition, we letA_<sx_<s:=

i<s Aixi and, in a similar fashion,A_>sx_>s :=

i>sAixi.

3A nonnegative sequenceakinduces its running best sequencebk =min{ai :i ≤k}; therefore,akhas running best rate ofo(1/k)ifbk=o(1/k).

(12)

Lemma 2 Ifβ > M¯²L_h(M is defined in A3), all the subproblems in Algorithm¯ 1are well defined.

This lemma is on its own, so we leave its proof to the “Appendix”.

Lemma 3 (bound dual by primal)Letλ₊₊(B^TB)be the smallest strictly-positive eigenvalue of B^TB, CLhM¯λ^−1/2₊₊ (B^TB). For all k∈N, it holds that

(a) B^Tw^k= −∇h(y^k).

(b) w⁺−w^k ≤CBy⁺−By^k.

Proof Part (a) follows directly from the optimality condition ofy^k: 0= ∇h(y^k)+B^Tw^k−1+ βB^T(Ax^k+By^k),andw^k=w^k−1+β

Ax^k+By^k .

Then let us prove Part (b). Sincew⁺−w^k=β(Ax⁺+By⁺)∈Im(B), we get w⁺−w^k ≤λ^−1/2₊₊ (B^TB)B^T(w⁺−w^k) =λ^−1/2₊₊ (B^TB)∇h(y⁺)

−∇h(y^k) ≤CBy⁺−By^k. The last inequality follows from the Lipschitz property of∇hand Lemma1.

3.3 Main Proof

This subsection proves Theorem1for Algorithm1under Assumptions A1–A5. For allk∈N andi =0, . . . ,p, because of the optimality ofx_i^k, we can introduce the followinggeneral subgradients d_i^kandd¯_i^k,

d¯_i^k:= −(A^T_iw⁺+βρ_i^k)∈∂if(x_<i⁺,x_i⁺,x^k_>_i), (12) d_i^k:= −∇ig(x_<i⁺,x_i⁺,x_>i^k )+ ¯d_i^k ∈∂f_i(x_i⁺), (13) where

ρ_i^k :=A^T_i(A_>ix_>^k_i−A_>ix_>i⁺)+A^T_i(By^k−By⁺).

The next two lemmas estimate the descent ofL_β(x,y, w)at each iteration.

Lemma 4 (descent of_L_βduringx_iupdate)The iterates in Algorithm1satisfy 1. L_β(x_<⁺_i,x^k_i,x_>^k_i,y^k, w^k)≥L_β(x⁺_<_i,x⁺_i ,x_>^k_i,y^k, w^k), i =0, . . . ,p;

2. _L_β(x^k,y^k, w^k)≥Lβ(x⁺,y^k, w^k); 3. Lβ(x^k,y^k, w^k)−Lβ(x⁺,y^k, w^k)=_p

i=0ri, where

r_i := f(x_<⁺_i,x_i^k,x^k_>i)− f(x_<⁺_i,x_i⁺,x_>i^k )− ¯d_i^k,x_i^k−x_i⁺ + β

2A_ix_i^k−A_ix_i⁺² ≥0, (14) whered¯_i^kis defined in(12).

4. For i =1, . . . ,p (without the block i=0), if f_i(x_i^k)+γi

2x_i^k−x_i⁺²≥ f_i(x_i⁺)+ d_i^k,x_i^k−x_i⁺ , (15) holds with constantγi ≥0(later, this condition will be shown to hold), then we have

ri ≥ β−γiM¯²−LgM¯²

2 Aix_i^k−Aix_i⁺², (16) where the constants L_gandM are defined in Assumptions A4 and A3, respectively.¯