Block-wise Alternating Direction Method of Multipliers for Multiple-block Convex Programming and Beyond

(1)

SMAI-JCM

SMAI Journal of

Computational Mathematics

Block-wise Alternating Direction Method of Multipliers for

Multiple-block Convex Programming and Beyond

Bingsheng He & Xiaoming Yuan Volume 1 (2015), p. 145-174.

<http://smai-jcm.cedram.org/item?id=SMAI-JCM_2015__1__145_0>

cedram

Article mis en ligne dans le cadre du

Centre de diffusion des revues académiques de mathématiques http://www.cedram.org/

(2)

Vol. 1, 145-174 (2015)

Block-wise Alternating Direction Method of Multipliers for Multiple-block Convex Programming and Beyond

Bingsheng He¹ Xiaoming Yuan²

1Department of Mathematics, South University of Science and Technology of China, and Department of Mathematics, Nanjing University, China

E-mail address: hebma@nju.edu.cn

2Department of Mathematics, Hong Kong Baptist University, Hong Kong E-mail address: xmyuan@hkbu.edu.hk.

Abstract. The alternating direction method of multipliers (ADMM) is a benchmark for solving a linearly constrained convex minimization model with a two-block separable objective function; and it has been shown that its direct extension to a multiple-block case where the objective function is the sum of more than two functions is not necessarily convergent. For the multiple-block case, a natural idea is to artificially group the objective functions and the corresponding variables as two groups and then apply the original ADMM directly —- the block-wise ADMM is accordingly named because each of the resulting ADMM subproblems may involve more than one function in its objective. Such a subproblem of the block-wise ADMM may not be easy as it may require minimizing more than one function with coupled variables simultaneously. We discuss how to further decompose the block-wise ADMM’s subproblems and obtain easier subproblems so that the properties of each function in the objective can be individually and thus effectively used, while the convergence can still be ensured. The generalized ADMM and the strictly contractive Peaceman-Rachford splitting method, two schemes closely relevant to the ADMM, will also be extended to the block-wise versions to tackle the multiple-block convex programming cases. We present the convergence analysis, including both the global convergence and the worst-case convergence rate measured by the iteration complexity, for these three block-wise splitting schemes in a unified framework.

Math. classification. 90C25, 90C06, 65K05.

Keywords. Convex programming, Operator splitting methods, Alternating direction method of multipliers, proximal point algorithm, Douglas-Rachford splitting method, Peaceman-Rachford splitting method, Convergence rate, Iteration complexity.

1. Introduction

We consider a separable convex minimization problem with linear constraints and its objective function is the sum of more than one function without coupled variables:

minⁿ

m

X

i=1

θi(xi)

m

X

i=1

Aixi =b, xi∈Xi, i= 1,· · · , m^o, (1.1) whereθ_i :Rⁿⁱ →R (i= 1,· · · , m) are convex (not necessarily smooth) closed functions; A_i ∈R^l×nⁱ, b ∈ R^l, and X_i ⊆ Rⁿⁱ (i = 1,· · ·, m) are convex sets. The solution set of (1.1) is assumed to be nonempty throughout our discussions in this paper. We also assume that matrices A^T_i Ai for (i = 1, . . . , m) are all nonsingular.

Bingsheng He was supported by the NSFC grant: 11471156.

Xiaoming Yuan was supported by a General Research Fund from Hong Kong Research Grants Council.

(3)

Let the augmented Lagrangian function of (1.1) be L^m_β(x1, x2,· · ·, xm, λ) =

m

X

i=1

θi(xi)−λ^T

m

X

i=1

Aixi−b+β 2k

m

X

i=1

Aixi−bk², (1.2) withλ∈R^lthe Lagrange multiplier andβ >0 a penalty parameter. For the special case of (1.1) with m= 2, the alternating direction method of multipliers (ADMM) in [14] reads as











x^k+1₁ = arg minL²_β(x₁, x^k₂, λ^k)x₁ ∈X₁ , x^k+1₂ = arg minL²_β(x^k+1₁ , x2, λ^k)x2∈X2 , λ^k+1=λ^k−β(A₁x^k+1₁ +A₂x^k+1₂ −b).

(1.3)

Recently, the ADMM has found many efficient applications for a broad spectrum of applications in various fields such as machine learning, statistical learning, computer vision, wireless network, and so on. We refer the reader to [1, 7, 12] for some recent review papers on the ADMM. For the multiple-block case of (1.1) with m≥3, the direct extension of ADMM reads as











x^k+1₁ = arg minL^m_β(x1, x^k₂,· · ·, x^k_m, λ^k)x1 ∈X1 ,

· · · ·

x^k+1_i = arg minL^m_β(x^k+1₁ ,· · · , x^k+1_i−1, xi, x^k_i+1,· · · , x^k_m, λ^k)xi∈Xi ,

· · · ·

x^k+1_m = arg minL^m_β(x^k+1₁ ,· · · , x^k+1_m−1, x_m, λ^k)x_m∈X_m , λ^k+1=λ^k−β(^P^m_i=1Aix^k+1_i −b).

(1.4)

The direct extension of ADMM scheme (1.4) indeed works empirically for some applications, as shown in, e.g. [32, 34]. However, it was shown in [3] that the scheme (1.4) is not necessarily convergent. In the literature, some surrogates of (1.4) have been well studied. For example, the schemes in [22, 21]

suggest correcting the output of (1.4) appropriately to ensure the convergence; the scheme in [23]

slightly changes the order of updating the Lagrange multiplier and twists some of the subproblems appropriately; the scheme in [26] suggests attaching a shrinking factor to the Lagrange multiplier updating step in (1.4).

Given the wide applicability of ADMM (1.3) for a two-block convex minimization model and the convergence difficulty of the direct extension of ADMM (1.4) for a multiple-block counterpart, in this paper we mainly answer the question of how to use the original ADMM scheme (1.3) directly for the multiple-block model (1.1) and to design an implementable algorithm with provable convergence which can individually take advantage of the properties of the functions in the objective of (1.1). Let us first elaborate on the difficulty by applying the original ADMM (1.3) to the model (1.1) withm≥3.

Conceptually, we can group them functions in the objective of (1.1) and accordingly all the variables as two groups; to which the original ADMM scheme (1.3) becomes applicable in a block-wise form.

That is, we can rewrite the model (1.1) as min







m1

X

i=1

θ_i(x_i) +

m2

X

j=1

φ_j(y_j)

m1

X

i=1

A_ix_i+

m2

X

j=1

B_jy_j =b, x_i ∈X_i, y_j ∈Y_j







, (1.5)

wherem₁ ≥1,m₂ ≥1 and m=m₁+m₂. Note that for the second group, we relabel (θ_i,x_i,A_i,X_i) for i = m1+ 1,· · ·, m in (1.1) as (φj, yj, Bj, Yj) with j = 1,2,· · · , m2 in (1.5), respectively. This relabeling will significantly simplify our natation of presentation in the coming analysis and help us

(4)

expose our idea more clearly. Furthermore, with the notation x=





 x₁

... x_m₁





, y=





 y₁

... y_m₂





, A= (A₁, . . . , A_m₁), B= (B₁, . . . , B_m₂), (1.6) and

ϑ(x) =

m1

X

i=1

θ_i(x_i), ϕ(y) =

m2

X

j=1

φ_j(y_j), X =

m1

Y

i=1

X_i, Y =

m2

Y

j=1

Y_j, (1.7) The reformulation (1.5) of the model (1.1) can be written as the block-wise form

minϑ(x) +ϕ(y)Ax+By=b, x∈ X,y∈ Y . (1.8) The Lagrange function of (1.8) is

L²(x,y, λ) =ϑ(x) +ϕ(y)−λ^T(Ax+By−b), (1.9) which defined on

Ω =X × Y ×R^l=X1× · · · ×Xm1×Y1× · · · ×Ym2 ×R^l. (1.10) The augmented Lagrangian function can be denoted by

L²_β(x,y, λ) =L²(x,y, λ) +β

2kAx+By−bk²

=

m1

X

i=1

θ_i(x_i) +

m2

X

j=1

φ_j(x_j)−λ^T

m1

X

i=1

A_ix_i+

m2

X

j=1

B_jy_j −b+β 2

m1

X

i=1

A_ix_i+

m2

X

j=1

B_jy_j−b². (1.11) Then, conceptually, the original ADMM (1.3) is implementable to the block-wise reformulation (1.8) of the model (1.1). The resulting scheme, called a block-wise ADMM hereafter, reads as











x^k+1 = arg minL²_β(x,y^k, λ^k)x∈ X , y^k+1= arg minL²_β(x^k+1,y, λ^k)y∈ Y , λ^k+1=λ^k−β(Ax^k+1+By^k+1−b).

(1.12)

For a generic case of (1.1) withm≥3, however, solving thex- andy-subproblems in (1.12) is usually hard or even not feasible for many concrete applications of the abstract model (1.1). This is mainly due that such a subproblem is in a block-wise form and it may require minimizing more than one functions with variables coupled by the quadratic term in (1.11). Recall that for the ADMM (1.3) and its variants, all the functions in the objective of (1.1) are treated individually in their decomposed subproblems and thus these functions’ properties can be effectively exploited in the algorithmic implementation. One representative case arising often in sparse and/or low-rank optimization models is that when a function θ_i is simple in the sense that its proximal operator can be evaluated explicitly, then the corresponding subproblem or its linearized version is easy enough to have a closed-form solution —- meaning the corresponding subproblem is completely exempted from any inner iteration. This feature is indeed the main reason to account for the efficient applications of ADMM-related schemes in a broad spectrum of applications. Therefore, although the scheme (1.12) represents a direct application of the original ADMM to the block-wise reformulation (1.8) and it makes theoretical senses, it is necessary to discuss how to solve itsx- andy-subproblems efficiently. With the just-mentioned desire of taking advantage of the properties of θ_i’s individually, we can consider further decomposing the x- and y-subproblems in (1.12) into m1 and m2 smaller subproblems (implementing decomposition to the quadratic term in (1.11) which couples the variables), respectively; so that each of them only has to deal with one

(5)

θi(xi) in its objective. That is, for solving the multiple-block model (1.1) withm≥3, we can consider the following splitting version of the block-wise ADMM (1.12):











x^k+1₁ = arg minL²_β(x1, x^k₂,· · ·, x^k_m₁,y^k, λ^k)x1∈X1 , ...

x^k+1_i = arg minL²_β(x^k₁, , x^k₂,· · ·, x^k_i−1, xi, x^k_i+1,· · · , x^k_m₁,y^k, λ^k)xi ∈Xi , ...

x^k+1_m₁ = arg minL²_β(x^k₁, x^k₂,· · ·, x^k_m₁₋₁, x_m₁,y^k, λ^k)x_m₁ ∈X_m₁ , y^k+1₁ = arg minL²_β(x^k+1, y1, y^k₂,· · · , y_m^k₂, λ^k)y1 ∈Y1 ,

...

y^k+1_j = arg minL²_β(x^k+1, y₁^k, y^k₂,· · · , y_j−1^k , yj, y_j+1^k ,· · · , y_m^k₂, λ^k)yj ∈Yj , ...

y^k+1_m₂ = arg minL²_β(x^k+1, y₁^k, y^k₂,· · · , x^k_m₂₋₁, y_m₂, λ^k)y_m₂ ∈Y_m₂ , λ^k+1 =λ^k−β(Ax^k+1+By^k+1−b).

(1.13)

Note that in (1.13), internally we consider the parallel (i.e., Jacobian) decomposition for the x- and y-subproblems arising in (1.12). See Remark 3.5 for the discussion where the alternating (i.e., Gauss- Seidel) decomposition is implemented to these subproblems internally. With this consideration of parallel decomposition, it is possible to solve the smaller subproblems respectively decomposed by the x- and y-subproblems simultaneously on parallel computation infrastructures, which is particularly of interest for the scenarios where large-dimension data is considered and distributed computation is requested. Thus, the scheme (1.13) is a mixture of alternating decomposition outside (i.e, the block- wise variablesxandyare updated alternatingly) and parallel decomposition inside (i.e., the individual variables in both thex- andy-subproblems are updated in parallel). It is clear that the scheme (1.13) enjoys the same good feature as the ADMM-related schemes, because each of its subproblems could be very easy if the functions θi’s are easy enough (e.g., their proximal operators can be evaluated explicitly). Despite of this nice feature, the scheme (1.13), however, is not necessarily convergent — see a counter example in the appendix.

In fact, the divergence of (1.13) can be intuitively understood: The decomposed (x1, x2,· · · , xm1)- subproblems (Resp., (y₁, y₂,· · ·, y_m₂)-subproblems) in (1.13) represent a more implementable but inexact version of the block-wise x-subproblem (Resp., y-subproblem) in (1.12). This approxima- tion, however, is not accurate enough because the block-wise x-subproblem (Resp., y-subproblem) in (1.12) is decomposed bym1(Resp.,m2) times in (1.13). Therefore, the convergence of the block-wise ADMM (1.12), which is indeed guaranteed under the condition that both of its block-wise subproblems must be solved exactly or inexactly but with certain requirement on the inexactness (see e.g. [18, 30]), does not necessarily hold for its splitting version (1.13). The counter example in the appendix is indeed a case of the model (1.1) with m = 3. Thus, we have m₁ = 1 and m₂ = 2 when implementing the splitting version (1.13), meaning that the xi-subproblem in (1.13) is the same as that in (1.12) while only the y-subproblem in (1.12) is approximated by two further decomposed subproblems in (1.13).

For this case with the least extent of approximation, the scheme (1.13) is already shown not to be necessarily convergent by this counter example. This fact clearly shows the failure of convergence of (1.13) for a generic case where both m₁ ≥2 and m₂ ≥2. How to derive a convergence-guaranteed algorithm based on the scheme (1.13) is thus the emphasis of this paper.

We will present an algorithm based on the scheme (1.13) in Section 3, preceding by some preliminaries summarized in Section 2 for further analysis. Then, in Section 4, we prove the convergence for the new scheme and establish its worst-case convergence rate measured by the iteration complexity in both

(6)

the ergodic and a nonergodic senses. In Sections 5 and 6, we apply similar analysis to the generalized ADMM in [6] and the strictly contractive Peaceman-Rachford splitting method in [20], respectively, and obtain their block-wise versions which are suitable for the model (1.1). The convergence analysis, including the convergence rate analysis, for these three block-wise splitting methods can be presented in a unified framework. Finally, we make some conclusions in Section 7.

2. Preliminaries

In this section, we summarize some known results in the literature for further analysis.

2.1. A variational inequality characterization

We denote by x^∗ = (x^∗₁, . . . , x^∗_m₁) and y^∗ = (y^∗₁, . . . , y^∗_m₂). Let (x^∗,y^∗, λ^∗) be a saddle point of the Lagrange function (1.9). Then, for anyλ∈R^l,x∈ X,y∈ Y, we have

L²(x^∗,y^∗, λ)≤L²(x^∗,y^∗, λ^∗)≤L²(x,y, λ^∗).

Indeed, finding a saddle point ofL²(x,y, λ) can be expressed as the following variational inequalities:

x^∗₁, . . . , x^∗_m₁, y₁^∗, . . . , y_m^∗₂, λ^∗∈Ω where Ω is defined in (1.10), such that











x^∗₁ ∈X1, θ1(x1)−θ1(x^∗₁) + (x1−x^∗₁)^T(−A^T₁λ^∗)≥0, ∀x₁ ∈X1, ...

x^∗_m₁ ∈X_m₁, θ_m₁(x_m₁)−θ_m₁(x^∗_m₁) + (x_m₁ −x^∗_m₁)^T(−A^T_m

1λ^∗)≥0, ∀x_m ∈X_m, y^∗₁ ∈Y1, φ1(y1)−φ1(y₁^∗) + (y1−y^∗₁)^T(−B₁^Tλ^∗)≥0, ∀y₁ ∈Y1,

...

y_m^∗₂ ∈Y_m₂, φ_m₂(y_m₂)−φ_m₂(y^∗_m₂) + (y_m₂−y^∗_m₂)^T(−B^T_m₂λ^∗)≥0, ∀y_m₂ ∈Y_m₂, λ^∗ ∈R^l, (λ−λ^∗)^T(^P^m_i=1¹ A_ix^∗_i +^P^m_j=1² B_jy_j^∗−b)≥0, ∀λ∈R^l.

(2.1)

More compactly, the variational inequalities in (2.1) can be rewritten in a compact form

VI(Ω, F, θ) w^∗ ∈Ω, f(u)−f(u^∗) + (w−w^∗)^TF(w^∗)≥0, ∀w∈Ω, (2.2a) with the definitions:

f(u) =

m1

X

i=1

θ_i(x_i) +

m2

X

j=1

φ_j(y_j), (2.2b)

u=





 x1

... xm1

y₁ ... ym2







, w=





 x₁

... xm1

y₁ ... y_m₂

λ







, F(w) =







−A^T₁λ ...

−A^T_m

1λ

−B₁^Tλ ...

−B_m^T

2λ Pm1

i=1A_ix_i+^P^m_j=1² B_jy_j −b







(2.2c)

and Ω is given in (1.10) (also can be expressed as Ω =X × Y ×R^l). We also denote by Ω^∗ the set of all saddle points ofL²_β(x,y, λ).

Recall the notation defined in (1.6)-(1.7). For the variational inequality (2.2). we further have

f(u) =ϑ(x) +ϕ(y), (2.3a)

(7)

where

u= x

y

, w=



 x y λ



, F(w) =





−A^Tλ

−B^Tλ Ax+By−b



 (2.3b)

This variational inequality indeed characterizes the first-order optimality condition of the block-wise reformulation (1.5) of the model (1.1). We need this variational inequality characterization for the upcoming theoretical analysis.

2.2. Some properties of the matrices A and B

For the matrices Aand Bdefined in (1.6), we have some properties which are useful for our analysis.

We summarize them in the following lemma; its proof is omitted as it is trivial.

Lemma 2.1. For the matrices definedA and B defined in (1.6), we have the following conclusions:

m1·diag(A^TA) A^TA and m2·diag(B^TB) B^TB, (2.4) where diag(A^TA) and diag(B^TB) are defined by

diag(A^TA) :=







A^T₁A1 0 · · · 0 0 . .. ... ... ... . .. ... 0 0 · · · 0 A^T_m₁A_m₁







and diag(B^TB) :=







B₁^TB1 0 · · · 0 0 . .. ... ... ... . .. ... 0 0 · · · 0 B_m^T₂B_m₂





 ,

respectively.

Lemma 2.2. Let τi > mi−1 for i= 1,2. Then we have

D_A:= (τ₁+ 1)βdiag(A^TA)−βA^TA 0 (2.5) and

D_B := (τ2+ 1)βdiag(B^TB)−βB^TB 0, (2.6) where diag(A^TA) and diag(B^TB) are defined in Lemma 2.1.

Proof. We need only to show (2.5). It follows from (2.4) that

D_A= (τ₁+ 1)βdiag(A^TA)−βA^TA β(τ₁+ 1−m₁)diag(A^TA).

The positive definiteness ofD_Afollows from (τ1+ 1−m1)>0 and diag(A^TA)0 directly.

3. A splitting version of the block-wise ADMM (1.12) for (1.1)

In this section, we present a splitting version of the block-wise ADMM (1.12) for the model (1.1) and give some remarks.

Algorithm 1: A splitting version of the block-wise ADMM for (1.1)

Initialization: Specify a grouping strategy for (1.1) and determine the integers m1 and m2. Choose the constantsτ₁> m₁−1;τ₂> m₂−1 andβ >0. For a given iteratew^k= (x^k,y^k, λ^k) = (x^k₁, . . . , x^k_m₁, y^k₁, . . . , y_m^k₂, λ^k), the new iteratew^k+1 is generated by the following steps.











x^k+1_i = arg min

xi∈Xi

nL²_β x^k₁, . . . , x^k_i−1, xi, x^k_i+1, . . . , x^k_m₁,y^k, λ^k

+^τ¹₂^βkAi(xi−x^k_i)k²o

, i= 1,· · ·, m1, y^k+1_j = arg min

y_j∈Yj

nL²_β x^k+1, y^k₁, . . . , y_j−1^k , yj, y_j+1^k , . . . , y_m^k₂, λ^k

+^τ²₂^βkBj(yj−y_j^k)k²o

, j= 1,· · ·, m2, λ^k+1=λ^k−β Ax^k+1+By^k+1−b

.

(3.1)

(8)

Remark 3.1. As analyzed in the introduction, the scheme (1.13) is not necessarily convergent because the decomposed subproblems therein might not be accurate enough to approximate thex- and y-subproblems in the block-wise ADMM (1.12). Compared to (1.13), the splitting version (3.1) prox- imally regularizes the decomposed subproblems in (1.13) but the proximally regularized subproblems are of the same difficulty as their original ones in (1.13). In fact, these added proximal terms play the role of controlling the proximity of the solutions of the decomposed subproblems to the solutions of the block-wise subproblems in (1.12), and the extent of this control is determined by the proximal coefficientsτ1 and τ2. This is the intrinsic mechanism why the convergence of the scheme (3.1) can be sufficiently guaranteed by adding some proximal terms in its subproblems. The requirementτ_i> m_i−1 for i= 1,2 is the specific mathematical condition for how large the mentioned proximity should be controlled to sufficiently lead to convergence. More specifically, we require this condition to guarantee the positive definiteness of the matrix G defined in (4.13) and thus ensure the strict contraction for the sequence generated by the scheme (3.1) with respect to the solution set.

Remark 3.2. Algebraically, the scheme (3.1) can be derived from the proximal version of the block- wise ADMM (1.12) with appropriate choices of proximal terms, see, e.g., [18] for an abstract discussion in the variational inequality context. But it should be mentioned that it is significantly different from the so-called linearized ADMM in the literature (e.g., [8, 35, 36]), which aims at linearizing the quadratic terms in ADMM’s subproblems with a sufficiently large proximal parameter (e.g., it should be greater than β · kA^T_i A_ik if the x_i-subproblem in (3.1) is considered) and thus alleviating the linearized subproblems. In other words, the proximal parameters in current linearized ADMM literature are dependent on the involved matrices of the corresponding quadratic terms and they may need to be sufficiently large to ensure the convergence if the matrices happen to be ill-conditioned, see, e.g., [16]. In (3.1), however, the proximal parameters τ1 and τ2 are just constants independent of the matrices. Indeed, they just rely on the user-assigned numbers of blocks for regrouping the model (1.1); and could be small. More importantly, the idea of using proximal terms to regularize the subproblems in the block-wise ADMM (1.12) and obtaining (3.1) is just for decomposing each block- wise ADMM subproblem into smaller and simpler ones when the block-wise reformulation (1.5) of (1.1) is considered. We do can further discuss how to linearize the subproblems in (3.1) and obtain even easier subproblems, as mentioned at the end in Section 7. But our emphasis is just the scheme (3.1) and we do not discuss more elaborated versions of it.

Remark 3.3. Notice that if we use x^k, instead of x^k+1, for the yj-subproblems in (1.13), it leads to a full Jacobian decomposition scheme of the application of the augmented Lagrangian method to the model (1.1); and its divergence was shown in [17] even for the case of m = 2. For this case, adding appropriate proximal terms to regularize the resulting subproblems can also guarantee the convergence, see [17] for detailed analysis. The scheme (3.1) differs from the scheme in [17] in that the block-wise variables x and y are updated alternatingly, i.e., it is x^k+1, not x^k for updating the y_j-subproblems in (3.1). This fact indeed represents a more accurate approximation to the augmented Lagrangian method when it is applied to the model (1.1). This difference also brings us a significant difference in the requirement on the proximal coefficientsτ₁ and τ₂: The conditions τ₁ > m₁−1 and τ₂ > m₂ −1 in (3.1) are much less restrictive than that in [17] which is required to be greater than m−1. For example, if we consider a case of (1.1) with m = 20, then the scheme in [17] requires the proximal coefficient to be greater than 19. Meanwhile, if we choosem₁ =m₂ = 10 to implement the scheme (3.1), then τ1 and τ2 are only required to be greater than 9. This is a very important difference because a larger proximal coefficient makes the proximal term play a more important role in the objective function and thus the proximity is controlled in a more conservative way and it is more likely to result in slower convergence.

Remark 3.4. Note that we only discuss grouping the variables and functions of (1.1) “blindly”

as (1.5). For a particular application of (1.1), based on some known information or features, we may

(9)

group the functions and variables more smartly and even discuss an optimal grouping strategy; but we would emphasize that how to group the variables and functions better for a particular application really depends on the particular structures and features of a given application itself. In this paper, we just provide the methodology and theoretical analysis to guarantee the convergence for the most general setting in form of (1.5).

Remark 3.5. As mentioned, the splitting scheme (3.1) is obtained by implementing the parallel (i.e., Jacobian) decomposition to the block-wise x- and y-subproblems in (1.13). If we consider the alternating (i.e., Gauss-Seidel) decomposition for the same subproblems in (1.13), then it is easy to see that the direct extension of ADMM (1.4) is recovered. Similar as what we will analyze, we can also consider how to ensure the convergence for the scheme (1.4) by adding proximal regularization terms to its subproblems. For succinctness, we omit the detail of analysis.

4. Convergence analysis

In this section, we analyze the convergence for the splitting version (3.1) of the block-wise ADMM (1.12). We will prove the global convergence, and establish the worst-case convergence rate measured by both the ergodic and a nonergodic senses. First of all, we rewrite the iterative scheme of (3.1) as a form that is more favorable for our analysis.

4.1. A reformulation of (3.1)

In our analysis, we need the auxiliary variables

x˜^k =x^k+1, y˜^k=y^k+1 (4.1)

and

λ˜^k =λ^k−β Ax^k+1+By^k−b. (4.2)

Recall the notation w= (x,y, λ) = (x1,· · · , xm1, y1,· · · , ym2, λ) with superscripts k and k+ 1; and further define ˜w= (˜x,y,˜ λ) = (˜˜ x1,· · ·,x˜m1,y˜1,· · · ,y˜m2,λ) with superscripts. Then, we can artificially˜ rewrite the iterative scheme of (3.1) as the following prediction-correction framework.

Prediction.

˜

x^k_i = arg min

x_i∈Xi

nL²_β x^k₁, . . . , x^k_i−1, xi, x^k_i+1, . . . , x^k_m₁,y^k, λ^k +τ₁β

2 kAi(xi−x^k_i)k²o

, (4.3a)

y˜^k_j = arg min

y_j∈Yj

nL²_β x˜^k, y^k₁, . . . , y_j−1^k , y_j, y^k_j+1, . . . , y^k_m

2, λ^k +τ₂β

2 kBj(y_j−y_j^k)k²o

, (4.3b)

λ˜^k=λ^k−β A˜x^k+By^k−b

. (4.3c)

Correction.

w^k+1 =w^k−M(w^k−w˜^k), (4.4a)

where

M =







I 0 0

0 I 0

0 −βB I





, (4.4b)

and ˜w^k is the related sub-vector of the predictor ˜w^k generated by (4.3).

(10)

The iterative scheme (4.3)-(4.4) can also be understood as a prediction-correction method. We would emphasize that it is the scheme (3.1) that will be implemented in practice; and the reformulation (4.3)- (4.4) is considered in our analysis because of two reasons. First, our upcoming convergence analysis is essentially based on the effort of showing that the sequence generated by (3.1) is strictly contractive with respect to the solution set of (1.1) and it turns out that the progress of proximity to the solution set at each iteration can be measured by the quantitykw^k−w˜^kk²_G where Gis defined in (4.13), see the inequality (4.20) in Theorem 4.4. Second, the analysis we will consider in Sections 5 and 6 for two other methods can also be written as prediction-correction frameworks; and using this kind of reformulations can make us present the analysis for these three methods in a unified framework.

4.2. Global convergence

Indeed, to prove the global convergence of the scheme (3.1), we mainly need to prove two conclusions.

We summarize them in the following two theorems respectively.

Theorem 4.1. Let w˜^k be generated by (4.3) from the given vector w^k. Then we have

w˜^k ∈Ω, f(u)−f( ˜u^k) + (w−w˜^k)^TF( ˜w^k)≥(w−w˜^k)^TQ(w^k−w˜^k), ∀w∈Ω, (4.5) where Q is defined by

Q=







(τ1+ 1)βdiag(A^TA)−βA^TA 0 0

0 (τ₂+ 1)βdiag(B^TB) 0

0 −B _β¹I







. (4.6)

Proof. First, the xi-subproblem in (4.3a) can be written as x˜^k_i = arg min

xi∈X_i

nL²_βx^k₁, . . . , x^k_i−1, x_i, x^k_i+1, . . . , x^k_m₁,y^k, λ^k+ τ₁β

2 kA_i(x_i−x^k_i)k²^o

(1.11)

= arg min

xi∈X_i

( θ_i(x_i)−(λ^k)^TA_ix_i+^β₂kA_i(x_i−x^k_i) + (Ax^k+By^k−b)k² +^τ¹₂^βkA_i(xi−x^k_i)k²,

) ,

in which some constant terms are ignored in its objective function. The first-order optimality condition of the above convex minimization problem can be written as

x˜^k_i ∈Xi, θi(xi)−θi(˜x^k_i) + (xi−x˜^k_i)^T−A^T_i λ^k

+βA^T_iAi(˜x^k_i −x^k_i) + (Ax^k+By^k−b)+τ1βA^T_i Ai(˜x^k_i −x^k_i) ≥0, ∀x_i ∈Xi. (4.7) Then, it follows from (4.3c) that

λ^k= ˜λ^k+β Ax˜^k+By^k−b. Hence, based on (4.7), we have

x˜^k_i ∈X_i, θ_i(x_i)−θ_i(˜x^k_i) + (x_i−x˜^k_i)^T−A^T_i ˜λ^k+β(Ax˜^k+By^k−b)

+βA^T_i Ai(˜x^k_i −x^k_i) + (Ax^k+By^k−b)+τ1βA^T_i Ai(˜x^k_i −x^k_i) ≥0, ∀x_i ∈Xi. Consequently, we have

x˜^k_i ∈X_i, θ_i(x_i)−θ_i(˜x^k_i) + (x_i−x˜^k_i)^T−A^T_i ˜λ^k

+βA^T_i Ai(˜x^k_i −x^k_i)− A(˜x^k−x^k)+τ1βA^T_i Ai(˜x^k_i −x^k_i) ≥0, ∀x_i ∈Xi, and it can be written as

x˜^k_i ∈Xi, θi(xi)−θi(˜x^k_i) + (xi−x˜^k_i)^T−A^T_i ˜λ^k

−βA^T_i A(˜x^k−x^k) + (τ₁+ 1)βA^T_i A_i(˜x^k_i −x^k_i) ≥0, ∀x_i ∈X_i.