A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non Separable Functions

(1)

HAL Id: hal-01497104

https://hal.archives-ouvertes.fr/hal-01497104v2

Submitted on 5 Feb 2019

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

A Coordinate Descent Primal-Dual Algorithm with

Large Step Size and Possibly Non Separable Functions

Olivier Fercoq, Pascal Bianchi

To cite this version:

Olivier Fercoq, Pascal Bianchi. A Coordinate Descent Primal-Dual Algorithm with Large Step Size

and Possibly Non Separable Functions. SIAM Journal on Optimization, Society for Industrial and

Applied Mathematics, 2019, 29 (1), pp.100-134. �hal-01497104v2�

(2)

A Coordinate Descent Primal-Dual Algorithm with Large Step Size

and Possibly Non-Separable Functions

∗

Olivier Fercoq

†

Pascal Bianchi

‡

February 5, 2018

Abstract

This paper introduces a randomized coordinate descent version of the V˜u-Condat algorithm. By coordinate descent, we mean that only a subset of the coordinates of the primal and dual iterates is updated at each iteration, the other coordinates being maintained to their past value. Our method allows us to solve optimization problems with a combination of differentiable functions, constraints as well as non-separable and non-differentiable regularizers.

We show that the sequences generated by our algorithm almost surely converge to a saddle point of the problem at stake, for a wider range of parameter values than previous methods. In particular, the condition on the step-sizes depends on the coordinate-wise Lipschitz constant of the differentiable function’s gradient, which is a major feature allowing classical coordinate descent to perform so well when it is applicable. We then prove a sublinear rate of convergence in general and a linear rate of convergence if the objective enjoys strong convexity properties.

We illustrate the performances of the algorithm on a total-variation regularized least squares regression problem and on large scale support vector machine problems.

1 Introduction

1.1 Motivation

We consider the optimization problem inf

x∈Xf (x) + g(x) + h(M x) (1)

where X is a Euclidean space, M : X → Y is a linear operator onto a second Euclidean space Y; functions f : X → R, g : X → (−∞, +∞] and h : Y →] − ∞, +∞] are assumed proper, closed and convex; the function f is moreover assumed differentiable. We assume that X and Y are product spaces of the form

X = X1× · · · × Xn and Y = Y1× · · · × Yp for some integers n, p. For any x ∈ X , we use the notation

x = (x(1), . . . , x(n)) to represent the (block of) coordinates of x (similarly for y = (y(1), . . . , y(p)) in Y).

Problem (1) has numerous applications e.g. in machine learning [8], image processing [9] or distributed optimization [7].

Under the standard qualification condition 0 ∈ ri(M domg − domh) (where dom and ri stand for domain and relative interior, respectively), a point x ∈ X is a minimizer of (1) if and only if there exists y ∈ Y such that (x, y) is a saddle point of the Lagrangian function

L(x, y) = f (x) + g(x) + hy, M xi − h?(y)

∗_{A summary of the results of this paper has been published in the proceedings of the 2016 Conference on Decision and}

Control [5]. This work has been supported by the Orange/Telecom ParisTech think tank Phi-TAB.

†_{LTCI, CNRS, T´}_el´_{ecom ParisTech, Universit´}_{e Paris-Saclay, 75013, Paris, France, (olivier.fercoq@telecom-paristech.fr).} ‡_{LTCI, CNRS, T´}_el´_{ecom ParisTech, Universit´}_{e Paris-Saclay, 75013, Paris, France, (pascal.bianchi@telecom-paristech.fr)}

(3)

where h . , . i is the inner product and h?: y 7→ supz∈Yhy, zi − h(z) is the Fenchel-Legendre transform of h.

There is a rich literature on primal-dual algorithms searching for a saddle point of L (see [45] and references therein). In the special case where f = 0, the alternating direction method of multipliers (ADMM) proposed by Glowinsky and Marroco [25], Gabay and Mercier [23] and the algorithm of Chambolle and Pock [12] are

amongst the most celebrated ones. Based on an elegant idea also used in [27], V˜u [51] and Condat [16]

separately proposed a primal-dual algorithm allowing as well to handle ∇f explicitly, and requiring one evaluation of the gradient of f at each iteration. Hence, the ∇f is handled explicitly in the sense that the algorithm does not involve, for instance, the call of a proximity operator associated with f . A convergence rate analysis is provided in [13] (see also [45]). A related splitting method has been recently introduced by [17].

This paper introduces a coordinate descent (CD) version of the V˜u-Condat algorithm. By coordinate

descent, we mean that only a subset of the coordinates of the primal and dual iterates is updated at each iteration, the other coordinates being maintained to their past value. Coordinate descent was historically used in the context of coordinate-wise minimization of a unique function in a Gauss-Seidel sense [52, 4, 48]. Tseng et al. [32, 49, 50] and Nesterov [35] developped CD versions of the gradient descent. In [35] as well as in this paper, the updated coordinates are randomly chosen at each iteration. The algorithm of [35] has at least two interesting features. Not only it is often easier to evaluate a single coordinate of the gradient vector rather than the whole vector, but the conditions under which the CD version of the algorithm is provably convergent are generally weaker than in the case of standard gradient descent. The key point is that the step size used in the algorithm when updating a given coordinate i can be chosen to be inversely proportional to the coordinate-wise Lipschitz constant of ∇f along its ith coordinate, rather than the global Lipschitz constant of ∇f (as would be the case in a standard gradient descent). Hence, the introduction of coordinate descent allows to use longer step sizes which potentially results in a more attractive performance. The

random CD gradient descent of [35] was later generalized by Richt´arik and Tak´aˇc [38] to the minimization of

a sum of two convex functions f + g (that is, h = 0 in problem (1)). The algorithm of [38] is analyzed under

the additional assumption that function g is separable in the sense that for each x ∈ X , g(x) =Pn

i=1gi(x(i))

for some functions gi: Xi→] − ∞, +∞]. Accelerated and parallel versions of the algorithm have been later

developed by [41, 40, 20, 31], always assuming the separability of g.

In the literature, several papers seek to apply the principle of coordinate descent to primal-dual algo-rithms. In the case where f = 0, h is separable and smooth and g is strongly convex, Zhang and Xiao [53] introduce a stochastic CD primal-dual algorithm and analyze its convergence rate (see also [44] for related works). In 2013, Iutzeler et al. [29] proved that random coordinate descent can be successfully applied to fixed point iterations of firmly non-expansive (FNE) operators. According to [22], the ADMM can be written as a fixed point algorithm of a FNE operator, which led the authors of [29] to propose a coordinate descent version of ADMM with application to distributed optimization. The key idea behind the convergence proof

of [29] is to establish the so-called stochastic Fej´er monotonicity of the sequence of iterates as noted by [15].

In a more general setting than [29], Combettes et al. in [15] and Bianchi et al. [6] extend the proof to the so-called α-averaged operators, which include FNE operators as a special case. This generalization allows to apply the coordinate descent principle to a broader class of primal-dual algorithms which is no longer restricted to the ADMM or the Douglas Rachford algorithm. For instance, Forward-Backward splitting is

considered in [15] and particular cases of the V˜u-Condat algorithm are considered in [6, 37]. Nevertheless,

the above approach has two major limitations.

First, in order to derive a converging coordinate descent version of a given deterministic algorithm, the

latter must write as a fixed point algorithm over some product Hilbert space of the form H = H1× · · · Hq

where the inner product in H is the sum of the inner products in the Hi’s. Unfortunately, this condition

does not hold in general for the V˜u-Condat method, because the inner product over H involves the coupling

linear operator M . A workaround was proposed in [6] but for a particular example only.

Second and even more importantly, the approach of [29, 15, 6, 37] needs “small” step sizes. More precisely, the convergence conditions are identical to the ones of the brute method, the one without coordinate descent. These conditions involve the global Lipschitz constant of the gradient ∇f instead than its coordinate-wise Lipschitz constants. In practice, it means that the application of coordinate descent to primal-dual algorithm

(4)

as suggested by [15] and [6] is restricted to the use of potentially small step sizes. One of the major benefits of coordinate descent is lost.

Some recent works also focused on designing primal-dual coordinate descent methods with a guaranteed convergence rate. In [24] and [11], a O(1/k) rate is obtained for the ergodic mean of the sequences. The rates are given in terms of feasibility and optimality or Bregman distance. Those two papers require all the dual variables to be updated at each iteration, which may not be efficient if there are more than a few dual variables. In the present paper, we will have much more flexibility in the variables we choose to update at each iteration, while retaining a provable convergence rate.

1.2 Contribution

• Our main contribution is to provide a CD primal-dual algorithm with a broad range of admissible step sizes. Our numerical experiments show that remarkable performance gains can be obtained when using larger step sizes.

• We identify two setups for which the structure of the problem is favorable to coordinate descent algorithms.

• We prove a sublinear rate of convergence in general and a linear rate of convergence if the objective enjoys strong convexity properties.

1.3 Organization of the paper

The algorithm is introduced in Section 2. At each iteration k, an index i is randomly chosen w.r.t. the uniform distribution in {1, . . . , n} where n is, as we recall, the number of primal coordinates. The coordinate

x(i)_k of the current primal iterate xk is updated, as well as a set of associated dual iterates. Under some

assumptions involving the coordinate-wise Lipschitz constants of ∇f , the primal-dual iterates converges to a saddle point of the Lagrangian. As a remarkable feature, our CD algorithm makes no assumption of separability of the functions f , g or h. In the special case where h = 0 and g is separable, the algorithm reduces to the CD proximal gradient algorithm of [38].

The convergence proof is provided in Section 3. It is worth noting that, under the stated assumption on

the step-size, the stochastic Fej´er monotonicity of the sequence of iterates, which is the key idea in [29, 15, 6],

does not hold (a counter-example is provided). Our proof relies on the introduction of an adequate Lyapunov function. In Section 4, we prove a sublinear rate of convergence in general and a linear rate of convergence if the objective enjoys strong convexity properties. In Section 5, the proposed algorithm is instantiated to the case of total-variation regularization and support vector machines. Numerical results performed on real MRI and text data establish the attractive behavior of the proposed algorithm and emphasize the importance of using primal-dual CD with large step sizes.

2 Coordinate Descent Primal-Dual Algorithm

2.1 Notation

We note M = (Mj,i: j ∈ {1, . . . , p}, i ∈ {1, . . . , n}) where Mj,i: Xi → Yj are the block components of M .

For each j ∈ {1, . . . , p}, we introduce the set

I(j) :=ni ∈ {1, . . . , n} : Mj,i6= 0

o .

Otherwise stated, the jth component of vector M x only depends on x through the coordinates x(i)_{such that}

i ∈ I(j). We denote by

(5)

the number of such coordinates. Without loss of generality, we assume that mj 6= 0 for all j. We also denote

πj:=

1

card(I(j)).

For all i ∈ {1, . . . , n}, we define

J (i) :=nj ∈ {1, . . . , p} : Mj,i6= 0

o .

Note that for every pair (i, j), the statements i ∈ I(j) and j ∈ J (i) are equivalent.

If ` is an integer, γ = (γ1, . . . , γ`) is a collection of positive real numbers and A = A1× · · · × A` is a

product of Euclidean spaces, we introduce the weighted norm k . kγ on A given by kuk2γ =

P`

i=1γiku (i)_k2

Ai

for every u = (u(1), . . . , u(`)) where k . kAi stand for the norm on Ai. If F : A →] − ∞, +∞] denotes a convex

proper lower-semicontinuous function, we introduce the proximity operator proxγ,F : A → A defined for any

u ∈ A by

proxγ,F(u) := arg min

w∈AF (w) + 1 2kw − uk 2 γ−1

where we use the notation γ−1 _{= (γ}−1

1 , . . . , γ`−1). We denote by prox

(i)

γ,F : A → Ai the ith coordinate

mapping of prox_γ,F that is, prox_γ,F(u) = (prox(1)_γ,F(u), . . . , prox(`)_γ,F(u)) for any u ∈ A. The notation DA(γ)

(or simply D(γ) when no ambiguity occurs) stands for the diagonal operator on A → A given by DA(γ)(u) =

(γ1u(1), . . . , γ`u(`)) for every u = (u(1), . . . , u(`)).

Finally, the adjoint of a linear operator B is denoted B?. The spectral radius of a square matrix A is

denoted by ρ(A). The number of nonzero elements of a matrix A is denoted by nnz(A).

2.2 Main algorithm

Consider Problem (1). Let σ = (σ1, . . . , σp) and τ = (τ1, . . . , τn) be two tuples of positive real numbers.

Consider an independent and identically distributed sequence (ik : k ∈ N∗) with uniform distribution on

{1, . . . , n}1_{. The proposed primal-dual CD algorithm consists in updating two sequences x}

k∈ X , yk ∈ Y. It

is provided in Algorithm 1 below.

Algorithm 1 Coordinate-descent primal-dual algorithm

Initialization: Choose x0∈ X , y0∈ Y. Iteration k: Define: y_k+1= prox_σ,h? yk+ D(σ)M xk xk+1= proxτ,g xk− D(τ ) ∇f (xk) + 2M?yk+1− M?yk .

For i = ik+1 and for each j ∈ J (ik+1), update:

x(i)_k+1= x(i)_k+1 y_k+1(j) = y(j)_k + πj(y (j) k+1− y (j) k ) . Otherwise, set x(i 0₎ k+1= x (i0) k , and y (j0) k+1= y (j0) k .

For every i ∈ {1, . . . , n}, we denote by Ui: Xi→ X the linear operator such that all coordinates of Ui(u)

are zero except the ith coordinate which coincides with u: Ui(u) = (0, · · · , 0, u, 0, · · · , 0). Our convergence

result holds under the following assumptions.

1_{The results of this paper easily extend to the selection of several primal coordinates at each iteration with a uniform}

(6)

Assumption 2.1. a) The functions f , g, h are closed proper and convex. b) The function f is differentiable on X .

c) For every i ∈ {1, . . . , n}, there exists βi≥ 0 such that for any x ∈ X , any u ∈ Xi,

f (x + Uiu) ≤ f (x) + h∇f (x), Uiui +

βi

2kuk

2 Xi.

d) The random sequence (ik)k∈N∗ is independent, uniformly distributed on {1, . . . , n}.

e) The step sizes τ = (τ1, . . . , τn) and σ = (σ1, . . . , σp) satisfy for all i ∈ {1, . . . , n},

τi <

1

βi+ ρ

P

j∈J (i)(2 − πj)mjσjMj,i? Mj,i

.

We denote by S the set of saddle points of the Lagrangian function L. Otherwise stated, a couple

(x∗, y∗) ∈ X × Y lies in S if and only if it satisfies the following inclusions

0 ∈ ∇f (x∗) + ∂g(x∗) + M?y∗ (2)

0 ∈ −M x∗+ ∂h?(y∗) . (3)

We shall also refer to elements of S as primal-dual solutions.

Theorem 1. Let Assumption 2.1 hold true and suppose that S 6= ∅. Let (xk, yk) be a sequence generated by

Algorithm 1. Almost surely, there exists (x∗, y∗) ∈ S such that

lim

k→∞xk= x∗

lim

k→∞yk = y∗.

2.3 Efficient implementation using problem structure

In Algorithm 1, it is worth noting that quantities (xk+1, yk+1) do not need to be explicitly calculated. At

iteration k, only the coordinates

x(ik+1)

k+1 and y

(j)

k+1, ∀j ∈ J (ik+1)

are needed to perform the update. From a computational point of view, it is often the case that the

evaluation of the above coordinates is less demanding than the computation of the whole vectors xk+1, yk+1.

Two situations have been reported in the literature:

• If g is separable, one only needs to compute the quantities ∇ik+1f (xk), (2M

?_y_¯

k+1− M?yk)(ik+1) and

prox_τ

ik+1,gik+1 to perform the kth iteration. A classical example of such smart residual update [36]

can be found in the proximal coordinate descent gradient algorithm (case g separable and h = 0) [39].

More generally, if g (resp. h?_{) is block-separable, we can use this structure in the algorithm, even if}

this block structure does not match X1× . . . × Xn (resp. Y1× . . . × Yp).

We used this idea in Section 5.1 to deal efficiently with the proximal operator of the `2,1 norm.

• If g is the indicator of the consensus constraint {x1= · · · = xn}, f is separable and h = 0, we recover

MISO [33]. In that case, we can store ∇f (xk) and update its average. Thanks to the separability of

f , only one coordinate of ∇f (xk) needs to be updated at each iteration.

We used similar ideas in Section 5.2 to deal efficiently with the projection onto the subspace orthogonal to a vector.

To illustrate the importance of these implementation tricks, we give in the following table a comparison of

the number of operations to compute the updates of the standard V˜u-Condat method against the proposed

(7)

Problem / Dimension of data V˜u-Condat Our algorithm

Total Variation + `1 regularization O(mn + 6n) O(m + 12)

A ∈ Rm×n

: dense; M ∈ R3n×n_{: nnz(M ) = 6n}

Support Vector Machines O(nnz(A) + n) O(nnz(Aei) + 1)

A ∈ Rm×n: sparse

Table 1: Number of operations per iteration for the proposed algorithm and for the standard V˜u-Condat algorithm

- The use cases are the ones described in the numerical section. The numbers 6 and 12 highlight the (mild) overhead of duplication in the Total Variation + `1 regularized least squares problem.

2.4 Primal dual coordinate descent with duplicated dual variables

In this section, we present a generalization of Algorithm 1 that allows for more flexibility in the update rule for dual variables. It will also be a convenient formulation for the analysis.

Recall that Y = Y1× · · · × Yp. For every j ∈ {1, . . . , p}, we use the notation Yj:= Y

I(j)

j , which means

that Yj consists of |I(j)| copies of Yj indexed by I(j). An arbitrary element u in Yj will be represented by

u = (u(i) : i ∈ I(j)). We define Y := Y1× · · · × Yp. An arbitrary element y in Y will be represented as

y = (y(1)_{, . . . , y}(p)_{) and we shall call such an element a duplicated dual variable. This notation is recalled}

in Table 2 below.

Table 2: Standing notation.

Space Element Dimension

(if blocks of size 1)

X = X1× · · · × Xn x = (x(i): i ∈ {1, . . . , n}) n

Y = Y1× · · · × Yp y = (y(j): j ∈ {1, . . . , p}) p

Yj= Y

I(j)

j u = (u(i) : i ∈ I(j)) |I(j)|

Y = Y1× · · · × Yp y = (y(j): j ∈ {1, . . . , p}) nnz(M )

where y(j)= (y(j)(i) : i ∈ I(j)) ∀j

In our algorithm, we will stack a collection of primal variables (x(i)_k : i ∈ {1, . . . , n}) at iteration k, and

a set of (duplicated) dual variables (y(j)_k (i) : i ∈ {1, . . . , n}, j ∈ J (i)). In a coordinate descent spirit, we

however update only a subset of these variables at every iteration k. First, we choose uniformly at random

a block of primal coordinates ik+1: eventually, only the primal variable x

(ik+1)

k will be updated. As far as

the dual variables are concerned, a natural choice is to update the dual variables (y(j)_k (ik+1) : j ∈ J (ik+1))

associated to the primal variable x(ik+1)

k . This case will be investigated in Section 2.5.1. For reasons that

will be made clear later on, it may be interesting in some situations to update a larger set of duplicated dual

variables at iteration k, namely (y(j)_k (l) : (l, j) ∈ J (ik+1)) where for every i ∈ {1, . . . , n}, J (i) is a subset of

{1, . . . , n} × {1, . . . , p} chosen in such a way that

{i} × J (i) ⊂ J (i) ⊂ {(l, j) : j ∈ J (l)} . (4)

We shall also define the probability that j ∈ J (ik+1) knowing that (l, j) ∈ J (ik+1) as

πj(i) =

1

card({l : (i, j) ∈ J (l)}). (5)

Note that 0 < πj(i) ≤ 1. In the special case where J (i) = {i} × J (i), note also that πj(i) = 1 for every

j ∈ J (i).

As for Algorithm 1, we consider an independent and identically distributed sequence (ik : k ∈ N∗) with

uniform distribution on {1, . . . , n}. The algorithm consists in updating four sequences xk ∈ X , wk ∈ X ,

(8)

Algorithm 2 Coordinate-descent primal-dual algorithm with duplicated variables

Initialization: Choose x0∈ X , y0∈ Y.

For all i ∈ {1, . . . , n}, set w(i)₀ =P

j∈J (i)M ? j,iy

(j) 0 (i).

For all j ∈ {1, . . . , p}, set z₀(j)=_m1

j P i∈I(j)y (j) 0 (i). Iteration k: Define: y_k+1= prox_σ,h? zk+ D(σ)M xk xk+1= proxτ,g xk− D(τ ) ∇f (xk) + 2M?yk+1− wk .

For i = ik+1 and for each (l, j) ∈ J (ik+1), update:

x(i)_k+1= x(i)_k+1 y(j)_k+1(l) = y(j)_k (l) + πj(l)(y (j) k+1− y (j) k (l)) w(l)_k+1= w(l)_k + X (l,j)∈J (i) Mj,l? (y (j) k+1(l) − y (j) k (l)) z_k+1(j) = z_k(j)+ 1 mj X l:(l,j)∈J (i) (y(j)_k+1(l) − y(j)_k (l)) . Otherwise, set x(i 0₎ k+1= x (i0) k , w (l0) k+1= w (l0) k , z (j0) k+1= z (j0) k and y (j0) k+1(l0) = y (j0) k (l0).

Theorem 2. Let Assumption 2.1 hold true and τi<

1

βi+ ρ

P

j∈J (i)(2 − πj(i))mjσjMj,i? Mj,i

. (6)

Suppose that Eq. (4) holds and that S 6= ∅. Let (xk, yk) be a sequence generated by Algorithm 2. Almost

surely, there exists (x∗, y∗) ∈ S such that

lim k→∞xk= x∗ lim k→∞y (j) k (i) = y (j) ∗ (∀j ∈ {1, . . . , p}, ∀i ∈ I(j)) .

2.5 Special Cases

2.5.1 The case J (i) = {i} × J (i) for all i

According to (4), the smallest possible choice for J (i) is J (i) = {i} × J (i). In that case, πj(i) = 1 for all

j ∈ J (i) and the update of the dual variable simplifies to:

∀j ∈ J (ik+1), y

(j)

k+1(ik+1) = y

j k+1.

This choice of dual sampling also implies that the primal and dual variables are grouped into n disjoint

primal-dual blocks of the type (x(i)_{, (y}(j)

k+1(i))j∈J (i)).

2.5.2 The case J (i) = ∪j∈J (i)I(j) × J (i) for all i

With this update scheme for dual variables, given ik+1, we update y

(j)

k+1(l) for all j ∈ J (ik+1) and all l ∈ I(j).

(9)

We have πj(l) = _|I(j)|1 = _m1_j for all l ∈ I(j). The advantage of this update scheme is that, provided there

exists y0 such that y(j)₀ (l) = y0(j)₀ for all l ∈ I(j), we have for all l ∈ I(j) and all k ≥ 0,

y(j)_k+1(l) = y0(j)_k+1= 1

mj

y(j)_k+1+ (1 − 1

mj

)y0(j)_k .

Hence, choosing J (i) = ∪j∈J (i)I(j) × J (i) allows us to undo the duplication of dual variables and reduce

the size of the vector of dual variables from the number of nonzero elements in M , nnz(M ), to its number of rows p.

This shows the following equivalence result.

Proposition 1. Algorithm 1 with initial point y00is equivalent to Algorithm 2 with the choice of dual sampling

J (i) = ∪j∈J (i)I(j) × J (i), ∀i ∈ {1, . . . , n} and initial point y

(j) 0 (l) = y00

(j)

, ∀j ∈ {1, . . . , p}, ∀l ∈ I(j). So, a byproduct of the proof of Theorem 2 will be a proof for Theorem 1.

2.5.3 The Case m1= · · · = mp= 1

We consider the special case m1 = · · · = mp = 1. Otherwise stated, the linear operator M has a single

nonzero component Mj,i per row j ∈ {1, . . . , p}. This happens for instance in the context of distributed

optimization [6]. This case will also be extensively used in the proofs.

In this scenario, the notations can be drastically simplied. Indeed, for every j ∈ {1, . . . , p}, I(j) is

a singleton. The corresponding set of duplicated dual variables (y(j)_k (i) : i ∈ I(j)) is reduced to a single

variable y(j)_k (I(j)), which we shall simply denote as y(j)_k . According to (4), J (i) is a subset of {(l, j) : l ∈ I(j)}

which simply coincides with the set {(I(j), j) : j ∈ {1, . . . , p}. Therefore, the set J (i) is uniquely determined by its projection onto the second set of indices. Otherwise stated, the selection of J (i) for a given i is equivalent to the selection of a subset of {1, . . . , p} which we abusively denote by J (i) in this paragraph.

Then, Algorithm 2 simplifies to Algorithm 3 below. Note that Algorithm 3 has a range of applicability which is different from Algorithm 1. We make an additional assumption on M but we have more freedom on the dual sampling J .

Algorithm 3 Coordinate-descent primal-dual algorithm - Case m1= · · · = mp= 1.

Initialization: Choose x0∈ X , y0∈ Y. Iteration k: Define: y_k+1= prox_σ,h? yk+ D(σ)M xk xk+1= proxτ,g xk− D(τ ) ∇f (xk) + M?(2yk+1− yk) .

For i = ik+1 and for each j ∈ J (ik+1), update:

x(i)_k+1= x(i)_k+1 y_k+1(j) = y(j)_k + πj(y (j) k+1− y (j) k ) . Otherwise, set x(i_k+10) = x(i_k0), y_k+1(j0) = y_k(j0). 2.5.4 The Case h = 0

Instanciating Algorithm 2 in the special case h = 0, it boils down to the following CD forward-backward algorithm:

x(i)_k+1= (

prox(i)τ,g xk− D(τ )∇f (xk), if i = ik+1,

(10)

As a consequence, Algorithm 2 allows to recover the CD proximal gradient algorithm of [38] with the notable difference that we do not assume the separability of g. On the other hand, Assumption 2.1(e) becomes

τi < 1/βi whereas in the separable case, [38] assumes τi = 1/βi. This remark leads us to conjecture that,

even though Assumption 2.1(e) generally allows for the use of larger step sizes than the ones suggested by the approach of [15, 6], one might be able to use even larger step sizes than the ones allowed by Theorem 2. Note that a similar CD forward-backward algorithm can be found in [15] with no need to require the

separability of g. However, the algorithm of [15] assumes that the step size τi(there assumed to be

indepen-dent of i) is less than 2/β where β is the global Lipschitz constant of ∇f . As discussed in the introduction,

an attractive feature of our algorithm is the fact that our convergence condition τi< 1/βionly involves the

coordinate-wise Lipschitz constant of ∇f .

2.6 Failure of Stochastic Fej´

er Monotonicity

As discussed in the introduction, an existing approach to prove convergence of CD algorithm in a general

setting (that is, not restricted to h = 0 and separable g) is to establish the stochastic Fej´er monotonicity of

the iterates. The idea was used in [29] and extended by [15] and [6] to a more general setting. Unfortunately, this approach implies to select a “small” step size as noticed in the previous section. The use of small step size is unfortunate in practice, as it may significantly affect the convergence rate.

It is natural to ask whether the existing convergence proof based on stochastic Fej´er monotonicity can

be extended to the use of larger step sizes. The answer is negative, as shown by the following example. Example 1. Consider the toy problem

min x∈R3 1 2(x (1)_{+ x}(2)_{+ x}(3)_{− 1)}2 that is we take f (x) = 1 2(x

(1)_{+ x}(2)_{+ x}(3)_{− 1)}2_{and g = h = M = 0. One of the minimizers is x}

∗= (1₃,1₃,1₃).

The global Lipschitz constant of ∇f is equal to 3 and the coordinate-wise Lipschitz constants are equal to 1. The CD proximal gradient algorithm (7) writes

x(i)_k+1= (

x(i)_k − τ (x(1)_k + x(2)_k + x(3)_k − 1) if i = ik+1

x(i)_k otherwise

where we used τ1 = τ2= τ3 , τ for simplicity. By Theorem 2, xk converges almost surely to x∗ whenever

τ < 1. Setting x0= 0, one has kx0− x∗k2= 13. It is immediately seen that Ekx1− x∗k

2_{= (τ −}1 3) 2₊1 9+ 1 9

where E represents the expectation. In particular, Ekx1− x∗k2> kx0− x∗k2 as soon as τ > 2/3. Therefore,

the sequence Ekxk−x∗k2is not decreasing. This example shows that the proof techniques based on monotone

operators and Fej´er monotonicity are not directly applicable in the case of long step sizes. Indeed, as shown

in Lemma 3 below, one needs to make use of another Lyapunov function, defined in (19). That inequality shows that the sequence exhibits a stochastic monotonicity property in the Bregman divergence sense [1].

3 Proof of Theorem 2

3.1 Preliminary Lemma

For every (x, y) ∈ X × Y, we define

V (x, y) :=1 2kxk 2 τ−1+ hy, M xi + 1 2kyk 2 σ−1. (8)

Lemma 1. Let Assumption 2.1(a-b) hold true. Let (x, y) ∈ X × Y and (x∗, y∗) ∈ S. Define

y = prox_σ,h? y + D(σ)M x

(11)

and set z = (x, y), z∗= (x∗, y∗), z = (x, y). Then,

h∇f (x∗) − ∇f (x), x∗− xi + V (z − z) ≤ V (z − z∗) − V (z − z∗) .

Proof. The inclusions (3) also read

∀u ∈ X , g(u) ≥ g(x∗) + h−∇f (x∗) − M?y∗, u − x∗i

∀v ∈ Y, h?_{(v) ≥ h}?_(y

∗) + hM x∗, v − y∗i .

Setting u = x and v = y in the above inequalities, we obtain

g(x) ≥ g(x∗) + h∇f (x∗) + M?y∗, x∗− xi (9)

h?(y) ≥ h?(y∗) + hM x∗, y − y∗i . (10)

By definition of the proximal operator, y = arg min v∈Yh ?_{(v) − hv, M xi +}1 2kv − yk 2 σ−1 (11) x = arg min

u∈Xg(u) + hu, ∇f (x) + M

?

(2y − y)i +1

2ku − xk

2

τ−1. (12)

Consider Equality (11) above. It classically implies [47] that for any v ∈ Y,

h?(y) − hy, M xi + 1 2ky − yk 2 σ−1 ≤ h?(v) − hv, M xi + 1 2kv − yk 2 σ−1− 1 2ky − vk 2 σ−1. (13) Setting v = y∗, we obtain h?(y) ≤ h?(y∗) + hy − y∗, M xi + 1 2ky∗− yk 2 σ−1− 1 2ky − y∗k 2 σ−1− 1 2ky − yk 2 σ−1 (14)

and using (10), we finally have

hM (x∗− x), y − y∗i ≤ 1 2ky∗− yk 2 σ−1− 1 2ky − y∗k 2 σ−1− 1 2ky − yk 2 σ−1 (15)

Similarly, Equality (12) implies that for any u ∈ X ,

g(x) + hx, ∇f (x) + M?(2y − y)i +1

2kx − xk

2 τ−1

≤ g(u) + hu, ∇f (x) + M?(2y − y)i +1

2ku − xk 2 τ−1− 1 2kx − uk 2 τ−1. (16)

We set u = x∗. This yields

g(x) ≤ g(x∗) + hx∗ − x, ∇f (x) + M?(2y − y)i + 1 2kx∗ − xk 2 τ−1 − 1 2kx − x∗k 2 τ−1 − 1 2kx − xk 2 τ−1.

Using moreover Inequality (9), we obtain

h∇f (x∗) + M?y∗, x∗− xi ≤ hx∗− x, ∇f (x) + M?(2y − y)i + 1 2kx∗− xk 2 τ−1− 1 2kx − x∗k 2 τ−1− 1 2kx − xk 2 τ−1

hence, rearranging the terms,

h∇f (x∗) − ∇f (x), x∗− xi − 1 2kx∗− xk 2 τ−1+ 1 2kx − x∗k 2 τ−1+ 1 2kx − xk 2 τ−1 ≤ h2y − y − y∗, M (x∗− x)i .

(12)

Summing the above inequality with (15), h∇f (x∗) − ∇f (x),x∗− xi + 1 2kx − xk 2 τ−1+ hy − y, M (x − x)i + 1 2ky − yk 2 σ−1 ≤1 2kx − x∗k 2 τ−1+ hy − y∗, M (x − x∗)i + 1 2ky − y∗k 2 σ−1 −1 2kx − x∗k 2 τ−1− hy − y∗, M (x − x∗)i − 1 2ky − y∗k 2 σ−1.

This completes the proof of the lemma thanks to the definition of V .

3.2 Study of Algorithm 3

We first prove Theorem 2 in the special case m1= · · · = mp = 1. In that case, Algorithm 2 boils down to

Algorithm 3. We recall that in this case, the vector y(j)_k is reduced to a single value y(j)_k (i) ∈ Yj where i is

the unique index such that Mj,i6= 0. We simply denote this value by y

(j) k .

We denote by Fk the filtration generated by the random variable (r.v.) i1, · · · , ik. We denote by

Ek( . ) = E( . |Fk) the conditional expectation w.r.t. Fk.

Lemma 2. Let Assumptions 2.1(a,b,d) hold true. Suppose m1= · · · = mp = 1. Consider Algorithm 3 and

let γ1, . . . , γn, γ10, . . . , γp0 be arbitrary positive coefficients. For every k ≥ 1 and every Fk-measurable pair of

random variables (X, Y ) on X × Y, Ek(xk+1) = 1 nxk+1+ (1 − 1 n)xk Ek(kxk+1− Xk2γ) = 1 nkxk+1− Xk 2 γ+ (1 − 1 n)kxk− Xk 2 γ Ek(kyk+1− Y k2γ0) = 1 nkyk+1− Y k 2 γ0+ (1 − 1 n)kyk− Y k 2 γ0− 1 nkyk+1− ykk 2 D(1−π)γ0 Ek(hyk+1− Y, M (xk+1− X)i) = 1 nhyk+1− Y, M (xk+1− X)i + (1 − 1 n)hyk− Y, M (xk− X)i − 1 nhD(1 − π)(yk+1− yk), M (xk+1− xk)i .

Proof. The first equality is immediate.

Consider the second one. Ek(kxk+1−Xk2γ) =

Pn

i=1γiEk(kx (i) k+1−X

(i)_k2_{) which coincides with}Pn

i=1γi(_n1kx (i) k+1− X(i)_k2_{+ (1 −} 1 n)kx (i) k − X

(i)_k2_{) and the second equality is proved.}

Similarly for the third equality, Ek(kyk+1− Y k2γ0) =P

p

j=1γj0Ek(ky (j) k+1− Y

(j)_k2_{) and for every j,}

Ek(ky (j) k+1− Y (j)_k2_{) = ky}(j) k +πj(y (j) k+1− y (j) k ) − Y (j)_k2 P(j ∈ J (ik+1)) + ky_k(j)− Y(j)_k2 P(j /∈ J (ik+1)). As j ∈ J (ik+1) ⇔ ik+1∈ I(j), we get

P(j ∈ J (ik+1)) = P(ik+1∈ I(j)) = card(I(j))/n = 1/n.

From (5), πj= P(j ∈ J(ik+1)|j ∈ J (ik+1)) = P(j ∈ J (i k+1) & j ∈ J (ik+1)) P(j ∈ J (ik+1)) = P(j ∈ J (ik+1)) P(j ∈ J (ik+1)) and so P(j ∈ J (ik+1)) = 1 nπj =|{i : j ∈ J (i)}| n .

(13)

We also have ky(j)_k + πj(y (j) k+1− y (j) k ) − Y (j)_k2_{= π} jky (j) k+1− Y (j)_k2_{+ (1 − π} j)ky (j) k − Y (j)_k2_{− π} j(1 − πj)ky (j) k+1− y (j) k k 2 This leads to Ek(ky (j) k+1− Y (j)_k2_{) =} 1 nky (j) k+1− Y (j)_k2_{+ (1 −} 1 n)ky (j) k − Y (j)_k2₋1 − πj n ky (j) k+1− y (j) k k 2_.

This proves the third equality.

Consider the fourth equality. Note that

hyk+1− Y, M (xk+1− X)i = n X i=1 X j∈J (i) hy_k+1(j) − Y(j)_{, M} j,i(x (i) k+1− X (i)_)i.

For any pair (i, j) such that j ∈ J (i), the conditional expectation of each term in the sum is equal to 1 nhπjy (j) k+1+ (1 − πj)y (j) k − Y (j)_{, M} j,i(x (i) k+1− X (i)_)i + ( 1 nπj − 1 n)hπjy (j) k+1+ (1 − πj)y (j) k − Y (j)_{, M} j,i(x (i) k − X (i)_)i + (1 − 1 nπj )hy(j)_k − Y(j), Mj,i(x (i) k − X (i)_)i =πj nhy (j) k+1− Y (j)_{, M} j,i(x (i) k+1− X (i)_{)i + (1 −} 2 n+ πj n)hy (j) k − Y (j)_{, M} j,i(x (i) k − X (i)_)i + (1 n− πj n)hy (j) k −Y (j)_{, M} j,i(x (i) k+1−X (i)_{)i + (}1 n− πj n)hy (j) k+1−Y (j)_{, M} j,i(x (i) k −X (i)_)i = 1 nhy (j) k+1− Y (j)_{, M} j,i(x (i) k+1− X (i)_{)i + (1 −} 1 n)hy (j) k − Y (j)_{, M} j,i(x (i) k − X (i)_)i + (1 n− πj n)hy (j) k −y (j) k+1, Mj,i(x (i) k+1−X (i)_{)i + γ}0 j( 1 n− πj n)hy (j) k+1−y (j) k , Mj,i(x (i) k −X (i)_)i = 1 nhy (j) k+1− Y (j)_{, M} j,i(x (i) k+1− X (i)_{)i + (1 −} 1 n)hy (j) k − Y (j)_{, M} j,i(x (i) k − X (i)_)i + (1 n− πj n)hy (j) k − y (j) k+1, Mj,i(x (i) k+1− x (i) k )i Finally, we obtain E(hyk+1− Y, M (xk+1− X)i) = 1 nhyk+1− Y, M (xk+1− X)i + (1 − 1 n)hyk− Y, M (xk− X)i − 1 nhD(1 − π)(yk+1− yk), M (xk+1− xk)i

which in turn implies the fourth equality in the Lemma.

Assume that τ_i−1 > βi for each i ∈ {1, . . . , n}. Define for every z = (x, y) ∈ X × Y,

˜ V (z) = ˜V (x, y) := 1 2kxk 2 τ−1_−β+ hD(2 − π)y, M xi + 1 2kyk 2 σ−1_(2−π). (17)

Lemma 3. Let Assumptions 2.1(a,b,c,d) hold true. Suppose m1= · · · = mp= 1 and assume that τi−1> βi

for each i ∈ {1, . . . , n}. Consider Algorithm 3 and define for every k ∈ N,

(14)

Then the following inequality holds: Ek[Sk+1,∗+ V (zk+1− z∗)] ≤ (1 − 1 n)Sk,∗+ V (zk− z∗) − 1 n ˜ V (zk+1− zk) (19) where zk+1= (xk+1, yk+1).

Proof. We can write the relations of Lemma 2 as

kxk+1− Xk2τ−1 = nEk(kxk+1− Xk2τ−1) − (n − 1)kxk− Xk2τ−1

kyk+1− Y k

2

σ−1 = nEk(kyk+1− Y k2σ−1) − (n − 1)kyk− Y k2σ−1+ ky_k+1− ykk2σ−1_(1−π)

hyk+1− Y, M (xk+1− X)i = nEk(hyk+1− Y, M (xk+1− X)i)

− (n − 1)hyk− Y, M (xk− X)i + hD(1 − π)(yk+1− yk), M (xk+1− xk)i .

Choosing Z = (X, Y ), denoting zk = (xk, yk) and zk = (xk, yk), we obtain

V (zk+1− Z) = nEk(V (zk+1− Z)) − nV (zk− Z) + V (zk− Z) +1 2kyk+1− ykk 2 σ−1_(1−π)+ hD(1 − π)(y_k+1− yk), M (xk+1− xk)i . (20) We shall denote Rπ= 1 2kyk+1− ykk 2 σ−1_(1−π)+ hD(1 − π)(yk+1− yk), M (xk+1− xk)i (21) Let z∗= (x∗, y∗) ∈ S. By Lemma 1, h∇f (x∗) − ∇f (xk), x∗− xk+1i + V (zk+1− zk) ≤ V (zk− z∗) − V (zk+1− z∗) .

Identifying Z in (20) to z∗ and zk successively, we obtain

h∇f (x∗) − ∇f (xk), x∗− xk+1i + nEk(V (zk+1− zk)) ≤ nV (zk− z∗) − nEk(V (zk+1− z∗)) − 2Rπ

Dividing both sides of the above inequality by n and using that xk+1= nEk(xk+1) − (n − 1)xk, we obtain

h∇f (x∗) − ∇f (xk), x∗− Ek(xk+1) + (1 − 1 n)(xk− x∗)i + Ek(V (zk+1− zk)) ≤ V (zk− z∗) − Ek(V (zk+1− z∗)) − 2 nRπ.

Rearranging the terms,

Ek[h∇f (xk) − ∇f (x∗), xk+1− xki + V (zk+1− z∗)] (22)

≤ −1

nh∇f (xk) − ∇f (x∗), xk− x∗i + V (zk− z∗) − Ek(V (zk+1− zk)) −

2

nRπ

We now use Assumption 2.1(c), knowing that xk+1only differs from xk along coordinate ik+1

f (xk+1) ≤ f (xk) + h∇f (xk), xk+1− xki + βik+1 2 kxk+1− xkk 2 = f (xk) + h∇f (xk), xk+1− xki + 1 2kxk+1− xkk 2 β (23)

which implies that h∇f (xk), xk+1− xki ≥ f (xk+1) − f (xk) −1₂kxk+1− xkk2β. Thus, plugging this into (22),

Ek f (xk+1) − f (xk) − 1 2kxk+1− xkk 2 β− h∇f (x∗), xk+1− xki + V (zk+1− z∗) ≤ −1 nh∇f (xk) − ∇f (x∗), xk− x∗i + V (zk− z∗) − Ek(V (zk+1− zk)) − 2 nRπ.

(15)

Introducing the quantity Sk,∗ as in (18), the inequality simplifies to Ek h Sk+1,∗+ V (zk+1− z∗) − 1 2kxk+1− xkk 2 β i ≤f (xk) − f (x∗) − (1 − 1 n)h∇f (x∗), xk− x∗i − 1 nh∇f (xk), xk− x∗i + V (zk− z∗) − Ek(V (zk+1− zk)) − 2 nRπ.

An estimate of the right-hand side is obtained upon noticing that h∇f (xk), xk − x∗i ≥ f (xk) − f (x∗).

Therefore, EkSk+1,∗+ V (zk+1− z∗) − 1 2kxk+1− xkk 2 β ≤ (1 − 1 n)Sk,∗+ V (zk− z∗) − Ek(V (zk+1− zk)) − 2 nRπ.

Using Lemma 2, (17) and (21), it is immediate that

Ek(V (zk+1− zk) − 1 2kxk+1− xkk 2 β) + 2 nRπ = 1 nV (zk+1− zk) − 1 nRπ− 1 2n x (j) k+1− xk 2 β + 2 nRπ = 1 n ˜ V (zk+1− zk)

and the proof is complete.

Recall that we denote by ρ(A) the spectral radius of a matrix A.

Lemma 4. Suppose that m1 = · · · = mp = 1 and assume that the following condition holds for every

i ∈ {1, . . . , n}: τi< 1 βi+ ρ P

j∈J (i)(2 − πj)σjMj,i? Mj,i

. (24)

Then ˜V1/2 is a norm on X × Y.

Note that under the assumptions of Lemma 4, V1/2 is also, a fortiori, a norm, but that V1/2 need not

be a norm.

Proof. Let γ−1 = τ−1 − β. Denote by σj0 = (2 − πj)σj for all j and by D(σ0) the diagonal matrix on

Y → Y defined by D(σ0_{)(y) := (σ}0

1y(1), . . . , σp0y(p)) for every y = (y(1), . . . , y(p)). We define D(γ) similarly

on X → X . By [28, Theorem 7.7.6], a sufficient (and necessary) condition for ˜V to be a squared norm

is that D(γ−1) M?_D(σ0_{)M (where notation A B means that A − B is a positive definite matrix).}

Defining R = D(σ01/2)M D(γ1/2_{) (that is, R}

j,i=

q

γiσ0jMj,ifor every j, i), the condition reads equivalently

ρ(R?_{R) < 1. As the set I(j) is reduced to a unique element for all j, the matrix R}?_{R is (block) diagonal.}

Precisely, for any 1 ≤ i, ` ≤ n, the (i, `)-component (R?_R)

i,` is zero whenever i 6= ` and is equal to

(R?_R)

i,i= γiPj∈J (i)σ 0

jMj,i? Mj,iotherwise. The condition ρ(R?R) < 1 yields γiρ

P j∈J (i)σ 0 jMj,i? Mj,i < 1 for each i ∈ {1, . . . , n} which is in turn equivalent to (24).

Proof of Theorem 1 in the case m1= . . . = mp= 1. Let z∗ be an arbitrary point in S. Whenever

condi-tion (24) is met, the r.v. V (zk− z∗) and ˜V (zk+1− zk) are non-negative. The r.v. Sk,∗ is non-negative as

well by convexity of f . We review two important consequences of Lemma 3.

• Define Uk := Sk,∗+ V (zk− z∗). A first consequence of Lemma 3 is that for all k,

Ek(Uk+1) ≤ Uk−

1 nSk,∗.

(16)

Recalling that Ukand Skare non-negative r.v., the Robbins-Siegmund Lemma [42] implies that almost surely,

limk→∞Ukexists andPkSk,∗< ∞. In particular, Sk,∗ converges almost surely to zero. By definition of Uk,

this implies that limk→∞V (zk− z∗) exists almost surely. Following the argument of [3, Prop. 9] (see also

[29], [15, Prop. 2.3]), this implies that there exists an event A of probability one such that for every ω ∈ A

and every ˇz ∈ S, limk→∞V1/2(zk(ω) − ˇz) exists.

• A second consequence of Lemma 3 is that, by taking the expectation E of both handsides of (19), E [Sk+1,∗+ V (zk+1− z∗)] ≤ E[Sk,∗+ V (zk− z∗)] −

1

nE(

˜

V (zk+1− zk))

and by summing these inequalities, we obtain

0 ≤ S0,∗+ V (z0− z∗) − 1 n k X i=0 E(V (z˜ i+1− zi)). (25) Thus E(P∞

i=0V (z˜ i+1− zi)) < ∞. The integrand is non-negative by Lemma 4. It is therefore finite almost

everywhere. In particular, the sequence ˜V (zk+1−zk) converges almost surely to zero. By Lemma 4, zk+1−zk

converges to zero almost surely. Say zk+1(ω) − zk(ω) → 0 for every ω ∈ B where B is a probability event of

probability one.

We introduce the mapping T : X × Y → X × Y such that for any (x, y) ∈ X × Y, the quantity T (x, y) coincides with the couple (x, y) given by

y = prox_σ,h? y + D(σ)M x

x = prox_τ,g x − D(τ )∇f (x) − D(τ )M?(2y − y) .

With this definition, zk+1= T (zk). By non-expansiveness of the proximity operator, it is straightforward to

show that T is continuous. It is also straightforward to verify that its set of fixed points coincides with S.

From now on to the end of this paragraph, we select a fixed ω ∈ A ∩ B. Note that zk(ω) is a bounded

sequence. Let ˜z be a cluster point of the latter. We have shown that T (zk(ω)) − zk(ω) → 0 which implies

that T (˜z) − ˜z = 0 by continuity of T . Thus, ˜z ∈ S. This implies that limk→∞V1/2(zk(ω) − ˜z) exists. Since

V1/2_(z

k(ω) − ˜z) tends to zero at least on some subsequence, we conclude that limk→∞V1/2(zk(ω) − ˜z) = 0.

Otherwise stated, the sequence zk(ω) converges to some point ˜z ∈ S. This completes the proof of Theorem 2

in the case m1= · · · = mp= 1.

3.3 General Case

For every j ∈ {1, . . . , p}, Yj = Y

I(j)

j is equipped with the inner product hu, vi =

P

i∈I(j)hu(i), v(i)i. The

space Yj stores I(j) duplicates of the original problem’s jth dual variable yj. We introduce the averaging

operator Sj : Yj→ Yj defined for every u ∈ Yj by

Sj(u) := 1 mj X i∈I(j) u(i) .

The averaging operators allows us to come back from duplicated dual variables to actual dual variables. For

any u ∈ Yj, we denote by 1mj⊗ u = (u, . . . , u) the vector of Yj whose components all coincide with u.

We introduce the linear operator Kj: X → Yj by

Kj(x) = (Mj,i(x(i)) : i ∈ I(j))

The operators S : Y → Y, K : X → Y are respectively defined by S(y) := (S1(y(1)), . . . , Sp(y(p))) and

K(x) := (K1(x), . . . , Kp(x)). It is immediate to verify that

M = D(m)SK (26)

where m = (m1, . . . , mp). In order to have some insights, the following example illustrates the construction

(17)

Example 2. Let X = Y = R3 and define M : X → Y as the 3 × 3 matrix M =   M1,1 M1,2 0 0 M2,2 0 M3,1 M3,2 M3,3   .

Here, I(1) = {1, 2} is the set of non-zero coefficients of the first row of M and it cardinal is m1= 2. Similarly

m2= 1, m3= 3 and Y = R6. Then K : R3→ R6 coincides with the matrix

K =         M1,1 0 0 0 M1,2 0 0 M2,2 0 M3,1 0 0 0 M3,2 0 0 0 M3,3        

and each row of K contains exactly one non-zero coefficient. On the other hand, S and D(m) respectively coincide with S =   1 2 1 2 0 0 0 0 0 0 1 0 0 0 0 0 0 1₃ 1₃ 1₃   and D(m) =   2 0 0 0 1 0 0 0 3   and obviously D(m)SK = M .

We define the function h := h ◦ (D(m)S) . By (26), Problem (1) is equivalent to min

x∈Xf (x) + g(x) + h(Kx) . (27)

We denote by S the set of primal-dual solutions of the above problem i.e., the set of pairs (x∗, y∗) ∈ X × Y

satisfying

0 ∈ ∇f (x∗) + ∂g(x∗) + K?y∗

0 ∈ −Kx∗+ ∂h

?

(y_∗) .

Substituting M with K, we may now apply Algorithm 3 to (27). For a fixed parameter σ = (σ1, . . . , σp), we

define ˜σj:= mjσj and we define ˜σ ∈ R

Pp

j=1mj _{as the vector ˜}_{σ := (˜}_σ

11m1, . . . , ˜σp1mp) where 1mj is a vector

of size mj whose components are all equal to one. Algorithm 3 writes

Initialization: Choose x0∈ X , y0∈ Y. Iteration k: Define: y_k+1= prox_˜_σ,h? y_k+ D(˜σ)Kx_k (28) xk+1= proxτ,g xk− D(τ ) ∇f (xk) + K?(2yk+1− yk) . (29)

For i = ik+1 and for each (l, j) ∈ J (ik+1), update:

x(i)_k+1= x(i)_k+1 (30) y(j)_k+1(l) = y(j)_k (l) + πj(l)(y (j) k+1(l) − y (j) k+1(l)) . (31)

Otherwise, set x(i)_k+1= x(i)_k , y(j)_k+1(l) = y(j)_k (l).

Using the result of the Section 3.2 and the properties of K, the sequence (xk, yk) converges almost surely

to a primal-dual point of Problem (27), provided that such a point exists and that the following condition holds: τi< 1 βi+ρ X (l,j)∈{i}×J (i) (2 − πj(l))˜σjK(l,j),i? K(l,j),i = 1 βi+ρ X j∈J (i)

(2 − πj(i))˜σjMj,i? Mj,i

(18)

which is equivalent to (6). It remains to prove that the algorithm given by the iterations (28)–(31) coincides with Algorithm 2. To that end, we need the following Lemma.

Lemma 5. For any y ∈ Y,

prox_˜_σ,h?(y) = (1_m 1⊗ prox (1) σ,h?(S(y)), . . . , 1mp⊗ prox (p) σ,h?(S(y))) .

Proof. We have h(y) = h(m1S1(y(1)), . . . , mpSp(y(p))). Thus,

h?(ϕ) = sup

y∈Y

hϕ, yi − h(m1S1(y(1)), . . . , mpSp(y(p)))

For all j ∈ {1, . . . , p}, denote by Cj the subset of Yj formed by the vectors of the form (u, . . . , u) for some

u ∈ Yj, and define C = C1× · · · × Cp. Clearly, h

?

(ϕ) = +∞ whenever ϕ /∈ C and ∂h?(ϕ) = ∅ in that case. If

on the other hand ϕ ∈ C, one can write ϕ under the form ϕ = (1m1⊗ ϕ

(1)_{, . . . , 1} mp⊗ ϕ (p)_{) for some ϕ ∈ Y.} In that case, h?(ϕ) = sup y∈Y p X j=1 h1mj⊗ ϕ (j)_{, 1} mj ⊗ y (j)_{i − h(m} 1y(1), . . . , mpy(p)) = sup y∈Y p X j=1 hϕ(j)_{, m} jy(j)i − h(m1y(1), . . . , mpy(p)) = h?(ϕ) .

Then, u ∈ ∂h?(ϕ) if and only if for every ψ ∈ Y, h?_{(ψ) ≥ h}?_{(ϕ) +}Pp

j=1hu (j)_{, 1} mj ⊗ (ψ (j)_{− ϕ}(j)_)i or equivalently, h?_{(ψ) ≥ h}?_{(ϕ) +}Pp j=1hmjSj(u

(j)_{), ψ}(j) _{− ϕ}(j)_{i . Therefore, u ∈ ∂h}?_{(ϕ) if and only if}

D(m)S(u) ∈ ∂h?_(ϕ).

Now consider an arbitrary y ∈ Y and set q = prox_σ,h_˜ ?(y). This is equivalent to

D(˜σ−1)(y − q) ∈ ∂h?(q). (32)

In particular, q ∈ dom(∂h?) and thus q has the form q = (1m1 ⊗ q

(1)_{, . . . , 1} mp ⊗ q

(p)_{) for some q ∈ Y.}

The inclusion (32) reads D(m)SD(˜σ−1)(y − q)) ∈ ∂h?_{(q). Since D(m)SD(˜}_σ−1_{) = D(σ}−1_{)S, we obtain}

D(σ−1)(S(y) − q) ∈ ∂h?_{(q) which is equivalent to q = prox}

σ,h?(S(y)). This completes the proof.

The proof of the following Lemma is immediate. Lemma 6. For any y ∈ Y,

K?(y) = ( X

j∈J (1)

M_j1?(y(j)(1)), . . . , X

j∈J (n)

M_jn? (y(j)(n))).

In particular, for any y ∈ Y,

K?(1m1⊗ y

(1)_{, . . . , 1} mp⊗ y

(p)_{) = M}?_{y .}

The following example shows how we are going to use the concept of duplication.

Example 3 (Total variation). Let us consider X = Rn1×n2×n3_{, Y = R}3×n1×n2×n3 _{and the total variation}

regularizer defined as h ◦ M where

h(y) = n1 X i1=1 n2 X i2=1 n3 X i3=1 v u u t 3 X j=1 y2 j,i1,i2,i3 = n1 X i1=1 n2 X i2=1 n3 X i3=1 hi1,i2,i3(y:,i1,i2,i3)

(19)

and M defined by blocks of the type M(i1,i2,i3)=     (xi1,i2,i3) (xi1+1,i2,i3) (xi1,i2+1,i3) (xi1,i2,i3+1) −1 1 0 0 −1 0 1 0 −1 0 0 1     (y1,i1,i2,i3) (y2,i1,i2,i3) (y3,i1,i2,i3)

Each line has two nonzero elements so we duplicate dual variables as

K(i1,i2,i3)=           (xi1,i2,i3) (xi1+1,i2,i3) (xi1,i2+1,i3) (xi1,i2,i3+1) −1 0 0 0 0 1 0 0 −1 0 0 0 0 0 1 0 −1 0 0 0 0 0 0 1           (y_1,i₁_,i₂_,i₃(1)) (y_1,i₁_,i₂_,i₃(2)) (y_2,i 1,i2,i3(1)) (y2,i1,i2,i3(2)) (y_3,i₁_,i₂_,i₃(1)) (y_3,i₁_,i₂_,i₃(2)) Hence, we cam write ¯hi1,i2,i3(yi1,i2,i3,:) =

q P3 j=1(yi1,i2,i3,j(1) + yi1,i2,i3,j(2)) 2 prox_mσ,h∗(y) = (1_m 1⊗prox (1) σ,h∗(S(y)), . . . , 1mp⊗prox (p)

σ,h∗(S(y))) becomes, denoting elthe lth coordinate

vector, prox_2σ,h∗ i1,i2,i3(yi1,i2,i3,:) =          e>₁prox_σ,h∗ i1,i2,i3(yi1,i2,i3,:(1) + yi1,i2,i3,:(2)) e>₁prox_σ,h∗ i1,i2,i3(yi1,i2,i3,:(1) + yi1,i2,i3,:(2)) e>₂prox_σ,h∗ i1,i2,i3(yi1,i2,i3,:(1) + yi1,i2,i3,:(2)) e>₂prox_σ,h∗ i1,i2,i3(yi1,i2,i3,:(1) + yi1,i2,i3,:(2)) e>₃prox_σ,h∗ i1,i2,i3(yi1,i2,i3,:(1) + yi1,i2,i3,:(2)) e>₃prox_σ,h∗ i1,i2,i3(yi1,i2,i3,:(1) + yi1,i2,i3,:(2))         

Suppose we would like to update x3,4,5:

• The dual variables corresponding to x3,4,5 are y3,4,5,1(1), y3,4,5,2(1), y3,4,5,3(1), y2,4,5,1(2), y3,3,5,2(2)

and y_3,4,4,3(2). • We compute prox_2σ,h∗ 3,4,5(y3,4,5,:), prox2σ,h ∗ 2,4,5(y2,4,5,:), prox2σ,h ∗ 3,3,5(y3,3,5,:) and prox2σ,h ∗ 3,4,4(y3,4,4,:),

which amounts to 12 real numbers. • We update only the 6 useful dual values.

We are now in a position to simplify the iterations (28)–(31). For every k, we define the vectors y_k+1=

prox_σ,h? S(y_k + D(˜σ)Kxk) and y_k+1 = (1m1 ⊗ y (1) k+1, . . . , 1mp ⊗ y (p)

k+1). Upon noting that SD(˜σ)K =

D(σ)D(m)SK = D(σ)M , we obtain

y_k+1= prox_σ,h?(zk+ D(σ)M xk) (33)

where we defined zk= S(yk), otherwise stated, for each j ∈ {1, . . . , p},

z(j)_k = 1

mj

X

i∈I(j)

y(j)_k (i) .

Note that zk+1differs from zkonly along the components j for which y

(j)

k+1(i) differs from y

(j)

k (i) for some i.

That is, z_k+1(j) = z(j)_k for each j such that (i, j) /∈ J (ik+1) for all I while for any j such that there exists i

such that (i, j) ∈ J (ik+1),

z(j)_k+1= z_k(j)+ 1

mj

X

i:(i,j)∈J (ik+1)

(20)

Now consider equation (29). By Lemma 6, K?yk+1 = M?yk+1. Thus, setting wk = K?yk, equation (29) simplifies to: xk+1= proxτ,g xk− D(τ ) ∇f (xk) + (2M?yk+1− wk) . (35) By Lemma 6 again, wk = (Pj∈J (1)Mj1?y (j) k (1), . . . , P j∈J (n)Mjn? y (j)

k (n)). Therefore, wk+1only differs from

wk along the coordinates i such that there exists (i, j) ∈ J (ik+1) and the update reads:

w(i)_k+1= w(i)_k + X (i,j)∈J (ik+1) Mj,i? (y (j) k+1(i) − y (j) k (i)) . (36)

Putting all pieces together, the update equations (33)–(36) coincide with Algorithm 2. We have thus proved

that Algorithm 2 is such that (xk, yk) converges to a primal-dual point of Problem (27) provided that such

a point exists. To complete the proof, the final step is to relate the primal-dual solutions of Problem (27) to the primal-dual solutions of the initial Problem (1).

Consider the mapping G : X × Y → X × Y defined by

G(x, y) := (x, (1m1⊗ y

(1)_{, . . . , 1} mp⊗ y

(p)_)).

Lemma 7. S = G(S).

Proof. Let (x, y) ∈ X × Y and set y = (1m1⊗ y

(1)_{, . . . , 1} mp⊗ y (p)_{). Then M}?_{y = K}?_{y, therefore} 0 ∈ ∇f (x) + ∂g(x) + K?y ⇔ 0 ∈ ∇f (x) + ∂g(x) + M?y . Moreover, 0 ∈ −Kx + ∂h?(y) ⇔ Kx ∈ ∂h?(y) ⇔ D(m)S(Kx) ∈ ∂h?(y) ⇔ M x ∈ ∂h?(y)

where we used Lemma 5 along with the identities D(m)SK = M and S(y) = y. The proof is completed upon

noting that if (x, y) ∈ S, then there exists y ∈ Y such that y has the form y = (1m1⊗ y

(1)_{, . . . , 1} mp⊗ y

(p)₎

.

We have shown that, almost surely, (xk, yk) converges to some point in G(S). This completes the proof

of Theorem 2.

4 Convergence rate

In this section, we are interested in the rate of convergence of the method. We consider three cases:

• h is Lipschitz continuous: we prove a O(1/√k) decrease for the function value (Theorem 3).

• h = I{b}, i.e. h(y) = 0 if y = b and h(y) = +∞ otherwise. This corresponds to an optimization

problem under the affine constraints M x = b. We prove a O(1/√k) decrease for the function value

and the feasibility (Theorem 3).

• f + g is strongly convex and ∇h is Lipschitz continuous: we prove a O(e−µk_{) rate for the distance to}

the optimum (Theorem 4).

These convergence guarantees are of the same order as what can be obtained by other primal-dual

meth-ods like the ADMM [18], i.e. O(1/√k) in general and linear rate of convergence under strong convexity

(21)

Theorem 3. Define for α ≥ 1,

C1,α= max

1≤i≤n

τ_i−1+ τ_i−1/2ρ(P

j∈J (i)mjσjMj,i? Mj,i)1/2

τ_i−1− ρ(P j∈J (i)mjσjM ? j,iMj,i) (1 + n α) C2,α= 1 + max 1≤i≤n α−1(n(n − 1) + 1) + 1

τ_i−1− βi− ρ(Pj∈J (i)(2 − πj(i))mjσjMj,i? Mj,i)

βi

.

We have that C1,α and C2,α are nonincreasing with respect to α, and thus bounded.

Define the number of iterations ˆK ∈ {1, . . . , k} as a random variable, independent of {i1, . . . , ik} and

such that Pr( ˆK = l) = _k1 for all l ∈ {1, . . . , k}.

If h is L(h)-Lipschitz in the norm k·k_D(m)σ, then for all k ≥ 0,

E(f (¯xKˆ) + g(¯xKˆ) + h(M ¯xKˆ) − f (x∗) − g(x∗) − h(M x∗)) ≤ C_2,√ k+ 2C1,k √ k n(S0,∗+ V (z0− z∗)) + 4 √ kL(h) 2

where V is defined in (8) and S0,∗ is defined in (18).

If h = I{b}, then for all k ≥ 0,

E(f (¯xKˆ) + g(¯xKˆ) − f (x∗) − g(x∗)) ≤ C_2,√ k+ 2C1,k √ k n S0,∗+ V (z0− z∗) + ky∗k E( M ¯xKˆ − b ) E( M ¯x_Kˆ − b _D(m)σ) ≤ 2 √ k q C_2,√ k+ 2C1,k+p2C1,k n(S0,∗+ V (z0− z∗)) 1/2

Proof. We begin with the proof for Algorithm 3, that is the case m1= . . . = mp= 1.

We combine the following inequalities proved in the previous sections and that are valid for all (x, y) ∈ X × Y. g(xk+1) + hxk+1, ∇f (xk) + M?(2yk+1− yk)i + 1 2kxk+1− xkk 2 τ−1 (16) ≤ g(x) + hx, ∇f (xk) + M?(2yk+1− yk)i + 1 2kx − xkk 2 τ−1− 1 2kxk+1− xk 2 τ−1 h?(y_k+1) − hy_k+1, M xki + 1 2kyk+1− ykk 2 σ−1 (13) ≤ h?_{(y) − hy, M x} ki + 1 2ky − ykk 2 σ−1− 1 2kyk+1− yk 2 σ−1 Ek(f (xk+1)) (23)+Lem. 2 ≤ f (xk) + 1 nh∇f (xk), ¯xk+1− xki + 1 2nk¯xk+1− xkk 2 β f (x) ≥ f (xk) + h∇f (xk), x − xki

We obtain that for all z ∈ X × Y such that z is measurable with respect to Fk,

g(¯xk+1) + nEk(f (xk+1)) − (n − 1)f (xk) + hM ¯xk+1, yi − h?(y) + h?(¯yk+1) − hM>y¯k+1, xi − g(x) − f (x) ≤ V (zk− z) − V (¯zk+1− z) − V (¯zk+1− zk) + 1 2k¯xk+1− xkk 2 β

As ∇f is n-Lipschitz in the norm k·k_β _{[40] and nE(x}k+1) − (n − 1)xk− ¯xk+1= 0,

nEk(f (xk+1)) − (n − 1)f (xk) ≥ nEk f (¯xk+1) + h∇f (¯xk+1), xk+1− ¯xk+1i − (n − 1) f (¯xk+1) + h∇f (¯xk+1), xk− ¯xk+1i + n 2 kxk− ¯xk+1k 2 β ≥ f (¯xk+1) − n(n − 1) 2 kxk− ¯xk+1k 2 β .

(22)

We also have for all α > 0,

V (zk− z) − V (¯zk+1− z) − V (¯zk+1− zk) = hzk− ¯zk+1, ¯zk+1− ziV

≤ 2V (zk− ¯zk+1)1/2V (¯zk+1− z)1/2≤ αV (zk− ¯zk+1) +

1

αV (¯zk+1− z) Gathering everything, we get

g(¯xk+1) + f (¯xk+1) + hM ¯xk+1, yi − h?(y) + h?(¯yk+1) − hM>y¯k+1, xi − g(x) − f (x) − 1 αV (¯zk+1− z) ≤ αV (zk− ¯zk+1) + n(n − 1) + 1 2 k¯xk+1− xkk 2 β

We can show by tedious but straightforward algebra that the norms V1/2, ˜V1/2and (1/2(kxk2_τ−1+kyk

2 σ−1))1/2

are equivalent with constants given by

V (z) ≤ max 1≤i≤n1 + s τiρ( X j∈J (i) σjMj,i? Mj,i) 1 2(kxk 2 τ−1+ kyk 2 σ−1) ≤ 2 × 1 2(kxk 2 τ−1+ kyk 2 σ−1) 1 2( kxk 2 τ−1+ kyk 2 σ−1) ≤ max 1≤i≤n τ_i−1+ τ_i−1/2ρ(P j∈J (i)σ 1/2 j Mj,i? Mj,i)1/2 τ_i−1− ρ(P

j∈J (i)σjMj,i?Mj,i)

V (z) = C1,∞V (z) αV (z) +n(n − 1) + 1 2 k¯xk+1− xkk 2 β ≤α + max 1≤i≤n n(n − 1) + 1 + α

τ_i−1− βi− ρ(Pj∈J (i)(2 − πj)σjMj,i? Mj,i)

βi ˜V (z) = αC2,αV (z)˜

where C2,α∈ O(1) for α → ∞. Denoting the smoothed gap [46] as

G2 α(¯zk, ¯zk) = sup_z g(¯xk) + f (¯xk) + hM ¯xk, yi − h ?_{(y) + h}?_(¯_y k) − hM>y¯k, xi − g(x) − f (x) − 2 2αk¯xk− xk 2 τ−1− 2 2αk¯yk− yk 2 σ−1, we have G2 α(¯zk, ¯zk) ≤ αC2,α ˜ V (¯zk− zk−1)

Now, by (25) and the fact that ˆK is independent of the coordinate selection process,

E(V (z˜ Kˆ − zK−1ˆ )) ≤ k X i=1 1 kE( ˜ V (zi− zi−1)) (25) ≤ n k(S0,∗+ V (z0− z∗)) so E(G2 α(¯zKˆ, ¯zKˆ)) ≤ αC2,α k n(S0,∗+ V (z0− z∗)) Taking α =√k as in [18], we get E(G√2 k (¯z_Kˆ, ¯z_Kˆ)) ≤ C_2,√ k √ k n(S0,∗+ V (z0− z∗))

(23)

We can also bound 1 2E( ¯x_Kˆ− x∗ 2 τ−1) ≤ C1,∞E(V (¯zKˆ− z∗)) (20)+(21) = C1,∞E(nV (z_Kˆ − z∗) − nV (z_K−1ˆ − z∗) + V (z_K−1ˆ − z∗) + R( ˆπK)) = C1,∞ k k X i=1

E(nV (zi− z∗) − nV (zi−1− z∗) + V (zi−1− z∗) + R(i)π )

= C1,∞ k E(nV (zk− z∗) − nV (z0− z∗) + k X i=1 V (zi−1− z∗) + R(i)π ) (19) ≤ C1,∞ n + k k (S0,∗+ V (z0− z∗)) + C1,∞ k k X i=1 R(i)π − ˜V (¯zi− zi−1) ≤ C1,k(S0,∗+ V (z0− z∗))

where the last inequality follows from R(i)π − ˜V (¯zi− zi−1) = 1₂k¯xi− xi−1k

2

β− V (¯zi− zi−1) ≤ 0.

If h is L(h)-Lipschitz in the norm k·k_σ, we can choose y ∈ ∂h(M ¯xk) 6= ∅ so that hM ¯xk, yi − h∗(y) =

h(M ¯xk), and x = x? so that h∗(¯yk) − hM>y¯k, x?i ≥ −h(M x?)

We then use the inequality

G√2 k (¯z_Kˆ, ¯z_Kˆ) ≥ f (¯x_Kˆ) + g(¯x_Kˆ) + h(M ¯x_Kˆ) − 4 √ kL(h) 2_{− f (x} ∗) − g(x∗) − h(M x∗) − 1 √ k x¯_Kˆ − x∗ 2 τ−1 to conclude.

If h = I{b}, then using Lemma 1 in [46], we get that

E(f (¯xKˆ) + g(¯xKˆ) − f (x∗) − g(x∗)) ≤ C_2,√ k √ k n(S0,∗+ V (z0− z∗))+ E(√1 k ¯x_Kˆ− x∗ 2 τ−1+ 1 √ k y¯_Kˆ − y∗ 2 σ−1− hy∗, M ¯xKˆ − bi) E( M ¯xKˆ − b _σ) ≤ 2 √ k E( ¯yKˆ − y? _σ−1) + E( ¯yKˆ− y? 2 σ−1) + 2 2/√k C_2,√ k √ k n(S0,∗+ V (z0− z∗)) + E( x¯Kˆ − x? 2 τ−1) 1/2

To obtain the result for Algorithm 2 we only need to remark that when we need to duplicate dual variables

we have h?_(¯_y

k) = h

?

(¯y_k). One then just needs to replace σj by mjσj in the conditions.

Remark 1. To prove the result of Theorem 3, we use a random number of iterations. This has also been proposed for instance in [43] for the stochastic dual coordinate ascent algorithm. Note that the number of iterations can be sampled beforehand, which means that the procedure comes with no computational cost.

When ˆK iterations have taken place, one just needs to compute ¯x_K+1ˆ once in order to obtain the guarantee.

We also have a fast rate if the problem has particular properties. We prove that if the Lagrangian function satisfies a strong convexity and strong concavity assumption, then Algorithm 2 converges exponentially fast with a rate that depends on the step size.

Assumption 4.1. There exists non-negative constants µg and µf such that µf + µg > 0 and a constant

µh?> 0 such that g is µ_g-strongly convex in the norm k · k_τ−1, f is µ_f-strongly convex in the norm k · k_τ−1

and h? _{is µ}

(24)

Theorem 4. For z = (x, y), denote Vµ(z) = V (z)+µgkxk2_τ−1+µ0_h?kyk2_(D(m)σ)−1 where µ0_h?= min(µh?, sup{µ >

0 : ∀i, τ_i−1 > βi + ρ(Pj∈J (i)

(2−πj(i))2σjmj

2−πj(i)−µ(1−πj(i))M

?

j,iMj,i)} (note that if πj(i) = 1 for all i and j, then

µ0_h?= µh?). If Assumption 4.1 holds then the iterates of Algorithm 2 satisfy

E [Sk,∗+ Vµ(zk− z∗)] ≤ 1 − 1 n (µf+ 2µg)µ0h? µf+ 2µg+ µ0h? k [S0,∗+ Vµ(z0− z∗)] .

In order to prove this theorem, we begin with a lemma that generalizes Lemma 1. Lemma 8. If Assumption 4.1 holds, then

h∇f (x∗) − ∇f (x), x∗− xi + V (z − z)

≤ V (z − z∗) − V (z − z∗) − µgkx − xk2τ−1− µh?ky − yk2_σ−1.

Proof. Assumption 4.1 gives us: for (x∗, y∗) ∈ S,

g(x) ≥ f (x∗) + g(x∗) + h∇f (x∗) + M?y∗, x∗− xi + µg 2 kx − x∗k 2 τ−1, (37) h?(y) ≥ h?(y∗) + hM x∗, y − y∗i + µh? 2 ky − y∗k 2 σ−1. (38)

With the same argument as in (14), we have

h?(y) ≤ h?(y∗) + hy − y∗, M xi + 1 2ky∗− yk 2 σ−1− 1 + µh? 2 ky − y∗k 2 σ−1− 1 2ky − yk 2 σ−1 and so using (38) hM (x∗− x), y − y∗i ≤ 1 2ky − y∗k 2 σ−1− 1 + 2µh? 2 ky − y∗k 2 σ−1− 1 2ky − yk 2 σ−1 (39) Similarly, we have h∇f (x∗) − ∇f (x), x∗−xi− 1 + 2µg 2 kx−x∗k 2 τ−1+ 1 2kx−x∗k 2 τ−1+ 1 2kx−xk 2

τ−1≤ h2y −y −y∗, M (x∗−x)i .

Summing the above inequality with (39), and recalling the definition of V (z) = V (x, y) = 1₂kxk2

τ−1+hy, M xi+ 1 2kyk 2 σ−1, we get h∇f (x∗) − ∇f (x), x∗ − xi + V (z − z) ≤ V (z − z∗) − V (z − z∗) − µgkx − xk2τ−1 − µh?ky − yk2_σ−1

Proof of Theorem 4. We begin with the case m1= . . . = mp.

By Assumption 2.1(e), if µh? > 0, then µ0_h? > 0 and if h? is µh?-strongly convex, it is also µ0_h?-strongly

convex. Then, by a straightforward adaptation of the proof of Lemma 3 to the strongly convex case, we have

Ek Sk+1,∗+ V (zk+1− z∗) + 2µg 2 kxk+1− x∗k 2 τ−1+ 2µ0_h? 2 kyk+1− y∗k 2 σ−1 ≤ (1 − 1 n)Sk,∗+ V (zk− z∗) + 2(n − 1)µg− µf 2n kxk− x∗k 2 τ−1 +2(n − 1)µ 0 h? 2n kyk− y∗k 2 σ−1− 1 n ˜ V (zk+1− zk) + µ0_h? n kyk+1− ykk 2 σ−1_(1−π) As soon as τ_i−1 > βi+ ρ P j∈J (i) (2−πj)2 2−πj−µ0h?(1−πj)σjM ? j,iMj,i

, we can remove the term −_n1V (z˜ k+1− zk) +

µ0_h?

(25)

In order to prove a linear convergence rate (1 − η), it suffices to prove that (1 − 1_n) ≤ (1 − η) and that with respect to the order of semi-definite matrices,

" τ−1_{(1 +} 2(n−1)µg−µf n ) M ? M σ−1(1 +2(n−1)µ0h? n ) # (1 − η)τ −1_{(1 + 2µ} g) M? M σ−1(1 + 2µ0_h?)

Using the fact that M is block-diagonal, this gives for all i the conditions

1 +2(n − 1)µg− µf n ≤ (1 − η)(1 + 2µg) 1 +2(n − 1)µ 0 h? n ≤ (1 − η)(1 + 2µ 0 h?) τ_i−1(−η(1 + 2µg) + µf + 2µg n ) ≥ X j∈J (i) σj −η(1 + 2µ0 h?) + 2µ0 h? n η2M_j,i? Mj,i.

Using the second condition we can multiply the third one by −η(1 + 2µ0_h?) +

2µ0_h?

n ≥ 0 and we obtain the

condition η2τ_i−1− X j∈J (i) σjMj,i? Mj,i + τ_i−1η2(2µg+ 2µ0h?) − η µf+ 2µg+ 2µ 0 h? n − 4µgµ0h? n − 2µfµ0h?+ 4µgµ0h? n + (µf+ 2µg)µ0h? n2 ≥ 0 .

The first term is nonnegative thanks to Assumption 2.1(e). The second term is nonnegative as soon as

η ≤ 1

n

(µf+ 2µg)µ0h?

µf+ 2µg+ µ0h?

. To conclude, we remark that

1 n (µf+ 2µg)µ0h? µf+ 2µg+ µ0_h? ≤ min(1 n µf+ 2µg 1 + 2µg ,1 n 2µ0_h? 1 + 2µ0_h? ) ≤ 1 n.

This result also implies the same rate for the iterates of Algorithm 2 because h∗ is µh?-strongly convex

in the norm k·k_σ−1 if and only if h

?

is µh?-strongly convex in the norm k·k_σ_˜−1.

Remark 2. It is worth noting that the algorithm does not depend on the strong convexity constants, which means that it automatically adapts to local strong convex-concave parameters of the Lagrangian. Moreover as can be seen on Figure 2 we do observe linear convergence in some cases, even when Assumption 4.1 is not satisfied. Thus we think that Theorem 4 can give an indication of how the algorithm behaves in favorable cases.

Remark 3. Of particular interest is the relation between the rate proved in Theorem 4 and the size of the

steps. Having longer step sizes improves the rate greatly since µf, µg and µh?, measured in the weighted

norm, are “proportional” to the step-sizes: as µgkxk

2

τ−1 = (αµg) kxk

2

(ατ )−1for all α > 0, multiplying the

step-sizes by α > 1 also multiplies µf, µg and µh? by α, which leads to an improved rate 1 −1

n (αµf+2αµg)αµh? αµf+2αµg+αµh? = 1 − α_n1(µf+2µg)µh? µf+2µg+µh? < 1 − 1 n (µf+2µg)µh? µf+2µg+µh?.

As shown in Section 5.2, in large scale applications one can expect much more than twice larger steps and so we can expect a much faster algorithm by using large steps than by using the steps proposed in [29].