Asynchrony and Acceleration in Gossip Algorithms

(1)

HAL Id: hal-02989459

https://hal.archives-ouvertes.fr/hal-02989459v2

Preprint submitted on 10 Feb 2021

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Asynchrony and Acceleration in Gossip Algorithms

Hadrien Hendrikx, Laurent Massoulié, Mathieu Even

To cite this version:

Hadrien Hendrikx, Laurent Massoulié, Mathieu Even. Asynchrony and Acceleration in Gossip Algo-

rithms. 2021. �hal-02989459v2�

(2)

MATHIEU EVEN, Inria, ENS Paris, PSL Research University, France

HADRIEN HENDRIKX, Inria, ENS Paris, PSL Research University, France

LAURENT MASSOULIÉ, Inria, ENS Paris, PSL Research University, France

This paper considers the minimization of a sum of smooth and strongly convex functions dispatched over the nodes of a communication network. Previous works on the subject either focus on synchronous algorithms, which can be heavily slowed down by a few slow nodes (the straggler problem ), or consider a model of asynchronous operation (Boyd et al. [5]) in which adjacent nodes communicate at the instants of Poisson point processes . We have two main contributions. 1) We propose CACDM (a Continuously Accelerated Coordinate Dual Method), and for the Poisson model of asynchronous operation, we prove CACDM to converge to optimality at an accelerated convergence rate in the sense of Nesterov and Stich [33]. In contrast, previously proposed asynchronous algorithms have not been proven to achieve such accelerated rate. While CACDM is based on discrete updates, the proof of its convergence crucially depends on a continuous time analysis. 2) We introduce a new communication scheme based on Loss-Networks , that is programmable in a fully asynchronous and decentralized way, unlike the Poisson model of asynchronous operation that does not capture essential aspects of asynchrony such as non-instantaneous communications and computations. Under this Loss-Network model of asynchrony, we establish for CDM (a Coordinate Dual Method) a rate of convergence in terms of the eigengap of the Laplacian of the graph weighted by local effective delays. We believe this eigengap to be a fundamental bottleneck for convergence rates of asynchronous optimization. Finally, we verify empirically that CACDM enjoys an accelerated convergence rate in the Loss-Network model of asynchrony.

Additional Key Words and Phrases: gossip algorithms, loss networks, distributed optimization, asynchrony, acceleration

1 INTRODUCTION

In this paper, we consider minimization of a function 𝑓 given by a sum of local functions:

min

𝑥∈R^𝑑

𝑓 ( 𝑥 ) : =

𝑛

Õ

𝑖=1

𝑓

_𝑖

( 𝑥 ) . (1)

A typical example is provided by Empirical Risk Minimization (ERM), in which the local functions 𝑓

_𝑖

correspond to the empirical risk evaluated on subsets of the whole dataset. We further assume that there is an underlying communication network, and that each 𝑓

_𝑖

, or gradients thereof, can only be computed at node 𝑖 of this network. In the case of ERM, 𝑓

_𝑖

represents the empirical risk for the dataset available at node 𝑖. We aim to solve Problem (1) in a decentralized fashion, where each node can only communicate with its neighbors in the graph.

Another important example is that of network averaging. It corresponds to 𝑓

_𝑖

( 𝑥 ) = ∥ 𝑥 − 𝑐

_𝑖

∥

²

where 𝑐

_𝑖

is a vector attached to node 𝑖 . The solution of Problem (1) is then provided by 𝑥

^★

=

_𝑛¹

Í

^𝑛

𝑖=1

𝑐

_𝑖

. Typical decentralized approaches for this problem rely on gossip communications [39] and first order local gradient steps [6, 32, 38, 41, 43, 46]. Yet, these approaches often rely on global synchronous rounds, in which all nodes exchange with their neighbours at the same time. Such synchronous approaches are well suited to networks with homogeneous communication and computation delays. However the presence of a few slow links or nodes drastically degrades their performance. Our work targets asynchronous distributed algorithms, for which we aim to obtain

Authors’ addresses: Mathieu Even, mathieu.even@inria.fr, Inria, ENS Paris, PSL Research University, France; Hadrien Hen- drikx, hadrien.hendrikx@inria.fr, Inria, ENS Paris, PSL Research University, France; Laurent Massoulié, laurent.massoulie@

inria.fr, Inria, ENS Paris, PSL Research University, France.

(3)

fast rates of convergence in networks with heterogeneous computation and communication delays, while being competitive with synchronous approaches in homogeneous environments.

1.1 Main Contributions

We consider local operations and communication schemes where each pair of neighbor nodes ( 𝑖, 𝑗 ) can exchange local variables at activation times of the corresponding edge ( 𝑖 𝑗 ) . We denote by P

𝑖 𝑗

⊂ R

⁺

the Point process of the corresponding activation times. Upon activation of edge ( 𝑖 𝑗 ) , nodes 𝑖 and 𝑗 can exchange local variables such as gradients of their local functions and update their local variables accordingly. We mainly study two models for the point processes P

𝑖 𝑗

: i) the Poisson model of asynchrony popularized by Boyd et al. [5] where (P

𝑖 𝑗

)

(𝑖 𝑗) ∈𝐸

are independent Poisson point processes of rates 𝑝

_{𝑖 𝑗}

[17]. We refer to this model as the Poisson point process model ( P.p.p.

model ). ii) A more complex model, inspired by loss networks (Kelly [16]), that we call Refined Loss Network Model ( RLNM ), designed to capture essential aspects of asynchronous communications and computations.

1.1.1 Randomized Gossip and P.p.p. model. We extend results obtained by [5] on gossip algorithms for network averaging to more general optimization problems of the form of Problem (1) through a dual formulation. We obtain a convergence rate that depends on both the condition number of the optimization problem and the Laplacian matrix of the graph, weighted by the rates of the Poisson point processes P

𝑖 𝑗

. The proof relies on a continuous-time analysis, which paves the way for the introduction of an accelerated algorithm, CACDM ( Continuously Accelerated Coordinate Dual Method ). CACDM can be interpreted as an accelerated coordinate gradient descent on the dual problem involving infinitesimal contractions. Using this interpretation we prove that CACDM converges at an accelerated rate in the sense of Nesterov and Stich [33]. To the best of our knowledge, this is the first asynchronous algorithm proven to achieve accelerated convergence rates in the P.p.p. model .

1.1.2 Refined Loss Network Model. Though the P.p.p. model is very convenient, it assumes that communications and computations are performed instantaneously. We thus modify the communication scheme in order to model communications in a more realistic way: busy nodes ( i.e. computing or communicating nodes) are made unavailable for other nodes to communicate with. This model is directly inspired by Loss Networks, where busy nodes are locked away from the network, which we refine by adding a busy-checking operation. For this communication model, we derive a rate of convergence that depends on the Laplacian matrix of the graph weighted by local communication constraints. Thus, we are able to recover the robustness to stragglers that we had with the P.p.p.

model, but with a theory that is more faithful to the implementation. The construction and analysis of this model enable us to identify key parameters of the communication network that condition achievable convergence rates for realistic asynchronous and distributed operation.

1.2 Related Work

1.2.1 Gossip Algorithms and Asynchrony. In gossip averaging algorithms [5, 10], nodes of the

network communicate with their neighbors without any central coordinator in order to compute

the global average of local vectors. These algorithms are particularly relevant since they can be

generalized to address our distributed optimization problem with local functions 𝑓

_𝑖

beyond the

special case 𝑓

_𝑖

( 𝑥 ) = ∥𝑥 − 𝑐

_𝑖

∥

²

. Two types of gossip algorithms appear in the literature: synchronous

ones, where all nodes communicate with each other simultaneously [4, 10, 38], and asynchronous

ones also called randomized gossip [5, 14, 31], where at a defined time 𝑡 ≥ 0, only a pair of adjacent

nodes can communicate. In the synchronous framework, the communication speed is limited by

the slowest node ( straggler problem).

(4)

Although qualified as asynchronous, the P.p.p. model cannot be programmed in a fully distributed and asynchronous structure: it assumes that communications and computations are instantaneous.

Two different approaches can be considered to deal with the fact that communications and computations are in fact non-instantaneous: (i) when a node 𝑖 receives information from a neighbor 𝑗 at a time 𝑡 ≥ 0, account for the fact that this information is delayed, or (ii) forbid communications with a busy ( i.e. communicating or computing) edge, thereby removing the need to handle delayed information. The first approach (i) is considered for asynchronous but centralized optimization by [19, 34], where delayed variables are modelled as so-called perturbed iterates . The second approach (ii) is reminiscent of Loss-Networks , initially considered for telecommunication networks [16], yet also adequate to reflect primitives in distributed computing such as locks and atomic transactions . In the perturbed iterate modelling, a central unit delegates computations to workers. Asynchrony lies in the fact that these workers do not wait for the central unit to update their current version of the optimization variable 𝑥, but instead send gradients ∇ 𝑓

_𝑖

( 𝑥

_𝑖

) whenever they can, even if based on outdated variable 𝑥

_𝑖

. Thus, the parameter of the central unit is updated using perturbed ( delayed ) gradients [27]. Section 4 focuses on the second modelling: nodes behave as in the P.p.p. model , but are made busy and hence non-available for other nodes for a time 𝜏

_{𝑖 𝑗}

> 0 after their activation. The system is asynchronous in the sense that communications are performed in a random pairwise fashion (instead of global synchronous rounds), and nodes do not wait for specific neighbours.

Yet, received gradients are never out of date since nodes always finish their current operation (communicating or computing) before engaging in a new one.

1.2.2 Acceleration in an Asynchronous Setting. Acceleration means gaining order of magnitudes in terms of convergence speed, compared to classical algorithms. Accelerating gossip algorithms has been studied in previous works in the synchronous framework: SSDA [38], Chebyshev acceleration [30] Jacobi-Polynomial acceleration in the first iterations [4], or in the asynchronous P.p.p. model : Geographic Gossip [9] , shift registers [23]. However, no algorithm in the P.p.p. model has been rigorously proven to achieve an accelerated rate for general graphs without additional synchronization between nodes. For instance, inspired by ACDM [33], [14] introduced ESDACD , where at each iteration, only a pair of adjacent nodes communicate, but all nodes need to make local contractions and thus need to know that an update is taking place somewhere else in the graph. This last requirement, also present in Stochastic Heavy Balls methods [26], is a departure from purely asynchronous operation, and thus a limitation of these methods. Section 3.3 presents a continuous alternative to ACDM , where the contractions previously cited are made continuously.

Our algorithm ( CACDM , for Continuously Accelerated Coordinate Descent Method) obtains in the P.p.p. model the same accelerated rate as [9, 14, 26] for any graph, without assuming access to any global iteration counter: it only needs local clock-synchronization between adjacent nodes.

Although our analysis of CACDM does not extend to more general communication models such as those presented in Section 4, we observe empirically that CACDM enjoys accelerated rates in the Loss-Network model as well as in the P.p.p. model.

The detailed problem statement and notations are given in Section 2. Section 3 contains our

results on asynchronous gossip in the P.p.p. model, first for a non-accelerated algorithm based on

simple gradient descent steps, then for the accelerated algorithm CACDM . Section 4 finally presents

our results for gossip algorithms in the refined loss network model .

(5)

2 PROBLEM FORMULATION AND NOTATIONS 2.1 Basic assumptions and notations

The communication network is represented by an undirected graph 𝐺 = ( 𝑉 , 𝐸 ) on the set of nodes 𝑉 = [ 𝑛 ] , and is assumed to be connected. Two nodes are said to be neighbors or adjacent in the graph, and we write 𝑖 ∼ 𝑗, if ( 𝑖 𝑗 ) ∈ 𝐸. Two edges ( 𝑖 𝑗 ) , ( 𝑘𝑙 ) ∈ 𝐸 are adjacent in the graph if ( 𝑖 𝑗 ) = ( 𝑘𝑙 ) or if they share a node. Each node 𝑖 ∈ 𝑉 has access to a local function 𝑓

_𝑖

defined on R

^𝑑

, assumed to be 𝐿

_𝑖

-smooth and 𝜎

_𝑖

-strongly convex [7], i.e. ∀ 𝑥 , 𝑦 ∈ R

^𝑑

:

𝑓

_𝑖

( 𝑥 ) ≤ 𝑓

_𝑖

( 𝑦 ) + ⟨∇ 𝑓

_𝑖

( 𝑦 ) , 𝑥 − 𝑦 ⟩ + 𝐿

_𝑖

2 ∥ 𝑥 − 𝑦 ∥

²

, 𝑓

_𝑖

( 𝑥 ) ≥ 𝑓

_𝑖

( 𝑦 ) + ⟨∇ 𝑓

_𝑖

( 𝑦 ) , 𝑥 − 𝑦 ⟩ + 𝜎

_𝑖

2 ∥ 𝑥 − 𝑦 ∥

²

.

(2)

Let us denote 𝑓 ( 𝑧 ) = Í

𝑖∈ [𝑛]

𝑓

_𝑖

( 𝑧 ) for 𝑧 ∈ R

^𝑑

and 𝐹 ( 𝑥 ) = Í

𝑖∈ [𝑛]

𝑓

_𝑖

( 𝑥

_𝑖

) for 𝑥 = ( 𝑥

^⊤

1

, · · · , 𝑥

^⊤

𝑛

) ∈ R

^𝑛^×^𝑑

where 𝑥

_𝑖

∈ R

^𝑑

is attached to node 𝑖 ∈ [ 𝑛 ] . Let

𝐿

max

: = max

𝑖

𝐿

_𝑖

and 𝜎

min

: = min

𝑖

𝜎

_𝑖

(3)

denote the global complexity numbers. Computing gradients and communicating them between two neighboring nodes 𝑖 ∼ 𝑗 is assumed to take time 𝜏

_{𝑖 𝑗}

> 0. This constant takes into account both the communication and computation times, and should be understood as an upper-bound on the delays between nodes 𝑖 and 𝑗.

In this decentralized setting, Problem (1) can be formulated as follows:

min

𝑥∈R^𝑛×𝑑:𝑥₁=...=𝑥𝑛

𝐹 ( 𝑥 ) , (4)

where 𝑥

₁

= ... = 𝑥

_𝑛

enforces consensus on all the nodes. We add the following structural constraints:

(1) Local computations: node 𝑖 (and node 𝑖 only) can compute first-order characteristics of 𝑓

_𝑖

such as ∇ 𝑓

_𝑖

or ∇ 𝑓

^∗

𝑖

;

(2) Local communications: node 𝑖 can send information only to neighboring nodes 𝑗 ∼ 𝑖.

These operations may be performed asynchronously and in parallel, and each node possesses a local version 𝑥

_𝑖

∈ R

^𝑑

of the global parameter 𝑥 . The rate of convergence of our algorithms will be controlled by the smallest positive eigenvalue 𝛾 of the Laplacian of graph 𝐺 [28], weighted by some constants 𝜈

_{𝑖 𝑗}

that depend on the local communication and computation delays.

Definition 1 (Graph Laplacian). Let ( 𝜈

_{𝑖 𝑗}

)

(𝑖 𝑗) ∈𝐸

be a set of non-negative real numbers. The Laplacian of the graph 𝐺 weighted by the 𝜈

_{𝑖 𝑗}

’s is the matrix with ( 𝑖, 𝑗 ) entry equal to − 𝜈

_{𝑖 𝑗}

if ( 𝑖 𝑗 ) ∈ 𝐸 , Í

𝑘∼𝑖

𝜈

_𝑖𝑘

if 𝑗 = 𝑖 , and 0 otherwise. In the sequel 𝜈

_{𝑖 𝑗}

always refers to the weights of the Laplacian, and 𝛾 ( 𝜈

_{𝑖 𝑗}

) denotes this Laplacian’s second smallest eigenvalue.

For any function 𝑔 : R

^𝑝

→ R , 𝑔

^∗

denotes its Fenchel conjugate on R

^𝑝

defined as

∀ 𝑦 ∈ R

^𝑝

, 𝑔

^∗

( 𝑦 ) = sup

𝑥∈R^𝑝

⟨ 𝑥 , 𝑦 ⟩ − 𝑔 ( 𝑥 ) ∈ R ∪ {+∞} .

Throughout the paper, F

𝑡

for 𝑡 ∈ R

⁺

denotes the filtration of the point processes P = Ð

(𝑖 𝑗) ∈𝐸

P

𝑖 𝑗

up

to time 𝑡. If 𝑡

_𝑘

, 𝑘 ∈ N

^∗

(and 𝑡

₀

= 0) are the successive points in P , we write if there is no ambiguity

F

𝑘

= F

𝑡_𝑘

, 𝑘 ∈ N

^∗

.

(6)

2.2 Dual Formulation of the Problem

A standard way to deal with the constraint 𝑥

₁

= ... = 𝑥

_𝑛

, is to use a dual formulation [14, 38, 45], by introducing a dual variable 𝜆 indexed by the edges. We first introduce a matrix 𝐴 ∈ R

^𝑛×𝐸

such that Ker ( A

^⊤

) = Vect ( I ) where I is the constant vector ( 1, ..., 1 )

^⊤

of dimension 𝑛. 𝐴 is chosen such that:

∀( 𝑖 𝑗 ) ∈ 𝐸, 𝐴𝑒

_{𝑖 𝑗}

= 𝜇

_{𝑖 𝑗}

( 𝑒

_𝑖

− 𝑒

_𝑗

) . (5) for some non-null constants 𝜇

_{𝑖 𝑗}

. We define 𝜇

_{𝑖 𝑗}

= −𝜇

𝑗 𝑖

for this writing to be consistent. This matrix 𝐴 is a square root of the laplacian of the graph weighted by 𝜈

_{𝑖 𝑗}

= 𝜇

²

𝑖 𝑗

. The constraint 𝑥

1

= ... = 𝑥

_𝑛

can then be written 𝐴

^⊤

𝑥 = 0. The dual problem reads as follows:

min

𝑥∈R^𝑛×𝑑,𝐴^⊤𝑥=0 𝑛

Õ

𝑖=1

𝑓

_𝑖

( 𝑥

_𝑖

) = min

𝑥∈R^𝑛×𝑑

max

𝜆∈R^𝐸 𝑛

Õ

𝑖=1

𝑓

_𝑖

( 𝑥

_𝑖

) − ⟨ 𝐴

^⊤

𝑥 , 𝜆 ⟩ . Let 𝐹

^∗

𝐴

( 𝜆 ) : = 𝐹

^∗

( 𝐴𝜆 ) for 𝜆 ∈ R

^𝐸×𝑑

where 𝐹

^∗

is the Fenchel conjugate of 𝐹 . The dual problem reads min

𝑥∈R^𝑛×𝑑,𝑥₁=...=𝑥𝑛

𝐹 ( 𝑥 ) = max

𝜆∈R^𝐸×𝑑

− 𝐹

^∗

𝐴

( 𝜆 ) . Thus 𝐹

^∗

𝐴

( 𝜆 ) = Í

^𝑛

𝑖=1

𝑓

^∗

𝑖

( ( 𝐴𝜆 )

𝑖

) is to be minimized over the dual variable 𝜆 ∈ R

^𝐸×𝑑

.

We now make a parallel between pairwise operations between adjacent nodes in the network and coordinate gradient steps on 𝐹

^∗

𝐴

. As 𝐹

^∗

𝐴

( 𝜆 ) = max

𝑥∈R^𝑛×𝑑

− 𝐹 ( 𝑥 ) + ⟨ 𝐴𝜆, 𝑥 ⟩ , to any 𝜆 ∈ R

^𝐸×𝑑

a primal variable 𝑥 ∈ R

^𝑛×𝑑

is uniquely associated through the formula ∇ 𝐹 ( 𝑥 ) = 𝐴𝜆. The partial derivative of 𝐹

^∗

𝐴

with respect to coordinate ( 𝑖 𝑗 ) of 𝜆 reads :

∇

𝑖 𝑗

𝐹

^∗

𝐴

(𝜆) = (𝐴𝑒

𝑖 𝑗

)

^⊤

∇𝐹

^∗

(𝐴𝜆) = 𝜇

_{𝑖 𝑗}

(∇𝑓

_𝑖^∗

( (𝐴𝜆)

𝑖

) − ∇𝑓

_𝑗^∗

( (𝐴𝜆)

𝑗

)).

Consider then the following step of coordinate gradient descent for 𝐹

^∗

𝐴

on coordinate ( 𝑖 𝑗 ) of 𝜆, performed when edge ( 𝑖 𝑗 ) is activated at iteration 𝑘 (corresponding to time 𝑡

_𝑘

), and where 𝑈

_{𝑖 𝑗}

= 𝑒

_{𝑖 𝑗}

𝑒

^⊤

𝑖 𝑗

:

𝜆

_𝑡

𝑘+1

= 𝜆

_𝑡

𝑘

− 1

( 𝜎

⁻¹

𝑖

+ 𝜎

⁻¹

𝑗

) 𝜇

²

𝑖 𝑗

𝑈

_{𝑖 𝑗}

∇

𝑖 𝑗

𝐹

^∗

𝐴

( 𝜆

_𝑡

𝑘

) . (6)

Denoting 𝑣

_𝑘

= 𝐴𝜆

_𝑡

𝑘

∈ R

^𝑛×𝑑

, we obtain the following formula for updating coordinates 𝑖, 𝑗 of 𝑣 when 𝑖 𝑗 activated:

𝑣

_𝑘+1,𝑖

= 𝑣

_{𝑘 ,𝑖}

−

∇ 𝑓

^∗

𝑖

( 𝑣

_{𝑘 ,𝑖}

) − ∇ 𝑓

^∗

𝑗

( 𝑣

_{𝑘 , 𝑗}

) 𝜎

⁻¹

𝑖

+ 𝜎

⁻¹

𝑗

, (7)

𝑣

_{𝑘+1, 𝑗}

= 𝑣

_{𝑘 , 𝑗}

+

∇ 𝑓

^∗

𝑖

( 𝑣

_{𝑘 ,𝑖}

) − ∇ 𝑓

^∗

𝑗

( 𝑣

_{𝑘 , 𝑗}

) 𝜎

⁻¹

𝑖

+ 𝜎

⁻¹

𝑗

. (8)

Such updates can be performed locally at nodes 𝑖 and 𝑗 after communication between the two nodes. We refer in the sequel to this scheme as the Coordinate Descent Method (CDM). While 𝜆 ∈ R

^𝐸×𝑑

is a dual variable defined on the edges, 𝑣 ∈ R

^𝑛×𝑑

is also a dual variable, but defined on the nodes. The primal surrogate of 𝑣 is defined as 𝑥 = ∇ 𝐹

^∗

( 𝑣 ) i.e. 𝑥

_𝑖

= ∇ 𝑓

^∗

𝑖

( 𝑣

_𝑖

) at node 𝑖. It can hence be computed with local updates on 𝑣 ((7) and (8)). Thus CDM , based on coordinate gradient descent for the dual problem, translates into local updates for the primal variables 𝑥

_𝑖

. Note that in order to perform CDM , an initialization 𝑣 ( 0 ) ∈ Im ( 𝐴 ) at all nodes is required, to ensure the existence of 𝜆 ∈ R

^𝐸×𝑑

such that 𝐴𝜆 ( 0 ) = 𝑣 ( 0 ) . We thus usually take 𝑣

_𝑖

( 0 ) = 0 for all nodes 𝑖 .

Remark 1. We hence have two notions of duality. For 𝑥 = ( 𝑥

1

, ..., 𝑥

_𝑛

) ∈ R

^𝑛×𝑑

the primal variables

associated with the network nodes, 𝑣 = ( 𝑣

₁

, ..., 𝑣

_𝑛

) ∈ R

^𝑛×𝑑

is its convex-dual conjugate with 𝑣

_𝑖

= ∇ 𝑓

_𝑖

( 𝑥

_𝑖

) ,

while 𝜆 ∈ R

^𝐸×𝑑

such that 𝐴𝜆 = 𝑣 is its edge-dual conjugate .

(7)

Remark 2. Matrix 𝐴 is introduced only for the purpose of the analysis. Indeed, we analyze our algorithms through edge-dual formulations, with updates of the form (6) on these variables. However, we present the algorithm with the convex-dual variables, (7) , (8) , for which 𝜇

²

𝑖 𝑗

and hence the effect of matrix 𝐴 disappears.

2.3 Gossip Averaging Problem

As previously mentioned, the initial problem (1) with functions 𝑓

_𝑖

( 𝑥 ) =

¹₂

∥ 𝑥 − 𝑐

_𝑖

∥

²

, 𝑥 ∈ R

^𝑑

for some vectors 𝑐

₁

, ..., 𝑐

_𝑛

∈ R

^𝑑

reduces to the gossip averaging problem that aims at computing in a decentralized way with local computations the value ¯ 𝑐 =

_𝑛¹

Í

𝑛

𝑖=1

𝑐

_𝑖

. We contrast in this particular framework the rates that can be obtained by synchronous and asynchronous methods. These rates are expressed in terms of the weighted graph Laplacian, where for synchronous updates the edge weights are tuned to the worst-case delay, whereas in the asynchronous case, the edge weights can be tuned to local delay. Thus the advantage of asynchronous methods over synchronous ones is captured by these different edge weights in the considered Laplacian.

Synchronous Communications: In Synchronous Gossip Algorithm iterations [10], all nodes update their values synchronously by taking a weighted average of the values of their neighbors (Appendix A.1 for more details). These algorithms converge linearly with a rate given by the smallest eigenvalue of the graph Laplacian weighted by weights 𝜈

_{𝑖 𝑗}

≤ 1. Since every iteration takes a time 𝜏

_𝑚𝑎𝑥

, synchronous Gossip algorithms have a linear rate of convergence 𝛾

_{𝑠 𝑦𝑛𝑐ℎ}

= 𝛾 ( 𝜈

_{𝑖 𝑗}

) with weights 𝜈

_{𝑖 𝑗}

≤ 𝜏

⁻¹

𝑚𝑎𝑥

for all ( 𝑖 𝑗 ) ∈ 𝐸 (Definition 1). We rephrase this as the following

Proposition 1 (Synchronous Gossip). Let 𝑥 ( 𝑡 ) = ( 𝑥

₁

( 𝑡 ) , ..., 𝑥

_𝑛

( 𝑡 ))

^⊤

∈ R

^𝑛×𝑑

be the matrix of vectors 𝑥

_𝑖

( 𝑡 ) attached to node 𝑖 at time 𝑡 ≥ 0 . For continuous time 𝑡 ≥ 0 and for synchronous gossip algorithms as in [10], we have:

∥ 𝑥 ( 𝑡 ) − 𝑐 ¯ ∥

²

≤ exp (−( 𝑡 − 𝜏

_max

) 𝛾

) ∥ 𝑥 ( 0 ) − 𝑐 ¯ ∥

²

, (9) with 𝛾

the second smallest eigenvalue of the graph Laplacian weighted by 𝜈

_{𝑖 𝑗}

≡ 𝜏

_max⁻¹

.

Asynchronous Communications in the P.p.p. model: This is the setting of randomized gossip as considered by [5], where point processes P

𝑖 𝑗

are independent P.p.p. of rates 𝑝

_{𝑖 𝑗}

> 0. When edge ( 𝑖 𝑗 ) is activated, nodes 𝑖 and 𝑗 update their values by making a local averaging (Appendix A.2). We have the following convergence result.

Proposition 2 (Randomized Gossip). For randomized gossip as in [5], we have:

E [∥ 𝑥 ( 𝑡 ) − 𝑐 ¯ ∥

²

] ≤ exp (− 𝑡𝛾

_{𝑎𝑠 𝑦𝑛𝑐ℎ}

) ∥ 𝑥 ( 0 ) − 𝑐 ¯ ∥

²

, (10) with 𝛾

the second smallest eigenvalue of the graph Laplacian weighted by 𝜈

_{𝑖 𝑗}

= 𝑝

_{𝑖 𝑗}

. Moreover, this rate is optimal in the sense that there exists 𝑥 ( 0 ) ∈ R

^𝑛×𝑑

such that (10) is an equality for all 𝑡 ≥ 0 . Proofs of (9) and (10) and details about synchronous and randomized gossip can be found in Appendix A. Equation (10) follows from derivations in [5], combined with a study of infinitesimal intervals of times [ 𝑡 , 𝑡 + 𝑑𝑡 ] . We generalize this result to the initial optimization problem (1) in next Section.

In the P.p.p., the terms 1 / 𝑝

_{𝑖 𝑗}

capture the average time between consecutive activations of edge ( 𝑖 𝑗 ) and are thus naturally related to the delays 𝜏

_{𝑖 𝑗}

. This suggests that asynchrony brings about a speed-up reflected by the change in the Laplacian’s spectral gap 𝛾 ( 𝜈

_{𝑖 𝑗}

) when the weights 𝜈

_{𝑖 𝑗}

≡ 𝜏

_max⁻¹

are replaced by 𝜈

_{𝑖 𝑗}

= 𝜏

⁻¹

𝑖 𝑗

. The fact that 𝛾

is optimal leads us to believe that this quantity - the

smallest non-null eigenvalue of the Laplacian with local weights - best describes the asynchronous

speed-up.

(8)

The above argument identifying 𝜏

_{𝑖 𝑗}

with 𝑝

⁻¹

𝑖 𝑗

is heuristic. Our analysis of the Loss-Network model will establish a more rigorous bridge between spectral gap of Laplacian with edge weights based on local delays and convergence speed of asynchronous schemes.

3 RANDOMIZED GOSSIP: THE P.P.P. MODEL

3.1 The P.p.p. Model and Randomized Gossip Algorithms

The P.p.p. model: Each edge ( 𝑖 𝑗 ) ∈ 𝐸 has a clock that ticks at the instants of a Poisson point process P

𝑖 𝑗

of intensity 𝑝

_{𝑖 𝑗}

, where the P

𝑖 𝑗

are mutually independent. At each tick of its clock, edge ( 𝑖 𝑗 ) is activated and nodes 𝑖 and 𝑗 can communicate together. The process P = Ð

(𝑖 𝑗) ∈𝐸

P

𝑖 𝑗

, P is again P.

p. p. of intensity

𝐼 = Õ

(𝑖 𝑗) ∈𝐸

𝑝

_{𝑖 𝑗}

. (11)

Randomized Gossip Algorithm: Each node 𝑖 maintains a local variable ( 𝑥

_𝑖

( 𝑡 ))

𝑡≥0

. We denote ( 𝑣

_𝑖

( 𝑡 ))

𝑡≥0

its local convex-dual conjugate and write 𝑣 ( 𝑡 ) = ( 𝑣

_𝑖

( 𝑡 ))

𝑖

. We initialize with 𝑣

_𝑖

( 0 ) = 0 at all nodes. Based on the dual problem formulation in Section 2.2, we consider CDM . Specifically, when clock ( 𝑖 𝑗 ) ticks at time 𝑡 ≥ 0, perform the following update on variable 𝑣 ( 𝑡 ) :

𝑣

_𝑖

( 𝑡 ) ← −

^𝑡

𝑣

_𝑖

( 𝑡 ) −

∇ 𝑓

^∗

𝑖

( 𝑣

_𝑖

( 𝑡 )) − ∇ 𝑓

^∗

𝑗

( 𝑣

_𝑗

( 𝑡 )) 𝜎

⁻¹

𝑖

+ 𝜎

⁻¹

𝑗

,

𝑣

_𝑗

( 𝑡 ) ← −

^𝑡

𝑣

_𝑗

( 𝑡 ) +

∇ 𝑓

^∗

𝑖

( 𝑣

_𝑖

( 𝑡 )) − ∇ 𝑓

^∗

𝑗

( 𝑣

_𝑗

( 𝑡 )) 𝜎

⁻¹

𝑖

+ 𝜎

⁻¹

𝑗

.

(12)

The desired output at node 𝑖 and time 𝑡 is then 𝑥

_𝑖

( 𝑡 ) = ∇ 𝑓

^∗

𝑖

( 𝑣

_𝑖

( 𝑡 )) . Note that as mentioned in Section 2.2, the outputs 𝑣

_𝑖

( 𝑡 ) and 𝑥

_𝑖

( 𝑡 ) at any node 𝑖 and time 𝑡 are all completely independent from the initial choice of matrix 𝐴, whose only use is for analysis. Observe that in the gossip averaging problem, 𝑣

_𝑖

( 𝑡 ) = 𝑥

_𝑖

( 𝑡 ) − 𝑥

_𝑖

( 0 ) , and Equation (12) simplifies to

𝑥

_𝑖

( 𝑡 ) , 𝑥

_𝑗

( 𝑡 ) ← −

^𝑡

𝑥

_𝑖

( 𝑡 ) + 𝑥

_𝑗

( 𝑡 ) 2

, (13)

which coincides with classical randomized gossip updates for the averaging problem.

3.2 Continuous Time Convergence Analysis

The classical analysis of gossip algorithms [5] proceeds as follows: at every clock tick of P , an edge ( 𝑖 𝑗 ) is selected with probability 𝑞

_{𝑖 𝑗}

=

^𝑝_𝐼^{𝑖 𝑗}

. A discrete time analysis of state variables at these ticking times is then performed. In order to derive bounds for continuous time 𝑡, we instead study infinitesimal intervals of time [ 𝑡 , 𝑡 + 𝑑𝑡 ] , giving us more degrees of freedom, as shown in Section 3.3.

Theorem 1. For the CDM updates (12) , in the P.p.p. model with intensities 𝑝

_{𝑖 𝑗}

, we have the following guarantees for all 𝑡 ≥ 0

E ( 𝐹

^∗

( 𝑣 ( 𝑡 )) − 𝐹

^∗

( 𝑣

^★

)) ≤ ( 𝐹

^∗

( 𝑣 ( 0 )) − 𝐹

^∗

( 𝑣

^★

)) exp

− 𝜎

_min

2𝐿

_max

𝛾

_𝑝

𝑡

, (14)

where 𝑣

^★

= 𝐴𝜆

^★

is the minimizer of 𝐹

^∗

on Im (𝐴) , 𝜆

^★

being a minimizer of 𝐹

^∗

𝐴

, 𝛾

_𝑝

= 𝛾 (𝑝

𝑖 𝑗

) is the

spectral gap of the graph Laplacian weighted by weights 𝜈

_{𝑖 𝑗}

= 𝑝

_{𝑖 𝑗}

and 𝜎

_min

, 𝐿

_max

are defined in (3) .

(9)

Since 𝑥 ( 𝑡 ) = ∇ 𝐹

^∗

( 𝑣 ( 𝑡 )) and 𝑥

^★

= ∇ 𝐹

^∗

( 𝑣

^★

) where 𝑥

^★

is the minimizer of 𝐹 under the consensus constraint, we have on primal variable 𝑥 ( 𝑡 ) (Lemma 3):

E h

𝑥

_𝑡

− 𝑥

^★

2

i

≤ 2𝐿

_max

𝜎

²

min

( 𝐹

^∗

( 𝑣 ( 0 )) − 𝐹

^∗

( 𝑣

^★

)) exp

− 𝜎

_min

2𝐿

_max

𝛾

_𝑝

𝑡

. (15)

We thus obtain a factor 𝛾

_𝑝

in the rate of convergence that reflects communication speed, and

^𝜎^min

𝐿max

that is an upper-bound on the condition number of the objective function. The sketch of proof below relies on a classical analysis of coordinate descent algorithms adapted to continuous time. The technical details are differed to Appendix B. We believe the proof technique to be of independent interest: it could be applied to analyze optimization methods such as gradient descent algorithms (stochastic, proximal or accelerated ones) with increments ruled by Poisson point processes with simple proofs based on establishment of differential inequalities.

Proof. We prove Theorem 1 by considering edge-dual variables 𝜆

_𝑡

∈ R

^𝐸×𝑑

associated to 𝑥 ( 𝑡 ) and 𝑣 ( 𝑡 ) , in particular with 𝐴𝜆

_𝑡

= 𝑣 ( 𝑡 ) and 𝐴𝜆

^★

= 𝑣

^★

. Since 𝑣 ( 0 ) = 0, we take 𝜆

₀

= 0. We consider matrix 𝐴 in (5) with 𝜇

²

𝑖 𝑗

=

^𝑝^{𝑖 𝑗}

𝜎⁻¹ 𝑖 +𝜎⁻¹

𝑗

. When clock ( 𝑖 𝑗 ) ticks at time 𝑡 ≥ 0, the following update is performed on variable 𝜆

_𝑡

:

𝜆

_𝑡

← −

𝑡

𝜆

_𝑡

− 1 ( 𝜎

⁻¹

𝑖

+ 𝜎

⁻¹

𝑗

) 𝜇

²

𝑖 𝑗

𝑈

_{𝑖 𝑗}

∇

𝑖 𝑗

𝐹

^∗

𝐴

( 𝜆

_𝑡

) . (16)

Furthermore, note that we have 𝐹

^∗

( 𝑣 ( 𝑡 )) = 𝐹

^∗

𝐴

( 𝜆

_𝑡

) . A key ingredient in the proof is the lemma below, which establishes a local smoothness property. Its proof is given in Appendix B,

Lemma 1. For 𝜆 ∈ R

^𝐸×𝑑

and 𝑖 𝑗 ∈ 𝐸 , we have:

𝐹

^∗

𝐴

𝜆 − 1

𝜇

²

𝑖 𝑗

( 𝜎

⁻¹

𝑖

+ 𝜎

⁻¹

𝑗

) 𝑈

_{𝑖 𝑗}

∇

𝑖 𝑗

𝐹

^∗

𝐴

( 𝜆 )

!

− 𝐹

^∗

𝐴

( 𝜆 ) ≤ − 1 2𝜇

²

𝑖 𝑗

( 𝜎

⁻¹

𝑖

+ 𝜎

⁻¹

𝑗

) ∇

𝑖 𝑗

𝐹

^∗

𝐴

( 𝜆 )

2

. (17)

Then using this, for 𝑡 ≥ 0 and 𝑑𝑡 > 0:

E

^F^𝑡

[ 𝐹

^∗

𝐴

( 𝜆

_{𝑡+𝑑 𝑡}

) − 𝐹

^∗

𝐴

( 𝜆

_𝑡

)] = ( 1 − 𝐼 𝑑𝑡 ) E

^F^𝑡

[ 𝐹

^∗

𝐴

( 𝜆

_{𝑡+𝑑 𝑡}

) − 𝐹

^∗

𝐴

( 𝜆

_𝑡

) | no activations in [ 𝑡 , 𝑡 + 𝑑𝑡 ]]

+ Õ

(𝑖 𝑗) ∈𝐸

𝑝

_{𝑖 𝑗}

𝑑𝑡 E

^F^𝑡

[ 𝐹

^∗

𝐴

( 𝜆

_{𝑡+𝑑 𝑡}

) − 𝐹

^∗

𝐴

( 𝜆

_𝑡

) | ( 𝑖 𝑗 ) activated in [ 𝑡 , 𝑡 + 𝑑𝑡 ]]

= − 𝑑𝑡 Õ

𝑖 𝑗∈𝐸

𝑝

_{𝑖 𝑗}

( 𝐹

^∗

𝐴

( 𝜆

_𝑡

) − 𝐹

^∗

𝐴

( 𝜆

_𝑡

− 1

( 𝜎

⁻¹

𝑖

+ 𝜎

⁻¹

𝑗

) 𝜇

²

𝑖 𝑗

𝑈

_{𝑖 𝑗}

∇

𝑖 𝑗

𝐹

^∗

𝐴

( 𝜆

_𝑡

))) + 𝑜 ( 𝑑𝑡 )

≤ − 𝑑𝑡 Õ

𝑖 𝑗∈𝐸

𝑝

_{𝑖 𝑗}

2 ( 𝜎

⁻¹

𝑖

+ 𝜎

⁻¹

𝑗

) 𝜇

²

𝑖 𝑗

∇

𝑖 𝑗

𝐹

^∗

𝐴

( 𝜆

_𝑡

)

2

+ 𝑜 ( 𝑑𝑡 )

= − 𝑑𝑡 2

∇ 𝐹

^∗

𝐴

( 𝜆

_𝑡

)

2

+ 𝑜 ( 𝑑𝑡 ) Lemma 8 in the Appendix implies that

∇ 𝐹

^∗

𝐴

( 𝜆 )

≥ 2𝜎

_𝐴

( 𝐹

^∗

𝐴

( 𝜆 ) − 𝐹

^∗

𝐴

( 𝜆

^★

)) , where 𝜎

_𝐴

is the strong convexity parameter of 𝐹

^∗

𝐴

with respect to the Euclidean norm on the orthogonal of Ker ( 𝐴 ) . We thus have:

E

^F^𝑡

[𝐹

_𝐴^∗

(𝜆

_{𝑡+𝑑 𝑡}

) − 𝐹

^∗

𝐴

(𝜆

𝑡

)] ≤ −𝑑𝑡 𝜎

𝐴

(𝐹

_𝐴^∗

(𝜆

𝑡

) − 𝐹

^∗

𝐴

(𝜆

^★

)) + 𝑜(𝑑𝑡 ).

Then, dividing by 𝑑𝑡 and taking 𝑑𝑡 → 0 yields:

^𝑑

𝑑 𝑡

E [ 𝐹

^∗

𝐴

( 𝜆

_𝑡

) − 𝐹

^∗

𝐴

( 𝜆

^★

)] ≤ − 𝜎

_𝐴

E [ 𝐹

^∗

𝐴

( 𝜆

_𝑡

) − 𝐹

^∗

𝐴

( 𝜆

^★

)] . We then obtain an exponential rate of convergence 𝜎

_𝐴

by integrating. Finally, Lemma 5 in the Appendix gives 𝜎

_𝐴

≥

^𝜆

+ min(𝐴𝐴^⊤)

𝐿_𝑚𝑎𝑥

where 𝜆

⁺

min

( 𝐴𝐴

^⊤

) is the smallest non-null eigenvalue of 𝐴𝐴

^⊤

. As 𝐴𝐴

^⊤

is the

(10)

Laplacian of the graph with weights 𝜈

_{𝑖 𝑗}

= 𝜇

²

𝑖 𝑗

=

^𝑝^{𝑖 𝑗}

𝜎⁻¹

𝑖 +𝜎⁻¹_𝑗

(Lemma 6), we have 𝜆

⁺

min

( 𝐴𝐴

^⊤

) ≥ 𝜎

_min

𝛾

_𝑝

/ 2

and (16) follows. □

Remark 3. The above study of infinitesimal intervals of time directly leads to continuous-time bounds. These could also be derived from a discrete time analysis: Denote by 𝑡

_𝑘

≥ 0 the time of 𝑘 -th activation, 𝑘 ∈ N

^∗

, and 𝑡

₀

= 0 . We can prove that:

E [ 𝐹

^∗

𝐴

( 𝜆

_𝑡

𝑘

) − 𝐹

^∗

𝐴

( 𝜆

^★

)] ≤ ( 1 − 𝜎

_𝐴

/ 𝐼 )

^𝑘

( 𝐹

^∗

𝐴

( 𝜆

₀

) − 𝐹

^∗

𝐴

( 𝜆

^★

)) , (18)

where 𝐼 = Í

(𝑖 𝑗) ∈𝐸

𝑝

_{𝑖 𝑗}

. Then, we have in continuous time, for any 𝑡 ∈ R

⁺

: E [ 𝐹

^∗

𝐴

( 𝜆

_𝑡

) − 𝐹

^∗

𝐴

( 𝜆

^★

)] = Õ

𝑘∈N

𝑒

^{−𝐼 𝑡}

( 𝐼 𝑡 )

^𝑘

𝑘 ! E [ 𝐹

^∗

𝐴

( 𝜆

_𝑡

) − 𝐹

^∗

𝐴

( 𝜆

^★

) | 𝑘 activations in [ 0, 𝑡 ]]

≤ Õ

𝑘∈N

𝑒

^{−𝐼 𝑡}

( 𝐼 𝑡 )

^𝑘

𝑘!

( 1 − 𝜎

_𝐴

/ 𝐼 )

^𝑘

( 𝐹

^∗

𝐴

( 𝜆

₀

) − 𝐹

^∗

𝐴

( 𝜆

^★

))

= 𝑒

^−𝜎^𝐴^𝑡

( 𝐹

^∗

𝐴

( 𝜆

0

) − 𝐹

^∗

𝐴

( 𝜆

^★

)) ,

giving the same result. However in the next Section, we will see that the continuous time viewpoint is essential in the design of the CACDM algorithm, as well as for its analysis through consideration of infinitesimal intervals and differential calculus.

3.3 Accelerated Gossip in the P.p.p. model

Inspired by previous works [14, 33], we propose CACDM (Continuously Accelerated Coordinate Descent Method), a gossip algorithm that, for the P.p.p. model , provably obtains an accelerated rate of convergence in the sense of Nesterov and Stich [33] (Theorem 2).

3.3.1 CACDM algorithm and convergence guarantees. Similarly to other standard acceleration techniques, the algorithm requires maintaining two variables 𝑥 ( 𝑡 ) , 𝑦 ( 𝑡 ) ∈ R

^𝑛×𝑑

, whose convex-dual conjugates are denoted 𝑢 ( 𝑡 ) , 𝑣 ( 𝑡 ) . The variable 𝑣 ( 𝑡 ) plays the role of a momentum. We initialize such that 𝑢 ( 0 ) = 𝑣 ( 0 ) = 0. As in last subsection, variables 𝑢

_𝑖

( 𝑡 ) , 𝑣

_𝑖

( 𝑡 ) , 𝑥

_𝑖

( 𝑡 ) , 𝑦

_𝑖

( 𝑡 ) ∈ R

^𝑑

for 𝑖 ∈ [ 𝑛 ] are attached to node 𝑖. The algorithm involves two types of operations: continuous contractions, and pairwise updates along each edge ( 𝑖 𝑗 ) when its Poisson clock ticks.

(1) Continuous Contractions: For all times 𝑡 ∈ R

⁺

and node 𝑖 ∈ [ 𝑛 ] , for some fixed 𝜃 > 0 to be specified, make the infinitesimal contraction

𝑢

_𝑖

( 𝑡 + 𝑑𝑡 ) 𝑣

_𝑖

( 𝑡 + 𝑑𝑡 )

=

1 − 𝑑𝑡 𝐼 𝜃 𝑑𝑡 𝐼 𝜃 𝑑𝑡 𝐼 𝜃 1 − 𝑑𝑡 𝐼 𝜃

𝑢

_𝑖

( 𝑡 ) 𝑣

_𝑖

( 𝑡 )

,

between times 𝑡 and 𝑡 + 𝑑𝑡. Between times 𝑠 < 𝑡, if there is no activation of 𝑖, it consists in performing the contraction:

𝑢

_𝑖

( 𝑡 ) 𝑣

_𝑖

( 𝑡 )

= exp

( 𝑡 − 𝑠 ) 𝐼

− 𝜃 𝜃 𝜃 − 𝜃

𝑢

_𝑖

( 𝑠 ) 𝑣

_𝑖

( 𝑠 )

=

1+𝑒^{−2𝐼 𝜃}^{(𝑡−𝑠)} 2

1−𝑒^{−2𝐼 𝜃}^{(𝑡−𝑠)} 2 1−𝑒^{−2𝐼 𝜃(𝑡}^−𝑠)

2

1+𝑒^{−2𝐼 𝜃}^(𝑡^−𝑠) 2

! 𝑢

_𝑖

( 𝑠 ) 𝑣

_𝑖

( 𝑠 )

. (19)

(2) Local Updates: Let 𝛾

_𝑝

be the smallest non-null eigenvalue of the Laplacian of the graph

weighted by the local rates: 𝜈

_{𝑖 𝑗}

= 𝑝

_{𝑖 𝑗}

(Definition 1), and 𝐿

max

defined in (3). When edge ( 𝑖 𝑗 )

(11)

is activated at time 𝑡 ≥ 0, perform the local update between nodes 𝑖 and 𝑗 : 𝑢

_𝑖

( 𝑡 ) ← −

^𝑡

𝑢

_𝑖

( 𝑡 ) −

∇ 𝑓

^∗

𝑖

( 𝑢

_𝑡

( 𝑖 )) − ∇ 𝑓

^∗

𝑗

( 𝑢

_𝑡

( 𝑗 )) 𝜎

⁻¹

𝑖

+ 𝜎

⁻¹

𝑗

, (20)

𝑣

_𝑖

( 𝑡 ) ← −

^𝑡

𝑣

_𝑖

( 𝑡 ) − 𝜃 𝐿

_max

𝛾

_𝑝

∇ 𝑓

^∗

𝑖

( 𝑢

_𝑡

( 𝑖 )) − ∇ 𝑓

^∗

𝑗

( 𝑢

_𝑡

( 𝑗 )

, (21)

and symmetrically at node 𝑗. The desired output at node 𝑖 and at time 𝑡 is then 𝑥

_𝑖

( 𝑡 ) =

∇ 𝑓

^∗

𝑖

( 𝑢

_𝑖

( 𝑡 )) (Section 2.2).

This procedure can be performed asynchronously and at discrete times: the length 𝑡 − 𝑠 between two activations of an edge that appears in the exponential contraction (19) is a local variable that can be computed from a local clock. More formally, the stochastic process defined above is the following, where 𝑉

_𝑡

= ( 𝑢 ( 𝑡 ) , 𝑣 ( 𝑡 ))

^𝑇

and 𝑁

_{𝑖 𝑗}

are independent P.p.p. of intensities 𝑝

_{𝑖 𝑗}

:

𝑑𝑉

_𝑡

= 𝐼

− 𝜃 𝜃 𝜃 − 𝜃

𝑉

_𝑡

𝑑𝑡 − Õ

(𝑖 𝑗) ∈𝐸

𝑑 𝑁

_{𝑖 𝑗}

( 𝑡 ) ©

«

∇𝑓^∗

𝑖(𝑢𝑡(𝑖))−∇𝑓^∗ 𝑗(𝑢𝑡(𝑗)) 𝜎⁻¹

𝑖 +𝜎⁻¹ 𝑗 𝜃 𝐿max

𝛾_𝑝

∇ 𝑓

^∗

𝑖

( 𝑢

_𝑡

( 𝑖 )) − ∇ 𝑓

^∗

𝑗

( 𝑢

_𝑡

( 𝑗 ) ª

®

¬ .

Define the Lyapunov function L

𝑡

=

𝑣 ( 𝑡 ) − 𝑣

^★

2

(𝐴^{∗ ⊤}𝐴^∗)²

+ 2𝜃

²

𝑆

²

𝐿

_max²

𝛾

²

𝑝

𝐹

^∗

( 𝑢 ( 𝑡 )) − 𝐹

^∗

( 𝑣

^★

)

, (22)

where 𝑣

^★

= 𝐴𝜆

^★

is the minimizer of 𝐹

^∗

on Im (𝐴) , 𝜆

^★

being a minimizer of 𝐹

^∗

𝐴

, 𝜃 , 𝑆 > 0 to be defined, and 𝐴

^∗

the pseudo-inverse of matrix 𝐴 tuned with 𝜇

²

𝑖 𝑗

= 𝑝

_{𝑖 𝑗}

. Let 𝛾

_𝑝

be the smallest non-null eigenvalue of the Laplacian of the graph, with weights 𝜈

_{𝑖 𝑗}

= 𝑝

_{𝑖 𝑗}

(Definition 1).

Theorem 2. For the CACDM algorithm defined by Equations (19) , (20) , (21) in the P.p.p. model, if 𝜃 = q

_𝛾

𝑝

𝐼 𝑆²𝐿_max

for 𝑆 verifying the inequality:

𝑆

²

≥ sup

(𝑖 𝑗) ∈𝐸

( 𝜎

⁻¹

𝑖

+ 𝜎

⁻¹

𝑗

) 2𝑝

_{𝑖 𝑗}

/ 𝐼

, (23)

where 𝜎

_𝑖

defined in (2) , 𝐼 in (11) and 𝜎

_min

, 𝐿

_max

in (3) , we have for all 𝑡 ∈ R

⁺

: E [L

𝑡

] ≤ L

0

𝑒

^{−𝐼 𝜃 𝑡}

.

where L

𝑡

is defined in (22) .

The proof of this theorem uses edge-dual variables and differential inequalities through the study of infinitesimal intervals [ 𝑡 , 𝑡 + 𝑑𝑡 ] as in the proof of Theorem 1, further combined with inequalities in [33] for the study of accelerated coordinate descent. We first make a few comments on this theorem, and then proceed to its proof. Since 𝑥 ( 𝑡 ) = ∇ 𝐹

^∗

( 𝑢 ( 𝑡 )) and 𝑣

^★

= ∇ 𝐹

^∗

( 𝑥

^★

) where 𝑥

^★

is the minimizer of 𝐹 under the consensus constraint, we have on primal variable 𝑥 ( 𝑡 ) :

E h

𝑥

_𝑡

− 𝑥

^★

2

i

≤ 2𝐿

_max

𝜎

²

min

𝛾

_𝑝

2𝜃

²

𝑆

²

𝐿

²_max

L

0

𝑒

^{−𝐼 𝜃 𝑡}

. (24)

Remarks on the bound: ( 𝛾

_𝑝

/ 𝐼 ) is the normalized non-accelerated randomized gossip rate of convergence. It is divided by 𝐼 so that the 𝑝

_{𝑖 𝑗}

sum to 1. If there exists a constant 𝑐 > 0 such that:

∀( 𝑖 𝑗 ) ∈ 𝐸, 𝑝

_{𝑖 𝑗}

𝐼

≥ 𝑐

| 𝐸 | ,

(12)

then 𝑆

²

≥ 𝜎

⁻¹

min

| 𝐸 |/ 𝑐, leading to the following rate of convergence:

𝐼 × r

𝑐 𝜎

_min

𝐿

_max

× 𝛾

_𝑝

𝐼 | 𝐸 | .

Taking 𝐼 = 1 (re-normalizing time) and the simple averaging problem leads to an improved rate of 𝑛

⁻²

on the line graph instead of 𝑛

⁻³

[28]. For the 2D-Grid, we have 𝑛

⁻³^/²

instead of 𝑛

⁻²

[28].

However, there is no improvement on the complete graph (1 / 𝑛 in both cases). These rates are the same as [9, 14, 25]. Yet, our algorithm does not require to know the number of activations performed on the whole network, and only requires local clocks. Moreover, similarly to Hendrikx et al. [14], it works for any graph and for the more general problem of distributed optimization of smooth and strongly convex functions provided dual gradients of local functions are computable.

CACDM algorithm for the averaging problem: for the gossip averaging problem, we have 𝑢 ( 𝑡 ) = 𝑥 ( 𝑡 ) − 𝑥 ( 0 ) , 𝑣 ( 𝑡 ) = 𝑦 ( 𝑡 ) − 𝑦 ( 0 ) , and (20) and (21) read as:

𝑥

_𝑖

( 𝑡 ) ← −

^𝑡

𝑥

_𝑖

( 𝑡 ) + 𝑥

_𝑗

( 𝑡 ) 2 𝑦

_𝑖

( 𝑡 ) ← −

^𝑡

𝑦

_𝑖

( 𝑡 ) − 𝜃

𝛾

_𝑝

( 𝑥

_𝑡

( 𝑖 ) − 𝑥

_𝑡

( 𝑗 )) .

The first variable thus performs classical local averagings while mixing continuously with the second one (the momentum).

3.3.2 Proof of Theorem 2. Let the two edge dual variables 𝜆, 𝜔 ∈ R

^𝐸×𝑑

be the edge-dual conjugates of 𝑥 ( 𝑡 ) , 𝑦 ( 𝑡 ) . Variable 𝜔 plays the role of the momentum. Since 𝑢 ( 0 ) = 𝑣 ( 0 ) = 0, we can take 𝜆

₀

= 𝜔

₀

= 0. Operations (20) and (21) translate as follows on these variables when clock ( 𝑖 𝑗 ) ticks. Let 𝜎

_𝐴

be the strong convexity parameter of 𝐹

^∗

𝐴

with respect to the Euclidean norm on the orthogonal of Ker ( 𝐴 ) . In Appendix B, we prove that, if 𝜇

²

𝑖 𝑗

= 𝑝

_{𝑖 𝑗}

: 𝜎

_𝐴

≤

_𝐿^𝛾^𝑝

max

. While working with 𝐹

^∗

𝐴

and edge-dual variables, we use 𝜎

_𝐴

instead of

𝛾_𝑝

𝐿_max

as presented in the algorithm, in order to keep in mind its meaning for 𝐹

^∗

𝐴

. Define the coordinate gradient step:

𝜂

_{𝑖 𝑗 ,𝑡}

= −

1 2𝜇²

𝑖 𝑗(𝜎_𝑖⁻¹+𝜎⁻¹_𝑗 )

𝑈

_{𝑖 𝑗}

∇

𝑖 𝑗

𝐹

^∗

𝐴

( 𝜆

_𝑡

)

𝜃 𝜎𝐴𝑝𝑖 𝑗

𝑈

_{𝑖 𝑗}

∇

𝑖 𝑗

𝐹

^∗

𝐴

( 𝜆

_𝑡

)

!

(25) where 𝑈

_{𝑖 𝑗}

= 𝑒

_{𝑖 𝑗}

𝑒

^𝑇

𝑖 𝑗

, and perform the gradient step:

𝜆

_𝑡

𝜔

_𝑡

← −

𝑡

𝜆

_𝑡

𝜔

_𝑡

+ 𝜂

_{𝑖 𝑗 ,𝑡}

(26)

Define:

𝐿

_𝑡

=

𝜔

_𝑡

− 𝜆

^★

∗2

+ 2𝜃

²

𝑆

²

𝜎

²

𝐴

( 𝐹

^∗

𝐴

( 𝜆

_𝑡

) − 𝐹

^∗

𝐴

( 𝜆

^★

)) ,

where ∥ . ∥

^∗

is the Euclidean norm on the orthogonal of Ker ( 𝐴 ) , and 𝜆

^★

is an optimizer of 𝐹

^∗

𝐴

. Note that we have 𝐿

_𝑡

= L

𝑡

for all 𝑡 ≥ 0.

Proof of Theorem 2. The proof closely follows the lines of Hendrikx et al. [14], Nesterov and Stich [33], adapted to fit our continuous time algorithm. Without loss of generality, we assume that 𝐼 = 1 i.e. that the 𝑝

_{𝑖 𝑗}

sum to 1 (by rescaling time with 𝑡

^′

= 𝑡 𝐼 ). We denote 𝑟

_𝑡

=

𝜔

_𝑡

− 𝜆

^★

∗

, and 𝑓

_𝑡

= 𝐹

^∗

𝐴

( 𝜆

_𝑡

) − 𝐹

^∗

𝐴

( 𝜆

^★

) , such that 𝐿

_𝑡

= 𝑟

²

𝑡

+

^2𝜃²^𝑆²

𝜎² 𝐴

𝑓

_𝑡

. Let 𝑡 ≥ 0 and 𝑑𝑡 > 0. The following equalities

(13)

and inequalities are true up to a 𝑜 ( 𝑑𝑡 ) approximation, which will disappear when we make 𝑑𝑡 → 0.

Let’s start with the term 𝑟

²

𝑡

: E

^F^𝑡

[ 𝑟

²

𝑡+𝑑 𝑡

] = ( 1 − 𝑑𝑡 ) E

^F^𝑡

[ 𝑟

²

𝑡+𝑑 𝑡

| no activations between t and t+dt ] (27)

+ 𝑑𝑡 E

^F^𝑡

[ 𝑟

²

𝑡+𝑑 𝑡

| 1 activation between t and t+dt ] (28)

For the first term, we get:

E

^F^𝑡

[ 𝑟

_{𝑡+𝑑 𝑡}²

| no activation in [ 𝑡 , 𝑡 + 𝑑𝑡 ]] =

( 1 − 𝜃 𝑑𝑡 ) 𝜔

_𝑡

+ 𝜃 𝑑𝑡 𝜆

_𝑡

− 𝜆

^★

∗2

≤ ( 1 − 𝜃 𝑑𝑡 ) 𝑟

_𝑡²

+ 𝜃 𝑑𝑡

𝜆

_𝑡

− 𝜆

^★

∗2

where the inequality uses convexity of the squared function. For the other term, we decompose the event "1 activation between t and t+dt" in the disjoint events "𝑖 𝑗 activated between t and t+dt", of probability 𝑝

_{𝑖 𝑗}

𝑑𝑡, to get the following equation, which is true up to a 𝑜 ( 1 ) approximation (which is enough since we multiply by 𝑑𝑡 afterwards):

E

^F^𝑡

[ 𝑟

²

𝑡+𝑑 𝑡

| 1 activation between t and t+dt ] = Õ

(𝑖 𝑗) ∈𝐸

𝑝

_{𝑖 𝑗}

𝜔

_𝑡

− 𝜃 𝑝

_{𝑖 𝑗}

𝜎

_𝐴

𝑈

_{𝑖 𝑗}

∇

𝑖 𝑗

𝐹

^∗

𝐴

( 𝜆

_𝑡

) − 𝜆

^★

∗2

= ∥ 𝜔

_𝑡

− 𝜆

^★

∥

^∗2

+ Õ

𝑖 𝑗

𝑝

_{𝑖 𝑗}

𝜃

²

𝜎

²

𝐴

𝑝

²

𝑖 𝑗

𝑈

_{𝑖 𝑗}

∇

𝑖 𝑗

𝐹

^∗

𝐴

( 𝜆

_𝑡

)

∗2

− 2 Õ

𝑖 𝑗

𝑝

_{𝑖 𝑗}

𝜃 𝑝

_{𝑖 𝑗}

𝜎

_𝐴

⟨ 𝑈

_{𝑖 𝑗}

∇

𝑖 𝑗

𝐹

^∗

𝐴

( 𝜆

_𝑡

) , 𝜔

_𝑡

− 𝜆

^★

⟩ (29) For the term Í

𝑖 𝑗

𝑝

_{𝑖 𝑗} ^𝜃

2

𝜎² 𝐴𝑝²

𝑖 𝑗

𝑈

_{𝑖 𝑗}

∇

𝑖 𝑗

𝐹

^∗

𝐴

( 𝜆

_𝑡

)

∗2

, we get by definition of 𝑆

²

, and by a local smoothness inequality (namely, ∀ 𝑦, 𝐹

^∗

𝐴

( 𝑦 )− 𝐹

^∗

𝐴

( 𝑦 −

¹

𝜇² 𝑖 𝑗(𝜎⁻¹

𝑖 +𝜎⁻¹

𝑗 )

𝑈

_{𝑖 𝑗}

∇

𝑖 𝑗

𝐹

^∗

𝐴

( 𝑦 )) ≥

¹

2𝜇² 𝑖 𝑗(𝜎⁻¹

𝑖 +𝜎⁻¹ 𝑗 )

∇

𝑖 𝑗

𝐹

^∗

𝐴

( 𝑦 )

2

in Lemma 4):

Õ

𝑖 𝑗

𝑝

_{𝑖 𝑗}

𝜃

²

𝜎

²

𝐴

𝑝

²

𝑖 𝑗

𝑈

_{𝑖 𝑗}

∇

𝑖 𝑗

𝐹

^∗

𝐴

( 𝜆

_𝑡

)

∗2

≤ Õ

𝑖 𝑗

𝑝

_{𝑖 𝑗}

2𝜃

²

𝑆

²

𝜎

²

𝐴

𝜇

²

𝑖 𝑗

( 𝜎

⁻¹

𝑖

+ 𝜎

⁻¹

𝑗

)

𝑈

_{𝑖 𝑗}

∇

𝑖 𝑗

𝐹

^∗

𝐴

( 𝜆

_𝑡

)

2

≤ Õ

𝑖 𝑗

𝑝

_{𝑖 𝑗}

2𝜃

²

𝑆

²

𝜎

²

𝐴

( 𝐹

^∗

𝐴

( 𝜆

_𝑡

) − 𝐹

^∗

𝐴

( 𝜆

_𝑡

− 𝜃 𝜎

_𝐴

𝑝

_{𝑖 𝑗}

𝑈

_{𝑖 𝑗}

∇

𝑖 𝑗

𝐹

^∗

𝐴

( 𝜆

_𝑡

)))

= 𝜃

²

𝑆

²

𝜎

²

𝐴

( 𝐹

^∗

𝐴

( 𝜆

_𝑡

) − E

^F^𝑡

[ 𝐹

^∗

𝐴

( 𝜆

_{𝑡+𝑑 𝑡}

) | 1 activation in [t,t+dt] ]) . (30) For the term − 2 Í

𝑖 𝑗

𝑝

_{𝑖 𝑗} ^𝜃

𝑝𝑖 𝑗𝜎𝐴

⟨ 𝑈

_{𝑖 𝑗}

∇

𝑖 𝑗

𝐹

^∗

𝐴

( 𝜆

_𝑡

) , 𝜔

_𝑡

− 𝜆

^★

⟩ , we get, by adding and subtracting a 𝜆

_𝑡

in the bracket, and by convexity of 𝐹

^∗

𝐴

(𝜎

_𝐴

is the strong convexity parameter of 𝐹

^∗

𝐴

):

− 2𝑑𝑡 𝜃 𝜎

_𝐴

⟨∇ 𝐹

^∗

𝐴

( 𝜆

_𝑡

) ,𝜔

_𝑡

− 𝜆

^★

⟩ = − 2𝑑𝑡 𝜃 𝜎

_𝐴

⟨∇ 𝐹

^∗

𝐴

( 𝜆

_𝑡

) , 𝜔

_𝑡

− 𝜆

_𝑡

⟩ − 2𝑑𝑡 𝜃 𝜎

_𝐴

⟨∇ 𝐹

^∗

𝐴

( 𝜆

_𝑡

) , 𝜆

_𝑡

− 𝜆

^★

⟩

≤ − 2 1 𝜎

_𝐴

⟨∇ 𝐹

^∗

𝐴

( 𝜆

_𝑡

) , 𝜃 𝑑𝑡 ( 𝜔

_𝑡

− 𝜆

_𝑡

)⟩ − 2𝑑𝑡 𝜃 𝜎

_𝐴

( 𝐹

^∗

𝐴

( 𝜆

_𝑡

) − 𝐹

^∗

𝐴

( 𝜆

^★

) + 𝜎

_𝐴

/ 2 𝜆

_𝑡

− 𝜆

^★

∗2

) Then, let’s define 𝜆

^′

𝑡+𝑑 𝑡

= ( 1 − 𝜃 𝑑𝑡 ) 𝜆

_𝑡

+ 𝜃 𝑑𝑡 𝜔

_𝑡

= E

^F^𝑡

[ 𝜆

_{𝑡+𝑑 𝑡}

| no activations in [ 𝑡 , 𝑡 + 𝑑𝑡 ]] . By noticing that 𝜃 𝑑𝑡 ( 𝜔

_𝑡

− 𝜆

_𝑡

) = 𝜆

^′

𝑡+𝑑 𝑡

− 𝜆

_𝑡

, we get:

− 2 1 𝜎

_𝐴

⟨∇ 𝐹

^∗

𝐴

( 𝜆

_𝑡

) , 𝜃 𝑑𝑡 ( 𝜔

_𝑡

− 𝜆

_𝑡

)⟩ = − 2 1 𝜎

_𝐴

⟨∇ 𝐹

^∗

𝐴

( 𝜆

_𝑡

) , 𝜆

^′

𝑡+𝑑 𝑡

− 𝜆

_𝑡

⟩ (31)

= − 2 1 𝜎

_𝐴

⟨∇ 𝐹

^∗

𝐴

( 𝜆

^′

𝑡+𝑑 𝑡

) , 𝜆

^′

𝑡+𝑑 𝑡

− 𝜆

_𝑡

⟩ (32)

≤ − 2 1 𝜎

_𝐴

( 𝐹

^∗

𝐴

( 𝜆

_{𝑡+𝑑 𝑡}^′

) − 𝐹

^∗

𝐴

( 𝜆

_𝑡

)) , (33)