Distributed Optimization in Multiagent Systems

(1)

Distributed Optimization in Multiagent Systems

(2)

The Consensus Problem

1 2

3 4

x

x x

1/25

(3)

The Consensus Problem

x + x + x + x

=

x

No single agent knows the target function to optimize

(4)

The Consensus Problem

x + x + x + x

=

x

No single agent knows the target function to optimize

1/25

(5)

Formally

minx∈X N

X

n=1

fn(x)

I N = number of nodes / agents

I X =R^d

I fn is thecost functionof agentn

I Two agentsnandmcan exchange messages ifn∼m Numerous works on that problem

Early work: Tsitsiklis ’84

(6)

Example #1: Wireless Sensor Networks

Yn= random observation of sensorn x = unknown parameter to be estimated

p(Y1,· · ·,YN;x) =p1(Y1;x)· · ·pN(YN;x) The maximum likelihood estimate writes

ˆ

x= arg max

x

X

n

lnpn(Yn;x)

[Schizas’08, Moura’11]

3/25

(7)

Example #2: Machine Learning

Data set formed byT samples (Xi,Yi) (i= 1. . .T)

I Yi = variable to be explained

I Xi = explanatory features minx

T

X

i=1

`(x^TXi,Yi) +r(x) Split data intoNbatches

min

x N

X

n=1

X

i

`(x^TXi,n,Yi,n) +r(x)

n.b.: some problems are more involved (I. Colin’16) min

x

X

i

X

j

f(x;Xi,Yi,Xj,Yj) +r(x)

(8)

Example #3: Resource Allocation

Letxn be the resource of an agentn

I Agents share a resourceb: X

n

xn≤b

I Agentngets rewardRn(xn) for using resourcexn I Maximize the global reward

max

x:P nx_n≤b

N

X

n=1

Rn(xn) The dual of a sharing problem is a consensus problem

5/25

(9)

Networks

Parallel:

.

Distributed:

.

(10)

Outline

Distributed gradient descent

Distributed Alternating Direction Method of Multipliers (D-ADMM)

Total Variation Regularization on Graphs

(11)

Outline

(12)

Adapt-and-combine (Tsitsiklis’84)

I [Local step]Each agentngenerates a temporary update

˜

xn^k+1=xn^k−γk∇fn(xn^k)

I [Agreement step]Connected agents merge their temporary estimates x_n^k+1=X

m∼n

A(n,m) ˜x_m^k+1

whereAsatisfies technical constraints (must be doubly stochastic)

Convergence rates(e.g. [Nedic’09], [Duchi’12])

I Decreasing step sizeγk→0 is needed in general

I Sublinear converges rates

7/25

(13)

Adapt-and-combine (Tsitsiklis’84)

I [Local step]Each agentngenerates a temporary update

˜

xn^k+1=xn^k−γk∇fn(xn^k)

I [Agreement step]Connected agents merge their temporary estimates x_n^k+1=X

m∼n

A(n,m) ˜x_m^k+1

whereAsatisfies technical constraints (must be doubly stochastic) Convergence rates(e.g. [Nedic’09], [Duchi’12])

I Decreasing step sizeγk→0 is needed in general

I Sublinear converges rates

(14)

Distributed stochastic gradient algorithm

Under technical conditions,

Convergence (Bianchi et al.’12): xn^k tends to a KKT pointx^? Convergence rate (Morral et. al’12): Ifx^?∈int(C)

√γk

−1(xn^k−x^?)−→ N^L (0,ΣOPT+ ΣNET)

I ΣOPT is the covariance corresponding to the centralized setting

I ΣNET is the excess variance due to the distributed setting Remark: ΣNET = 0 for some protocols which can be characterized

(16)

Outline

(17)

Alternating Direction Method of Multipliers

Consider the generic problem

minx F(x) +G(Mx) whereF,G are convex. Rewrite as a constrained problem

z=MxminF(x) +G(z) The augmented Lagrangian is:

Lρ(x,z;λ) =F(x) +G(z) +hλ,Mx−zi+ρ

2kMx−zk² ADMM

x^k+1 = arg min

x Lρ(x,z^k;λ^k)→onlyF needed z^k+1 = arg min

z Lρ(x^k+1,z;λ^k)→onlyG needed λ^k+1 = λ^k+ρ(Mx^k+1−z^k+1)

(18)

Back to our problem

All functionsfn:X →Rare assumed convex. Consider the problem:

min

u∈X N

X

n=1

fn(u) Main trick: Define

F :x = (x1, . . . ,xN)7→X

n

fn(xn) Equivalent problem:

min

x∈X^NF(x) +ιsp(1)(x)

whereιsp(1)(x) =

0 ifx1=· · ·=xN

+∞ otherwise

I F is separable inx1, . . . ,xN

I G =ιsp(1)couples the variables but is simple

11/25

(19)

ADMM illustrated

Setx^k=_N¹P

nxn^k

Algorithm (see e.g. [Boyd’11])

For alln, λ^kn = λ^k−1n +ρ(xn^k−x^k) xn^k+1 = arg min

y fn(y) +ρ

2kx^k−ρ⁻¹λ^kn−yk²

.

x^kn

1. Transmit current estimates

.

(20)

ADMM illustrated

Setx^k=_N¹P

nxn^k

y fn(y) +ρ

.

x^k 2. Compute averagex^k

.

12/25

(21)

ADMM illustrated

Setx^k=_N¹P

nxn^k

y fn(y) +ρ

.

x^k

3. Transmitx^kto all agents

.

(22)

ADMM illustrated

Setx^k=_N¹P

nxn^k

y fn(y) +ρ

.

4. Computeλ^k_n,x^k+1_n for alln

.

12/25

(23)

ADMM illustrated

Setx^k=_N¹P

nx_n^k

For alln, λ^kn = λ^k−1n +ρ(xn^k−x^k) x_n^k+1 = arg min

y fn(y) +ρ

2kx^k−ρ⁻¹λ^k_n−yk²

.

4. Computeλ^k_n,x^k+1_n for alln

.

(24)

Subgraph consensus

LetA1,A2,· · ·,ALbe subsets of agents

.

2

4

5 3

1

.

A1={1,3},A2={2,3},A3={3,4,5}

x1

x3

∈sp (¹₁) x2

x3

∈sp (¹₁)



 x3

x4

x5



∈sp₁

1 1

consensus within subgraphs⇔global consensus

13/25

(25)

Subgraph consensus

.

2

4

5 3

1

.

A1={1,3},A2={2,3},A3={3,4,5}

x1

x3

∈sp (¹₁)

x2

x3

∈sp (¹₁)



 x3

x4

x5



∈sp₁

1 1

(26)

Subgraph consensus

.

2

4

5 3

1

.

A1={1,3},A2={2,3},A3={3,4,5}

x1

x3

∈sp (¹₁) x2

x3

∈sp (¹₁)



 x3

x4

x5



∈sp₁

1 1

13/25

(27)

Subgraph consensus

.

2

4

5 3

1

.

A1={1,3},A2={2,3},A3={3,4,5}

x1

x3

∈sp (¹₁) x2

x3

∈sp (¹₁)



 x3

x4

x5



∈sp₁

1 1

(28)

Subgraph consensus

.

2

4

5 3

1

.

A1={1,3},A2={2,3},A3={3,4,5}

x1

x3

∈sp (¹₁) x2

x3

∈sp (¹₁)



 x3

x4

x5



∈sp₁

1 1

13/25

(29)

Example (Cont.)

The initial problem is

min

x∈X^NF(x) +G(Mx)

where Mx=





 x1

x3

x2

x3

x4

x5







that is: M=







1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1







and whereG is the indicator function of the subspace of vectors of the form





 α α β β δ δ δ







(30)

Distributed ADMM illustrated

Distributed ADMM (early works by [Schizas’08]) For alln, Λ^kn = Λ^k−1n +ρ(xn^k−χ^kn)

xn^k+1 = arg min

y fn(y) +ρ|σn|

2 kχ^kn−ρ⁻¹Λ^kn−yk² where|σn|= number of “neighbors” ofn

.

Computex^k_A₁

.

1. For each subgraph, compute averagex^k_A_`

15/25

(31)

Distributed ADMM illustrated

Distributed ADMM (early works by [Schizas’08]) For alln, Λ^k_n = Λ^k−1_n +ρ(x_n^k−χ^k_n)

x_n^k+1 = arg min

y fn(y) +ρ|σn|

2 kχ^kn−ρ⁻¹Λ^k_n−yk² where|σn|= number of “neighbors” ofn

.

Computex^k_A₂

.

(32)

Distributed ADMM illustrated

Distributed ADMM (early works by [Schizas’08]) For alln, Λ^kn = Λ^k−1n +ρ(xn^k−χ^kn)

x_n^k+1 = arg min

y fn(y) +ρ|σn|

2 kχ^k_n−ρ⁻¹Λ^k_n−yk² where|σn|= number of “neighbors” ofn

.

Computex^k_A₃

.

1. For each subgraph, compute averagex^k_A_`

15/25

(33)

Distributed ADMM illustrated

x_n^k+1 = arg min

y fn(y) +ρ|σn|

.

k k

(34)

Distributed ADMM illustrated

x_n^k+1 = arg min

y fn(y) +ρ|σn|

.

3. For eachn, computeλ^k_n andx_n^k+1

15/25

(35)

Linear convergence of the Distributed ADMM

Assumption: H?:=P

n∇²fn(x?)>0 at the minimizerx?

kxn^k−x?k ∼α^k ask→ ∞

I [Shi et al.’ 13] non-asymptotic bound but pessimistic

I [Iutzeler et al.’ 16] asymptotic but tight

0 10 20 30 40 50 60 70 80 90 100

−0.2

−0.1 0 0.1 0.2 0.3 0.4 0.5

Number of iterations

Simulation log α = −0.19 Bound of Shi = −5e−4

1

klogkx^k−1⊗x?kas a function ofk

(36)

Example: ring network

^.

.

Defineα= limk→∞kxn^k−x?k^1/k Setfn:R→Randf_n⁰⁰(x?) =σ²

α≥

s 1 + cos^2π_N

2(1 + sin^2π_N) with equality when ρ= σ² 2 sin^2π_N

17/25

(37)

Asynchronous D-ADMM

I All agents must complete their arg min computation before combining

I The network waits for the slowest agents Our objective: allow for asynchronism

(38)

Revisiting ADMM as a fixed point algorithm

Setζ^k=λ^k+ρz^k. Fact: λ^k=P(ζ^k) wherePis a projection.

ADMM can be written as a fixed point algorithm[Gabay,83] [Eckstein,92]

ζ^k+1=J(ζ^k) whereJis firmly non-expansivei.e.,

kJ(x)−J(y)|²≤ kx−yk²− k(I−J)(x)−(I−J)(y)k²

.

x y

JA(y)

JA(x)

.

19/25

(39)

Random coordinate descent

Introducing the block-components ofζ^k+1=J(ζ^k) :





 ζ₁^k+1

.. . ζ_`^k+1

.. . ζ_L^k+1







=





 J1(ζ^k)

.. . J`(ζ^k)

.. . JL(ζ^k)







Convergence of the Asynchronous ADMM[Iutzeler’13]

This algorithm still converges if active components are chosen at random

Main idea: For a well-chosen norm|||.|||and a fixed pointζ^?of J, prove E

ζ^k+1−ζ^?

2

|Fk

≤ ζ^k−ζ^?

2

⇒ζ^k is getting “stochastically” closer toζ^?

(40)

Random coordinate descent

If only one block`=`(k+ 1) is active at timek+ 1:





 ζ₁^k+1

.. . ζ_`^k+1

.. . ζ_L^k+1







=





 ζ₁^k

.. . J`(ζ^k)

.. . ζ_L^k







ζ^k+1−ζ^?

2

|Fk

≤ ζ^k−ζ^?

2

20/25

(41)

Random coordinate descent

If only one block`=`(k+ 1) is active at timek+ 1:





 ζ₁^k+1

.. . ζ_`^k+1

.. . ζ_L^k+1







=





 ζ₁^k

.. . J`(ζ^k)

.. . ζ_L^k







ζ^k+1−ζ^?

2|Fk

≤ ζ^k−ζ^?

2

(42)

Asynchronous ADMM explicited

.

m n

.

Activate two nodesA`={m,n}

I Agentncomputes x_n^k+1= arg min

x fn(x)+X

j∼k

hx, λ^k_j,ni+ρ

2kx−x¯_j,n^k k² and similarly for Agentm.

I They exchangex_m^k+1andx_n^k+1

I Agentncomputes

¯ x_m,n^k+1=1

2(x_m^k+1+x_n^k+1), λ^k+1m,n =λ^km,n+ρx_n^k+1−x_m^k+1

2 and similarly for Agentm.

21/25

(43)

Asynchronous ADMM explicited

.

m n

.

x fn(x)+X

j∼k

hx, λ^k_j,ni+ρ

I Agentncomputes

¯ x_m,n^k+1=1

(44)

Asynchronous ADMM explicited

.

m n

.

x fn(x)+X

j∼k

hx, λ^k_j,ni+ρ

I Agentncomputes

¯ x_m,n^k+1=1

21/25

(45)

Generalization: Distributed V˜ u-Condat algorithm

I V˜u-Condat algorithm generalizes ADMM (allows “gradients” evaluations)

I Distributed V˜u-Condat algorithm is applicable using the same principle

I Bianchi’16, Fercoq’17 provide a random coordinate descent version

I The algorithm is asynchronous at thenode leveland not at theedge level

(46)

Stochastic Optimization

minx∈X N

X

n=1

E(fn(x, ξn))

I Law ofξnunknown, but revealedon-linethrough random copiesξ¹n, ξ²n, . . .

I Stochastic approximation: at timek, replace the unknown function E(fn(. , ξn)) by its random versionfn(. , ξ^k_n)

Example : stochastic gradient descent

I Thesis of A. Salim: Stochastic versions of generic optimization algorithms (Forward-Backward, Douglas-Rachford, ADMM, V˜u-Condat, etc.)

I Byproduct : distributed stochastic algorithms

23/25

(47)

Outline

(48)

Total variation regularization (1/2)

Notation:On a graphG= (V,E), the total variation ofx∈R^V is TV(x) = X

{i,j}∈E

|xi−xj|

General problem:

min

x∈R^V

F(x) + TV(x)

I Trend filtering: F(x) =¹₂kx−mk²wherem∈R^V are noisy measurements

I Graph inpainting: complete possibly missing measurements on the nodes Proximal gradient algorithm:

xn+1=prox_γTV(xn−γ∇F(xn))

I Computingprox_TVis difficult over large unstructured graphs

I But efficient algorithms exist for 1D-graphs(Mammen’97) (Condat’13)

24/25

(49)

Total variation regularization (2/2)

Write TV as an expectation: Letξbe simple random walk inG of fixed length TVG(x) ∝ E(TVξ(x))

Algorithm(Salim’16)At timen,

I Draw a random walkξn+1 I Computexn+1=prox_γ_n_TV

ξn+1(xn−γn∇F(xn))→easy, 1D Hidden difficulty: one should avoid loops when choosing the walk. . .

Trend filtering example. Cost function vs time(s). Stochastic block model 10⁵nodes, 25.10⁶edges.

Distributed Optimization in Multiagent Systems