Distributed Optimization in Multiagent Systems
The Consensus Problem
1 2
3 4
x
x
x x
1/25
The Consensus Problem
x + x + x + x
=
x
No single agent knows the target function to optimize
The Consensus Problem
x + x + x + x
=
x
No single agent knows the target function to optimize
1/25
Formally
minx∈X N
X
n=1
fn(x)
I N = number of nodes / agents
I X =Rd
I fn is thecost functionof agentn
I Two agentsnandmcan exchange messages ifn∼m Numerous works on that problem
Early work: Tsitsiklis ’84
Example #1: Wireless Sensor Networks
Yn= random observation of sensorn x = unknown parameter to be estimated
p(Y1,· · ·,YN;x) =p1(Y1;x)· · ·pN(YN;x) The maximum likelihood estimate writes
ˆ
x= arg max
x
X
n
lnpn(Yn;x)
[Schizas’08, Moura’11]
3/25
Example #2: Machine Learning
Data set formed byT samples (Xi,Yi) (i= 1. . .T)
I Yi = variable to be explained
I Xi = explanatory features minx
T
X
i=1
`(xTXi,Yi) +r(x) Split data intoNbatches
min
x N
X
n=1
X
i
`(xTXi,n,Yi,n) +r(x)
n.b.: some problems are more involved (I. Colin’16) min
x
X
i
X
j
f(x;Xi,Yi,Xj,Yj) +r(x)
Example #3: Resource Allocation
Letxn be the resource of an agentn
I Agents share a resourceb: X
n
xn≤b
I Agentngets rewardRn(xn) for using resourcexn I Maximize the global reward
max
x:P nxn≤b
N
X
n=1
Rn(xn) The dual of a sharing problem is a consensus problem
5/25
Networks
Parallel:
.
.
Distributed:
.
.
Outline
Distributed gradient descent
Distributed Alternating Direction Method of Multipliers (D-ADMM)
Total Variation Regularization on Graphs
Outline
Distributed gradient descent
Distributed Alternating Direction Method of Multipliers (D-ADMM)
Total Variation Regularization on Graphs
Adapt-and-combine (Tsitsiklis’84)
I [Local step]Each agentngenerates a temporary update
˜
xnk+1=xnk−γk∇fn(xnk)
I [Agreement step]Connected agents merge their temporary estimates xnk+1=X
m∼n
A(n,m) ˜xmk+1
whereAsatisfies technical constraints (must be doubly stochastic)
Convergence rates(e.g. [Nedic’09], [Duchi’12])
I Decreasing step sizeγk→0 is needed in general
I Sublinear converges rates
7/25
Adapt-and-combine (Tsitsiklis’84)
I [Local step]Each agentngenerates a temporary update
˜
xnk+1=xnk−γk∇fn(xnk)
I [Agreement step]Connected agents merge their temporary estimates xnk+1=X
m∼n
A(n,m) ˜xmk+1
whereAsatisfies technical constraints (must be doubly stochastic) Convergence rates(e.g. [Nedic’09], [Duchi’12])
I Decreasing step sizeγk→0 is needed in general
I Sublinear converges rates
More problems
1. Asynchronism
Some agents are active at timen, others aren’t Random link failures
2. Noise
Gradients may be observed up to a random noise (online algorithms) 3. Constraints
Minimize
N
X
n=1
fn(x) subject tox∈C
˜
xnk+1 = projC[xnk−γk(∇fn(xnk)+noise)]
xnk+1 = X
m∼n
Ak+1(n,m) ˜xmk+1
8/25
Distributed stochastic gradient algorithm
Under technical conditions,
Convergence (Bianchi et al.’12): xnk tends to a KKT pointx? Convergence rate (Morral et. al’12): Ifx?∈int(C)
√γk
−1(xnk−x?)−→ NL (0,ΣOPT+ ΣNET)
I ΣOPT is the covariance corresponding to the centralized setting
I ΣNET is the excess variance due to the distributed setting Remark: ΣNET = 0 for some protocols which can be characterized
Outline
Distributed gradient descent
Distributed Alternating Direction Method of Multipliers (D-ADMM)
Total Variation Regularization on Graphs
Alternating Direction Method of Multipliers
Consider the generic problem
minx F(x) +G(Mx) whereF,G are convex. Rewrite as a constrained problem
z=MxminF(x) +G(z) The augmented Lagrangian is:
Lρ(x,z;λ) =F(x) +G(z) +hλ,Mx−zi+ρ
2kMx−zk2 ADMM
xk+1 = arg min
x Lρ(x,zk;λk)→onlyF needed zk+1 = arg min
z Lρ(xk+1,z;λk)→onlyG needed λk+1 = λk+ρ(Mxk+1−zk+1)
Back to our problem
All functionsfn:X →Rare assumed convex. Consider the problem:
min
u∈X N
X
n=1
fn(u) Main trick: Define
F :x = (x1, . . . ,xN)7→X
n
fn(xn) Equivalent problem:
min
x∈XNF(x) +ιsp(1)(x)
whereιsp(1)(x) =
0 ifx1=· · ·=xN
+∞ otherwise
I F is separable inx1, . . . ,xN
I G =ιsp(1)couples the variables but is simple
11/25
ADMM illustrated
Setxk=N1P
nxnk
Algorithm (see e.g. [Boyd’11])
For alln, λkn = λk−1n +ρ(xnk−xk) xnk+1 = arg min
y fn(y) +ρ
2kxk−ρ−1λkn−yk2
.
xkn
1. Transmit current estimates
.
ADMM illustrated
Setxk=N1P
nxnk
Algorithm (see e.g. [Boyd’11])
For alln, λkn = λk−1n +ρ(xnk−xk) xnk+1 = arg min
y fn(y) +ρ
2kxk−ρ−1λkn−yk2
.
xk 2. Compute averagexk
.
12/25
ADMM illustrated
Setxk=N1P
nxnk
Algorithm (see e.g. [Boyd’11])
For alln, λkn = λk−1n +ρ(xnk−xk) xnk+1 = arg min
y fn(y) +ρ
2kxk−ρ−1λkn−yk2
.
xk
3. Transmitxkto all agents
.
ADMM illustrated
Setxk=N1P
nxnk
Algorithm (see e.g. [Boyd’11])
For alln, λkn = λk−1n +ρ(xnk−xk) xnk+1 = arg min
y fn(y) +ρ
2kxk−ρ−1λkn−yk2
.
4. Computeλkn,xk+1n for alln
.
12/25
ADMM illustrated
Setxk=N1P
nxnk
Algorithm (see e.g. [Boyd’11])
For alln, λkn = λk−1n +ρ(xnk−xk) xnk+1 = arg min
y fn(y) +ρ
2kxk−ρ−1λkn−yk2
.
4. Computeλkn,xk+1n for alln
.
Subgraph consensus
LetA1,A2,· · ·,ALbe subsets of agents
.
2
4
5 3
1
.
A1={1,3},A2={2,3},A3={3,4,5}
x1
x3
∈sp (11) x2
x3
∈sp (11)
x3
x4
x5
∈sp1
1 1
consensus within subgraphs⇔global consensus
13/25
Subgraph consensus
LetA1,A2,· · ·,ALbe subsets of agents
.
2
4
5 3
1
.
A1={1,3},A2={2,3},A3={3,4,5}
x1
x3
∈sp (11)
x2
x3
∈sp (11)
x3
x4
x5
∈sp1
1 1
consensus within subgraphs⇔global consensus
Subgraph consensus
LetA1,A2,· · ·,ALbe subsets of agents
.
2
4
5 3
1
.
A1={1,3},A2={2,3},A3={3,4,5}
x1
x3
∈sp (11) x2
x3
∈sp (11)
x3
x4
x5
∈sp1
1 1
consensus within subgraphs⇔global consensus
13/25
Subgraph consensus
LetA1,A2,· · ·,ALbe subsets of agents
.
2
4
5 3
1
.
A1={1,3},A2={2,3},A3={3,4,5}
x1
x3
∈sp (11) x2
x3
∈sp (11)
x3
x4
x5
∈sp1
1 1
consensus within subgraphs⇔global consensus
Subgraph consensus
LetA1,A2,· · ·,ALbe subsets of agents
.
2
4
5 3
1
.
A1={1,3},A2={2,3},A3={3,4,5}
x1
x3
∈sp (11) x2
x3
∈sp (11)
x3
x4
x5
∈sp1
1 1
consensus within subgraphs⇔global consensus
13/25
Example (Cont.)
The initial problem is
min
x∈XNF(x) +G(Mx)
where Mx=
x1
x3
x2
x3
x3
x4
x5
that is: M=
1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1
and whereG is the indicator function of the subspace of vectors of the form
α α β β δ δ δ
Distributed ADMM illustrated
Distributed ADMM (early works by [Schizas’08]) For alln, Λkn = Λk−1n +ρ(xnk−χkn)
xnk+1 = arg min
y fn(y) +ρ|σn|
2 kχkn−ρ−1Λkn−yk2 where|σn|= number of “neighbors” ofn
.
ComputexkA1
.
1. For each subgraph, compute averagexkA`
15/25
Distributed ADMM illustrated
Distributed ADMM (early works by [Schizas’08]) For alln, Λkn = Λk−1n +ρ(xnk−χkn)
xnk+1 = arg min
y fn(y) +ρ|σn|
2 kχkn−ρ−1Λkn−yk2 where|σn|= number of “neighbors” ofn
.
ComputexkA2
.
Distributed ADMM illustrated
Distributed ADMM (early works by [Schizas’08]) For alln, Λkn = Λk−1n +ρ(xnk−χkn)
xnk+1 = arg min
y fn(y) +ρ|σn|
2 kχkn−ρ−1Λkn−yk2 where|σn|= number of “neighbors” ofn
.
ComputexkA3
.
1. For each subgraph, compute averagexkA`
15/25
Distributed ADMM illustrated
Distributed ADMM (early works by [Schizas’08]) For alln, Λkn = Λk−1n +ρ(xnk−χkn)
xnk+1 = arg min
y fn(y) +ρ|σn|
2 kχkn−ρ−1Λkn−yk2 where|σn|= number of “neighbors” ofn
.
.
k k
Distributed ADMM illustrated
Distributed ADMM (early works by [Schizas’08]) For alln, Λkn = Λk−1n +ρ(xnk−χkn)
xnk+1 = arg min
y fn(y) +ρ|σn|
2 kχkn−ρ−1Λkn−yk2 where|σn|= number of “neighbors” ofn
.
.
3. For eachn, computeλkn andxnk+1
15/25
Linear convergence of the Distributed ADMM
Assumption: H?:=P
n∇2fn(x?)>0 at the minimizerx?
kxnk−x?k ∼αk ask→ ∞
I [Shi et al.’ 13] non-asymptotic bound but pessimistic
I [Iutzeler et al.’ 16] asymptotic but tight
0 10 20 30 40 50 60 70 80 90 100
−0.2
−0.1 0 0.1 0.2 0.3 0.4 0.5
Number of iterations
Simulation log α = −0.19 Bound of Shi = −5e−4
1
klogkxk−1⊗x?kas a function ofk
Example: ring network
..
Defineα= limk→∞kxnk−x?k1/k Setfn:R→Randfn00(x?) =σ2
α≥
s 1 + cos2πN
2(1 + sin2πN) with equality when ρ= σ2 2 sin2πN
17/25
Asynchronous D-ADMM
I All agents must complete their arg min computation before combining
I The network waits for the slowest agents Our objective: allow for asynchronism
Revisiting ADMM as a fixed point algorithm
Setζk=λk+ρzk. Fact: λk=P(ζk) wherePis a projection.
ADMM can be written as a fixed point algorithm[Gabay,83] [Eckstein,92]
ζk+1=J(ζk) whereJis firmly non-expansivei.e.,
kJ(x)−J(y)|2≤ kx−yk2− k(I−J)(x)−(I−J)(y)k2
.
x y
JA(y)
JA(x)
.
19/25
Random coordinate descent
Introducing the block-components ofζk+1=J(ζk) :
ζ1k+1
.. . ζ`k+1
.. . ζLk+1
=
J1(ζk)
.. . J`(ζk)
.. . JL(ζk)
Convergence of the Asynchronous ADMM[Iutzeler’13]
This algorithm still converges if active components are chosen at random
Main idea: For a well-chosen norm|||.|||and a fixed pointζ?of J, prove E
ζk+1−ζ?
2
|Fk
≤ ζk−ζ?
2
⇒ζk is getting “stochastically” closer toζ?
Random coordinate descent
If only one block`=`(k+ 1) is active at timek+ 1:
ζ1k+1
.. . ζ`k+1
.. . ζLk+1
=
ζ1k
.. . J`(ζk)
.. . ζLk
Convergence of the Asynchronous ADMM[Iutzeler’13]
This algorithm still converges if active components are chosen at random
Main idea: For a well-chosen norm|||.|||and a fixed pointζ?of J, prove E
ζk+1−ζ?
2
|Fk
≤ ζk−ζ?
2
⇒ζk is getting “stochastically” closer toζ?
20/25
Random coordinate descent
If only one block`=`(k+ 1) is active at timek+ 1:
ζ1k+1
.. . ζ`k+1
.. . ζLk+1
=
ζ1k
.. . J`(ζk)
.. . ζLk
Convergence of the Asynchronous ADMM[Iutzeler’13]
This algorithm still converges if active components are chosen at random
Main idea: For a well-chosen norm|||.|||and a fixed pointζ?of J, prove E
ζk+1−ζ?
2|Fk
≤ ζk−ζ?
2
⇒ζk is getting “stochastically” closer toζ?
Asynchronous ADMM explicited
.
m n
.
Activate two nodesA`={m,n}
I Agentncomputes xnk+1= arg min
x fn(x)+X
j∼k
hx, λkj,ni+ρ
2kx−x¯j,nk k2 and similarly for Agentm.
I They exchangexmk+1andxnk+1
I Agentncomputes
¯ xm,nk+1=1
2(xmk+1+xnk+1), λk+1m,n =λkm,n+ρxnk+1−xmk+1
2 and similarly for Agentm.
21/25
Asynchronous ADMM explicited
.
m n
.
Activate two nodesA`={m,n}
I Agentncomputes xnk+1= arg min
x fn(x)+X
j∼k
hx, λkj,ni+ρ
2kx−x¯j,nk k2 and similarly for Agentm.
I They exchangexmk+1andxnk+1
I Agentncomputes
¯ xm,nk+1=1
2(xmk+1+xnk+1), λk+1m,n =λkm,n+ρxnk+1−xmk+1
2 and similarly for Agentm.
Asynchronous ADMM explicited
.
m n
.
Activate two nodesA`={m,n}
I Agentncomputes xnk+1= arg min
x fn(x)+X
j∼k
hx, λkj,ni+ρ
2kx−x¯j,nk k2 and similarly for Agentm.
I They exchangexmk+1andxnk+1
I Agentncomputes
¯ xm,nk+1=1
2(xmk+1+xnk+1), λk+1m,n =λkm,n+ρxnk+1−xmk+1
2 and similarly for Agentm.
21/25
Generalization: Distributed V˜ u-Condat algorithm
I V˜u-Condat algorithm generalizes ADMM (allows “gradients” evaluations)
I Distributed V˜u-Condat algorithm is applicable using the same principle
I Bianchi’16, Fercoq’17 provide a random coordinate descent version
I The algorithm is asynchronous at thenode leveland not at theedge level
Stochastic Optimization
minx∈X N
X
n=1
E(fn(x, ξn))
I Law ofξnunknown, but revealedon-linethrough random copiesξ1n, ξ2n, . . .
I Stochastic approximation: at timek, replace the unknown function E(fn(. , ξn)) by its random versionfn(. , ξkn)
Example : stochastic gradient descent
I Thesis of A. Salim: Stochastic versions of generic optimization algorithms (Forward-Backward, Douglas-Rachford, ADMM, V˜u-Condat, etc.)
I Byproduct : distributed stochastic algorithms
23/25
Outline
Distributed gradient descent
Distributed Alternating Direction Method of Multipliers (D-ADMM)
Total Variation Regularization on Graphs
Total variation regularization (1/2)
Notation:On a graphG= (V,E), the total variation ofx∈RV is TV(x) = X
{i,j}∈E
|xi−xj|
General problem:
min
x∈RV
F(x) + TV(x)
I Trend filtering: F(x) =12kx−mk2wherem∈RV are noisy measurements
I Graph inpainting: complete possibly missing measurements on the nodes Proximal gradient algorithm:
xn+1=proxγTV(xn−γ∇F(xn))
I ComputingproxTVis difficult over large unstructured graphs
I But efficient algorithms exist for 1D-graphs(Mammen’97) (Condat’13)
24/25
Total variation regularization (2/2)
Write TV as an expectation: Letξbe simple random walk inG of fixed length TVG(x) ∝ E(TVξ(x))
Algorithm(Salim’16)At timen,
I Draw a random walkξn+1 I Computexn+1=proxγnTV
ξn+1(xn−γn∇F(xn))→easy, 1D Hidden difficulty: one should avoid loops when choosing the walk. . .
Trend filtering example. Cost function vs time(s). Stochastic block model 105nodes, 25.106edges.