• Aucun résultat trouvé

Distributed Optimization in Multiagent Systems

N/A
N/A
Protected

Academic year: 2022

Partager "Distributed Optimization in Multiagent Systems"

Copied!
49
0
0

Texte intégral

(1)

Distributed Optimization in Multiagent Systems

(2)

The Consensus Problem

1 2

3 4

x

x

x x

1/25

(3)

The Consensus Problem

x + x + x + x

=

x

No single agent knows the target function to optimize

(4)

The Consensus Problem

x + x + x + x

=

x

No single agent knows the target function to optimize

1/25

(5)

Formally

minx∈X N

X

n=1

fn(x)

I N = number of nodes / agents

I X =Rd

I fn is thecost functionof agentn

I Two agentsnandmcan exchange messages ifn∼m Numerous works on that problem

Early work: Tsitsiklis ’84

(6)

Example #1: Wireless Sensor Networks

Yn= random observation of sensorn x = unknown parameter to be estimated

p(Y1,· · ·,YN;x) =p1(Y1;x)· · ·pN(YN;x) The maximum likelihood estimate writes

ˆ

x= arg max

x

X

n

lnpn(Yn;x)

[Schizas’08, Moura’11]

3/25

(7)

Example #2: Machine Learning

Data set formed byT samples (Xi,Yi) (i= 1. . .T)

I Yi = variable to be explained

I Xi = explanatory features minx

T

X

i=1

`(xTXi,Yi) +r(x) Split data intoNbatches

min

x N

X

n=1

X

i

`(xTXi,n,Yi,n) +r(x)

n.b.: some problems are more involved (I. Colin’16) min

x

X

i

X

j

f(x;Xi,Yi,Xj,Yj) +r(x)

(8)

Example #3: Resource Allocation

Letxn be the resource of an agentn

I Agents share a resourceb: X

n

xn≤b

I Agentngets rewardRn(xn) for using resourcexn I Maximize the global reward

max

x:P nxn≤b

N

X

n=1

Rn(xn) The dual of a sharing problem is a consensus problem

5/25

(9)

Networks

Parallel:

.

.

Distributed:

.

.

(10)

Outline

Distributed gradient descent

Distributed Alternating Direction Method of Multipliers (D-ADMM)

Total Variation Regularization on Graphs

(11)

Outline

Distributed gradient descent

Distributed Alternating Direction Method of Multipliers (D-ADMM)

Total Variation Regularization on Graphs

(12)

Adapt-and-combine (Tsitsiklis’84)

I [Local step]Each agentngenerates a temporary update

˜

xnk+1=xnk−γk∇fn(xnk)

I [Agreement step]Connected agents merge their temporary estimates xnk+1=X

m∼n

A(n,m) ˜xmk+1

whereAsatisfies technical constraints (must be doubly stochastic)

Convergence rates(e.g. [Nedic’09], [Duchi’12])

I Decreasing step sizeγk→0 is needed in general

I Sublinear converges rates

7/25

(13)

Adapt-and-combine (Tsitsiklis’84)

I [Local step]Each agentngenerates a temporary update

˜

xnk+1=xnk−γk∇fn(xnk)

I [Agreement step]Connected agents merge their temporary estimates xnk+1=X

m∼n

A(n,m) ˜xmk+1

whereAsatisfies technical constraints (must be doubly stochastic) Convergence rates(e.g. [Nedic’09], [Duchi’12])

I Decreasing step sizeγk→0 is needed in general

I Sublinear converges rates

(14)

More problems

1. Asynchronism

Some agents are active at timen, others aren’t Random link failures

2. Noise

Gradients may be observed up to a random noise (online algorithms) 3. Constraints

Minimize

N

X

n=1

fn(x) subject tox∈C

˜

xnk+1 = projC[xnk−γk(∇fn(xnk)+noise)]

xnk+1 = X

m∼n

Ak+1(n,m) ˜xmk+1

8/25

(15)

Distributed stochastic gradient algorithm

Under technical conditions,

Convergence (Bianchi et al.’12): xnk tends to a KKT pointx? Convergence rate (Morral et. al’12): Ifx?∈int(C)

√γk

−1(xnk−x?)−→ NL (0,ΣOPT+ ΣNET)

I ΣOPT is the covariance corresponding to the centralized setting

I ΣNET is the excess variance due to the distributed setting Remark: ΣNET = 0 for some protocols which can be characterized

(16)

Outline

Distributed gradient descent

Distributed Alternating Direction Method of Multipliers (D-ADMM)

Total Variation Regularization on Graphs

(17)

Alternating Direction Method of Multipliers

Consider the generic problem

minx F(x) +G(Mx) whereF,G are convex. Rewrite as a constrained problem

z=MxminF(x) +G(z) The augmented Lagrangian is:

Lρ(x,z;λ) =F(x) +G(z) +hλ,Mx−zi+ρ

2kMx−zk2 ADMM

xk+1 = arg min

x Lρ(x,zkk)→onlyF needed zk+1 = arg min

z Lρ(xk+1,z;λk)→onlyG needed λk+1 = λk+ρ(Mxk+1−zk+1)

(18)

Back to our problem

All functionsfn:X →Rare assumed convex. Consider the problem:

min

u∈X N

X

n=1

fn(u) Main trick: Define

F :x = (x1, . . . ,xN)7→X

n

fn(xn) Equivalent problem:

min

x∈XNF(x) +ιsp(1)(x)

whereιsp(1)(x) =

0 ifx1=· · ·=xN

+∞ otherwise

I F is separable inx1, . . . ,xN

I G =ιsp(1)couples the variables but is simple

11/25

(19)

ADMM illustrated

Setxk=N1P

nxnk

Algorithm (see e.g. [Boyd’11])

For alln, λkn = λk−1n +ρ(xnk−xk) xnk+1 = arg min

y fn(y) +ρ

2kxk−ρ−1λkn−yk2

.

xkn

1. Transmit current estimates

.

(20)

ADMM illustrated

Setxk=N1P

nxnk

Algorithm (see e.g. [Boyd’11])

For alln, λkn = λk−1n +ρ(xnk−xk) xnk+1 = arg min

y fn(y) +ρ

2kxk−ρ−1λkn−yk2

.

xk 2. Compute averagexk

.

12/25

(21)

ADMM illustrated

Setxk=N1P

nxnk

Algorithm (see e.g. [Boyd’11])

For alln, λkn = λk−1n +ρ(xnk−xk) xnk+1 = arg min

y fn(y) +ρ

2kxk−ρ−1λkn−yk2

.

xk

3. Transmitxkto all agents

.

(22)

ADMM illustrated

Setxk=N1P

nxnk

Algorithm (see e.g. [Boyd’11])

For alln, λkn = λk−1n +ρ(xnk−xk) xnk+1 = arg min

y fn(y) +ρ

2kxk−ρ−1λkn−yk2

.

4. Computeλkn,xk+1n for alln

.

12/25

(23)

ADMM illustrated

Setxk=N1P

nxnk

Algorithm (see e.g. [Boyd’11])

For alln, λkn = λk−1n +ρ(xnk−xk) xnk+1 = arg min

y fn(y) +ρ

2kxk−ρ−1λkn−yk2

.

4. Computeλkn,xk+1n for alln

.

(24)

Subgraph consensus

LetA1,A2,· · ·,ALbe subsets of agents

.

2

4

5 3

1

.

A1={1,3},A2={2,3},A3={3,4,5}

x1

x3

∈sp (11) x2

x3

∈sp (11)

 x3

x4

x5

∈sp1

1 1

consensus within subgraphs⇔global consensus

13/25

(25)

Subgraph consensus

LetA1,A2,· · ·,ALbe subsets of agents

.

2

4

5 3

1

.

A1={1,3},A2={2,3},A3={3,4,5}

x1

x3

∈sp (11)

x2

x3

∈sp (11)

 x3

x4

x5

∈sp1

1 1

consensus within subgraphs⇔global consensus

(26)

Subgraph consensus

LetA1,A2,· · ·,ALbe subsets of agents

.

2

4

5 3

1

.

A1={1,3},A2={2,3},A3={3,4,5}

x1

x3

∈sp (11) x2

x3

∈sp (11)

 x3

x4

x5

∈sp1

1 1

consensus within subgraphs⇔global consensus

13/25

(27)

Subgraph consensus

LetA1,A2,· · ·,ALbe subsets of agents

.

2

4

5 3

1

.

A1={1,3},A2={2,3},A3={3,4,5}

x1

x3

∈sp (11) x2

x3

∈sp (11)

 x3

x4

x5

∈sp1

1 1

consensus within subgraphs⇔global consensus

(28)

Subgraph consensus

LetA1,A2,· · ·,ALbe subsets of agents

.

2

4

5 3

1

.

A1={1,3},A2={2,3},A3={3,4,5}

x1

x3

∈sp (11) x2

x3

∈sp (11)

 x3

x4

x5

∈sp1

1 1

consensus within subgraphs⇔global consensus

13/25

(29)

Example (Cont.)

The initial problem is

min

x∈XNF(x) +G(Mx)

where Mx=

 x1

x3

x2

x3

x3

x4

x5

that is: M=

1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1

and whereG is the indicator function of the subspace of vectors of the form

 α α β β δ δ δ

(30)

Distributed ADMM illustrated

Distributed ADMM (early works by [Schizas’08]) For alln, Λkn = Λk−1n +ρ(xnk−χkn)

xnk+1 = arg min

y fn(y) +ρ|σn|

2 kχkn−ρ−1Λkn−yk2 where|σn|= number of “neighbors” ofn

.

ComputexkA1

.

1. For each subgraph, compute averagexkA`

15/25

(31)

Distributed ADMM illustrated

Distributed ADMM (early works by [Schizas’08]) For alln, Λkn = Λk−1n +ρ(xnk−χkn)

xnk+1 = arg min

y fn(y) +ρ|σn|

2 kχkn−ρ−1Λkn−yk2 where|σn|= number of “neighbors” ofn

.

ComputexkA2

.

(32)

Distributed ADMM illustrated

Distributed ADMM (early works by [Schizas’08]) For alln, Λkn = Λk−1n +ρ(xnk−χkn)

xnk+1 = arg min

y fn(y) +ρ|σn|

2 kχkn−ρ−1Λkn−yk2 where|σn|= number of “neighbors” ofn

.

ComputexkA3

.

1. For each subgraph, compute averagexkA`

15/25

(33)

Distributed ADMM illustrated

Distributed ADMM (early works by [Schizas’08]) For alln, Λkn = Λk−1n +ρ(xnk−χkn)

xnk+1 = arg min

y fn(y) +ρ|σn|

2 kχkn−ρ−1Λkn−yk2 where|σn|= number of “neighbors” ofn

.

.

k k

(34)

Distributed ADMM illustrated

Distributed ADMM (early works by [Schizas’08]) For alln, Λkn = Λk−1n +ρ(xnk−χkn)

xnk+1 = arg min

y fn(y) +ρ|σn|

2 kχkn−ρ−1Λkn−yk2 where|σn|= number of “neighbors” ofn

.

.

3. For eachn, computeλkn andxnk+1

15/25

(35)

Linear convergence of the Distributed ADMM

Assumption: H?:=P

n2fn(x?)>0 at the minimizerx?

kxnk−x?k ∼αk ask→ ∞

I [Shi et al.’ 13] non-asymptotic bound but pessimistic

I [Iutzeler et al.’ 16] asymptotic but tight

0 10 20 30 40 50 60 70 80 90 100

−0.2

−0.1 0 0.1 0.2 0.3 0.4 0.5

Number of iterations

Simulation log α = −0.19 Bound of Shi = −5e−4

1

klogkxk1x?kas a function ofk

(36)

Example: ring network

.

.

Defineα= limk→∞kxnk−x?k1/k Setfn:R→Randfn00(x?) =σ2

α≥

s 1 + cosN

2(1 + sinN) with equality when ρ= σ2 2 sinN

17/25

(37)

Asynchronous D-ADMM

I All agents must complete their arg min computation before combining

I The network waits for the slowest agents Our objective: allow for asynchronism

(38)

Revisiting ADMM as a fixed point algorithm

Setζkk+ρzk. Fact: λk=P(ζk) wherePis a projection.

ADMM can be written as a fixed point algorithm[Gabay,83] [Eckstein,92]

ζk+1=J(ζk) whereJis firmly non-expansivei.e.,

kJ(x)−J(y)|2≤ kx−yk2− k(I−J)(x)−(I−J)(y)k2

.

x y

JA(y)

JA(x)

.

19/25

(39)

Random coordinate descent

Introducing the block-components ofζk+1=J(ζk) :

 ζ1k+1

.. . ζ`k+1

.. . ζLk+1

=

 J1k)

.. . J`k)

.. . JLk)

Convergence of the Asynchronous ADMM[Iutzeler’13]

This algorithm still converges if active components are chosen at random

Main idea: For a well-chosen norm|||.|||and a fixed pointζ?of J, prove E

ζk+1−ζ?

2

|Fk

≤ ζk−ζ?

2

⇒ζk is getting “stochastically” closer toζ?

(40)

Random coordinate descent

If only one block`=`(k+ 1) is active at timek+ 1:

 ζ1k+1

.. . ζ`k+1

.. . ζLk+1

=

 ζ1k

.. . J`k)

.. . ζLk

Convergence of the Asynchronous ADMM[Iutzeler’13]

This algorithm still converges if active components are chosen at random

Main idea: For a well-chosen norm|||.|||and a fixed pointζ?of J, prove E

ζk+1−ζ?

2

|Fk

≤ ζk−ζ?

2

⇒ζk is getting “stochastically” closer toζ?

20/25

(41)

Random coordinate descent

If only one block`=`(k+ 1) is active at timek+ 1:

 ζ1k+1

.. . ζ`k+1

.. . ζLk+1

=

 ζ1k

.. . J`k)

.. . ζLk

Convergence of the Asynchronous ADMM[Iutzeler’13]

This algorithm still converges if active components are chosen at random

Main idea: For a well-chosen norm|||.|||and a fixed pointζ?of J, prove E

ζk+1−ζ?

2|Fk

≤ ζk−ζ?

2

⇒ζk is getting “stochastically” closer toζ?

(42)

Asynchronous ADMM explicited

.

m n

.

Activate two nodesA`={m,n}

I Agentncomputes xnk+1= arg min

x fn(x)+X

j∼k

hx, λkj,ni+ρ

2kx−x¯j,nk k2 and similarly for Agentm.

I They exchangexmk+1andxnk+1

I Agentncomputes

¯ xm,nk+1=1

2(xmk+1+xnk+1), λk+1m,nkm,n+ρxnk+1−xmk+1

2 and similarly for Agentm.

21/25

(43)

Asynchronous ADMM explicited

.

m n

.

Activate two nodesA`={m,n}

I Agentncomputes xnk+1= arg min

x fn(x)+X

j∼k

hx, λkj,ni+ρ

2kx−x¯j,nk k2 and similarly for Agentm.

I They exchangexmk+1andxnk+1

I Agentncomputes

¯ xm,nk+1=1

2(xmk+1+xnk+1), λk+1m,nkm,n+ρxnk+1−xmk+1

2 and similarly for Agentm.

(44)

Asynchronous ADMM explicited

.

m n

.

Activate two nodesA`={m,n}

I Agentncomputes xnk+1= arg min

x fn(x)+X

j∼k

hx, λkj,ni+ρ

2kx−x¯j,nk k2 and similarly for Agentm.

I They exchangexmk+1andxnk+1

I Agentncomputes

¯ xm,nk+1=1

2(xmk+1+xnk+1), λk+1m,nkm,n+ρxnk+1−xmk+1

2 and similarly for Agentm.

21/25

(45)

Generalization: Distributed V˜ u-Condat algorithm

I V˜u-Condat algorithm generalizes ADMM (allows “gradients” evaluations)

I Distributed V˜u-Condat algorithm is applicable using the same principle

I Bianchi’16, Fercoq’17 provide a random coordinate descent version

I The algorithm is asynchronous at thenode leveland not at theedge level

(46)

Stochastic Optimization

minx∈X N

X

n=1

E(fn(x, ξn))

I Law ofξnunknown, but revealedon-linethrough random copiesξ1n, ξ2n, . . .

I Stochastic approximation: at timek, replace the unknown function E(fn(. , ξn)) by its random versionfn(. , ξkn)

Example : stochastic gradient descent

I Thesis of A. Salim: Stochastic versions of generic optimization algorithms (Forward-Backward, Douglas-Rachford, ADMM, V˜u-Condat, etc.)

I Byproduct : distributed stochastic algorithms

23/25

(47)

Outline

Distributed gradient descent

Distributed Alternating Direction Method of Multipliers (D-ADMM)

Total Variation Regularization on Graphs

(48)

Total variation regularization (1/2)

Notation:On a graphG= (V,E), the total variation ofx∈RV is TV(x) = X

{i,j}∈E

|xi−xj|

General problem:

min

x∈RV

F(x) + TV(x)

I Trend filtering: F(x) =12kx−mk2wherem∈RV are noisy measurements

I Graph inpainting: complete possibly missing measurements on the nodes Proximal gradient algorithm:

xn+1=proxγTV(xn−γ∇F(xn))

I ComputingproxTVis difficult over large unstructured graphs

I But efficient algorithms exist for 1D-graphs(Mammen’97) (Condat’13)

24/25

(49)

Total variation regularization (2/2)

Write TV as an expectation: Letξbe simple random walk inG of fixed length TVG(x) ∝ E(TVξ(x))

Algorithm(Salim’16)At timen,

I Draw a random walkξn+1 I Computexn+1=proxγnTV

ξn+1(xn−γn∇F(xn))→easy, 1D Hidden difficulty: one should avoid loops when choosing the walk. . .

Trend filtering example. Cost function vs time(s). Stochastic block model 105nodes, 25.106edges.

Références

Documents relatifs

For general games, there might be non-convergence, but when the convergence of the ODE holds, considered stochastic algorithms converge towards Nash equilibria.. For games admitting

Fortunately, this is possible to go further, observing that many of the previous classes (ordinal, (exact) potential, continuous potential, load balancing games, congestion games,

P2P stochastic bandits and our results In our P2P bandit setup, we assume that each of the N peers has access to the same set of K arms (with the same unknown distributions that

Recently, Munos ( 2011 ) proposed the SOO algorithm for deterministic optimization, that assumes that f is locally smooth with respect to some semi-metric ℓ, but that this

The communication layer may estimate some parameters of the OS/library communication layer (the same parameters the application layer can estimate regarding the

– Finite Time Activity Identification Under the assumption that the non- smooth component of the optimization problem is partly smooth around a global minimizer relative to its

In this section, we apply the average-based distributed algorithm studied above to derive, from local average-based interactions, two global properties of our system: first

The main contribution of this work is the derivation of the convergence properties of matrix exponential learning in games played over bounded regions of positive-semidefinite