Communication-Efficient Distributionally Robust Decentralized Learning
Matteo Zecchin
SSIE 2021 - Student Workshop
Decentralized learning
Data-driven optimization methods deal with datasets distributed across multiple computing nodes.
The costumary system objective isL(·) the sum of loss terms at the network nodes{`i(·)}mi=1.
Network Objective
The goal is to find a configuration of theθthat minimizes the function L(θ).
D-SGD
Decentralized (stochastic) gradient descent (D-SGD) is a versatile tool that harness data and computational resources in a privacy preserving and fault tolerant manner:
• Local data is never off-loaded and used only for local updates.
• Flexible w.r.t. communication topology.
Local update phase
D-SGD
Decentralized (stochastic) gradient descent (D-SGD) is a versatile tool that harness data and computational resources in a privacy preserving and fault tolerant manner:
• Local data is never off-loaded and used only for local updates.
• Flexible w.r.t. communication topology .
Communication phase
D-SGD
Decentralized (stochastic) gradient descent (D-SGD) is a versatile tool that harness data and computational resources in a privacy preserving and fault tolerant manner:
• Local data is never off-loaded and used only for local updates.
• Flexible w.r.t. communication topology .
Averaging phase
Fairness through distributional robustness
Data heterogeneityentail a fundamental challenge, deeply linked to the notion of fairness.
The vanilla formulation of loss function may neglect hard and under-represented sub-populations.
Example: Drug discovery with world-wide data and local extraneous factor.
The minimizer of the average loss formulation may perform badly and dangerously Asia.
Distributional Robust Formulation
Minimax formulation of the learning process as to focus on worst case performance.
Given a bunch of dataset from differently distributed datasetDi ∼Pi⊗ni fori= 1, . . . ,mdefine the convex hull
P :=
( X
i
λiPi
λi ≥0, X
i
λi= 1 )
and the distributionally robust objective maxP∈Pmin
θ∈Rd
Ez∼P[`(θ,z)]
Interpretation: two player game, player one tunesθ to minimize the loss w.r.t. toPwhile player two chooses the most challenging distributionP for the current modelθ.
Fairness and distributional robust have been previously considered in federated learning1.
1Mohri, Mehryar, Gary Sivek, and Ananda Theertha Suresh. ”Agnostic federated learning.” International Conference on Machine Learning. 2019.
Agnostic Decentralized Gradient Descent Ascent (AD-GDA)
We propose a AD-GDA, a distributionally robust learning algorithm for thefully decentralized learning.
AD-GDA objective:
maxλ∈Λ min
θ∈Rd m
X
i=1
λiEz∼Pi[`(θ,z)]
adversarial loss mixture
+
regularizer to prevent degenerate adversary config.
µr(~λ)
Key features of AD-GDA
• Single loopascent-descent: computing nodes query once the local gradient oracle and perform aparallel gradient descent step inθand an ascent one inλ.
• Communication efficient: nodes exploit compressed gossip to share reduce the network communication load.
Guarantees
AD-GDA is aprovably convergentalgorithm, we provide convergence guarantees in both convex and non-convex settings
• In theconvex scenario, AD-GDA approaches a solution with optimality gapO(√1
T)
• In thenon-convexscenario, AD-GDA returns a approximate stationary solution with gradient vanishing asO(√1
T)
Key proof ingredient: the step-size forθ must be smaller that the one used to updateλ.
⇓
The optimization problem w.r.t. θ changes slowly enough to optimize w.r.t. λ.
Experiment
Set-up:
• Classification task over CIFAR-10 dataset scattering data over a 10 nodes connected according to a ring topology.
• One node in the system store images with lower contrast and one with higher contrast.
(a)low contrast (b)original contrast (c)high contrast
Experiment Results
Effect of regularization:
• AD-GDA increases the worst case accuracy with a modest reduction in the avearage accuracy.
• Regularization param. allows to trade the two.
0 1000 2000 3000 4000 5000
Iteration 0.15
0.20 0.25 0.30 0.35 0.40 0.45
Accuracy
AG-GDAα= 0.1 worst case acc.
AG-GDAα= 0.1 average acc.
AG-GDAα= 1 worst case acc.
AG-GDAα= 1 mean acc.
AG-GDAα= 10 worst case acc.
AG-GDAα= 10 mean acc.
Communication Efficiency:
• AD-GDA employs compression to reduce the comm. load.
• AD-GDA beats distributionally robust federated learning and CHOCO-SGD.
0 20 40 60 80 100 120 140 160
MB 0.10
0.15 0.20 0.25 0.30 0.35 0.40
Accuracy
CHOCO-SGD 4 bit Quant.25% Spars.
AD-GDA 4 bit Quant.+25% Spars.
DRFA FedAvg
Recap
Problem: decentralized learning is becoming increasingly popular but in case of heterogeneous local data sources can produceunfairpredictors.
Starting from thedistributionally robust formulation of the learning task, we provide a AD-GDA which:
• effectively trades average accuracy for worst case performance(fair predictors).
• is based on a single loop procedure(simple and fast).
• employs message compression(communication efficient).
• is provably convergent(sound algorithm).
Thank you for your attention :)
Recap
Problem: decentralized learning is becoming increasingly popular but in case of heterogeneous local data sources can produceunfairpredictors.
Starting from thedistributionally robust formulation of the learning task, we provide a AD-GDA which:
• effectively trades average accuracy for worst case performance(fair predictors).
• is based on a single loop procedure(simple and fast).
• employs message compression(communication efficient).
• is provably convergent(sound algorithm).