Robust and effective decentralized learning in terms of communication

(1)

Communication-Efficient Distributionally Robust Decentralized Learning

Matteo Zecchin

SSIE 2021 - Student Workshop

(2)

Decentralized learning

Data-driven optimization methods deal with datasets distributed across multiple computing nodes.

The costumary system objective isL(·) the sum of loss terms at the network nodes{`i(·)}^m_i=1.

Network Objective

The goal is to find a configuration of theθthat minimizes the function L(θ).

(3)

D-SGD

Decentralized (stochastic) gradient descent (D-SGD) is a versatile tool that harness data and computational resources in a privacy preserving and fault tolerant manner:

• Local data is never off-loaded and used only for local updates.

• Flexible w.r.t. communication topology.

Local update phase

(4)

D-SGD

• Flexible w.r.t. communication topology .

Communication phase

(5)

D-SGD

• Flexible w.r.t. communication topology .

Averaging phase

(6)

Fairness through distributional robustness

Data heterogeneityentail a fundamental challenge, deeply linked to the notion of fairness.

The vanilla formulation of loss function may neglect hard and under-represented sub-populations.

Example: Drug discovery with world-wide data and local extraneous factor.

The minimizer of the average loss formulation may perform badly and dangerously Asia.

(7)

Distributional Robust Formulation

Minimax formulation of the learning process as to focus on worst case performance.

Given a bunch of dataset from differently distributed datasetD_i ∼P_i^⊗nⁱ fori= 1, . . . ,mdefine the convex hull

P :=

( X

i

λiPi

λi ≥0, X

i

λi= 1 )

and the distributionally robust objective maxP∈Pmin

θ∈R^d

Ez∼P[`(θ,z)]

Interpretation: two player game, player one tunesθ to minimize the loss w.r.t. toPwhile player two chooses the most challenging distributionP for the current modelθ.

Fairness and distributional robust have been previously considered in federated learning¹.

1Mohri, Mehryar, Gary Sivek, and Ananda Theertha Suresh. ”Agnostic federated learning.” International Conference on Machine Learning. 2019.

(8)

Agnostic Decentralized Gradient Descent Ascent (AD-GDA)

We propose a AD-GDA, a distributionally robust learning algorithm for thefully decentralized learning.

AD-GDA objective:

maxλ∈Λ min

θ∈R^d m

X

i=1

λiEz∼Pi[`(θ,z)]

adversarial loss mixture

+

regularizer to prevent degenerate adversary config.

µr(~λ)

Key features of AD-GDA

• Single loopascent-descent: computing nodes query once the local gradient oracle and perform aparallel gradient descent step inθand an ascent one inλ.

• Communication efficient: nodes exploit compressed gossip to share reduce the network communication load.

(9)

Guarantees

AD-GDA is aprovably convergentalgorithm, we provide convergence guarantees in both convex and non-convex settings

• In theconvex scenario, AD-GDA approaches a solution with optimality gapO(^√¹

T)

• In thenon-convexscenario, AD-GDA returns a approximate stationary solution with gradient vanishing asO(^√¹

T)

Key proof ingredient: the step-size forθ must be smaller that the one used to updateλ.

⇓

The optimization problem w.r.t. θ changes slowly enough to optimize w.r.t. λ.

(10)

Experiment

Set-up:

• Classification task over CIFAR-10 dataset scattering data over a 10 nodes connected according to a ring topology.

• One node in the system store images with lower contrast and one with higher contrast.

(a)low contrast (b)original contrast (c)high contrast

(11)

Experiment Results

Effect of regularization:

• AD-GDA increases the worst case accuracy with a modest reduction in the avearage accuracy.

• Regularization param. allows to trade the two.

0 1000 2000 3000 4000 5000

Iteration 0.15

0.20 0.25 0.30 0.35 0.40 0.45

Accuracy

AG-GDAα= 0.1 worst case acc.

AG-GDAα= 0.1 average acc.

AG-GDAα= 1 worst case acc.

AG-GDAα= 1 mean acc.

AG-GDAα= 10 worst case acc.

AG-GDAα= 10 mean acc.

Communication Efficiency:

• AD-GDA employs compression to reduce the comm. load.

• AD-GDA beats distributionally robust federated learning and CHOCO-SGD.

0 20 40 60 80 100 120 140 160

MB 0.10

0.15 0.20 0.25 0.30 0.35 0.40

Accuracy

CHOCO-SGD 4 bit Quant.25% Spars.

AD-GDA 4 bit Quant.+25% Spars.

DRFA FedAvg

(12)

Recap

Problem: decentralized learning is becoming increasingly popular but in case of heterogeneous local data sources can produceunfairpredictors.

Starting from thedistributionally robust formulation of the learning task, we provide a AD-GDA which:

• effectively trades average accuracy for worst case performance(fair predictors).

• is based on a single loop procedure(simple and fast).

• employs message compression(communication efficient).

• is provably convergent(sound algorithm).

Thank you for your attention :)

(13)

Recap

Problem: decentralized learning is becoming increasingly popular but in case of heterogeneous local data sources can produceunfairpredictors.

Starting from thedistributionally robust formulation of the learning task, we provide a AD-GDA which:

• effectively trades average accuracy for worst case performance(fair predictors).

• is based on a single loop procedure(simple and fast).

• employs message compression(communication efficient).

• is provably convergent(sound algorithm).