• Aucun résultat trouvé

Robust and effective decentralized learning in terms of communication

N/A
N/A
Protected

Academic year: 2022

Partager "Robust and effective decentralized learning in terms of communication"

Copied!
13
0
0

Texte intégral

(1)

Communication-Efficient Distributionally Robust Decentralized Learning

Matteo Zecchin

SSIE 2021 - Student Workshop

(2)

Decentralized learning

Data-driven optimization methods deal with datasets distributed across multiple computing nodes.

The costumary system objective isL(·) the sum of loss terms at the network nodes{`i(·)}mi=1.

Network Objective

The goal is to find a configuration of theθthat minimizes the function L(θ).

(3)

D-SGD

Decentralized (stochastic) gradient descent (D-SGD) is a versatile tool that harness data and computational resources in a privacy preserving and fault tolerant manner:

• Local data is never off-loaded and used only for local updates.

• Flexible w.r.t. communication topology.

Local update phase

(4)

D-SGD

Decentralized (stochastic) gradient descent (D-SGD) is a versatile tool that harness data and computational resources in a privacy preserving and fault tolerant manner:

• Local data is never off-loaded and used only for local updates.

• Flexible w.r.t. communication topology .

Communication phase

(5)

D-SGD

Decentralized (stochastic) gradient descent (D-SGD) is a versatile tool that harness data and computational resources in a privacy preserving and fault tolerant manner:

• Local data is never off-loaded and used only for local updates.

• Flexible w.r.t. communication topology .

Averaging phase

(6)

Fairness through distributional robustness

Data heterogeneityentail a fundamental challenge, deeply linked to the notion of fairness.

The vanilla formulation of loss function may neglect hard and under-represented sub-populations.

Example: Drug discovery with world-wide data and local extraneous factor.

The minimizer of the average loss formulation may perform badly and dangerously Asia.

(7)

Distributional Robust Formulation

Minimax formulation of the learning process as to focus on worst case performance.

Given a bunch of dataset from differently distributed datasetDi ∼Pi⊗ni fori= 1, . . . ,mdefine the convex hull

P :=

( X

i

λiPi

λi ≥0, X

i

λi= 1 )

and the distributionally robust objective maxP∈Pmin

θ∈Rd

Ez∼P[`(θ,z)]

Interpretation: two player game, player one tunesθ to minimize the loss w.r.t. toPwhile player two chooses the most challenging distributionP for the current modelθ.

Fairness and distributional robust have been previously considered in federated learning1.

1Mohri, Mehryar, Gary Sivek, and Ananda Theertha Suresh. ”Agnostic federated learning.” International Conference on Machine Learning. 2019.

(8)

Agnostic Decentralized Gradient Descent Ascent (AD-GDA)

We propose a AD-GDA, a distributionally robust learning algorithm for thefully decentralized learning.

AD-GDA objective:

maxλ∈Λ min

θ∈Rd m

X

i=1

λiEz∼Pi[`(θ,z)]

adversarial loss mixture

+

regularizer to prevent degenerate adversary config.

µr(~λ)

Key features of AD-GDA

• Single loopascent-descent: computing nodes query once the local gradient oracle and perform aparallel gradient descent step inθand an ascent one inλ.

• Communication efficient: nodes exploit compressed gossip to share reduce the network communication load.

(9)

Guarantees

AD-GDA is aprovably convergentalgorithm, we provide convergence guarantees in both convex and non-convex settings

• In theconvex scenario, AD-GDA approaches a solution with optimality gapO(1

T)

• In thenon-convexscenario, AD-GDA returns a approximate stationary solution with gradient vanishing asO(1

T)

Key proof ingredient: the step-size forθ must be smaller that the one used to updateλ.

The optimization problem w.r.t. θ changes slowly enough to optimize w.r.t. λ.

(10)

Experiment

Set-up:

• Classification task over CIFAR-10 dataset scattering data over a 10 nodes connected according to a ring topology.

• One node in the system store images with lower contrast and one with higher contrast.

(a)low contrast (b)original contrast (c)high contrast

(11)

Experiment Results

Effect of regularization:

• AD-GDA increases the worst case accuracy with a modest reduction in the avearage accuracy.

• Regularization param. allows to trade the two.

0 1000 2000 3000 4000 5000

Iteration 0.15

0.20 0.25 0.30 0.35 0.40 0.45

Accuracy

AG-GDAα= 0.1 worst case acc.

AG-GDAα= 0.1 average acc.

AG-GDAα= 1 worst case acc.

AG-GDAα= 1 mean acc.

AG-GDAα= 10 worst case acc.

AG-GDAα= 10 mean acc.

Communication Efficiency:

• AD-GDA employs compression to reduce the comm. load.

• AD-GDA beats distributionally robust federated learning and CHOCO-SGD.

0 20 40 60 80 100 120 140 160

MB 0.10

0.15 0.20 0.25 0.30 0.35 0.40

Accuracy

CHOCO-SGD 4 bit Quant.25% Spars.

AD-GDA 4 bit Quant.+25% Spars.

DRFA FedAvg

(12)

Recap

Problem: decentralized learning is becoming increasingly popular but in case of heterogeneous local data sources can produceunfairpredictors.

Starting from thedistributionally robust formulation of the learning task, we provide a AD-GDA which:

• effectively trades average accuracy for worst case performance(fair predictors).

• is based on a single loop procedure(simple and fast).

• employs message compression(communication efficient).

• is provably convergent(sound algorithm).

Thank you for your attention :)

(13)

Recap

Problem: decentralized learning is becoming increasingly popular but in case of heterogeneous local data sources can produceunfairpredictors.

Starting from thedistributionally robust formulation of the learning task, we provide a AD-GDA which:

• effectively trades average accuracy for worst case performance(fair predictors).

• is based on a single loop procedure(simple and fast).

• employs message compression(communication efficient).

• is provably convergent(sound algorithm).

Thank you for your attention :)

Références

Documents relatifs

In order to “improve privacy and user empowerment”, the company updates their service to allow encrypted data to be stored on the users’ devices, rather than on their

Collaborative filtering recommender systems rely on relations be- tween users to predict items that could fit users’ interests based on related users ratings.. Those relations are

The protocol Semi-Honest-k-Shares is a real privacy preserving reputation protocol (Definition 10) under the semi-honest model, since: 1) Semi-Honest-k- Shares has the same

SOMMAIRE Introduction Première Partie : L’équilibre contractuel en période de formation du contrat de consommation Chapitre I : De la nécessite de la recherche d'un

In gossip learning, models perform random walks on the network and are trained on the local data using stochastic gradient descent.. Besides, several models can perform random walks

propriétés de ces deux chalcogénures afin d'en produire un matériau, qui, grâce à la modulation de ses propriétés optiques pourra être utilisé comme

In addition to its negative effect on cell proliferation and motility in the early phase of osteoclatogenesis ( 67 ), miR- 124 seems to act on the late phase of the differentiation

ThemaMap: a Free Versatile Data Analysis and Visualization Tool.. Jacques Madelaine, Gaétan Richard,