• Aucun résultat trouvé

Distributed deep learning on edge-devices in the Parameter Server Model

N/A
N/A
Protected

Academic year: 2021

Partager "Distributed deep learning on edge-devices in the Parameter Server Model"

Copied!
2
0
0

Texte intégral

(1)

HAL Id: hal-01651145

https://hal.inria.fr/hal-01651145

Submitted on 28 Nov 2017

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Distributed deep learning on edge-devices in the Parameter Server Model

Corentin Hardy, Erwan Le Merrer, Bruno Sericola

To cite this version:

Corentin Hardy, Erwan Le Merrer, Bruno Sericola. Distributed deep learning on edge-devices in the

Parameter Server Model. Workshop on Decentralized Machine Learning, Optimization and Privacy,

Sep 2017, Lille, France. pp.1. �hal-01651145�

(2)

Distributed deep learning on edge-devices

in the Parameter Server Model

Corentin Hardy, Erwan Le Merrer, Bruno Sericola

Our proposal : AdaComp

Challenges Towards distribution on edge-devices

VM Manager deploy

Virtual network PS LXC

Supervisor Worker

LXC

Worker Worker

Worker Worker

Worker Worker

Ingress Traffic

Our experimental plateform The ingress traffic is a

critical bottleneck : all workers send their updates to the PS located on a central node. Classically, updates

ΔW have the same

size than the model. In case of deep learning this mean the central node has to receive between few MB to GB per worker each time one computes an update.

This bottleneck limits parallelism and consequently the performance of the training.

User devices such as gateways, set-top boxes, smartphones, generate lots of data (voice, pictures, video, network traffic, ...). Most of this data are private and so difficult to exploit. We ask the following question : How to train a deep neural network while keeping users data on their devices ?

The Parameter Server (PS) model allows to distribute SGD iterations :

Where W(t) is the state of the machine learning model (such as a deep neural network), and

ΔW(t) is the gradient of a loss function computed using a

training data batch. In the PS model, W is stored in a server. A pool of workers computes the term

ΔW using different training sets. To speed-up

the training, workers send their updates and pull the current state of the model from the PS simultaneously asynchronously.

Instead of running the distributed learning in a datacenter, we propose to run workers on edges-devices and to compute updates using their local data.

In the context of the PS model, deploying workers on edge-devices leads to some issues and challenges :

- The training datasets are composed by user data : how to enforce user privacy ?

- The gradient computation of a deep neural network facing asynchrony and crashes.

- How many workers can train the DNN in parallel ?

- Instead of datacenter communication, updates are sent through the Internet, with bandwidth or latency constraints.

Ingress traffic

To deal with the ingress traffic bottleneck, we propose Adacomp, a method which combines gradient selection to send sparse updates and an adapted learning rate using element-staleness.

The selection is performed by keeping gradient of parameters with the largest magnitude. For each variable of the model (i.e., matrix weights in the case of DNN), workers send gradients of only a fraction c of its total number of elements (selected at each iterations).

The staleness is computed locally for each parameter of the model. At iteration i, the staleness for the parameters k is computed such as :

where is the gradient of parameter k at iteration u, and d is the delay of current state of worker. The staleness is used to adapt the learning rate of each parameter :

where is the learning rate. Thus, parameters which are rarely updated since the last iteration of the worker obtain an higher learning rate than parameters which are often updated. AdaComp allows to reduce ingress traffic at PS while keeping a good accuracy.

Test accuracy of an distributed CNN on M NI ST wi th 200 wo r kers

Workers

Training dataset Datacenter

ΔW W Users

ΔW W

Parameter Server

97%

1.3 GB

14 GB 260 GB

Classical PS model PS model with c = 10%

PS model with c = 1%

AdaComp with c = 10%

AdaComp with c = 1%

Références

Documents relatifs

The literature reveals four major uses of the BM concept: to map specific business sectors or industries and create taxonomies; to describe the

We apply the random neural networks (RNN) [15], [16] developed for deep learning recently [17]–[19] to detecting network attacks using network metrics extract- ed from the

The results show that the simple representation, combined with an evolutionary design is better able to predict training times than a more random data generation process (with a

The algorithm has been incremented in an Smart Investment model that simulates the human brain with reinforcement learning for fast decisions, deep learning to memo- rize

Background and Objectives: This paper presents the results of a Machine- Learning based Model Order Reduction (MOR) method applied to a complex 3D Finite Element (FE)

The GBDT with features selection is among all metrics superior and outperform the deep learning model based on LSTM validating our assumption that on small and imbal- anced data

this bottleneck is still valid. Deep learning is adapted to such problems where knowledge is implicit in the data. c) Deep neural networks are composed of a huge

As the crucial focus of machine learning is the accuracy of the learned model, we have shown (Figure 1) that unfortunately it is not linear with the amount of data received at the