Distributed Optimization for Deep Learning with Gossip Exchange

(1)

HAL Id: hal-01930346

https://hal.archives-ouvertes.fr/hal-01930346

Submitted on 15 Jan 2019

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Distributed Optimization for Deep Learning with Gossip Exchange

Michael Blot, David Picard, Nicolas Thome, Matthieu Cord

To cite this version:

Michael Blot, David Picard, Nicolas Thome, Matthieu Cord. Distributed Optimization for Deep Learning with Gossip Exchange. Neurocomputing, Elsevier, 2019, 330, pp.287-296.

�10.1016/j.neucom.2018.11.002�. �hal-01930346�

(2)

Distributed optimization for deep learning with gossip exchange

Michael Blot

^a^,¹^,^∗

, David Picard

^b^,^a

, Nicolas Thome

^a^,^c

, Matthieu Cord

^a

aUPMC Univ Paris 06, CNRS, LIP6 UMR 7606, Sorbonne Universités, 4 place Jussieu, Paris 75005, France

bETIS UMR 8051, Université Paris Seine, Université Cergy-Pontoise, ENSEA, CNRS, France

cCEDRIC, Conservatoire National des Arts et Métiers, 292 Rue St Martin, Paris 75003, France

Keywords:

Optimization

Distributed gradient descent Gossip

Deep Learning Neural networks

a b s t r a c t

Weaddressthe issueofspeedingupthetraining ofconvolutionalneuralnetworksbystudying adis- tributed method adaptedto stochastic gradient descent.Our parallel optimizationsetup usesseveral threads,eachapplyingindividualgradientdescentsonalocalvariable.Weproposeanewwayofsharing informationbetweendifferentthreadsbasedongossipalgorithmsthatshowgoodconsensusconvergence properties.OurmethodcalledGoSGDhastheadvantagetobefullyasynchronousanddecentralized.

1. Introduction

Withdeepconvolutionalneuralnetworks(CNN)introduced by Fukushima [1] and LeCun et al. [2], computer vision tasks and more specifically image classification have made huge improve- ments inthe yearsfollowing[3].CNN performancesbenefita lot from big collections of annotated images like Russakovsky et al.

[4]orLinetal.[5].Theyaretrainedbyoptimizingalossfunction withgradientdescentscomputedonrandommini-batchesaccord- ing to Bottou [6]. The method called stochastic gradient descent (SGD) hasproved tobe veryeﬃcient totrain neuralnetworksin general.

However current CNN structures are extremely deep like the 100 layers ResNet ofHe et al.[7] andcontains a lotof parameters (around60M forAlexnet [3]and130 M forvgg [8]). Those structures involve heavy gradient computation times making the trainingonbig data-setsveryslow. ComputationonGPUacceler- atesthetrainingbutrequireshugelocalmemorycaches.

Nevertheless the mini-batch optimization seems suitable for distributingthetrainingoverseveralthreads.Manymethodshave been studied like Refs. [9,10], which propose to distribute the

∗ Corresponding author.

E-mail addresses: michael.blot@lip6.fr (M. Blot), picard@ensea.fr (D. Picard), nicolas.thome@lip6.fr (N. Thome), matthieu.cord@lip6.fr (M. Cord).

1Thank DGA for ﬁnancing my researches.

batches over different threads called workers that periodically exchange information via a central thread to synchronize their models. In [9], when a worker exchanges information with the central node it updates its parameters variable with a weighted averaging of its own parameters variable and the central node ones. The master thread does the symmetrical update. In [10], the workers accumulate in an additional buffer the gradients computedatthelaststeps,andappliesthisaggregatedgradientto themasterthreadparameters.

Themajorissueofthesemethods isthatthey inducewaits in allcomputingnodesduetothesynchronizationproblemswiththe criticalresource that is the central node. Apopular wayof tack- lingthisissueindistributedoptimizationinvolvesexchanging in- formationinapeertopeerfashion.Inthiscategory,Gossipproto- cols[11]havebeensuccessfullyappliedtomanymachinelearning problemssuchaskernelmethods[12],PCA[13]andk-meansclus- tering[14,15].We proposeto combineGossipprotocols withSGD toeﬃcientlytraindeepCNNwithouttheneedofacentralnode.

Thecontributionsofthispaperarethefollowing.First,wepro- posea newframework fordistributedSGD in whichwe can for- mallyanalyze andcompare the propertiesof existing distributed deeplearningtrainingalgorithms.Usingthisframework,weshow that distributing the mini-batches is equivalent to use bigger batches in non-distributed SGD, which has implications on the convergencespeed ofthe algorithms. Finally,by carefullyanalyz- ingthekeyaspectsofourframework,weproposetwodistributed

(3)

SGDalgorithmsfortrainingdeepCNN.Thefirstone,calledPerSyn, usesacentralnodeinaveryefficientway.Thesecondone,called GoSGD,isbasedonGossipexchangesandcanthusbescaledtoan arbitrarilylargenumberofnodes,providingasignificant speedup tothetrainingprocedure.

The remaining ofthis paperisasfollows.In thenext section, werecalltheprinciplesofSGDappliedtodeeplearningaswellas therelevantliterature aboutdistributingits computation.Next,in Section3,wedetailourformalizationofdistributedSGDwhichal- lowsusetocompareexistingalgorithmsandwriteourfirstpropo- sitionPerSyn.InSection4,we presenta decentralizedalternative basedonGossipalgorithms,thatwe callGoSGD,whichfitsnicely inourframework.Finally,wepresentexperimentsshowingtheef- ficiencyofdistributingSGDinSection5beforeweconclude.

2. SGDandrelatedwork

Inimageclassiﬁcation,trainingdeepCNNfollowstheempirical risk minimization (ERM) principle [16]: The objectiveis to minimize

η

îs ^the ^gradient ^step ^size, ^called ^learning ^rate. În îm-

⁼_N¹ N

i=1

∇

^x

(

^x,^Yi

)

Theset(^Yi)i=1..N ofindependentsamplesdrawnfromIiscalleda batchandNisthebatchsize.

∇

^L(^x)^is^then^the^descent^direction usedateachupdate. Inpractice,theimagesinabatcharedrawn randomlyfromthetrainingsetandaredifferentforeachgradient step.

Concerningtheprecisionofthegradientindicator,wepointout that it isunbiased. It can also be shownthat the approximation withrespect to truegradient depends on thebatch size. Infact, E

||∇

^L(^x)−

∇

^L(^x)

||

²2∝_N¹tr(Co

v

Y[

∇

(^x,Y)^]) ^where Co

v

Y[

∇

(^x,Y)^] isthecovariancematrix ofthegradient estimatorandtrdenotes thetraceofthematrix(seeAppendixA).Thereforethebiggerthe batch size N,the more accurate the gradient estimator. Unfortu- natelyGPUmemoryis limitedandboundsthenumberofimages that can be processed in a batch. For instance,a GPUcard with 12 Goof memory cannot process batches biggerthan 48 images of224×224RGBpixelsusingResnet[7]with30layers.Indeed,at eachupdate step, theGPUneeds to hostthe whole modelinter- mediatecomputationsforallimagesinthebatch.As such,using

asingleGPUlimitstheprecisionofthestochasticgradientdescent andthushindersitsconvergence.

2.1. Relatedwork

There are many ways to accelerate the convergence time of SGD,oneofwhichistooptimizethelearningrate

η

^for^each^gra-

dientdescent iteration asaimed at bysecond ordergradient de- scentmethods [17].Fordeeplearning,the parametersspaceisof highdimensionandsecondordermethodsarecomputationallytoo heavytobeappliedeﬃciently.Examplesofﬁrstordergradientde- scent that adapt the learning ratewith success can be found in [18]and[19].

A second way to accelerate the gradient descent isto reduce thecomputationtimeof

∇

^L(^x)^which^is^the^purpose^of^this^paper.

Based on theavailability of severalGPU cards,different methods thatovercometheGPUbatchsizelimitationshavebeenproposed.

There are two main ways ofdistributing the computationof the gradientupdate,whichareeithersplittingthemodelanddistribut- ingpartsofitamongthedifferentGPUs,orsplittingthebatchof imagesinsubsetsanddistributingthemamongtheGPUs.

AfamoustechniqueusedbyKrizhevskyetal.[3]fordistribut- ingthemodelconsistsinparallelingtheforward/backwardpassof abatchovertwoGPUcards.Theconvolutionalpartofthenetwork issplit intotwoindependent partsthat areonly aggregatedwith afullyconnectedlayeratthetopofthenetwork.Thisconception enablestoloadthetwoparallelpartsontwodifferentGPUsliber- atingspaceformoreimagesinbatches.HowevertheCNNmodels claiming currentstate of the art performances in image classiﬁ- cation likeHe et al.[7] andHuang etal.[20] beneﬁta lot from theinter-correlationbetweenfeaturesofeachlayersanddonotal- lowindependentpartitioningwithinalayer.Thusifthenetworkis loadedinseveralGPUcardstheywouldhavetocommunicatedur- ingtheforwardpassofabatch. Thecommunicationsbeingtime- consumingitispreferabletorestrainthewholenetworkonasin- gleGPUcard.

Thesecond approach,whichconsistsindistributingsubsetsof thebatchto thedifferentGPUs,isbecomingincreasingly popular.

We use the following notationsto describe such methods: As in [9]thedifferentthreadsthat manageaneuralnetworkmodelare calledworkers.WithMGPUcards,thereareMworkersandtheir corresponding set of parameters isnoted xm foreach worker m. Thesetofparametersusedforinferenceisnotedx˜andisspeciﬁ- callythemodelwewanttooptimize.Theeasiestwayofdistribut- ingthebatchesispresentedin[21]andconsistsincomputingthe averageofalltheworkersmodelsasinAlgorithm1.

Algorithm1 FullysynchronousSGD:pseudo-code.

1: Input:M:numberofthreads,

η

^:^learning^rate

2: Initialize:xisinitializedrandomly,xm=x 3: repeat

4:

∀

^m^,

v

m=

∇

^Lm(^xm)

5:

v

=_M¹M m=1

v

m 6:

∀

^m,xm=xm−

η v

7: untilMaximumiterationreached 8: return _M¹ M

m=1xm

Westressthefactthataftereachiterationallworkershostthe same value of the variable x˜ to optimize (i.e.,

∀

^m,x_m=x˜). This methodisequivalenttousingMtimesbiggerbatchesforthegradi- entscomputation.²ItisthenpossibletoalleviatetheGPUmemory constraintlimitingthenumberofexamplesinabatchinorderto

2In [21] they use the sum instead of the averaging of the gradients but both are equivalent by adjusting the learning rate.

(4)

getmoreaccurategradients.Howeverincomparisonwiththeclas- sicalSGDtherearethreeadditionalphasesofcommunicationand aggregationthathaveincompressibletime:

1. Thetransmissionto themasterofthelocalgradientestimates computedbythethreadsatline4.

2. Theaggregationofthegradientsatline5.

3. The transmission of the aggregated gradient

∇

^L(^x) ^to ^all threadsbeforelocalupdatesatline6.

Inlargescaleimageclassification,thecommunicationsinvolved atlines4and6implyaCNNwithmillionsofparametersandare thus very costly. Moreover, to perform line5, the master has to waitforallworkerstohavecompletedline4.Theadditionaltime involvedbycommunicationsandsynchronizationscanhaveasig- nificantimpactontheglobaltrainingtime.Thisisevenmoretrue whentheGPUsarenotonthesameserver,wherethecommunica- tiondelaysareconsequent.Thus,thiseasysynchronousdistributed methodcanbeveryinefficient.

The focusofthispaperistoreduce thesecommunicationand aggregation costs. As withexisting strategiessuch as Refs. [9,10], themain ideaisto controlthe amountofinformationexchanged between the nodes. This amount is a key factor in making all workerscontributetotheoptimizationofthetestmodelx˜.Indeed, if no information is ever exchanged, the distributed system is equivalenttotrainingMindependentmodels.Asaconsequenceof the symmetryproperty ofCNN explainedin[22],theseindepen- dentmodels arelikelytobe verydifferentandalmost impossible to combine. As such, the distributedtraining wouldnot improve overusingasingleGPUwithareducedbatchsize.Themainchal- lengeofthiswork isthus toallow aslittleinformationexchange as possible to ensure reduced communication costs, while still optimizingthecombinedmodelx˜.

Inthe next section,we introducea matrixframework that allows to explicitly control the amount of exchanged information, andthustoexploredifferentoptimizationstrategies.

3. Framework

Inthissection,weproposeageneralframeworkfordistributed SGD methods. We start from the centralized SGD update procedure and we show the equivalent synchronous distributed procedure. Then, we show that relaxing some equality constraints amongworkersisequivalentto considerdifferentcommunication strategies.Wearethusabletowritestrategiesfoundinthelitera- tureusingasimplematrixexpression.

First, we unroll the recursion of the classic SGD update of Eq.(2)toobtainthevalueofx˜afterT+1gradientdescentsteps:

˜

x⁽^T⁺¹⁾=x˜⁽⁰⁾−

η

^T

t=0

∇

ˆ^L

˜ x⁽^t⁾

(3)

Withx˜⁽^t⁾ beingvariablex˜atime t.Theequivalentbatchdistribu- tionmethodisthefollowing:

˜

x⁽^T⁺¹⁾=x˜⁽⁰⁾−

η

M T

t=0

M

m=1

∇

ˆ^L^m

˜ x⁽^t⁾

Inordertomakethecommunicationbetweentheworkersandthe centralmodelmorevisible,wecanusetheequivalentupdaterule that introduces local variables xm and consensus constraints. By doingso,weobtainthestandarddistributedSGDmethod:

˜

x⁽^T⁺¹⁾=x˜⁽⁰⁾−

η

M T t=0

M

m=1

∇

ˆ^L^m

x⁽_m^t⁾

(4)

s.t.

∀

^t,

∀

m,x_m⁽^t⁾=x˜⁽^t⁾ (5)

Wecan replace theconsensus equalityconstraints (5)by equiva- lentlocalupdaterulesobtainedfrom(4):

∀

^r,^xr⁽^T⁺¹⁾=x˜⁽⁰⁾−

η

M T t=0

M

s=1

∇

ˆ^L^s

x⁽_s^t⁾

Thisformulationisaﬁrststeptowardsdecentralizedandcom- municationeﬃcientalgorithmssinceitmadethecommunications betweenworkersvisible.Indeed,ateachstep,eachworkergathers thegradientsfromallotherworkersandapplies themtoitslocal variable.Themaindrawbackofthisformulationisthatthenumber ofmessagesexchangedby workersateach stepisnowM(^M−1) insteadof2Mintheclassicalversion.

Toallow formore eﬃcientcommunication, we relax the con- sensusconstraintsbyallowinglocalvariablestodifferslightly.This canbe done eﬃcientbyintroducing non-uniformweights

α

r,s in thecombinationofthelocalgradients:

˜

x⁽^T⁺¹⁾=x˜⁽⁰⁾−

η

^T

t=0

M

m=1

α

⁰,m

∇

ˆ^L^m

x⁽_m^t⁾

∀

^r,^xr⁽^T⁺¹⁾=x˜⁽⁰⁾−

η

^T

t=0

M

s=1

α

^r,s

∇

ˆ^L^s

x⁽_s^t⁾

s.t.

∀

^r, M

s=1

α

^r,s=1

We introduce a matrix notation over

α

^r,^s ^to ^simplify ^the ^ex-

pression of these update rules. We note x⁽^t⁾=[x˜⁽^t⁾,x₁⁽^t⁾,...,x⁽_M^t⁾] the vector concatenating the central variable x˜⁽^t⁾ and all local variables x⁽_m^t⁾ at time step t. Similarly, we note v= [0,

∇

ˆ^L1(^x⁽₁^t⁾),...,

∇

ˆ^LM(^x_M⁽^t⁾)^] ^the^vector concatenating all thelocal gradients

∇

ˆ^L^m(^x⁽m^t⁾) ^at^time ^step ^t^, ^with^a ^leading ⁰^to ^have ^the samesize asx.Theweights

α

r,softhegradientcombinationsare enclosedinamatrixK,whichleadstothefollowingmatrixupdate rule:

x⁽^T⁺¹⁾= T

t=0

K

x⁽⁰⁾−

η

^T

t=0

T τ=t

K

v⁽^t⁾

Kisthecommunicationmatrixandisresponsibleforthemixingof thelocalupdates. Remarkthat itsrowshavetosumto1toavoid explodinggradients. The sparser K is, the less workersexchange informationandthusthemoreeﬃcientthealgorithmis.However, thiscomesatthepriceoflessconsensusamongthelocalmodels.

On the extreme, if K is diagonal (no information exchanged between workers),these update rules are equivalentto Mdifferent modelbeingindependentlytrained.Hence,thechoiceofKiscriti- caltohaveagoodtrade-off betweenlowcommunicationcostsand goodinformationsharingbetweenworkers.

Tofurther allow forthe optimizationof K,we introduce time dependentcommunicationmatricesK⁽^t⁾.Thisallowsforthedesign ofverysparsecommunicationmatricesmostofthetime,balanced withtheuseofadenseK⁽^t⁾ toenforcebetterconsensuswhenever itisrequired.WenoteP_t^T=T

τ=tK^{(τ )},andobtain:

x⁽^T⁺¹⁾=P₀^Tx⁽⁰⁾−

η

^T

t=0

P_t^Tv⁽^t⁾

Thiscorrespondstothefollowingrecursion:

x⁽^T⁺¹⁾=K⁽^T⁾x⁽^T⁾−

η

^K⁽^T⁾^v⁽^T⁾

Remark that since the update is linear, mixing the local gradients and mixing the local variables are equivalent. As such, we can deﬁne an intermediate variable x⁽^t⁺¹²⁾=x⁽^t⁾−

η

^v⁽^t⁾ ^that

(5)

considersonly thelocal updatesandobtainthe followingupdate rules:

x⁽^T⁺¹²⁾=x⁽^T⁾−

η

^v⁽^T⁾ ⁽⁶⁾

x⁽^T⁺¹⁾=K⁽^T⁾x⁽^T⁺¹²⁾ (7) Sincethese rules make a clear distinction betweenlocal computation(step T+¹2) andcommunication (step T+1), they are the oneswewilluseintheremainingofthisarticle.

Inthefollowing,weshowhowseveralexistingdistributedSGD canbespecifiedbytheirsequenceofcommunicationmatricesK⁽^t⁾. We first start with a naive yet communication efficient method wecallPerSyn (PeriodicallySynchronous)that considersexchang- ing information only once every few steps. Then, we show how EASGD[9]andDOWNPOUR[10]comparetoPerSyn.

3.1.PerSyn

As a ﬁrst illustration of our framework we introduce a new communicationandaggregationstrategy fordistributedSGD that is very simple to implement and yet surprisingly effective. The mainideaistorelaxthesynchronizationofalllocalvariableswith themastervariabletooccuronlyonceevery

τ

^steps.^This^is^done

by deﬁning the following time dependent communicationmatri- ces:

K⁽^t⁾=

⎧ ⎪

⎪ ⎪

⎨

⎪ ⎪

⎪ ⎩

0 _M¹1 ... I

, iftmod

τ

⁼⁰

0 · · ·

1 0

, iftmod

τ

⁼¹

0 · · · ... I

, else

IbeingtheidentitymatrixofsizeM×M,and1thevectorofsize Mwith1oneverycomponent.Thecorrespondingalgorithmisde- scribedin Algorithm2. Remarkit isvery similar to thestandard

Algorithm2 PerSynSGD:Pseudo-code.

1: Input:M:numberofthreads,

η

^:^learning^rate

2: Initialize:xisinitializedrandomly,xm=x,t=0 3: repeat

4: forallm,x⁽^t⁺

1 2)

m ←x⁽_m^t⁾−

η ∇

^L^m_N(^x⁽m^t⁾)

5: t=t+1

6: iftmod

τ

=0then 7: x˜⁽^t+1⁾=_M¹M

m=1x_m⁽^t⁺¹²⁾ 8:

∀

^m^,^xm⁽^t⁺¹⁾←_M¹ M

m=1x⁽_m^t⁺¹²⁾

9: else

10:

∀

^m^,^xm⁽^t⁺¹⁾←x⁽_m^t+¹²⁾ 11: endif

12: untilMaximumiterationreached 13: return _M¹M

m=1x_m

distributedSGD(1)andthuscanbeveryeasilyimplemented.

As we can see, PerSyn involves no communication atall ^τ⁻_τ¹ percentofthetime.However,duringthat time,modelsaredriven by their local gradients only andcan thus divergefrom one an- otherleadingtoincoherentdistributedoptimization.Thetrade-off withPerSynisthentochoose

τ

^such^as^to^minimize^the^communi-

cationcostswhileforbiddingmodelstodivergefromone another duringthetimewherenocommunicationismade.

One mainlimitationofPerSynisthat thecommunicationma- trix is very dense when tmod

τ

=0. This is taken care of by

EASGD [9] by considering a moving average of the local models insteadofastrictaverage.

3.2. EASGD

InEASGD,thelocalmodelsareperiodicallyaveragedlikeinPer- Syn.However, theaveraging isperformin anelastic waythatal- lowsforreducedcommunicationcompared to PerSyn.Thecorre- spondingmatrixisthen:

K⁽^t⁾=

⎧ ⎪

⎪ ⎪

⎨

⎪ ⎪

⎪ ⎩

1−M

α α

¹

α

¹

(

¹−

α )

^I

, iftmod

τ

⁼⁰

0 ... ... I

, else

With

α

^being^a^mixing^weight.

As we can see, most of the time, no communication is per- formedandthecommunicationmatrixistheidentity.Every

τ

^iter-

ations, anaveraging synchronizationisperformedtokeepthe local models from divergingfromone another. However, contrarily toPerSynwhichreplacesallmodels(localandglobal)bythenew average, this is done by computinga weightedaverage between thepreviousmodelandthecurrentaverage.Bydoingso,EASGDis abletoexchangeonly2Mmessagesevery

τ

^iteration,^without^hav-

ingtowaitforthecentralnodetocomputethenewaverage.How- ever,aglobalsynchronizationisstillrequiredasthemasterhasto performtheweightedaverageoflocalmodelsthat areconsistent, thatis,localmodelsthathavebeenupdated thesamenumberof time.

3.3. DownpourSGD

InDownpourSGD[10],anasynchronousupdateschemeisdis- cussed. By asynchronous, the authors mean that each worker is endowed withits own clock and process gradient at a rate that is completely independent from the other workers. This can be easily described in our matrix framework by considering a uni- versal clock that ticks each time one ofthe workers clock ticks.

Thiscorrespondstoconsideringtheﬁnesttimeresolutionpossible, whenonlyoneworker isawakeatanygiventimestep.Inpartic- ular, this redeﬁnes v⁽^t⁾=[0,...,0,

∇

ˆ^L^m(^x⁽m^t⁾),0,...] to have zeros everywhere but on the componentcorresponding to the awaken workerm.

Whenevertheyawake,eachworker hastheoptiontocommu- nication with the master while performing its gradient update.

These communications can either be fetching the global model fromthemasterorsendingthecurrentupdatetothemaster.This isexpressedbythefollowingcommunicationmatrices:

K⁽^send⁾=

1 em

0 I

, K⁽^receive⁾=

1 0 em I−e_me_m

withe_mbeingthevectorofzeroswitha1oncomponentm. Whennodesareneithersendingnorreceiving,thecommunica- tionmatrixissimplytheidentityandthustheonlyongoingopera- tionisacomputationwhichinvolvesasingleworkerthatperforms alocalgradientupdate.ThemaindrawbackofDownpouristhatit involvesamasterthatstoresthemostup-to-datemodel.Thissin- gleentrypointcan beasourceofweaknessesinthethe training process since it acts asthe communication bottleneck (allwork- ers haveto communicate with the master) and is also a critical pointoffailure.Weintendtoovercomethisdrawbackbyintroduc- ing newcommunication matricesthat lead to fullydecentralized trainingalgorithms.

(6)

4. GossipStochasticGradientDescent(GoSGD)

To remove the burdenof using a master, several key modifi- cations tothecommunicationprocess havetobemade. First,the absence ofa mastermeans that each workeris expectedto con- vergetothesamevaluex˜,whichcorresponds toensuringastrict consensus amongworkers. Furthermore,thisabsenceof amaster isreflected bythecommunicationmatrixhavingitsfirstrowand first columnto be0.Therefore, allcommunicationare performed in a peerto peer way, which is reflected in the communication matrix wherethe columnsare the sendersandthe rowsare the receivers.

Second, theabsenceofa masteralsoimpliesthe absenceofa globalclock.Thismeansthatworkersareperformingtheirupdates atanytime, independently oftheothers.To reﬂectthis, we con- siderthesameclockmodelaswithDownpour,whereateachtime stept,onlyasingleworkersisawaken.Consequently,thisalsore- deﬁnes v⁽^t⁾=[0,...,0,

∇

ˆ^L^s(^x⁽s^t⁾),0,...] to have zeros everywhere butonthecomponentcorrespondingtotheawakenworkers.Fur- thermore,sincetheactiveworkerisrandom,thecommunications havetoberandomizedtoo,whichleadstorandomcommunication matricesK⁽^t⁾.Theserandom peertopeercommunicationsystems are known asRandomized Gossip Protocols[11],and are known to convergeto theconsensus exponentiallyfastintheabsenceof perturbations (which in our case corresponds to

∀

^t,v⁽^t⁾=0). To control thecommunicationcost,thefrequencyatwhichaworker emitsmessagesissetbyusingarandomvariabledecidingwhether theawakennodesharesitsvariableswithneighbors.

Finally, we want a communication protocol where no worker is waitingforanother,so astomaximize thetime they spend at computingtheir localgradients.Waitingoccurswhena workeris expectingamessagefromanotherworkerandhastoidleuntilthis message arrives. This is the case in symmetrical communication whereworkersexpect repliesto theirmessages. Itiswell known intheRandomizedGossipliteraturethatsuchlocalblockingwaits can causeglobalsynchronizationissueswheremostofthe work- ersspendtheir timewaitingfortheneededresourcestobeavail- able.Toovercomethisproblem,asymmetricgossip protocolshave beendevelopedsuchasin[23].Asymmetricmeansthatnoworker is both sending andreceiving messages atthe sametime, which translatesin constraintson thenon-zero coeﬃcientsofthe com- municationmatrixK⁽^t⁾.Toassureconvergencetotheconsensus,a weight w⁽_m^t⁾ has to be associated with every variable x⁽_m^t⁾ andis sharedbyworkersusingthesamecommunicationsasthevariables x⁽_m^t⁾. These communication protocols are known as Sum-Weight Gossipprotocolsbecauseofintroductionoftheweightsw_m⁽^t⁾.

Using theseassumptions, wedeﬁne theGossipStochasticGra- dientDescent(GoSGD) asfollows:Letsbe theworkerawaken at time t, which is our potential sender. Let S∼B(p) be a random variable following a Bernoulli distribution of probability p. Let r be a random variable sampled from the uniform distribution in

{

¹,...,M

} \

^s^.^r^represents^our^potential^receiver.^The^communica-

tionmatrixK⁽^t⁾ isthen:

K⁽^t⁾=

⎧ ⎪

⎪ ⎪

⎪ ⎨

⎪ ⎪

⎩

0 ...

... I+_w(t^w)⁽^s^t⁾ s +w⁽r^t⁾

e_re_s +

w⁽s^t⁾ w⁽s^t⁾+w⁽r^t⁾−1

e_se_s

,ifS

0 ... ... I

,else

(8)

IfSissuccessful,theweightsofthesendersandthereceiverrare updatedasfollows:

w_s⁽^t⁺¹⁾=w⁽_s^t⁾

2 , w⁽_r^t⁺¹⁾=w⁽_r^t⁾+w⁽_s^t⁾

2 (9)

Noticethat thisupdateinvolvesasinglemessageformstorjust asfortheupdateof x_r⁽^t⁾ usingK⁽^t⁾.Inpractice,both x⁽_s^t⁾ andw⁽_s^t⁾ areencapsulatedinasinglemessageandsenttogether.

Remarkalsothatcontrarilytowhatisfoundinthesum-weight gossipliterature,we donotperformexplicitratioofthevariables andtheirassociatedweights.Ourimplicitratioisperformedbyin- cludingthe weights inthe communicationmatrixK⁽^t⁾. Thisleads tomore complicatedmatricesthan whatis commoninthe liter- ature andhinders the theoretical analysis of the communication process (since K⁽^t⁾ are no longer i.i.d.). However, it also leads to amuch easierimplementationwhenconsidering large deepneu- ral networkssincethe gradientiscomputed onthevariable only andnot ontheratioofvariablesandassociatedweights. Further- more, since the updates are equivalent, our proposed communi- cationmatrixkeepsalltheexponential convergencetoconsensus propertiesofstandardsum-weightsgossipprotocols.

4.1. GoSGDalgorithm

The procedure run by each worker to perform GoSGD is de- scribedinAlgorithm3.Eachworker isendowedwithaqueueq_m

Algorithm3 GoSGD:workerPseudo-code.

1: Input: p: probability ofexchange, M: numberof workers,

η

^:

learningrate,

∀

^m,qm: message queueassociated withworker m,scurrentworkerid.

2: Initialize:xisinitializedrandomly,xs=x,ws= _M¹ 3: repeat

4: processMessages(qs) 5: xs←xs−

η

⁽^t⁾

∇

^ˆ^Ls⁽^t⁾ 6: ifS∼B(^p)^then

7: r=Random(^M)

8: pushMessage(qr,xs,ws)

9: endif

10: untilMaximumiterationreached 11: returnx_s

Algorithm4 Gossipupdatefunctions.

1: functionpushMessage(queueq_r,x_s,w_s) 2: xs←xs

3: ws← ^w₂^s 4: qr.push((xs,

α

^s⁾⁾

5: endfunction

6: functionprocessMessages(queueqr) 7: repeat

8: (^xs,w_s)←q_r.pop() 9: xr←_w_s^w₊^r_w_rxr+_w_s^w₊^s_w_rxs 10: wr←wr+ws

11: untilq_r.empty() 12: endfunction

which can be concurrently accessed by all workers. Eachworker repeats the same loop which consists in a sequence of 3 oper- ations, namely processing incoming messages, performing a local gradientdescent update andpossibly sending amessage toa neighborworker.

Remark that all messages are processed in a delayed fashion in thesense that several messages can be received andthe corresponding variables may havebeen updated by their respective workersbefore thereceiving worker applies thereception procedure.Fromthispointofview,itseemstheworkersaretakinginto account outdatedinformation,which iseven morethe casewith a low communication frequency controlled by a low probability

(7)

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4

0 500 1000 1500 2000 2500 3000

Train Loss

#Batches

PerSyn(p=0.400) PerSyn(p=0.100) PerSyn(p=0.010) 1 thread (bs=16)

(a) Training loss evolution for PerSyn

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4

0 500 1000 1500 2000 2500 3000

Train Loss

#Batches

GoSGD(p=0.400) GoSGD(p=0.100) GoSGD(p=0.010) 1 thread (bs=16)

(b) Training Loss evolution for GoSGD

Fig. 1. Training loss evolution for PerSyn and GoSGD for different frequency/probability of exchange p .

(8)

Fig. 2. Convergence speed comparison between GoSGD and EASGD for similar frequency/probability of exchange p = 0 . 02 .

p.However, as we show inthe experiment, even very low prob- abilitiesofcommunication(p=0.01)yieldverygoodconvergence propertiesandsatisfyingconsensus.

4.2. Distributedoptimizationinterpretation

We now show that our proposed GoSGD solves a consensus augmented versionofthedistributedERM principle.Asshownin Section3,theeasiestwayoftacklingthedistributedERMprinciple istoconsiderlocalvariablesconstrainedtobeequal:

xmin1,...,xM

M

m=1

L

(

^x^m

)

s.t.

∀

^m,xm=xˆ

ˆ x= 1

M M

m=1

xm

As usedin[9]and[24],theconsensus constraintscanbe relaxed whichleadstothefollowingaugmentedproblem:

xmin1,...,xM

M

m=1

L

(

^xm

)

+

ρ

2 xm−xˆ ²₂ (10)

s.t.xˆ= 1 M

M

m=1

xm (11)

We can rewrite thisloss in order to make the equivalent gossip communicationsvisible:

xmin1,...,xM

M

s=1

L

(

^xs

)

+

ρ

4M M

r=1

xs−xr 2 2

(12)

The gradient ofthisobjectivefunction contains2terms, theﬁrst beingrelatedtotheempiricalrisk andthesecond beingthepair- wisedistancebetweenlocalmodels.WeproveinAppendixBthat the update rules of GoSGD are in expectation equivalent to this

twogradienttermsandthusthatGoSGDperformsastochastical- ternategradientdescentonthisaugmentedproblem.

5. Experiments

WeconductedexperimentontheCIFAR-10dataset(see[25]for detailedpresentation)usingaCNNdeﬁnedin[9]anddescribedin [26].Intheﬁrstpart,wetesttheconvergencerateofdifferentdis- tributedalgorithms, namely PerSyn,GoSGD andEASGD.The sec- ondpartofourexperimentsconsiders theconsensus convergence ability of the different communication strategies. For distributed experimentswe always useM=8 workersloaded on8 different TeslaK-20 GPUcards with5Go ofRAM each.The deeplearning frameworkthat we used isTorch 7[27].Forinformation,we ob- tained withGoSGD in this framework, a time of 0.25s per SGD iteration for a probability of exchange equal to 0.01. For all the experiments p represents the frequency/probability of communi- cationperbatchforoneworker.Inordertogetafaircomparison the methods are always compared atequal frequency/probability ofexchange.

5.1. Trainingspeedandgeneralizationperformances

Inthefollowingexperimentswestudythebehaviourofthedis- tributedalgorithms duringtrainingonCIFAR-10.Weusethesame dataaugmentationstrategyasin[9].Duringtraining,thelearning rateissetto0.1andtheweightdecayissetto10⁻⁴.Weusedeight workers.

Weshow on Fig. 1the evolution ofthe traininglossfor both PerSynandGoSGDfordifferentvaluesofthefrequency/probability ofexchange p.As we can see,PerSyn seems to be slightlyfaster intermsofconvergencespeed (asmeasured bythenumberofit- erationsrequiredtoreachaspeciﬁclossvalue).However,remem- berthateven nottakingintoaccountsynchronizationissues,Per- SynrequiresdoubletheamountofmessageofGoSGDforthesame frequency/probabilitysinceworkershavetowaitforthemasterto answer.

(9)

Fig. 3. Validation accuracy evolution for PerSyn and GoSGD for different frequency/probability of exchange p .

Similarly, we comparethetrainingloss ofGoSGD withEASGD onFig.2usingarealworldclock.Aswecansee,GoSGDissignif- icantlyfasterthanEASGD,whichcanbeexplainbythenonblock- ingupdatesofGoSGD aswell asby thefactthat GoSGD requires lessmessagesto keepthesameconsensus error.ﬁnally,weshow

Fig. 4. Evolution of the consensus error ε( t ) against time for different communication frequency/probability p for both GoSGD and PerSyn.

on Fig. 3 the evolution of the validation accuracy for both Per- Syn and GoSGD for different values of the frequency/probability of communication p. For low frequency p=0.01, despite Per- Syn being faster interms on training loss decrease, both PerSyn andGoSGDmodelsobtainequivalentvalidationaccuracy.However, GoSGD is expected to be at least twice as fast in term of real worldclocksinceituseshalfthenumberofmessagesforthesame ratep.

At higher exchange rate p=0.4, we can see that GoSGD ob- tainsbettervalidationaccuracythanPerSyn,despitehavinghigher trainingloss,whichmeansthatGoSGD isinpracticemorerobust tooverﬁtting.Thiscanbe explainedbythe randomizednatureof

(10)

the gossip exchanges,which performsome kindof stochastic ex- plorationoftheparameterspace,similarlytowhatisdoneby[26]. Remark that althoughthe batchsize isequal foreach worker, theimplicitglobalbatch sizeofthealgorithmsarenot equivalent forallcurves.Indeed,thesinglethreadrunhasabatch sizeof16 while each of the 8 workersof the distributed runshas a batch sizeof16.Thereforethedistributedrunshaveanequivalentbatch size of 16×8=128 ascalculated in Section 3.These batch sizes arechosen sohastohaveequalhardwarerequirementsperavail- ablemachine,whichwebelieve isthemostfairandpracticaluse case. InSection 2,werecall thathavinga biggerbatchsize leads toamuchmoreaccurategradientupdateswhichmaysigniﬁcantly speed upthe convergenceandthisisindeedwhat weobservein Fig.1.

5.2. Consensuswithrandomupdates

In this experiment, we assess the ability of GoSGD and Per- Syn tokeep consensus.We considera worst-casescenariowhere the local updatesare not correlated andassuch, we replacethe gradient termbya random variablesampledfromN(⁰,1),inde- pendentlyandidenticallydistributedoneachworker.Weshowon Fig.4thefollowingconsensuserrorfordifferentvaluesofthefre- quency/probabilityofexchangep:

ε (

^t

)

⁼ M

m=1

x⁽_m^t⁾−x˜⁽^t⁾

²

As we can see, at high frequency/probability of exchange, GoSGD and PerSyn are equivalent on average. For low frequency/probability p=0.01,the Gossip strategy has a consensus error in the range ofthe higher values of PerSyn, asboth share the same magnitude. The main difference between the 2 strate- giesisthatPerSynexhibitstheexpectedperiodicityinitsbehavior whichleadsto bigvariationofthe consensuserror,while GoSGD seemstohavemuchlessvariation.Overall,GoSGDisabletomain- tain consensus properties comparable to synchronous algorithms despiteitsrandomnature.

6. Conclusion

Inthispaper,wediscussedthedistributedcomputingstrategies fortrainingdeepconvolutionalneuralnetworks.Weshowthatthe classical stochastic gradient descent can beneﬁtfromsuch distri- butionsinceitsaccuracydependsonthesizeofthesamplebatch used at each iteration. We show that distributed SGD involve 2 keyoperations, namelycomputationandcommunication.Wealso showthatthereisacrucialtrade-off betweenlowcommunication costsandsuﬃcientcommunicationtoensurethatallworkerscon- tributetothesameobjectivefunction.

WedeveloptheoriginaldistributedSGDalgorithmintoanovel framework that allows to design update strategies that precisely control the communication costs. From this framework, we pro- pose2 newalgorithms PerSyn andGoSGD, thelater beingbased onRandomizedGossipprotocols.

In the experiments, we show that both PerSyn and GoSGD are able to work under communication rates as low as 0.01 message/update which renders the communication costs almost negligible. We also show that GoSGD is faster than EASGD [9]. Furthermore,weshowthattherandomcommunicationsofGoSGD allows for better generalization capabilities when compared to non-randomcommunications.

AppendixA. Estimatorvariance

The errorintroduced bytheMonteCarlo estimatorofthegra- dientisproportionaltothebatchsize.

Proof. Therelationbetweentheapproximationerrorandthesize ofthesamplebatchisgivenby:

^L⁽^x⁾i, E

∇

^L(^x)i

²2=

dim(x)

i=1

1

NVar

( ∇

(

)

AppendixB. Alternategradient

TheupdaterulesofGoSGDareinexpectationequivalenttothe empiricalrisk andtheconsensus losspartsofthe gradientofthe consensusaugmenteddistributedproblem.

Fortheempiricalriskpart,sinceateachtimesteptoneofthe workersperformsa localstochastic gradient descentupdate with probability 1/M,it isin expectation equivalentto performing the gradient descent on the empirical risk part with a learning rate dividedbyM.

Forthe consensus part,let usrecall that the update fora re- ceivingworkeris:

x⁽_r^t+1⁾= wr

ws+wr

x⁽_r^t⁾+ ws

ws+wr

x⁽_s^t+⁾

Thisupdate isoccurringwithprobability _M₍_M^p₋₁₎ forall pairs(xs, xr)ofworkers.Firstwedemonstratethefollowinglemma:

Lemma1. E

_w_r

wr+ws

= ¹2 forallpossiblepairs(wr,ws).

Proof. Theweightsupdatecanbewrittenwitharandomcommu- nicationmatrixA⁽^t⁾ withthefollowingdeﬁnition:

A⁽^t⁾=

_I,_withprobabilityp

I−¹₂e_se_s +¹₂e_re_s,withprobability1−p ThesequenceofA⁽^t⁾ arei.i.d.andwenoteE

A⁽^t⁾

=A: A=pI+

(

¹−p

)

s,r=s

1 M

(

^M−1

)

I−1

2e_se_s +1 2e_re_s

=

1−1−p 2M

I+ 1−p 2M

(

^M−1

)

¹¹

UsingAanddenotingw⁽^t⁾ thevectorconcatenating allweights at timet,wehavethefollowingrecursion:

E

w⁽^t⁺¹⁾

|

^w⁽^t⁾

=E

A⁽^t⁾

w⁽^t⁾

=Aw⁽^t⁾

Sinceallweights areinitializedto1,unrollingtherecursionleads to:

E

w⁽^t+1⁾

=A^t1

Since1isa righteigenvectorofA(associated witheigenvalue

λ

^),

wehave:

E

w⁽^t⁺¹⁾

=

λ

^t¹

Whichmeansthatallweightsareequalinexpectation.