• Aucun résultat trouvé

Distributed Optimization for Deep Learning with Gossip Exchange

N/A
N/A
Protected

Academic year: 2021

Partager "Distributed Optimization for Deep Learning with Gossip Exchange"

Copied!
11
0
0

Texte intégral

(1)

HAL Id: hal-01930346

https://hal.archives-ouvertes.fr/hal-01930346

Submitted on 15 Jan 2019

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Distributed Optimization for Deep Learning with Gossip Exchange

Michael Blot, David Picard, Nicolas Thome, Matthieu Cord

To cite this version:

Michael Blot, David Picard, Nicolas Thome, Matthieu Cord. Distributed Optimization for Deep Learning with Gossip Exchange. Neurocomputing, Elsevier, 2019, 330, pp.287-296.

�10.1016/j.neucom.2018.11.002�. �hal-01930346�

(2)

Distributed optimization for deep learning with gossip exchange

Michael Blot

a,1,

, David Picard

b,a

, Nicolas Thome

a,c

, Matthieu Cord

a

aUPMC Univ Paris 06, CNRS, LIP6 UMR 7606, Sorbonne Universités, 4 place Jussieu, Paris 75005, France

bETIS UMR 8051, Université Paris Seine, Université Cergy-Pontoise, ENSEA, CNRS, France

cCEDRIC, Conservatoire National des Arts et Métiers, 292 Rue St Martin, Paris 75003, France

Keywords:

Optimization

Distributed gradient descent Gossip

Deep Learning Neural networks

a b s t r a c t

Weaddressthe issueofspeedingupthetraining ofconvolutionalneuralnetworksbystudying adis- tributed method adaptedto stochastic gradient descent.Our parallel optimizationsetup usesseveral threads,eachapplyingindividualgradientdescentsonalocalvariable.Weproposeanewwayofsharing informationbetweendifferentthreadsbasedongossipalgorithmsthatshowgoodconsensusconvergence properties.OurmethodcalledGoSGDhastheadvantagetobefullyasynchronousanddecentralized.

1. Introduction

Withdeepconvolutionalneuralnetworks(CNN)introduced by Fukushima [1] and LeCun et al. [2], computer vision tasks and more specifically image classification have made huge improve- ments inthe yearsfollowing[3].CNN performancesbenefita lot from big collections of annotated images like Russakovsky et al.

[4]orLinetal.[5].Theyaretrainedbyoptimizingalossfunction withgradientdescentscomputedonrandommini-batchesaccord- ing to Bottou [6]. The method called stochastic gradient descent (SGD) hasproved tobe veryefficient totrain neuralnetworksin general.

However current CNN structures are extremely deep like the 100 layers ResNet ofHe et al.[7] andcontains a lotof parame- ters (around60M forAlexnet [3]and130 M forvgg [8]). Those structures involve heavy gradient computation times making the trainingonbig data-setsveryslow. ComputationonGPUacceler- atesthetrainingbutrequireshugelocalmemorycaches.

Nevertheless the mini-batch optimization seems suitable for distributingthetrainingoverseveralthreads.Manymethodshave been studied like Refs. [9,10], which propose to distribute the

Corresponding author.

E-mail addresses: michael.blot@lip6.fr (M. Blot), picard@ensea.fr (D. Picard), nicolas.thome@lip6.fr (N. Thome), matthieu.cord@lip6.fr (M. Cord).

1Thank DGA for financing my researches.

batches over different threads called workers that periodically exchange information via a central thread to synchronize their models. In [9], when a worker exchanges information with the central node it updates its parameters variable with a weighted averaging of its own parameters variable and the central node ones. The master thread does the symmetrical update. In [10], the workers accumulate in an additional buffer the gradients computedatthelaststeps,andappliesthisaggregatedgradientto themasterthreadparameters.

Themajorissueofthesemethods isthatthey inducewaits in allcomputingnodesduetothesynchronizationproblemswiththe criticalresource that is the central node. Apopular wayof tack- lingthisissueindistributedoptimizationinvolvesexchanging in- formationinapeertopeerfashion.Inthiscategory,Gossipproto- cols[11]havebeensuccessfullyappliedtomanymachinelearning problemssuchaskernelmethods[12],PCA[13]andk-meansclus- tering[14,15].We proposeto combineGossipprotocols withSGD toefficientlytraindeepCNNwithouttheneedofacentralnode.

Thecontributionsofthispaperarethefollowing.First,wepro- posea newframework fordistributedSGD in whichwe can for- mallyanalyze andcompare the propertiesof existing distributed deeplearningtrainingalgorithms.Usingthisframework,weshow that distributing the mini-batches is equivalent to use bigger batches in non-distributed SGD, which has implications on the convergencespeed ofthe algorithms. Finally,by carefullyanalyz- ingthekeyaspectsofourframework,weproposetwodistributed

(3)

SGDalgorithmsfortrainingdeepCNN.Thefirstone,calledPerSyn, usesacentralnodeinaveryefficientway.Thesecondone,called GoSGD,isbasedonGossipexchangesandcanthusbescaledtoan arbitrarilylargenumberofnodes,providingasignificant speedup tothetrainingprocedure.

The remaining ofthis paperisasfollows.In thenext section, werecalltheprinciplesofSGDappliedtodeeplearningaswellas therelevantliterature aboutdistributingits computation.Next,in Section3,wedetailourformalizationofdistributedSGDwhichal- lowsusetocompareexistingalgorithmsandwriteourfirstpropo- sitionPerSyn.InSection4,we presenta decentralizedalternative basedonGossipalgorithms,thatwe callGoSGD,whichfitsnicely inourframework.Finally,wepresentexperimentsshowingtheef- ficiencyofdistributingSGDinSection5beforeweconclude.

2. SGDandrelatedwork

Inimageclassification,trainingdeepCNNfollowstheempirical risk minimization (ERM) principle [16]: The objectiveis to mini- mize

L

(

x

)

=EY∼I[

(

x,Y

)

] (1)

where Y is a pair variable consisting of an image and the asso- ciatedlabel followingthe naturalimage distribution I,x arethe CNNparametersandthelossfunctionquantifyingtheprediction errormadeonY.Despitetheobjectivefunctionusuallybeingnon convex, gradient descent has been shownto be a successful op- timizationmethodfortheproblem. Ingradientdescentalgorithm [17],theupdateoperation

xx

η∇

L

(

x

)

(2)

issequentiallyappliedtotheoptimizedvariablesuntil aconvinc- ing local minimum for L is found or some stopping criteria are reached.

η

is the gradient step size, called learning rate. In im-

ageclassification,itisnotpossibletocomputetheexpectedgradi- ent

L(x)becausethedistributionIisunknown(andnotproperly modeled).HoweveritispossibletoapproximateitusingaMonte Carlomethod:

L(x)=

x

EY[(x,Y)]

=EY[

x(x,Y)](theprop- erties allowing to swap the derivative and the integral are sup- posed to be respected). This last quantity is approximated by a MonteCarloestimator:

L

(

x

)

=N1 N

i=1

x

(

x,Yi

)

Theset(Yi)i=1..N ofindependentsamplesdrawnfromIiscalleda batchandNisthebatchsize.

L(x)isthenthedescentdirection usedateachupdate. Inpractice,theimagesinabatcharedrawn randomlyfromthetrainingsetandaredifferentforeachgradient step.

Concerningtheprecisionofthegradientindicator,wepointout that it isunbiased. It can also be shownthat the approximation withrespect to truegradient depends on thebatch size. Infact, E

||∇

L(x)

L(x)

||

22N1tr(Co

v

Y[

(x,Y)]) where Co

v

Y[

(x,Y)] isthecovariancematrix ofthegradient estimatorandtrdenotes thetraceofthematrix(seeAppendixA).Thereforethebiggerthe batch size N,the more accurate the gradient estimator. Unfortu- natelyGPUmemoryis limitedandboundsthenumberofimages that can be processed in a batch. For instance,a GPUcard with 12 Goof memory cannot process batches biggerthan 48 images of224×224RGBpixelsusingResnet[7]with30layers.Indeed,at eachupdate step, theGPUneeds to hostthe whole modelinter- mediatecomputationsforallimagesinthebatch.As such,using

asingleGPUlimitstheprecisionofthestochasticgradientdescent andthushindersitsconvergence.

2.1. Relatedwork

There are many ways to accelerate the convergence time of SGD,oneofwhichistooptimizethelearningrate

η

foreachgra-

dientdescent iteration asaimed at bysecond ordergradient de- scentmethods [17].Fordeeplearning,the parametersspaceisof highdimensionandsecondordermethodsarecomputationallytoo heavytobeappliedefficiently.Examplesoffirstordergradientde- scent that adapt the learning ratewith success can be found in [18]and[19].

A second way to accelerate the gradient descent isto reduce thecomputationtimeof

L(x)whichisthepurposeofthispaper.

Based on theavailability of severalGPU cards,different methods thatovercometheGPUbatchsizelimitationshavebeenproposed.

There are two main ways ofdistributing the computationof the gradientupdate,whichareeithersplittingthemodelanddistribut- ingpartsofitamongthedifferentGPUs,orsplittingthebatchof imagesinsubsetsanddistributingthemamongtheGPUs.

AfamoustechniqueusedbyKrizhevskyetal.[3]fordistribut- ingthemodelconsistsinparallelingtheforward/backwardpassof abatchovertwoGPUcards.Theconvolutionalpartofthenetwork issplit intotwoindependent partsthat areonly aggregatedwith afullyconnectedlayeratthetopofthenetwork.Thisconception enablestoloadthetwoparallelpartsontwodifferentGPUsliber- atingspaceformoreimagesinbatches.HowevertheCNNmodels claiming currentstate of the art performances in image classifi- cation likeHe et al.[7] andHuang etal.[20] benefita lot from theinter-correlationbetweenfeaturesofeachlayersanddonotal- lowindependentpartitioningwithinalayer.Thusifthenetworkis loadedinseveralGPUcardstheywouldhavetocommunicatedur- ingtheforwardpassofabatch. Thecommunicationsbeingtime- consumingitispreferabletorestrainthewholenetworkonasin- gleGPUcard.

Thesecond approach,whichconsistsindistributingsubsetsof thebatchto thedifferentGPUs,isbecomingincreasingly popular.

We use the following notationsto describe such methods: As in [9]thedifferentthreadsthat manageaneuralnetworkmodelare calledworkers.WithMGPUcards,thereareMworkersandtheir corresponding set of parameters isnoted xm foreach worker m. Thesetofparametersusedforinferenceisnotedx˜andisspecifi- callythemodelwewanttooptimize.Theeasiestwayofdistribut- ingthebatchesispresentedin[21]andconsistsincomputingthe averageofalltheworkersmodelsasinAlgorithm1.

Algorithm1 FullysynchronousSGD:pseudo-code.

1: Input:M:numberofthreads,

η

:learningrate

2: Initialize:xisinitializedrandomly,xm=x 3: repeat

4:

m,

v

m=

Lm(xm)

5:

v

=M1M m=1

v

m 6:

m,xm=xm

η v

7: untilMaximumiterationreached 8: return M1 M

m=1xm

Westressthefactthataftereachiterationallworkershostthe same value of the variable x˜ to optimize (i.e.,

m,xm=x˜). This methodisequivalenttousingMtimesbiggerbatchesforthegradi- entscomputation.2ItisthenpossibletoalleviatetheGPUmemory constraintlimitingthenumberofexamplesinabatchinorderto

2In [21] they use the sum instead of the averaging of the gradients but both are equivalent by adjusting the learning rate.

(4)

getmoreaccurategradients.Howeverincomparisonwiththeclas- sicalSGDtherearethreeadditionalphasesofcommunicationand aggregationthathaveincompressibletime:

1. Thetransmissionto themasterofthelocalgradientestimates computedbythethreadsatline4.

2. Theaggregationofthegradientsatline5.

3. The transmission of the aggregated gradient

L(x) to all threadsbeforelocalupdatesatline6.

Inlargescaleimageclassification,thecommunicationsinvolved atlines4and6implyaCNNwithmillionsofparametersandare thus very costly. Moreover, to perform line5, the master has to waitforallworkerstohavecompletedline4.Theadditionaltime involvedbycommunicationsandsynchronizationscanhaveasig- nificantimpactontheglobaltrainingtime.Thisisevenmoretrue whentheGPUsarenotonthesameserver,wherethecommunica- tiondelaysareconsequent.Thus,thiseasysynchronousdistributed methodcanbeveryinefficient.

The focusofthispaperistoreduce thesecommunicationand aggregation costs. As withexisting strategiessuch as Refs. [9,10], themain ideaisto controlthe amountofinformationexchanged between the nodes. This amount is a key factor in making all workerscontributetotheoptimizationofthetestmodelx˜.Indeed, if no information is ever exchanged, the distributed system is equivalenttotrainingMindependentmodels.Asaconsequenceof the symmetryproperty ofCNN explainedin[22],theseindepen- dentmodels arelikelytobe verydifferentandalmost impossible to combine. As such, the distributedtraining wouldnot improve overusingasingleGPUwithareducedbatchsize.Themainchal- lengeofthiswork isthus toallow aslittleinformationexchange as possible to ensure reduced communication costs, while still optimizingthecombinedmodelx˜.

Inthe next section,we introducea matrixframework that al- lows to explicitly control the amount of exchanged information, andthustoexploredifferentoptimizationstrategies.

3. Framework

Inthissection,weproposeageneralframeworkfordistributed SGD methods. We start from the centralized SGD update proce- dure and we show the equivalent synchronous distributed pro- cedure. Then, we show that relaxing some equality constraints amongworkersisequivalentto considerdifferentcommunication strategies.Wearethusabletowritestrategiesfoundinthelitera- tureusingasimplematrixexpression.

First, we unroll the recursion of the classic SGD update of Eq.(2)toobtainthevalueofx˜afterT+1gradientdescentsteps:

˜

x(T+1)=x˜(0)

η

T

t=0

ˆL

˜ x(t)

(3)

Withx˜(t) beingvariablex˜atime t.Theequivalentbatchdistribu- tionmethodisthefollowing:

˜

x(T+1)=x˜(0)

η

M T

t=0

M

m=1

ˆLm

˜ x(t)

Inordertomakethecommunicationbetweentheworkersandthe centralmodelmorevisible,wecanusetheequivalentupdaterule that introduces local variables xm and consensus constraints. By doingso,weobtainthestandarddistributedSGDmethod:

˜

x(T+1)=x˜(0)

η

M T t=0

M

m=1

ˆLm

x(mt)

(4)

s.t.

t,

m,xm(t)=x˜(t) (5)

Wecan replace theconsensus equalityconstraints (5)by equiva- lentlocalupdaterulesobtainedfrom(4):

r,xr(T+1)=x˜(0)

η

M T t=0

M

s=1

ˆLs

x(st)

Thisformulationisafirststeptowardsdecentralizedandcom- municationefficientalgorithmssinceitmadethecommunications betweenworkersvisible.Indeed,ateachstep,eachworkergathers thegradientsfromallotherworkersandapplies themtoitslocal variable.Themaindrawbackofthisformulationisthatthenumber ofmessagesexchangedby workersateach stepisnowM(M−1) insteadof2Mintheclassicalversion.

Toallow formore efficientcommunication, we relax the con- sensusconstraintsbyallowinglocalvariablestodifferslightly.This canbe done efficientbyintroducing non-uniformweights

α

r,s in thecombinationofthelocalgradients:

˜

x(T+1)=x˜(0)

η

T

t=0

M

m=1

α

0,m

ˆLm

x(mt)

r,xr(T+1)=x˜(0)

η

T

t=0

M

s=1

α

r,s

ˆLs

x(st)

s.t.

r, M

s=1

α

r,s=1

We introduce a matrix notation over

α

r,s to simplify the ex-

pression of these update rules. We note x(t)=[x˜(t),x1(t),...,x(Mt)] the vector concatenating the central variable x˜(t) and all lo- cal variables x(mt) at time step t. Similarly, we note v= [0,

ˆL1(x(1t)),...,

ˆLM(xM(t))] thevector concatenating all thelocal gradients

ˆLm(x(mt)) attime step t, witha leading 0to have the samesize asx.Theweights

α

r,softhegradientcombinationsare enclosedinamatrixK,whichleadstothefollowingmatrixupdate rule:

x(T+1)= T

t=0

K

x(0)

η

T

t=0

T τ=t

K

v(t)

Kisthecommunicationmatrixandisresponsibleforthemixingof thelocalupdates. Remarkthat itsrowshavetosumto1toavoid explodinggradients. The sparser K is, the less workersexchange informationandthusthemoreefficientthealgorithmis.However, thiscomesatthepriceoflessconsensusamongthelocalmodels.

On the extreme, if K is diagonal (no information exchanged be- tween workers),these update rules are equivalentto Mdifferent modelbeingindependentlytrained.Hence,thechoiceofKiscriti- caltohaveagoodtrade-off betweenlowcommunicationcostsand goodinformationsharingbetweenworkers.

Tofurther allow forthe optimizationof K,we introduce time dependentcommunicationmatricesK(t).Thisallowsforthedesign ofverysparsecommunicationmatricesmostofthetime,balanced withtheuseofadenseK(t) toenforcebetterconsensuswhenever itisrequired.WenotePtT=T

τ=tK(τ ),andobtain:

x(T+1)=P0Tx(0)

η

T

t=0

PtTv(t)

Thiscorrespondstothefollowingrecursion:

x(T+1)=K(T)x(T)

η

K(T)v(T)

Remark that since the update is linear, mixing the local gra- dients and mixing the local variables are equivalent. As such, we can define an intermediate variable x(t+12)=x(t)

η

v(t) that

(5)

considersonly thelocal updatesandobtainthe followingupdate rules:

x(T+12)=x(T)

η

v(T) (6)

x(T+1)=K(T)x(T+12) (7) Sincethese rules make a clear distinction betweenlocal compu- tation(step T+12) andcommunication (step T+1), they are the oneswewilluseintheremainingofthisarticle.

Inthefollowing,weshowhowseveralexistingdistributedSGD canbespecifiedbytheirsequenceofcommunicationmatricesK(t). We first start with a naive yet communication efficient method wecallPerSyn (PeriodicallySynchronous)that considersexchang- ing information only once every few steps. Then, we show how EASGD[9]andDOWNPOUR[10]comparetoPerSyn.

3.1.PerSyn

As a first illustration of our framework we introduce a new communicationandaggregationstrategy fordistributedSGD that is very simple to implement and yet surprisingly effective. The mainideaistorelaxthesynchronizationofalllocalvariableswith themastervariabletooccuronlyonceevery

τ

steps.Thisisdone

by defining the following time dependent communicationmatri- ces:

K(t)=

⎧ ⎪

⎪ ⎪

⎪ ⎪

⎪ ⎪

⎪ ⎪

⎪ ⎪

⎪ ⎪

⎪ ⎪

⎪ ⎪

⎪ ⎩

0 M11 ... I

, iftmod

τ

=0

0 · · ·

1 0

, iftmod

τ

=1

0 · · · ... I

, else

IbeingtheidentitymatrixofsizeM×M,and1thevectorofsize Mwith1oneverycomponent.Thecorrespondingalgorithmisde- scribedin Algorithm2. Remarkit isvery similar to thestandard

Algorithm2 PerSynSGD:Pseudo-code.

1: Input:M:numberofthreads,

η

:learningrate

2: Initialize:xisinitializedrandomly,xm=x,t=0 3: repeat

4: forallm,x(t+

1 2)

mx(mt)

η

LmN(x(mt))

5: t=t+1

6: iftmod

τ

=0then 7: x˜(t+1)=M1M

m=1xm(t+12) 8:

m,xm(t+1)M1 M

m=1x(mt+12)

9: else

10:

m,xm(t+1)x(mt+12) 11: endif

12: untilMaximumiterationreached 13: return M1M

m=1xm

distributedSGD(1)andthuscanbeveryeasilyimplemented.

As we can see, PerSyn involves no communication atall ττ1 percentofthetime.However,duringthat time,modelsaredriven by their local gradients only andcan thus divergefrom one an- otherleadingtoincoherentdistributedoptimization.Thetrade-off withPerSynisthentochoose

τ

suchastominimizethecommuni-

cationcostswhileforbiddingmodelstodivergefromone another duringthetimewherenocommunicationismade.

One mainlimitationofPerSynisthat thecommunicationma- trix is very dense when tmod

τ

=0. This is taken care of by

EASGD [9] by considering a moving average of the local models insteadofastrictaverage.

3.2. EASGD

InEASGD,thelocalmodelsareperiodicallyaveragedlikeinPer- Syn.However, theaveraging isperformin anelastic waythatal- lowsforreducedcommunicationcompared to PerSyn.Thecorre- spondingmatrixisthen:

K(t)=

⎧ ⎪

⎪ ⎪

⎪ ⎪

⎪ ⎩

1−M

α α

1

α

1

(

1

α )

I

, iftmod

τ

=0

0 ... ... I

, else

With

α

beingamixingweight.

As we can see, most of the time, no communication is per- formedandthecommunicationmatrixistheidentity.Every

τ

iter-

ations, anaveraging synchronizationisperformedtokeepthe lo- cal models from divergingfromone another. However, contrarily toPerSynwhichreplacesallmodels(localandglobal)bythenew average, this is done by computinga weightedaverage between thepreviousmodelandthecurrentaverage.Bydoingso,EASGDis abletoexchangeonly2Mmessagesevery

τ

iteration,withouthav-

ingtowaitforthecentralnodetocomputethenewaverage.How- ever,aglobalsynchronizationisstillrequiredasthemasterhasto performtheweightedaverageoflocalmodelsthat areconsistent, thatis,localmodelsthathavebeenupdated thesamenumberof time.

3.3. DownpourSGD

InDownpourSGD[10],anasynchronousupdateschemeisdis- cussed. By asynchronous, the authors mean that each worker is endowed withits own clock and process gradient at a rate that is completely independent from the other workers. This can be easily described in our matrix framework by considering a uni- versal clock that ticks each time one ofthe workers clock ticks.

Thiscorrespondstoconsideringthefinesttimeresolutionpossible, whenonlyoneworker isawakeatanygiventimestep.Inpartic- ular, this redefines v(t)=[0,...,0,

ˆLm(x(mt)),0,...] to have zeros everywhere but on the componentcorresponding to the awaken workerm.

Whenevertheyawake,eachworker hastheoptiontocommu- nication with the master while performing its gradient update.

These communications can either be fetching the global model fromthemasterorsendingthecurrentupdatetothemaster.This isexpressedbythefollowingcommunicationmatrices:

K(send)=

1 em

0 I

, K(receive)=

1 0 em Iemem

withembeingthevectorofzeroswitha1oncomponentm. Whennodesareneithersendingnorreceiving,thecommunica- tionmatrixissimplytheidentityandthustheonlyongoingopera- tionisacomputationwhichinvolvesasingleworkerthatperforms alocalgradientupdate.ThemaindrawbackofDownpouristhatit involvesamasterthatstoresthemostup-to-datemodel.Thissin- gleentrypointcan beasourceofweaknessesinthethe training process since it acts asthe communication bottleneck (allwork- ers haveto communicate with the master) and is also a critical pointoffailure.Weintendtoovercomethisdrawbackbyintroduc- ing newcommunication matricesthat lead to fullydecentralized trainingalgorithms.

(6)

4. GossipStochasticGradientDescent(GoSGD)

To remove the burdenof using a master, several key modifi- cations tothecommunicationprocess havetobemade. First,the absence ofa mastermeans that each workeris expectedto con- vergetothesamevaluex˜,whichcorresponds toensuringastrict consensus amongworkers. Furthermore,thisabsenceof amaster isreflected bythecommunicationmatrixhavingitsfirstrowand first columnto be0.Therefore, allcommunicationare performed in a peerto peer way, which is reflected in the communication matrix wherethe columnsare the sendersandthe rowsare the receivers.

Second, theabsenceofa masteralsoimpliesthe absenceofa globalclock.Thismeansthatworkersareperformingtheirupdates atanytime, independently oftheothers.To reflectthis, we con- siderthesameclockmodelaswithDownpour,whereateachtime stept,onlyasingleworkersisawaken.Consequently,thisalsore- defines v(t)=[0,...,0,

ˆLs(x(st)),0,...] to have zeros everywhere butonthecomponentcorrespondingtotheawakenworkers.Fur- thermore,sincetheactiveworkerisrandom,thecommunications havetoberandomizedtoo,whichleadstorandomcommunication matricesK(t).Theserandom peertopeercommunicationsystems are known asRandomized Gossip Protocols[11],and are known to convergeto theconsensus exponentiallyfastintheabsenceof perturbations (which in our case corresponds to

t,v(t)=0). To control thecommunicationcost,thefrequencyatwhichaworker emitsmessagesissetbyusingarandomvariabledecidingwhether theawakennodesharesitsvariableswithneighbors.

Finally, we want a communication protocol where no worker is waitingforanother,so astomaximize thetime they spend at computingtheir localgradients.Waitingoccurswhena workeris expectingamessagefromanotherworkerandhastoidleuntilthis message arrives. This is the case in symmetrical communication whereworkersexpect repliesto theirmessages. Itiswell known intheRandomizedGossipliteraturethatsuchlocalblockingwaits can causeglobalsynchronizationissueswheremostofthe work- ersspendtheir timewaitingfortheneededresourcestobeavail- able.Toovercomethisproblem,asymmetricgossip protocolshave beendevelopedsuchasin[23].Asymmetricmeansthatnoworker is both sending andreceiving messages atthe sametime, which translatesin constraintson thenon-zero coefficientsofthe com- municationmatrixK(t).Toassureconvergencetotheconsensus,a weight w(mt) has to be associated with every variable x(mt) andis sharedbyworkersusingthesamecommunicationsasthevariables x(mt). These communication protocols are known as Sum-Weight Gossipprotocolsbecauseofintroductionoftheweightswm(t).

Using theseassumptions, wedefine theGossipStochasticGra- dientDescent(GoSGD) asfollows:Letsbe theworkerawaken at time t, which is our potential sender. Let SB(p) be a random variable following a Bernoulli distribution of probability p. Let r be a random variable sampled from the uniform distribution in

{

1,...,M

} \

s.rrepresentsourpotentialreceiver.Thecommunica-

tionmatrixK(t) isthen:

K(t)=

⎧ ⎪

⎪ ⎪

⎪ ⎨

⎪ ⎪

⎪ ⎪

0 ...

... I+w(tw)(st) s +w(rt)

eres +

w(st) w(st)+w(rt)−1

eses

,ifS

0 ... ... I

,else

(8)

IfSissuccessful,theweightsofthesendersandthereceiverrare updatedasfollows:

ws(t+1)=w(st)

2 , w(rt+1)=w(rt)+w(st)

2 (9)

Noticethat thisupdateinvolvesasinglemessageformstorjust asfortheupdateof xr(t) usingK(t).Inpractice,both x(st) andw(st) areencapsulatedinasinglemessageandsenttogether.

Remarkalsothatcontrarilytowhatisfoundinthesum-weight gossipliterature,we donotperformexplicitratioofthevariables andtheirassociatedweights.Ourimplicitratioisperformedbyin- cludingthe weights inthe communicationmatrixK(t). Thisleads tomore complicatedmatricesthan whatis commoninthe liter- ature andhinders the theoretical analysis of the communication process (since K(t) are no longer i.i.d.). However, it also leads to amuch easierimplementationwhenconsidering large deepneu- ral networkssincethe gradientiscomputed onthevariable only andnot ontheratioofvariablesandassociatedweights. Further- more, since the updates are equivalent, our proposed communi- cationmatrixkeepsalltheexponential convergencetoconsensus propertiesofstandardsum-weightsgossipprotocols.

4.1. GoSGDalgorithm

The procedure run by each worker to perform GoSGD is de- scribedinAlgorithm3.Eachworker isendowedwithaqueueqm

Algorithm3 GoSGD:workerPseudo-code.

1: Input: p: probability ofexchange, M: numberof workers,

η

:

learningrate,

m,qm: message queueassociated withworker m,scurrentworkerid.

2: Initialize:xisinitializedrandomly,xs=x,ws= M1 3: repeat

4: processMessages(qs) 5: xsxs

η

(t)

ˆLs(t) 6: ifSB(p)then

7: r=Random(M)

8: pushMessage(qr,xs,ws)

9: endif

10: untilMaximumiterationreached 11: returnxs

Algorithm4 Gossipupdatefunctions.

1: functionpushMessage(queueqr,xs,ws) 2: xsxs

3: wsw2s 4: qr.push((xs,

α

s))

5: endfunction

6: functionprocessMessages(queueqr) 7: repeat

8: (xs,ws)qr.pop() 9: xrwsw+rwrxr+wsw+swrxs 10: wrwr+ws

11: untilqr.empty() 12: endfunction

which can be concurrently accessed by all workers. Eachworker repeats the same loop which consists in a sequence of 3 oper- ations, namely processing incoming messages, performing a lo- cal gradientdescent update andpossibly sending amessage toa neighborworker.

Remark that all messages are processed in a delayed fashion in thesense that several messages can be received andthe cor- responding variables may havebeen updated by their respective workersbefore thereceiving worker applies thereception proce- dure.Fromthispointofview,itseemstheworkersaretakinginto account outdatedinformation,which iseven morethe casewith a low communication frequency controlled by a low probability

(7)

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4

0 500 1000 1500 2000 2500 3000

Train Loss

#Batches

PerSyn(p=0.400) PerSyn(p=0.100) PerSyn(p=0.010) 1 thread (bs=16)

(a) Training loss evolution for PerSyn

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4

0 500 1000 1500 2000 2500 3000

Train Loss

#Batches

GoSGD(p=0.400) GoSGD(p=0.100) GoSGD(p=0.010) 1 thread (bs=16)

(b) Training Loss evolution for GoSGD

Fig. 1. Training loss evolution for PerSyn and GoSGD for different frequency/probability of exchange p .

(8)

Fig. 2. Convergence speed comparison between GoSGD and EASGD for similar frequency/probability of exchange p = 0 . 02 .

p.However, as we show inthe experiment, even very low prob- abilitiesofcommunication(p=0.01)yieldverygoodconvergence propertiesandsatisfyingconsensus.

4.2. Distributedoptimizationinterpretation

We now show that our proposed GoSGD solves a consensus augmented versionofthedistributedERM principle.Asshownin Section3,theeasiestwayoftacklingthedistributedERMprinciple istoconsiderlocalvariablesconstrainedtobeequal:

xmin1,...,xM

M

m=1

L

(

xm

)

s.t.

m,xm=xˆ

ˆ x= 1

M M

m=1

xm

As usedin[9]and[24],theconsensus constraintscanbe relaxed whichleadstothefollowingaugmentedproblem:

xmin1,...,xM

M

m=1

L

(

xm

)

+

ρ

2 xmxˆ 22 (10)

s.t.xˆ= 1 M

M

m=1

xm (11)

We can rewrite thisloss in order to make the equivalent gossip communicationsvisible:

xmin1,...,xM

M

s=1

L

(

xs

)

+

ρ

4M M

r=1

xsxr 2 2

(12)

The gradient ofthisobjectivefunction contains2terms, thefirst beingrelatedtotheempiricalrisk andthesecond beingthepair- wisedistancebetweenlocalmodels.WeproveinAppendixBthat the update rules of GoSGD are in expectation equivalent to this

twogradienttermsandthusthatGoSGDperformsastochastical- ternategradientdescentonthisaugmentedproblem.

5. Experiments

WeconductedexperimentontheCIFAR-10dataset(see[25]for detailedpresentation)usingaCNNdefinedin[9]anddescribedin [26].Inthefirstpart,wetesttheconvergencerateofdifferentdis- tributedalgorithms, namely PerSyn,GoSGD andEASGD.The sec- ondpartofourexperimentsconsiders theconsensus convergence ability of the different communication strategies. For distributed experimentswe always useM=8 workersloaded on8 different TeslaK-20 GPUcards with5Go ofRAM each.The deeplearning frameworkthat we used isTorch 7[27].Forinformation,we ob- tained withGoSGD in this framework, a time of 0.25s per SGD iteration for a probability of exchange equal to 0.01. For all the experiments p represents the frequency/probability of communi- cationperbatchforoneworker.Inordertogetafaircomparison the methods are always compared atequal frequency/probability ofexchange.

5.1. Trainingspeedandgeneralizationperformances

Inthefollowingexperimentswestudythebehaviourofthedis- tributedalgorithms duringtrainingonCIFAR-10.Weusethesame dataaugmentationstrategyasin[9].Duringtraining,thelearning rateissetto0.1andtheweightdecayissetto104.Weusedeight workers.

Weshow on Fig. 1the evolution ofthe traininglossfor both PerSynandGoSGDfordifferentvaluesofthefrequency/probability ofexchange p.As we can see,PerSyn seems to be slightlyfaster intermsofconvergencespeed (asmeasured bythenumberofit- erationsrequiredtoreachaspecificlossvalue).However,remem- berthateven nottakingintoaccountsynchronizationissues,Per- SynrequiresdoubletheamountofmessageofGoSGDforthesame frequency/probabilitysinceworkershavetowaitforthemasterto answer.

(9)

Fig. 3. Validation accuracy evolution for PerSyn and GoSGD for different fre- quency/probability of exchange p .

Similarly, we comparethetrainingloss ofGoSGD withEASGD onFig.2usingarealworldclock.Aswecansee,GoSGDissignif- icantlyfasterthanEASGD,whichcanbeexplainbythenonblock- ingupdatesofGoSGD aswell asby thefactthat GoSGD requires lessmessagesto keepthesameconsensus error.finally,weshow

Fig. 4. Evolution of the consensus error ε( t ) against time for different communica- tion frequency/probability p for both GoSGD and PerSyn.

on Fig. 3 the evolution of the validation accuracy for both Per- Syn and GoSGD for different values of the frequency/probability of communication p. For low frequency p=0.01, despite Per- Syn being faster interms on training loss decrease, both PerSyn andGoSGDmodelsobtainequivalentvalidationaccuracy.However, GoSGD is expected to be at least twice as fast in term of real worldclocksinceituseshalfthenumberofmessagesforthesame ratep.

At higher exchange rate p=0.4, we can see that GoSGD ob- tainsbettervalidationaccuracythanPerSyn,despitehavinghigher trainingloss,whichmeansthatGoSGD isinpracticemorerobust tooverfitting.Thiscanbe explainedbythe randomizednatureof

(10)

the gossip exchanges,which performsome kindof stochastic ex- plorationoftheparameterspace,similarlytowhatisdoneby[26]. Remark that althoughthe batchsize isequal foreach worker, theimplicitglobalbatch sizeofthealgorithmsarenot equivalent forallcurves.Indeed,thesinglethreadrunhasabatch sizeof16 while each of the 8 workersof the distributed runshas a batch sizeof16.Thereforethedistributedrunshaveanequivalentbatch size of 16×8=128 ascalculated in Section 3.These batch sizes arechosen sohastohaveequalhardwarerequirementsperavail- ablemachine,whichwebelieve isthemostfairandpracticaluse case. InSection 2,werecall thathavinga biggerbatchsize leads toamuchmoreaccurategradientupdateswhichmaysignificantly speed upthe convergenceandthisisindeedwhat weobservein Fig.1.

5.2. Consensuswithrandomupdates

In this experiment, we assess the ability of GoSGD and Per- Syn tokeep consensus.We considera worst-casescenariowhere the local updatesare not correlated andassuch, we replacethe gradient termbya random variablesampledfromN(0,1),inde- pendentlyandidenticallydistributedoneachworker.Weshowon Fig.4thefollowingconsensuserrorfordifferentvaluesofthefre- quency/probabilityofexchangep:

ε (

t

)

= M

m=1

x(mt)x˜(t)

2

As we can see, at high frequency/probability of exchange, GoSGD and PerSyn are equivalent on average. For low fre- quency/probability p=0.01,the Gossip strategy has a consensus error in the range ofthe higher values of PerSyn, asboth share the same magnitude. The main difference between the 2 strate- giesisthatPerSynexhibitstheexpectedperiodicityinitsbehavior whichleadsto bigvariationofthe consensuserror,while GoSGD seemstohavemuchlessvariation.Overall,GoSGDisabletomain- tain consensus properties comparable to synchronous algorithms despiteitsrandomnature.

6. Conclusion

Inthispaper,wediscussedthedistributedcomputingstrategies fortrainingdeepconvolutionalneuralnetworks.Weshowthatthe classical stochastic gradient descent can benefitfromsuch distri- butionsinceitsaccuracydependsonthesizeofthesamplebatch used at each iteration. We show that distributed SGD involve 2 keyoperations, namelycomputationandcommunication.Wealso showthatthereisacrucialtrade-off betweenlowcommunication costsandsufficientcommunicationtoensurethatallworkerscon- tributetothesameobjectivefunction.

WedeveloptheoriginaldistributedSGDalgorithmintoanovel framework that allows to design update strategies that precisely control the communication costs. From this framework, we pro- pose2 newalgorithms PerSyn andGoSGD, thelater beingbased onRandomizedGossipprotocols.

In the experiments, we show that both PerSyn and GoSGD are able to work under communication rates as low as 0.01 message/update which renders the communication costs almost negligible. We also show that GoSGD is faster than EASGD [9]. Furthermore,weshowthattherandomcommunicationsofGoSGD allows for better generalization capabilities when compared to non-randomcommunications.

AppendixA. Estimatorvariance

The errorintroduced bytheMonteCarlo estimatorofthegra- dientisproportionaltothebatchsize.

Proof. Therelationbetweentheapproximationerrorandthesize ofthesamplebatchisgivenby:

E

||∇

L

(

x

)

L

(

x

) ||

22=E

dim(x)

i=1

|∇

L

(

x

)

i

L

(

x

)

i

|

2

=

dim(x)

i=1

E

|∇

L

(

x

)

i

L

(

x

)

i

|

2

L(x)i being an unbiased estimator of

L(x)i, E

L(x)i

=

L(x)i and E

|∇

L(x)i

L(x)i

|

2=Var(

L(x)i). For Monte-Carlo estimatorswehavethatVar(

L(x)i)==Var(N1nk=1

(x,Yk)i)=

1

NVar(

(x,Y)i),theYkbeingiid. E

||∇

L

(

x

)

L

(

x

) ||

22=

dim(x)

i=1

1

NVar

(

(

x,Y

)

i

)

= 1

Ntr

(

Co

v

Y[

(

x,Y

)

]

)

AppendixB. Alternategradient

TheupdaterulesofGoSGDareinexpectationequivalenttothe empiricalrisk andtheconsensus losspartsofthe gradientofthe consensusaugmenteddistributedproblem.

Fortheempiricalriskpart,sinceateachtimesteptoneofthe workersperformsa localstochastic gradient descentupdate with probability 1/M,it isin expectation equivalentto performing the gradient descent on the empirical risk part with a learning rate dividedbyM.

Forthe consensus part,let usrecall that the update fora re- ceivingworkeris:

x(rt+1)= wr

ws+wr

x(rt)+ ws

ws+wr

x(st+)

Thisupdate isoccurringwithprobability M(Mp1) forall pairs(xs, xr)ofworkers.Firstwedemonstratethefollowinglemma:

Lemma1. E

wr

wr+ws

= 12 forallpossiblepairs(wr,ws).

Proof. Theweightsupdatecanbewrittenwitharandomcommu- nicationmatrixA(t) withthefollowingdefinition:

A(t)=

I,withprobabilityp

I12eses +12eres,withprobability1−p ThesequenceofA(t) arei.i.d.andwenoteE

A(t)

=A: A=pI+

(

1p

)

s,r=s

1 M

(

M−1

)

I−1

2eses +1 2eres

=

1−1−p 2M

I+ 1−p 2M

(

M−1

)

11

UsingAanddenotingw(t) thevectorconcatenating allweights at timet,wehavethefollowingrecursion:

E

w(t+1)

|

w(t)

=E

A(t)

w(t)

=Aw(t)

Sinceallweights areinitializedto1,unrollingtherecursionleads to:

E

w(t+1)

=At1

Since1isa righteigenvectorofA(associated witheigenvalue

λ

),

wehave:

E

w(t+1)

=

λ

t1

Whichmeansthatallweightsareequalinexpectation.

Références

Documents relatifs

Thereby the problem of increasing of pyromagnetic coefficient of magnetic liquids consists in the appropriate choice of such magnetic material which should provide for the

The second type, exchange loss, arises from the decay of the surface wave into the bulk continuum of spin waves.. The latter result is identical to the surface wave frequency

In Figure 5 (b) - (c) we present, for representative levels of haze, the rg chromaticity of the obtained colors for the selected patches by using DehazeNet in order to compare

In our experiments, this version worked better in the Unsupervised and Transfer Learning Challenge, characterized by a non-discriminant linear classifier and very few labeled

The secondary partial signature for a prefix is computed as a sum over all LSAs in a nodes link-state database, where the prefix matches the advertising router of the LSA:..

Eventual linear convergence rate of an exchange algorithm for superresolution.. Frédéric de Gournay, Pierre Weiss,

2. A popular variant of the algorithm consists in adding the single most violating maximizer, which can then be regarded as a variant of the conditional gradient descent. It is

This paper introduces the Shape and Time Distortion Loss (STDL), a new objective function for training deep neural networks in the context of multi-step and non-stationary time