Laboratoire de l’Informatique du Parall´elisme
Ecole Normale Sup´erieure de Lyon ´
Unit´e Mixte de Recherche CNRS-INRIA-ENS LYON n o 5668 SPI
Algorithmi Issues
for (Distributed) Heterogeneous
Computing Platforms
Extended Abstrat
Vinent Boudet, Fabrie Rastello and
Yves Robert
Marh 1999
Researh ReportN o
1999-19
Ecole Normale Sup´erieure de Lyon ´
46 All´ee d’Italie, 69364 Lyon Cedex 07, France
T´el´ephone : +33(0)4.72.72.80.37
for (Distributed) Heterogeneous Computing Platforms
Extended Abstrat
Vinent Boudet, Fabrie Rastelloand Yves Robert
Marh 1999
Abstrat
Futureomputingplatformswillbedistributedandheterogeneous. Suh
platforms range from heterogeneous networks of workstations (NOWs)
to olletions of NOWs and parallel servers sattered throughout the
world and linked through high-speed networks. Implementing tightly-
oupled algorithms on suh platforms raises several hallenging issues.
New data distribution and load balaning strategies are required to
squeeze themostoutof heterogeneousplatforms.
In this paper, we rst summarizeprevious results obtained for hetero-
geneousNOWs,dealingwiththeimplementationofstandardnumerial
kernelssuh asnite-dierenestenilsordense linearsolvers.
Next we target distributedolletionsof heterogeneous NOWs,and we
disussdataalloationstrategies fordense linearsolverson topof suh
platforms. Theseresultsindiatethatamajoralgorithmiandsoftware
eort is needed to ome up with eÆient numerial libraries on the
omputationalgrid.
Keywords: meta-omputing, heterogeneousnetworks,omputational grid,distributed-memory,
dierent-speedproessors,sheduling,mapping,nite-dierenestenils,numeriallibraries.
Resume
Sansauundoute, les mahinesparallelesdufutur seront desmahines
distribueeset heterogenes. Celava du simplereseau heterogene de sta-
tions de travail (NOW), a l'interonnexion de tels reseaux et de ma-
hinesparallelesrepartis dansle monde entieret relies pardes reseaux
rapides. Dans e rapport, tout d'abord, nous resumons les resultats
preedemment obtenus, relatifs au alul lineaire ou aux problemes de
dierenes nies, sur un simple NOW heterogene. Ensuite, nous trai-
tonsdu probleme del'alloationdesdonneesen algebre lineairedansle
asd'unreseau pluslarge, omposede sousreseaux,et... Cesresultats
montrent la neessite d'un eortonsequent dansette diretion avant
de pouvoir,a terme, mettreen plaeunelibrairied'algebre lineaireeÆ-
aesur lereseau mondialdesstationsde travail.
Mots-les: \meta-omputing",\omputational grid",plateforme heterogene, memoire
distribuee,proesseursdevitesses dierentes, ordonnanement,distribution,librairiesde alul
The future of omputing is best desribed by the key-words distributed and heterogeneous. At
the low end of the eld of distributed and heterogeneous omputing, heterogeneous networks of
workstations or PCsare ubiquitousinuniversitydepartments and ompanies, and they represent
the typial poorman's parallel omputer: running a large PVM or MPI experiment (possibly all
night long) is a heap alternative to buying superomputer hours. The idea is to make useof all
availableresoures, namely slowermahinesin addition tomore reent ones.
Atthehighendoftheeld,linkingthemostpowerfulsuperomputersofthelargest superom-
putingenters throughdediatedhigh-speednetworkswillgive riseto themostpowerfulomputa-
tional siene and engineering problem-solving environment ever assembled: the so-alled ompu-
tational grid [13 ℄. Providing desktop aess to this\grid"will make omputingroutinely parallel,
distributed,ollaborativeand immersive.
In themiddleoftheeld,we an thinkof onnetingmedium-sizeparallelserversthroughfast
butnon-dediated links. For instane,eah institution partiipatingto ameta-omputing projet
ould build its own speialized parallelmahine equippedwith appliation-spei databases and
appliation-oriented software, thus reating a \meta-system". The user is then able to aess all
themahinesofthismeta-systemremotely andtransparently,withouteahinstitutiondupliating
theresouresand theexploitation osts.
Whereas thearhiteturalvisionislear,thesoftwaredevelopmentsarenotsowellunderstood.
Even at the low end of the eld, the programmer is faed with several hallenges. The major
limitationtoprogrammingheterogeneousplatformsarisesfromtheadditionaldiÆultyofbalaning
theloadwhenusingproessorsrunningatdierentspeeds. Distributingtheomputations(together
withthe assoiated data)an be performed either dynamiallyorstatially,ora mixture ofboth.
Some simple shedulersare available, butthey usenaive mappingstrategies suh as master-slave
tehniquesorparadigmsbasedupontheidea\usethepastpredit thefuture",i.e. usetheurrently
observedspeedofomputationofeahmahinetodeideforthenextdistributionofwork[10,9 ,2℄.
Furthermore,datadependenesmaywellleadtoslowingthewholeomputingproessdowntothe
paeof theslowestproessor, asexamplestaken fromstandard linearalgebrakernels demonstrate
(see below). In fat, extensions of parallellibraries suh as SaLAPACK arenot yet available. A
majoralgorithmieortmustbeundertaken to takleheterogeneous omputingresoures. Blok-
ylidistributionisno longerenough: there isa hallenge indeterminingatrade-o betweenthe
datadistributionparametersandtheproessspawningandpossiblemigrationpoliies. Redundant
omputations might also be neessaryto use aheterogeneous lusterat itsbest apabilities.
At the high end of the eld, the rst task is to logially assemble the distributed omputer:
giventhenetworkinfrastruture,ongurethedistributedolletionofmahinestowhihaessis
given. Softwareofthisategoryinludeslow-levelommuniationprotoolsthatenabledistributed
resourestoeÆientlyommuniate. ExtensionsofPVMandMPIsuhasPlus[17 ℄,PACX-MPI[11℄
orPVMPI[12 ℄areneeded. One thissoftwarelayer isbuilt,the usermust be providedwithmeta-
omputingtools andlibraries,i.e software thatisableto splittheomputationintotasks thatwill
be dynamially alloated to the dierent resoures available. Current strategies to alloate tasks
to resoures are very simple (similar to those previously disussed). We disuss more elaborate
strategies inthispaper.
The ultimate goal would be to use the omputing resoures remotely and transparently, just
as we do with eletriity: withoutknowing where it omes from. Before reahing thisambitious
goal, there are several layers of software to be provided. Lots of eorts in the area of building
andoperatingmeta-systemsaretargetedto infrastruture,serviesandappliations. Notsomany
themajoroneptual hallengeto be takled.
In thispaper,we rsttarget heterogeneousNOWs,and we reportboththeoretial resultsand
PVM experiments on the implementation of standard numerial kernels suh as nite-dierene
stenils or dense linear solvers. It turns out that rened stati alloationstrategies lead to good
resultsfor suh kernels. Owingto these strategies,quite satisfatory resultsan be obtainedwith
little software eorts on a single heterogeneous NOW. Next we target distributed olletions of
heterogeneousNOWs,and we disussbothstatiand dynamidataalloationstrategies fordense
linearsolversontopofsuhplatforms. Theseresultsindiatethatamajoralgorithmiandsoftware
eortis neededto ome up witheÆient numeriallibrarieson theomputational grid.
2 Three ase studies on a heterogeneous NOW
Inthissetionwedemonstratethatstatialloationstrategiesarethekeyto eÆientlyimplement-
ingtightly-oupledalgorithmson adediated heterogeneous NOW.
Dynami strategies arelikelyto prove usefulinthe followingtwosituations:
whentargetingnon-dediatedworkstationswithunpreditableworkloadandpossiblefailures.
when programmingalargeappliation madeup of several loosely-oupledtasks.
However, whenimplementingatightly-oupledalgorithm(suhasalinearsystemsolver),arefully
tuned sheduling and mapping strategies are required. At rst sight, we maythink that dynami
strategies are likely to perform better, beause the mahine loads will be self-regulated, hene
self-balaned, if proessors pik up new tasks just as they terminate their urrent omputation.
However, datadependenes, ommuniationostsand ontrol overhead maywelllead to slowthe
wholeproess downto thepaeoftheslowestproessors. Ontheotherhand,stati strategies will
suppress(orat leastminimize) dataredistributionsandontroloverhead duringexeution.
To be suessful, stati strategies mustobey a more rened model thanstandard blok-yli
distributions: suh distributionsare well-suited to proessors of equal speed but would lead to a
greatloadimbalanewithproessorsofdierentspeed. Toshowthatthedesignofstatistrategies
thatahieveagoodloadbalaneonaheterogeneousNOWanbeahievedforavarietyofproblems,
we briey sketh three ase studieshereafter.
Tiling We start witha simpleexample where dependenesprevent dynamistrategies to reah
agoodeÆieny: onsideratiledomputationoveraretangulariteration spaeasrepresentedin
Figure 1(see [14 ℄ and [7 ℄ forfurtherinformationon tiling).
There are p available proessors, numbered from 1 to p, whih are assigned olumns of tiles.
WhentargetingahomogeneousNOW,anaturalwaytoalloatetileolumnsto physialproessors
using a pure yli alloation [15 , 14 , 1 ℄. For heterogeneous NOWs, we use a rened periodi
alloation(see[6℄)whihproveseÆientboththeoretiallyandexperimentally: wehaverunseveral
MPIexperiments. Wepointoutthatpurelyylialloationsdonotperformwell,whiletheywould
be the outome of a greedy master-slave strategy. Indeed, proessors willbe alloated therst p
olumnsinanyorder. Re-numberproessorsaordingtothisinitialassignment. Thenthroughout
theomputation,P
j
willreturnafter P
j 1
and justbeforeP
j+1
(take indiesmodulo p),beause
ofthedependenes. Heneomputationswouldonlyprogressatthespeedoftheslowestproessor.
x x x x x x
x x x x x x
x x x x x x
x x x x x x
x x x x x x x x x x x x
x x x x x x x x x
x x x x x x x x x
x x x x x x
x x x x x x
x x x x x x x x x
x x x
x x x x x x
x x x x x x
j i
x x x x x x
x x x x x x
N
N
1
2 x x x
x x x x x x x x x
T 2,3
Figure 1: Atiled iterationspae withhorizontal and vertialdependenes.
In a word, our solutiononsists inalloatingpanels of B tileolumnsto the p proessors in a
periodifashion. Insideeahpanel,proessorsreeiveanamountofolumnsinverselyproportional
to theirspeed. Withineah panel,theworkis weell-balaned, and dependenesdo notslowdown
theexeution. See[6 ℄for details.
Finite-dierenestenilomputations InthisexamplewestudyatiledversionoftheFermi-
Pasta-Ulamone-dimensional relaxationproblem[18 ℄.
0000000000000000000000000 1111111111111111111111111
Spring
Moleule x
1 x
2
x
l
m~a= P
~
F
mx
i
=k(x
i+1 x
i )+K(x
i+1 x
i )
+k(x
i 1 x
i )+K(x
i 1 x
i )
Figure2: The Fermi-Pasta-Ulam model ofa one dimensionalrystal.
Figure 2 illustrates this model. Let x t
i
denote the oordinate of the i th
moleule at time t.
Aording to themodel,we have
x t
i
= 2x t 1
i +x
t 2
i +
(dt) 2
m f(x
t 1
i 1
;x t 1
i
;x t 1
i+1 )
= g(x t 1
i 1
;x t 1
i
;x t 1
i+1
;x t 2
i )
After some ompilertransformations,we endupwitha atilediterationspaewhoseshapeisa
parallelogramand whosedependenevetorsarethepair(0;1) t
and (1;0) t
. Thisturnsoutto bea
very generalsituation whensolvingnite-dierene problems.
Our solution is similar to that for the retangular-shaped problem: olumns of tiles are dis-
tributed to the proessors. More preisely, the panel size B is hosen suh that the load is best
balanedand thatsome idletimeis notreatedbytoolargebloks ofolumns. However, beause
thedomain hasbeenskewed, dependenesfurtheronstrain theproblem, and there is a tehnial
onditionto enfore sothatno proessor iskeptidle. See[5 ℄ formore details.
F
U U U
F
U U
U U
F
U
Figure3: The taskgraphof LU and QRdeompositions.
Dense linear systemsolvers The lastkernels thatwe dealwitharethe(bloked)LUandQR
deompositionstaken fromtheSaLapaklibrary[8 ℄. Wereportboththeoretial resultsand PVM
experimentsina ompanionpaper[4 ℄. StatistrategiesaremuhmoreeÆientthandynamiones
beausethey dramatiallyredue the amount of dataredistributionsduringtheexeution.
First we ndan optimal alloationof theolumn bloks, i.e. suh that eah phase of updates
(arowoftasklabeledUinFigure3)isbestbalaned. Inthissolutiontheonlyommuniationsare
the broadastsof thepivot blokat eah phase (after eah task labeled Fin Figure 3). Then we
addressthemoreompliatedproblemwherethepivotblokfatorization istakenintoaountfor
load balaning the blok: we proposean algorithm suh that eah pivot fatorization is exeuted
bythefastest proessor. This indues afew more ommuniations,butthe algorithm is perfetly
load-balaned.
3 Semi-stati strategies for olletions of heterogeneous NOWs
Here we target distributed olletions of heterogeneous NOWs, and we disuss data alloation
strategies for dense linear solvers on top of suh platforms. These results indiate that a ma-
jor algorithmi and software eort is needed to ome up with eÆient numerial libraries on the
omputationalgrid.
The mahine and network models are the following: we hierarhially dene a (d+1)-deep
grid asahomogeneousnetwork ofheterogeneous d-deepgrids. Ofoursea1-deep grid simplyis a
heterogeneous NOW. Then a 2-deep grid isa olletion of heterogeneous NOWs, where theinter-
NOW ommuniationlinksare assumedto have thesame speed, typiallyone order of magnitude
slower thanthe intra-NOW ommuniationlinks. For instane two loal networks inEurope and
intheUS may be onnetedbya slower(non-dediated)link.
Werst addresstheproblemofndinganoptimalalloationfortheLUandQRfatorizations
ona2-deepgrid. Tothispurpose,weassumethatineahNOW,aproessorisdediatedtohandle
theommuniations betweenNOWs, asshownin Figure4.
Cluster A Cluster B Cluster C
Figure 4: Modeling a2-deep grid.
Beause of the harateristis of the 2-deep grid, we have to inrease the granularity of the
omputations. ThebasihunkofdatathatisalloatedtoagivenNOWisapanelofBbloksofr
olumns,wherer ishosen toensureLevel3 BLASperformane[3℄and B isamahine-dependent
parameter. The basiidea is to overlap inter-NOW ommuniations (typiallythe broadast of a
panel) with independent omputations. Updating a panel requires nB 2
r 2
a
units of time, where
a
is the elemental omputation time. Communiating a panel between NOWs requires nBr
units of time, where
is the inter-NOW ommuniation rate. Of ourse
is several orders of
magnitude greater than
a
, but letting B large enough (in fat B
r
a
) will indeed permit the
desiredommuniation-omputationoverlap. Notethatsuhanoverlapannotusuallybeahieved
withina singleNOW.
We report in the Appendix several strategies to implement LU and QR fatorizations on a
2-deep grid. These strategies are prospetive: they are intended to balane omputations while
overlappingommuniations. Atualexperimentsareneededtovalidateourimplementationskele-
tons. However, we an already draw the onlusionthat a major software eortis needed. Some
basitoolsto writethefatorizationroutines, suhastheBLAS3 operations,arestillthere. Some
othertools suhastheBLACSsubroutinesneedtobeextendedto opewithseveralNOWs,using
pakages suh asthose in[17 , 11 ,12 ℄. What seemsunavoidable is a hange inthe philosophy: we
would aess pointers to loal arrays rather thanaddressing a shared-memoryglobal matrixasin
theurrent SaLAPACKdistribution. Suha major hange isa sine-qua-nonto taklethe imple-
mentationofSaLAPACKond-deepgrids, whered2: itdoesnotseemreasonableto emulate a
global addressingon a olletion of heterogeneous NOWs orparallel servers that aresattered all
aroundthe world.
4 Conlusion
The mainobjetivesof thispaperhave been:
to survey existing algorithm designmethodsfor heterogeneous platforms,mainlyat the low
endof theeld,i.e. targeting a singleheterogeneous NOW
todemonstrate thatrenedstati alloationstrategies areveryeÆientforavarietyof stan-
dardomputationalkernels
appliationson meta-systems(olletions of NOWs).
We believe that squeezingthe most outof meta-omputing systems willrequire to solve hal-
lenging algorithmi problems. We insist that the ommunity should takle these problems very
rapidlyto make fulluse ofthemanyhardwareresoures thatalready areat its disposal.
Referenes
[1℄ RumenAndonovand SanjayRajopadhye. Optimal orthogonal tilingof two-dimensionaliter-
ations. Journal of Parallel and Distributed Computing, 45(2):159{165, 1997.
[2℄ F. Berman. High-performane shedulers. In I. Foster and C. Kesselman, editors, The Grid:
Blueprintfor a NewComputing Infrastruture, pages 279{309. Morgan-Kaufmann,1998.
[3℄ L.S.Blakford,J.Choi,A.Cleary,E.D'Azevedo, J.Demmel,I.Dhillon,J.Dongarra, S.Ham-
marling,G.Henry,A.Petitet,K.Stanley,D.Walker,,andR.C.Whaley.SaLAPACKUsers'
Guide. SIAM,1997.
[4℄ V. Boudet, F. Rastello, and Y. Robert. A proposal for an heterogeneous luster SaLA-
PACK (dense linear solvers). Tehnial Report RR-99-17, LIP, ENS Lyon, 1999. Available
at www.ens-lyon.fr/LIP/lip/publis/publis.us.html.Submitted to the PDPTA'99 Con-
ferene.
[5℄ P. Boulet, J. Dongarra, F. Rastello, Y.Robert, and F. Vivien. Algorithmi issuesforhetero-
geneous omputing platforms. Tehnial Report RR-98-49, LIP, ENSLyon, 1998. Available
at www.ens-lyon.fr/LIP/lip/publis/publis.us.html.
[6℄ P.Boulet,J.Dongarra, YvesRobert, andFrederiVivien.Tilingforheterogeneousomputing
platforms. Tehnial Report UT-CS-97-373, University of Tennessee, Knoxville, 1997. To
appearinthejournalParallel Computing.
[7℄ P.Y.Calland,J.Dongarra,andY.Robert. Tilingwithlimitedresoures.InL.Thiele,J.Fortes,
K.Vissers,V.Taylor,T.Noll,andJ.Teih,editors,Appliation SpeiSystems,Ahitetures,
and Proessors, ASAP'97, pages 229{238. IEEE Computer Soiety Press, 1997. Extended
versionavailableon theWEB at http://www.ens-lyon.fr/yrobert.
[8℄ J.Choi,J.Demmel, I.Dhillon,J.Dongarra,S.Ostrouhov,A.Petitet, K.Stanley,D.Walker,
and R. C. Whaley. SaLAPACK: A portable linear algebra library for distributed memory
omputers-designissuesandperformane.ComputerPhysisCommuniations,97:1{15,1996.
(alsoLAPACKWorking Note#95).
[9℄ Mihal Cierniak, Mohammed J.Zaki, and Wei Li. Customized dynamiload balaning for a
network of workstations. Journal of Parallel andDistributed Computing,43:156{162, 1997.
[10℄ Mihal Cierniak, Mohammed J. Zaki, and Wei Li. Sheduling algorithms for heterogeneous
network of workstations. TheComputer Journal, 40(6):356{372, 1997.
[11℄ Th. Eikermann, J. Henrihs, M. Resh, R. Stoy, and R. Volpel. Metaomputing in gigabit
environments: networks, tools andappliations. Parallel Computing, 24:1847{1872, 1998.
management under PVMPI. In M. Bubak, J. Dongarra, and J. Wasniewski, editors, Reent
advanes in PV and MPI, volume1332 of Letures Notes in Computer Siene, pages91{98.
SpringerVerlag, 1997.
[13℄ I.FosterandC.Kesselman,editors. TheGrid: Blueprintfor aNewComputingInfrastruture.
Morgan-Kaufmann,1998.
[14℄ K. Hogstedt, L.Carter, and J.Ferrante. Determining the idletime of a tiling. In Priniples
of Programming Languages, pages 160{173. ACM Press, 1997. Extended version availableas
TehnialReport UCSD-CS96-489,and on theWEB at http://www.se.usd.edu/arter.
[15℄ H.Ohta,Y.Saito,M.Kainaga,andH.Ono.Optimaltilesizeadjustmentinompilinggeneral
DOACROSSloopnests. In1995International Confereneon Superomputing, pages270{279.
ACM Press,1995.
[16℄ J.M. Ortega and C.H. Romine. The ijk forms of fatorization methods ii. parallel systems.
Parallel Computing, 7:149{162, 1988.
[17℄ A. Reinefeld, J. Gehring, and M. Brune. Communiating aross parallel message-passing
environments. Journal of Systems Arhiteture, 44:261{272, 1998.
[18℄ M. Remoissenet. Waves alled solitons. Springer Verlag,1994.
Appendix
We disuss in this setion several strategies to implement the LU and QR fatorizations on a
olletion of heterogeneous NOWs. Setion 4 of the ompanion researh report RR-1999-17 [4 ℄
dealswith thesame problemon a single heterogeneous NOW. Thereading of Setion4 of [4 ℄ is a
prerequisiteto thisAppendix.
We target a 2-deep grid, as desribed in Setion 3 and in Figure 4. There is a (slow) serial
link betweenanypair of NOWs. To ommuniate to aproessorof the lusterB (in Figure 4), a
proessor of lusterA hasto send its message to thedediated proessor of its own luster. Then
themessagewillbeforwardedtothedediatedproessoroflusterB;thisommuniationantake
plae inparallel withindependentomputations withinthe lustersA and B Finally,thereeiver
inlusterB reeivesits datafromthe dediatedproessor.
Stati strategy
We deompose ourmatrixinto panelsof sizeB: apanel isa slie ofB olumnbloks. The sizeof
thepanelsis thesame forall thelusters. The valueforB is disussedbelow.
Roughly speaking, the number of panels alloated to eah luster is inverselyproportionalto
its speed: we ompute the time needed to update a panel of B olumn bloks for eah luster.
Thesetimesarethe\yle-time"ofthelusters. Forexample,onsideralusterAwith3mahines
whose yle-times are 2, 3 and 4, and a luster B with 3 mahines whose yle-times are 3, 5
and 8. Suppose that the size of the panels is B = 5. The optimal alloation for the lusterA is
24232(meaningthattheproessorofyle-timeequalto 2reeivestherst,thirdandfthbloks,
startingthe numberingfrom the right, and so on). The optimalalloationfor B is 35383. Hene
the\yle-time" forlusterA is6 and the\yle-time" forthelusterB is9.
wouldusetheoptimalalloationof[4℄withtwomahinesofyle-times6and9,leadingtoaperiodi
alloationof panels as:::jBAABAjBAABAjBAABA. Infat we an do slightlybetter and on-
tinuetheoptimizeddistributioninsideeahlusterfromonepaneltothenextone. Intheexample,
thealloationoftherstvepanelsisAAABA(as statedabove,startingthenumberingfromthe
right),andinsidethelusterswehavethedistributionB
8 B
5 B
3 B
3 B
5 jA
3 A
2 A
4 A
2 A
3 jA
2 A
3 A
2 A
4 A
3 jB
3 B
8 B
3 B
5 B
3 jA
2 A
4 A
2 A
3 A
2 .
Finally,wean furtherimprovethissolutionbyre-evaluating the\yle-times"ofthelusters. In-
deed, in our example, thetime needed to omputethe seond panel of lusterB is 10 and not9.
Sotospeak, torenethealloationofthepanelswemaytakethe\yle-times"ofthelustersinto
aount on they.
As explainedinSetion 3,thesizeofthepanelsis hosen sothatupdatingagiven paneltakes
lesstimethanommuniatinga panelto anotherluster. Thesheduling orrespondsto the look-
aheadstrategyforthepointwisealgorithm[16 ℄. ItisillustratedinFigure5. Eah taskinFigure 5
representsa panel (andnot asingle olumn blok asinFigure 3). After the fator taskat step 2
is ompleted,all proessorsof lusterB gather theurrent panel on thedediated proessor. The
broadastofthepanelan takeplaewhilelusterAupdatesits thirdpanelandlusterB updates
its seond panelat step 3.
A
B A A B A B A A B A
B F F
1 1
U U U U U U U U U U
0
2
3 4 5
2 3 4 5 6
Figure5: Shedulingwith2lustersAandB. Exponentsrepresentthenatureofthetasks(fatoring
orupdatingapanel). Indiesrepresentthe stepsat whihthe tasksareproessed.
Dynami strategy
Asabove,we deomposethematrix into panelsof sizeB andwe omputethe\yle-time" ofthe
dierent lusters. We still distribute the panels to the dierent lusters as explained, using an
optimalalloationinside the lusters. However, we deidethat the fator task and the preeding
update task is always exeuted by the fastest luster. In the initial distribution of panels, we
suppressthersttwoourrenesof thefastestlusterto takeintoaount thefator andupdate
tasks.
Forexample,ifwehave3lustersofrelativespeeds3,5and8,thealloationwillbe33j85335385.
The two panelson theleftorrespond to thefator taskand its orresponding update. Asbefore,
the sheduling strategy is \update, fator and broadast ASAP". The fastest luster will indeed
omputethenextpivotassoonaspossible,andbroadastittotheotherlusters,beforeomputing
itslastupdates. Thereareadditionalommuniationsduetothefatthatallthepivotpanelsmust
be proessed by the fastest luster, as illustrated in Figure 6 whih orresponds to the example
above with 3 lusters. The shortommuniations labeled\Ga" are loal gathers withina luster
(thedediatedommuniationproessorgathersthepanel)whereasthelongommuniationsstand
thefastest luster,orthe broadastof thepivot panel after fatoring. The ostof an inter-luster
ommuniationremainsremainssmallerthanthetimeneeded foran updatebeauseof thehoie
of B.
3
3 8
3 5 3
Br Cluster 3
Cluster 8
F Ga U F Ga
U Ga Send
Figure6: Sheduling thetasksand theommuniationswiththedynamistrategy.