SPI. Laboratoire de l Informatique du Parallélisme. École Normale Supérieure de Lyon Unité Mixte de Recherche CNRS-INRIA-ENS LYON n o 5668

(1)

Laboratoire de l’Informatique du Parall´elisme

Ecole Normale Sup´erieure de Lyon ´

Unit´e Mixte de Recherche CNRS-INRIA-ENS LYON n ^o 5668 SPI

Algorithmi Issues

for (Distributed) Heterogeneous

Computing Platforms

Extended Abstrat

Vinent Boudet, Fabrie Rastello and

Yves Robert

Marh 1999

Researh ReportN o

1999-19

Ecole Normale Sup´erieure de Lyon ´

46 All´ee d’Italie, 69364 Lyon Cedex 07, France

T´el´ephone : +33(0)4.72.72.80.37

(2)

for (Distributed) Heterogeneous Computing Platforms

Extended Abstrat

Vinent Boudet, Fabrie Rastelloand Yves Robert

Marh 1999

Abstrat

Futureomputingplatformswillbedistributedandheterogeneous. Suh

platforms range from heterogeneous networks of workstations (NOWs)

to olletions of NOWs and parallel servers sattered throughout the

world and linked through high-speed networks. Implementing tightly-

oupled algorithms on suh platforms raises several hallenging issues.

New data distribution and load balaning strategies are required to

squeeze themostoutof heterogeneousplatforms.

In this paper, we rst summarizeprevious results obtained for hetero-

geneousNOWs,dealingwiththeimplementationofstandardnumerial

kernelssuh asnite-dierenestenilsordense linearsolvers.

Next we target distributedolletionsof heterogeneous NOWs,and we

disussdataalloationstrategies fordense linearsolverson topof suh

platforms. Theseresultsindiatethatamajoralgorithmiandsoftware

eort is needed to ome up with eÆient numerial libraries on the

omputationalgrid.

Keywords: meta-omputing, heterogeneousnetworks,omputational grid,distributed-memory,

dierent-speedproessors,sheduling,mapping,nite-dierenestenils,numeriallibraries.

Resume

Sansauundoute, les mahinesparallelesdufutur seront desmahines

distribueeset heterogenes. Celava du simplereseau heterogene de sta-

tions de travail (NOW), a l'interonnexion de tels reseaux et de ma-

hinesparallelesrepartis dansle monde entieret relies pardes reseaux

rapides. Dans e rapport, tout d'abord, nous resumons les resultats

preedemment obtenus, relatifs au alul lineaire ou aux problemes de

dierenes nies, sur un simple NOW heterogene. Ensuite, nous trai-

tonsdu probleme del'alloationdesdonneesen algebre lineairedansle

asd'unreseau pluslarge, omposede sousreseaux,et... Cesresultats

montrent la neessite d'un eortonsequent dansette diretion avant

de pouvoir,a terme, mettreen plaeunelibrairied'algebre lineaireeÆ-

aesur lereseau mondialdesstationsde travail.

Mots-les: \meta-omputing",\omputational grid",plateforme heterogene, memoire

distribuee,proesseursdevitesses dierentes, ordonnanement,distribution,librairiesde alul

(3)

The future of omputing is best desribed by the key-words distributed and heterogeneous. At

the low end of the eld of distributed and heterogeneous omputing, heterogeneous networks of

workstations or PCsare ubiquitousinuniversitydepartments and ompanies, and they represent

the typial poorman's parallel omputer: running a large PVM or MPI experiment (possibly all

night long) is a heap alternative to buying superomputer hours. The idea is to make useof all

availableresoures, namely slowermahinesin addition tomore reent ones.

Atthehighendoftheeld,linkingthemostpowerfulsuperomputersofthelargest superom-

putingenters throughdediatedhigh-speednetworkswillgive riseto themostpowerfulomputa-

tional siene and engineering problem-solving environment ever assembled: the so-alled ompu-

tational grid [13 ℄. Providing desktop aess to this\grid"will make omputingroutinely parallel,

distributed,ollaborativeand immersive.

In themiddleoftheeld,we an thinkof onnetingmedium-sizeparallelserversthroughfast

butnon-dediated links. For instane,eah institution partiipatingto ameta-omputing projet

ould build its own speialized parallelmahine equippedwith appliation-spei databases and

appliation-oriented software, thus reating a \meta-system". The user is then able to aess all

themahinesofthismeta-systemremotely andtransparently,withouteahinstitutiondupliating

theresouresand theexploitation osts.

Whereas thearhiteturalvisionislear,thesoftwaredevelopmentsarenotsowellunderstood.

Even at the low end of the eld, the programmer is faed with several hallenges. The major

limitationtoprogrammingheterogeneousplatformsarisesfromtheadditionaldiÆultyofbalaning

theloadwhenusingproessorsrunningatdierentspeeds. Distributingtheomputations(together

withthe assoiated data)an be performed either dynamiallyorstatially,ora mixture ofboth.

Some simple shedulersare available, butthey usenaive mappingstrategies suh as master-slave

tehniquesorparadigmsbasedupontheidea\usethepastpredit thefuture",i.e. usetheurrently

observedspeedofomputationofeahmahinetodeideforthenextdistributionofwork[10,9 ,2℄.

Furthermore,datadependenesmaywellleadtoslowingthewholeomputingproessdowntothe

paeof theslowestproessor, asexamplestaken fromstandard linearalgebrakernels demonstrate

(see below). In fat, extensions of parallellibraries suh as SaLAPACK arenot yet available. A

majoralgorithmieortmustbeundertaken to takleheterogeneous omputingresoures. Blok-

ylidistributionisno longerenough: there isa hallenge indeterminingatrade-o betweenthe

datadistributionparametersandtheproessspawningandpossiblemigrationpoliies. Redundant

omputations might also be neessaryto use aheterogeneous lusterat itsbest apabilities.

At the high end of the eld, the rst task is to logially assemble the distributed omputer:

giventhenetworkinfrastruture,ongurethedistributedolletionofmahinestowhihaessis

given. Softwareofthisategoryinludeslow-levelommuniationprotoolsthatenabledistributed

resourestoeÆientlyommuniate. ExtensionsofPVMandMPIsuhasPlus[17 ℄,PACX-MPI[11℄

orPVMPI[12 ℄areneeded. One thissoftwarelayer isbuilt,the usermust be providedwithmeta-

omputingtools andlibraries,i.e software thatisableto splittheomputationintotasks thatwill

be dynamially alloated to the dierent resoures available. Current strategies to alloate tasks

to resoures are very simple (similar to those previously disussed). We disuss more elaborate

strategies inthispaper.

The ultimate goal would be to use the omputing resoures remotely and transparently, just

as we do with eletriity: withoutknowing where it omes from. Before reahing thisambitious

goal, there are several layers of software to be provided. Lots of eorts in the area of building

andoperatingmeta-systemsaretargetedto infrastruture,serviesandappliations. Notsomany

(4)

themajoroneptual hallengeto be takled.

In thispaper,we rsttarget heterogeneousNOWs,and we reportboththeoretial resultsand

PVM experiments on the implementation of standard numerial kernels suh as nite-dierene

stenils or dense linear solvers. It turns out that rened stati alloationstrategies lead to good

resultsfor suh kernels. Owingto these strategies,quite satisfatory resultsan be obtainedwith

little software eorts on a single heterogeneous NOW. Next we target distributed olletions of

heterogeneousNOWs,and we disussbothstatiand dynamidataalloationstrategies fordense

linearsolversontopofsuhplatforms. Theseresultsindiatethatamajoralgorithmiandsoftware

eortis neededto ome up witheÆient numeriallibrarieson theomputational grid.

2 Three ase studies on a heterogeneous NOW

Inthissetionwedemonstratethatstatialloationstrategiesarethekeyto eÆientlyimplement-

ingtightly-oupledalgorithmson adediated heterogeneous NOW.

Dynami strategies arelikelyto prove usefulinthe followingtwosituations:

whentargetingnon-dediatedworkstationswithunpreditableworkloadandpossiblefailures.

when programmingalargeappliation madeup of several loosely-oupledtasks.

However, whenimplementingatightly-oupledalgorithm(suhasalinearsystemsolver),arefully

tuned sheduling and mapping strategies are required. At rst sight, we maythink that dynami

strategies are likely to perform better, beause the mahine loads will be self-regulated, hene

self-balaned, if proessors pik up new tasks just as they terminate their urrent omputation.

However, datadependenes, ommuniationostsand ontrol overhead maywelllead to slowthe

wholeproess downto thepaeoftheslowestproessors. Ontheotherhand,stati strategies will

suppress(orat leastminimize) dataredistributionsandontroloverhead duringexeution.

To be suessful, stati strategies mustobey a more rened model thanstandard blok-yli

distributions: suh distributionsare well-suited to proessors of equal speed but would lead to a

greatloadimbalanewithproessorsofdierentspeed. Toshowthatthedesignofstatistrategies

thatahieveagoodloadbalaneonaheterogeneousNOWanbeahievedforavarietyofproblems,

we briey sketh three ase studieshereafter.

Tiling We start witha simpleexample where dependenesprevent dynamistrategies to reah

agoodeÆieny: onsideratiledomputationoveraretangulariteration spaeasrepresentedin

Figure 1(see [14 ℄ and [7 ℄ forfurtherinformationon tiling).

There are p available proessors, numbered from 1 to p, whih are assigned olumns of tiles.

WhentargetingahomogeneousNOW,anaturalwaytoalloatetileolumnsto physialproessors

using a pure yli alloation [15 , 14 , 1 ℄. For heterogeneous NOWs, we use a rened periodi

alloation(see[6℄)whihproveseÆientboththeoretiallyandexperimentally: wehaverunseveral

MPIexperiments. Wepointoutthatpurelyylialloationsdonotperformwell,whiletheywould

be the outome of a greedy master-slave strategy. Indeed, proessors willbe alloated therst p

olumnsinanyorder. Re-numberproessorsaordingtothisinitialassignment. Thenthroughout

theomputation,P

j

willreturnafter P

j 1

and justbeforeP

j+1

(take indiesmodulo p),beause

ofthedependenes. Heneomputationswouldonlyprogressatthespeedoftheslowestproessor.

(5)

x x x x x x

x x x x x x x x x x x x

x x x x x x x x x

x x x x x x

x x x x x x x x x

x x x

x x x x x x

j i

x x x x x x

N

1 2 x x x

x x x x x x x x x

T _2,3

Figure 1: Atiled iterationspae withhorizontal and vertialdependenes.

In a word, our solutiononsists inalloatingpanels of B tileolumnsto the p proessors in a

periodifashion. Insideeahpanel,proessorsreeiveanamountofolumnsinverselyproportional

to theirspeed. Withineah panel,theworkis weell-balaned, and dependenesdo notslowdown

theexeution. See[6 ℄for details.

Finite-dierenestenilomputations InthisexamplewestudyatiledversionoftheFermi-

Pasta-Ulamone-dimensional relaxationproblem[18 ℄.

0000000000000000000000000 1111111111111111111111111

Spring

Moleule x

1 x

2

x

l

m~a= P

~

F

mx

i

=k(x

i+1 x

i )+K(x

i+1 x

i )

+k(x

i 1 x

i )+K(x

i 1 x

i )

Figure2: The Fermi-Pasta-Ulam model ofa one dimensionalrystal.

Figure 2 illustrates this model. Let x t

i

denote the oordinate of the i th

moleule at time t.

Aording to themodel,we have

x t

i

= 2x t 1

i +x

t 2

i +

(dt) 2

m f(x

t 1

i 1

;x t 1

i

;x t 1

i+1 )

= g(x t 1

i 1

;x t 1

i

;x t 1

i+1

;x t 2

i )

After some ompilertransformations,we endupwitha atilediterationspaewhoseshapeisa

parallelogramand whosedependenevetorsarethepair(0;1) t

and (1;0) t

. Thisturnsoutto bea

very generalsituation whensolvingnite-dierene problems.

Our solution is similar to that for the retangular-shaped problem: olumns of tiles are dis-

tributed to the proessors. More preisely, the panel size B is hosen suh that the load is best

balanedand thatsome idletimeis notreatedbytoolargebloks ofolumns. However, beause

thedomain hasbeenskewed, dependenesfurtheronstrain theproblem, and there is a tehnial

onditionto enfore sothatno proessor iskeptidle. See[5 ℄ formore details.

(6)

F

U U U

F

U U

F

U

Figure3: The taskgraphof LU and QRdeompositions.

Dense linear systemsolvers The lastkernels thatwe dealwitharethe(bloked)LUandQR

deompositionstaken fromtheSaLapaklibrary[8 ℄. Wereportboththeoretial resultsand PVM

experimentsina ompanionpaper[4 ℄. StatistrategiesaremuhmoreeÆientthandynamiones

beausethey dramatiallyredue the amount of dataredistributionsduringtheexeution.

First we ndan optimal alloationof theolumn bloks, i.e. suh that eah phase of updates

(arowoftasklabeledUinFigure3)isbestbalaned. Inthissolutiontheonlyommuniationsare

the broadastsof thepivot blokat eah phase (after eah task labeled Fin Figure 3). Then we

addressthemoreompliatedproblemwherethepivotblokfatorization istakenintoaountfor

load balaning the blok: we proposean algorithm suh that eah pivot fatorization is exeuted

bythefastest proessor. This indues afew more ommuniations,butthe algorithm is perfetly

load-balaned.

3 Semi-stati strategies for olletions of heterogeneous NOWs

Here we target distributed olletions of heterogeneous NOWs, and we disuss data alloation

strategies for dense linear solvers on top of suh platforms. These results indiate that a ma-

jor algorithmi and software eort is needed to ome up with eÆient numerial libraries on the

omputationalgrid.

The mahine and network models are the following: we hierarhially dene a (d+1)-deep

grid asahomogeneousnetwork ofheterogeneous d-deepgrids. Ofoursea1-deep grid simplyis a

heterogeneous NOW. Then a 2-deep grid isa olletion of heterogeneous NOWs, where theinter-

NOW ommuniationlinksare assumedto have thesame speed, typiallyone order of magnitude

slower thanthe intra-NOW ommuniationlinks. For instane two loal networks inEurope and

intheUS may be onnetedbya slower(non-dediated)link.

Werst addresstheproblemofndinganoptimalalloationfortheLUandQRfatorizations

ona2-deepgrid. Tothispurpose,weassumethatineahNOW,aproessorisdediatedtohandle

theommuniations betweenNOWs, asshownin Figure4.

(7)

Cluster A Cluster B Cluster C

Figure 4: Modeling a2-deep grid.

Beause of the harateristis of the 2-deep grid, we have to inrease the granularity of the

omputations. ThebasihunkofdatathatisalloatedtoagivenNOWisapanelofBbloksofr

olumns,wherer ishosen toensureLevel3 BLASperformane[3℄and B isamahine-dependent

parameter. The basiidea is to overlap inter-NOW ommuniations (typiallythe broadast of a

panel) with independent omputations. Updating a panel requires nB 2

r 2

a

units of time, where

a

is the elemental omputation time. Communiating a panel between NOWs requires nBr

units of time, where

is the inter-NOW ommuniation rate. Of ourse

is several orders of

magnitude greater than

a

, but letting B large enough (in fat B

r

a

) will indeed permit the

desiredommuniation-omputationoverlap. Notethatsuhanoverlapannotusuallybeahieved

withina singleNOW.

We report in the Appendix several strategies to implement LU and QR fatorizations on a

2-deep grid. These strategies are prospetive: they are intended to balane omputations while

overlappingommuniations. Atualexperimentsareneededtovalidateourimplementationskele-

tons. However, we an already draw the onlusionthat a major software eortis needed. Some

basitoolsto writethefatorizationroutines, suhastheBLAS3 operations,arestillthere. Some

othertools suhastheBLACSsubroutinesneedtobeextendedto opewithseveralNOWs,using

pakages suh asthose in[17 , 11 ,12 ℄. What seemsunavoidable is a hange inthe philosophy: we

would aess pointers to loal arrays rather thanaddressing a shared-memoryglobal matrixasin

theurrent SaLAPACKdistribution. Suha major hange isa sine-qua-nonto taklethe imple-

mentationofSaLAPACKond-deepgrids, whered2: itdoesnotseemreasonableto emulate a

global addressingon a olletion of heterogeneous NOWs orparallel servers that aresattered all

aroundthe world.

4 Conlusion

The mainobjetivesof thispaperhave been:

to survey existing algorithm designmethodsfor heterogeneous platforms,mainlyat the low

endof theeld,i.e. targeting a singleheterogeneous NOW

todemonstrate thatrenedstati alloationstrategies areveryeÆientforavarietyof stan-

dardomputationalkernels

(8)

appliationson meta-systems(olletions of NOWs).

We believe that squeezingthe most outof meta-omputing systems willrequire to solve hal-

lenging algorithmi problems. We insist that the ommunity should takle these problems very

rapidlyto make fulluse ofthemanyhardwareresoures thatalready areat its disposal.

Referenes

[1℄ RumenAndonovand SanjayRajopadhye. Optimal orthogonal tilingof two-dimensionaliter-

ations. Journal of Parallel and Distributed Computing, 45(2):159{165, 1997.

[2℄ F. Berman. High-performane shedulers. In I. Foster and C. Kesselman, editors, The Grid:

Blueprintfor a NewComputing Infrastruture, pages 279{309. Morgan-Kaufmann,1998.

[3℄ L.S.Blakford,J.Choi,A.Cleary,E.D'Azevedo, J.Demmel,I.Dhillon,J.Dongarra, S.Ham-

marling,G.Henry,A.Petitet,K.Stanley,D.Walker,,andR.C.Whaley.SaLAPACKUsers'

Guide. SIAM,1997.

[4℄ V. Boudet, F. Rastello, and Y. Robert. A proposal for an heterogeneous luster SaLA-

PACK (dense linear solvers). Tehnial Report RR-99-17, LIP, ENS Lyon, 1999. Available

at www.ens-lyon.fr/LIP/lip/publis/publis.us.html.Submitted to the PDPTA'99 Con-

ferene.

[5℄ P. Boulet, J. Dongarra, F. Rastello, Y.Robert, and F. Vivien. Algorithmi issuesforhetero-

geneous omputing platforms. Tehnial Report RR-98-49, LIP, ENSLyon, 1998. Available

at www.ens-lyon.fr/LIP/lip/publis/publis.us.html.

[6℄ P.Boulet,J.Dongarra, YvesRobert, andFrederiVivien.Tilingforheterogeneousomputing

platforms. Tehnial Report UT-CS-97-373, University of Tennessee, Knoxville, 1997. To

appearinthejournalParallel Computing.

[7℄ P.Y.Calland,J.Dongarra,andY.Robert. Tilingwithlimitedresoures.InL.Thiele,J.Fortes,

K.Vissers,V.Taylor,T.Noll,andJ.Teih,editors,Appliation SpeiSystems,Ahitetures,

and Proessors, ASAP'97, pages 229{238. IEEE Computer Soiety Press, 1997. Extended

versionavailableon theWEB at http://www.ens-lyon.fr/yrobert.

[8℄ J.Choi,J.Demmel, I.Dhillon,J.Dongarra,S.Ostrouhov,A.Petitet, K.Stanley,D.Walker,

and R. C. Whaley. SaLAPACK: A portable linear algebra library for distributed memory

omputers-designissuesandperformane.ComputerPhysisCommuniations,97:1{15,1996.

(alsoLAPACKWorking Note#95).

[9℄ Mihal Cierniak, Mohammed J.Zaki, and Wei Li. Customized dynamiload balaning for a

network of workstations. Journal of Parallel andDistributed Computing,43:156{162, 1997.

[10℄ Mihal Cierniak, Mohammed J. Zaki, and Wei Li. Sheduling algorithms for heterogeneous

network of workstations. TheComputer Journal, 40(6):356{372, 1997.

[11℄ Th. Eikermann, J. Henrihs, M. Resh, R. Stoy, and R. Volpel. Metaomputing in gigabit

environments: networks, tools andappliations. Parallel Computing, 24:1847{1872, 1998.

(9)

management under PVMPI. In M. Bubak, J. Dongarra, and J. Wasniewski, editors, Reent

advanes in PV and MPI, volume1332 of Letures Notes in Computer Siene, pages91{98.

SpringerVerlag, 1997.

[13℄ I.FosterandC.Kesselman,editors. TheGrid: Blueprintfor aNewComputingInfrastruture.

Morgan-Kaufmann,1998.

[14℄ K. Hogstedt, L.Carter, and J.Ferrante. Determining the idletime of a tiling. In Priniples

of Programming Languages, pages 160{173. ACM Press, 1997. Extended version availableas

TehnialReport UCSD-CS96-489,and on theWEB at http://www.se.usd.edu/arter.

[15℄ H.Ohta,Y.Saito,M.Kainaga,andH.Ono.Optimaltilesizeadjustmentinompilinggeneral

DOACROSSloopnests. In1995International Confereneon Superomputing, pages270{279.

ACM Press,1995.

[16℄ J.M. Ortega and C.H. Romine. The ijk forms of fatorization methods ii. parallel systems.

Parallel Computing, 7:149{162, 1988.

[17℄ A. Reinefeld, J. Gehring, and M. Brune. Communiating aross parallel message-passing

environments. Journal of Systems Arhiteture, 44:261{272, 1998.

[18℄ M. Remoissenet. Waves alled solitons. Springer Verlag,1994.

Appendix

We disuss in this setion several strategies to implement the LU and QR fatorizations on a

olletion of heterogeneous NOWs. Setion 4 of the ompanion researh report RR-1999-17 [4 ℄

dealswith thesame problemon a single heterogeneous NOW. Thereading of Setion4 of [4 ℄ is a

prerequisiteto thisAppendix.

We target a 2-deep grid, as desribed in Setion 3 and in Figure 4. There is a (slow) serial

link betweenanypair of NOWs. To ommuniate to aproessorof the lusterB (in Figure 4), a

proessor of lusterA hasto send its message to thedediated proessor of its own luster. Then

themessagewillbeforwardedtothedediatedproessoroflusterB;thisommuniationantake

plae inparallel withindependentomputations withinthe lustersA and B Finally,thereeiver

inlusterB reeivesits datafromthe dediatedproessor.

Stati strategy

We deompose ourmatrixinto panelsof sizeB: apanel isa slie ofB olumnbloks. The sizeof

thepanelsis thesame forall thelusters. The valueforB is disussedbelow.

Roughly speaking, the number of panels alloated to eah luster is inverselyproportionalto

its speed: we ompute the time needed to update a panel of B olumn bloks for eah luster.

Thesetimesarethe\yle-time"ofthelusters. Forexample,onsideralusterAwith3mahines

whose yle-times are 2, 3 and 4, and a luster B with 3 mahines whose yle-times are 3, 5

and 8. Suppose that the size of the panels is B = 5. The optimal alloation for the lusterA is

24232(meaningthattheproessorofyle-timeequalto 2reeivestherst,thirdandfthbloks,

startingthe numberingfrom the right, and so on). The optimalalloationfor B is 35383. Hene

the\yle-time" forlusterA is6 and the\yle-time" forthelusterB is9.

(10)

wouldusetheoptimalalloationof[4℄withtwomahinesofyle-times6and9,leadingtoaperiodi

alloationof panels as:::jBAABAjBAABAjBAABA. Infat we an do slightlybetter and on-

tinuetheoptimizeddistributioninsideeahlusterfromonepaneltothenextone. Intheexample,

thealloationoftherstvepanelsisAAABA(as statedabove,startingthenumberingfromthe

right),andinsidethelusterswehavethedistributionB

8 B

5 B

3 B

5 jA

3 A

2 A

4 A

2 A

3 jA

2 A

3 A

2 A

4 A

3 jB

3 B

8 B

3 B

5 B

3 jA

2 A

4 A

2 A

3 A

2 .

Finally,wean furtherimprovethissolutionbyre-evaluating the\yle-times"ofthelusters. In-

deed, in our example, thetime needed to omputethe seond panel of lusterB is 10 and not9.

Sotospeak, torenethealloationofthepanelswemaytakethe\yle-times"ofthelustersinto

aount on they.

As explainedinSetion 3,thesizeofthepanelsis hosen sothatupdatingagiven paneltakes

lesstimethanommuniatinga panelto anotherluster. Thesheduling orrespondsto the look-

aheadstrategyforthepointwisealgorithm[16 ℄. ItisillustratedinFigure5. Eah taskinFigure 5

representsa panel (andnot asingle olumn blok asinFigure 3). After the fator taskat step 2

is ompleted,all proessorsof lusterB gather theurrent panel on thedediated proessor. The

broadastofthepanelan takeplaewhilelusterAupdatesits thirdpanelandlusterB updates

its seond panelat step 3.

A

B A A B A B A A B A

B F F

1 1

U U U U U U U U U U

0

2 3 4 5

2 3 4 5 6

Figure5: Shedulingwith2lustersAandB. Exponentsrepresentthenatureofthetasks(fatoring

orupdatingapanel). Indiesrepresentthe stepsat whihthe tasksareproessed.

Dynami strategy

Asabove,we deomposethematrix into panelsof sizeB andwe omputethe\yle-time" ofthe

dierent lusters. We still distribute the panels to the dierent lusters as explained, using an

optimalalloationinside the lusters. However, we deidethat the fator task and the preeding

update task is always exeuted by the fastest luster. In the initial distribution of panels, we

suppressthersttwoourrenesof thefastestlusterto takeintoaount thefator andupdate

tasks.

Forexample,ifwehave3lustersofrelativespeeds3,5and8,thealloationwillbe33j85335385.

The two panelson theleftorrespond to thefator taskand its orresponding update. Asbefore,

the sheduling strategy is \update, fator and broadast ASAP". The fastest luster will indeed

omputethenextpivotassoonaspossible,andbroadastittotheotherlusters,beforeomputing

itslastupdates. Thereareadditionalommuniationsduetothefatthatallthepivotpanelsmust

be proessed by the fastest luster, as illustrated in Figure 6 whih orresponds to the example

above with 3 lusters. The shortommuniations labeled\Ga" are loal gathers withina luster

(thedediatedommuniationproessorgathersthepanel)whereasthelongommuniationsstand

(11)

thefastest luster,orthe broadastof thepivot panel after fatoring. The ostof an inter-luster

SPI. Laboratoire de l Informatique du Parallélisme. École Normale Supérieure de Lyon Unité Mixte de Recherche CNRS-INRIA-ENS LYON n o 5668

Laboratoire de l’Informatique du Parall´elisme

Ecole Normale Sup´erieure de Lyon ´

Unit´e Mixte de Recherche CNRS-INRIA-ENS LYON n o 5668 SPI

Ecole Normale Sup´erieure de Lyon ´

46 All´ee d’Italie, 69364 Lyon Cedex 07, France

T´el´ephone : +33(0)4.72.72.80.37

x x x x x x

x x x x x x

x x x x x x

x x x x x x

x x x x x x x x x x x x

x x x x x x x x x

x x x x x x x x x

x x x x x x

x x x x x x

x x x x x x x x x

x x x

x x x x x x

x x x x x x

j i

x x x x x x

x x x x x x

N

N

1

2 x x x

x x x x x x x x x

T 2,3

0000000000000000000000000 1111111111111111111111111

F

U U U

F

U U

U U

F

U

Cluster A Cluster B Cluster C

A

B A A B A B A A B A

B F F

1 1

U U U U U U U U U U

0

2

3 4 5

2 3 4 5 6

3

3 8

3 5 3

Br Cluster 3

Cluster 8

F Ga U F Ga

U Ga Send

Unit´e Mixte de Recherche CNRS-INRIA-ENS LYON n ^o 5668 SPI

T _2,3