MPI Applications on Grids: A Topology-Aware Approach

(1)

HAL Id: inria-00319241

https://hal.inria.fr/inria-00319241

Submitted on 7 Sep 2008

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

MPI Applications on Grids: A Topology-Aware

Approach

Camille Coti, Thomas Herault, Franck Cappello

To cite this version:

Camille Coti, Thomas Herault, Franck Cappello. MPI Applications on Grids: A Topology-Aware

Approach. [Research Report] RR-6633, INRIA. 2008, pp.21. �inria-00319241�

(2)

a p p o r t

d e r e c h e r c h e

N

0

2

4

9 -6

3

9

9 IS

R

N

IN

R

IA

/R

R

--6

6

3

3 --F

R

+

E

N

G

Thème NUM

MPI Applications on Grids: a Topology Aware

Approach

Camille Coti — Thomas Herault — Franck Cappello

N° 6633

(3)

(4)

Centre de recherche INRIA Futurs

Parc Orsay Université, ZAC des Vignes,

Camille Coti, Thomas Herault, Fran k Cappello ThèmeNUMSystèmesnumériques

Équipe-ProjetGrandLarge

Rapportdere her he n° 6633Septembre200818pages

Abstra t: Large Grids are built by aggregating smaller parallel ma hines throughapubli long-distan einter onne tionnetwork (su h astheInternet). Therefore,theirstru tureisintrinsi allyhierar hi al. Ea hlevelofthenetwork hierar hygivesperforman eswhi h dierfrom the otherlevelsin terms of la-ten y and bandwidth. MPI isthe de fa to standardfor programmingparallel ma hines,thereforeanattra tivesolutionforprogrammingparallelappli ations onthiskindofgrid. However,be auseoftheaforementioneddieren esof om-muni ationperforman es,theappli ation ontinuously ommuni atesba kand forthbetween lusters,withasigni antimpa tonperforman es. Inthisreport, wepresent anextension of theinformation providedby therun-time environ-ment of anMPI library, aset of e ient olle tiveoperationsfor gridsand a methodology to organize ommuni ationpatterns within appli ations with re-spe t to the underlying physi al topology, and implement it in a geophysi s appli ation.

(5)

tenant ompte de la topologie

Résumé : Les grilles de al ul à grande é helle sont onstruites en aggré-geant desma hines parallèles pluspetiteset en lesreliantàtraversunréseau d'inter onnexionlongue distan e publi ( omme l'Internet). Leur stu ture est don intrinsèquement hiérar hique. Chaque niveau de hiérar hie dans le ré-seau fournit des performan esdiérentes des autres niveaux en termes de la-ten eet de bande passante. MPI estle standardde faitpourprogrammer les ma hines parallèles, et don une solutionattra tivepourprogrammer des ap-pli ations parallèles pour e type de grilles de al ul. Cependant, dufait des diéren esdeperforman esde ommuni ationssus- itées,lesappli ations om-muniquent entre les lusters ave un impa t signi atif sur les performan es. Dans erapport, nous présentonsune extension desinformations fournies par l'environementd'exé utiond'unebibliothèque MPI,unensemblede ommuni- ations olle tivespourlesgrillesetuneméthodepourorganiserless hémasde ommunu ationsau sein de l'appli ation en fon tionde la topologiephysique sous-ja ente,etl'implémentonsdansuneappli ationdegó-physique.

(6)

1 Introdu tion

Large Institutional Grids are built by aggregating smaller parallel ma hines (usually lusters) throughapubli inter onne tion network (su h asthe Inter-net). Be ause oftheirstrongexperien ein parallelma hines, manyuserswish to program su h grids as traditional parallel ma hines although they do not knowallthe hara teristi sof su hma hines.

The de-fa to standard for programming parallel ma hines is the Message Passing Interfa e (MPI). One of the advantages of MPI is that it provides a single well dened programmingparadigm, basedon expli it messagepassing and olle tive ommuni ations. It is interesting to onsider aMPI for Grids, sin e omplex appli ations may use non trivial ommuni ation s hemes both inside andbetween lusters.

InordertoportMPIonGrids,severalissueshavetobeaddressed. In[4℄,we alreadyaddressedtheproblemofdesigningane ientruntimeenvironmentfor MPIonGridsandenablingtransparentinter- luster ommuni ations.Withthe frameworkwedesignedinthispreliminarywork,anMPIappli ation annottake fulladvantageoftheGridpower. Indeed,the ommuni ationtopologydoesnot dierentiate ommuni ationsbetween nodes inside asame luster and remote nodes. Thus, in the general ase, the appli ation ontinuously ommuni ates ba kandforthbetween lusters,withasigni antimpa tonperforman es.

In this report we address the key problem of adapting the appli ation's ommuni ations hemeto thetopologyofthe gridonwhi h theappli ation is spawned. Our approa h onsists in spe ifying, during the appli ation design andimplementation,thetopologiestheappli ation anbee ientlyrunon. In this spe i ation,whi hwill be des ribedmorethoughtfullyin se tion3,roles forspe i nodesandnetworkrequirementsbetweennodes anbedened.

A few approa hes tried to ta kle the topology adaptation problem (e.g. PACX-MPIandMPICH-G[8,14℄)bypublishing atopologydes riptiontothe appli ationatruntime. However,pra ti alexperimentsdemonstratedthatitis adi ulttaskto omputeane ient ommuni ations hemefrom atopology dis overedatruntime.

In the approa h we develop inside the QosCosGrideuropeanProje t, the topologypublishedatruntimetotheappli ationis ompatiblewiththe require-mentsoftheappli ationforane ientdatadistribution,and ommuni ations. Moreover,thetopologypublished ontainsadditionalinformationonnodesand networksthusallowingto spe ializerolesforspe i nodes.

Furthermore, topology information anbe used to organize point-to-point ommuni ationswithin olle tive operations. Colle tiveoperations havebeen studied and optimized during the last de ades under a ommon assumption: the ommuni ation ostfun tionbetweentwonodesisthesamethroughoutthe system. Thishypothesisisnolongervalidinagridenvironment,sin e ommu-ni ation osts anvary bythree orders ofmagnitude betweentwo onse utive levelsofhierar hy. Hen e,performan es angreatlybenetfrom olle tive op-erationsthatadapttheirpoint-to-point ommuni ationstothetopology.

Themain ontributionsofthisarti leare: agrid-enabledMPImiddleware, featuringtopologydis overyfun tionality;asetofadapted olle tiveoperations thattwiththetopologyofthegridusingtopologyinformation,namely Broad- ast,Redu e,Gather,Allgather,BarrierandAlltoall;agrid-enabledappli ation that takesadvantageoftheaforementionedfeatures.

(7)

2 Related Work

OpenMPI[5 ℄ istheMPI implementationused asthebasisof thegrid-enabled middleware [4℄ ofthe QosCosGridproje t[16℄. It tsparti ularly well togrid omputing,be auseit allowsheterogeneous32/64bits representation,and an handlemore omplexshiftingbetweendierentar hite tures.

The PACX-MPI[8℄ proje t was reated to inter onne t two vendor paral-lel ma hines. It an be used to aggregateseveral lusters and reate a meta- omputer.

Grid-MPI [17℄ is a grid-enabled extension of YAMPII using the IMPI [9℄ standard.ItimplementstheMPI-1.2standardandmostoftheMPI-2standard. In[17℄someoptimized olle tiveoperationsarepresentedfortwo lusters. The AllRedu e algorithm is based on the works presented in [20℄. The broad ast algorithm is basedon [1℄. Colle tive operationsare optimized to makean in-tensive use of inter- luster bandwidth, with the assumption that inter- luster ommuni ationshavea esstoahigherbandwidththanintra- luster. However, this is notalways avalid assumption. Nowadays,high-speed andlowlaten y networks'bandwidthishigherthaninter- lusterInternetlinks. Anenvironment wheretheinter- lusterbandwidthislowerthantheintra- lusterbandwidthis out-of-s ope of[17℄. Furthermore,GridMPI doesnot allowthe use of hetero-geneousenvironmentsbe auseofdierentpre isionin oatingpointand32/64 bitssupport.

TheGlobusToolkit[7℄isaset ofsoftwarethataimsto providetoolsforan e ientuseofgrids. MPICH[10℄hasbeenextendedtotakeadvantageofthese features[14℄andmakeanintensiveuseoftheavailableresour esforMPI appli- ations. ItusesGTforresour edis overy,staging,spawning,laun hingandto monitortheappli ation,input and outputredire tion andtopology dis overy. MPICH-G2introdu edthe on eptof olorstodes ribetheavailabletopology. Itis limitedto at mostfour levels: WAN, LAN,system areaand,if available, vendorMPI.Thosefourlevelsareusuallyenoughto overmostuse- ases. How-ever, one anexpe t ner-graintopology information and moreexibility for large-s alegridsystems. Besides, olorsin MPICH-G2 are integersonly. This is onvenientto reatenew ommuni atorsusingMPI_Comm_split(),butthe userhastoguess what olor orrespondstotherequested ommuni ator.

All of these approa hes display the physi al topology for the appli ation. On theother hand, our approa h fo uses on ommuni ation pattern: the ap-pli ationhasa esstothelogi al topology,whi h mapsthe ommuni ationsof theappli ation. Anotherelementofthe gridar hite ture,themeta-s heduler, whi h is notdes ribed in thiswork,is in hargewith mat hing thisrequested logi altopologywitha ompatiblephysi altopology.

Colle tive operations have been studied widely and extensively the last de ades. One an ite Fibona i(k-ary)treesandbinomialtreesforbroad ast operations. Binary trees an be optimized assplit-binary trees. The message issplit equallybythe root [23℄, and ea h half is sent to ea h side of thetree. At the end of the rst phase, ea h half of the nodes has re eived half of the message. Ea hnodeex hangesitshalfwithanodeamongtheotherhalf.

However,aspointedoutin[20℄thosestrategiesareoptimalinhomogeneous environments,andmostoftenwith apower-of-twonumberofpro esses. Their performan e aredrasti allyharmedbynon-power-of-twonumbersof pro esses andheterogeneousmessagetransmissiontimes.

(8)

Topology-dis overyfeaturesinGlobushavebeenusedtoimplementa topology-aware hierar hi albroad astalgorithm in MPICH-G2 [13℄. The basi idea of thisstudyistominimizeinter- luster ommuni ations. Insteadofsending

log(p)

messagesbetweentwo lusters(

p

beingtheoverallnumberofnodesin the sys-tem),asatraditionalbinarytreewould,thealgorithmpresentedinthisarti le sendsonlyonemessage.

Ahierar hi albroad astalgorithm isalso presentedin [3℄, withavery de-tailed ommuni ation ost model and an deep analysis of the problem. The modelused in thisstudy onsidersnotonly heterogeneityin termsof ommu-ni ation delays but also in terms of messagepro essing speed. It onsiders a multi-levelstrategytominimizethe omplexityofaninstan eoftheWeighted Broad astProblemwithanextensionof aGreedyBroad astingAlgorithm.

An optimized redu tion algorithm for heterogeneous environments is pre-sentedin [15℄. However,themodel used inthat study onsidersonly thetime ea hnodetakestopro essamessage,notpoint-to-point ommuni ationtimes. Wormhole is a routing proto ol targeted to massively parallel omputers. Messagesaredividedintosmallpie es alledits,whi h arepipelinedthrough thenetwork. [18℄studieshow olle tiveoperations anbenetorbeharmedby messagespli ing.

3 Ar hite ture

Toimplementourtopology aware approa h, wehavedesigned QCG-OMPI,a modiedversionofOpenMPI fortheQosCosGridproje t. Thegeneral ar hi-te tureofQCG-OMPIisdetailed in[4℄. Asetofpersistentservi esextendthe runtimeenvironmentoftheparallelappli ationinordertosupportappli ations spanningaboveseveral lusters.

Theseservi esare alledInter-ClusterCommuni ationServi es(ICCS).They makepossiblepeer-to-peer onne tions betweennodesseparated bya rewall and/oraNAT.TheICCS ar hite tureisdepi tedingure1.

Severalinter- luster onne tionte hniquesareimplementedinQCG-OMPI. Basi te hniquesin ludedire t onne tion(whenpossiblea ordingtothe re-wallpoli y of lusters),relaying (like PACX-MPI, though withonly onerelay daemoninsteadofoneforin oming ommuni ationsandoneforoutgoingones), limitationtoanopenportrange(likeMPICH-G2andPACX-MPI)andreverse onne tion. Moreelaboratemethodsin ludetraversingTCP,whi hispresented in [21℄.

Thesefeaturesareprovidedbyahierar hi aldistributedgridinfrastru ture onsistingonasetofgridservi es. A onne tionhelper isrunningoneverynode to help MPI pro esses establish onne tions with other pro esses. A frontal servi e is running on the front-end ma hine of every luster. It is used for message relaying within the infrastru ture, and a hes information. The only entralized servi eisthebroker. It maintainsanup-to-dateglobal onta tlist about all pro esses of the appli ations and knows whi h is the moste ient te hniquetointer onne ttwopro esses.

(9)

FRONTAL

CLUSTER NODE

CONNECTION

HELPER

BROKER

PROXY

CLUSTER NODE

CONNECTION

HELPER

CONNECTION

HELPER

FRONTAL

CONNECTION

HELPER

CONNECTION

HELPER

CONNECTION

HELPER

CLUSTER NODE

Figure 1: Inter- luster ommuni ationservi es

3.1 Multi-level topology dis overy

Institutional grids are made of several levels of hierar hy: nodes an feature multi- oresor multi-pro essorsar hite tures,theyare put togetheras lusters (sharing a samenetwork), sometimes several lusters are geographi ally lose to ea h other (e.g., in the same ma hine room or in the same building), yet annotbe onsideredasasingle luster,and lusters lo atedin dierent orga-nizationsareinter onne tedthroughtheInternet. Hen e,appli ationsoughtto beadaptedtotheunderlyingtopology,inorderto onsideritto organizetheir ommuni ations. However,theMPIstandarddoesnotspe ifysu h geograph-i alinformationa ess.

We propose to des ribe the topology by naming the groups of nodes the pro esseswill bemappedon. Thisinformationispassedasaparameterofthe s hedulertospe ifyhardwarerequirementsontherequestednodes.

Wegivethedeveloperthepossibilitytodes ribeatopologyforhis appli a-tion. Besidetheappli ationitself,he anwriteaJobProledes ribing require-mentsintermsof ommuni ationpossibilitiesand/or omputationalpowerfor the appli ation pro esses. A meta-s heduler is in harge of mapping the re-questedtopologyon available nodes whose hara teristi s mat h astightly as possibletheonesrequestedintheJobProle.

This approa hallows theappli ation developerto organize his appli ation (andmorespe i allyhis ommuni ations)withinhismindthephysi al topol-ogytheappli ationwillbeexe utedon. AnexampleofaminimaljobProleis presentedgure2. Itdes ribesthepro essgroupsinvolvedin the omputation (onlyone grouphere), in parti ular by spe ifying the number of pro esses in ea h group(here $NPROCS, i.e., alltheavailable pro essesinthejob). Some parametersareleft blank,andlled bythemeta-s hedulerwiththe hara ter-isti softheobtainedmapping.

(10)

<

grmsJobappId="helloworld-job"

>

<

tasktaskId="helloworld-task1"

>

<

resour esDes ription

>

<

topology

>

<

groupgroupId="g1"/

>

<

pro esses

>

<

groupIdReferen egroupId="g1"/

>

<

pro essesCount

>

<

value

>

$NPROCS

<

/value

>

<

/pro essesCount

>

<

/pro esses

>

<

/topology

>

<

/resour esDes ription

>

<

exe utiontype="open_mpi"

>

<

exe utable

>

<

exe File

>

<

le

>

<

url

>

gsiftp://host.domain. om/~/hello

<

/url

>

<

/le

>

<

/exe File

>

<

/exe utable

>

<

pro essesCount

>

<

value

>

$NPROCS

<

/value

>

<

/pro essesCount

>

<

/exe ution

>

<

/task

>

<

/grmsJob

>

Figure 2: SimpleJobProleexample

The groupId dened in the jobProle is the same asthe one obtained by the appli ation in the topologydes ription. This wayit is easy to determine intheappli ationwhi hgroupagivenpro essbelongsto, andwhi hpro esses belong to agiven group. Furthermore,abije tivefun tion is providedby the MPIlibrarytogetthe olorinteger orrespondingtothenameofthegroup,to provideabilityto reate ommuni ators.

Thefollowingtwosubse tions explain how to takeadvantageof those ol-ors. Subse tion 3.2 des ribe a set of olle tives operations designed to t on the physi al topologyand knowledge aboutproximitybetween pro esses, and subse tion3.3des ribesaspe i allyadapted appli ation.

The dieren e beteewn the physi al and the logi al topology it that the latter is dened by the developper, and the former is obtainedby the s hed-uler and mat hes with the logi al topology. For example, if the developper requested three groups for tightly oupled pro esses, the s heduler an map themontwo lustersonly: thephysi altopologymeetstherequirementsofthe logi ial topology, but is not exa tly the same as what was spe ied. Global olle tive operations must onsider the whole set of pro esses as two subsets, whereastheappli ation onsidersitasthreesubsets.

3.2 Adapted olle tive operations

Colle tiveoperationsare riti alinMPIappli ations. Astudy ondu tedatthe Stuttgart High-Performan e Computing Center [19, 20℄ showed that on their CrayT3E,theyrepresent45%oftheoveralltimespentin MPIroutines.

Tothebestof ourknowledge,no equivalentstudywaseverdoneona pro-du tiongridduringsu halongperiod. However,one anexpe t non-topology-aware olle tive ommuni ationstobeevenmoretime- onsuming(withrespe t toalltheotheroperations)onanheterogeneousplatform.

(11)

// ommuni ators are in a table of MPI_Comm QCG_B ast(buffer, ROOT, ** omms){

for( i = 0 ; i < depth[my_rank℄ ; i++) { MPI_B ast(&buffer, 1, MPI_INT, ROOT,

omm[i℄ ); }

}

QCG_Barrier(** omms){

for( i = depth[my_rank℄-1 ; i >= 0 ; i) { MPI_Barrier( omm[i℄ );

}

QCG_B ast(&i, 1, MPI_INT, ROOT, & omm); }

Figure3: Hierar hi albroad astandbarrier

Usage of topology information Inorder to use thetopologyinformation for olle tive operations, we dene the following groups of pro esses and the orresponding ommuni ators: a) Ea h topologi al group (e.g., a luster) is assigneda ommuni ator; b)Among ea h ofthose groups,amasteris dened (e.g.,thelowestglobalrank); )Allthepro esseswithinagivengroupthatare mastersoftheirrespe tivesubgroupsarepartofamaster group

MPI_B ast Sendingamessagebetweentwo lusterstakessigni antlymore timethan sending amessagewithin a luster. Thelaten y forsmall syn hro-nization messages, an be superior by several orders of magnitude, and the inter- lusterbandwidthissharedbetweenallthenodes ommuni atingbetween lusters. Thus, apriorityof any olle tivealgorithmingridenvironmentsisto minimizeinter- luster ommuni ations.

In this study, weuse ahierar hi albroad ast. Most often, broad asts are initiatedbythepro essofrank0ofthe on erned ommuni ator. Therootof thebroad ast,beingpartofthetop-levelmaster ommuni ator,broad aststhe messagealongthis top-level ommuni ator. Ea hpro essthen broad aststhe messagealongitssub-masters ommuni ator,until thelowest-levelnodesare rea hed.

Thesystemof olorsdoesnotallowthepro essestoknowabout ommuni a-tion ostswithinalevel,soweassumeherethatea hsub-groupishomogeneous. MPI_Redu e Redu tionalgorithms are notjust reverse broad asting al-gorithms. Indeed, the operation may not be ommutative, although all the predened operators are ommutative. However, it is required to be asso ia-tive.

We assumethat thepro essesmapped in a given luster have onse utive ranks. Therefore,we anusethehierar hi alapproa hforaredu tionoperation too. Ea hlowestlevel lusterperformaredu tiontowardstheirmaster,andfor ea hleveluntilthetoplevelisrea hedthemastersperformaredu tiontoward theirlevelmaster.

MPI_Gather AGather algorithmsends the ontentsofabuerfrom ea h pro ess to the root pro ess of the gather operation. It an also be done in a hierar hi al way: a root is dened in ea h luster and sub- luster, and an

(12)

optimizedgatheralgorithmisusedwithinthelowestlevelofhierar hy,thenfor ea hupperleveluntiltheroot isrea hed.

The exe utions among sub-masters gather buers whi h are a tually ag-gregations of buers. This aggregation minimizes thenumber of inter- luster ommuni ations,forthe ostofonlyonetriptimewhilemakingabetteruseof theinter- lusterbandwidth.

MPI_Barrier blo ks the aller until all group members have alledit. The all returns atany pro ess only after allgroup members have entered the all. [22℄. This fa t anbeguaranteed by su essiveMPI_Barrierfrom the lowest-level ommuni ator up to the highest-level masters' ommuni ator. At this point, allthe lowest-levelmastersare surethat allthe allersofMPI_Barrier enteredthebarrier,thenspreadtheinformationusingahierar hi albroad ast, referedasQCG_B astin gure3.

MPI_Allgather anbea hievedusingthesamealgorithmasMPI_Barrier. MPI_Alltoall [12℄ presents an algorithm for all-to-all ommuni ations on lustersof lusters. Nodeswithinea h lusterperformanall-to-all ommuni a-tion,thenseveralnodesinea h lustersendanaggregationofseveralbuersto nodesbelongingtotheother lusters. Thesemeta-buersarebroad astwithin theother lusters.

Thegoalof this utting-throughis toshare thequantityof datathat must besent betweentwogivennodes. However,ifthemeta-buers donot ontain enoughdatatolltheinter- lusterbandwidth,itwillnotfullytakeadvantage ofthoselinks.

Appli ationsdonotusuallysendextremelylargedatathroughMPI ommu-ni ations ; hen e, themeta-buer ontainingthe resultof ea h lo al all-to-all willbesentinonlyone ommuni ation.

To summarize, an all-to-all ommuni ation is rst performed within ea h lowest-levelset of pro esses. Then at ea h level, the masters perform an all-to-all ommuni ation with aggregation of the buers sent in the lower level. Whenthetop-level isrea hed,theresult isbroad ast(usingatopology-aware broad astalgorithm).

3.3 Grid-enabled appli ation

Weimplementedandusedthese olle tiveoperationsin Ray2mesh[11℄,a geo-physi s appli ation that tra es seismi rays along a given mesh ontaining a geographi area des ription. It uses the Snell-Des artes law in spheri al ge-ometry to propagate a wave front from a sour e (earthquake epi enter) to a re eiver (seismograph). It features two parallel implementations, in luding a master-worker-basedapproa h.

The exe ution of Ray2mesh an be split up into three phases. The rst one onsistsof su essive olle tiveoperationsto distributeinformationto the omputation nodes. It ismadeof threebroad astoperations,from themaster towardworkers. Broad astsaretree-based olle tiveoperations: allsreturnas soonas thepro ess'sparti ipationto theoperationis ompleted,notwhenall theoperationis ompleted. Forexample,therststageofatree-basedoperation

(13)

SM

B

W

Figure 4: Organization of a multi-levelmaster-workerappli ation. SM is the SuperMaster,BaretheBosses,WaretheWorkers

anstartassoonasthese ondstageofthepreviousonestarts. Operations per-formedherebetweenbroad astsaresimpleoperations,likememoryallo ations: thebroad asts arepipelined. These ond phaseisthemaster-worker omputa-tionitself. Thethird phaseis made of olle tive operations: abroad astand anall-to-alloperation.

Themaster-workerapproa hisusedforaverywiderangeofparallel appli a-tions. Itsmajordrawba kisthe on entrationofallthedatadistributionwork in the master, reatinga bottlene k. Nevertheless, the uniquequeue ( entral master)designisabasi approa hofamaster-workerdatadistribution.

Moreover,themaster anbelo ated in aremote luster. After omputing a hunk of data, slaves wait for anew hunk of data: the longer it takes to sendtheresulttothemasterandtore eiveanew hunk,thelongertheworker willbeidle. Data prefet hto overlap ommuni ationand omputation [2℄ an be a solution to eliminate those idle phases, however this approa h requires massivemodi ationofanappli ationand omplexiestheutilisationofgeneri libraries.

Ourapproa hisfo usedon ommuni ationpatterns. Weusethetopologi al informationto buildamulti-levelimplementation ofthethree phasesinvolved in Ray2meshto makethe ommuni ationpattern t withthe topologyofthe Grid.

Pro esses whi h are lose to ea h other an ommuni ate often, whereas distantpro essesoughtto barely(or notatall) ommuni ate withea hother. We organize ommuni ations with this basi prin iple in mind. Pro esses lo- atedin thesame luster(we onsideras luster asetofpro essesthatshare thesamelowest-level olor)sharethesamedataqueue. A pro esswithin ea h lusterisin hargetodistributethedataamongotherpro esses: we allitthe lo alboss. Then,workersre eivedatafromtheirlo alboss,involvinglo al-only ommuni ations.

Bossesgettheirdataqueue fromanupper-levelboss (that anbethe top-levelmaster,orsuper-master). Therefore,most ommuni ationsaredonewithin lusters. Themostexpensive ommuni ationsintermsoflaten y,betweentwo top-level lusters,are theleastfrequentones. Theyareused totransferlarger

(14)

setsofdata,destinatedtobosses: theymakeabetteruseofa(supposed)wider bandwidthandredu etheee tsofahigherlaten y.

Theorganizationofamulti-queuemaster-workerappli ationisrepresented in gure4,with2 lustersand2sub- lustersinone luster.

Thisapproa hprovidesthesameattra tivepropertiesasatraditional master-workerappli ation, withanynumberof levelsofhierar hy. Hen e,itperforms thesameautomati load-balan ing,notonlyamongtheworkers,butalsoamong thequeues: ifa luster omputesitsdatafasterthantheotherones,itwillget anewsetofdataimmediately aftertheresultsarere overed. Thispropertyis ne essarytosuittodierentsizesof lustersanddierent omputationspeeds. Moreover,ea hmasterhastohandlefewerworkrequeststhanauniquemaster wouldhaveto.

Besideofthismaster-worker omputation,Ray2meshusesseveral olle tive operations. Some information related to the omputation is broad astbefore the a tualmaster-workerphase: ourgrid-enabledbroad astalgorithm is used threetimesbeforethe omputationphase,thenon eafterit. Someinformation abouttheresults omputedbyea hslaveareeventuallysenttoea hother: we usehereourhierar hi alMPI_Alltoall.

4 Experimental Evaluation

Inthisse tionwepresenttheperforman emeasurementsofourimplementation. We ondu tedtheexperimentsontwotraditionalplatformsofhighperforman e omputing: lusters ofworkstationswithGigaEthernet networkand omputa-tional grids. These experimentswere done on theexperimental Grid'5000 [6℄ platformorsomeofits omponents.

First,wemeasure thee ien y oftopology-aware olle tiveoperations, us-ingmi ro-ben hmarkstoisolatetheirperforman e. Thenwemeasuretheee ts ofhierar hyonamaster-workerdatadistributionpatternandtheee tsonthe Ray2meshappli ation.

4.1 Experimental Platform

Grid'5000 isadedi ated re ongurableand ontrolableexperimental platform featuring13 lusters,ea hwith58to342PCs, onne tedbyRenater(theFren h Edu ationaland Resear hwideareaNetwork)."

It gathers roughly 5000 CPU ores featuring four ar hite tures (Itanium, Xeon,G5andOpteron)distributedinto 13 lustersover9 itiesin Fran e.

Forthetwofamiliesofmeasurementwe ondu ted( lusterandgrid),weused onlyhomogeneous lusterswithAMDOpteron248(2GHz/1MBL2 a he) bi-pro essors. Thisin ludes3ofthe13 lustersofGrid'5000: the93-nodes luster at Bordeaux, the 312-nodes luster at Orsay, a 99-nodes luster at Rennes. Nodesareinter onne tedbyaGigabitEthernet swit h.

WealsousedQCG,a lusterof4multi- ore-basednodeswithdual- oreIntel PentiumD (2.8GHz/2x1MBL2 a he)pro essorsinter onne tedbya100MB Ethernet network.

All thenodeswere booted under linux2.6.18.3 onGrid'5000and2.6.22on theQCG luster.Thetestsandben hmarksare ompiledwithGCC-4.0.3(with ag-O3). All testsarerunindedi ated mode.

(15)

Figure5: Hierar hi alMPI_Gather

Inter- lusterthroughputonGrid'5000is136.08Mb/sandlaten yis7.8ms, whereas intra- luster throughput is 894.39 Mb/s and laten y is 0.1 ms. On theQCG luster,shared-memory ommuni ationhaveathroughputof3979.46 Mb/sandalaten yof0.02ms,whereasTCP omminu ationshaveathroughput of89.61Mb/sandalaten yof0.1 ms.

4.2 Colle tive operations

We implemented the olle tiveoperations des ribed in se tion 3.2 using MPI fun tions. Weusedthemon32 nodesa rosstwo lustersin Orsayand Rennes (gures6a-b). Althoughnotexa tlyagridinthesenseofa lusterof lusters,a ongurationwithtwo lustersisanextremesituationtoevaluateour olle tive ommuni ations: asmall onstantnumberofinter- lustermessagesaresentby topology-aware ommuni ations,whereas

O(log(p))

(

p

: totalnumberofnodes) inter- luster messages are sent by standard olle tive operations. With any numberof lusters, topology-aware olle tiveoperations send

O(log(C))

inter- lustermessages,where

C

isthenumberof lusters.

We also usedthe QCG luster with 8pro esses mapped onea h ma hine. Althoughthis oversubs ribesthe nodes (8 pro essesfor 2available slots), our ben hmarksare notCPU-bounded, andthis ongurationenhan es thestress onthenetworkinterfa e. Measurementswithaprolingtoolvalidatedthevery lowCPUusageduringourben hmarkruns.

Weusedthesamemeasurementmethodasdes ribedin[14℄,usingthebarrier des ribedinse tion3.2.

Sin e we implemented our hierar hi al olle tiveoperationsin MPI, some pre-treatment of the buers may be useful. We ompared the performan es ofourhierar hi alMPI_Gatherfordierentmessagesizeonthe QCG luster

(16)

(a)Broad ast (b)Redu e

( )Redu e (d)Gather

Figure6: Comparisonbetweenstandardandgrid-enabled olle tiveoperations onagrid.

gure5. Pre-spli ingthebuerappearedtobeinterestingfor messageslarger than1kB.Thenitispossibletopipelinethesu essivestagesofthehierar hi al operation. It appeared tobeuseful espe iallywhen shared-memory ommuni- ations were involved, whi h an be explained by the better performan es of shared-memory ommuni ationsonsmallmessages.

Figures6(a)and6(b)pi ture omparisonsbetweenstandardandhierar hi al MPI_B astandMPI_Redu eonGrid'5000.Spli ingappearedtobeusefulfor MPI_B ast: wespli edthemessagesevery10kBformessagesofsize64kBand beyond,whereasitwasuselessforMPI_Redu e,sin ebigmessagesarealready utoutbythealgorithmimplementedin OpenMPI.

One an see that, as expe ted, hierar hi al MPI_B ast (gure 6(a)) al-waysperformsbetterthanthestandardone. Moreover,spli ingandpipelining permitstoavoidtheperforman esteparoundtheeager/rendezvousmode tran-sition.

When messages are large regarding the ommuni ator size, MPI_Redu e (gure 6(b)) in Open MPI is implemented using apipeline me hanism. This way,the ommuni ation ostisdominatedbythehighthroughputofthepipeline ratherthanthelaten yof amulti-stepstree-likestru ture. Hierar hyshortens thepipeline: then itslaten y (i.e., timeto loadthepipeline) issmallerand it performsbetteronshortmessages. Butforlargemessages(beyond100kB),the higherthroughputofalongerpipelineoutperformsthelaten y-redu tion strat-egy. Inthis ase,hierar hization ofthe ommuni ationsisnottheappropriate approa h. ThehighstandarddeviationofthestandardMPI_Redu eforsmall

(17)

messages anbeexplained by thehigh variability of on urrent tra onthe inter- lusterlinks,andthefa t thatthestandardalgorithmforsmallmessages ismoresensitivetothis.

Figures 6( ) and 6(d) pi ture omparisons between standard and hierar- hi al MPI_Redu e and MPI_Gather on the QCG luster. On a luster of multi- ores, olle tive operations over shared-memory outperform TCP om-muni ationssigni antlyenoughtohaveanegligible ost. Therefore,ona on-gurationin ludingasmallernumberofphysi alnodes,indu ingmore shared-memory ommuni ations,ourhierar hi alMPI_Redu eperformsbetter(gure 6( )).

Forthe samereasons, MPI_Gatherperforms better if ommuni ationsare organizedin order to t with the physi al topology. Moreover, performan es seemto bedrasti allyharmedbythetransitionto therendezvous ommuni a-tionproto ol, whi h is bydefault set to 4kBin OpenMPI. We utmessages outevery1kBforthehierar hi almode,asexplainedingure5.

4.3 Adapted appli ation

Theexe utionphases ofRay2mesharepresentedin se tion3.3. It is madeof 3phases: 2 ommuni ationphasesand amaster-worker omputation phasein between them. When the number of pro esses in reases, one an expe t the se ondphasetobefasterbuttherstandthirdphasestotakemoretime,sin e morenodesareinvolvedinthe olle tive ommuni ations.

Figure 7 presents the s alability of Ray2mesh under three ongurations: standard(vanilla), using grid-enabled olle tive operations, and using a hier-ar hi al master-worker pattern and grid-enabled olle tiveoperations. Those three ongurationsrepresentthethree levels ofadaptation ofappli ations to theGrid. Thestandarddeviationis lowerthan1%forea hpoint.

Firstofall,Ray2meshs alesremarkablywell,evenwhensomepro essesare lo ated ona remote luster. Butperforman es are slightly better ona single luster, for obvious reasons: olle tiveoperationsare faster on a luster, and workersstayidlelongeronagridwhiletheyarewaitingforanewset ofdata. When alargenumberof nodes are involvedin the omputation, olle tive operationsrepresentanimportantpartoftheoverallexe utiontime. We ansee theimprovementobtainedfromgrid-enabled olle tivesonthegrid-optimized olle tivesline ingure7. Theperforman egainfor180pro essesis 9.5%.

Small-s alemeasurementsshowthat thegrid-enabledversionof Ray2mesh doesnot perform as well asthe standardversion. The reasonfor that is that severalpro essesareusedtodistributethedata(thebosses)insteadofonlyone. For example, with 16 pro esses distributed on two lusters, 15 pro esses will a tuallyworkforthe omputationinasingle-queuemaster-workerappli ation, whereas only 12 of them will ontribute to the omputation on a multi-level (two-level)master-workerappli ation.

However,weranpro essesonea hoftheavailable pro essors,regardlessof theirrole in thesystem. Bossesare mainly usedfor ommuni ations, whereas workersdonot ommuni ate alot (duringthemaster-workerphase,they om-muni atewiththeirbossonly). Therefore,aworkerpro ess anberunonthe sameslotasabosswithout ompetingforthesameresour es. Foragiven num-berofworkers,asrepresentedbytheworkersandmasteronly linein gure7, the three implementations show the sameperforman e for a small numberof

(18)

100 1000

10

100 Execution time (s)

# of procs

Execution time of Ray2mesh on a Grid

Vanilla

Grid-optimized collectives

Grid-enabled application

Grid-enabled application, workers and master only

Figure7: S alabilityofRay2meshonagrid

pro esses,andthegrid-enabledimplementationsaremores alable. The perfor-man egainfor180pro essesis35%.

5 Con lusion

In this report, we presented an extension of the runtime environment of an MPI implementationtargeting institutionalgridsto providetopology informa-tion tothe appli ation. Thosefeatures havebeenimplementedin ourspe i adaptationofanMPI libraryforgrids.

Weproposed amethodologyto useMPIprogrammingte hniques ongrids. First we des ribed a set of e ient olle tive operations that organize their ommuni ationwithrespe ttothetopologyinordertominimizethenumberof high-laten y ommuni ations.Experimentsshowedthebenetsofthisapproa h anditslimitations.

Thenwepresentedaprogrammingmethodtomakeuseofthetopology infor-mationintheappli ationanddeneadapted ommuni ationpatterns. Weused this method to implement agrid-enabled versionof the Ray2meshgeophysi s appli ationsfeaturingamulti-levelmaster-workerpattern andourhierar hi al olle tiveoperations. Experimentsshowedthat usingoptimized olle tives t-ted to the physi al topologyof the Grid indu e a performan e improvement. Theyalso showedthat adaptingthe appli ation itself animprovethe perfor-man esevenfurther.

We are now working on the integration of our grid-enabled MPI middle-warewith theDistributed Resour eManagement Appli ationAPI (DRMAA) OpenDSP[24℄. OpenDSPfeaturesjobsubmission,jobterminationandjob mon-itoringservi es,andprovidesadedi atedauthenti ationservi estoin reasethe levelofse urityand ondentialitya rossthegrid.

(19)

A knowledgments

The authors thank Emmanuel Jeannot for the interesting dis ussions about olle tiveoperations,andStephaneGenaudandMar Grunbergfortheirhelpon Ray2mesh. Aspe ialthankmustbegiventoGeorgeBosil aforhisexplanations abouttheimplementationof olle tiveoperationsin OpenMPI.

PartoftheauthorsarefoundedthroughtheQosCosGridEuropeanProje t (grantnumber: FP6-2005-IST-5033883).

Referen es

[1℄ MikeBarnett,SatyaGupta,DavidG.Payne,Lan eShuler,Robertvande Geijn,andJerrellWatts. Buildingahigh-performan e olle tive ommuni- ationlibrary.InPro eedings ofSuper omputing'94,pages107116, Wash-ingtonDC,November1994.IEEE.

[2℄ Salah-Salim Boutammine,Daniel Millot,and ChristianParrot. An adap-tives hedulingmethod for grid omputing. InWolfgang E. Nagel, Wolf-gangV.Walter,andWolfgangLehner,editors,12thInternationalEuro-Par Conferen e(Euro-Par'06),volume4128ofLe tureNotesinComputer S i-en e(LNCS),pages188197,Dresden,Germany,August-September2006. Springer-Verlag(Berlin/NewYork).

[3℄ Fran kCappello,PierreFraigniaud,BernardMans,andArnoldL. Rosen-berg. HiHCoHP: Towardarealisti ommuni ationmodelforhierar hi al hyper lustersof heterogeneous pro essors. InPro eedings of the 15th In-ternational Parallel&DistributedPro essingSymposium(2ndIPDPS'01), page 42, San Fran is o, CA, USA, April 2001. IEEE Computer So iety (LosAlamitos,CA).

[4℄ Camille Coti, Thomas Herault, Sylvain Peyronnet, Ala Rezmerita, and Fran kCappello.Gridservi esforMPI.InACM/IEEE,editor,Pro eedings of the 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid'08),Lyon,Fran e,May2008.

[5℄ E.Gabrieletal.OpenMPI:Goals, on ept,anddesignofanextgeneration MPI implementation. In Pro eedings, 11th European PVM/MPI Users' GroupMeeting,pages97104,Budapest,Hungary,September2004. [6℄ F. Cappello et al. Grid'5000: a large s ale, re ongurable, ontrolable

and monitorable grid platform. In pro eedings of IEEE/ACM Grid'2005 workshop,Seattle,USA,2005.

[7℄ Ian T.Foster. Globus toolkitversion4: Software forservi e-oriented sys-tems. J. Comput.S i. Te hnol, 21(4):513520,2006.

[8℄ Edgar Gabriel,Mi haelM.Res h,ThomasBeisel,andRainerKeller. Dis-tributed omputing in a heterogeneous omputing environment. In Vas-sil N.AlexandrovandJa kDongarra,editors,PVM/MPI, volume1497of Le tureNotesinComputer S ien e,pages180187.Springer,1998. [9℄ William L. George, John G. Hagedorn, and Judith E. Devaney. IMPI:

(20)

[10℄ WilliamGropp,EwingLusk,NathanDoss,and AnthonySkjellum. High-performan e, portable implementation of the MPI messagepassing inter-fa estandard. Parallel Computing,22(6):789828,September1996. [11℄ Mar Grunberg,StephaneGenaud,andCatherineMongenet.Parallel

seis-mi ray tra ing in a global earth model. In Hamid R. Arabnia, editor, Pro eedings of the International Conferen e on Parallel and Distributed Pro essing Te hniques and Appli ations, (PDPTA'02), volume 3, pages 11511157,LasVegas,Nevada,USA,June2002. CSREAPress.

[12℄ E.JeannotandLuiz-AngeloSteenel. Fastande ienttotalex hangeon two lusters. Inthe 13thInternational Euro-ParConferen e,volume4641 ofLNCS,pages848857,Rennes,Fran e,August2007.SpringerVerlag. [13℄ N.T.Karonis,B.deSupinski,I.Foster,B.Gropp,E.Lusk,andT.

Bresna-han.Exploitinghierar hyinparallel omputernetworkstooptimize olle -tiveoperationperforman e.InIPPS:14thInternationalParallelPro essing Symposium.IEEEComputerSo ietyPress,2000.

[14℄ Ni holas T. Karonis, Brian R. Toonen, and Ian T. Foster. MPICH-G2: A grid-enabled implementation of the messagepassing interfa e. CoRR, s.DC/0206040,2002.

[15℄ Liuand Wang. Redu tion optimizationin heterogeneous luster environ-ments. InIPPS:14th International Parallel Pro essingSymposium.IEEE ComputerSo ietyPress,2000.

[16℄ M.Charlotetal.TheQosCosGridproje t: Quasi-opportunisti super om-putingfor omplexsystemssimulations.Des riptionofageneralframework fromdierenttypesofappli ations. InIbergrid2007 onferen e,Centrode Super omputa ion de Gali ia (GESGA),2007.

[17℄ MotohikoMatsuda, Tomohiro Kudoh, Yuetsu Kodama, RyouseiTakano, and Yutaka Ishikawa. E ient MPI olle tive operations for lusters in long-and-fastnetworks. InCLUSTER. IEEE,2006.

[18℄ M Kinley, Tsai, and Robinson. Colle tive ommuni ation in wormhole-routed massivelyparallel omputers. COMPUTER: IEEE Computer, 28, 1995.

[19℄ Rolf Rabenseifner. Automati MPI ounter proling of all users: First resultsonaCRAY T3E900-512,January251999.

[20℄ Rolf Rabenseifner. Optimization of olle tive redu tion operations. In MarianBubak,G.Di kvanAlbada,PeterM.A.Sloot,andJa kDongarra, editors, International Conferen eon ComputationalS ien e,volume3036 ofLe tureNotesin ComputerS ien e,pages19.Springer,2004.

[21℄ AlaRezmerita,TanguiMorlier,Vin entNéri,andFran kCappello.Private virtual luster:Infrastru tureandproto olforinstantgrids.InWolfgangE. Nagel,Wolfgang V.Walter, andWolfgangLehner, editors,Euro-Par, vol-ume4128ofLe tureNotesin ComputerS ien e,pages393404.Springer, 2006.

(21)

[22℄ M. Snir, S. Otto, S. Huss-Lederman, D. Walker,and J. Dongarra. MPI: The Complete Referen e. TheMITPress,1996.

[23℄ RajeevThakurand WilliamGropp. Improvingtheperforman eof olle -tive operations in MPICH. In Ja k Dongarra, Domeni o Laforenza, and Salvatore Orlando, editors, PVM/MPI, volume 2840 of Le ture Notes in Computer S ien e,pages257267.Springer,2003.

[24℄ Peter Tröger, HrabriRaji , AndreasHaas, and Piotr Domagalski. Stan-dardization of an API for distributed resour e management systems. In CCGRID,pages619626.IEEEComputerSo iety,2007.

(22)

Parc Orsay Université - ZAC des Vignes

4, rue Jacques Monod - 91893 ORSAY Cedex (France)

Centre de recherche INRIA Nancy – Grand Est : LORIA, Technopôle de Nancy-Brabois - Campus scientifique

615, rue du Jardin Botanique - BP 101 - 54602 Villers-lès-Nancy Cedex

Centre de recherche INRIA Rennes – Bretagne Atlantique : IRISA, Campus universitaire de Beaulieu - 35042 Rennes Cedex

Centre de recherche INRIA Grenoble – Rhône-Alpes : 655, avenue de l’Europe - 38334 Montbonnot Saint-Ismier

Centre de recherche INRIA Paris – Rocquencourt : Domaine de Voluceau - Rocquencourt - BP 105 - 78153 Le Chesnay Cedex

Centre de recherche INRIA Sophia Antipolis – Méditerranée : 2004, route des Lucioles - BP 93 - 06902 Sophia Antipolis Cedex

Éditeur

INRIA - Domaine de Voluceau - Rocquencourt, BP 105 - 78153 Le Chesnay Cedex (France)