HAL Id: hal-00261436
https://hal.archives-ouvertes.fr/hal-00261436
Submitted on 7 Mar 2008
HAL
is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or
L’archive ouverte pluridisciplinaire
HAL, estdestinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires
A framework for adaptive collective communications for heterogeneous hierarchical computing systems
Luiz Angelo Steffenel, Grégory Mounié
To cite this version:
Luiz Angelo Steffenel, Grégory Mounié. A framework for adaptive collective communications for
heterogeneous hierarchical computing systems. Journal of Computer and System Sciences, Elsevier,
2008, 74 (6), pp.1082-1093. �hal-00261436�
Communiations for Heterogeneous
Hierarhial Computing Systems
Luiz Angelo Steenel
1
, Grégory Mounié
2
1
Université Nany2/LORIA, Nany,Frane
2
Laboratoire ID-IMAG,Grenoble,Frane
Abstrat
Colletive ommuniat ionoperations arewidelyusedin MPI appliationsand play
an important role in their performane. However, the network heterogenei ty in-
herent to grid environments represent a great hallenge to develop eient high
performane omputing appliations. Inthis work we propose a generi framework
basedonommuniation modelsandadaptive tehniquesfordealing with olletive
ommuniat ionpatternsongridplatforms.Towardthis goal,we addressthe hierar-
hialorganizationofthegrid,seletingthemosteientommuniat ionalgorithms
at eah network level. Our framework is also adaptive to grid load dynamis sine
it onsiders transient network harateristis for dividing the nodes into lusters.
Ourexperimentswiththe broadastoperation onareal-grid setupindiate thatan
adaptive frameworkallows signiant performaneimprovementson MPIolletive
ommuniat ions.
Key words: Gridomputing; Performanemodeling; Adaptive tehniques;
Polyalgorithms;Colletiveommuniat ion; MPI
Inthelastyears, therewasahugedevelopmentintheeldofparallelanddis-
tributed proessing, espeially atthe arhitetural levelleading toa wide va-
rietyofexeutionsupports. Themajorinnovationwasthephenomenalspread
ofarhitetureslikelustersandgrids.Theseplatformsrepresentareasonable
alternative to traditional parallel mahines and have beome the most ost-
eetive omputing supports for solving a large range of high performane
omputing appliations due the good ost/performane ratio that they pro-
vide. However, the introdution of suh parallelsystems has a major impat
onthe designof eient parallelalgorithms.Indeed, new harateristishave
tobe taken intoaount inludingsalability and portability.Moreover, suh
parallel systems are often upgraded with new generation of proessors and
network tehnologies. For instane, adaptability beomes ruial beause of
thefrequenthangesofthe systemhardware. Thesedierentelementsrequire
torevise thelassialparallelalgorithmswhihonsider onlyregulararhite-
tures with stati ongurations and to propose new approahes.
Our objetive in this work is to propose a generi framework based on om-
muniation models and sheduling tehniques to deal with ommuniation
shedulingin heterogeneous environments suh asomputationalgrids. More
preisely, this paper proposes a ommuniation shedule methodology with
two adaptation levels. At the rst level we proeed at the intra-luster level,
by determiningthe most eientommuniationalgorithm fromaset of well
known algorithms from the literature. At a seond level, our framework de-
Email address:
1
Luiz-Angelo.Steffeneluniv-nany2.fr,
2
Gregory.Mounieimag.fr(Luiz Angelo Steenel
1
, Grégory Mounié
2
).
exeution time of a olletive ommuniation. Therefore, our framework dif-
fers signiantly fromother works, asexisting adaptive approahes presented
in the literature [1,2,3℄ proeed by simply sheduling ommuniations at the
inter-lusterlevel,i.e.,long-distanelinks.Attheother side,workslike[4,5,6℄
onlytrytominimizetheexeutiontimeofolletiveommuniationoperations
intheontextofintra-lusterenvironments.Tothebest ofourknowledge,our
framework provides the rst general methodology to automatially assoiate
eient intra-luster algorithms with inter-luster ommuniationheuristis,
reduing the overall exeution time of a olletive ommuniation.
The remainder of the paperis organized as follows. Webegin inSetion2 by
desribing our assumptions for the ommuniation environment. In Setion
3 we rst dene the onept of polyalgorithm, presenting our framework for
adaptive ommuniations and detailing its omponents. Setion 4 desribes
theplatformpartitioningphase,whereweorganizethegrid intohomogeneous
logialluster.Hene,inSetion5wepresentaasestudy whereweapplythe
seondpart of our framework forthe development of agrid-aware MPI_BCas t
ommuniation operation. To validate the framework ontributions, we on-
dut both pratial experiments on a grid environment (Setion 6) and nu-
merial simulations(Setion 7).These results onern both the evaluation of
the optimization overhead and the salability of the algorithms, proving the
interestofthiswork.Finally,Setion8onludesthepaperanddisusses some
perspetives toextend this work.
Heterogeneity Model:Weassume ageneri platformomposed by hetero-
geneouslustersasdesribed in[7℄.Theplatformstudiedenjoysheterogeneity
alongthreeorthogonalaxes: (i)theproessorsthatpopulatethelustersmay
dierinomputationalpowers,even withinthesame luster;(ii) the lusters
areorganizedhierarhiallyandareinteronnetedviaahierarhyofnetworks
of possibly diering latenies and bandwidths. At the level of physial lus-
ters, theinteronnetionnetworks are assumed tobeheterogeneous; (iii) the
lustersat eahlevelof the hierarhy may dier insizes.
Communiation Model: We assume that the network is fully onneted.
The links between pairs of proesses are bidiretional, and eah proess an
transmitdata onatmostone linkandreeivedataonatmostone linkatany
given time. This model iswell-known inthe literature as 1-port full-duplex.
Transmission Model: The literature ontains several parallel ommunia-
tion models [8,9,10,11,12,3℄. These models dier on the omputational and
networkassumptions, suhas lateny, heterogeneity,network ontention, et.
InthisworkweadoptedtheparameterizedLogP model( pLogP)[3℄.Ourhoie
on the pLogP model omes from the fat that we an experiene dierent
transmission rates aording to the message size, as a onsequene of trans-
port protools andhardware poliies.Hene, allalong this paperweshall use
Lastheommuniationlatenybetweentwonodes,P asthenumberofnodes
and g(m) for the gap of a messageof size m . The gap of a messagem repre-
sentsthetimerequiredtotransmitamessagethrough thenetwork(exluding
the lateny), whih is inversely proportionalto the bandwidth of the link. In
the ase of message segmentation, the segment size s of the message m is a
initial messagem intok segments.
3 An Adaptive Framework for Grid-Aware Communiations
Inthissetion,wedesribeourframeworkforadaptiveommuniationshedul-
ing in an exeution environment haraterized by its heterogeneity and its
hierarhial organization. We onsider a grid environment omposed by dif-
ferent lusters
C
1 toC
n with respetivelyn
1, n
2, . . . , n
n nodes. A wide-areanetwork, alled a bakbone, interonnets these lusters. We assume that a
luster use the same network ard to ommuniate to one of its node or to
a node of another luster, although eah luster may use dierent network
tehnologies(Fast Ethernet, GigabitEthernet, Myrinet, et.). Based onthat
topology inter-luster ommuniations are never faster than ommuniation
withina luster.
Most MPI libraries(LAM-MPI, OpenMPI,MPICH2, et.) implement olle-
tive ommuniations assuming that all the nodes are on the same lusters,
whih means that allommuniationshave the same weight.However, inour
ase, some messages are transferred within a luster (from a node of
C
1 to anode of
C
1, for example, or between the two lusters. In the rst ase, band-width and lateny are faster than in the seond ase. Therefore, we need to
assoiate dierent tools to model the overall performane. We assume that
ommuniation performanes an be predited based on ommuniationost
models(forinstane,thepLogPmodel[3℄)andbenhmarksontherealsystem.
Anoverviewof theframeworkisskethed inFigure1.Sinethetargetsystem
mayexperieneheterogeneityatdierentlevels(omputingperformane,net-
highperformaneomputing.Onewaytoirumventthisproblemistosubdi-
vide the network in homogeneoussubnets (or logiallusters), handlingeah
luster individually to subsequently aggregate them at the grid level. There-
fore, the framework is separated in two suessive phases. During the rst
one, we aim to partition the exeution platform into subnets with homoge-
neous harateristis. Then, when exeuting the seond phase, we determine
foreah subnet(i.e., for eah luster)the ommuniationalgorithm that per-
forms better in that luster. Indeed, using pLogP, we are able to predit the
ommuniationperformane oneahdierentluster, allowingustoompare
dierentommuniationsalgorithms.Inthesameway,pLogPisusedtodene
eient wide-area ommuniation shedules adapted to a heterogeneous grid
environment.
Parameters of the target
Performance predictions
Communication Scheduling
Platform partitioningAdaptive approach
Clustering
Communication models
Target execution platform Dynamic monitoring tool
environment
Figure1.Coneptual framework of the adaptive mehanism
One the platform ispartitioned in separated homogeneoushierarhiallus-
ters we determine, for eah luster, an algorithm whih performs better in
thatnetworkenvironment.Atually,weomparetheexpete dperformane of
dierentalgorithmsfromtheliterature(eahalgorithmbeingpreviouslymod-
harateristis and the number of nodes.
Through the analysis of the inter-lusters and intra-luster performane pre-
ditions we are able to dene a ommuniation shedule that minimizes the
overall exeutiontime.One again wean ompare dierentshedulepoliies
(heuristis), whih are hosen aording totheir estimated termination time.
Theframeworkallows, indeed,toimplementshedulingheuristis thatat on
dierent ommuniationlevels, beitatinter-lusterlevel(mostlyappropriate
to olletive operations like broadast [2℄ and redue [13℄) orat node-to-node
level (for operationssuh asthe all-to-all [4℄).
4 Platform Partition
Weproposeamethodtoautomatiallydisoverthenetworktopology,allowing
theonstrutionofoptimizedmultilevelolletiveoperations.Weprefer auto-
matitopologydisoveryinsteadofapredened topologybeause ifthereare
hiddenheterogeneities insidea luster,they may interfere withthe ommuni-
ation and indue a non negligibleimpreisionin the models. The automati
disovery we propose should be done in two phases: the rst phase ollets
reahability data from dierent networks.The seond phase, exeute d atthe
appliationstart-up, subdivides the networks inhomogeneouslogiallusters
and nallyaquires pLogP parameters tomodel olletive ommuniations.
Severalspeializedtoolsanbeusedtogatheronnetivityinformationthrough
network monitoring. These tools may aquire data from diret probing, like
NWS [14℄, from SNMP queries to network equipments, like REMOS [15℄, or
even ombine both approahes, like TopoMon [16℄. NWS seems to be the
best andidate to our needs: as a de fato standard in the grid ommunity,
throughput, CPU load and available memory. For instane, we may identify
groups of mahines with similar ommuniation harateristis using lateny
and throughput data obtained fromNWS.
4.1 Clustering
One reason to onstrut logial lusters is that even mahines in the same
network may behave dierently, in spite of their physial loation. Indeed,
suhdierenes introdue undesirableheterogeneities thatmay invalidatethe
performanemodelsusedtooptimizeolletiveommuniations.Forinstane,
weareinterestedingrouping mahineswithsimilarperformanesinto"logial
lusters toredue the sheduling omplexity.
Clusteringmaybeperformedaordingdierentapproahes.Themostknown
approah try to dene a spanning tree suh that eah node onnets to the
losest node in the network. This approah an be implemented through ag-
glomerativeonstrutionofthespanningtreefromagivenparameter,butalso
an be implemented by pruning the full interonnetion graph [17℄. Another
approah onsists on dening a "loseness" parameter
ρ
, whih indiates themaximumvarianeamongnodesinthe samegroup.Inthespei aseofour
work,the lasttehniqueseems tobethemost appropriate,asatthis pointwe
are simplyinterested onthe denition ofhomogeneous lusters.
Therefore, wemayonsider aweighteddigraph
dG(V, E)
ofordern withV = {p
0, ..., p
n−1}
to represent our network. In this digraph,the verties representthe proess nodes and the edges represent the link between two nodes. An
integer
w
i,j isassoiatedwitheahedgeE
i,j,representingthedistanebetween nodesp
i andp
j (ommuniationlateny, forexample),and wedeneρ
asthedigraphorresponds tothe distane matrix M dened by:
M =
w
i,jif there is a local link between {i, j}
0 otherwise
(1)
Forinstane,atrivialalgorithmtosolvethis probleminitiallysortstheoutgo-
ing edges from eah node in inreasing order of their weights. By proeeding
from the smallest weighted edge
w
x,y, we dene an initial group{x, y}
. Ateahstep we selet aandidate node
a
and ompare its distane to any nodewithin a group
S
. If distane does not vary more thanρ
, nodea
an be in-luded in group
S
. Otherwise, if nodea
does not tintoany existent group,it beomes the rst node of a new group
S
′. The algorithm terminates afteralloutgoingedgeshavebeen evaluated. Indeed,this algorithm an bedened
by the expression:
∀x, ∀y ∈ S, x 6= y, a ∈ S ⇒ |w(a, x) − w(x, y)| ≤ ρ
(2)Beauseweneedtoompare node
a
toeahnodefromgroupS
,thisalgorithmexeute s in
O(N
2)
steps. Therefore, Lowekamp [18℄ presented a greedy algo-rithm, whih was implemented within the ECO library and is also adopted
in our work. More speially, Lowekamp's algorithm ompares a andidate
node
a
withthe smallestedgewmin
withina groupS
.This algorithm,whihrequires only
O(N )
steps, orresponds tothe followingexpression:∀x, ∀y ∈ S, x 6= y, a ∈ S ⇒ |w(a, x) − wmin(S)| ≤ ρ
(3)Although the distane between two nodes an be expressed with the help of
dierentparameters(lateny,bandwidth,hops,et.),weonsideredlatenyas
Indeed, lateny has proved to be suiently aurate todistinguish nodes in
onneted to dierent swithes in a loal network. Further, lateny an be
easilymeasured inawideareanetworkwithoutdisturbingthe ongoingtra,
ontrarilytoa bandwidth measurement.
In addition, the topology disovery proess may be detahed from the appli-
ation, minimizing the overhead in the appliation performane. Indeed, the
most expens ive part of the proess onsists onontating eah other node to
ompose a distane matrix, while the lustering part is quite simple. An of-
ine topology disovery is reommended for suh appliations, following the
priniples used by MagPIe [2℄, whih reads the topology desription from a
le. A daemon proess may ondut regular updates on the desription le,
induingalmost nooverhead tothe appliation.
4.2 Eient Aquisition of pLogP Parameters
One identifying the logial luster organization of our grid, we must other
networkparameterssuhasthe bandwidth(or thegap,forthe pLogPmodel).
Hopefully,there isnoneed toexeute
n(n − 1)
pLogP measures,one for eahpossible interonnetion. Using the topology information we an get pLogP
parameters in an eient way by onsidering a single proess to represent
eahluster.Asone single measuremayrepresentthe entire subnet, the total
number of pLogP measures is fairly redued. If we sum up the measures to
obtain the parameters for the inter-lusters onnetions, we shall exeute at
most
C × (C − 1) + C
experiments, where C means the number of luster.Further, if we onsider symmetri links, only half of the probes are need,
minimizingthe interferene on the network.
5.1 Intra-luster Communiation Strategy Seletion
WithBroadast, a single proess, alled root, sends the same message of size
m toall other
(P − 1)
proesses. Classial implementationsof the Broadast operationrelyond- arytrees haraterizedbytwoparameters,d and h ,whered is the maximum numberof suessors a node an have,and h isthe height
ofthe tree,thelongestpathfromtheroottoanyofthetree leaves.Therefore,
mostMPIimplementationsrelyontheBinomialTreebroadast,analgorithm
that isoptimal onhomogeneousnetworks if we assume that messages annot
be segmented.
Barnett et al. [19℄ demonstrate, however, that better performanes an be
obtainedif weompose apipelineamongtheproesses.Thisstrategy benets
from message segmentation, as reent works indiate[3℄[20℄. In a Segmented
ChainBroadast,the transmissionofasegmentk overlapswith thereeption
of segment k+1, reduing the overall time.
To fully benet from the pipeline eort, the segment size must be hosen a-
ording to the network environment. Indeed, too small messages pay more
for their headers than for their ontent, while too large messages do not
explore enough the pipeline. Therefore, an eient method to identify an
adequate segment size s onsists in searhing through all values of s where
s = m/2
i, i ∈ [0 . . . log
2m]
suh that s minimizes the predited performaneofthe ommuniationoperation.Torene thesearh,wean alsoapply some
heuristis likeloalhill-limbing, asproposed by Kielmann et al. [3℄.
niques, whih are presented on Table 1. From these models, we are able to
easily determine the broadast algorithm that best performs oneah luster.
Indeed, using the pLogP parameters obtained during the topology disovery
phase,wean preditthe broadastexeutiontimewith agoodaurayand
selet the fastest algorithm for eah luster,as we presented in [21℄.
Table 1
Someommuniat ion modelsfor the Broadast operation
Algorithm Communiation Cost
FlatTree L+ (P −1)×g(m)
Segmented FlatTree L+ (P−1)×(g(s)×k)
Chain (P−1)×(g(m) +L)
Segmented Chain (Pipeline) (P−1)×(g(s) +L) + (g(s)×(k−1))
BinaryTree ≤ ⌈log2P⌉ ×(2×g(m) +L)
BinomialTree ⌈log2P⌉ ×L+⌊log2P⌋ ×g(m)
Segmented BinomialTree ⌈log2P⌉ ×L+⌊log2P⌋ ×g(s)×k
k-hain[22℄ with adegree d (d+⌈P−(2(2d+1)d+1)⌉)×(g(s) +L) + (g(s)×(k−1))
Satter/Col letion [23℄ (log2P +P −1)×L+ 2×(p−p1)×g(m)
5.2 Grid-aware Communiation Sheduling
The literature presents several works that aim tooptimize olletive ommu-
niationsinheterogeneous environments. Whilesomeworks justfous onthe
searh for the best broadast tree of a network [17℄, most authors suh as
Banikazemi[24℄,Bhat[4℄,Liu[5℄,Park[25℄,Mateesu [26℄ and Vorakosit[27℄
try togenerate optimal broadast trees aording toa given root proess.
Unfortunately,mostoftheseworkswere designedforsmall-salesystems.One
of the rst works on olletive ommuniationfor grid systems was the ECO
libraryproposedby Lowekamp [18℄,wheremahinesare grouped aordingto
[2℄,whereproessesarehierarhiallyorganizedintwolevelswiththeobjetive
tominimize the exhange of wide-area messages.
A ommon harateristi of these two implementations is that only inter-
luster ommuniationsareoptimized.Hene,toimproveommuniationper-
formanes, we must also improve inter-luster ommuniations. One of the
rst works to address this problem was presented by Karonis [1℄, who de-
ned a multilevel hierarhy that allows ommuniation overlapping between
dierent levels. While this struture on multiple levels allows a performane
improvement,itreliesonat treestodisseminate messages between two wide
arealevels,the samestrategyasECO orMagPIe. Itisimportanttonotethat
a at tree is far from being optimal on heterogeneous systems. Beause the
exhaustive searh of the optimal tree is expens ive, we deided to employ dif-
ferentoptimizationheuristis.Forinstane,inthisworkweexploreadierent
approahto improve ommuniationeieny.
We onsider that wide-area lateny is no longer the single parameter that
may ontributetothe broadast time. Indeed,the ommuniationost inside
a luster may represent an important fator to the overall ompletion time.
For example, let us onsider two lusters from Grid'5000,one loated at Or-
say and the other at Grenoble (approximately 700km from eah other). The
transmissionof1MBbetweentheselusterswithaprivatebakboneof1Gbit/s
needs 350 milliseonds. At the same time, a binomial-tree broadast with 50
nodes interonneted by a Gigabit Ethernet network for the same message
sizerequires almost600milliseonds.Ignoringtheintra-lustertimemaylead
to ineient ommuniation shedules if the lusters are not well balaned.