A framework for adaptive collective communications for heterogeneous hierarchical computing systems

(1)

HAL Id: hal-00261436

https://hal.archives-ouvertes.fr/hal-00261436

Submitted on 7 Mar 2008

HAL

is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or

L’archive ouverte pluridisciplinaire

HAL, est

destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires

A framework for adaptive collective communications for heterogeneous hierarchical computing systems

Luiz Angelo Steffenel, Grégory Mounié

To cite this version:

Luiz Angelo Steffenel, Grégory Mounié. A framework for adaptive collective communications for

heterogeneous hierarchical computing systems. Journal of Computer and System Sciences, Elsevier,

2008, 74 (6), pp.1082-1093. �hal-00261436�

(2)

Communiations for Heterogeneous

Hierarhial Computing Systems

Luiz Angelo Steenel

1

, Grégory Mounié

2

1

Université Nany2/LORIA, Nany,Frane

2

Laboratoire ID-IMAG,Grenoble,Frane

Abstrat

Colletive ommuniat ionoperations arewidelyusedin MPI appliationsand play

an important role in their performane. However, the network heterogenei ty in-

herent to grid environments represent a great hallenge to develop eient high

performane omputing appliations. Inthis work we propose a generi framework

basedonommuniation modelsandadaptive tehniquesfordealing with olletive

ommuniat ionpatternsongridplatforms.Towardthis goal,we addressthe hierar-

hialorganizationofthegrid,seletingthemosteientommuniat ionalgorithms

at eah network level. Our framework is also adaptive to grid load dynamis sine

it onsiders transient network harateristis for dividing the nodes into lusters.

Ourexperimentswiththe broadastoperation onareal-grid setupindiate thatan

adaptive frameworkallows signiant performaneimprovementson MPIolletive

ommuniat ions.

Key words: Gridomputing; Performanemodeling; Adaptive tehniques;

Polyalgorithms;Colletiveommuniat ion; MPI

(3)

Inthelastyears, therewasahugedevelopmentintheeldofparallelanddis-

tributed proessing, espeially atthe arhitetural levelleading toa wide va-

rietyofexeutionsupports. Themajorinnovationwasthephenomenalspread

ofarhitetureslikelustersandgrids.Theseplatformsrepresentareasonable

alternative to traditional parallel mahines and have beome the most ost-

eetive omputing supports for solving a large range of high performane

omputing appliations due the good ost/performane ratio that they pro-

vide. However, the introdution of suh parallelsystems has a major impat

onthe designof eient parallelalgorithms.Indeed, new harateristishave

tobe taken intoaount inludingsalability and portability.Moreover, suh

parallel systems are often upgraded with new generation of proessors and

network tehnologies. For instane, adaptability beomes ruial beause of

thefrequenthangesofthe systemhardware. Thesedierentelementsrequire

torevise thelassialparallelalgorithmswhihonsider onlyregulararhite-

tures with stati ongurations and to propose new approahes.

Our objetive in this work is to propose a generi framework based on om-

muniation models and sheduling tehniques to deal with ommuniation

shedulingin heterogeneous environments suh asomputationalgrids. More

preisely, this paper proposes a ommuniation shedule methodology with

two adaptation levels. At the rst level we proeed at the intra-luster level,

by determiningthe most eientommuniationalgorithm fromaset of well

known algorithms from the literature. At a seond level, our framework de-

Email address:

1

Luiz-Angelo.Steffeneluniv-nany2.fr,

2

Gregory.Mounieimag.fr(Luiz Angelo Steenel

1

, Grégory Mounié

2

).

(4)

exeution time of a olletive ommuniation. Therefore, our framework dif-

fers signiantly fromother works, asexisting adaptive approahes presented

in the literature [1,2,3℄ proeed by simply sheduling ommuniations at the

inter-lusterlevel,i.e.,long-distanelinks.Attheother side,workslike[4,5,6℄

onlytrytominimizetheexeutiontimeofolletiveommuniationoperations

intheontextofintra-lusterenvironments.Tothebest ofourknowledge,our

framework provides the rst general methodology to automatially assoiate

eient intra-luster algorithms with inter-luster ommuniationheuristis,

reduing the overall exeution time of a olletive ommuniation.

The remainder of the paperis organized as follows. Webegin inSetion2 by

desribing our assumptions for the ommuniation environment. In Setion

3 we rst dene the onept of polyalgorithm, presenting our framework for

adaptive ommuniations and detailing its omponents. Setion 4 desribes

theplatformpartitioningphase,whereweorganizethegrid intohomogeneous

logialluster.Hene,inSetion5wepresentaasestudy whereweapplythe

seondpart of our framework forthe development of agrid-aware MPI_BCas t

ommuniation operation. To validate the framework ontributions, we on-

dut both pratial experiments on a grid environment (Setion 6) and nu-

merial simulations(Setion 7).These results onern both the evaluation of

the optimization overhead and the salability of the algorithms, proving the

interestofthiswork.Finally,Setion8onludesthepaperanddisusses some

perspetives toextend this work.

(5)

Heterogeneity Model:Weassume ageneri platformomposed by hetero-

geneouslustersasdesribed in[7℄.Theplatformstudiedenjoysheterogeneity

alongthreeorthogonalaxes: (i)theproessorsthatpopulatethelustersmay

dierinomputationalpowers,even withinthesame luster;(ii) the lusters

areorganizedhierarhiallyandareinteronnetedviaahierarhyofnetworks

of possibly diering latenies and bandwidths. At the level of physial lus-

ters, theinteronnetionnetworks are assumed tobeheterogeneous; (iii) the

lustersat eahlevelof the hierarhy may dier insizes.

Communiation Model: We assume that the network is fully onneted.

The links between pairs of proesses are bidiretional, and eah proess an

transmitdata onatmostone linkandreeivedataonatmostone linkatany

given time. This model iswell-known inthe literature as 1-port full-duplex.

Transmission Model: The literature ontains several parallel ommunia-

tion models [8,9,10,11,12,3℄. These models dier on the omputational and

networkassumptions, suhas lateny, heterogeneity,network ontention, et.

InthisworkweadoptedtheparameterizedLogP model( pLogP)[3℄.Ourhoie

on the pLogP model omes from the fat that we an experiene dierent

transmission rates aording to the message size, as a onsequene of trans-

port protools andhardware poliies.Hene, allalong this paperweshall use

Lastheommuniationlatenybetweentwonodes,P asthenumberofnodes

and g(m) for the gap of a messageof size m . The gap of a messagem repre-

sentsthetimerequiredtotransmitamessagethrough thenetwork(exluding

the lateny), whih is inversely proportionalto the bandwidth of the link. In

the ase of message segmentation, the segment size s of the message m is a

(6)

initial messagem intok segments.

3 An Adaptive Framework for Grid-Aware Communiations

Inthissetion,wedesribeourframeworkforadaptiveommuniationshedul-

ing in an exeution environment haraterized by its heterogeneity and its

hierarhial organization. We onsider a grid environment omposed by dif-

ferent lusters

C

1 ^to

C

n ^with respetively

n

1

, n

2

, . . . , n

n ^nodes. ^A ^wide-area

network, alled a bakbone, interonnets these lusters. We assume that a

luster use the same network ard to ommuniate to one of its node or to

a node of another luster, although eah luster may use dierent network

tehnologies(Fast Ethernet, GigabitEthernet, Myrinet, et.). Based onthat

topology inter-luster ommuniations are never faster than ommuniation

withina luster.

Most MPI libraries(LAM-MPI, OpenMPI,MPICH2, et.) implement olle-

tive ommuniations assuming that all the nodes are on the same lusters,

whih means that allommuniationshave the same weight.However, inour

ase, some messages are transferred within a luster (from a node of

C

1 ^to ^a

node of

C

1^, ^for êxample, ôr ^between ^the ^two ^lusters. În ^the ^rst âse, ^band-

width and lateny are faster than in the seond ase. Therefore, we need to

assoiate dierent tools to model the overall performane. We assume that

ommuniation performanes an be predited based on ommuniationost

models(forinstane,thepLogPmodel[3℄)andbenhmarksontherealsystem.

Anoverviewof theframeworkisskethed inFigure1.Sinethetargetsystem

mayexperieneheterogeneityatdierentlevels(omputingperformane,net-

(7)

highperformaneomputing.Onewaytoirumventthisproblemistosubdi-

vide the network in homogeneoussubnets (or logiallusters), handlingeah

luster individually to subsequently aggregate them at the grid level. There-

fore, the framework is separated in two suessive phases. During the rst

one, we aim to partition the exeution platform into subnets with homoge-

neous harateristis. Then, when exeuting the seond phase, we determine

foreah subnet(i.e., for eah luster)the ommuniationalgorithm that per-

forms better in that luster. Indeed, using pLogP, we are able to predit the

ommuniationperformane oneahdierentluster, allowingustoompare

dierentommuniationsalgorithms.Inthesameway,pLogPisusedtodene

eient wide-area ommuniation shedules adapted to a heterogeneous grid

environment.

Parameters of the target

Performance predictions

Communication Scheduling

Platform partitioningAdaptive approach

Clustering

Communication models

Target execution platform Dynamic monitoring tool

environment

Figure1.Coneptual framework of the adaptive mehanism

One the platform ispartitioned in separated homogeneoushierarhiallus-

ters we determine, for eah luster, an algorithm whih performs better in

thatnetworkenvironment.Atually,weomparetheexpete dperformane of

dierentalgorithmsfromtheliterature(eahalgorithmbeingpreviouslymod-

(8)

harateristis and the number of nodes.

Through the analysis of the inter-lusters and intra-luster performane pre-

ditions we are able to dene a ommuniation shedule that minimizes the

overall exeutiontime.One again wean ompare dierentshedulepoliies

(heuristis), whih are hosen aording totheir estimated termination time.

Theframeworkallows, indeed,toimplementshedulingheuristis thatat on

dierent ommuniationlevels, beitatinter-lusterlevel(mostlyappropriate

to olletive operations like broadast [2℄ and redue [13℄) orat node-to-node

level (for operationssuh asthe all-to-all [4℄).

4 Platform Partition

Weproposeamethodtoautomatiallydisoverthenetworktopology,allowing

theonstrutionofoptimizedmultilevelolletiveoperations.Weprefer auto-

matitopologydisoveryinsteadofapredened topologybeause ifthereare

hiddenheterogeneities insidea luster,they may interfere withthe ommuni-

ation and indue a non negligibleimpreisionin the models. The automati

disovery we propose should be done in two phases: the rst phase ollets

reahability data from dierent networks.The seond phase, exeute d atthe

appliationstart-up, subdivides the networks inhomogeneouslogiallusters

and nallyaquires pLogP parameters tomodel olletive ommuniations.

Severalspeializedtoolsanbeusedtogatheronnetivityinformationthrough

network monitoring. These tools may aquire data from diret probing, like

NWS [14℄, from SNMP queries to network equipments, like REMOS [15℄, or

even ombine both approahes, like TopoMon [16℄. NWS seems to be the

best andidate to our needs: as a de fato standard in the grid ommunity,

(9)

throughput, CPU load and available memory. For instane, we may identify

groups of mahines with similar ommuniation harateristis using lateny

and throughput data obtained fromNWS.

4.1 Clustering

One reason to onstrut logial lusters is that even mahines in the same

network may behave dierently, in spite of their physial loation. Indeed,

suhdierenes introdue undesirableheterogeneities thatmay invalidatethe

performanemodelsusedtooptimizeolletiveommuniations.Forinstane,

weareinterestedingrouping mahineswithsimilarperformanesinto"logial

lusters toredue the sheduling omplexity.

Clusteringmaybeperformedaordingdierentapproahes.Themostknown

approah try to dene a spanning tree suh that eah node onnets to the

losest node in the network. This approah an be implemented through ag-

glomerativeonstrutionofthespanningtreefromagivenparameter,butalso

an be implemented by pruning the full interonnetion graph [17℄. Another

approah onsists on dening a "loseness" parameter

ρ

^, ^whih ^indiates ^the

maximumvarianeamongnodesinthe samegroup.Inthespei aseofour

work,the lasttehniqueseems tobethemost appropriate,asatthis pointwe

are simplyinterested onthe denition ofhomogeneous lusters.

Therefore, wemayonsider aweighteddigraph

dG(V, E)

^of^orderⁿ ^with

V = {p

0

, ..., p

n−1

}

^to ^represent ^our ^network. ^In ^this ^digraph,^the ^verties ^represent

the proess nodes and the edges represent the link between two nodes. An

integer

w

i,j îsâssoiated^withêahêdge

E

i,j^,representingthedistanebetween nodes

p

i ^and

p

j (ommuniationlateny, forexample),and wedene

ρ

^as^the

(10)

digraphorresponds tothe distane matrix M dened by:

M =











w

i,j

if there is a local link between {i, j}

0 otherwise

(1)

Forinstane,atrivialalgorithmtosolvethis probleminitiallysortstheoutgo-

ing edges from eah node in inreasing order of their weights. By proeeding

from the smallest weighted edge

w

x,y^, ^we ^dene ^an ^initial ^group

{x, y}

^. ^At

eahstep we selet aandidate node

a

ând ômpare îts ^distane ^to âny ^node

within a group

S

^. ^If ^distane ^does ^not ^vary ^more ^than

ρ

^, ^node

a

^an ^be ^in-

luded in group

S

^. ^Otherwise, ^if ^node

a

^does ^not ^tîntoâny êxistent ^group,

it beomes the rst node of a new group

S

^′^. ^The ^algorithm ^terminates ^after

alloutgoingedgeshavebeen evaluated. Indeed,this algorithm an bedened

by the expression:

∀x, ∀y ∈ S, x 6= y, a ∈ S ⇒ |w(a, x) − w(x, y)| ≤ ρ

⁽²⁾

Beauseweneedtoompare node

a

^to^eah^node^from^group

S

^,^this^algorithm

exeute s in

O(N

²

)

^steps. ^Therefore, ^Lowekamp ^[18℄ ^presented ^a ^greedy ^algo-

rithm, whih was implemented within the ECO library and is also adopted

in our work. More speially, Lowekamp's algorithm ompares a andidate

node

a

^with^the ^smallest^edge

wmin

^within^a ^group

S

^.^This ^algorithm,^whih

requires only

O(N )

^steps, ^orresponds ^to^the ^followingexpression:

∀x, ∀y ∈ S, x 6= y, a ∈ S ⇒ |w(a, x) − wmin(S)| ≤ ρ

⁽³⁾

Although the distane between two nodes an be expressed with the help of

dierentparameters(lateny,bandwidth,hops,et.),weonsideredlatenyas

(11)

Indeed, lateny has proved to be suiently aurate todistinguish nodes in

onneted to dierent swithes in a loal network. Further, lateny an be

easilymeasured inawideareanetworkwithoutdisturbingthe ongoingtra,

ontrarilytoa bandwidth measurement.

In addition, the topology disovery proess may be detahed from the appli-

ation, minimizing the overhead in the appliation performane. Indeed, the

most expens ive part of the proess onsists onontating eah other node to

ompose a distane matrix, while the lustering part is quite simple. An of-

ine topology disovery is reommended for suh appliations, following the

priniples used by MagPIe [2℄, whih reads the topology desription from a

le. A daemon proess may ondut regular updates on the desription le,

induingalmost nooverhead tothe appliation.

4.2 Eient Aquisition of pLogP Parameters

One identifying the logial luster organization of our grid, we must other

networkparameterssuhasthe bandwidth(or thegap,forthe pLogPmodel).

Hopefully,there isnoneed toexeute

n(n − 1)

^pLogP ^measures,^one ^for ^eah

possible interonnetion. Using the topology information we an get pLogP

parameters in an eient way by onsidering a single proess to represent

eahluster.Asone single measuremayrepresentthe entire subnet, the total

number of pLogP measures is fairly redued. If we sum up the measures to

obtain the parameters for the inter-lusters onnetions, we shall exeute at

most

C × (C − 1) + C

experiments, where C means the number of luster.

Further, if we onsider symmetri links, only half of the probes are need,

minimizingthe interferene on the network.

(12)

5.1 Intra-luster Communiation Strategy Seletion

WithBroadast, a single proess, alled root, sends the same message of size

m toall other

(P − 1)

^proesses. ^Classial implementationsof the Broadast operationrelyond- arytrees haraterizedbytwoparameters,d and h ,where

d is the maximum numberof suessors a node an have,and h isthe height

ofthe tree,thelongestpathfromtheroottoanyofthetree leaves.Therefore,

mostMPIimplementationsrelyontheBinomialTreebroadast,analgorithm

that isoptimal onhomogeneousnetworks if we assume that messages annot

be segmented.

Barnett et al. [19℄ demonstrate, however, that better performanes an be

obtainedif weompose apipelineamongtheproesses.Thisstrategy benets

from message segmentation, as reent works indiate[3℄[20℄. In a Segmented

ChainBroadast,the transmissionofasegmentk overlapswith thereeption

of segment k+1, reduing the overall time.

To fully benet from the pipeline eort, the segment size must be hosen a-

ording to the network environment. Indeed, too small messages pay more

for their headers than for their ontent, while too large messages do not

explore enough the pipeline. Therefore, an eient method to identify an

adequate segment size s onsists in searhing through all values of s where

s = m/2

ⁱ

, i ∈ [0 . . . log

2

m]

^suh ^that ^s ^minimizes ^the ^predited ^performane

ofthe ommuniationoperation.Torene thesearh,wean alsoapply some

heuristis likeloalhill-limbing, asproposed by Kielmann et al. [3℄.

(13)

niques, whih are presented on Table 1. From these models, we are able to

easily determine the broadast algorithm that best performs oneah luster.

Indeed, using the pLogP parameters obtained during the topology disovery

phase,wean preditthe broadastexeutiontimewith agoodaurayand

selet the fastest algorithm for eah luster,as we presented in [21℄.

Table 1

Someommuniat ion modelsfor the Broadast operation

Algorithm Communiation Cost

FlatTree L+ (P −1)×g(m)

Segmented FlatTree L+ (P−1)×(g(s)×k)

Chain (P−1)×(g(m) +L)

Segmented Chain (Pipeline) (P−1)×(g(s) +L) + (g(s)×(k−1))

BinaryTree ≤ ⌈log₂P⌉ ×(2×g(m) +L)

BinomialTree ⌈log₂P⌉ ×L+⌊log₂P⌋ ×g(m)

Segmented BinomialTree ⌈log₂P⌉ ×L+⌊log₂P⌋ ×g(s)×k

k-hain[22℄ with adegree d (d+⌈^P⁻₍₂⁽²d+1)^d⁺¹⁾⌉)×(g(s) +L) + (g(s)×(k−1))

Satter/Col letion [23℄ (log₂P +P −1)×L+ 2×(^p−_p¹)×g(m)

5.2 Grid-aware Communiation Sheduling

The literature presents several works that aim tooptimize olletive ommu-

niationsinheterogeneous environments. Whilesomeworks justfous onthe

searh for the best broadast tree of a network [17℄, most authors suh as

Banikazemi[24℄,Bhat[4℄,Liu[5℄,Park[25℄,Mateesu [26℄ and Vorakosit[27℄

try togenerate optimal broadast trees aording toa given root proess.

Unfortunately,mostoftheseworkswere designedforsmall-salesystems.One

of the rst works on olletive ommuniationfor grid systems was the ECO

libraryproposedby Lowekamp [18℄,wheremahinesare grouped aordingto

(14)

[2℄,whereproessesarehierarhiallyorganizedintwolevelswiththeobjetive

tominimize the exhange of wide-area messages.

A ommon harateristi of these two implementations is that only inter-

luster ommuniationsareoptimized.Hene,toimproveommuniationper-

formanes, we must also improve inter-luster ommuniations. One of the

rst works to address this problem was presented by Karonis [1℄, who de-

ned a multilevel hierarhy that allows ommuniation overlapping between

dierent levels. While this struture on multiple levels allows a performane

improvement,itreliesonat treestodisseminate messages between two wide

arealevels,the samestrategyasECO orMagPIe. Itisimportanttonotethat

a at tree is far from being optimal on heterogeneous systems. Beause the

exhaustive searh of the optimal tree is expens ive, we deided to employ dif-

ferentoptimizationheuristis.Forinstane,inthisworkweexploreadierent

approahto improve ommuniationeieny.

We onsider that wide-area lateny is no longer the single parameter that

may ontributetothe broadast time. Indeed,the ommuniationost inside

a luster may represent an important fator to the overall ompletion time.

For example, let us onsider two lusters from Grid'5000,one loated at Or-

say and the other at Grenoble (approximately 700km from eah other). The

transmissionof1MBbetweentheselusterswithaprivatebakboneof1Gbit/s

needs 350 milliseonds. At the same time, a binomial-tree broadast with 50

nodes interonneted by a Gigabit Ethernet network for the same message

sizerequires almost600milliseonds.Ignoringtheintra-lustertimemaylead

to ineient ommuniation shedules if the lusters are not well balaned.