S hed

(1)

M2 report / ENS-Lyon

Andreea Chis

Under guidane of FrederiDesprez and Eddy Caron

July 19,2007

(2)

Introdution

1.1 General Purpose of the Target Appliation

World'slimateisurrentlyhangingduetotheinreaseofthegreenhousegasesintheatmosphere. Climate

utuationsare foreasted for theyears to ome. For aproperstudy of the inoming hanges, numerial

simulationsareneeded,usinggeneralirulationmodelsofalimatesystem(atmosphere,oean,ontinental

surfaes) on fored mode or oupled mode (i.e. allowing information exhanges between eah omponent

duringsimulation).

Naturalvariability,seasonalforeastingorglobal warmingharateristisaresomeexamplesoftheuse

ofnumerialsimulationandoupledmodels. Climatologists'purposein thisaseistolaunhparallelsimu-

lations(10ormore)whereeahindependentsimulationmodelstheevolutionofthepresentlimatefollowed

by the 21

st

entury, eah with adistint physial parametrization of the atmospheri model. Comparing

the independent simulations, they expet to better estimate global warmingsensibility in order to model

parametrization.

1.2 Appliation Desription

One"senario"simulationofpresentlimate followedbythe21

st

enturyforatotalof 150yearsombines

1800simulationsof onemonth eah, launhedoneafter theother, asseenin Figure 1.1. The resultsfrom

the n

th

monthly simulationare the starting point of (n+1)

th

monthly simulation. The whole experiment

onsistsin launhingseveralsuhsenariosimulations(10ormore).

Figure1.1: Onesenariosimulationand thewhole experiment

(3)

Onemonthlysimulationonsistsofapre-proessingphase,aone monthrunofthelimatemodelanda

post-proessingphase,asseenin Figure1.2.

Thepre-proessing phaseonsistsof2independenttasks:

•

modify_parameters (mp) - it updates the parametrization of eah model, aordingto the time oordinatepositionofthemonthlyrunwithinthewholesenarioexperimentanditsexeutiontimeis

1seond;

•

onatenate_atmospheri_input_les(aif)-itmodiestheatmospherimodelinputlesmerg- inginitialstateandlimitonditionsanditsexeutiontimeisof1seond.

Giventheirsmallduration,wehooseto inludethepre-proessingtasksinthemainproessingtaskof

amonthlysimulation.

The main omputingtask, proess_oupled_run(pr) in Figure 1.2 represents thelimate model -

OCC17whihinludesanatmospherimodel(ARPEGE)[2℄,oeanandsea-ie(OPA)[3℄andrunopath-

way(TRIP)[4℄. Theoupler, OASIS [10℄, ensurestheirsimultaneous running andsynhroniinformation

exhanges at omponent interfaes. ARPEGE (Ation de ReherhePetite Ehelle Grande Ehelle)ode

isfully parallel. OPA (OeanParallelise),TRIP(TotalRunoIntegratingPathways)and theOASIS ou-

plerare sequentialappliations.Theexeutiontime of proess_oupled_rundepends onthenumberof

proessorsalloatedtotheatmospherimodel.

Figure1.2: Onemonthsimulation.

The exeution times for dierent numbers of proessors for proess_oupled_run are displayed in

Table 1.1 (note that exeuting this task on more than 11 proessors doesn't derease the running time

anymore):

Table1.1: Exeution timesforproess_oupled_run

Proessornumber 4 5 6 7 8 9 10 11

Time(s) 5400 3000 2200 1800 1620 1459 1354 1260

Thepost-proessingphaseonsistsof3tasks:

•

onvert_output_format(of)onvertsthelimatemodeloutputstoastandardizedformatandits exeutiontimeisof1minute;

•

extrat_minimum_information(emf)reduesthesizeofthedatabyomputingglobalorregional meansanditsexeutiontimeisaround30seonds;

(4)

•

ompress_diags(d)reduesdatasize anditsexeutiontimeisaround30seonds;

Fora"senario"simulation,1800suhmonthlysimulationsarehainedasseeninFigure 1.3.

Figure1.3: Chainof3onseutivemonthlysimulations

The dataexhange between2onseutivemonthlysimulationsbelongingto the same"senario"simu-

lation is of 120 MB while the rest of the data exhanges are reasonableto allow modeling them through

NFS.

1.3 Our Goal

Ourgoalregardingthelimateforeastingappliationistothoroughlyanalyzeitinordertomodelitsneeds

intermsofexeutionmodel,dataaesspattern,omputingneeds. Oneapropermodeloftheappliation

hasbeenderived,appropriateshedulingheuristisanbeproposed,testedandompared. Thenextstepis

toprovidegenerishedulingshemesforappliationswithsimilardependene graphs.

Therest of the report is organizedas follows : in hapter 2wemakea briefoverviewof the theoreti

problem related to the urrent appliation and their solutions, then in hapter 3 we present sheduling

heuristisspeitothelimateforeastingappliationandthenshedulingshemesforgeneriappliations

followingthesamemodel. Inhapter4wepresentsomeexperimental resultsofsimulationsandnally we

onludein hapter5.

(5)

Related Works

2.1 Multiple DAGs Sheduling

ADiretedAyliGraph(DAG) onsistsofnodesandedges,whereeahnoderepresentsataskandedges

represent preedene onstraints on tasks. Sheduling several appliations strutured asDireted Ayli

Graphs(DAGs) anbesolved in several wayswhih arepresentedin [11℄. A naiveapproahwould be to

shedule the DAGsstritly oneafter the other. This ould leaveresouresidle soabetter approah is to

shedule the DAGs assoon aspossible in order to ll theexisting idle slots left by the shedulingof the

previousDAGs. Theorderof shedulingtheDAGsin thisasemayinuenethemakespan.

TheotherpossibilityistosheduletheDAGsonurrentlyatthesametime. Oneapproahistoreatea

ompositeDAGbymakingalltheentrytasksofalltheDAGsthesuessorsofauniquenewentrynodeand

similarlytomakethesuessorofalltheexitnodesofalltheDAGsauniquenewexitnodeandtoshedule

the resulting new DAG. Anotherapproah is to make as before anew omposite DAG but to groupthe

tasks intolevelsontainingindependent taskswhih areto besheduledusing algorithms forindependent

tasks. A third approah would be to reate a omposite DAG asbefore but to shedule tasks from eah

DAGwith around-robinpoliy amongDAGs. Finally anotherpossibility isto reatethe omposite DAG

bylinking exit nodesofthesmallerDAGsto thetaskof agreater DAGwhose longestpathfrom itsentry

nodeisaboutthesameasthelongestpathoftheshortDAG.Theauthorsin [11℄proposeheuristisaiming

tominimizetheDAGs'makespanandtomaintainadegreeoffairnessamongtheDAGswhenhoosingthe

onefrom whihthenexttaskshouldbesheduled.

2.2 Mixed Parallelism

Parallelsientiappliationsusuallyexhibittwotypesofparallelism: dataparallelismandtaskparallelism.

Thersttypeourswheneverthesameoperationisapplied in parallelondierentelementsofadata-set

while theseond oneappearsin the form of onurrentomputations runningondierentdata sets. The

ombination ofthese twoapproahesyields thesimultaneous exploitationofbothtypesof parallelism,the

soalledmixedparallelism,whihoersbetterspeedupsomparedtothepuretaskparallelismorpuredata

parallelism.

Shedulinga DAG ona nite numberof homogeneousresouresis known to be NP-omplete even for

thesimpleraseoftasksthatexeuteonlyonasingleproessorandheuristishavebeendevelopedtotakle

the problem. Forthe ase of DAGs omposed of data parallel tasks and homogeneousplatforms, several

shedulingheuristisexist aswell.

(6)

In [7℄, a 2stepapproah hasbeenproposed - rst onvex programmingis used to nd the numberof

proessors onwhih a data-paralleltask should be exeuted, and then, a list sheduling heuristi is used

to eetivelymap the tasksonto proessors. In [8℄, an approah of shedulingtask graphswith aspei

topology(series-parallel)isproposed. Seriesompositionsoftasksarealloatedthewholesetofproessors,

while for parallel ompositions, the proessors are partitioned into disjoint sets on whih the tasks are

sheduled.

In[6℄,a1stepalgorithm is proposed (CritialPathRedution-CPR) forshedulingDAGswith data-

parallel tasks onto homogeneous platforms. Thealgorithm startsby alloating 1task to all data-parallel

tasks (M-tasks) and omputes the makespan of the urrent alloation using a list sheduling proedure.

Iteratively,thealgorithminreasesthenumberofproessorsalloatedtotaskssituatedontheritialpath

forwhihtheinreaseofthenumberofproessorsreduestheurrentmakespan. Thealgorithm'siterations

stopwhenthemakespan(reomputedateahstep) doesn'timproveforanytryofinreasingthenumberof

proessorsalloatedtotaskssituatedontheritialpath.

In[5℄,a2stepalgorithmisproposed(CPA-CritialPathandAreabasedSheduling). Firstthenumber

of proessorsalloated to eah data-paralleltasks is determined with the goal of obtaininga ompromise

betweentheritialpath(thelongestpathfromanentrynodetoanexitnode)andtheproessorutilization

andintheseondstep,thetasksaresheduledonresouresusingalistshedulingproedure. Thealgorithm

starts similarly to CPR, alloating1 proessorto eah multi-proessor task and then iteratively inreases

thenumber ofproessorsof thetask situated on theritial pathwhih would benetthemost from this

inrease. Theiterationsendwhentheritialpathbeomessmallerthentheproessorutilization.

2.3 Pipelined data parallel tasks

Computations onsisting of a hain of data-parallel tasks that proess suessive data sets in a pipeline

fashionareapartiularasewheremixed parallelismours.

Forthistypeofappliation,twokeymetristhataretobeoptimizedarethelatenyandthethroughput.

Thelatenyisthedurationofproessingofadatasetwhilethethroughputistherateatwhihdatasetsan

beproessed. Theinverseofthethroughputistheperiod,i.e. thetimeintervalbetweenthesubmissionoftwo

onseutivedata setsto thepipeline. Minimizingthelateny (whihouldbeattainedsimplybymapping

allthestagesofthepipelineonthewholesetofproessors)isoppositetomaximizingthethroughput(whih

anbeahievedthroughthesimultaneousproessingofdierentdatasetseitherinparallelorinapipelined

fashion).

In[9℄, theauthorsproposeadynami programmingsolutionfortheproblem ofminimizing thelateny

withathroughputonstraintandproposeanearoptimalsolutiontotheproblemofmaximizingthethrough-

outwithalatenyonstraintonhomogeneousplatforms.

Severalaspetsmustbekeptinmindwhenmappingthetasksofapipelineontheresoures. Subhains

ofonseutivetasksinthepipelineanbelusteredintomodules(whihouldthusredueommuniations

and improvelateny)andthe resouresanbesplitamong theresultingmodules. Theresouresavailable

to a module an be split into several groups, on whih proesses will alternate data sets, improving the

throughputbutreduingthelateny(whihorrespondstotherepliationofthemodule).

(7)

Sheduling Heuristis

3.1 Oean-Atmosphere Appliation Sheduling

We onsider ahomogeneous platform omposed of R resouresand that data transfersare made through

NFS. Thus, the exeution time of any task is assumed to inlude the time neessaryto aess the data,

thetimetoredistributeittoproessors(inasethetaskis amulti-proessortask),theeetiveomputing

time and nally the time needed to store the data on the data storage. Given the short duration of the

pre-proessingtasksomparedtothedurationofthemainproessingtasks,wemadethedeisionto group

themallinasingletask. Thesamedeisionwastakenforthe3post-proessingtasks.

The purpose of this sheduling algorithm is to divide the resoures of the platform into disjoint sets

on whih multi-proessor tasks to exeute suh that the overall makespan would be minimal, under the

assumptionthatallmulti-proessortaskswillbeexeutedonthesamenumberofproessors.

Thefollowingnotationsareintrodued:

•

^R^-^total^number^of^proessors;

•

^R

1

^-^number^of^proessors^(among^the^total^R^proessors)^alloated^to^themulti-proessortasks;

•

^R

2

^-^number^of^proessors^alloated^to^thepost-proessingtasks;

•

^nb

max

^- ^maximum ^number ^of multi-proessor tasks that an run simultaneously given the urrent hoieforthenumberofproessorstobealloatedtoamulti-proessortask;

•

^G^-^numberôf^proessorsâlloated^to â^singlemulti-proessortask;

•

^T

G

^-êxeution^timeôfâmulti-proessortaskonGproessors;

•

^NM^-^number^of^monthsⁱⁿ ^anindependentsimulation;

•

^NS ^-^number^ofindependentsimulations;

•

^T

P

^-^exeution^time^for^apost-proessingtask;

ForaertaingroupingoftheR

1

^proessorsⁱⁿ^groups^of^G^proessors,^if^R^div ^G<NS^then^R

2

^=R^mod^G,

R

1

^=R^-^R

2

^and^nb

max

⁼^R

1

^/G.În^theôppositeâse,^due^tosequentiallyonditions,wehave: R

1

^=NS

×

^G,

R

2

^=R^-^R

1

^and^nb

max

⁼^NS.^In^onlusion:

•

^nb

max

⁼^min^{NS,^R^div^G^};

(8)

•

^R

1

^=nb

max ×

^G;

•

^R

2

^=R^-^R

1

^;

Thereare2asestobeonsidered: R

2

⁼⁰^and^R

2 6=

⁰respetively.

Case 1. R

2

⁼ ^0;

Inthisase,multi-proessortasksareexeutedrst,followedbythepost-proessingtasks. Themakespan

ofthemultiproessortasksisgivenby:

M S _multiproc =

N S × N M nb _max

× T _G ;

^(3.1.1)

If

N S × N M

^mod

nbmax = 0

^,^the^total^makespan^is^given^by^:

M S = N S × N M

nb _max × T _G +

N S × N M R

× T _P ;

^(3.1.2)

DarkgreyretanglesinFigure3.1representmulti-proessortasksandlightgreyretanglesrepresentthe

orrespondingpost-proessingtasks:

MSmultiproc MSpost−proc

Figure3.1: Makespanwithoutproessorsalloatedtothepost-proessing.

If

N S × N M

^mod

nbmax 6= 0

^,^a^total^of

T rem

^post ^proessing^tasks ^whih ^do^not^t^on ^the^resoures

left unoupied onthelast set of multi proessortasks(

R lef t = R − (N S × N M mod nb max ) × G

^), ^along

withthe

N S × N M mod nb max

orrespondingto thelast multi-proessortasks,is:

T rem = (N S × N M mod nb max ) + max

0, N S × N M − (N S × N M mod nb max ) − T G

T P

× R lef t

;

(3.1.3)

Themakespaninthis situationis:

M S = N S × N M nb max

× T _G + T _rem

R

× T _P ;

^(3.1.4)

Case 2. R

2 6=

^0;În^thisâse,^the^makespanôf^themulti-proessortasksisagaingivenby:

M S multiproc =

N S × N M nb max

× T G ;

^(3.1.5)

For aset of nb

max

multi-proessortasks, theexeution time ofthe orresponding post-proessingtasks is givenby:

M S _postproc

_

phase =

nb _max R 2

× T _P ;

^(3.1.6)

(9)

Toverpass

Figure 3.2: Post-proessingtasksoverpassingase.

It maybepossibleforthis timetobegreaterthantheexeutiontime ofamulti-proessortask, asein

whih theexeution time for thepost-proessing tasks will overpassthe exeution time of thenext set of

multi-proessortasksasseenin Figure3.2.

The numberof post-proessingtasks that anbeexeuted during theintervalT

G

^on ^the^R

2

^resoures

reservedforthemis:

N possible = T G

T P

× R 2 ;

^(3.1.7)

Thisvaluemustbetestedagainstthenb

max

^value^(sine^there^are^nb

max

multi-proessortasksgenerating thesamenumberofpost-proessortasks)inorder todetermineiftheR

2

^left^resoures^are^suient^or^not

forthepost-proessingtasks. Intherstase,there maybeapartof theR

2

^resoures^whih ^are^not^used

(R

unused

⁾^during ^the^whole ^proess,^whileⁱⁿ ^the^seond^ase,^thepost-proessingtaskswhihdonotton theresouresarereportedfortheendofthemulti-proessortasks.

R unused = R 2 −





 nb max

j T G

T P

k





;

^(3.1.8)

Wedenotebyn thetotalnumberofsets ofsimultaneousmulti-proessorjobs:

n =

N S × N M nb _max

;

^(3.1.9)

Again ,twoseparate asesmust be treated,namely

N S × N M

^mod

nbmax = 0

^and

N S × N M mod nbmax 6= 0

^.

Inthease

N S × N M mod nbmax = 0

^,^the^numberôf^tasks^reported^for^theênd ôf^themulti-proessor tasks(intheasesuhtasksexist)is:

N overpass = max{0, (n − 1) × (nb max − N possible )};

^(3.1.10)

Thetotalmakespanisgivenby:

M S = M S multiproc +

N overpass + nb max

R

× T P ;

^(3.1.11)

Inthease

N S × N M mod nbmax 6= 0

^, ^a^total^of ^N

overpass

post-proessingtasksorrespondingtothe rstn-2setsofsimultaneousmulti-proessortaskswilloverpasstheexeutionofthelast n-2ompletesets

ofsimultaneoustasks(Figure3.3):

N _overpas = max{0, (n − 2) × (nb max − N _possible )};

^(3.1.12)

(10)

n

n−2

Noverpass

Figure3.3: Post-proessingtasksoverpassing.

Toverpass

Figure3.4: Post-proessingtasksoverpassingand nalshedule.

Alongwith thenb

max

post-proessing tasksfrom thelast ompleteset ofsimultaneous multi-proessor tasks, this gives a total of N

overtot

⁼ ^N

overpass

^+nb

max

^tasks ^that ^should ^be ^sheduled ^starting ^on ^the

resoures left unoupied (left resoures) in the last set of multi-proessor tasks (R

lef t

^=R-G

×

^[

(N S × N M ) mod nb max

^℄- ^left^resoures)^(Figure^3.4).

On 1proessoroftheR

lef t

^remaining^ones^there^an^be^sheduled^[T

G

^/T

P

^℄post-proessingtasks. The remaining tasks along with the post-proessing task orresponding to the last (inomplete) set of multi-

proessortasks(

(N S × N M )mod nb max

⁾^is:

T rem = (N S × N M ) mod nb max + max

0, N overtot − T _G

T P

× R lef t

;

^(3.1.13)

Finally,theglobalmakespanwillbegivenby:

M S = M S multiproc + T rem

R

× T P ;

^(3.1.14)

Allthe8possibilitiesfortheparameterG(

4 → 11

⁾âre^testedând^theône^yielding^the^smallest^makespan

ishosen. Theoptimalgroupingforvarious numberofresoures(

11 → 150

⁾^is^plotted ⁱⁿ^Figure^3.5.

For agiven optimal groupinghowever it may be possible that for aset of onurrent multi-proessor

tasks and the assoiated post-proessingtasks, not all the available resoures are used. For example, for

R=53 resoures, and 10 "senario" simulations, the optimal groupingis G=7. Hene a total of 7 multi-

proessortasksanexeuteonurrently,oupying 49resoures. The orrespondingpost-proessingtasks

needonly1resoure,whihleaves3resouresunoupiedduring thewholesimulation. Inordertoimprove

(11)

Figure3.5: Optimalgroupingsfor10senariosimulations.

ourmakespan,theunoupiedresouresanbedistributed among the7groupsof resouresforthe multi-

proessortasksresultingin3groupswith 8resouresand4groupswith 7resouresand1resoureforthe

postproessingtasksgivingagainof4.5

%

⁽⁵⁸^hours^less^on^the^makespan).

Given that the multi-proessor tasks sale well and the post-proessing tasks have a small duration,

anotherpossibilityforreduingthemakespanis tousethe resouresnormallyreservedforpost-proessing

tasksformulti-proessortasksandtoleaveallthepost-proessingattheend.

The optimal repartition of the R proessors in groups on whih the multi-proessor tasks should be

exeutedouldbeviewedasaninstane oftheKnapsakproblem withanextraonstraint. Givenaset of

itemswithaostandavalueitisrequiredtodeterminethenumberofeahitem toinlude inaolletion

suhthattheostislessthansomegivenostandthetotalvalueisaslargeaspossible.

Inthisase,theitemsarethe8possiblegroupings(groupsof4to11). Theostofanitem(groupingin

thisase)isrepresentedbythenumberof resouresofthatgrouping. Thevalueofaspei groupingGis

givenby

1 T [G]

^,^whih^represents^the^fration^of^amulti-proessortaskthatgetsexeutedduringatimeunit forthatspeigroupofproessors. ThetotalostisrepresentedbythetotalnumberofresouresR.

The goal when dividing the proessorsin groups for the multi-proessor tasks is to ahieve the most

possibleduringatimeinterval.

Wehave

n i

^unknowns⁽ⁱ^from

4 → 11

⁾representingthenumberofgroupswith

i

^resoures^whih^will ^be

takenin thenal solution. Thegoalisto maximize

P 11

i=4 n _i × _T ¹ _[i]

^under ^the^onstraints

P 11

i=4 i × n _i ≤ R

and

P 11

i=4 n i ≤ N S

^(given^that^no^more^than^NS ^tasks^an^be^exeutedsimultaneously).

The distribution of the multi-proessor tasks on the dierent groupings is done in levels (the multi-

proessortasksfor month

n

ânnot^be^sheduledûntilâll^themulti-proessortasksfrom month

n − 1

^have

beenexeuted),inordertoexploit thetaskparallelismaspossibleandtopreventthesituationwhere some

simulationsare exeuted at afaster ratethan others. Proessorgroupsare kept sortedaordingto their

ready times and the multi-proessor tasks of a level are kept sorted aording to the ready times of the

orresponding multi-proessor tasks of the previous level. At the time a multi-proessor group beomes

ready,themulti-proessortaskoftheurrentlevelwiththesmallestreadytimeissheduled.

Thegainsobtainedwiththe3possibleimprovementspresentedabovewithrespettothetherstversion

ofshedulingareplottedinFigure3.6. Thenegativevaluesareduetotheshedulingpoliythatpreventsa

newlevelto bestartedbefore allthetasksof thepreviouslevelhavebeensheduledandwhih sometimes

introduesgapsonsomeproessorgroups.

(12)

-1 0 1 2 3 4 5 6 7 8 9 10

20 40 60 80 100 120

Gain in %

Resources (processors)

Gain1 Gain2 Gain3

Figure3.6: Gainsobtainedbyusingresouresleftunoupied(Gain1),usingallresouresforpostproessing

tasks(Gain2)andusingtheKnapsakproblemmodelization(Gain3).

3.2 Generi Sheduling Heuristis

We propose sheduling heuristis for a general lass of appliations that onsist of independent idential

hains of idential DAGs, where eah suh DAG ontains 4data-parallel tasks - one of independent pre-

proessing, one of independent post-proessing, one main omputing task and one inter- proessing task

linkingthesuessivehained DAGsasseeninFigure3.7:

Figure3.7: GeneralizedAppliation.

Wedenoteby:

•

^NS ⁼^number^ofindependenthainsofDAGs(simulations);

•

^NR ⁼^number^ofrepetitionsofthebasiDAG;

•

^R⁼^number^of^resoures;

(13)

•

^A⁼^theindependentpre-proessingtypeoftask;

•

^B⁼^theinter-proessingtypeoftask;

•

^C⁼^the^main ^proessing^type^of^task;

•

^D ⁼^theindependentpost-proessingtypeoftask;

•

^T

T Y P E [i]

⁼^the^proessing^timeôfâ^taskôf^type^TYPEônⁱ^resoures;

Onepossible approah to shedule this type of appliation senario is to reate a omposite DAG by

linking allentrytasks (alltheindependentpre-proessing tasksin this aseand theinter-proessingtasks

orresponding to the rst basi DAGs in eah hain) to a ommon entry node and all exit tasks (the

independentpost-proessingtasksandthemainproessingtasksorrespondingtothelastrepetitionofthe

basi DAG in eah hain) to a ommon exit node, and to apply mixed-parallelism sheduling algorithms

(suhasCPA [5℄ orCPR[6℄)to thisnewreatedDAG.

CPR will not work for this type of appliation when the number of resoures is at least two times

greaterthanthenumberofindependenthainsoftaskstobesheduledbeauseanyinreaseofthenumber

of proessorsalloated to atask belonging to aertain hain of DAGs(all hains of DAGs represent an

independentritialpath)willonlydereasethemakespanofthatspeihain,andwillhavenoinuene

ontheoverallmakespan. Hene,CPRwillonlyiteratethroughallthetasksoftheompositeDAGwithout

anysuessinimprovingtheoverallmakespanandwillstopwitha1proessoralloationforeahtask.

CPA hastheadvantageof havingalowomplexity-O(V(V+E)P) (where Vis thenumberof verties

in theompositeDAG, Eisthe numberof edgesand Pthenumberofproessors). However,it isa2-step

algorithmwhihdeouplesthealloationofproessorstotasksoftheeetiveshedulingonproessorsand

itsomplexitydepends onthenumberof tasksin theDAG(whih inthisaseissigniant).

Exploitingthespeistrutureofthegraph,weproposeashedulingmethodologybaseduponnotions

andtehniquesforshedulingdata-parallelpipelinesandforshedulingindependentmalleabletasks.

At a rst level, we may observe that any of the independent hains of DAGs is atually a pipeline

onsistingofNRidentialstages,eahstagebeingrepresentedbythebasiDAGwith4data-paralleltasks.

Atadeeperlevel,wemayviewthewholeappliationasasingleDAG-thebasiDAG-whihmustproess

datasets in apipelinefashion. Therstdata setsaretheinitial NS datasetsgivenasinputfor thewhole

senario,while in the nextsteps, theNS datasets from the previouslevelof iterationsare proessedin a

roundrobinfashionandsoon.

The advantage whih annot be exploited by regular mixed parallelism sheduling algorithms is the

possibility to separate the independent pre and post-proessing (sheduling them independently) and to

fous on sheduling optimally the tasks with dependenies whih streth over all thehain of DAGs(the

inter-proessingandmain omputationtasks).

Basially, we antreat the whole appliation as a pipeline of 2 stages (the inter-proessing and post-

proessing) preeded by the independent pre-proessing tasks and followed by the post-proessing tasks,

providedthat theresultingpipelinewillnotproessmorethanNS datasetsonurrently.

Thepre-proessingtasksareallindependentofeah-otherandanbeexeutedinanyorderaslongasthey

preedetheorrespondingmainproessingtasks. Similarly,thepost-proessingtasksareallindependentof

eahotherbutmustbeexeutedafter theorrespondingmain proessingtasks.

Theoptimalshedulingof identialmalleabletaskshasaomplexityexponentialin thenumberofpro-

essors. However, there exists an algorithm that approximates an optimal shedule up to a fator

5 4

ⁱⁿ

onstant time [1℄, making a phase by phase shedule (a shedule onsisting in phases in whih eah job

uses the samenumber of proessorsand in whih a newphase annot start until the previous phasehas

nished),and whih we will use for the sheduling of the pre and post proessing tasks. We further de-

noteasphase_by_phase(T,n,R)thefuntionreturningthemakespanobtainedbysheduling

n

^idential

independentmalleabletasksoftypeTonto

R

^resoures.

(14)

Forthepipelineoftheinter-proessingandmainproessingtasks,wehave2options: eitherweonsider

themasanintervalandshedulethemonthesamenumberofresoures,orwealloatethemseparately.

Whenonsidering the2tasksasaninterval,themakespanoftheinter-proessingandmain proessing

tasksforaertainnumberof resoures

G _BC

âlloated^to^themân^beômputedâs:

M S _BC =





N S × N R min

N S, ⌊ _G ^R

BC ⌋





× (T B [G BC ] + T _C [G BC ]) ;

^(3.2.1)

The

G _BC

^yielding^the^minimum^value^for

M S _BC

îs^hosenând^the^total^makespanîs^given^by^:

M S _total = phase

^_

by

^_

phase(A, N S × N R, R) + M S _BC + phase

^_

by

^_

phase(D, N S × N R, R);

^(3.2.2)

Whenonsideringthe2tasksindependently,wemustdeideonadistribution oftheRresourestothe

2tasks suh that the overall makespanis minimal. Consideringn

1

^resoures^alloated ^to ^tasks ^of ^type^B

andn

2

⁼^R^-ⁿ

1

^resouresâlloated^to ^tasksôf^type^C,^there êxist^severalpossibilitiestodividethen

1

^and

n

2

^resouresⁱⁿ^groups^for^tasks^of^type^B^and^Crespetively. Ifwedenote byG

B

^the^number^of ^resoures

alloatedto atask of type Band G

C

^the ^numberôf^resouresâlloated^to â^task ôf ^type^C, ^there ân ^be

N r _B = j

n ₁ G B

k

groupsofproessorsfor tasksof typeBand

N r _C = j

n ₂ G C

k

groupsof proessorsfortasks of

typeC.ThisisequivalenttosayingthattaskBisrepliated

N r B

^times^and^task^C^is^repliated

N r C

^times.

Henetheperiod ofthispipelineisgivenby:

P BC = max

T B [G B ] N r B

, T C [G C ] N r C

;

^(3.2.3)

Thelatenyforadatasettorossthispipelineisgivenby

L = T B [G B ] + T C [G C ]

^. ^We^must^ensure^that

nomorethanNSdatasetsanbeproessedbythepipelinesimultaneouslysinethereannotbemorethan

NSindependentdatasets. Thisourswhen

L > N S × P

ând^requiresînreasing^the^periodôf^the^pipeline

to the value

P _BC = _{N S} ^L

^. ^The ^makespan ^of ^the inter-proessing and main proessing tasks treated likea pipelineisgivenby:

M S _BC = L + (N S × N R − 1) × P _BC ;

^(3.2.4)

Theoverallmakespanisthesameasin thepreviousase.

This hoie of shedulingould yield someunused resoures(due to the integer division as well as to

theonstraintonthemaximumnumberofindependentdatasetsthatanbeproessedsimultaneously,NS)

throughallthelengthofthepipeline. Also,tryingtominimizejustthemakespanofthepipelineformedby

thetasksoftypeBandCouldresultin aloalminimafortheoverallmakespan.

Abetterapproahistoallowpreandpost-proessingtaskstobeexeutewhilethepipelineB-Cisative,

providedthat dependenyrelationsareguaranteed.

Whentreatingthetasks oftypeBandCseparately in apipeline,onemustdeideonadistribution of

theRresouresamongthepreandpost-proessingtasks,theinter-proessingtasksandthemainproessing

tasks. Assuming

n AD

^resoures^for^the ^former^and

n B

^and

n C

^resoures^for^the ^latter⁽

n AD ∈ [0, R − 2]

^,

n B =∈ [1, R − 1 − n AD ], n C = R − n AD − n BC

⁾ ^and^a distribution of the

n B

^resouresⁱⁿ ^groups^of

G B

resoures and of the

n C

^resouresⁱⁿ ^groups ^of

G C

^resoures, ^this ^yields ^a ^number ^of

N r B = j

n B

G B

k

and

N r _C = j

n C

G C

k

repliations for the inter-proessing and main proessing tasks respetively and a lateny

L = T B [G GB ] + T C [G GC ]

^.

Theperiod ofthepipelineisthesameasin theequation3.2.5. Similarly,itouldhangeto

P BC = _{N S} ^L

inthease

L > N S × P

^. ^The^makespan^of^the^pipeline^B-C^is

M S BS

^,^as^seenⁱⁿ ^formula^3.2.4.

(15)

Given that the period of the pipeline is the maximum of 2 periods and that an integerdivision with

remainderisperformedwhenmakingthegroupsofresoures,it ouldbetheasethat someresoureswill

notatuallybeusedbythepipeline:

R unused = n B − T B

P _BC

× G B + n C − T C

P _BC

× G C ;

^(3.2.5)

These unusedresouresadd up tothe

n _AD

^resoures. ^The^pre ^and^post ^proessing ^task ^are^sheduled

eah on 1 proessoron the resouresreserved for them. The rationalebehind this deision is that being

numerous,thephase_by_phaseproedurewouldhavesheduled

_{N S×N R}

R

taskson1proessor,andthe

remaining

(N S × N R) mod R

^tasks^on

j

R (N S × N R) mod R

k

resoureseah. Henethegreatmajorityoftasks

wouldhavebeensheduledon1resoureanywayandshedulingapartofthemontheresoureskeptaside

forthispurposeaswellasontheresouresleftunoupiedshould dereasetheoverallmakespan.

Theperiod that thepre-proessing tasksould attainif pipelined onthe

n AD

^resoures^is

P A = ^T _n ^A ^[1]

AD

.

Similarly for thepost-proessing tasks :

P _D = ^T _n ^D _AD ^[1]

^. ^We ^treat³ ^possible^ases ^aording^to ^the ^relation

between

P _A

^and

P _BC

ônône^sideând

P _D

^and

P _BC

^on^the^other^side.

Case 1.

P A ≤ P BC

The signiane of this ase is that on apipeline of the pre-proessingtasks on their

n AD

^proessors,

their pipeline would output results at a higher(or at least equal) rate than the pipeline of the inter and

main proessing tasks. Hene, ifthe rst pre-proessingtask starts with

T A [1]

^time ^units ^before ^the ^rst

mainproessingtask,weareguaranteedthatallmainproessingtaskswillhavetheneessarydataavailable

at theirstart. Fortheeaseofomputation oftheresultingmakespan,weonsiderpreand post proessing

taskstobesheduledin bloksratherthanpipelined.

Thedurationofthepreproessingtasks(onsideringRtasksaresheduledbeforethestart oftheB-C

pipeline)is:

M S A =

N S × N R − R n AD

× T A [1];

^(3.2.6)

Aordingtotherelationbetween

P D

^and

P BC

^wedistinguish2sub-ases: Case 1.1

P D ≤ P BC

Thesignianeof thisaseisthatapipelineofthepost-proessingtasksonthe

n AD

^resoures^would

advaneat aratehigherthan therateof produingresultsofthemain proessingtask. Thisisthemirror

reetion of the ase

P _A ≤ P _BC

^. Considering the post-proessing task orresponding to the last main proessingtaskstartsimmediatelyafteritandtherestofthepost-proessingtasksstartapipelinetowards

thestartoftheappliation,weguaranteethatnopost-proessingtaskstartsbeforetheorrespondingmain

proessing task. Againto easethe omputation, we onsider theyare sheduledin bloksand notshifted

asforapipeline,andthatRtasksaresheduledimmediatelyafterthenishofthepipelineB-C.Thetime

takenbypost-proessingtaskstoexeuteinparallel withthepipelineB-C is:

M S _D =

N S × N R − R n _AD

× T _D [1];

^(3.2.7)

If

M S A + M S D > M S BC

^,^a^total^of

N D

^_

rest

^post^proessing^tasks^must^be^sheduledâfter ^theêndôf

thepipeline:

N _D

^_

rest = N S × N R −

M S BC − M S A

T _D [1]

× n _AD ; ;

^(3.2.8)

Themakespaninthisaseis:

M S _total = T _A [1] + M S _BC + phase

^_

by

^_

phase(D, N D

^_

rest, R);

^(3.2.9)

(16)

If

M S A + M S D ≤ M S BC

^,

M S _total = T _A [1] + M S _BC + T _D [1];

^(3.2.10)

Case 1.2

P _D > P _BC

Thesignianeofthisaseisthatapost-proessingtaskpipelineonthen

AD

^resoures^would^produe

resultsatarateslowerthantheB-Cpipeline,henetherewouldbenodangerofshedulingapost-proessing

taskbeforeitsorrespondingmainproessingtask. Shedulingtheminbloks,wemuststarttheirexeution

atleastafter

T _B [G B ] + T _C [G C ] + (n AD − 1) × P _BC

^timeûnitsâfter ^the^startôf^the^pipeline ^B-C^toênsure

thereare

n AD

multi-proessortasksnished. Consideringthattheyatuallystartafterallthepre-proessing tasksarenished,thereremain

N D

^_

rest

^tasks^to^be^done^after ^the^pipeline^B-C ^:

N _D

^_

rest = N S × N R −

M S _BC − max(M S A , T _B [G B ] + T _C [G C ] + (n AD − 1) × P _BC ) T _D

× n _AD ;

^(3.2.11)

Themakespanisgivenby:

M S _total = T _A [1] + M S _BC + phase

^_

by

^_

phase(D, N D

^_

rest, R);

^(3.2.12)

Case 2.

P A > P BC

Thesignianeofthis aseis thatapipelineofthepre-proessingtaskswouldadvaneat alowerrate

thanthemainproessingtasksneedingtheirresults,heneweonsiderthepre-proessingtasksthat ton

the

n AD

^resoures^starting^bakwards^from^theêndôf^the^pipeline^towards^the^beginningând^leavingâll^the

post-proessingattheend. Thetimeintervalwherepre-proessingtasksanbeexeuted(inbloks)safely

(suhthat therewouldexistmainproessingtasksusingtheirresults)is:

T _A

^_

conc = M S _BC − ((n AD − 1) × P _BC + T _C [G C ]]);

^(3.2.13)

Thereremain

N _A

^_

rest

pre-proessingtaskstobeexeutedatthebeginningofthepipeline:

N A

^_

rest = N S × N R −

T _A

^_

conc T A [1] × n AD

;

^(3.2.14)

Themakespanisgivenby:

M S _total = phase

^_

by

^_

phase(A, N A

^_

rest, R) + M S _BC + phase

^_

by

^_

phase(D, N S × N R, R);

^(3.2.15)

SimilarformulasanbededuedfortheaseofpipeliningthetasksBandCasaninterval.

Inonlusion,wehaveprovided4possiblesolutionsforshedulingtheproposedappliation: onesolution

in whih we test allthe possiblegroupingsof the Rresoures(in groupswith equalnumberof resoures)

usingtherestoftheresouresforthepost-proessingtasks,onesolutionin whihwedistributethepossible

resoures left unused by the post-proessing tasks to the existing groups, one solution in whih we leave

noresoureatallforthepost-proessingtasks, distributing evenlytherestof theresouresto theexisting

groupingsandleavingallthepost-proessingattheendandnallyasolutionwherethenumberofproessors

in the groupsof proessors is hosen suh asto maximize the umulatedperentageof tasks exeuted in

one time unit (leaving the post-proessing at the end). We have proposed 4 possible solutions for the

generalizedproblem : onesolutionisto separate thepre -proessing and post-proessingand to shedule

theinter-proessingandmainproessingtaskstogether(amainproessingtaskalongwithitsorresponding

inter-proessingtaskonthesameresoures), onesolutionis tomakeapipeline oftheinter-proessingand

post-proessingtasksandsheduleallthepre-proessingandallthepost-proessingat theend, andnally

2 solutions that take into aount the possibility of exeuting pre and post- proessing tasks while the

inter-proessingandpost-proessingarepipelined(either separatelyorasablok).

(17)

Experimental Results

The behavior of the 4 pipelined based heuristis was tested against the results provided by the CPA [5℄

shedulingalgorithmappliedto theompositeDAGreatedbylinkingallentrytasksofthesimulationsto

aommonuniqueentrynodeandallexittasktoaommonuniqueexit node.

Tasks'exeution time asa funtion of the alloatednumber of proessorsis modeled by the following

funtionaordingtoAmdahllaw:

T (t, n) = α + ¹ ⁻ _n ^α

× T 1

^,^where

T 1

îs^theêxeution^timeôf^task^tôn¹

proessor,

α

îs^the^frationôf^the^task^that îs^sequentialând

n

îs^the^numberôf^proessorsâlloated^to^the

task

-2 0 2 4 6 8 10 12 14 16 18 20

10 20 30 40 50 60 70 80 90 100

Gain in %

Resources (processors)

Gain1 Gain2 Gain3 Gain4

(a)

α interprocessing = 0 . 1

-10 0 10 20 30 40 50 60

10 20 30 40 50 60 70 80 90 100

Gain in %

Resources (processors)

Gain1 Gain2 Gain3 Gain4

(b)

α interprocessing = 0 . 8

Figure 4.1: Gainsobtainedforthersttestonguration.

Several ongurations have been tested. Some of them and their results asgains with respet to the

resultsof theCPA algorithmare presentednext. Gain1 in thefollowingguresrepresentsthe gainofthe

blokshedulingoftasksof inter-proessingand mainproessing(preeded bypre-proessingandfollowed

bythepost-proessingtasks), Gain2representsthegainoftheapproah ofpipelining Band Cseparately

(preandpostproessingtasksexeutedseparately),Gain3representsthegainoftheapproahofpipelining

theintervaloftasksB-Cwithpreandpost-proessingtasksexeutedsimultaneouslyonresouresspeially

reserved and on the resouresleft unoupied and nally Gain 4 represents the gain of the approah of

pipeliningBandCseparatelywithpreandpostproessingexeutedsimultaneouslyasinthepreviousase.

FirstaratherhomogeneousDAGwiththesameexeutiontimeofallthetasks(500)on1proessorand

(18)

-10 0 10 20 30 40 50 60 70

10 20 30 40 50 60 70 80 90 100

Gain in %

Resources (processors)

Gain1 Gain2 Gain3 Gain4

(a)

α interprocessing = 0 . 6

-5 0 5 10 15 20 25

10 20 30 40 50 60 70 80 90 100

Gain in %

Resources (processors)

Gain1 Gain2 Gain3 Gain4

(b)

α interprocessing = 1 . 0

Figure4.2: Gainsobtainedfortheseond testonguration.

thesameperentageofsequentially(

α

^=0.1)^has^beenînvested. ^The^gainsôbtained^by^the⁴âlgorithms^with

respettothesheduleobtainedbytheCPAalgorithmfor10independenthainsoftasksand1800iterations

areplotted inFigure4.1(a). For avalue(

α

^=0.8)^for^theinter-proessingtask(i.e. theinter-proessingtask is almostsequential), thegainresultsare plotted in Figure4.1(b). Gainsof50%are obtainedin thisase

byexploitingthepipelinednature oftheappliation.

Anotherongurationthatwastestedisonewithpreandpost-proessingtasksofduration50onasingle

proessorandinter-proessingandmainproessingofduration500,withthesameparameter

α = 0.1

^for^all

tasks exeptforthe inter-proessingtask for whih theparameterisvaried. In Figure4.2(a) thegains for

α interprocessing = 0.6

^are^plotted,^whileⁱⁿ^Figure^4.2(b)^the^gains^for

α interprocessing = 1.0

^(totally^sequential

task).

A third onguration tested was one with all tasks of the same length (500) and same oeient of

sequentiallyexeptforthemainproessingtaskforwhihthedurationisvaried. Figures 4.3(a)and4.3(b)

presentthegainresultsobtained.

The experimental results show that the sheduling approahes proposed whih exploit the pipelined

nature of theappliationanobtainsigniantimprovementsoverCPA (upto60%of gain). Eventhough

sometimesoneofthe4proposed heuristisbehavesworsethantheCPA, theother 3behavemuh better.

-5 0 5 10 15 20

10 20 30 40 50 60 70 80 90 100

Gain in %

Resources (processors)

Gain1 Gain2 Gain3 Gain4

(a)T

main p rocessing t ask

⁼¹⁵⁰⁰

-10 -5 0 5 10 15 20 25 30 35 40

10 20 30 40 50 60 70 80 90 100

Gain in %

Resources (processors)

Gain1 Gain2 Gain3 Gain4

(b)T

m ain p rocessing t ask

⁼³⁰⁰⁰

(19)

Conlusion

Thisreportpresentstheworkofanalyzingandmodelingareallimatologyappliationwiththepurposeof

derivingappropriateshedulingheuristis.

First,theappliationhasbeenmodeledasindependentidentialworkowsderivedthroughthehaining

ofseveralbasiDAGs. Thenasimpliedmodelwithlusteredtasksbasedupontheatualtimeparameters

oftheappliationhasbeenderived.

Forthis newmodel, arst shedulingheuristi(driven bytheprinipleof alloatingthe samenumber

ofproessorstoallmulti-proessortasksandleavingwhatisleftto post-proessingtasks)hasbeenissued.

Threeimprovedversionshavebeenproposed: arstonethatdistributedresouresleftunusedevenlyaross

thegroupsofproessors,aseondonewhih doesnotleaveanyresoureforthepostproessing tasksand

distributes all left resouresevenly to the groups of proessors and a third onethat models the problem

of dividing the resouresof the platform in disjoint sets asan instane of the Knapsakproblem with a

supplementaryonstraint. Thethreeimprovedversionshavebeensimulatedandyieldedgains ofupto9%.

Finally, sheduling heuristisfor the generalizedproblem of shedulingindependent idential hains of

idential DAGs (omposed of an independentpre-proessing task, anindependent post-proessing task, a

mainproessingtaskandaninter-proessingtasklinking suessiveDAGs,alltasksbeingmulti proessor)

havebeenproposedandomparedtotheapproahofapplyingamixed-parallelismshedulingalgorithmto

theompositeDAGresultingwhen linking allentrytasks toaommon entrynode and allexit tasks toa

ommonexit node. Theresults ofthe 4heuristisproposed were highly enouraging notonlyin termsof

gains obtainedwith respet to the resultsof theCPA mixed parallelism shedulingalgorithms (upto 60%

of gain), but alsoin termsof runningtimes forndingthesolution (atmostaseond for determiningthe

optimalpipelineomparedtotensofminutesorevenanhourforrunningCPAonaproblemofthedimension

10hainsof1800iterationsofthebasiDAGeah).

Asfuturework,weintendtoenhanetheheuristisbytakingintoaountamorepreiseommuniation

ost models in the asebigger data exhanges would beenountered for other eld appliations. Also we

alsoplantoperformrealsimulationsovertheplatformGrid'5000inordertovalidatethetheoretialresults.

Finally,weintendtoanalyzeotherappliationsusingasimilarapproahwiththelongtermgoalofderiving

appliationdependentshedulingshemesandimplementwithintheDIETmiddlewaresheduler.

(20)

[1℄ T.Deker,T.Luking,andB.Monien.A5/4-approximationalgorithmforshedulingidentialmalleable

tasks. Theoretial Computer Siene,361(2):226240,2006.

[2℄ M.Deque,C.Dreveton,A.Braun,andD. Cariolle.TheARPEGE/IFSatmospheremodel: aontribu-

tionto thefrenhommunitylimatemodeling. Clim Dyn,10:249266.

[3℄ G. Made. NEMO Referene manual, oean dynami omponent: NEMO-OPA. Number27. Institut

PierreSimonLaplae(IPSL),2006. ISSN1288-1619.

[4℄ T.Oki andY. C. Sud. Designof totalrunointegrating pathways(trip). Tehnial Report 2,Earth

Integration,1998.

[5℄ A. Radulesuand A. J.C. vanGemund. Low-ostmixed task anddata parallel sheduling. In 30-th

International Conferene onParallel Proessing(ICPP),pages6976,August2001.

[6℄ AndreiRadulesu, CristinaNiolesu,Arjan J.C.vanGemund,and PieterJonker. CPR:Mixedtask

and data parallel shedulingfor distributed systems. In IEEE International Parallel and Distributed

Proessing Symposium,page39.IEEEComputerSoiety,2001.

[7℄ S.Ramaswamy,S.Sapatnekar,andP.Banerjee. Aframeworkforexploitingdataandfuntionalparal-

lelismondistributedmemorymultiomputers,1994.

[8℄ ThomasRauberand Gudula Runger. Compiler support for task sheduling in hierarhialexeution

models. JournalofSystems Arhiteture,45(6-7):483503,1999.

[9℄ Jaspal Subhlok and Gary Vondran. Optimal use of mixed task and data parallelism for pipelined

omputations. JournalofParallel and DistributedComputing,60(3):297319,2000.

[10℄ S. Valke, R. Caubel, A. Vogelsang, and D. Delat. Oasis 3, user guide. Tehnial Report PRISM

ReportSerieno2(5thedition),CERFACS,Toulouse,2004. 60pp.

[11℄ Henan Zhao and Rizos Sakellariou. Shedulingmultiple dags onto heterogeneoussystems. In IEEE

International Parallel andDistributedProessing Symposium.IEEE,2006.

S hed

st

st

th

th

•

•

•

•

•

•

•

1

•

2

•

max

•

•

G

•

•

•

P

1

2

1

2

max

1

1

×

2

1

max

•

max

•

1

max ×

•

2

1

2

2 6=

2

M S multiproc =

N S × N M nb max

× T G ;

N S × N M

nbmax = 0

M S = N S × N M

nb max × T G +

N S × N M R

× T P ;

MSmultiproc MSpost−proc

N S × N M

nbmax 6= 0

T rem

R lef t = R − (N S × N M mod nb max ) × G

N S × N M mod nb max

T rem = (N S × N M mod nb max ) + max

0, N S × N M − (N S × N M mod nb max ) − T G

T P

× R lef t

;

M S = N S × N M nb max

× T G + T rem

R

× T P ;

2 6=

M S multiproc =

N S × N M nb max

× T G ;

max

M S postproc

phase =

nb max R 2

× T P ;

Toverpass

M S _multiproc =

N S × N M nb _max

× T _G ;

nb _max × T _G +

× T _P ;

× T _G + T _rem

× T _P ;

M S _postproc

nb _max R 2

× T _P ;

N S × N M nb _max

N _overpas = max{0, (n − 2) × (nb max − N _possible )};

0, N overtot − T _G