Improving Reactivity and Communication Overlap in MPI using a Generic I/O Manager

(1)

HAL Id: inria-00177167

https://hal.inria.fr/inria-00177167

Submitted on 5 Oct 2007

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Improving Reactivity and Communication Overlap in MPI using a Generic I/O Manager

François Trahay, Alexandre Denis, Olivier Aumage, Raymond Namyst

To cite this version:

François Trahay, Alexandre Denis, Olivier Aumage, Raymond Namyst. Improving Reactivity and

Communication Overlap in MPI using a Generic I/O Manager. EuroPVM/MPI 2007, Oct 2007,

Paris, France. pp.170-177, �10.1007/978-3-540-75416-9_27�. �inria-00177167�

(2)

Overlap in MPI using a Generi I/O Manager

FrançoisTrahay,AlexandreDenis,OlivierAumage,andRaymondNamyst

INRIA,LaBRI,UniversitéBordeaux1

351,oursdelaLibération

F-33405TALENCE,Frane

{trahay,denis,aumage,namyst}labri.fr

Abstrat. MPIappliationsmaywastethousandsofCPUylesifthey

donoteientlyoverlapommuniationsandomputation.Inthispa-

per,wepresentageneriandportableI/Omanagerthatisabletomake

ommuniationprogressasynhronouslyusingtasklets.Ithoosesauto-

matiallythe most appropriate ommuniation method,depending on

theontext:multi-threadedappliationornot,SMPmahineornot.We

haveimplementedand evaluated ourI/Omanager withMad-MPI,our

own MPIimplementation,and ompared it to other existing MPIim-

plementationsregardingtheabilitytoeientlyoverlapommuniation

andomputation.

Keywords: Pollin g, Interrupt , Thread, Sheduler, High-Speed

Network

1 Introdution

Asynhronism is beoming ubiquitous in modern ommuniation run-

times.Thisevolutionistheombinedresultofmultiplefators.Firstly,

ommuniation subsystems implement inreasingly omplex optimiza-

tionsinorder to makebetter use of networking hardware. Aswe have

shownin[1℄,suhoptimizations require onlineanalysis of theommu-

niationshemes andhenerequire thede-synhronizationof theom-

muniationrequestsubmissionfromitsproessing.Moreover , providing

rihfuntionalitysuhasommuniationowmultiplexingortranspar-

ent multi-method, heterogeneous networking implies that the runtime

systemshouldagaintake anativepart in-betweenthe ommuniation

requestsubmitandproessing.Andnally,overlappingommuniation

withomputationandbeingreativeatuallydomattermorenowthan

ithaseverdone[2,3℄.Thelatenyofnetworktransationsisintheorder

ofmagnitudeofseveralthousandsCPUylesatleast.Everythingmust

thereforebe doneto avoid independent omputationsto beblokedby

anongoingnetworktransation.Thisisevenmoretruewiththeinreas-

inglydenseSMP,multiore,SMT(alsoknownasIntel'sHyperthreading)

arhitetureswheremanyomputingunitsshareafewNICs.

Sineportabilityisoneofthemostimportantrequirementsforommuni-

(3)

niationruntimesindeedare startingto makeuse ofthreadsinternally

andalso allowappliations to bemultithreadedasit anbe seenwith

both MPICH-2 [4℄,and Open MPI [5,6℄. Low level ommuniation li-

brariessuhasQuadris'Elan[7℄andMyriom'sMX[8℄alsomakeuse

ofmultithreading.Suhanintrodutionofthreadsinsideommuniation

subsystemsisnotgoingwithouttroubleshowever.Thefat thatmulti-

threadingis still usually optional with these runtimes is symptomati

ofthe diulty to get thebenetsof multithreadinginthe ontext of

networkingwithoutsueringfromthepotentialdrawbaks.

Inthispaper,weanalyzethetwofundamentalapproahesofintegrating

multithreadingandommuniationsinterruptsandpolling.Westudy

theirrespetivebenetsand their potentialdrawbaks,and we disuss

theimportaneoftheooperationbetweentheasynhronouseventman-

agementodeandthethreadshedulingodeinordertoavoidsuhdis-

advantages.Wethenintrodueourproposalforsymbiotiallyombining

bothapproahesinsideanewgenerinetworkI/Oeventmanager. The

paperisorganizedasfollows.Setion2exposestheproblemofintegrat-

ingthreadsand ommuniations.Setion3introduesourproposalfor

anew asynhronousevent management modeland givesdetails about

ourimplementation.WeevaluatethisimplementationinSetion4and

Setion5onludesandgivesaninsightofongoingandfuturework.

2 Integrating threads and ommuniation: the

problems of network I/O events management

ThedetetionofnetworkI/Oeventsanbeahievedbytwomainstrate-

gies. Themost ommonapproah onsists inusing the ative waiting:

apollingfuntion is alled repeatedly untila networkI/Oeventis de-

teted.The polling funtion is usually inexpensive, but repeating this

operationthousandsoftimesmaybeprohibitive.Theother methodfor

deteting ommuniation events is the passive waiting whih is based

onblokingalls. In that ase, the NIC informs the operating system

that a networkI/O event hasourredby using aninterrupt, making

this method muh more reative than polling. However this operation

involvesinterrupthandlersandontextswitheswhihareratherostly.

Thebestmethodto usedependsontheappliation, butinbothases,

somebehaviorsmayleadtosuboptimalperformane.Whenusinginterrupt-

basedmethods,priorityissuesmayour:thethreadthatiswaitingfor

theommuniationeventmaybesheduledwithsomedelay.Thisisthe

asewhen,forexample,ithasbeenomputingforalongperiodbeforeit

bloks,loweringitspriority.Moreover,thesystemhastosupportmeth-

odstodetetthenetworkI/Oevents.Forinstane,inapureuser-level

sheduler,interrupt-drivenblokingallsareprohibited(unlessaspei

OSextensionliketheShedulerAtivations[9℄isused).

Using polling methods an also be problemati: if the system is over-

loaded (i.e.there are more runningthreads thanavailable CPUs), the

pollingthread may sarely besheduled, thusinreasing the reation

(4)

a regular polling so that the preliminary phase makes progress. As it

isshownin[2℄,someappliationswouldsigniantlyimprovetheirexe-

utiontimebyeientlyoverlappingommuniationand omputation,

whihrequirestopollommuniationeventsregularly.

3 An I/O manager model

Toresolvethesekindsofproblems,weproposeanI/Omanagerthatpro-

videstheommuniationruntimesystemswithanetworkeventdetetion

servie.Thus, ommuniation libraries themselvesbeomeindependent

ofthemultithreadissuesandrelatedhardwareissuessuhasthenumber

ofCPUs.Thereby,theyanfous their eortsonommuniation opti-

mizationsand other funtionalities. By working losely witha spei

threadsheduler,theI/Omanageranbeviewedasaprogressionengine

abletosheduleaommuniatingthreadwhenneededortodynamially

adaptthepollingfrequeny tomaximize thereativity/overheadratio.

The I/O manager handles both polling and interrupt-based methods,

swithingfromonemethodtoanotherdependingontheontext.

TheimplementationofourI/OmanageralledPIOMan(PM2I/OMan-

ager)reliesonatwo-levelthreadsheduler[10℄whihwasslightlymod-

ied to interat with the I/O manager when neessary. The use of a

two-levelshedulerallows to preiselyontrol threadsheduling at the

userlevel,withalmostnoexpliit(andexpensive)interationwiththe

OS.Thisway,weandynamiallyfavourtheshedulingofathreadre-

quiringahighreativitytoommuniationeventsduringaxedperiod.

PIOManisavailableasthreemainversions:no-thread,mono(user-level

threads)orSMP(userthreadsontopofkernelthreads).

3.1 Overview of theI/O manager

The mehanism of our I/O manager is desribedthrough an example

showninFigure1:theappliationrstregistersaallbakfuntion for

eaheventtypetodetet.Whentheappliationstartsaommuniation,

itansubmittherequeststopoll (1)andwaitfor themorsimplyon-

tinue itsomputation. Periodial ly, the thread sheduler alls the I/O

manager (2)inorder to poll thenetworkbyalling the allbak fun-

tions(3).

We propose to manage the ommuniation events in adediated on-

trollerlinkedtothethreadshedulerforseveralreasons.Firstly,entral-

izingavoidstheonurrenyissuesenounteredwhenseveralthreadstry

topoll thesame network. Sine the I/Omanager hasa globalview of

thependingrequests,itanpoll eahrequestoneafter another.More-

over,themanagerhastheopportunitytoaggrega temultiplerequests.If

severalthreadsarewaitingformessagesonasinglenetworkinterfae,it

anbeinterestingtoaggrega tetheserequestswhenpolling.

Seondly,thethreadshedulerhastheopportunitytopreemptaomput-

(5)

Thread Scheduler I/O Manager

Communication Library

NIC

2 3

1

mx_test

poll( ) wait_event( )

polling callback( )

submit_request( )

Fig.1.ExampleofinterationbetweentheI/OmanagerandtheMPIlibrary.

CPU CPU

LWP

CPU LWP

low−priority, spare kernel thread

t2 t3

LWP

kernel thread blocked

LWP

t1 t2 t3 t1

LWP

t2 t3

t1

(a) Regular execution (b) Preventing a blocking syscall (c) Rescuing ready threads

Fig.2.Low priority,spare kernel-levelthreadsare usedto sheduleremainingappli-

ationthreadsinaseablokingsysalloursduringanI/Oritialoperation.

I/Oompletionandthusmaketheommuniationprogress.Thisisuse-

fulwhentheappliationperformsasynhronousoperationsthatrequire

someproessingonetheommuniationends.Forexample,inarendez-

vousprotool,thereeiverhastopostareeivingrequesttosynhronize

withthesender.Onebothsidesaresynhronized,thetransferanstart:

onesidereeivesthedatathattheothersidesends.Inthatase,thepro-

gressionoeredbytheI/Omanagerandthethreadshedulerallowsto

ompletelyoverlaptheommuniationwithomputation.

3.2 Passive waiting:interrupts

Passivemonitoringthroughblokingsystemallsistrikytoimplement

in a two-level sheduler. Indeed, during regular exeution of applia-

tionthreads,ourshedulerbinds exatlyonekernelthread(also alled

LightWeightproessLWP)perproessor(Fig.2-a),sothattheshedul-

ing of threads anbe entirely performed at the user-level. A bloking

systemallouldthereforepreventawhole subsetof user-levelthreads

torun.Toavoidthisandkeepreativitylow,weproeedasfollows.

Before exeuting a (potentially bloking) I/O system all, the lient

(6)

threadrunsataverylowpriority,itwillnotbesheduleduntilthepre-

vious kernel thread bloks. Thus, ifthe system all ompletes without

bloking ,theI/Olientwillontinueitsexeutionwithaveryhighpri-

ority,asrequested.AttheendoftheI/Osetion,thesparekernelthread

simplyreturnstothesleepstate.Ontheopposite,iftheallbloks ,the

originalkernelthreadyieldstheCPUtothespareone(Fig.2-).Upon

I/Oompletion,theNICinterrupthandlerwillwakeuptheoriginalker-

nelthreadthatwill, inturn,immediatelyontinuetheexeutionofthe

lientthread.Thisway,thereativityofthelientthreadisoptimal.

Notethatnomodiationtotheunderlyingoperatingsystemisrequired,

asopposedtosolutionssuhasShedulerAtivations [9,11℄.

3.3 Ative waiting:polling

Inimplementingativepolling,oursystemarefullyooperateswiththe

threadshedulertoavoidbusywaitingandunneessaryontextswithes.

AppliationsregisternewtypesofI/Oeventswithsomepollingtrigger(s)

(ateveryontextswith,afteraperiodoftime,whenaCPUgetsidle,

et.)The thread sheduler then invokesthe I/O manager aordingly.

However,theseinvoationsourinarestritedontextwithsomelasses

ofationsbeingprohibited(synhronizationprimitives,typially).Thus,

theyaresimilartointerrupthandlerswithinanoperatingsystem.

MostoftheI/Omanagerodeisonsequentlyrunoutsidetherestrited

ontext in the form of tasklets [12℄. Tasklets have been introdued in

operatingsystemstodefertreatmentsthatannotbeperformedwithin

aninterrupt handler. They run as soon as possible (they have a very

highpriority)whentheshedulerreahesapointwhereitissafetorun

tasklets. They have additional properties. Firstly, tasklets ofthe same

typerunundermutualexlusion,whihsimpliestheI/Omanagerode

andevenmakesitmoreeient.Seondly,theexeutionoftaskletsan

beenfored ona partiularproessor, whihallows to maximize ahe

anitybyrunningtaskletsonthesameproessorastheirlientthread.

3.4 Handlingof bothinterrupts and polling

Most of the network interfaes (MX/Myrinet, Inniband Verbs, TCP

sokets) provide both polling and interrupt-based funtions to detet

networkI/Oevents.Toensureagoodreativity,ourI/Omanager uses

onemethodor theother dependingontheontext:numberofrunning

threadsandavailableCPUs.Thiskindofstrategyhasalreadybeende-

velopedinPanda[13℄,butoursalsotakesintoaounttheupperlayer's

preferene:theommuniationlibraryortheappliationhasfullknowl-

edgeoftherequestompletiontime.Asmarterapproahouldalsotake

intoaountthehistoryofrequestsortheirpriorities.Asimilarmethod

was developed inpolling wathdog[ 14℄but it requireda spei kernel

(7)

Sender Reeiver

get_time(t1);

MPI_Send(...);

get_time(t2);

MPI_IRev(...);

ompute();

/* approx. 50ms. omputation */

MPI_Wait(...);

0.1 1 10 100 1000 10000 100000 1e+06

1MB 32KB

1KB 32

Sending time (µs)

Data size (Bytes)

OpenMPI MPICH MadMPI, no PIOMan MadMPI + PIOMan/mono MadMPI + PIOMan/SMP interrupt MadMPI + PIOMan/SMP polling no computation (reference)

Fig.3.MPI_SendtimewithMX.

4 Evaluation

We have evaluated the implementation of our I/O manager using the

NewMadeleine[1℄ommuniationlibraryanditsbuilt-inMPIimplemen-

tationalledMad-MPI.Thepoint-to-pointnonblokingposting(isend,

irev)andompletion(wait,test)operationsofMad-MPIarediretly

mapped to the equivalent operationsof NewMadeleine. We performed

benhmarks thatevaluatethe MPIasynhronousoperation progression

inbakground (ommuniation/omputation overlap) and benhmarks

thatevaluatetheoverheadofPIOMan.Alltheseexperimentshavebeen

arried outonasetof two dual-ore1.8 GHz Opteronboxesinteron-

neted through Myri-10G NICs with the MX1.2.1 driver providing a

latenyof2.3µ^s.

MPI asynhronousprogressio nof ommuniations. Toevaluatethe

MPIasynhronousprogression,weusethebenhmarkprogramlistedon

Table 1.Thisprogramattemptstooverlapommuniation andompu-

tationonthereeiverside.Wereordthetimespentinsendingandwe

omparetheresultstoarefereneobtainedwithMad-MPI.

Figure3showsthesendingtime(timespentinMPI_Send)wemeasured

overMX/MyrinetwithMad-MPI,OpenMPI1.2.1,andMPICH/MX1.2.7.

We measuredsimilar results over other network types(Inniband and

(8)

nothread mono SMP

polling 0.038µ^s ^0.085µ^s^0.142µ^s

interrupt - - 1.68 µ^s

tothenetworklateny.Forlarger messages,when arendez-vou s isper-

formed,weobservethreedierentbehaviors:

noasynhronousprogress OpenMPI and plain Mad-MPI donot

supportbakgroundprogress ofrendez-vous handshake.Therefore,

thesenderisblokeduntilthereeiverreahestheMPI_Wait.MPICH

makesthehandshakeprogressthankstotheMXprogressionthread

butintheurrentimplementation,thenotiationofthetransferis

notoverlapped.

oarsegrained interleavedprogress PIOMan/mono tasklets are

sheduled upontimer interrupt, every 10ms. We observe that the

delaytoompletetherendez-vous isnowboundedby10msinstead

ofthefullomputationtime.

full overlap PIOMan/SMP is able to shedule tasklets onanother

LWP,thuswegetafulloverlapofommuniationandomputation.

Weobserveonthegurethattherendez-vou sperformanedoesnot

suerfromtheomputationonthereeiverside.

We onlude that PIOMan is able to atually overlap MPI ommuni-

ationandomputationwhile OpenMPI,MPICH,andplainMad-MPI

werenotabletomakeommuniationprogressasynhronously.

Overheadevaluation. WehaveevaluatedtheoverheadoftheI/Oman-

agerwithemptypollingandblokingfuntions.Theresultsareshownin

Table2.Thepollingoverheaddiersfromoneversiontotheother.This

is dueto theost of synhronization being dierentovereahversion.

TheinterruptoverheadhasonlybeenevaluatedontheSMPversionsine

onlythisversionimplementsthemehanism.Weobservethattheover-

headisnegligibleforpolling.Ontheotherhand,theostofblokingalls

(interrupts)isquitehighduetotheawakeningofthesleepingLWPand

theommuniationbetweenLWPs.However,interruptsaresupposedto

be used when the CPU is doing omputation, where the delay would

havebeenseveralorderofmagnitudehigherwithoutinterrupts.

5 Conlusions and Future Work

OverlappingMPIommuniationsandomputationdomatterifwedo

notwant towastethousandsofCPUyles.However,makingommu-

niationsprogresseientlyisnotsosimpleasaddingaommuniation

thread.Inthispaper,wehaveproposedageneriandportableommuni-

ationeventsmanagerthatisabletoatuallyoverlapommuniationand

omputation.ThisI/Omanagerisabletohandlebothativepollingand

(9)

I/Omanager,asopposedtootherwidespreadMPIs.

Inthenearfuture,weplantousePIOManinsideotherMPIimplemen-

tationssuhasMPICH-2orommuniationframeworkslikePadioTM.

Wealsointendtomakeamoreeientuse ofNUMAarhitetures by

tryingto exeutepollingtasklets onthe most suitable CPUgiventhe

arhiteturetopology.

Referenes

1. Aumage,O.,Brunet,E.,Furmento,N.,Namyst,R.:Newmadeleine:

afast ommuniation sheduling engine for high performane net-

works. In:CAC 2007: Workshop onCommuniation Arhiteture

forClusters,heldinonjuntionwithIPDPS2007

2. Sanho, J.C., et al. : Quantifying the potential benet of overlap-

pingommuniationandomputationinlarge-salesientiappli-

ations. In:SC06,Tampa,FL,IEEEComputerSoiety(2006)

3. Doerer,D.,Brightwell,R.: Measuring MPIsendandreeiveover-

headandappliationavailabilityinhighperformanenetworkinter-

faes. In:EuroPVM/MPI.(2006)331338

4. ANL,MCSDivision:MPICH-2HomePage(2007)http://www.ms.

anl.gov/mpi/mpih/.

5. TheOpenMPIProjet:OpenMPI:OpenSoureHighPerformane

Computing(2007)http://www.open-mpi.org/.

6. Graham,R.L.,etal. :OpenMPI:Ahigh-performane,heterogeneous

MPI.In:Proeedings,FifthInternationalWorkshoponAlgorithms,

Models and Tools for Parallel Computing on Heterogeneous Net-

works,Barelona,Spain(September2006)

7. Quadris Ltd.: Elan Programming Manual (2003) http://ww w.

quadris.om/.

8. MyriomIn.: Myrinet EXpress(MX):AHighPerformane, Low-

level, Message-Passing Interfae for Myrinet (2003) http://ww w.

myri.om/ss/.

9. Anderson,T.E.,Bershad,B.N.,Lazowska,E.D.,Levy,H.M.: Shed-

ulerativations:eetivekernel supportfor theuser-level manage-

mentofparallelism. ACMTrans.Comput.Syst.10 (1)(1992)5379

10. Runtime Team, LaBRI-Inria Futurs: Marel: A POSIX-ompliant

threadlibraryforhierarhialmultiproessormahines(2007)http:

//runtime.futurs.inria.fr/marel/.

11. Danjean,V.,Namyst,R.,Russell,R.: Integratingkernelativations

ina multithreadedruntimesystemonLinux. In:Paralleland Dis-

tributed Proessing. Pro. 4th Workshop onRuntime Systems for

ParallelProgramming(RTSPP'00).(2000)

12. Russel,P.: Unreliableguidetohakingthelinuxkernel(2000)

13. Langendoen, K., Romein, J., Bhoedjang, R., Bal, H.: Integrating

polling,interrupts,andthreadmanagement. frontiers00(1996) 13

14. Maquelin,O.,etal. :Pollingwathdog:ombiningpollingandinter-

ruptsfor eient message handling. In:ISCA '96: Proeedings of

the23rdannualinternationalsymposiumonComputerarhiteture.