HAL Id: inria-00177167
https://hal.inria.fr/inria-00177167
Submitted on 5 Oct 2007
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Improving Reactivity and Communication Overlap in MPI using a Generic I/O Manager
François Trahay, Alexandre Denis, Olivier Aumage, Raymond Namyst
To cite this version:
François Trahay, Alexandre Denis, Olivier Aumage, Raymond Namyst. Improving Reactivity and
Communication Overlap in MPI using a Generic I/O Manager. EuroPVM/MPI 2007, Oct 2007,
Paris, France. pp.170-177, �10.1007/978-3-540-75416-9_27�. �inria-00177167�
Overlap in MPI using a Generi I/O Manager
FrançoisTrahay,AlexandreDenis,OlivierAumage,andRaymondNamyst
INRIA,LaBRI,UniversitéBordeaux1
351,oursdelaLibération
F-33405TALENCE,Frane
{trahay,denis,aumage,namyst}labri.fr
Abstrat. MPIappliationsmaywastethousandsofCPUylesifthey
donoteientlyoverlapommuniationsandomputation.Inthispa-
per,wepresentageneriandportableI/Omanagerthatisabletomake
ommuniationprogressasynhronouslyusingtasklets.Ithoosesauto-
matiallythe most appropriate ommuniation method,depending on
theontext:multi-threadedappliationornot,SMPmahineornot.We
haveimplementedand evaluated ourI/Omanager withMad-MPI,our
own MPIimplementation,and ompared it to other existing MPIim-
plementationsregardingtheabilitytoeientlyoverlapommuniation
andomputation.
Keywords: Pollin g, Interrupt , Thread, Sheduler, High-Speed
Network
1 Introdution
Asynhronism is beoming ubiquitous in modern ommuniation run-
times.Thisevolutionistheombinedresultofmultiplefators.Firstly,
ommuniation subsystems implement inreasingly omplex optimiza-
tionsinorder to makebetter use of networking hardware. Aswe have
shownin[1℄,suhoptimizations require onlineanalysis of theommu-
niationshemes andhenerequire thede-synhronizationof theom-
muniationrequestsubmissionfromitsproessing.Moreover , providing
rihfuntionalitysuhasommuniationowmultiplexingortranspar-
ent multi-method, heterogeneous networking implies that the runtime
systemshouldagaintake anativepart in-betweenthe ommuniation
requestsubmitandproessing.Andnally,overlappingommuniation
withomputationandbeingreativeatuallydomattermorenowthan
ithaseverdone[2,3℄.Thelatenyofnetworktransationsisintheorder
ofmagnitudeofseveralthousandsCPUylesatleast.Everythingmust
thereforebe doneto avoid independent omputationsto beblokedby
anongoingnetworktransation.Thisisevenmoretruewiththeinreas-
inglydenseSMP,multiore,SMT(alsoknownasIntel'sHyperthreading)
arhitetureswheremanyomputingunitsshareafewNICs.
Sineportabilityisoneofthemostimportantrequirementsforommuni-
niationruntimesindeedare startingto makeuse ofthreadsinternally
andalso allowappliations to bemultithreadedasit anbe seenwith
both MPICH-2 [4℄,and Open MPI [5,6℄. Low level ommuniation li-
brariessuhasQuadris'Elan[7℄andMyriom'sMX[8℄alsomakeuse
ofmultithreading.Suhanintrodutionofthreadsinsideommuniation
subsystemsisnotgoingwithouttroubleshowever.Thefat thatmulti-
threadingis still usually optional with these runtimes is symptomati
ofthe diulty to get thebenetsof multithreadinginthe ontext of
networkingwithoutsueringfromthepotentialdrawbaks.
Inthispaper,weanalyzethetwofundamentalapproahesofintegrating
multithreadingandommuniationsinterruptsandpolling.Westudy
theirrespetivebenetsand their potentialdrawbaks,and we disuss
theimportaneoftheooperationbetweentheasynhronouseventman-
agementodeandthethreadshedulingodeinordertoavoidsuhdis-
advantages.Wethenintrodueourproposalforsymbiotiallyombining
bothapproahesinsideanewgenerinetworkI/Oeventmanager. The
paperisorganizedasfollows.Setion2exposestheproblemofintegrat-
ingthreadsand ommuniations.Setion3introduesourproposalfor
anew asynhronousevent management modeland givesdetails about
ourimplementation.WeevaluatethisimplementationinSetion4and
Setion5onludesandgivesaninsightofongoingandfuturework.
2 Integrating threads and ommuniation: the
problems of network I/O events management
ThedetetionofnetworkI/Oeventsanbeahievedbytwomainstrate-
gies. Themost ommonapproah onsists inusing the ative waiting:
apollingfuntion is alled repeatedly untila networkI/Oeventis de-
teted.The polling funtion is usually inexpensive, but repeating this
operationthousandsoftimesmaybeprohibitive.Theother methodfor
deteting ommuniation events is the passive waiting whih is based
onblokingalls. In that ase, the NIC informs the operating system
that a networkI/O event hasourredby using aninterrupt, making
this method muh more reative than polling. However this operation
involvesinterrupthandlersandontextswitheswhihareratherostly.
Thebestmethodto usedependsontheappliation, butinbothases,
somebehaviorsmayleadtosuboptimalperformane.Whenusinginterrupt-
basedmethods,priorityissuesmayour:thethreadthatiswaitingfor
theommuniationeventmaybesheduledwithsomedelay.Thisisthe
asewhen,forexample,ithasbeenomputingforalongperiodbeforeit
bloks,loweringitspriority.Moreover,thesystemhastosupportmeth-
odstodetetthenetworkI/Oevents.Forinstane,inapureuser-level
sheduler,interrupt-drivenblokingallsareprohibited(unlessaspei
OSextensionliketheShedulerAtivations[9℄isused).
Using polling methods an also be problemati: if the system is over-
loaded (i.e.there are more runningthreads thanavailable CPUs), the
pollingthread may sarely besheduled, thusinreasing the reation
a regular polling so that the preliminary phase makes progress. As it
isshownin[2℄,someappliationswouldsigniantlyimprovetheirexe-
utiontimebyeientlyoverlappingommuniationand omputation,
whihrequirestopollommuniationeventsregularly.
3 An I/O manager model
Toresolvethesekindsofproblems,weproposeanI/Omanagerthatpro-
videstheommuniationruntimesystemswithanetworkeventdetetion
servie.Thus, ommuniation libraries themselvesbeomeindependent
ofthemultithreadissuesandrelatedhardwareissuessuhasthenumber
ofCPUs.Thereby,theyanfous their eortsonommuniation opti-
mizationsand other funtionalities. By working losely witha spei
threadsheduler,theI/Omanageranbeviewedasaprogressionengine
abletosheduleaommuniatingthreadwhenneededortodynamially
adaptthepollingfrequeny tomaximize thereativity/overheadratio.
The I/O manager handles both polling and interrupt-based methods,
swithingfromonemethodtoanotherdependingontheontext.
TheimplementationofourI/OmanageralledPIOMan(PM2I/OMan-
ager)reliesonatwo-levelthreadsheduler[10℄whihwasslightlymod-
ied to interat with the I/O manager when neessary. The use of a
two-levelshedulerallows to preiselyontrol threadsheduling at the
userlevel,withalmostnoexpliit(andexpensive)interationwiththe
OS.Thisway,weandynamiallyfavourtheshedulingofathreadre-
quiringahighreativitytoommuniationeventsduringaxedperiod.
PIOManisavailableasthreemainversions:no-thread,mono(user-level
threads)orSMP(userthreadsontopofkernelthreads).
3.1 Overview of theI/O manager
The mehanism of our I/O manager is desribedthrough an example
showninFigure1:theappliationrstregistersaallbakfuntion for
eaheventtypetodetet.Whentheappliationstartsaommuniation,
itansubmittherequeststopoll (1)andwaitfor themorsimplyon-
tinue itsomputation. Periodial ly, the thread sheduler alls the I/O
manager (2)inorder to poll thenetworkbyalling the allbak fun-
tions(3).
We propose to manage the ommuniation events in adediated on-
trollerlinkedtothethreadshedulerforseveralreasons.Firstly,entral-
izingavoidstheonurrenyissuesenounteredwhenseveralthreadstry
topoll thesame network. Sine the I/Omanager hasa globalview of
thependingrequests,itanpoll eahrequestoneafter another.More-
over,themanagerhastheopportunitytoaggrega temultiplerequests.If
severalthreadsarewaitingformessagesonasinglenetworkinterfae,it
anbeinterestingtoaggrega tetheserequestswhenpolling.
Seondly,thethreadshedulerhastheopportunitytopreemptaomput-
Thread Scheduler I/O Manager
Communication Library
NIC
2 3
1
mx_test
poll( ) wait_event( )
polling callback( )
submit_request( )
Fig.1.ExampleofinterationbetweentheI/OmanagerandtheMPIlibrary.
CPU CPU
LWP
CPU LWP
low−priority, spare kernel thread
t2 t3
LWP
kernel thread blocked
LWP
t1 t2 t3 t1
LWP
t2 t3
t1
(a) Regular execution (b) Preventing a blocking syscall (c) Rescuing ready threads
Fig.2.Low priority,spare kernel-levelthreadsare usedto sheduleremainingappli-
ationthreadsinaseablokingsysalloursduringanI/Oritialoperation.
I/Oompletionandthusmaketheommuniationprogress.Thisisuse-
fulwhentheappliationperformsasynhronousoperationsthatrequire
someproessingonetheommuniationends.Forexample,inarendez-
vousprotool,thereeiverhastopostareeivingrequesttosynhronize
withthesender.Onebothsidesaresynhronized,thetransferanstart:
onesidereeivesthedatathattheothersidesends.Inthatase,thepro-
gressionoeredbytheI/Omanagerandthethreadshedulerallowsto
ompletelyoverlaptheommuniationwithomputation.
3.2 Passive waiting:interrupts
Passivemonitoringthroughblokingsystemallsistrikytoimplement
in a two-level sheduler. Indeed, during regular exeution of applia-
tionthreads,ourshedulerbinds exatlyonekernelthread(also alled
LightWeightproessLWP)perproessor(Fig.2-a),sothattheshedul-
ing of threads anbe entirely performed at the user-level. A bloking
systemallouldthereforepreventawhole subsetof user-levelthreads
torun.Toavoidthisandkeepreativitylow,weproeedasfollows.
Before exeuting a (potentially bloking) I/O system all, the lient
threadrunsataverylowpriority,itwillnotbesheduleduntilthepre-
vious kernel thread bloks. Thus, ifthe system all ompletes without
bloking ,theI/Olientwillontinueitsexeutionwithaveryhighpri-
ority,asrequested.AttheendoftheI/Osetion,thesparekernelthread
simplyreturnstothesleepstate.Ontheopposite,iftheallbloks ,the
originalkernelthreadyieldstheCPUtothespareone(Fig.2-).Upon
I/Oompletion,theNICinterrupthandlerwillwakeuptheoriginalker-
nelthreadthatwill, inturn,immediatelyontinuetheexeutionofthe
lientthread.Thisway,thereativityofthelientthreadisoptimal.
Notethatnomodiationtotheunderlyingoperatingsystemisrequired,
asopposedtosolutionssuhasShedulerAtivations [9,11℄.
3.3 Ative waiting:polling
Inimplementingativepolling,oursystemarefullyooperateswiththe
threadshedulertoavoidbusywaitingandunneessaryontextswithes.
AppliationsregisternewtypesofI/Oeventswithsomepollingtrigger(s)
(ateveryontextswith,afteraperiodoftime,whenaCPUgetsidle,
et.)The thread sheduler then invokesthe I/O manager aordingly.
However,theseinvoationsourinarestritedontextwithsomelasses
ofationsbeingprohibited(synhronizationprimitives,typially).Thus,
theyaresimilartointerrupthandlerswithinanoperatingsystem.
MostoftheI/Omanagerodeisonsequentlyrunoutsidetherestrited
ontext in the form of tasklets [12℄. Tasklets have been introdued in
operatingsystemstodefertreatmentsthatannotbeperformedwithin
aninterrupt handler. They run as soon as possible (they have a very
highpriority)whentheshedulerreahesapointwhereitissafetorun
tasklets. They have additional properties. Firstly, tasklets ofthe same
typerunundermutualexlusion,whihsimpliestheI/Omanagerode
andevenmakesitmoreeient.Seondly,theexeutionoftaskletsan
beenfored ona partiularproessor, whihallows to maximize ahe
anitybyrunningtaskletsonthesameproessorastheirlientthread.
3.4 Handlingof bothinterrupts and polling
Most of the network interfaes (MX/Myrinet, Inniband Verbs, TCP
sokets) provide both polling and interrupt-based funtions to detet
networkI/Oevents.Toensureagoodreativity,ourI/Omanager uses
onemethodor theother dependingontheontext:numberofrunning
threadsandavailableCPUs.Thiskindofstrategyhasalreadybeende-
velopedinPanda[13℄,butoursalsotakesintoaounttheupperlayer's
preferene:theommuniationlibraryortheappliationhasfullknowl-
edgeoftherequestompletiontime.Asmarterapproahouldalsotake
intoaountthehistoryofrequestsortheirpriorities.Asimilarmethod
was developed inpolling wathdog[ 14℄but it requireda spei kernel
Sender Reeiver
get_time(t1);
MPI_Send(...);
get_time(t2);
MPI_IRev(...);
ompute();
/* approx. 50ms. omputation */
MPI_Wait(...);
0.1 1 10 100 1000 10000 100000 1e+06
1MB 32KB
1KB 32
Sending time (µs)
Data size (Bytes)
OpenMPI MPICH MadMPI, no PIOMan MadMPI + PIOMan/mono MadMPI + PIOMan/SMP interrupt MadMPI + PIOMan/SMP polling no computation (reference)
Fig.3.MPI_SendtimewithMX.
4 Evaluation
We have evaluated the implementation of our I/O manager using the
NewMadeleine[1℄ommuniationlibraryanditsbuilt-inMPIimplemen-
tationalledMad-MPI.Thepoint-to-pointnonblokingposting(isend,
irev)andompletion(wait,test)operationsofMad-MPIarediretly
mapped to the equivalent operationsof NewMadeleine. We performed
benhmarks thatevaluatethe MPIasynhronousoperation progression
inbakground (ommuniation/omputation overlap) and benhmarks
thatevaluatetheoverheadofPIOMan.Alltheseexperimentshavebeen
arried outonasetof two dual-ore1.8 GHz Opteronboxesinteron-
neted through Myri-10G NICs with the MX1.2.1 driver providing a
latenyof2.3µs.
MPI asynhronousprogressio nof ommuniations. Toevaluatethe
MPIasynhronousprogression,weusethebenhmarkprogramlistedon
Table 1.Thisprogramattemptstooverlapommuniation andompu-
tationonthereeiverside.Wereordthetimespentinsendingandwe
omparetheresultstoarefereneobtainedwithMad-MPI.
Figure3showsthesendingtime(timespentinMPI_Send)wemeasured
overMX/MyrinetwithMad-MPI,OpenMPI1.2.1,andMPICH/MX1.2.7.
We measuredsimilar results over other network types(Inniband and
nothread mono SMP
polling 0.038µs 0.085µs0.142µs
interrupt - - 1.68 µs
tothenetworklateny.Forlarger messages,when arendez-vou s isper-
formed,weobservethreedierentbehaviors:
noasynhronousprogress OpenMPI and plain Mad-MPI donot
supportbakgroundprogress ofrendez-vous handshake.Therefore,
thesenderisblokeduntilthereeiverreahestheMPI_Wait.MPICH
makesthehandshakeprogressthankstotheMXprogressionthread
butintheurrentimplementation,thenotiationofthetransferis
notoverlapped.
oarsegrained interleavedprogress PIOMan/mono tasklets are
sheduled upontimer interrupt, every 10ms. We observe that the
delaytoompletetherendez-vous isnowboundedby10msinstead
ofthefullomputationtime.
full overlap PIOMan/SMP is able to shedule tasklets onanother
LWP,thuswegetafulloverlapofommuniationandomputation.
Weobserveonthegurethattherendez-vou sperformanedoesnot
suerfromtheomputationonthereeiverside.
We onlude that PIOMan is able to atually overlap MPI ommuni-
ationandomputationwhile OpenMPI,MPICH,andplainMad-MPI
werenotabletomakeommuniationprogressasynhronously.
Overheadevaluation. WehaveevaluatedtheoverheadoftheI/Oman-
agerwithemptypollingandblokingfuntions.Theresultsareshownin
Table2.Thepollingoverheaddiersfromoneversiontotheother.This
is dueto theost of synhronization being dierentovereahversion.
TheinterruptoverheadhasonlybeenevaluatedontheSMPversionsine
onlythisversionimplementsthemehanism.Weobservethattheover-
headisnegligibleforpolling.Ontheotherhand,theostofblokingalls
(interrupts)isquitehighduetotheawakeningofthesleepingLWPand
theommuniationbetweenLWPs.However,interruptsaresupposedto
be used when the CPU is doing omputation, where the delay would
havebeenseveralorderofmagnitudehigherwithoutinterrupts.
5 Conlusions and Future Work
OverlappingMPIommuniationsandomputationdomatterifwedo
notwant towastethousandsofCPUyles.However,makingommu-
niationsprogresseientlyisnotsosimpleasaddingaommuniation
thread.Inthispaper,wehaveproposedageneriandportableommuni-
ationeventsmanagerthatisabletoatuallyoverlapommuniationand
omputation.ThisI/Omanagerisabletohandlebothativepollingand
I/Omanager,asopposedtootherwidespreadMPIs.
Inthenearfuture,weplantousePIOManinsideotherMPIimplemen-
tationssuhasMPICH-2orommuniationframeworkslikePadioTM.
Wealsointendtomakeamoreeientuse ofNUMAarhitetures by
tryingto exeutepollingtasklets onthe most suitable CPUgiventhe
arhiteturetopology.
Referenes
1. Aumage,O.,Brunet,E.,Furmento,N.,Namyst,R.:Newmadeleine:
afast ommuniation sheduling engine for high performane net-
works. In:CAC 2007: Workshop onCommuniation Arhiteture
forClusters,heldinonjuntionwithIPDPS2007
2. Sanho, J.C., et al. : Quantifying the potential benet of overlap-
pingommuniationandomputationinlarge-salesientiappli-
ations. In:SC06,Tampa,FL,IEEEComputerSoiety(2006)
3. Doerer,D.,Brightwell,R.: Measuring MPIsendandreeiveover-
headandappliationavailabilityinhighperformanenetworkinter-
faes. In:EuroPVM/MPI.(2006)331338
4. ANL,MCSDivision:MPICH-2HomePage(2007)http://www.ms.
anl.gov/mpi/mpih/.
5. TheOpenMPIProjet:OpenMPI:OpenSoureHighPerformane
Computing(2007)http://www.open-mpi.org/.
6. Graham,R.L.,etal. :OpenMPI:Ahigh-performane,heterogeneous
MPI.In:Proeedings,FifthInternationalWorkshoponAlgorithms,
Models and Tools for Parallel Computing on Heterogeneous Net-
works,Barelona,Spain(September2006)
7. Quadris Ltd.: Elan Programming Manual (2003) http://ww w.
quadris.om/.
8. MyriomIn.: Myrinet EXpress(MX):AHighPerformane, Low-
level, Message-Passing Interfae for Myrinet (2003) http://ww w.
myri.om/ss/.
9. Anderson,T.E.,Bershad,B.N.,Lazowska,E.D.,Levy,H.M.: Shed-
ulerativations:eetivekernel supportfor theuser-level manage-
mentofparallelism. ACMTrans.Comput.Syst.10 (1)(1992)5379
10. Runtime Team, LaBRI-Inria Futurs: Marel: A POSIX-ompliant
threadlibraryforhierarhialmultiproessormahines(2007)http:
//runtime.futurs.inria.fr/marel/.
11. Danjean,V.,Namyst,R.,Russell,R.: Integratingkernelativations
ina multithreadedruntimesystemonLinux. In:Paralleland Dis-
tributed Proessing. Pro. 4th Workshop onRuntime Systems for
ParallelProgramming(RTSPP'00).(2000)
12. Russel,P.: Unreliableguidetohakingthelinuxkernel(2000)
13. Langendoen, K., Romein, J., Bhoedjang, R., Bal, H.: Integrating
polling,interrupts,andthreadmanagement. frontiers00(1996) 13
14. Maquelin,O.,etal. :Pollingwathdog:ombiningpollingandinter-
ruptsfor eient message handling. In:ISCA '96: Proeedings of
the23rdannualinternationalsymposiumonComputerarhiteture.