HAL Id: inria-00565361
https://hal.inria.fr/inria-00565361
Submitted on 11 Feb 2011
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of
sci-entific research documents, whether they are
pub-lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
An Efficient and Transparent Thread Migration Scheme
in the PM2 Runtime System
Gabriel Antoniu, Luc Bougé, Raymond Namyst
To cite this version:
Gabriel Antoniu, Luc Bougé, Raymond Namyst. An Efficient and Transparent Thread Migration
Scheme in the PM2 Runtime System. Proceedings of the 11 IPPS/SPDP’99 Workshops Held in
Conjunction with the 13th International Parallel Processing Symposium and 10th Symposium on
Parallel and Distributed Processing, Apr 1999, Po Rico, Puerto Rico. pp.496–510. �inria-00565361�
migration scheme in the PM2 runtime system
GabrielAntoniu,Luc Bougé,andRaymondNamyst
LIP,ENSLyon,46,Alléed'Italie,69364LyonCedex07,France.
Contact:fGabriel.Antoniu,Luc.Bouge,Raymond.Namystg @ens-lyon.fr .
Abstract. Thispaperdescribes anew iso-address approach to the
dy-namicallocationofdatainamultithreadedruntimesystemwiththread
migrationcapability.Thesystem guaranteesthatthe migratedthreads
andtheirassociatedstaticdataarerelocatedexactlyatthesamevirtual
addressonthe destinationnodes,so that nopost-migrationprocessing
isneededtokeeppointers valid.Intheexperimentsreported,athread
canbemigratedinlessthan75s.
1 Introduction
Multithreading hasproven useful to implement massively parallel activitiesin
distributed systems,sinceit providesanecientway ofoverlapping
communi-cation and computation. When the application behavior is hardly predictable
at compiletime, dynamic loadbalancingbecomesessential.It canbeachieved
by transparently migrating computationthreads from the overloadednodes to
theunderloadedones.Inthe implementation describedbelow,athread canbe
migratedacrosstheMyrinet networkin lessthan75s.
Migratingathread usually means movingthe thread stack,but sometimes
mayalsomeanmovingthestaticdata used bythethread.Inthisrespect,
sev-eralmigrationapproacheshavebeenimplementedintheexistingmultithreaded
systems,depending ontherationaleunderlying theuseofthread migration.In
Ariadne[8],threadsaremigratedtogetclosetotheremotedatatheyuse.Static
data nevermoves.Onmigration, thethreadstackisrelocatedat ausually
dif-ferentaddressonthedestinationnode,suchthat pointers needtobeupdated.
As shown in 2, several problems cannot be solved by this approach. In
Milli-pede [7], thread migration is directed by a load balancing module integrated
in thesystem, whereas static data get movedonly when they get accessed by
remote threads. The threads and their data are alwaysrelocated at the same
virtual addresseson all nodes. Yet, thread creationis expensive,therefore the
numberofconcurrentthreadsisstaticallyxedatinitialization.Inbothsystems
mentionedabove,dataaresharedandcanbeaccessedbymorethanonethread.
UPVM[4]providesthreadmigrationforPVMapplications,inordertosupport
load balancing. Threads have private heaps, for private dynamic allocations.
Threadcreationisexpensiveinthissystem,too,sinceitiscarriedoutbymeans
compiling. Consequently, our study focuses on the case in which data are not
shared:theybelongtosomeuniquethreadandthus havetofollowiton
migra-tion.Ouriso-addressallocatorhasbeenimplementedinthePM2multithreaded
runtimesystem[10],whichservesasaruntimesupportfortwodata-parallel
com-pilers[1]. Wetargetapplications havingto executeonhomogeneous clusters of
workstationsorPCsinterconnectedbyahigh-speednetwork(e.g.,Myrinet[9]).
Thispaperisstructuredasfollows.Insection2,wegiveaquickdescriptionof
PM2:amultithreadedruntimesystemprovidingthreadmigration.Anoverview
ofouriso-addressapproachisgiveninsection3andsomeimplementationdetails
are presentedin section 4.Section 5showssome performance gures. Finally,
section6summarizesourmainresultsandpointsoutwhatweintendtoaddress
in thenear future.
2 PM2: a multithreaded runtime system with thread
migration
PM2isamultithreadedruntimesystemespeciallydesignedtoserveasaruntime
support for highly parallel irregularapplications. Insuch applications, threads
mayneedto startorterminateat arbitrarymomentsduring theexecution.At
thesametime,thesystemhastoecientlycopewithalargenumberof
concur-rentthreads.Therefore,PM2 provides veryecientprimitivesto handle these
operations:creation,destructionandcontextswitching.Adistinctivefeatureof
PM2isthreadmigration.Sincetheexecutionofirregularapplicationsmaylead
to severe loadimbalances,thread migration canbe usedto support the
imple-mentationofloadbalancingpoliciesbasedondynamicactivityredistribution.
InaPM2application,thereisasingle(heavy)processrunningateachnode
andeachsuchprocessmaycontaintensofthousandsofthreads.Weoften
iden-tifythiscontainerprocesswiththenoderunningit.Atthesimplestlevel,aPM2
threadisanexecutionowmanaging asetofresources,i.e.,itsstatedescriptor
anditsprivateexecutionstack.Thecodetobeexecutedbythethreadsis
repli-catedoneach node(SPMD approach)andisnotpartof thethread.Again,we
emphasizethatwedonotconsider theaspectsofdatasharingbetweenthreads
in this paper,nor theproblem of athread usingglobal process resourcessuch
asles,networkinterfaces,etc.Inthissetting,migratingathreadsimplymeans
movingthethreadresourcesfromthe(heavy)processrunningonthelocalnode
toanother(heavy)processlocatedonsomeremotenode.InPM2,themigration
operationiscarriedoutinthreemainsteps:
1. Thethreadgetsstopped (frozen) anditsresourcesgetcopiedtoa
commu-nicationbuer.Thememoryareastoringtheresourcesisset free.
2. Thebuercontentsgetsentto thedestinationnodethroughthenetwork.
point during its execution. It may also be preemptively migrated by another
thread runningonthesamenode. Thislatter propertyisessential,sinceit
en-suresthatapplication threadsmaybetransparently migratedacrossthenodes.
Consequently, a generic module implemented outside the running application
could balance the load by migrating the application threads. Thethreads are
unawareoftheirbeingmigrated andkeeponrunningirrespectiveoftheir
loca-tion. Sourcecode: void p1() { int x; x = 1; pm2_printf("value = %dnn", x); pm2_migrate(marcel_self(),1); pm2_printf("value = %dnn", x); } Execution: [node0] value = 1 [node1] value = 1
Fig.1.Threadmigrationwithoutpointers.
Sourcecode: void p2() { int x; int *ptr = &x; x = 1; pm2_printf("value = %dnn", *ptr); pm2_migrate(marcel_self(), 1); pm2_printf("value = %dnn", *ptr); } Execution: [node0] value = 1 Segmentation fault
Fig.2.Threadmigration inthepresenceof
pointerstostackdata
Anexampleof threadmigrationisgivenonFigure1.Assume thatathread
running on node 0calls procedure p1. The thread declares a local variable x,
writes thevalue1to thisvariable, thenprintsit.Next,the threadmigrates to
node 1 and prints the value of the variable x again. At run time, we can see
thatthevalue1isdisplayedinbothcases,beforeandaftermigration.Thelocal
variable x getsautomatically moved to node 1,since itis stored in the thread
stack.
A diculty turns up as soon asa migrating thread makesuse of pointers.
SuchasituationisillustratedonFigure2.Here,thethreadwhichcallsp2reads
variablexthroughpointerptr .Aftermigration, thereisnoguaranteethat
vari-ablexisstilllocatedataddressptr andtheexecution(most probably!)fails.
Onewaytotacklethisproblemistoupdateallreferencestostackdataafter
migration, before thethreadresumes itsexecutionbyaddingsomeosettoall
pointers. Twocategories of pointers to stack data requiresuch post-migration
processing: the implicit pointers generated by the compiler in order to chain
thestackframesandtheexplicit pointersusedbytheprogrammer.Theformer
in the early versions of PM2, which provided primitives to register/unregister
user-level pointers. When a thread moved to another node, all its registered
pointers wereupdated(Figure3).
Sourcecode:
void p2()
{
int x;
int *ptr = &x;
unsigned int key;
key=pm2_register_pointer(&ptr); x = 1; pm2_printf("value = %d\n", *ptr); pm2_migrate(marcel_self(), 1); pm2_printf("value = %d\n", *ptr); pm2_unregister_pointer(key); } Execution: [node0] value = 1 [node1] value = 1
Fig.3. Thread migration with registered
pointers
Sourcecode:
void p3 ()
{
int *t =
(int *)malloc (100 * sizeof(int));
t[10] = 1; pm2_printf("value = %d\n", t[10]); pm2_migrate(marcel_self(), 1); pm2_printf("value = %d\n", t[10]); } Execution: [node0] value = 1 Segmentation fault
Fig.4.Threadmigrationwithpointerstoheap
data
Clearly,this approachdoesnotextendtocomplexapplications.Moreoverit
does not cope with resources located outside of the stack, such as heap data
dynamicallyallocatedbythemallocprimitiveoftheClanguage.Figure4shows
a thread which calls mallocto allocate somememory area, writes (potentially
large) data into this area, migrates, and eventually tries to read at the same
virtual address. Theprogram obviously fails, since the allocated data hasnot
beenmigrated.
One way to solve this problem consists in remallocating the data on the
destinationnode.Inthiscase,theprogrammerhastoexplicitlyhandlethedata
packing and unpacking, and to manage thepointer updating astheallocation
address are usually dierent from the original one. As in the case of pointers
tostackdata,thisapproachcannotbeusedforarbitrarilycomplexapplications
makinguseofalargenumberofpointerstoheapdata.Moreover,thisapproach
cannot copewith compiler-generatedpointersin caseoptimization optionsare
used,sincesuchpointersarenotregisteredandcannotbeupdated.Fundamental
3.1 Generaloverview
A much better approach to the problem described in the previous section is
to provide amechanismableto guaranteethat both thestack andtheprivate,
dynamically allocated data of a thread can be migrated and remallocated at
the same virtual address on the destination node (iso-address allocation and
migration).Theideaistolocallyallocatestorageareasinasystem-wide,globally
consistent way. The allocation mechanism must guarantee that each range of
virtualaddressesatwhichmemoryhasbeenmmappedatsomenodeiskeptfree
onalltheother nodes.Such anapproachhasseveraladvantages.
Simplicity The migration mechanism is simplied, because no post-migration
pointerupdateisnecessaryanylonger.
Transparency Applications may make free use of pointers without having to
takeinto accountpossibleproblems relatedto thread migration.User-level
pointersarealwaysguaranteed tobesafe.
Portability Nocompiler knowledgeaboutthethreadstackstructureisrequired,
sincethestackcontentsremainsexactlythesameaftermigration.In
partic-ular,compiler-generatedpointersaremigration-safe,too.Consequently,any
compilermaybeusedandcompileroptimizationsareallowed.
Preemptiveness Preemptivemigrationis possible,giventhat noassumptionis
madeaboutthethreadstateatmigrationtime.
Theisomallocallocationmechanismreliesonafewbasicprinciples.Theserules
ensurethat eachnode mayuseitsgloballyreservedmemorywithouthavingto
inform the other nodes. We thus avoid any synchronization when allocating
memorytothreads.
1. Thephysical execution environment is assumed to be homogeneous (same
typeofprocessor,sameoperatingsystem).Moreover,allnodeshavethesame
memorymapping: the same binary code is loaded on each of them at the
samevirtualaddress(sothatnocodeneedsgettingmoveduponmigration).
The(unique)processstackisalsolocatedatthesamevirtualaddressonall
nodes.
2. Oneachnode,alliso-addressallocationstakeplacewithinaspecialaddress
rangecalled iso-address area.Wehavelocatedit betweentheprocessstack
and theheap (Figure 5). This zone correspondsto the same virtualrange
onallnodes.
3. Separaterangesofvirtualaddresseswithin theiso-addressareaareglobally
reserved foreach node,so that eachaddressmaybe usedby asinglenode
atatime.
4. Theactual memoryallocationiscarriedoutlocally,withinanaddressrange
~ 3,5 Go
<< 500 Mo
< 100 ko
Code
fixed
(UNIX) process stack
Local heap
Iso-address
Area
at compile time
Data
Fig.5.Allnodes have thesamememory mapping.Inparticular,the iso-address area
coversthesamevirtualaddressrangeonallnodes
3.2 Theslot layer
In this improved view, a PM2 thread is an execution ow managing a set of
resources, i.e., its state descriptor, its private execution stack, and a series a
dynamically allocated sub-areas within the iso-address area. Let us introduce
someterminologyatthispointforthesakeofclarity.Anaddress slotisarange
ofvirtualaddresseswithintheiso-addressarea.Aslot isfree ifnomemoryhas
beenmmapped at thisaddress.Otherwise, itis busy, andwesaythat memory
has been allocated in this slot. Then, data may be stored within this slot of
virtualaddresses.Theiso-addressdisciplineguaranteesthataslotwhichisbusy
onanodeis guaranteedto remainfreeonanyothernode.
Ourgoalisto designthemanagementpolicyso astoavoidinter-node
syn-chronization asfar as possible and to remain compatible with the heap
man-agement mechanisms of the container (heavy) process. To manage slots in a
consistent system-wide manner, it is convenient to give them a uniform size,
verymuchlikememorypagesat thenodelevel.Thechoice ofthis sizeis
obvi-ouslycrucialandwewilldiscussitlater.Weintroduceagainsometerminology.
Atanypoint,exactlyone agent,anodeorathread,isresponsibleformanaging
agivenslot.Itistheowner oftheslot.Theslotsownedbyanodeorathreadare
calleditsprivate slots.Aslot ownerisresponsibleformmappingorunmmaping
owned by the thread
owned by the local node
Memory address space
Node 1/2
Thread
Node 1/2
Step 2
Node 2/2
Node 2/2
Step 1
Node 1/2
Node 2/2
Step 4
Node 1/2
Step 3
Node 2/2
Data
Data
Stack
Stack
Stack
Data
Data
Fig.6.Slotownershipmaychangeduetomigration.Inthisexample,athreadiscreated
and acquires a slot owned by the local nodeto store its stack (Step 1). The thread
acquiresotherslotsfromthelocalnode,tostoreitsprivatedata(Step2).Thethread
migratesalongwithitsslots(Step3).Thethreaddiesanditsslotsareacquiredbythe
destinationnode(Step4).
Atinitializationtime,eachslotisownedbyauniquenode andisfree.When
a thread is created, the local node gives the thread a slot to store its initial
resources:this slot is from now onowned by the thread. Whenisomallocating
datadynamically,athreadacquiresadditionalslotsfrom thelocalnode.Notice
that all this change of ownership do not requireany synchronization between
nodeswhatsoever.Athread isassociatedwiththelistofitsprivateslotswhere
itstoresitsresources.Onmigration, theseslotsmigratealongwith thethread,
which still owns them after the migration, thoughthe memory is allocated at
anothernode. Atanypoint,athreadmayreleaseslots.Theyarethen givento
the node thethread is currently visiting.This node maybedierent from the
irrelevantatthispoint.Wewilldiscussthechoiceofthisdistribution later.
3.3 Theblock layer
Since ourgoal is to provide an allocationfunction compatible with themalloc
Cprimitive,theisomallocallocatorhasbeenrenedsoastocopewith
arbitrarily-sizedzonesofmemory.Thisleadstoanewconcept: theblock.Amedium-sized
slotmaycontainmultiplesmall-sized blocks.
Conversely,when largerequest aretobehandled,ablockmaystretch over
multiple contiguousslots. Ifthecurrentlocal nodeowns thenecessarynumber
of contiguous slots, this allocation is carried out the same way as a simple,
single-slot allocation. Theset of contiguousslotsis simplymergedinto alarge
slot. Otherwise, the node has to enter a negotiation with other nodes to buy
from them thenecessaryset of contiguous slots. Assuchanoperationinvolves
synchronization and mutual exclusion,it is clearly much more expensive than
usual,localallocations.Everythinghastobedonetokeepitexceptional.Itisof
coursepossibletoincreasetheslotsizedenedatinitialization.Itismuchmore
ecienttoadjust theinitialdistribution ofslotsso astofavorthecontiguityof
theslotsownedbynodes.Wediscusstheseaspectsfurtherin Section4.1.
3.4 Theprogramminginterface
The PM2 high-level programminginterface provides two primitives by means
of which threads may allocate(respectivelyrelease) memoryin theiso-address
area:pm2_isomallocandpm2_isofree.Theseprimitiveshavethesameprototype
astheclassicCfunctionsmallocandfree:
void *pm2_isomalloc(size_t size);
void pm2_isofree(void *addr);
Athreadmustcallpm2_isomallocinsteadofmalloctoallocatememoryfor
pri-vate,non-shareddatathat arerequiredto migratewiththethread.PM2
guar-antees that alldata stored at addresses returnedby pm2_isomallocfollowthe
calling thread in case of migration. All addresses allocated by pm2_isomalloc
have to be set free through a call to pm2_isofree. Using these primitives
en-suresthatallreferencestotheaddressareashandledbythemremainvalidand
that accesses to the corresponding data are migration-safe. Migration is thus
transparentandthemigratingthreadsmayusepointersinanarbitraryway.
Anexampleofcodeusingpm2_isomallocisgiveningure7.Letussuppose
that the procedure p4 is called by a thread running on node 0. The thread
allocates memory blocks in the iso-address zone through successive calls to
pm2_isomallocandcreatesalinkedlist.Then,thethreadbeginstotraversethe
list whileprintingitselements. Whenthe101st element isreached,the thread
PM2 guarantees that all blocks allocated by pm2_isomallocmigrate with the
threadand keepthesamevirtualaddresses.
#define NB_ELEMENTS 100000
#define NB_ITERATIONS 20000
typedef struct _item {int value; struct _item *next;} item;
[...]
void p4() {
int j; item *head, *ptr;
/* Create a list. */
head = NULL;
for (j = 0; j < NB_ELEMENTS; j++) {
ptr = (item *) pm2_isomalloc(sizeof(item));
ptr->value = j * 2 + 1; /* For example */
ptr->next = head; head = ptr;
}
pm2_printf("I am thread %p\n", marcel_self());
[...]
/* Print the list elements. */
j = 0; ptr = head;
while(ptr != NULL) {
if (j = 100) { /* Migrate! */
pm2_printf("Initializing migration from node %d\n", pm2_self());
pm2_migrate(marcel_self(), 1);
pm2_printf("Arrived at node %d\n", pm2_self());
}
pm2_printf("Element %d = %d\n", j, ptr->value);
ptr = ptr->next; j++;
}
}
Fig.7.Samplecodeusingpm2_isomalloc.Procedurep4is calledbyathreadinitially
running on node0. After havingallocated afew blocks in the iso-address area and
constructedalinkedlist,thethreadstartstraversingthelist.Arrivedatelement100,
thethreadmigratestonode1andcontinuesthetraversal.
4 Implementation details
4.1 Basicrequirements
info%pm2load example1
[node0] I am thread eeff0020
[node0] Element 0 = 1
[node0] Element 1 = 3
[...]
[node0] Element 99 = 199
[node0] Initializing migration
from node 0
[node1] Arrived at node 1
[node1] Element 100 = 201
[node1] Element 101 = 203
[node1] Element 102 = 205
Fig.8.Execution tracefor thecodein
Fig-ure7.Thelisttraversalstartsonnode0and
continueson node1. Using malloc instead
ofpm2_isomallocwouldresultinamemory
accesserror (Figure9),sincethelistisnot
migratedwiththethreadinthiscase
[node0] I am thread eeff0020
[node0] Element 0 = 1
[node0] Element 1 = 3
[...]
[node0] Element 99 = 199
[node0] Initializing migration
from node 0
[node1] Arrived at node 1
[node1] Element 100 = -1797270816
[node1] Element 101 = 57654
Segmentation fault
Fig.9. If the call to pm2_isomalloc is
re-placedbyacalltomallocinthecodegiven
ingure7,anerroroccurswhenthethread
triestoaccessitslistafterthemigration
Iso-address area Aspecicpartofthevirtualspacehastobededicatedto
iso-address memory allocations on all nodes. To this purpose, we dened an
iso-addressareasituatedbetweentheprocessstackandtheheap(Figure5).
Thisispossiblesinceallnodes arebinarycompatible andrun bythe same
versionoftheoperatingsystem.
Global reservation,localallocation Theiso-address areaisdividedinto
xed-sizevirtualaddressslots,eachofwhich isgiventoauniquenodeat
initial-ization.Toimplement thisglobal reservation,each nodeis provided witha
privatebitmap which identiesthe slots ownedby thenode (see4.2). The
initial slot distribution pattern must ensure that noslot is shared by
sev-eral nodes. On each node, actual, local allocations may only takeplace at
the slots owned by the caller. Memory allocation is done using the mmap
primitive,whichallowsformemoryallocationatspeciedvirtualaddresses.
Slot distribution Initially, slots are distributed among the nodes according to
someuser-dened distributionpattern which may be chosenso asto meet
the needs of the application. This choice should be made such that most
allocationsbelocal andnegotiations areasseldompossible.Inourcurrent
implementation, slotsareassigned tonodes inaround-robinfashion:slot i
belongstonodeimodpinap-nodeconguration.Thischoicehasbeenmade
forsimplicity,butitbehavesratherpoorlyformulti-slotallocations.Nothing
preventstheuserfromchoosingotherdistributions.Forinstance,insteadof
distributingsingleslotscyclicallyamongthenodes,onemaydistributeseries
ofcontiguousslots(block-cyclic distribution). An extreme choiceis to split
theiso-adressareaintopsub-areas,oneforeachnode,buttheseschemeisnot
slotstomaximizethecontiguity,
Slot size Aspreviouslyexplained,theslotsize waschosenso asto tathread
stackandwasxedto64kB,thatis16pages.Thus,threadcreationisalocal
operation(i.e.,nonegotiationisneeded)irrespectiveoftheslotdistribution,
sinceasingleslot is required.Thisis alsovalid forallallocationsof blocks
smallerthanaslot.Asforlargerallocations,detailsaregiveninSection4.4.
4.2 Managingslots
Each node keepstrack of itsprivateslots by meansof aprivatebitmap. Each
bitin thisbitmapcorrespondsto aslotin theiso-addresszone.Giventhat this
zoneistypicallyaslargeas3.5GBandthataslotcorrespondsto64kB,thesize
ofsuchabitmap amountsto 7kB.Ineach bitmap,thebitsareset to1ifthey
correspondtoslotsownedbythelocalnode,otherwisetheyaresetto0.Ifabit
issetto 1,thecorrespondingslotisfree.Ifitissetto0,theslotbelongseither
toanothernode(anditisnecessarilyfree)orto somelocalorremotethread.
When a slot request is issued by a thread (for instance, when a thread is
createdorwhen itrequires additionalstoragearea),oneofthe slotsownedby
the local node is given to the thread and the corresponding bit is set to 0in
the local bitmap.The slot doesnot belong to the local node any more.When
aslot is releasedbyathread (due to dynamic releaseorto thread death), the
correspondingbitinthecurrentlocalbitmapissetto1.Observethatthebitmaps
donotundergoanychangeonthread migration,sincethemigratingslots keep
beingownedbythethread andthecorrespondingbitskeeptheir0-valueonall
nodes.Noticealsothat,duetomigration,aslotmaybeallocatedonanodeand
releasedon another, sothat thedestination node mayeventuallyacquire slots
that itdidnotpossessinitially.
Threadsmanagetheirprivateslots in adouble-linkedlist(Figure10). This
isincontrastwithnodeswhichmanagetheirprivateslotsbymeansofabitmap.
Chainingtheslotsownedbyathreadmakesitmucheasiertomanipulatethem
on migration. Actually, chaining is carried out by means of pointers stored in
the slot headers. Given that the slot contents get copied at the same virtual
addressincaseofmigration,thesepointersremainvalidandthechainingisthus
preserved.Aswithuser-levelpointers,nopost-migrationprocessingisnecessary:
aniso-addresscopyisenough.
4.3 Allocatingblocks
In contrast to the traditional malloc/free primitives,which deal with dynamic
allocations in a contiguousheap, pm2_isomallocand pm2-isofree manage
allo-cationsof arbitrarily-sizedblockswithin alistof discontinuousslots. Eachslot
contains a double-linked list of free blocks. Blocks have headers storing their
size,aswellaspointerstotheneighboringblocksinthelist.
0000
0000
0000
0000
0000
0000
0000
0000
0000
0000
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
0000000
0000000
0000000
0000000
0000000
0000000
0000000
0000000
0000000
0000000
0000000
1111111
1111111
1111111
1111111
1111111
1111111
1111111
1111111
1111111
1111111
1111111
00000000
00000000
00000000
00000000
00000000
00000000
00000000
00000000
00000000
00000000
11111111
11111111
11111111
11111111
11111111
11111111
11111111
11111111
11111111
11111111
User-level data
Thread stack
Fig.10.Eachthreadkeepsitsprivateslotsinadouble-linkedlist
thecurrentimplementation,arst-tstrategyisused,butotherstrategiescould
beconsideredaswell,especiallyiffragmentationistobekeptlow.Ifnosuitable
blockis found, a newfree slot belonging to the current local node is acquired
bythethread.Itgetsattachedtoitsslotlist.Then, anewblockisallocatedin
thisnewslot.Thisschemeworksforallrequestsforblockssmallerthantheslot
size,aslongasthenodeownsat leastoneslot.
4.4 Copingwithlarge-block allocations
To ensure the compatibility with malloc and free, our allocator can also cope
witharbitrarily-sizedblockrequests, largerthanaslot.Inordertosatisfysuch
requests, thekeypointis to makeupalargerslotoutof nregular,contiguous
slotsandtoallocatetheblockinsidethisnewslot(wherenisthesmallestnumber
ofcontiguousslotsthatwouldbenecessary).Forthispurpose,thefollowingsteps
areaccomplished.
1. The slot bitmap of the local node is scanned, in order to nd the
neces-sarynumberofcontiguousslots.Arst-tstrategyisused.Ifthissearchis
successful,thecorrespondingslotsaregiventothethread,whichusesthem
tobuild upalargeslot.This largeslotgets attachedto theslot list ofthe
thread.
2. Ifthesearchfails,aglobalnegotiationphaseamongallthenodesislaunched.
Theinitiatingnodesbehaveasfollows.
(a) Enter asystem-widecritical section.No othernode is allowedto
mod-ify its slot bitmap within this section. (It may still run its code and
allocate/freeblocks,aslongasnoslot managementisnecessary.)
(b) Gatherthelocal bitmapsofallnodes.
(c) Computeanglobal or takingallbitmapsasoperands.
(d) Searchfortherstseriesncontiguousavailableslotsinthisglobalbitmap
andbuy thenon-localslots.Itsucesto marktheseslots aremarked
with1 in thebitmap of therequestingnode and0 in thebitmap of
theiroriginalownernode.
simplyenablesanodetobuy slotsfromsomeothernodes.
A global negotiation is obviously an expensive operation, because of the
global communication required. It should therefore be kept as exceptional as
possible.Twomainfactorshaveanimpactonthefrequencyofthesenegotiations:
theslotsize andtheinitial slotdistribution.Sinceallsingle-slotallocationsare
guaranteed to be local, the slot must be large enough to avoid multiple-slot
allocations asmuch aspossible. Onthe other hand, even for such allocations,
negotiationmaybeavoidedifthenecessarynumberofcontiguousslotsarelocally
available. It is therefore importantto choose agood initial slot distribution,
in order to avoid negotiations even more.Observe that there is no restriction
whatsoeverontheinitial distribution.
Noticealsothat themanipulation ofthebitmaps onthelocalnode may be
completely arbitrary. It is in particular possible for the local node to to take
advantage of anegotiation phase to pre-buy slots in prevision of foreseeable
large allocation requests. It is also possible to completely restructurethe slot
distributionat the system level, forinstance by groupingcontiguousfree slots
asmuchaspossibleonthevariousnodes.Theonlyrequirementisthateachslot
presentin thebitmaps mustnallybelong toexactlyonenode.
5 Performance and optimizations
WepresentheresomeresultsobtainedonourPoPCcluster.Eachnodeconsists
of a 200 MHz PentiumPro processor. The operating system is Linux 2.0.36.
ThenodesareinterconnectedbyaMyrinetnetwork fromMyricom[9]accessed
throughtheBIP low-levelcommunicationinterface[12].
Thetimeneededtomigrateathreadwithnostaticdatabetweentwonodes
islessthan75s.Itwasmeasuredbymeansofathreadping-pongbetweentwo
nodes. Thistime includes packingthethread resources,transferringthem over
thenetwork,allocatingthememoryonthedestinationnodeandunpackingthe
resources.Noticethatnopost-migrationprocessingwhatsoeverisneededthanks
toouriso-addressapproach.Thistimeshouldbecomparedtothe150sreported
for the migration of a null thread in Active Threads [13]. This performance
gureispartlyduetotheveryecientMadeleinecommunicationlayerusedby
PM2 [2].
Usingthepm2_isomallocfunctioninsteadoftheusualmallocinducesa
non-signicantoverheadfortherequests ofblockslargerthanoneslot,asshownin
Figure11.Thisoverheadismainlyduetothenegotiationautomaticallyrequired
byanymulti-slotallocationwhentheslotsaredistributedin around-robinway
(whichisthecaseinourexperiment).Thisnegotiationtakes255sina2-node
congurationwhenusing BIP/Myrinet.If theunderlying architectureprovides
morethan2nodes,another165sshouldbeaddedperextranode.Noticethat,
forlargeallocations,thisoverheadissmallandratherinsignicantcomparedto
0
1000
2000
3000
4000
5000
6000
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
500000
average allocation time (microseconds)
# block size (bytes)
malloc
pm2_isomalloc
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
1e+06
2e+06
3e+06
4e+06
5e+06
6e+06
7e+06
8e+06
average allocation time (microseconds)
# block size (bytes)
malloc
pm2_isomalloc
Fig.11.Comparedperformanceofmallocandpm2_isomallocforrespectivelysmalland
largerequestsina2-nodeconguration.
6 Conclusion and future work
Tovalidate ourapproach, wehave integratedthe iso-address allocation
primi-tivesin the runtime libraries used by twodata-parallel compilers[3,6]. These
compilershavebeenpreviouslymodied,inordertogeneratemultithreadedcode
for PM2[11,1].Thanks toour newallocator,theruntimecoderesponsiblefor
threadmigrationwassignicantlysimplied.Giventhatpre-andpost-migration
processingwerereduced,wecouldnoticeanimprovementofourvirtualprocessor
migrationtime. Wearecurrentlyworkingonthese aspects.
timeatthenextslotallocation.Whenmigratingaslot attachedtoathread,it
issucienttosenditsinternallyallocatedblocks.Additionaldetailsonthe
cur-rentimplementationand adownloadableversioncanbefound at http://www.
ens-lyon.fr/
rnamyst/pm2.html.
References
1. L.Bougé,P.Hatcher,R.Namyst,andC.Perez.Multithreadedcodegenerationfor
aHPFdata-parallelcompiler. InProc.1998Int.Conf.ParallelArchitecturesand
CompilationTechniques(PACT'98),ENST,Paris,France,October1998.
Prelim-inary version available at ftp://ftp.lip.ens-lyon.fr/pub/LIP/Rapports/RR/RR1998/
RR1998-43.ps.Z.
2. L.Bougé,J-F.Méhaut,andR.Namyst.Madeleine:anecientandportable
com-municationinterfaceformultithreadedenvironments.InProc.1998Int.Conf.
Par-allelArchitecturesandCompilationTechniques(PACT'98),pages240247,ENST,
Paris,France,October1998.IFIPWG10.3andIEEE.Preliminaryversionavaiable
atftp://ftp.lip.ens-lyon.fr/pub/LIP/Rapports/RR/RR1998/RR1998-26.ps.Z.
3. Th.Brandes.Adaptor(HPFcompilationsystem),developpedatGMD-SCAI.
Avail-ableathttp://www.gmd.de/SCAI/lab/adaptor/adaptor _ home.html.
4. J. Casas, R. Konuru, S. W. Otto, R. Prouty, and J. Walpole. Adaptive load
migration systemsfor PVM. InProc.Supercomputing'94,pages390399,
Wash-ington,D.C.,November1994.Availableathttp://www.mcs.vuw.ac.nz/
pmar/refs.
html#R545.
5. D. Cronk, M. Haines, and P. Mehrotra. Thread migration in the presence of
pointers. In Proc. Mini-trackon Multithreaded Systems, 30thIntl Conf. on
Sys-tem Sciences, Hawaii, January 1997. Available at URLhttp://www.cs.uwyo.edu/
haines/research/chant.
6. P.J.Hatcher. UNHC*. Availableathttp://www.cs.unh.edu/pjh/vstar/cstar.html.
7. A.Itzkovitz,A.Schuster,andL.Shalev. Threadmigrationanditsapplicationin
distributed shared memorysystems. J.Systems andSoftware, 42(1):7187, July
1998. Availableathttp://www.cs.technion.ac.il/Labs/Millipede/.
8. E. Mascarenhas and V. Rego. Ariadne: Architectureof a portable threads
sys-temsupportingmobileprocesses. Software:Practice &Experience,26(3):327356,
March1996.
9. Myricom. Myrinet linkand routingspecication. Available at http://www.myri.
com/myricom/document.html,1995.
10. R.Namyst. PM2:anenvironmentforaportabledesignandanecientexecution
of irregular parallelapplications. Phdthesis,Univ.Lille1,France,January1997.
InFrench.
11. C.Perez.LoadbalancingHPFprogramsbymigratingvirtualprocessors.InSecond
Int.WorkshoponHigh-LevelProgr.Models andSupportiveEnv.(HIPS'97),pages
8592,April1997.
12. B.TourancheauandL.Prylli. BIPmessages. Availableathttp://lhpca.univ-lyon1.
fr/bip.html.
13. B. Weissman, B. Gomes, J. W. Quittek, and M. Holtkamp. Ecient ne-grain
thread migrationwith ActiveThreads. InProceedings ofIPPS/SPDP 1998,
Or-lando, Florida, March 1998. Available at http://www.icsi.berkeley.edu/
sather/