An Efficient and Transparent Thread Migration Scheme in the PM2 Runtime System

(1)

HAL Id: inria-00565361

https://hal.inria.fr/inria-00565361

Submitted on 11 Feb 2011

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

An Eﬀicient and Transparent Thread Migration Scheme

in the PM2 Runtime System

Gabriel Antoniu, Luc Bougé, Raymond Namyst

To cite this version:

Gabriel Antoniu, Luc Bougé, Raymond Namyst. An Eﬀicient and Transparent Thread Migration

Scheme in the PM2 Runtime System. Proceedings of the 11 IPPS/SPDP’99 Workshops Held in

Conjunction with the 13th International Parallel Processing Symposium and 10th Symposium on

Parallel and Distributed Processing, Apr 1999, Po Rico, Puerto Rico. pp.496–510. �inria-00565361�

(2)

migration scheme in the PM2 runtime system

GabrielAntoniu,Luc Bougé,andRaymondNamyst

LIP,ENSLyon,46,Alléed'Italie,69364LyonCedex07,France.

Contact:fGabriel.Antoniu,Luc.Bouge,Raymond.Namystg @ens-lyon.fr .

Abstract. Thispaperdescribes anew iso-address approach to the

dy-namicallocationofdatainamultithreadedruntimesystemwiththread

migrationcapability.Thesystem guaranteesthatthe migratedthreads

andtheirassociatedstaticdataarerelocatedexactlyatthesamevirtual

addressonthe destinationnodes,so that nopost-migrationprocessing

isneededtokeeppointers valid.Intheexperimentsreported,athread

canbemigratedinlessthan75s.

1 Introduction

Multithreading hasproven useful to implement massively parallel activitiesin

distributed systems,sinceit providesanecientway ofoverlapping

communi-cation and computation. When the application behavior is hardly predictable

at compiletime, dynamic loadbalancingbecomesessential.It canbeachieved

by transparently migrating computationthreads from the overloadednodes to

theunderloadedones.Inthe implementation describedbelow,athread canbe

migratedacrosstheMyrinet networkin lessthan75s.

Migratingathread usually means movingthe thread stack,but sometimes

mayalsomeanmovingthestaticdata used bythethread.Inthisrespect,

sev-eralmigrationapproacheshavebeenimplementedintheexistingmultithreaded

systems,depending ontherationaleunderlying theuseofthread migration.In

Ariadne[8],threadsaremigratedtogetclosetotheremotedatatheyuse.Static

data nevermoves.Onmigration, thethreadstackisrelocatedat ausually

dif-ferentaddressonthedestinationnode,suchthat pointers needtobeupdated.

As shown in 2, several problems cannot be solved by this approach. In

Milli-pede [7], thread migration is directed by a load balancing module integrated

in thesystem, whereas static data get movedonly when they get accessed by

remote threads. The threads and their data are alwaysrelocated at the same

virtual addresseson all nodes. Yet, thread creationis expensive,therefore the

numberofconcurrentthreadsisstaticallyxedatinitialization.Inbothsystems

mentionedabove,dataaresharedandcanbeaccessedbymorethanonethread.

UPVM[4]providesthreadmigrationforPVMapplications,inordertosupport

load balancing. Threads have private heaps, for private dynamic allocations.

Threadcreationisexpensiveinthissystem,too,sinceitiscarriedoutbymeans

(3)

compiling. Consequently, our study focuses on the case in which data are not

shared:theybelongtosomeuniquethreadandthus havetofollowiton

migra-tion.Ouriso-addressallocatorhasbeenimplementedinthePM2multithreaded

runtimesystem[10],whichservesasaruntimesupportfortwodata-parallel

com-pilers[1]. Wetargetapplications havingto executeonhomogeneous clusters of

workstationsorPCsinterconnectedbyahigh-speednetwork(e.g.,Myrinet[9]).

Thispaperisstructuredasfollows.Insection2,wegiveaquickdescriptionof

PM2:amultithreadedruntimesystemprovidingthreadmigration.Anoverview

ofouriso-addressapproachisgiveninsection3andsomeimplementationdetails

are presentedin section 4.Section 5showssome performance gures. Finally,

section6summarizesourmainresultsandpointsoutwhatweintendtoaddress

in thenear future.

2 PM2: a multithreaded runtime system with thread

migration

PM2isamultithreadedruntimesystemespeciallydesignedtoserveasaruntime

support for highly parallel irregularapplications. Insuch applications, threads

mayneedto startorterminateat arbitrarymomentsduring theexecution.At

thesametime,thesystemhastoecientlycopewithalargenumberof

concur-rentthreads.Therefore,PM2 provides veryecientprimitivesto handle these

operations:creation,destructionandcontextswitching.Adistinctivefeatureof

PM2isthreadmigration.Sincetheexecutionofirregularapplicationsmaylead

to severe loadimbalances,thread migration canbe usedto support the

imple-mentationofloadbalancingpoliciesbasedondynamicactivityredistribution.

InaPM2application,thereisasingle(heavy)processrunningateachnode

andeachsuchprocessmaycontaintensofthousandsofthreads.Weoften

iden-tifythiscontainerprocesswiththenoderunningit.Atthesimplestlevel,aPM2

threadisanexecutionowmanaging asetofresources,i.e.,itsstatedescriptor

anditsprivateexecutionstack.Thecodetobeexecutedbythethreadsis

repli-catedoneach node(SPMD approach)andisnotpartof thethread.Again,we

emphasizethatwedonotconsider theaspectsofdatasharingbetweenthreads

in this paper,nor theproblem of athread usingglobal process resourcessuch

asles,networkinterfaces,etc.Inthissetting,migratingathreadsimplymeans

movingthethreadresourcesfromthe(heavy)processrunningonthelocalnode

toanother(heavy)processlocatedonsomeremotenode.InPM2,themigration

operationiscarriedoutinthreemainsteps:

1. Thethreadgetsstopped (frozen) anditsresourcesgetcopiedtoa

commu-nicationbuer.Thememoryareastoringtheresourcesisset free.

2. Thebuercontentsgetsentto thedestinationnodethroughthenetwork.

(4)

point during its execution. It may also be preemptively migrated by another

thread runningonthesamenode. Thislatter propertyisessential,sinceit

en-suresthatapplication threadsmaybetransparently migratedacrossthenodes.

Consequently, a generic module implemented outside the running application

could balance the load by migrating the application threads. Thethreads are

unawareoftheirbeingmigrated andkeeponrunningirrespectiveoftheir

loca-tion. Sourcecode: void p1() { int x; x = 1; pm2_printf("value = %dnn", x); pm2_migrate(marcel_self(),1); pm2_printf("value = %dnn", x); } Execution: [node0] value = 1 [node1] value = 1

Fig.1.Threadmigrationwithoutpointers.

Sourcecode: void p2() { int x; int *ptr = &x; x = 1; pm2_printf("value = %dnn", *ptr); pm2_migrate(marcel_self(), 1); pm2_printf("value = %dnn", *ptr); } Execution: [node0] value = 1 Segmentation fault

Fig.2.Threadmigration inthepresenceof

pointerstostackdata

Anexampleof threadmigrationisgivenonFigure1.Assume thatathread

running on node 0calls procedure p1. The thread declares a local variable x,

writes thevalue1to thisvariable, thenprintsit.Next,the threadmigrates to

node 1 and prints the value of the variable x again. At run time, we can see

thatthevalue1isdisplayedinbothcases,beforeandaftermigration.Thelocal

variable x getsautomatically moved to node 1,since itis stored in the thread

stack.

A diculty turns up as soon asa migrating thread makesuse of pointers.

SuchasituationisillustratedonFigure2.Here,thethreadwhichcallsp2reads

variablexthroughpointerptr .Aftermigration, thereisnoguaranteethat

vari-ablexisstilllocatedataddressptr andtheexecution(most probably!)fails.

Onewaytotacklethisproblemistoupdateallreferencestostackdataafter

migration, before thethreadresumes itsexecutionbyaddingsomeosettoall

pointers. Twocategories of pointers to stack data requiresuch post-migration

processing: the implicit pointers generated by the compiler in order to chain

thestackframesandtheexplicit pointersusedbytheprogrammer.Theformer

(5)

in the early versions of PM2, which provided primitives to register/unregister

user-level pointers. When a thread moved to another node, all its registered

pointers wereupdated(Figure3).

Sourcecode:

void p2()

{

int x;

int *ptr = &x;

unsigned int key;

key=pm2_register_pointer(&ptr); x = 1; pm2_printf("value = %d\n", *ptr); pm2_migrate(marcel_self(), 1); pm2_printf("value = %d\n", *ptr); pm2_unregister_pointer(key); } Execution: [node0] value = 1 [node1] value = 1

Fig.3. Thread migration with registered

pointers

Sourcecode:

void p3 ()

{

int *t =

(int *)malloc (100 * sizeof(int));

t[10] = 1; pm2_printf("value = %d\n", t[10]); pm2_migrate(marcel_self(), 1); pm2_printf("value = %d\n", t[10]); } Execution: [node0] value = 1 Segmentation fault

Fig.4.Threadmigrationwithpointerstoheap

data

Clearly,this approachdoesnotextendtocomplexapplications.Moreoverit

does not cope with resources located outside of the stack, such as heap data

dynamicallyallocatedbythemallocprimitiveoftheClanguage.Figure4shows

a thread which calls mallocto allocate somememory area, writes (potentially

large) data into this area, migrates, and eventually tries to read at the same

virtual address. Theprogram obviously fails, since the allocated data hasnot

beenmigrated.

One way to solve this problem consists in remallocating the data on the

destinationnode.Inthiscase,theprogrammerhastoexplicitlyhandlethedata

packing and unpacking, and to manage thepointer updating astheallocation

address are usually dierent from the original one. As in the case of pointers

tostackdata,thisapproachcannotbeusedforarbitrarilycomplexapplications

makinguseofalargenumberofpointerstoheapdata.Moreover,thisapproach

cannot copewith compiler-generatedpointersin caseoptimization optionsare

used,sincesuchpointersarenotregisteredandcannotbeupdated.Fundamental

(6)

3.1 Generaloverview

A much better approach to the problem described in the previous section is

to provide amechanismableto guaranteethat both thestack andtheprivate,

dynamically allocated data of a thread can be migrated and remallocated at

the same virtual address on the destination node (iso-address allocation and

migration).Theideaistolocallyallocatestorageareasinasystem-wide,globally

consistent way. The allocation mechanism must guarantee that each range of

virtualaddressesatwhichmemoryhasbeenmmappedatsomenodeiskeptfree

onalltheother nodes.Such anapproachhasseveraladvantages.

Simplicity The migration mechanism is simplied, because no post-migration

pointerupdateisnecessaryanylonger.

Transparency Applications may make free use of pointers without having to

takeinto accountpossibleproblems relatedto thread migration.User-level

pointersarealwaysguaranteed tobesafe.

Portability Nocompiler knowledgeaboutthethreadstackstructureisrequired,

sincethestackcontentsremainsexactlythesameaftermigration.In

partic-ular,compiler-generatedpointersaremigration-safe,too.Consequently,any

compilermaybeusedandcompileroptimizationsareallowed.

Preemptiveness Preemptivemigrationis possible,giventhat noassumptionis

madeaboutthethreadstateatmigrationtime.

Theisomallocallocationmechanismreliesonafewbasicprinciples.Theserules

ensurethat eachnode mayuseitsgloballyreservedmemorywithouthavingto

inform the other nodes. We thus avoid any synchronization when allocating

memorytothreads.

1. Thephysical execution environment is assumed to be homogeneous (same

typeofprocessor,sameoperatingsystem).Moreover,allnodeshavethesame

memorymapping: the same binary code is loaded on each of them at the

samevirtualaddress(sothatnocodeneedsgettingmoveduponmigration).

The(unique)processstackisalsolocatedatthesamevirtualaddressonall

nodes.

2. Oneachnode,alliso-addressallocationstakeplacewithinaspecialaddress

rangecalled iso-address area.Wehavelocatedit betweentheprocessstack

and theheap (Figure 5). This zone correspondsto the same virtualrange

onallnodes.

3. Separaterangesofvirtualaddresseswithin theiso-addressareaareglobally

reserved foreach node,so that eachaddressmaybe usedby asinglenode

atatime.

4. Theactual memoryallocationiscarriedoutlocally,withinanaddressrange

(7)

~ 3,5 Go

<< 500 Mo

< 100 ko

Code

fixed

(UNIX) process stack

Local heap

Iso-address

Area

at compile time

Data

Fig.5.Allnodes have thesamememory mapping.Inparticular,the iso-address area

coversthesamevirtualaddressrangeonallnodes

3.2 Theslot layer

In this improved view, a PM2 thread is an execution ow managing a set of

resources, i.e., its state descriptor, its private execution stack, and a series a

dynamically allocated sub-areas within the iso-address area. Let us introduce

someterminologyatthispointforthesakeofclarity.Anaddress slotisarange

ofvirtualaddresseswithintheiso-addressarea.Aslot isfree ifnomemoryhas

beenmmapped at thisaddress.Otherwise, itis busy, andwesaythat memory

has been allocated in this slot. Then, data may be stored within this slot of

virtualaddresses.Theiso-addressdisciplineguaranteesthataslotwhichisbusy

onanodeis guaranteedto remainfreeonanyothernode.

Ourgoalisto designthemanagementpolicyso astoavoidinter-node

syn-chronization asfar as possible and to remain compatible with the heap

man-agement mechanisms of the container (heavy) process. To manage slots in a

consistent system-wide manner, it is convenient to give them a uniform size,

verymuchlikememorypagesat thenodelevel.Thechoice ofthis sizeis

obvi-ouslycrucialandwewilldiscussitlater.Weintroduceagainsometerminology.

Atanypoint,exactlyone agent,anodeorathread,isresponsibleformanaging

agivenslot.Itistheowner oftheslot.Theslotsownedbyanodeorathreadare

calleditsprivate slots.Aslot ownerisresponsibleformmappingorunmmaping

(8)

owned by the thread

owned by the local node

Memory address space

Node 1/2

Thread

Node 1/2

Step 2

Node 2/2

Step 1

Node 1/2

Node 2/2

Step 4

Node 1/2

Step 3

Node 2/2

Data

Stack

Data

Fig.6.Slotownershipmaychangeduetomigration.Inthisexample,athreadiscreated

and acquires a slot owned by the local nodeto store its stack (Step 1). The thread

acquiresotherslotsfromthelocalnode,tostoreitsprivatedata(Step2).Thethread

migratesalongwithitsslots(Step3).Thethreaddiesanditsslotsareacquiredbythe

destinationnode(Step4).

Atinitializationtime,eachslotisownedbyauniquenode andisfree.When

a thread is created, the local node gives the thread a slot to store its initial

resources:this slot is from now onowned by the thread. Whenisomallocating

datadynamically,athreadacquiresadditionalslotsfrom thelocalnode.Notice

that all this change of ownership do not requireany synchronization between

nodeswhatsoever.Athread isassociatedwiththelistofitsprivateslotswhere

itstoresitsresources.Onmigration, theseslotsmigratealongwith thethread,

which still owns them after the migration, thoughthe memory is allocated at

anothernode. Atanypoint,athreadmayreleaseslots.Theyarethen givento

the node thethread is currently visiting.This node maybedierent from the

(9)

irrelevantatthispoint.Wewilldiscussthechoiceofthisdistribution later.

3.3 Theblock layer

Since ourgoal is to provide an allocationfunction compatible with themalloc

Cprimitive,theisomallocallocatorhasbeenrenedsoastocopewith

arbitrarily-sizedzonesofmemory.Thisleadstoanewconcept: theblock.Amedium-sized

slotmaycontainmultiplesmall-sized blocks.

Conversely,when largerequest aretobehandled,ablockmaystretch over

multiple contiguousslots. Ifthecurrentlocal nodeowns thenecessarynumber

of contiguous slots, this allocation is carried out the same way as a simple,

single-slot allocation. Theset of contiguousslotsis simplymergedinto alarge

slot. Otherwise, the node has to enter a negotiation with other nodes to buy

from them thenecessaryset of contiguous slots. Assuchanoperationinvolves

synchronization and mutual exclusion,it is clearly much more expensive than

usual,localallocations.Everythinghastobedonetokeepitexceptional.Itisof

coursepossibletoincreasetheslotsizedenedatinitialization.Itismuchmore

ecienttoadjust theinitialdistribution ofslotsso astofavorthecontiguityof

theslotsownedbynodes.Wediscusstheseaspectsfurtherin Section4.1.

3.4 Theprogramminginterface

The PM2 high-level programminginterface provides two primitives by means

of which threads may allocate(respectivelyrelease) memoryin theiso-address

area:pm2_isomallocandpm2_isofree.Theseprimitiveshavethesameprototype

astheclassicCfunctionsmallocandfree:

void *pm2_isomalloc(size_t size);

void pm2_isofree(void *addr);

Athreadmustcallpm2_isomallocinsteadofmalloctoallocatememoryfor

pri-vate,non-shareddatathat arerequiredto migratewiththethread.PM2

guar-antees that alldata stored at addresses returnedby pm2_isomallocfollowthe

calling thread in case of migration. All addresses allocated by pm2_isomalloc

have to be set free through a call to pm2_isofree. Using these primitives

en-suresthatallreferencestotheaddressareashandledbythemremainvalidand

that accesses to the corresponding data are migration-safe. Migration is thus

transparentandthemigratingthreadsmayusepointersinanarbitraryway.

Anexampleofcodeusingpm2_isomallocisgiveningure7.Letussuppose

that the procedure p4 is called by a thread running on node 0. The thread

allocates memory blocks in the iso-address zone through successive calls to

pm2_isomallocandcreatesalinkedlist.Then,thethreadbeginstotraversethe

list whileprintingitselements. Whenthe101st element isreached,the thread

(10)

PM2 guarantees that all blocks allocated by pm2_isomallocmigrate with the

threadand keepthesamevirtualaddresses.

#define NB_ELEMENTS 100000

#define NB_ITERATIONS 20000

typedef struct _item {int value; struct _item *next;} item;

[...]

void p4() {

int j; item *head, *ptr;

/* Create a list. */

head = NULL;

for (j = 0; j < NB_ELEMENTS; j++) {

ptr = (item *) pm2_isomalloc(sizeof(item));

ptr->value = j * 2 + 1; /* For example */

ptr->next = head; head = ptr;

}

pm2_printf("I am thread %p\n", marcel_self());

[...]

/* Print the list elements. */

j = 0; ptr = head;

while(ptr != NULL) {

if (j = 100) { /* Migrate! */

pm2_printf("Initializing migration from node %d\n", pm2_self());

pm2_migrate(marcel_self(), 1);

pm2_printf("Arrived at node %d\n", pm2_self());

}

pm2_printf("Element %d = %d\n", j, ptr->value);

ptr = ptr->next; j++;

}

Fig.7.Samplecodeusingpm2_isomalloc.Procedurep4is calledbyathreadinitially

running on node0. After havingallocated afew blocks in the iso-address area and

constructedalinkedlist,thethreadstartstraversingthelist.Arrivedatelement100,

thethreadmigratestonode1andcontinuesthetraversal.

4 Implementation details

4.1 Basicrequirements

(11)

info%pm2load example1

[node0] I am thread eeff0020

[node0] Element 0 = 1

[...]

[node0] Initializing migration

from node 0

[node1] Arrived at node 1

Fig.8.Execution tracefor thecodein

Fig-ure7.Thelisttraversalstartsonnode0and

continueson node1. Using malloc instead

ofpm2_isomallocwouldresultinamemory

accesserror (Figure9),sincethelistisnot

migratedwiththethreadinthiscase

[node0] I am thread eeff0020

[...]

[node0] Initializing migration

from node 0

[node1] Arrived at node 1

[node1] Element 100 = -1797270816

Segmentation fault

Fig.9. If the call to pm2_isomalloc is

re-placedbyacalltomallocinthecodegiven

ingure7,anerroroccurswhenthethread

triestoaccessitslistafterthemigration

Iso-address area Aspecicpartofthevirtualspacehastobededicatedto

iso-address memory allocations on all nodes. To this purpose, we dened an

iso-addressareasituatedbetweentheprocessstackandtheheap(Figure5).

Thisispossiblesinceallnodes arebinarycompatible andrun bythe same

versionoftheoperatingsystem.

Global reservation,localallocation Theiso-address areaisdividedinto

xed-sizevirtualaddressslots,eachofwhich isgiventoauniquenodeat

initial-ization.Toimplement thisglobal reservation,each nodeis provided witha

privatebitmap which identiesthe slots ownedby thenode (see4.2). The

initial slot distribution pattern must ensure that noslot is shared by

sev-eral nodes. On each node, actual, local allocations may only takeplace at

the slots owned by the caller. Memory allocation is done using the mmap

primitive,whichallowsformemoryallocationatspeciedvirtualaddresses.

Slot distribution Initially, slots are distributed among the nodes according to

someuser-dened distributionpattern which may be chosenso asto meet

the needs of the application. This choice should be made such that most

allocationsbelocal andnegotiations areasseldompossible.Inourcurrent

implementation, slotsareassigned tonodes inaround-robinfashion:slot i

belongstonodeimodpinap-nodeconguration.Thischoicehasbeenmade

forsimplicity,butitbehavesratherpoorlyformulti-slotallocations.Nothing

preventstheuserfromchoosingotherdistributions.Forinstance,insteadof

distributingsingleslotscyclicallyamongthenodes,onemaydistributeseries

ofcontiguousslots(block-cyclic distribution). An extreme choiceis to split

theiso-adressareaintopsub-areas,oneforeachnode,buttheseschemeisnot

(12)

slotstomaximizethecontiguity,

Slot size Aspreviouslyexplained,theslotsize waschosenso asto tathread

stackandwasxedto64kB,thatis16pages.Thus,threadcreationisalocal

operation(i.e.,nonegotiationisneeded)irrespectiveoftheslotdistribution,

sinceasingleslot is required.Thisis alsovalid forallallocationsof blocks

smallerthanaslot.Asforlargerallocations,detailsaregiveninSection4.4.

4.2 Managingslots

Each node keepstrack of itsprivateslots by meansof aprivatebitmap. Each

bitin thisbitmapcorrespondsto aslotin theiso-addresszone.Giventhat this

zoneistypicallyaslargeas3.5GBandthataslotcorrespondsto64kB,thesize

ofsuchabitmap amountsto 7kB.Ineach bitmap,thebitsareset to1ifthey

correspondtoslotsownedbythelocalnode,otherwisetheyaresetto0.Ifabit

issetto 1,thecorrespondingslotisfree.Ifitissetto0,theslotbelongseither

toanothernode(anditisnecessarilyfree)orto somelocalorremotethread.

When a slot request is issued by a thread (for instance, when a thread is

createdorwhen itrequires additionalstoragearea),oneofthe slotsownedby

the local node is given to the thread and the corresponding bit is set to 0in

the local bitmap.The slot doesnot belong to the local node any more.When

aslot is releasedbyathread (due to dynamic releaseorto thread death), the

correspondingbitinthecurrentlocalbitmapissetto1.Observethatthebitmaps

donotundergoanychangeonthread migration,sincethemigratingslots keep

beingownedbythethread andthecorrespondingbitskeeptheir0-valueonall

nodes.Noticealsothat,duetomigration,aslotmaybeallocatedonanodeand

releasedon another, sothat thedestination node mayeventuallyacquire slots

that itdidnotpossessinitially.

Threadsmanagetheirprivateslots in adouble-linkedlist(Figure10). This

isincontrastwithnodeswhichmanagetheirprivateslotsbymeansofabitmap.

Chainingtheslotsownedbyathreadmakesitmucheasiertomanipulatethem

on migration. Actually, chaining is carried out by means of pointers stored in

the slot headers. Given that the slot contents get copied at the same virtual

addressincaseofmigration,thesepointersremainvalidandthechainingisthus

preserved.Aswithuser-levelpointers,nopost-migrationprocessingisnecessary:

aniso-addresscopyisenough.

4.3 Allocatingblocks

In contrast to the traditional malloc/free primitives,which deal with dynamic

allocations in a contiguousheap, pm2_isomallocand pm2-isofree manage

allo-cationsof arbitrarily-sizedblockswithin alistof discontinuousslots. Eachslot

contains a double-linked list of free blocks. Blocks have headers storing their

size,aswellaspointerstotheneighboringblocksinthelist.

(13)

0000

1111

0000000

1111111

00000000

11111111

User-level data

Thread stack

Fig.10.Eachthreadkeepsitsprivateslotsinadouble-linkedlist

thecurrentimplementation,arst-tstrategyisused,butotherstrategiescould

beconsideredaswell,especiallyiffragmentationistobekeptlow.Ifnosuitable

blockis found, a newfree slot belonging to the current local node is acquired

bythethread.Itgetsattachedtoitsslotlist.Then, anewblockisallocatedin

thisnewslot.Thisschemeworksforallrequestsforblockssmallerthantheslot

size,aslongasthenodeownsat leastoneslot.

4.4 Copingwithlarge-block allocations

To ensure the compatibility with malloc and free, our allocator can also cope

witharbitrarily-sizedblockrequests, largerthanaslot.Inordertosatisfysuch

requests, thekeypointis to makeupalargerslotoutof nregular,contiguous

slotsandtoallocatetheblockinsidethisnewslot(wherenisthesmallestnumber

ofcontiguousslotsthatwouldbenecessary).Forthispurpose,thefollowingsteps

areaccomplished.

1. The slot bitmap of the local node is scanned, in order to nd the

neces-sarynumberofcontiguousslots.Arst-tstrategyisused.Ifthissearchis

successful,thecorrespondingslotsaregiventothethread,whichusesthem

tobuild upalargeslot.This largeslotgets attachedto theslot list ofthe

thread.

2. Ifthesearchfails,aglobalnegotiationphaseamongallthenodesislaunched.

Theinitiatingnodesbehaveasfollows.

(a) Enter asystem-widecritical section.No othernode is allowedto

mod-ify its slot bitmap within this section. (It may still run its code and

allocate/freeblocks,aslongasnoslot managementisnecessary.)

(b) Gatherthelocal bitmapsofallnodes.

(c) Computeanglobal or takingallbitmapsasoperands.

(d) Searchfortherstseriesncontiguousavailableslotsinthisglobalbitmap

andbuy thenon-localslots.Itsucesto marktheseslots aremarked

with1 in thebitmap of therequestingnode and0 in thebitmap of

theiroriginalownernode.

(14)

simplyenablesanodetobuy slotsfromsomeothernodes.

A global negotiation is obviously an expensive operation, because of the

global communication required. It should therefore be kept as exceptional as

possible.Twomainfactorshaveanimpactonthefrequencyofthesenegotiations:

theslotsize andtheinitial slotdistribution.Sinceallsingle-slotallocationsare

guaranteed to be local, the slot must be large enough to avoid multiple-slot

allocations asmuch aspossible. Onthe other hand, even for such allocations,

negotiationmaybeavoidedifthenecessarynumberofcontiguousslotsarelocally

available. It is therefore importantto choose agood initial slot distribution,

in order to avoid negotiations even more.Observe that there is no restriction

whatsoeverontheinitial distribution.

Noticealsothat themanipulation ofthebitmaps onthelocalnode may be

completely arbitrary. It is in particular possible for the local node to to take

advantage of anegotiation phase to pre-buy slots in prevision of foreseeable

large allocation requests. It is also possible to completely restructurethe slot

distributionat the system level, forinstance by groupingcontiguousfree slots

asmuchaspossibleonthevariousnodes.Theonlyrequirementisthateachslot

presentin thebitmaps mustnallybelong toexactlyonenode.

5 Performance and optimizations

WepresentheresomeresultsobtainedonourPoPCcluster.Eachnodeconsists

of a 200 MHz PentiumPro processor. The operating system is Linux 2.0.36.

ThenodesareinterconnectedbyaMyrinetnetwork fromMyricom[9]accessed

throughtheBIP low-levelcommunicationinterface[12].

Thetimeneededtomigrateathreadwithnostaticdatabetweentwonodes

islessthan75s.Itwasmeasuredbymeansofathreadping-pongbetweentwo

nodes. Thistime includes packingthethread resources,transferringthem over

thenetwork,allocatingthememoryonthedestinationnodeandunpackingthe

resources.Noticethatnopost-migrationprocessingwhatsoeverisneededthanks

toouriso-addressapproach.Thistimeshouldbecomparedtothe150sreported

for the migration of a null thread in Active Threads [13]. This performance

gureispartlyduetotheveryecientMadeleinecommunicationlayerusedby

PM2 [2].

Usingthepm2_isomallocfunctioninsteadoftheusualmallocinducesa

non-signicantoverheadfortherequests ofblockslargerthanoneslot,asshownin

Figure11.Thisoverheadismainlyduetothenegotiationautomaticallyrequired

byanymulti-slotallocationwhentheslotsaredistributedin around-robinway

(whichisthecaseinourexperiment).Thisnegotiationtakes255sina2-node

congurationwhenusing BIP/Myrinet.If theunderlying architectureprovides

morethan2nodes,another165sshouldbeaddedperextranode.Noticethat,

forlargeallocations,thisoverheadissmallandratherinsignicantcomparedto

(15)

0 1000

2000

3000

4000

5000

6000

0 50000

100000

150000

200000

250000

300000

350000

400000

450000

500000

average allocation time (microseconds)

# block size (bytes)

malloc

pm2_isomalloc

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

1e+06

2e+06

3e+06

4e+06

5e+06

6e+06

7e+06

8e+06

average allocation time (microseconds)

# block size (bytes)

malloc

pm2_isomalloc

Fig.11.Comparedperformanceofmallocandpm2_isomallocforrespectivelysmalland

largerequestsina2-nodeconguration.

6 Conclusion and future work

Tovalidate ourapproach, wehave integratedthe iso-address allocation

primi-tivesin the runtime libraries used by twodata-parallel compilers[3,6]. These

compilershavebeenpreviouslymodied,inordertogeneratemultithreadedcode

for PM2[11,1].Thanks toour newallocator,theruntimecoderesponsiblefor

threadmigrationwassignicantlysimplied.Giventhatpre-andpost-migration

processingwerereduced,wecouldnoticeanimprovementofourvirtualprocessor

migrationtime. Wearecurrentlyworkingonthese aspects.

(16)

timeatthenextslotallocation.Whenmigratingaslot attachedtoathread,it

issucienttosenditsinternallyallocatedblocks.Additionaldetailsonthe

cur-rentimplementationand adownloadableversioncanbefound at http://www.

ens-lyon.fr/

rnamyst/pm2.html.

References

1. L.Bougé,P.Hatcher,R.Namyst,andC.Perez.Multithreadedcodegenerationfor

aHPFdata-parallelcompiler. InProc.1998Int.Conf.ParallelArchitecturesand

CompilationTechniques(PACT'98),ENST,Paris,France,October1998.

Prelim-inary version available at ftp://ftp.lip.ens-lyon.fr/pub/LIP/Rapports/RR/RR1998/

RR1998-43.ps.Z.

2. L.Bougé,J-F.Méhaut,andR.Namyst.Madeleine:anecientandportable

com-municationinterfaceformultithreadedenvironments.InProc.1998Int.Conf.

Par-allelArchitecturesandCompilationTechniques(PACT'98),pages240247,ENST,

Paris,France,October1998.IFIPWG10.3andIEEE.Preliminaryversionavaiable

atftp://ftp.lip.ens-lyon.fr/pub/LIP/Rapports/RR/RR1998/RR1998-26.ps.Z.

3. Th.Brandes.Adaptor(HPFcompilationsystem),developpedatGMD-SCAI.

Avail-ableathttp://www.gmd.de/SCAI/lab/adaptor/adaptor _ home.html.

4. J. Casas, R. Konuru, S. W. Otto, R. Prouty, and J. Walpole. Adaptive load

migration systemsfor PVM. InProc.Supercomputing'94,pages390399,

Wash-ington,D.C.,November1994.Availableathttp://www.mcs.vuw.ac.nz/

pmar/refs.

html#R545.

5. D. Cronk, M. Haines, and P. Mehrotra. Thread migration in the presence of

pointers. In Proc. Mini-trackon Multithreaded Systems, 30thIntl Conf. on

Sys-tem Sciences, Hawaii, January 1997. Available at URLhttp://www.cs.uwyo.edu/

haines/research/chant.

6. P.J.Hatcher. UNHC*. Availableathttp://www.cs.unh.edu/pjh/vstar/cstar.html.

7. A.Itzkovitz,A.Schuster,andL.Shalev. Threadmigrationanditsapplicationin

distributed shared memorysystems. J.Systems andSoftware, 42(1):7187, July

1998. Availableathttp://www.cs.technion.ac.il/Labs/Millipede/.

8. E. Mascarenhas and V. Rego. Ariadne: Architectureof a portable threads

sys-temsupportingmobileprocesses. Software:Practice &Experience,26(3):327356,

March1996.

9. Myricom. Myrinet linkand routingspecication. Available at http://www.myri.

com/myricom/document.html,1995.

10. R.Namyst. PM2:anenvironmentforaportabledesignandanecientexecution

of irregular parallelapplications. Phdthesis,Univ.Lille1,France,January1997.

InFrench.

11. C.Perez.LoadbalancingHPFprogramsbymigratingvirtualprocessors.InSecond

Int.WorkshoponHigh-LevelProgr.Models andSupportiveEnv.(HIPS'97),pages

8592,April1997.

12. B.TourancheauandL.Prylli. BIPmessages. Availableathttp://lhpca.univ-lyon1.

fr/bip.html.

13. B. Weissman, B. Gomes, J. W. Quittek, and M. Holtkamp. Ecient ne-grain

thread migrationwith ActiveThreads. InProceedings ofIPPS/SPDP 1998,

Or-lando, Florida, March 1998. Available at http://www.icsi.berkeley.edu/

sather/