ok ok

(1)

DerekBruening,SrikrishnaDevabhaktuni,Saman Amarasinghe

Laboratory forComputerScience

Massachusetts InstituteofTechnology

Cambridge,MA 02139

iye@mit.edu

Abstract

We present Softspec, a technique for parallelizing sequen-

tial applications using only simple software mechanisms,

requiring no complex program analysis or hardware sup-

port. Softspec parallelizes loops whose memory references

arestride-predictable. Bydetecting andspeculativelyexe-

cutingpotentialparallelismatruntimeSoftspecsucceedsin

parallelizingloopswhosememoryaccesspatternsarestati-

callyindeterminable. For example,Softspeccanparallelize

while loops with un-analyzable exit conditions, linked list

traversals,and sparsematrixapplicationswithpredictable

memorypatterns. Weshowperformance resultsusingour

prototypeimplementation.

1 Introduction

Parallel processing can provide scalable performance im-

provements, and multiprocessor hardware is becoming

widelyavailable. However,itis diÆculttodevelop,debug,

and maintain parallel code. Compilers can automatically

parallelize some sequential applications but are limited in

thetypeofcodethattheycanparallelize. Inordertoiden-

tifyparallelregionsofcodetheymustusecomplexinterpro-

ceduralanalysesto provethatthecodehasnodatadepen-

dencesforallpossibleinputs.Typicalcodetargetedbythese

compilersconsistsofnestedloopswithaÆnearrayaccesses

writteninalanguagesuchas FORTRANthat haslimited

aliasing. Large systemswritteninmodern languages such

asC, C++, or Javausuallycontainmultiple modulesand

memory aliasing, which makes themnot amenable to au-

tomaticparallelization. Furthermore,code whose memory

access patternsare indeterminable at compiletime dueto

dependenceonprograminputscanbeimpossiblefor these

compilerstoparallelize.

Hardware-based speculative parallelism has been pro-

posed to address the problem of parallelizing code that is

hard to analyze statically [HWO98, SM98 , MBVS97 ]. In

theseschemes,codeisspeculativelyexecutedinparalleland

extrahardware isused todetectdependences andtoundo

the parallel execution in cases of misprediction. The ad-

ditionalhardware requiredis extensivedueto theneedfor

communicationbetweenallprocessorstodetermineifdepen-

dencesexist. Softspecusescompilerandruntimetechniques

that take advantage of underlying properties of programs

This researchwassupportedin partby theDefenseAdvanced

ResearchProjectsAgencyunderGrantDBT63-96-C-0036andbya

NipponTelegraph&Telephonegraduatefellowship.

toparallelizediÆculttoanalyzeapplications. Softspecsuc-

ceedsinreducingthecommunicationandcomputationover-

head ofspeculation andisable toachieveperformanceim-

provementswithoutrequiringspecializedhardware.

Our approach to parallelizing applications stems from

two main observations. First, the performance improve-

mentsforonlypartiallyparallelcodedonotscalewelldueto

synchronization costs. Second, memory access patternsin

loopscanoften bepredictedatruntimeusingsimplevalue

predictors. Softspec performs datadependenceanalysis at

runtime using predicted access patterns. It speculatively

parallelizes loops and detects speculation failures without

inter-processorcommunication.

Wedescribethe Softspec techniquefor parallelizing se-

quentialapplicationsusingonlysimplesoftwaremechanisms

whichcanimproveperformanceofmodernprogramsonex-

isting hardware. We present a prototype implementation

consisting of a compiler and accompanying runtime sys-

tem and give experimentalresults ona symmetricshared-

memorymultiprocessor.

SinceSoftspecoftentargetsne-grainparallelism,weex-

pect even better performance on architectures with lower

inter-processor datasharingcosts, suchas single-chip mul-

tiprocessors[HNO97,SM98 ].

Softspec requires only local program information and

doesnot relyonany globalanalysis. Itssimplicity means

itcouldexecuteentirelyatruntimeandtargetprogrambi-

naries. Runtime translation [Kla00 , CHH +

98 , ATCL +

98]

and runtime optimization [BDB00] techniques are becom-

ingprevalent. Softspeccanbereadilyincorporatedintosuch

frameworks.

Thispapermakesthefollowing contributions:

It istherstapplicationofstridepredictiontoparal-

lelizingloops.

Itpresentsanovelapproachtospeculativeparallelism

that requires minimal programanalysis and noaddi-

tionalhardware.

It shows how interprocedural parallelism can be ex-

ploitedwithoutanyinterproceduralanalysis.

Itgivesageneraltechniqueforparallelizingwhileloops

and is the rst techniquethat is applicable to while

loops whenthe exit condition is not known until the

lastiteration.

Itgivesageneraltechniqueforparallelizingsparsema-

trix algorithms, especially eective for matrices with

denseclusters.

(2)

Thepaperisorganizedasfollows. Thenextsectiongives

anoverviewofourtechnique. Section3describesthecoreal-

gorithmindetail,andSection4explainshowthealgorithm

handlesmorecomplicatedloops. Section5givesexperimen-

talresults ofourprototype implementation. Relatedwork

andconclusionsnishthepaper.

2 TheSoftspec Approach

This section givesan overview of the Softspec paralleliza-

tion technique. Softspec parallelizes loops whose memory

referencesarestride-predictable. Amemoryaccessisstride-

predictable if the address it accesses is incremented by a

constant stridefor eachsuccessive dynamic instance (e.g.,

in a loop). Array accesses with aÆne index expressions

within loops are always stride-predictable. It has been

shown that many other memory accesses are also stride-

predictable[SS97 ].

Rather than proving at compile time that a loop con-

tains no inter-iteration dependences, we calculate the de-

pendences at runtime. First we dynamically prole the

addresses in the rst three iterations of the loop. If the

addresses are stride-predictable in these three iterations,

we predict that the addresses have the same strides for

the rest of the loop. Once the stride of each memory ac-

cesshasbeenidentied,wedeterminewhetherornotthere

areinter-iterationdependencesamongthememoryaccesses.

Wedonotparallelize loopsthatcontaininter-iterationde-

pendencessincethesynchronizationcostscanoutweighthe

benetsofparallelization|weonlytargetloopswhose it-

erationscanall beexecutedinparallel(doallloops).

Aninter-iterationdependenceexistsifawriteinoneiter-

ationistothesameaddressasareadorawriteinanother

iteration. In order to determine if any such dependences

exist,weexamineallmemoryaccessesinthelooppairwise.

Eachmemoryaccessinstructioncoversaregionofaddresses

throughouttheiterationsoftheloop. Foreachpairwede-

terminehowmanyiterationsmaybeexecutedbeforetheir

regionsoverlap. IfthereareRmemoryreadsandW mem-

orywritesinthe loop, W(W+R )comparisonsare per-

formed.Theminimumofalltheresultsisthenthenumber

ofparallelizableiterationsintheloop.

If the numberof parallelizable iterationsin the loopis

largeenough,theloopisspeculativelyexecutedinparallel,

itsparallelizableiterationssplitevenlyamongtheavailable

processors. If there are very few parallelizable iterations

and each iteration contains little work, speculation is not

attemptedandtheloopisexecutedsequentially.

While speculating, each processor checks that the pre-

dictedaddressesmatchtheactualaddresses. Thisisalocal

operationandrequiresnocommunicationwithotherproces-

sors. Infact,theonlyglobalcommunicationrequiredisone

bitofinformationattheendofspeculationstatingwhether

allpredictions werecorrect or not. Ifthereis amispredic-

tion,all subsequent iterations musthave their speculation

undoneandbere-executedsequentially. Anundobueris

allocatedtostoretheoriginalvaluesatalladdresseswritten

inthe loop. Since the predicted addresses are guaranteed

tohavenointer-iterationdependences,speculativememory

accesses use the predicted addresses instead of the actual

addresses (which may containdependences). Thisenables

each processor to undo the eects of speculation indepen-

dentlyoftheotherprocessors. Afterundoinginparallel,the

remaining iterationsof the loop are executed sequentially.

void foo(double *a, double *b, int j) f

double *p;

int i;

p = &a[j];

for (i=0; i<500; i++) f

a[i] = i;

b[i+j] = b[i] - *p;

g

Figure 1: A sample procedure containing a parallelizable

loop.

Alternatively, the speculative process can be restarted |

forexample,speculationcouldattempttoparallelizeonlya

pieceofaloopatatimeandonlygiveupiftherearemany

mispredictions. Inthisway loops withafew gaps intheir

stride-predictabilityorwithwidelyvaryingiterationcounts

canbefullyparallelized.

3 The CoreAlgorithm

ThissectiondescribesindetailhowSoftspecparallelizessim-

pleloopswithknownnumbersofiterationswhosebodiesare

straight-linecode. ThisisthecoreoftheSoftspecalgorithm.

Section4explainshowweextendthecorealgorithmtohan-

dleloop-carrieddependences,branchesinloopbodies,while

loops,nestedloops,andinterproceduralparallelism.

To illustrate how Softspec works, consider the code in

Figure1. Theloopinthisprocedurecontainsfourmemory

accesses: twowrites(a[i]andb[i+j])andtworeads(b[i]

and*p). Theseaccessesarestride-predictablewithastride

of0for*pandstridesequaltosizeof(double)fortheoth-

ers. Ifa and bare non-overlappingarrays withlengths at

least500and j >= 500,therewillbenointer-iterationde-

pendencesbetweenthememoryaccessesandtheentireloop

willbeparallelizable.

In order for a parallelizing compiler to parallelize this

loop, it mustprovestatically that aand bare distinct ar-

rays and that therewill be nointer-iteration dependences

betweenthememoryaccesses. Deducingthisinformationat

compiletimerequiressophisticatedinterproceduralanalysis,

ormaybeimpossibleifthememoryaddressesaredependent

onprograminputs. However,acompiler thatmakesuseof

runtimepredicated parallelism canparallelize this loopby

insertingatestthatatruntimewilldeduceiftherearemem-

orydependencesandonlyexecutetheparallelversionofthe

loop if there are none. This exampleserves onlyto illus-

tratethecoreoftheSoftspecalgorithm;therestofthealgo-

rithmexplainedinSection4targetsloopsforwhichpractical

parallelism-detectingpredicatesdonotexist.

Softspec parallelizes this loop by simply proling the

initial addresses and calculating the intersections of their

strides. Nocomplexcompile-timeanalysisisrequired|all

thatneedstobedoneisinstrumenteachmemoryaccess.

TheSoftspecalgorithmreplacestheloopwithfourloops:

a proleloop, a detection loop, aspeculation loop, and a

recovery loop. Theexecution paththrough these loops is

showninFigure2.Thefollowingsectionsdescribethispath

inmoredetail. Forfurtherinformationonthecorealgorithm

see[Dev99 ].

(3)

processor 0 time

Profile

Sequential Execution

Speculation

ok ok

Recovery i=287 to 499 i=30 to 148 to 29 i=3 i=0 to 2

processor 1

Detection

Speculation ok

ok i=149

to 265 Speculation ok

cancel

Undo

i=266 to 382

i=287 to 287

Speculation ok

ok

Undo

i=383 to 499

processor 2 processor 3

Figure2: TheexecutionpathofSoftspec's parallelizationofthesampleloopinFigure1. Theloopcontains500iterations,

i=0throughi=499. The27iterationsperformedduringdetectionisatypicalempiricalnumberforasmallloopbody.Thetop

portionofthegureshowssuccessfulspeculation. Thebottomportionshowswhatwouldhappenifamispredictionoccurred

wheni=287.

/*** profile loop ***/

for (i=0; i<3; i++) f

profile_address[i][0] = &a[i];

a[i] = i;

profile_address[i][1] = &b[i+j];

profile_address[i][2] = &b[i];

profile_address[i][3] = p;

b[i+j] = b[i] - *p;

g

Figure3: Theproleloopcreatedfromthesequentialloop

inFigure1.

3.1 Proling and Parallelism Detection

Theprole loopfor the sample code of Figure1 is shown

in Figure 3. As can be seen, it runs the rst three it-

erations of the original loop with instructions inserted to

storetheaddresses ofeachmemoryaccessinto datastruc-

turesusedbythe runtimesystem. The outerindexofthe

profileaddress array correspondsto theiteration ofthe

loopbeingproled, andthe innerindexisusedto number

thememoryaccesses.

Whenthe proleloop nishesthe runtime system cal-

culates thestride ofeachmemoryaccess bysimplytaking

thedierenceofaddressesinconsecutiveiterations. Ifeach

memoryaccess has aconsistent stride(i.e., thestrides be-

tweentherstandsecondandbetweenthesecondandthird

proled iterations are the same), theruntimesystem then

determineshow many iterationscanbe parallelized before

aninter-iterationdependenceis encountered. Ifthe strides

arenotconsistentthennospeculationisperformedandthe

loopisrunsequentially.Thiscatchesmanyaccessesthatare

notstride-predictableearlyonbeforeanyspeculationneeds

to beundone. HowSoftspec handlesaccessesthat are not

executedinall threeoftheproling iterationsisdiscussed

inSection4.2.

Asecondthreadperformsthestridecalculationsandpar-

allelismdetectionwhiletheoriginalthreadcontinuessequen-

tialexecutionoftheloop,asshowninFigure2. Thisenables

forwardprogressonthelooptobemadewhilethedetection

calculations are performed (thecalculations cantake time

equaltotensofiterationsofloopswithsmallbodies).

To identify whether parallelism exists, we examine the

memoryaccessespairwisetodeterminehowmanyiterations

may beexecutedbeforetheir addressesbecomeequal. For

apairofaddresseswithvaluesinthe rstiterationa0 and

a1andwithstridess0 ands1,weneedtondtheminimum

valuesofintegersiandjsuchthata0+is0=a1+js1.

Rewritingtheequationasis0 js1=a1 a0,asolutionex-

istsifandonlyifthegreatestcommondivisor(gcd)ofs0and

s1dividesa1 a0. Thesolutioncanbeobtainedasasingly

parameterized set using Euclid's gcd algorithm [AHU74],

(4)

0 1 2

parallelizable

memory address

iterations 1000 0 1 2

memory address

iterations

proc0 proc1 proc2 proc3

1000

Figure 4: Two memory accesses are compared to determine the numberof iterations before theyhave an inter-iteration

dependence. Theproled addresses are shown as dots; their strides are extrapolated to obtain two lines. Inter-iteration

dependencesmayexist startingatthe pointwhere oneline enters theother'saddressspace. Inthis case therstaccess is

to a1000-elementarray located inmemoryadjacent to the secondaccess. Thus, the two overlap at 1000 iterations. The

parallelizableiterationsaredividedamongfourprocessorsasillustratedontheright.

from which the largest number of parallelizable iterations

canbecalculated.

We found that nearly 90% of the parallelism detection

execution time was spent calculating gcd's. To avoid this

calculationourprototypeSoftspec implementationusesan

approximation algorithmthat calculates aconservativeso-

lutionbutismuchmoreeÆcient. Inpracticewehavenever

foundacase wherethe conservative algorithmreports less

parallelismthantheexactgcdalgorithm.

This approximation algorithm views each address as a

line dened by the initial address value a

i

and the stride

s

i : y = s

i x+a

i

. For agiven pairof addresses, the al-

gorithmextrapolates eachproled address into a line and

determinesthe minimum x value inthe rst quadrant at

whichthetwolinessharey values. Thisxvalueistherst

iterationatwhichaninter-iterationdependencecanoccur.

Figure4 givesanexampleof this process. Therst inter-

iteration dependence may actually occur laterthan the x

value computed inthismanner, since treatingdiscretead-

dressvaluesasacontinuouslineconsidersaddressestoover-

lapthatmayneveractuallylineupduetoequalstridesbut

dierent osets. For thisreason weadd aheuristiccase to

theapproximation algorithm: if thepairofaddresses have

thesamestrideandeitherthesameinitialvalueordierent

initialvaluesmodulothestridethentheywillneveroverlap.

Thecostofthedetectionalgorithmbecomesnegligibleas

theamountofcomputationinthetargetloopincreases. An

adaptive detection algorithm that runs the approximation

algorithm for small loops and the full gcdcalculations for

largeloopscouldbeused.

3.2 Speculation and RecoveryfromMisprediction

IfSoftspecdetectsenoughparallelism towarrantspeculat-

ing,thespeculative versionoftheloopisexecutedbyeach

processor. Thespeculativeversionofthesampleloopfrom

Figure1isshowninFigure5. Itusesthepredictedaddresses

insteadoftheactualaddresses sothatonlyonemispredic-

tionconditionalattheendoftheloopisneeded,insteadof

a conditional before every memory access. The predicted

addresses are initialized before the loopby simply adding

/*** speculation loop ***/

for (i=start[thread]; i<stop[thread]; i++) f

ncancelflag &= (predict0 == &a[i]);

*undo_buffer = *predict0;

undo_buffer++;

*predict0 = i;

ncancelflag &= (predict1 == &b[i+j]);

ncancelflag &= (predict2 == &b[i]);

ncancelflag &= (predict3 == p);

*undo_buffer = *predict1;

undo_buffer++;

*predict1 = *predict2 - *predict3;

if (!ncancel flag) f /* fail */ g

predict0 += delta0;

predict1 += delta1;

predict2 += delta2;

predict3 += delta3;

g

Figure5: Thespeculationloopcreatedfromthesequential

loopinFigure1.

the stride multiplied by the starting iteration numberfor

thethreadtotheinitialaddress.

Codeinsertedintotheloopstoresvaluestobeoverwrit-

tenintheundobuerandincrementseachpredictedaddress

byitsstride. Each processor hasitsownundobuer that

isallocatedpriortospeculation(itismadelargeenoughto

hold all writes in the loop). Additional code checks that

the predicted address matches the actual address; if not,

the speculationfails: all writesiniterationsduringandaf-

ter the mispredictionare undone and thoseiterations are

re-executedsequentiallybytherecoveryloop. Therecovery

loopisidenticaltotheoriginalsequentialloopexceptthatit

beginsexecutionontheiterationofthemisprediction. This

processisillustrated atthe bottomofFigure2. Notethat

alloperationsarelocaltoeachprocessorexceptforthesuc-

cessor failureofthespeculation attheendofthe loop;no

otherglobalcommunicationisrequired.

Undoing writes involves restoring the values from the

(5)

Prole 1-4 1-4 0

Detect 0 0 2-4

Speculate 6-11 10-13 2-5

Recovery 0 0 0

Table 1: Overheadintermsofmachineinstructionsforthe

fourloopsthatSoftspeccreateswhenparallelizingatarget

loop.

undo buer to the predicted addresses, from the most

recently executed iteration backward to the earliest exe-

cuted iteration. Since the predicted addresses are stride-

predictable, the undo buer does not need to store any

memory addresses, only values. The processors canundo

inparallel since thereare nooverlaps inthe predictedad-

dresses.

If many execution instances of a loop experience early

speculationfailures,theruntimesystemdisablesspeculation

of that loop and executes it sequentially instead to avoid

overheadcosts.

3.3 Runtime Overhead

The Softspec algorithm transforms the original sequential

loop into four loops. One of these, the recovery loop, is

essentially identical to the original loop. The other three

loops have extra instructions added that lead to runtime

overheadwhencomparedtotheoriginalsequentialloop.

Theproleloopstorestheaddressofeachloadandstore

intoashareddatastructure. Thedetectionloop, whichse-

quentiallymakesforwardprogressontheloopwhiletherun-

timesystemdetects theamount ofparallelism inthe loop,

is equivalent to the originalloop witha check everyitera-

tiontoseeiftheparallelismdetectionhasnished. Finally,

thespeculationloopcontainsanincrementofthepredicted

addressand acheck thatthe predicted addressequals the

actualaddressfor all memoryreads andwrites, astore to

the writebuer and increment of the writebuer pointer

foreachwrite,and additionallyasinglespeculationfailure

branch. Table1summarizesthesecostsinunitsofmachine

instructionsfortypicalcompilations.

Notethat since speculation usespredicted memoryad-

dresses,nestedarrayreferences(suchasA[B[i]])ormulti-

plelevelsofpointerindirection(suchas**p)donotneedto

waitfortherstmemoryreferencetoresolvebeforeaccess-

ing the second. Breakingthis dependenceallows for more

instruction-levelparallelismandreordering.

Additional overhead is required to synchronize the

threads. A barrier is required at the end of the parallel

executionto determineif any threadsencounteredmispre-

dictions. Thisbarriermaycosthundredsoreventhousands

ofcycles onashared-memory multiprocessor. Fortunately

itistheonlyglobalsynchronizationneededbySoftspec.

Verysimpleextensionstotheunderlyinghardware,such

as adding a speculative bit to caches that eliminates the

needfor anundo buer, could reduce Softspec's overhead

signicantly.

4 Extendingthe CoreAlgorithm

Thecore of the Softspec algorithm describedin Section3

onlyhandlesloopswithknownnumbersofiterationswhose

bodies are straight-line code, i.e., no procedure calls or

branchesinsideofloops,andnowhileloopsornestedloops.

This section explains how Softspec extends the core algo-

rithmtohandleloopbodiescontainingloop-carrieddepen-

dencesandnonlinearcontrolow,whileloops,nestedloops,

and interproceduralparallelism. Wehaveincorporated all

butthelastoftheseintoourprototypeSoftspecimplemen-

tation. We plan to implement interproceduralparallelism

inthe near future; we describe herehowto doso without

usinganyinterproceduralanalysis. We thendiscussusing

compilerinformationtoreduceruntimeoverhead,andcom-

paretheruntimeoverheadoftheseextensionstothatofthe

corealgorithm.

4.1 Loops ContainingLoop-Carried Dependences

Memoryaddressesarenottheonlyvaluesinaloopthatare

often stride-predictable. Scalarvariables withloop-carried

dependencesexhibitingstride-predictabilityrangefromsim-

pleinductionvariables,whichareusuallyanalyzableatcom-

piletime,topointersthatarediÆculttoanalyzeatcompile

time. Apointerusedtotraversealinkedlist,whenthelist

is laidoutcontiguouslyinmemory,isagoodexampleofa

stride-predictableloop-carrieddependence.

Loop-carrieddependencesare treatedinasimilarman-

ner to memoryaddresses. Theirvalues are proled inthe

rstthreeiterationsoftheloopjustlikememoryaddresses,

and a strideis calculated that ispredicted to hold for the

restoftheloop. Ifthestridedoeshold,theloopisparalleliz-

able; nointer-iterationdependenceanalysisisneededsince

the value of the variable can be computed independently

for anyiteration. Thisisincontrast tomemoryaddresses,

forwhichmerelybeingstride-predictableisnotsuÆcientfor

parallelism to exist since theaddresses and not thevalues

atthoseaddressesarebeingpredicted,andthevaluesmay

dependoneachother.

Duringspeculationthe\actual"valueoftheloop-carried

dependenceisthepredictedvalue forthecurrentiteration.

At theendoftheiteration, afterthisactualvalue ismodi-

edintheloopbody,itsresultingvalueiscomparedtothe

predictedvalueforthenextiterationandifamisprediction

occursthespeculationaborts. Thisisdierentfromamem-

oryaddresswhoseactual value is computedindependently

ofthepredictedvalue.

The undomechanism needsnoextrainformation to be

able torestoreloop-carrieddependencesto thevaluesthey

heldpriortoamisprediction. Eachvaluecanbecomputed

usingtheprolingdatafortheiterationpriortothespecu-

lationfailure.

If a loop-carried dependence is used as a pointer, care

mustbe takento avoida mispredictioncausing amemory

accessoutsideoftheprogram'saddressspace. Thepredicted

values of the pointer for the initial and nal iterations of

theloopneedto becheckedtoensurethat theyarewithin

theapplication'saddressspace. Iftheyare, thenso areall

valuesatintermediateiterations. Withintheaddressspace,

anymispredictionwillbedetectedandallerroneouswrites

willberestoredtotheiroriginalvaluesfromtheundobuer.

4.2 Loops ContainingNonlinear ControlFlow

Thecorealgorithmassumedthateachmemoryaccesswould

beexecutedoneachiteration. Branchesinsidetheloopbody

oftenfailtomeetthisassumption. Thiscancauseaddresses

tobeencounteredduringspeculationthatwerenotexecuted

(6)

sincenopredicted strideisavailable andnointer-iteration

dependenceanalysishasbeenperformedonthenewaddress.

Speculation mustalso abortifabranchistakenthatleads

outoftheloop.

Ifaparticularaccessisnotexecutedinanyofthethree

prolingiterations,thataccessmustbedistinguishablefrom

fullyproledaccessesbythespeculationloop. Asentinelin

thevalueofthe predictedaddressisusedfor thispurpose.

Unfortunately, the speculation loop must branch immedi-

atelybasedonitscheck for thesentinel because it cannot

executeinstructionsusinganunknownaddress. Suchaspec-

ulationfailurebranchmustbeplacedbeforeeveryaddress

usethatcouldpotentiallybeunproled.Thisbranchcauses

twoproblems. First,itdegradesperformance. Thiscanbe

avoidedinsomecasesbygeneratingaversionofthespecula-

tionloopwithoutthebranchesandusingthatversionwhen

alladdressesareexecutedintheproleloop. Second,itre-

quirestheabilitytoundoapartiallynishediteration. This

iseasilysolved: whenamispredictionoccurstheaddressit

occurredatisrecorded,andtheundomechanismusesthat

informationto undo justtheportion ofthe iteration prior

tothemisprediction.

Ifanaccessisexecutedinonlyoneofthethreeproling

iterations, astridecannotbecalculated. However, astride

canbeguessedbasedonthestridesofotheraddressesinthe

looporpossiblyonstoredstridesfrompreviousinstancesof

thisloop. Ourprototypeimplementationdoesnotcurrently

guessstridesanddoesnotattempttospeculatealoopcon-

taininganaddressthatwasproledonlyonce.

Ifanaccess is executedintwoof thethreeproling it-

erations,astridecanbecalculatedandspeculatedjustlike

afullyproled address. Alternatively,amoredynamicap-

proachcoulddecidetoexecutefurtherprolingiterationsto

verifythestride.

Detectingunproledloop-carrieddependencesisdonein

thesamemanneras for memoryaddresses, butsinceloop-

carrieddependencescantakeonanyvalueasentinelcannot

beused.Instead,theprolingdatastructuresareextended.

4.3 While Loops

Whileloops withun-analyzableexits arediÆcult toparal-

lelizesincethenumberofiterationsintheloopisnotknown

at compile time. In fact, the numberof iterations is not

known untilthe nal iteration of the loop. Theonly way

to parallelize such aloop is speculatively. The numberof

iterationsmustbeguessedandovershooting(guessingmore

iterationsthanactuallyexist)mustbehandled.

The core algorithm is easily extended to handle while

loops. Guessingthenumberofiterationsisaccomplishedin

oneofseveral ways. It canbeguessed outright. The loop

canberunsequentiallythersttimeitis encounteredand

the numberof iterations from that execution canbe used

asthe guessfor thenextexecution. Or,repeatedspecula-

tioncanbeused: asmallnumberofiterationsisrepeatedly

speculateduntiltheloopexitconditionisreached. Repeated

speculationisagoodstrategyforwhileloopsthatmayhave

a few stride-predictability gaps but for the most part are

stride-predictable,andforwhileloopswhoseiterationcount

varies widely. Inany case, laterspeculation of the loopis

madeadaptivebykeepingarunningaverageofthenumber

ofiterationsinprevious executionsoftheloopandusingit

tospeculatefutureinstancesoftheloop.

theguessednumberofiterationsandbreaksifthewhileloop

exitconditionismetbeforetheforloopnishes.Theprole,

detection,andspeculationloopsaregeneratedfromthisfor

loop. Therecoveryloopistheoriginalwhileloop.

If the while conditionis overshot, i.e., too many itera-

tionsare executed,theextraiterationsareundoneandthe

runningaverage ofthe numberofiterationsis updated. If

the speculation ends successfully and the while condition

has not been reached, speculation canbegin again on the

remainderoftheloop,orelsetherecoveryloopnishesexe-

cutingtheloop. Addressvaluesinthespeculationloopmust

be checkedto ensuretheyare within the address space of

the program(predictedvaluesiniterationsbeyond theac-

tualloopendcouldbe outsideofthe addressspace). This

canbedonebycheckingtheinitialandnalpredictedvalues

priortothespeculationloop.

4.4 Nested Loops

The core algorithm only handles innermost loops. Even

whenparallelismexistsininnerloops,greaterperformance

improvementscanbeachievedbyparallelizingouterloops,

especiallywhenexecutingonmachineswithhighdatashar-

ing costs between processors' caches where a higher com-

putation to overhead ratio is required. However, target-

ing outer loops complicates parallelism detection because

there are many more opportunities for inter-iteration de-

pendences. Softspec extends the core algorithm to handle

outerloopswithoutundulyincreasingitscomplexity.

Ina nestedloop, the innerloops canbetreated intwo

ways. The inner loops' memory accesses can be required

tobestride-predictable(thelinearmethod),ortheycanbe

requiredtofallwithin acertainrange(therangemethod).

ThetwomethodsarehandledsimilarlybySoftspec. Weim-

plemented bothmethods andfound that,as expected, the

range methodis less eÆcient thanthe linearmethod. Al-

thoughtherangemethodplacesfewerrestrictionsonwhich

loops are parallelizable, we never encountered a loop for

whichtherangemethodappliedbutthe linearmethoddid

not. Ourprototypeimplementationusesthelinearmethod.

Theprolingstepfornestedloopsrunstheouterloopfor

threeiterationsandtheinnerloopsfortheirfullnumberof

iterations. Thelinearmethodonlygathersprolingdataon

therstthreeiterationsoftheinnerloops,whiletherange

methodkeeps trackof the maximumand minimumvalues

of eachaddressaccessedineachinnerloop. For thelinear

method, all memoryaccesses in a loop have a stride that

holds within the outer loop; amemory access in aninner

loop has a separate stridefor that loop. Bothstrides are

necessarytopredictaddressesthatareaccessedinboththe

innerandouterloop. Tohandleaddressesthatareaccessed

inmorethantwolayersofloops,additionalseparatestrides

mustbecalculated.

Notethat loop-carrieddependencesinnestedloopscan

betreatedjustlikeinnon-nestedloops. Onlyasinglestride

is required for each loop-carried dependence since it does

notmatterwhathappenstoitsvalueinsideinnerloops;all

thatmattersisthatitsvalueattheendofeachiterationof

theouterloopisstride-predictable. Thatis theonlyvalue

thatisproledforboththelinearandrangemethods.

In the detection step, the linear method calculates in-

ner and outerstrides and fromthem calculates arange of

valuesforeachmemoryaccessbasedonthenumberofiter-

ationsintheinnerloops,whiletherangemethodcalculates

(7)

Eachrangehasastride,justlikeasinglememoryaccessin

theouteriterationhasastride.Thedetectionalgorithmex-

trapolates andcomparestheranges inthesamemanneras

thecore algorithm compares linesto calculatethe number

ofparallelizableiterations.

Speculationandundoingfor thelinearmethodaresim-

ilar to non-nested loops. For the range method, since its

addresses are not assumed to be stride-predictable, either

theundobuermuststoretheaddressofeachwritewhose

old value it stores, or the entirerange of values overwrit-

teninan innerloopmustbe copied to theundo buerat

once. Also,therangemethodusesactualaddressesinstead

ofpredictedaddressesinitsinnerloops,andsoitmustcheck

beforeeachmemoryaccessthattheactualaddressiswithin

thepredictedrange.Thesetwofactorscombinetomakethe

rangemethodlesseÆcientthanthelinearmethod.

ThegreatestchallengeforSoftspecinparallelizingnested

loops isdetermining whichloopto parallelize inmultiply-

nestedloops. Compileranalysiscansometimesndtheout-

ermostparallelizableloop;adynamicsolutionattemptspar-

allelization of the outermost loop and if it fails tries each

nestedinnerloopinturn.

4.5 InterproceduralParallelism

Inapurelyruntimeimplementation,procedurecallsarenot

anobstacle to parallelization with Softspec since the pro-

gramis viewedat theexecutiontracelevelandprocedures

areessentiallyinlined.

Extendingacompiler-basedSoftspecimplementationto

handle procedure calls in loop bodies can be done with-

outanyinterproceduralanalysis. Separateversionsofeach

procedure for proling and speculating can be generated

independently. The memory references are then dynami-

callynumbered. Librariesmusthavetheirproceduresmade

\Softspec-ready". Runtimeanalysiswillviewtheloopfrom

thelevelofitstrace,treatingtheproceduresasthoughthey

wereinlined,andnotneedanyinterproceduralinformation.

Aagmustbeusedtopreventnestedspeculationfromloops

insideofprocedurecalls.

This is markedly simpler than any other treatment of

procedure callsinloops. Parallelizing compilersmustper-

formcomplexinterproceduralanalysis todetermineif pro-

cedure calls may be parallelized. Softspec enables a very

simplemechanismforinterproceduralparallelism. Wehave

notyetincorporatedthisintoourprototypeimplementation

ofSoftspec,butplantodosointhenearfuture.

4.6 UsingAdvancedCompiler Analysis

For many accesses inaloop, the compiler may be able to

prove that the accesses have a given pattern. For exam-

ple, the access patternof alocal inductionvariablewhose

address is not taken canbe easily discerned. These vari-

ablesdonotrequireanyprolingorruntimechecking.Also,

theloop'sdependenceanalysiscanbepartiallyevaluatedat

compiletime.

Theoverheadoftheundobuercanbereducedorelim-

inated using compile-time information. If a memory ref-

erence is read before it is written, the read value can be

directlywritten to the undo buer, eliminatinga load in-

struction. Furthermore,ifthe compilercandeducehowto

re-createtheoriginalvalueofamodiedmemoryaccess,no

undoinformationforthataccessneedstobestored.

makethe compiler muchmore robust. A loopwith a few

unanalyzableaccesseswillstillbeparallelizablewithalittle

extraruntimeoverhead.

4.7 RuntimeOverheadofExtended Algorithm

Theextensionstothe corealgorithm describedinthissec-

tion add additional runtime overhead to that analyzedin

Section 3.3. Loop-carried dependences do not aect the

overhead much, since even ina nested loopthey are only

checked at the end of the outer loop. Nonlinear control

ow, however, adds failure branches to the middle of the

speculation loop, which can be relatively expensive. The

onlyextraoverheadsofwhileloopscomefromovershooting

thenumberofiterationsandtheaddedoverheadinrepeated

speculation. Finally, nestedloops usingthe linearmethod

donothaveappreciablymoreruntimecoststhannon-nested

loops,whiletherangemethodaddscostsintheformoffail-

ure branchesand copyinglarge regions to theundo buer

atonce.

5 ExperimentalResults

WehavedevelopedaprototypeimplementationoftheSoft-

spec algorithmconsisting of acompiler and accompanying

runtimesystem. Theprototype incorporates all of theal-

gorithm components described in Section 4 except for in-

terprocedural parallelism. Thecompiler transforms target

sequentialcodeintospeculativelyparallelcodewhichisthen

linkedwiththe runtimesystem. Thecompiler waswritten

inSUIF [AALT95] andthe runtimesystem was writtenin

C.Thepthreadslibrary isusedtocreateaseparatethread

foreachprocessoronthemachine;thesethreadsarecreated

at programstartupand are usedfor all speculation inthe

program.

Thissection presentstheresults ofusingourprototype

to parallelize various typesofapplications. Thetarget ap-

plicationswererunonaDigitalAlphaServer8400,whichis

abus-basedshared-memorymultiprocessorcontainingeight

Digital Alpha processors. We expect even better perfor-

mance improvements on architectures with lower costs of

datasharingbetweenprocessors'caches,suchassingle-chip

multiprocessors[HNO97,SM98].

NotethatthereisnothingabouttheSoftspectechnique

that requires a source-code compiler. It could operate on

programbinaries,andcouldexecutecompletelyat runtime

inadynamicoptimizationframework. Ifthespeedupspre-

sentedherecanbeachievedinsuchenvironments,theywill

be even more impressive. The simplicity of the Softspec

approachlendsitversatility.

5.1 Dense MatrixApplications

WerstexaminehowSoftspecperformsondensematrixap-

plications, whichare typicallyhighly parallelizablebycur-

rentcompilers. ObviouslywedonotexpectSoftspectopro-

ducegreaterspeedupsthanautomaticallyparallelizingcom-

pilersbecauseofourruntimeoverhead. Thissectiongivesan

ideaofhowSoftspeccomparestoautomaticallyparallelizing

compilers.

ConsiderthematrixmultiplicationcodeinFigure6.The

middleloopisparallelizable,andSoftspecparallelizesitsuc-

cessfullywithspeedupshowninFigure7.

(8)

for(i=0; i<L; i++) f

for(j=0; j<M; j++) f

for(k=0; k<N; k++) f

c[i][k] += a[i][j] * b[j][k];

g

Figure6: Densematrixmultiplicationloopnest.

1 2 3 4 5 6 7 8

0 1 2 3 4 5

Number of Processors

Speedup

Figure7: Speedup obtained by Softspecfor multiplication

ofadense1000by1000squarematrix.

When executed on one processor, speculation has a

nearlytwotimesslowdown. Thesequentialprogram'sloop

bodyisverysimpleandmoreeasilyoptimizedbythecom-

pilerthantheparallelversionofthecodegeneratedbySoft-

spec. ThesparsematrixcodeinSection5.4ismorediÆcult

to optimize and thus exhibits higher speedups and much

lower slowdown on one processor. On two processors the

dense matrix multiplication breaks even, and its speedup

increaseslinearlybyalittlemorethan0.5witheachproces-

soradded.

Figure 8 givesSoftspec's speedup when applied to the

SPEC95FPbenchmarkswim. Theresultsaresimilartothe

matrixmultiplicationresults.

5.2 While Loops withUn-analyzableExits

Automatically parallelizing compilers naturally cannot

parallelize loops whose termination conditions are un-

analyzableuntilthelastiterationatruntime. Softspec'sdy-

namictechniquescanparallelize suchloops. Ourprototype

implementation of Softspec executes a while loop sequen-

tiallythersttimeitisencounteredandusesthenumberof

iterationsinthatrstexecutionasthenumberofiterations

tospeculateon. Theiterationcountisupdatedasarunning

averagetoadapttovarying-lengthwhileloops.

ConsidertheexamplecodeinFigure9,aprocedurefor

performingconvolutioninwhichonearray'sendismarked

withasentinel.Inaprogramthatcallsthisproceduremulti-

pletimes,SoftspecobtainsthespeedupshowninFigure10.

1 2 3 4 5 6 7 8

0 1 2 3 4

Number of Processors

Speedup

Figure8: SpeedupfortheSPEC95FPbenchmarkswim.

void convolve(double *A, int lengthA, double *B,

double sentinel, double *C) f

int i,j;

i = lengthA;

while (B[i] != sentinel) f

double tmp = 0;

for (j=0; j<lengthA; j++) f

tmp += A[j] * B[i-j];

g

C[i] = tmp;

i++;

g

Figure9: Examplecodecontainingaloopwhoseexitcondi-

tionis notanalyzableat compiletime. Thecodeperforms

aconvolutioninwhichtheendofarrayBismarkedwitha

sentinel.

1 2 3 4 5 6 7 8

0 1 2 3 4 5

Number of Processors

Speedup

Figure10: Speedupfor aprogramthat makes100calls to

theconvolveprocedureofFigure9,eachwithlengthA=1000

andlengthofB=20000.

(9)

void compute(node *link) f

while (link != NULL) f

/* do work on link */

...

/* move to next link */

link = link->next;

g

Figure11: Codetraversingalinkedlistandperformingsome

computationateachnode.

5.3 LinkedList Traversal

Codethat traverses a linkedlist canbe diÆcultto impos-

sibleto parallelize. If thenodes ofthe list are all laidout

contiguously then the traversalis amenable to paralleliza-

tionbySoftspec. Inasystemwithagarbagecollector,the

memorylayout of the list canbe controlled. Thegarbage

collectorcancooperatewithSoftspecbykeepingthelistlaid

outcontiguouslyas muchas possible. A Javavirtual ma-

chine, for example, could implement Softspec dynamically

and use its control of memory layout to parallelize many

loopsotherwisenotparallelizable.

Without control over the memory layout, a repeated

speculation strategycanbeusedto obtainspeeduponthe

regionsofthelistthathappentobecontiguous. Weinves-

tigatedperformance onlistswithvaryingmemorylayouts.

Inpracticelistsoftenhaveclustersofnoncontiguousnodes.

Tomodelthis,weallocatea20,000nodelistsuchthateach

sequenceof50consecutivenodeshasacertainchanceofei-

therbeing contiguousor havingfrequentgapsbetweenthe

nodes. We parallelized a program that traverses this list

and performs some computation at each node. The code

frameworkisshowninFigure11.

Performanceresultsforthisprogramaregivenfordier-

entchancesofeachsequenceofthelistcontaininggapsand

fordierentnumbersofiterationstospeculate(repeatedly).

Results for a 0% chanceof each sequence having gaps (a

completelycontiguouslist)areshowninFigure12,fora1%

chanceinFigure13,andfora5%chanceinFigure14. Each

gureshows thespeedupfor vedierentnumbersofiter-

ationsspeculated: \All"performsspeculation oftheentire

loopanddoesnottryagain afterreaching amisprediction

while the other four, \100", \200", \400", and \800", re-

peatedlyspeculatethat manyiterationsatatimeuntilthe

loopnishes,regardless of howmany speculations faildue

tomispredictions.

5.4 SparseMatrix Applications

Inthissectionwegiveanexampleofasparsematrixapplica-

tionthatisforsomematricesstride-predictableandcontains

parallelism, but is not parallelizable by current compilers.

TheFORTRANcodeinFigure15isthecore codeforma-

trixmultiplicationofsparse matricesstoredincompressed

rowstorageformat. Themanyarraysandmultiplelevelsof

indirectionmakeanalysisofthislooptoodiÆcultforcurrent

parallelizingcompilers,butSoftspec'sparallelizationscheme

isapplicable.

Whenthecurrentrowineachofthetwomatricesbeing

multiplied contains contiguous regions of non-zero values,

the memoryreferences in each of the two branches of the

1 2 3 4 5 6 7 8

0 1 2 3 4 5

All 100 250 400 800

Number of Processors

Speedup

Figure 12: Speedup for the sampleprogram in Figure 11

operating on acompletely contiguous list. Softspec either

usedrepeatedspeculationofacertainnumberofiterations

(100,250,400,800)orspeculatedtheentireloop(\All").

if arestride-predictable. Thuswhenmanyconsecutiveit-

erationsof the innerlooptake thesame branch, theinner

loopisparallelizable bySoftspec. Thethen branchinitial-

izesnewnon-zeroelementsintheproductmatrixwhilethe

else branchaddsto anexisting element. Therowbyrow

multiplicationends up often entering the same branchon

consecutiveiterationsoftheinnerloop. Thefrequencyand

durationofparallelizableiterationsequencesaredependent

ontheinputmatrices. Sparsematrixdatasetssuchas the

Non-HermitianEigenvalueProblemCollection[NHE]often

containmatriceswithnon-zero valuesinastripedownthe

diagonalorinblocksdownthediagonal. Suchpatternslead

toparallelizable iterationsequencesoflengths equalto the

widthofthestripesorblocks.

Figure16showsthespeedupobtainedbySoftspecwhen

multiplying block diagonal sparse matrices with dierent

blocksizes. Sixdierentsparse matriceswithnon-zero el-

ementsinsquareblocksdownthediagonal weremultiplied

bythemselves. Theblock widthsforthe sixmatriceswere

25, 50,100,200, 300,and400,respectively. Thenumberof

blocksdoesnotaectthespeedupmuch,asitonlychanges

the total runtimeand not thelengthof the parallelizable

sequences.

No speedup (or slowdown) occurs for parallelizable se-

quencesof25or 50iterations. For100iterationsmoderate

speedup is achieved, butthereis not enough work for the

speedup to scale beyond ve or six processors. Iteration

countsof200, 300,and400allachievescalablespeedup.

The speedups obtained on a few hundred iterations of

theinnerloopofthesparsematrixmultiplicationaregreater

thanthoseforthenestedloopofthedensematrixmultipli-

cation,eventhoughthelatterhasmuchmoreworkineach

parallelized iteration. Thereasonstemsfromtheability of

thecompilertomoreaggressivelyoptimizethedensematrix

loopbodyversusthesparsematrixloopbody. Themultiple

arrayindirectionsinhibitoptimizationsthatcanbeapplied

to the simpler dense matrix code. This results inless of

a disparity in speed between the sequential code and the

(10)

1 2 3 4 5 6 7 8 0

0.5 1 1.5 2 2.5 3

All 100 250 400 800

Number of Processors

Speedup

Figure 13: Speedup for the sample program inFigure 11

operatingonalistwitheachsequenceof50nodeshavinga

1%chanceofcontainingmanymemorygaps(a50%chance

ofagapbetweeneachnode). Softspeceitherusedrepeated

speculationofacertainnumberofiterations(100,250,400,

800)orspeculatedtheentireloop(\All").

1 2 3 4 5 6 7 8

0 0.25 0.5 0.75 1 1.25 1.5

All 100 250 400 800

Number of Processors

Speedup

Figure 14: Speedup for the sample program inFigure 11

operatingonalistwitheachsequenceof50nodeshavinga

5%chanceofcontainingmanymemorygaps(a50%chance

ofagapbetweeneachnode). Softspeceitherusedrepeated

speculationofacertainnumberofiterations(100,250,400,

800)orspeculatedtheentireloop(\All").

do k = offb(ja(j)), offb(ja(j)+1)-1

if ((ptr(jb(k)) .eq. 0)) then

ptr(jb(k)) = i

c(ptr(jb(k))) = a(j)*b(k)

index(ptr(jb(k))) = jb(k)

i = i + 1

else

c(ptr(jb(k))) = c(ptr(jb(k)))+a(j)*b(k)

endif

enddo

Figure15: ThecoreFORTRANloopformultiplyingsparse

matrices a and b and storing the result in c. The sparse

matricesarestoredincompressedrowstorageformat. The

arraysoffa,ja,offb,andjbareusedtolocatethecolumns

androwsofelementsinaandb.

1 2 3 4 5 6 7 8

0 1 2 3 4 5 6

25 50 100 200 300 400

Number of Processors

Speedup

Figure 16: Speedup for the sparse matrix multiplication

code of Figure 15 operating on six dierent block diago-

nalmatrices.Thesixmatriceshaveblocksofwidths25,50,

100,200, 300,and400,respectively.

parallelizedcodegeneratedbySoftspec.

6 RelatedWork

Overthelasttwodecades,compilertechniquesforautomat-

icallyparallelizingsequentialapplicationshaveadvancedre-

markably. Modernparallelizingcompilersareverysuccess-

fulwhentargetingcertaintypesofcode,namelynestedloops

containinglimitedmemoryaliasing. However,thesecompil-

ersarelargesystemsthatusecomplexinterproceduralanal-

yses: thePolaris compiler[BEH +

94 ]containsover170,000

linesofcode,andtheSUIFcompiler[AALT95]containsover

150,000linesofcode.

Aliasanalysisalgorithms[WL95 ,EGH94 ,RR98]canen-

able automatic parallelization in the presence of memory

aliases. However, these alias analysis systems work only

with small, self-containedprograms and donot scale with

programsize.

(11)

posed to overcome the needfor acompile-time proof that

code is parallelizable. Candidate loops are speculatively

executedinparallel while a complexhardware system ob-

servesallmemoryaccessesinordertodetectinter-iteration

data dependences. When a dependence is detected, addi-

tional hardware mechanisms undo the speculative execu-

tion and the loop is re-executed sequentially. Some pro-

posals extend existing multiprocessor hardware [HWO98 ,

SM98 ]whileotherspresentcompletelynewhardwarestruc-

tures [MBVS97 ]. Targeting procedural parallelism using

hardware has also beenproposed [OHL99 ]. Because spec-

ulative hardware schemes do not rely anprogram charac-

teristicstheyrequirea lotof workandglobal communica-

tion. Softspec takesadvantage of memoryaccess patterns

to reduce the amount of workand communication needed

todetectdependences,makingitviableinsoftware.

Fundamentalresearchintoprogrambehaviorhasshown

that both data and address values can be predicted by

stride prediction and last-value prediction [SS97]. Stride-

predictability of memory accessesinscientic applications

has been successfully exploited to improve the cache be-

havior of these codes through compiler-inserted prefetch-

ing in uniprocessor and multiprocessor machines [FP91 ].

Thestride-predictabilityofmemoryaddresseshasbeenused

to perform speculative prefetching in out-of-order super-

scalars[GG97 ].

Software-basedspeculativeparallelization schemeshave

mostly focused on determining the inter-iteration depen-

dencepatterns ofapartially parallel loopinorder to con-

structparallelizationschedules. Theseschemesfocus onan

inspector-executormodel[LZ93 ],inwhichanextractedin-

spectorloopanalyzesthememoryaccessesatruntimeand

constructsaschedulefortheexecutorloop,whichrunsparts

oftheloopinparallelusingsynchronization.

Othersoftware-based techniquesspeculatively runcode

inparallel and monitor all memoryaccesses using shadow

arrays[RP95a ].Whenaninter-iterationdatadependenceis

observed,the speculation isundone. Thisis similartothe

hardware-based proposals, but with greater runtime over-

head and only targeting for loops that operate on arrays.

Mechanisms for parallelizing certain types of while loops

havebeendeveloped[RP95b ],buttheyrequirethe helpof

compiler analysis and code reorganization that could not

practicallybedoneatruntime.

7 Conclusions

Wehavepresentedanovelapproachtoparallelizationofse-

quentialcodethatdoesnotrequirecomplexprogramanal-

ysisor additional hardware mechanisms. Wehave demon-

stratedthebenetsofthisapproachonavarietyofprograms

usingaprototypeimplementation.

The trend towards object-oriented programming and

sharedlibraryusageismakingitincreasinglydiÆcultforan

automatically parallelizingcompiler to performthe whole-

programanalysisit needsto identifyparallelism. Wehave

shown how to exploit interprocedural parallelism with no

needfor interprocedural analysis, leadingto a highlyscal-

able solutionthat supportsseparatecompilation. Weplan

to incorporate this into our prototype implementation of

Softspec.

Softspec can be combined with existing automatically

parallelizing compilersto increasethe range of loops they

wouldprovidemorerobustperformance.

Weplantosimulatetheperformanceofourimplementa-

tiononproposedsingle-chipmultiprocessors. Afactorlim-

itingthespeedupsobtainableonshared-memorymultipro-

cessorsisthehighcostofdatasharingbetweenprocessors'

caches. Thesecostsaremuchloweronproposedsingle-chip

multiprocessors. Althoughwe showedspeedup onexisting

hardware,simplehardwareextensionssuchasaddingaspec-

ulativebittocachesthateliminatesundooverheadcandras-

ticallyimproveSoftspecperformance.

Softspec can be implemented entirely in software and

cantarget programbinaries. In thefuturewe hope to in-

tegrate Softspec into a virtual machine environment with

control of the garbage collector to enhance data structure

stride-predictability. We would also like to investigate in-

corporating Softspec into a binary optimizer. This would

enableparallelizationoflegacycodeorthird-partysoftware

that isonlyavailableinbinaryformandcannotberecom-

piled.

Softspecintroducesaframeworkforparallelizationbased

onpredictionratherthanproofof parallelismthat enables

parallelizationofalargeclassofimportantapplicationsthat

are currentlyunable to useautomatic parallelizationtech-

niques. WebelievethattheSoftspecframeworkwillleadto

manyothertechniquesinvolvingdierentoptimizationand

predictionschemes.

References

[AALT95] S.P.Amarasinghe,J.M.Anderson,M.S.Lam,

andC.W.Tseng. TheSUIFcompilerforscal-

able parallel machines. In Proceedings of the

SeventhSIAMConferenceonParallelProcess-

ingfor ScienticComputing,February1995.

[AHU74] A. V.Aho, J.E. Hopcroft,and J.D. Ullman.

The Design and Analysis of Computer Algo-

rithms. Addison-WesleyPublishing Company,

Reading,MA.,1974.

[ATCL +

98] Ali-Reza Adl-Tabatabai, Michal Cierniak,

Guei-Yuan Lueh, Vishesh M. Parikh, and

James M. Stichnoth. Fast, eective code gen-

eration in a just-in-time Java compiler. In

ProceedingsoftheSIGPLAN'98Conferenceon

ProgrammingLanguageDesignandImplemen-

tation,June1998.

[BDB00] Vasanth Bala, Evelyn Duesterwald, and San-

jeev Banerjia. Dynamo: A transparent run-

time optimization system. In Proceedings of

the SIGPLAN'00Conference on Programming

Language Design and Implementation, June

2000.

[BEH +

94] William Blume, Rudolf Eigenmann,Jay Hoe-

inger,DavidPadua,PaulPetersen,Lawrence

Rauchwerger,andPengTu. Automatic detec-

tionofparallelism: Agrandchallengeforhigh-

performance computing. IEEE Parallel and

Distributed Technology, 2(3):37{47, Fall 1994.

Alsohttp://polaris.cs.uiuc.edu/po lari s/.

[CHH +

98] Anton Cherno, Mark Herdeg, Ray Hook-

way, Chris Reeve, Norman Rubin, Tony Tye,

(12)

FX!32: A prole-directed binary translator.

IEEEMicro,18(2),March1998.

[Dev99] SrikrishnaDevabhaktuni. Softspec: Software-

basedspeculativeparallelismviastridepredic-

tion.Master'sthesis,M.I.T.,1999.

http://www.cag.lcs.mit.edu/~s aman /

student_thesis/Sri-99.ps.

[EGH94] M. Emami, R. Ghiya, and L. J. Hen-

dren.Context-sensitiveinterproceduralpoints-

toanalysisinthepresenceoffunctionpointers.

InProceedings oftheACMSIGPLAN'94Con-

ference onProgramming LanguageDesignand

Implementation,June1994.

[FP91] J.W.C.FuandJ.H.Patel. Dataprefetching

in multiprocessor vector cache memories. In

The 18thAnnual International Symposium on

ComputerArchitecture,May1991.

[GG97] J.GonzalezandA.Gonzalez. Speculativeexe-

cutionviaaddresspredictionanddataprefetch-

ing.In11thInternationalConferenceonSuper-

computing,July1997.

[HNO97] Lance Hammond, Basem Nayfeh, and Kunle

Olukotun. Asingle-chip multiprocessor. IEEE

Computer,30(9),September1997.

[HWO98] Lance Hammond, Mark Willey, and Kunle

Olukotun. Dataspeculationsupportforachip

multiprocessor. In Proceedings of the Eighth

ACM Conference on ArchitecturalSupport for

Programming Languages and Operating Sys-

tems,October1998.

[Kla00] Alexander Klaiber. The technology behind

Crusoe processors. Transmeta Corporation,

January2000.

http://www.transmeta.com/crus oe/d ownl oad/p df/c rusoe tech wp.p df.

[LZ93] Shun-TakLeungand John Zahorjan. Improv-

ingtheperformanceofruntimeparallelization.

InProceedings of the Fourth ACM SIGPLAN

SymposiumonPrinciples andPractice ofPar-

allelProgramming,1993.

[MBVS97] A.Moshovos,S.E.Breach,T.N.Vijaykumar,

andG.S.Sohi. Dynamicspeculationandsyn-

chronization ofdatadependencies. In24thIn-

ternationalSymposiumonComputerArchitec-

ture,June1997.

[NHE] Non-Hermitianeigenvalueproblemcollection.

http://math.nist.gov/MatrixMa rket /dat a/NEP .

[OHL99] J. Oplinger, D. Heine, and M. S. Lam. In

search of speculative thread-level parallelism.

InProceedings of the 1999 International Con-

ferenceonParallelArchitecturesandCompila-

tionTechniques,October1999.

[RP95a] LawrenceRauchwergerandDavidPadua. The

LRPD test: Speculative run-time paralleliza-

tionof loops withprivatization and reduction

parallelization. In Proceedings of the SIG-

PLAN'95 Conference on Programming Lan-

guageDesignandImplementation,June1995.

allelizing while loops for multiprocessor sys-

tems. In Proceedings of the 9th International

ParallelProcessingSymposium,April1995.

[RR98] R. Rugina and M. Rinard. Span: A shape

andpointeranalysispackage.TechnicalReport

LCS-TM-581,M.I.T.,June1998.

[SM98] J. G. Stean and T. C. Mowry. The poten-

tial for using thread-level data speculation to

facilitate automatic parallelization. In Pro-

ceedingsoftheFourthInternationalSymposium

on High-Performance Computer Architecture,

February1998.

[SS97] Y. SazeidesandJ.E. Smith. Thepredictabil-

ity of data values. In Proceedings of the 30th

Annual International Symposium on Microar-

chitecture,December1997.

[WL95] R.P.WilsonandM.S.Lam. EÆcientcontext-

sensitive pointer analysis for C programs. In

Proceedings of the ACM SIGPLAN'95 Con-

ference onProgrammingLanguage Designand

Implementation,June1995.