DerekBruening,SrikrishnaDevabhaktuni,Saman Amarasinghe
Laboratory forComputerScience
Massachusetts InstituteofTechnology
Cambridge,MA 02139
iye@mit.edu
Abstract
We present Softspec, a technique for parallelizing sequen-
tial applications using only simple software mechanisms,
requiring no complex program analysis or hardware sup-
port. Softspec parallelizes loops whose memory references
arestride-predictable. Bydetecting andspeculativelyexe-
cutingpotentialparallelismatruntimeSoftspecsucceedsin
parallelizingloopswhosememoryaccesspatternsarestati-
callyindeterminable. For example,Softspeccanparallelize
while loops with un-analyzable exit conditions, linked list
traversals,and sparsematrixapplicationswithpredictable
memorypatterns. Weshowperformance resultsusingour
prototypeimplementation.
1 Introduction
Parallel processing can provide scalable performance im-
provements, and multiprocessor hardware is becoming
widelyavailable. However,itis diÆculttodevelop,debug,
and maintain parallel code. Compilers can automatically
parallelize some sequential applications but are limited in
thetypeofcodethattheycanparallelize. Inordertoiden-
tifyparallelregionsofcodetheymustusecomplexinterpro-
ceduralanalysesto provethatthecodehasnodatadepen-
dencesforallpossibleinputs.Typicalcodetargetedbythese
compilersconsistsofnestedloopswithaÆnearrayaccesses
writteninalanguagesuchas FORTRANthat haslimited
aliasing. Large systemswritteninmodern languages such
asC, C++, or Javausuallycontainmultiple modulesand
memory aliasing, which makes themnot amenable to au-
tomaticparallelization. Furthermore,code whose memory
access patternsare indeterminable at compiletime dueto
dependenceonprograminputscanbeimpossiblefor these
compilerstoparallelize.
Hardware-based speculative parallelism has been pro-
posed to address the problem of parallelizing code that is
hard to analyze statically [HWO98, SM98 , MBVS97 ]. In
theseschemes,codeisspeculativelyexecutedinparalleland
extrahardware isused todetectdependences andtoundo
the parallel execution in cases of misprediction. The ad-
ditionalhardware requiredis extensivedueto theneedfor
communicationbetweenallprocessorstodetermineifdepen-
dencesexist. Softspecusescompilerandruntimetechniques
that take advantage of underlying properties of programs
This researchwassupportedin partby theDefenseAdvanced
ResearchProjectsAgencyunderGrantDBT63-96-C-0036andbya
NipponTelegraph&Telephonegraduatefellowship.
toparallelizediÆculttoanalyzeapplications. Softspecsuc-
ceedsinreducingthecommunicationandcomputationover-
head ofspeculation andisable toachieveperformanceim-
provementswithoutrequiringspecializedhardware.
Our approach to parallelizing applications stems from
two main observations. First, the performance improve-
mentsforonlypartiallyparallelcodedonotscalewelldueto
synchronization costs. Second, memory access patternsin
loopscanoften bepredictedatruntimeusingsimplevalue
predictors. Softspec performs datadependenceanalysis at
runtime using predicted access patterns. It speculatively
parallelizes loops and detects speculation failures without
inter-processorcommunication.
Wedescribethe Softspec techniquefor parallelizing se-
quentialapplicationsusingonlysimplesoftwaremechanisms
whichcanimproveperformanceofmodernprogramsonex-
isting hardware. We present a prototype implementation
consisting of a compiler and accompanying runtime sys-
tem and give experimentalresults ona symmetricshared-
memorymultiprocessor.
SinceSoftspecoftentargetsne-grainparallelism,weex-
pect even better performance on architectures with lower
inter-processor datasharingcosts, suchas single-chip mul-
tiprocessors[HNO97,SM98 ].
Softspec requires only local program information and
doesnot relyonany globalanalysis. Itssimplicity means
itcouldexecuteentirelyatruntimeandtargetprogrambi-
naries. Runtime translation [Kla00 , CHH +
98 , ATCL +
98]
and runtime optimization [BDB00] techniques are becom-
ingprevalent. Softspeccanbereadilyincorporatedintosuch
frameworks.
Thispapermakesthefollowing contributions:
It istherstapplicationofstridepredictiontoparal-
lelizingloops.
Itpresentsanovelapproachtospeculativeparallelism
that requires minimal programanalysis and noaddi-
tionalhardware.
It shows how interprocedural parallelism can be ex-
ploitedwithoutanyinterproceduralanalysis.
Itgivesageneraltechniqueforparallelizingwhileloops
and is the rst techniquethat is applicable to while
loops whenthe exit condition is not known until the
lastiteration.
Itgivesageneraltechniqueforparallelizingsparsema-
trix algorithms, especially eective for matrices with
denseclusters.
Thepaperisorganizedasfollows. Thenextsectiongives
anoverviewofourtechnique. Section3describesthecoreal-
gorithmindetail,andSection4explainshowthealgorithm
handlesmorecomplicatedloops. Section5givesexperimen-
talresults ofourprototype implementation. Relatedwork
andconclusionsnishthepaper.
2 TheSoftspec Approach
This section givesan overview of the Softspec paralleliza-
tion technique. Softspec parallelizes loops whose memory
referencesarestride-predictable. Amemoryaccessisstride-
predictable if the address it accesses is incremented by a
constant stridefor eachsuccessive dynamic instance (e.g.,
in a loop). Array accesses with aÆne index expressions
within loops are always stride-predictable. It has been
shown that many other memory accesses are also stride-
predictable[SS97 ].
Rather than proving at compile time that a loop con-
tains no inter-iteration dependences, we calculate the de-
pendences at runtime. First we dynamically prole the
addresses in the rst three iterations of the loop. If the
addresses are stride-predictable in these three iterations,
we predict that the addresses have the same strides for
the rest of the loop. Once the stride of each memory ac-
cesshasbeenidentied,wedeterminewhetherornotthere
areinter-iterationdependencesamongthememoryaccesses.
Wedonotparallelize loopsthatcontaininter-iterationde-
pendencessincethesynchronizationcostscanoutweighthe
benetsofparallelization|weonlytargetloopswhose it-
erationscanall beexecutedinparallel(doallloops).
Aninter-iterationdependenceexistsifawriteinoneiter-
ationistothesameaddressasareadorawriteinanother
iteration. In order to determine if any such dependences
exist,weexamineallmemoryaccessesinthelooppairwise.
Eachmemoryaccessinstructioncoversaregionofaddresses
throughouttheiterationsoftheloop. Foreachpairwede-
terminehowmanyiterationsmaybeexecutedbeforetheir
regionsoverlap. IfthereareRmemoryreadsandW mem-
orywritesinthe loop, W(W+R )comparisonsare per-
formed.Theminimumofalltheresultsisthenthenumber
ofparallelizableiterationsintheloop.
If the numberof parallelizable iterationsin the loopis
largeenough,theloopisspeculativelyexecutedinparallel,
itsparallelizableiterationssplitevenlyamongtheavailable
processors. If there are very few parallelizable iterations
and each iteration contains little work, speculation is not
attemptedandtheloopisexecutedsequentially.
While speculating, each processor checks that the pre-
dictedaddressesmatchtheactualaddresses. Thisisalocal
operationandrequiresnocommunicationwithotherproces-
sors. Infact,theonlyglobalcommunicationrequiredisone
bitofinformationattheendofspeculationstatingwhether
allpredictions werecorrect or not. Ifthereis amispredic-
tion,all subsequent iterations musthave their speculation
undoneandbere-executedsequentially. Anundobueris
allocatedtostoretheoriginalvaluesatalladdresseswritten
inthe loop. Since the predicted addresses are guaranteed
tohavenointer-iterationdependences,speculativememory
accesses use the predicted addresses instead of the actual
addresses (which may containdependences). Thisenables
each processor to undo the eects of speculation indepen-
dentlyoftheotherprocessors. Afterundoinginparallel,the
remaining iterationsof the loop are executed sequentially.
void foo(double *a, double *b, int j) f
double *p;
int i;
p = &a[j];
for (i=0; i<500; i++) f
a[i] = i;
b[i+j] = b[i] - *p;
g
g
Figure 1: A sample procedure containing a parallelizable
loop.
Alternatively, the speculative process can be restarted |
forexample,speculationcouldattempttoparallelizeonlya
pieceofaloopatatimeandonlygiveupiftherearemany
mispredictions. Inthisway loops withafew gaps intheir
stride-predictabilityorwithwidelyvaryingiterationcounts
canbefullyparallelized.
3 The CoreAlgorithm
ThissectiondescribesindetailhowSoftspecparallelizessim-
pleloopswithknownnumbersofiterationswhosebodiesare
straight-linecode. ThisisthecoreoftheSoftspecalgorithm.
Section4explainshowweextendthecorealgorithmtohan-
dleloop-carrieddependences,branchesinloopbodies,while
loops,nestedloops,andinterproceduralparallelism.
To illustrate how Softspec works, consider the code in
Figure1. Theloopinthisprocedurecontainsfourmemory
accesses: twowrites(a[i]andb[i+j])andtworeads(b[i]
and*p). Theseaccessesarestride-predictablewithastride
of0for*pandstridesequaltosizeof(double)fortheoth-
ers. Ifa and bare non-overlappingarrays withlengths at
least500and j >= 500,therewillbenointer-iterationde-
pendencesbetweenthememoryaccessesandtheentireloop
willbeparallelizable.
In order for a parallelizing compiler to parallelize this
loop, it mustprovestatically that aand bare distinct ar-
rays and that therewill be nointer-iteration dependences
betweenthememoryaccesses. Deducingthisinformationat
compiletimerequiressophisticatedinterproceduralanalysis,
ormaybeimpossibleifthememoryaddressesaredependent
onprograminputs. However,acompiler thatmakesuseof
runtimepredicated parallelism canparallelize this loopby
insertingatestthatatruntimewilldeduceiftherearemem-
orydependencesandonlyexecutetheparallelversionofthe
loop if there are none. This exampleserves onlyto illus-
tratethecoreoftheSoftspecalgorithm;therestofthealgo-
rithmexplainedinSection4targetsloopsforwhichpractical
parallelism-detectingpredicatesdonotexist.
Softspec parallelizes this loop by simply proling the
initial addresses and calculating the intersections of their
strides. Nocomplexcompile-timeanalysisisrequired|all
thatneedstobedoneisinstrumenteachmemoryaccess.
TheSoftspecalgorithmreplacestheloopwithfourloops:
a proleloop, a detection loop, aspeculation loop, and a
recovery loop. Theexecution paththrough these loops is
showninFigure2.Thefollowingsectionsdescribethispath
inmoredetail. Forfurtherinformationonthecorealgorithm
see[Dev99 ].
processor 0 time
Profile
Sequential Execution
Speculation
ok ok
Recovery i=287 to 499 i=30 to 148 to 29 i=3 i=0 to 2
processor 1
Detection
Speculation ok
ok i=149
to 265 Speculation ok
cancel
Undo
i=266 to 382
i=287 to 287
Speculation ok
ok
Undo
i=383 to 499
i=383 to 499
processor 2 processor 3
Figure2: TheexecutionpathofSoftspec's parallelizationofthesampleloopinFigure1. Theloopcontains500iterations,
i=0throughi=499. The27iterationsperformedduringdetectionisatypicalempiricalnumberforasmallloopbody.Thetop
portionofthegureshowssuccessfulspeculation. Thebottomportionshowswhatwouldhappenifamispredictionoccurred
wheni=287.
/*** profile loop ***/
for (i=0; i<3; i++) f
profile_address[i][0] = &a[i];
a[i] = i;
profile_address[i][1] = &b[i+j];
profile_address[i][2] = &b[i];
profile_address[i][3] = p;
b[i+j] = b[i] - *p;
g
Figure3: Theproleloopcreatedfromthesequentialloop
inFigure1.
3.1 Proling and Parallelism Detection
Theprole loopfor the sample code of Figure1 is shown
in Figure 3. As can be seen, it runs the rst three it-
erations of the original loop with instructions inserted to
storetheaddresses ofeachmemoryaccessinto datastruc-
turesusedbythe runtimesystem. The outerindexofthe
profileaddress array correspondsto theiteration ofthe
loopbeingproled, andthe innerindexisusedto number
thememoryaccesses.
Whenthe proleloop nishesthe runtime system cal-
culates thestride ofeachmemoryaccess bysimplytaking
thedierenceofaddressesinconsecutiveiterations. Ifeach
memoryaccess has aconsistent stride(i.e., thestrides be-
tweentherstandsecondandbetweenthesecondandthird
proled iterations are the same), theruntimesystem then
determineshow many iterationscanbe parallelized before
aninter-iterationdependenceis encountered. Ifthe strides
arenotconsistentthennospeculationisperformedandthe
loopisrunsequentially.Thiscatchesmanyaccessesthatare
notstride-predictableearlyonbeforeanyspeculationneeds
to beundone. HowSoftspec handlesaccessesthat are not
executedinall threeoftheproling iterationsisdiscussed
inSection4.2.
Asecondthreadperformsthestridecalculationsandpar-
allelismdetectionwhiletheoriginalthreadcontinuessequen-
tialexecutionoftheloop,asshowninFigure2. Thisenables
forwardprogressonthelooptobemadewhilethedetection
calculations are performed (thecalculations cantake time
equaltotensofiterationsofloopswithsmallbodies).
To identify whether parallelism exists, we examine the
memoryaccessespairwisetodeterminehowmanyiterations
may beexecutedbeforetheir addressesbecomeequal. For
apairofaddresseswithvaluesinthe rstiterationa0 and
a1andwithstridess0 ands1,weneedtondtheminimum
valuesofintegersiandjsuchthata0+is0=a1+js1.
Rewritingtheequationasis0 js1=a1 a0,asolutionex-
istsifandonlyifthegreatestcommondivisor(gcd)ofs0and
s1dividesa1 a0. Thesolutioncanbeobtainedasasingly
parameterized set using Euclid's gcd algorithm [AHU74],
0 1 2
parallelizable
memory address
iterations 1000 0 1 2
memory address
iterations
proc0 proc1 proc2 proc3
1000
Figure 4: Two memory accesses are compared to determine the numberof iterations before theyhave an inter-iteration
dependence. Theproled addresses are shown as dots; their strides are extrapolated to obtain two lines. Inter-iteration
dependencesmayexist startingatthe pointwhere oneline enters theother'saddressspace. Inthis case therstaccess is
to a1000-elementarray located inmemoryadjacent to the secondaccess. Thus, the two overlap at 1000 iterations. The
parallelizableiterationsaredividedamongfourprocessorsasillustratedontheright.
from which the largest number of parallelizable iterations
canbecalculated.
We found that nearly 90% of the parallelism detection
execution time was spent calculating gcd's. To avoid this
calculationourprototypeSoftspec implementationusesan
approximation algorithmthat calculates aconservativeso-
lutionbutismuchmoreeÆcient. Inpracticewehavenever
foundacase wherethe conservative algorithmreports less
parallelismthantheexactgcdalgorithm.
This approximation algorithm views each address as a
line dened by the initial address value a
i
and the stride
s
i : y = s
i x+a
i
. For agiven pairof addresses, the al-
gorithmextrapolates eachproled address into a line and
determinesthe minimum x value inthe rst quadrant at
whichthetwolinessharey values. Thisxvalueistherst
iterationatwhichaninter-iterationdependencecanoccur.
Figure4 givesanexampleof this process. Therst inter-
iteration dependence may actually occur laterthan the x
value computed inthismanner, since treatingdiscretead-
dressvaluesasacontinuouslineconsidersaddressestoover-
lapthatmayneveractuallylineupduetoequalstridesbut
dierent osets. For thisreason weadd aheuristiccase to
theapproximation algorithm: if thepairofaddresses have
thesamestrideandeitherthesameinitialvalueordierent
initialvaluesmodulothestridethentheywillneveroverlap.
Thecostofthedetectionalgorithmbecomesnegligibleas
theamountofcomputationinthetargetloopincreases. An
adaptive detection algorithm that runs the approximation
algorithm for small loops and the full gcdcalculations for
largeloopscouldbeused.
3.2 Speculation and RecoveryfromMisprediction
IfSoftspecdetectsenoughparallelism towarrantspeculat-
ing,thespeculative versionoftheloopisexecutedbyeach
processor. Thespeculativeversionofthesampleloopfrom
Figure1isshowninFigure5. Itusesthepredictedaddresses
insteadoftheactualaddresses sothatonlyonemispredic-
tionconditionalattheendoftheloopisneeded,insteadof
a conditional before every memory access. The predicted
addresses are initialized before the loopby simply adding
/*** speculation loop ***/
for (i=start[thread]; i<stop[thread]; i++) f
ncancelflag &= (predict0 == &a[i]);
*undo_buffer = *predict0;
undo_buffer++;
*predict0 = i;
ncancelflag &= (predict1 == &b[i+j]);
ncancelflag &= (predict2 == &b[i]);
ncancelflag &= (predict3 == p);
*undo_buffer = *predict1;
undo_buffer++;
*predict1 = *predict2 - *predict3;
if (!ncancel flag) f /* fail */ g
predict0 += delta0;
predict1 += delta1;
predict2 += delta2;
predict3 += delta3;
g
Figure5: Thespeculationloopcreatedfromthesequential
loopinFigure1.
the stride multiplied by the starting iteration numberfor
thethreadtotheinitialaddress.
Codeinsertedintotheloopstoresvaluestobeoverwrit-
tenintheundobuerandincrementseachpredictedaddress
byitsstride. Each processor hasitsownundobuer that
isallocatedpriortospeculation(itismadelargeenoughto
hold all writes in the loop). Additional code checks that
the predicted address matches the actual address; if not,
the speculationfails: all writesiniterationsduringandaf-
ter the mispredictionare undone and thoseiterations are
re-executedsequentiallybytherecoveryloop. Therecovery
loopisidenticaltotheoriginalsequentialloopexceptthatit
beginsexecutionontheiterationofthemisprediction. This
processisillustrated atthe bottomofFigure2. Notethat
alloperationsarelocaltoeachprocessorexceptforthesuc-
cessor failureofthespeculation attheendofthe loop;no
otherglobalcommunicationisrequired.
Undoing writes involves restoring the values from the
Prole 1-4 1-4 0
Detect 0 0 2-4
Speculate 6-11 10-13 2-5
Recovery 0 0 0
Table 1: Overheadintermsofmachineinstructionsforthe
fourloopsthatSoftspeccreateswhenparallelizingatarget
loop.
undo buer to the predicted addresses, from the most
recently executed iteration backward to the earliest exe-
cuted iteration. Since the predicted addresses are stride-
predictable, the undo buer does not need to store any
memory addresses, only values. The processors canundo
inparallel since thereare nooverlaps inthe predictedad-
dresses.
If many execution instances of a loop experience early
speculationfailures,theruntimesystemdisablesspeculation
of that loop and executes it sequentially instead to avoid
overheadcosts.
3.3 Runtime Overhead
The Softspec algorithm transforms the original sequential
loop into four loops. One of these, the recovery loop, is
essentially identical to the original loop. The other three
loops have extra instructions added that lead to runtime
overheadwhencomparedtotheoriginalsequentialloop.
Theproleloopstorestheaddressofeachloadandstore
intoashareddatastructure. Thedetectionloop, whichse-
quentiallymakesforwardprogressontheloopwhiletherun-
timesystemdetects theamount ofparallelism inthe loop,
is equivalent to the originalloop witha check everyitera-
tiontoseeiftheparallelismdetectionhasnished. Finally,
thespeculationloopcontainsanincrementofthepredicted
addressand acheck thatthe predicted addressequals the
actualaddressfor all memoryreads andwrites, astore to
the writebuer and increment of the writebuer pointer
foreachwrite,and additionallyasinglespeculationfailure
branch. Table1summarizesthesecostsinunitsofmachine
instructionsfortypicalcompilations.
Notethat since speculation usespredicted memoryad-
dresses,nestedarrayreferences(suchasA[B[i]])ormulti-
plelevelsofpointerindirection(suchas**p)donotneedto
waitfortherstmemoryreferencetoresolvebeforeaccess-
ing the second. Breakingthis dependenceallows for more
instruction-levelparallelismandreordering.
Additional overhead is required to synchronize the
threads. A barrier is required at the end of the parallel
executionto determineif any threadsencounteredmispre-
dictions. Thisbarriermaycosthundredsoreventhousands
ofcycles onashared-memory multiprocessor. Fortunately
itistheonlyglobalsynchronizationneededbySoftspec.
Verysimpleextensionstotheunderlyinghardware,such
as adding a speculative bit to caches that eliminates the
needfor anundo buer, could reduce Softspec's overhead
signicantly.
4 Extendingthe CoreAlgorithm
Thecore of the Softspec algorithm describedin Section3
onlyhandlesloopswithknownnumbersofiterationswhose
bodies are straight-line code, i.e., no procedure calls or
branchesinsideofloops,andnowhileloopsornestedloops.
This section explains how Softspec extends the core algo-
rithmtohandleloopbodiescontainingloop-carrieddepen-
dencesandnonlinearcontrolow,whileloops,nestedloops,
and interproceduralparallelism. Wehaveincorporated all
butthelastoftheseintoourprototypeSoftspecimplemen-
tation. We plan to implement interproceduralparallelism
inthe near future; we describe herehowto doso without
usinganyinterproceduralanalysis. We thendiscussusing
compilerinformationtoreduceruntimeoverhead,andcom-
paretheruntimeoverheadoftheseextensionstothatofthe
corealgorithm.
4.1 Loops ContainingLoop-Carried Dependences
Memoryaddressesarenottheonlyvaluesinaloopthatare
often stride-predictable. Scalarvariables withloop-carried
dependencesexhibitingstride-predictabilityrangefromsim-
pleinductionvariables,whichareusuallyanalyzableatcom-
piletime,topointersthatarediÆculttoanalyzeatcompile
time. Apointerusedtotraversealinkedlist,whenthelist
is laidoutcontiguouslyinmemory,isagoodexampleofa
stride-predictableloop-carrieddependence.
Loop-carrieddependencesare treatedinasimilarman-
ner to memoryaddresses. Theirvalues are proled inthe
rstthreeiterationsoftheloopjustlikememoryaddresses,
and a strideis calculated that ispredicted to hold for the
restoftheloop. Ifthestridedoeshold,theloopisparalleliz-
able; nointer-iterationdependenceanalysisisneededsince
the value of the variable can be computed independently
for anyiteration. Thisisincontrast tomemoryaddresses,
forwhichmerelybeingstride-predictableisnotsuÆcientfor
parallelism to exist since theaddresses and not thevalues
atthoseaddressesarebeingpredicted,andthevaluesmay
dependoneachother.
Duringspeculationthe\actual"valueoftheloop-carried
dependenceisthepredictedvalue forthecurrentiteration.
At theendoftheiteration, afterthisactualvalue ismodi-
edintheloopbody,itsresultingvalueiscomparedtothe
predictedvalueforthenextiterationandifamisprediction
occursthespeculationaborts. Thisisdierentfromamem-
oryaddresswhoseactual value is computedindependently
ofthepredictedvalue.
The undomechanism needsnoextrainformation to be
able torestoreloop-carrieddependencesto thevaluesthey
heldpriortoamisprediction. Eachvaluecanbecomputed
usingtheprolingdatafortheiterationpriortothespecu-
lationfailure.
If a loop-carried dependence is used as a pointer, care
mustbe takento avoida mispredictioncausing amemory
accessoutsideoftheprogram'saddressspace. Thepredicted
values of the pointer for the initial and nal iterations of
theloopneedto becheckedtoensurethat theyarewithin
theapplication'saddressspace. Iftheyare, thenso areall
valuesatintermediateiterations. Withintheaddressspace,
anymispredictionwillbedetectedandallerroneouswrites
willberestoredtotheiroriginalvaluesfromtheundobuer.
4.2 Loops ContainingNonlinear ControlFlow
Thecorealgorithmassumedthateachmemoryaccesswould
beexecutedoneachiteration. Branchesinsidetheloopbody
oftenfailtomeetthisassumption. Thiscancauseaddresses
tobeencounteredduringspeculationthatwerenotexecuted
sincenopredicted strideisavailable andnointer-iteration
dependenceanalysishasbeenperformedonthenewaddress.
Speculation mustalso abortifabranchistakenthatleads
outoftheloop.
Ifaparticularaccessisnotexecutedinanyofthethree
prolingiterations,thataccessmustbedistinguishablefrom
fullyproledaccessesbythespeculationloop. Asentinelin
thevalueofthe predictedaddressisusedfor thispurpose.
Unfortunately, the speculation loop must branch immedi-
atelybasedonitscheck for thesentinel because it cannot
executeinstructionsusinganunknownaddress. Suchaspec-
ulationfailurebranchmustbeplacedbeforeeveryaddress
usethatcouldpotentiallybeunproled.Thisbranchcauses
twoproblems. First,itdegradesperformance. Thiscanbe
avoidedinsomecasesbygeneratingaversionofthespecula-
tionloopwithoutthebranchesandusingthatversionwhen
alladdressesareexecutedintheproleloop. Second,itre-
quirestheabilitytoundoapartiallynishediteration. This
iseasilysolved: whenamispredictionoccurstheaddressit
occurredatisrecorded,andtheundomechanismusesthat
informationto undo justtheportion ofthe iteration prior
tothemisprediction.
Ifanaccessisexecutedinonlyoneofthethreeproling
iterations, astridecannotbecalculated. However, astride
canbeguessedbasedonthestridesofotheraddressesinthe
looporpossiblyonstoredstridesfrompreviousinstancesof
thisloop. Ourprototypeimplementationdoesnotcurrently
guessstridesanddoesnotattempttospeculatealoopcon-
taininganaddressthatwasproledonlyonce.
Ifanaccess is executedintwoof thethreeproling it-
erations,astridecanbecalculatedandspeculatedjustlike
afullyproled address. Alternatively,amoredynamicap-
proachcoulddecidetoexecutefurtherprolingiterationsto
verifythestride.
Detectingunproledloop-carrieddependencesisdonein
thesamemanneras for memoryaddresses, butsinceloop-
carrieddependencescantakeonanyvalueasentinelcannot
beused.Instead,theprolingdatastructuresareextended.
4.3 While Loops
Whileloops withun-analyzableexits arediÆcult toparal-
lelizesincethenumberofiterationsintheloopisnotknown
at compile time. In fact, the numberof iterations is not
known untilthe nal iteration of the loop. Theonly way
to parallelize such aloop is speculatively. The numberof
iterationsmustbeguessedandovershooting(guessingmore
iterationsthanactuallyexist)mustbehandled.
The core algorithm is easily extended to handle while
loops. Guessingthenumberofiterationsisaccomplishedin
oneofseveral ways. It canbeguessed outright. The loop
canberunsequentiallythersttimeitis encounteredand
the numberof iterations from that execution canbe used
asthe guessfor thenextexecution. Or,repeatedspecula-
tioncanbeused: asmallnumberofiterationsisrepeatedly
speculateduntiltheloopexitconditionisreached. Repeated
speculationisagoodstrategyforwhileloopsthatmayhave
a few stride-predictability gaps but for the most part are
stride-predictable,andforwhileloopswhoseiterationcount
varies widely. Inany case, laterspeculation of the loopis
madeadaptivebykeepingarunningaverageofthenumber
ofiterationsinprevious executionsoftheloopandusingit
tospeculatefutureinstancesoftheloop.
theguessednumberofiterationsandbreaksifthewhileloop
exitconditionismetbeforetheforloopnishes.Theprole,
detection,andspeculationloopsaregeneratedfromthisfor
loop. Therecoveryloopistheoriginalwhileloop.
If the while conditionis overshot, i.e., too many itera-
tionsare executed,theextraiterationsareundoneandthe
runningaverage ofthe numberofiterationsis updated. If
the speculation ends successfully and the while condition
has not been reached, speculation canbegin again on the
remainderoftheloop,orelsetherecoveryloopnishesexe-
cutingtheloop. Addressvaluesinthespeculationloopmust
be checkedto ensuretheyare within the address space of
the program(predictedvaluesiniterationsbeyond theac-
tualloopendcouldbe outsideofthe addressspace). This
canbedonebycheckingtheinitialandnalpredictedvalues
priortothespeculationloop.
4.4 Nested Loops
The core algorithm only handles innermost loops. Even
whenparallelismexistsininnerloops,greaterperformance
improvementscanbeachievedbyparallelizingouterloops,
especiallywhenexecutingonmachineswithhighdatashar-
ing costs between processors' caches where a higher com-
putation to overhead ratio is required. However, target-
ing outer loops complicates parallelism detection because
there are many more opportunities for inter-iteration de-
pendences. Softspec extends the core algorithm to handle
outerloopswithoutundulyincreasingitscomplexity.
Ina nestedloop, the innerloops canbetreated intwo
ways. The inner loops' memory accesses can be required
tobestride-predictable(thelinearmethod),ortheycanbe
requiredtofallwithin acertainrange(therangemethod).
ThetwomethodsarehandledsimilarlybySoftspec. Weim-
plemented bothmethods andfound that,as expected, the
range methodis less eÆcient thanthe linearmethod. Al-
thoughtherangemethodplacesfewerrestrictionsonwhich
loops are parallelizable, we never encountered a loop for
whichtherangemethodappliedbutthe linearmethoddid
not. Ourprototypeimplementationusesthelinearmethod.
Theprolingstepfornestedloopsrunstheouterloopfor
threeiterationsandtheinnerloopsfortheirfullnumberof
iterations. Thelinearmethodonlygathersprolingdataon
therstthreeiterationsoftheinnerloops,whiletherange
methodkeeps trackof the maximumand minimumvalues
of eachaddressaccessedineachinnerloop. For thelinear
method, all memoryaccesses in a loop have a stride that
holds within the outer loop; amemory access in aninner
loop has a separate stridefor that loop. Bothstrides are
necessarytopredictaddressesthatareaccessedinboththe
innerandouterloop. Tohandleaddressesthatareaccessed
inmorethantwolayersofloops,additionalseparatestrides
mustbecalculated.
Notethat loop-carrieddependencesinnestedloopscan
betreatedjustlikeinnon-nestedloops. Onlyasinglestride
is required for each loop-carried dependence since it does
notmatterwhathappenstoitsvalueinsideinnerloops;all
thatmattersisthatitsvalueattheendofeachiterationof
theouterloopisstride-predictable. Thatis theonlyvalue
thatisproledforboththelinearandrangemethods.
In the detection step, the linear method calculates in-
ner and outerstrides and fromthem calculates arange of
valuesforeachmemoryaccessbasedonthenumberofiter-
ationsintheinnerloops,whiletherangemethodcalculates
Eachrangehasastride,justlikeasinglememoryaccessin
theouteriterationhasastride.Thedetectionalgorithmex-
trapolates andcomparestheranges inthesamemanneras
thecore algorithm compares linesto calculatethe number
ofparallelizableiterations.
Speculationandundoingfor thelinearmethodaresim-
ilar to non-nested loops. For the range method, since its
addresses are not assumed to be stride-predictable, either
theundobuermuststoretheaddressofeachwritewhose
old value it stores, or the entirerange of values overwrit-
teninan innerloopmustbe copied to theundo buerat
once. Also,therangemethodusesactualaddressesinstead
ofpredictedaddressesinitsinnerloops,andsoitmustcheck
beforeeachmemoryaccessthattheactualaddressiswithin
thepredictedrange.Thesetwofactorscombinetomakethe
rangemethodlesseÆcientthanthelinearmethod.
ThegreatestchallengeforSoftspecinparallelizingnested
loops isdetermining whichloopto parallelize inmultiply-
nestedloops. Compileranalysiscansometimesndtheout-
ermostparallelizableloop;adynamicsolutionattemptspar-
allelization of the outermost loop and if it fails tries each
nestedinnerloopinturn.
4.5 InterproceduralParallelism
Inapurelyruntimeimplementation,procedurecallsarenot
anobstacle to parallelization with Softspec since the pro-
gramis viewedat theexecutiontracelevelandprocedures
areessentiallyinlined.
Extendingacompiler-basedSoftspecimplementationto
handle procedure calls in loop bodies can be done with-
outanyinterproceduralanalysis. Separateversionsofeach
procedure for proling and speculating can be generated
independently. The memory references are then dynami-
callynumbered. Librariesmusthavetheirproceduresmade
\Softspec-ready". Runtimeanalysiswillviewtheloopfrom
thelevelofitstrace,treatingtheproceduresasthoughthey
wereinlined,andnotneedanyinterproceduralinformation.
Aagmustbeusedtopreventnestedspeculationfromloops
insideofprocedurecalls.
This is markedly simpler than any other treatment of
procedure callsinloops. Parallelizing compilersmustper-
formcomplexinterproceduralanalysis todetermineif pro-
cedure calls may be parallelized. Softspec enables a very
simplemechanismforinterproceduralparallelism. Wehave
notyetincorporatedthisintoourprototypeimplementation
ofSoftspec,butplantodosointhenearfuture.
4.6 UsingAdvancedCompiler Analysis
For many accesses inaloop, the compiler may be able to
prove that the accesses have a given pattern. For exam-
ple, the access patternof alocal inductionvariablewhose
address is not taken canbe easily discerned. These vari-
ablesdonotrequireanyprolingorruntimechecking.Also,
theloop'sdependenceanalysiscanbepartiallyevaluatedat
compiletime.
Theoverheadoftheundobuercanbereducedorelim-
inated using compile-time information. If a memory ref-
erence is read before it is written, the read value can be
directlywritten to the undo buer, eliminatinga load in-
struction. Furthermore,ifthe compilercandeducehowto
re-createtheoriginalvalueofamodiedmemoryaccess,no
undoinformationforthataccessneedstobestored.
makethe compiler muchmore robust. A loopwith a few
unanalyzableaccesseswillstillbeparallelizablewithalittle
extraruntimeoverhead.
4.7 RuntimeOverheadofExtended Algorithm
Theextensionstothe corealgorithm describedinthissec-
tion add additional runtime overhead to that analyzedin
Section 3.3. Loop-carried dependences do not aect the
overhead much, since even ina nested loopthey are only
checked at the end of the outer loop. Nonlinear control
ow, however, adds failure branches to the middle of the
speculation loop, which can be relatively expensive. The
onlyextraoverheadsofwhileloopscomefromovershooting
thenumberofiterationsandtheaddedoverheadinrepeated
speculation. Finally, nestedloops usingthe linearmethod
donothaveappreciablymoreruntimecoststhannon-nested
loops,whiletherangemethodaddscostsintheformoffail-
ure branchesand copyinglarge regions to theundo buer
atonce.
5 ExperimentalResults
WehavedevelopedaprototypeimplementationoftheSoft-
spec algorithmconsisting of acompiler and accompanying
runtimesystem. Theprototype incorporates all of theal-
gorithm components described in Section 4 except for in-
terprocedural parallelism. Thecompiler transforms target
sequentialcodeintospeculativelyparallelcodewhichisthen
linkedwiththe runtimesystem. Thecompiler waswritten
inSUIF [AALT95] andthe runtimesystem was writtenin
C.Thepthreadslibrary isusedtocreateaseparatethread
foreachprocessoronthemachine;thesethreadsarecreated
at programstartupand are usedfor all speculation inthe
program.
Thissection presentstheresults ofusingourprototype
to parallelize various typesofapplications. Thetarget ap-
plicationswererunonaDigitalAlphaServer8400,whichis
abus-basedshared-memorymultiprocessorcontainingeight
Digital Alpha processors. We expect even better perfor-
mance improvements on architectures with lower costs of
datasharingbetweenprocessors'caches,suchassingle-chip
multiprocessors[HNO97,SM98].
NotethatthereisnothingabouttheSoftspectechnique
that requires a source-code compiler. It could operate on
programbinaries,andcouldexecutecompletelyat runtime
inadynamicoptimizationframework. Ifthespeedupspre-
sentedherecanbeachievedinsuchenvironments,theywill
be even more impressive. The simplicity of the Softspec
approachlendsitversatility.
5.1 Dense MatrixApplications
WerstexaminehowSoftspecperformsondensematrixap-
plications, whichare typicallyhighly parallelizablebycur-
rentcompilers. ObviouslywedonotexpectSoftspectopro-
ducegreaterspeedupsthanautomaticallyparallelizingcom-
pilersbecauseofourruntimeoverhead. Thissectiongivesan
ideaofhowSoftspeccomparestoautomaticallyparallelizing
compilers.
ConsiderthematrixmultiplicationcodeinFigure6.The
middleloopisparallelizable,andSoftspecparallelizesitsuc-
cessfullywithspeedupshowninFigure7.
for(i=0; i<L; i++) f
for(j=0; j<M; j++) f
for(k=0; k<N; k++) f
c[i][k] += a[i][j] * b[j][k];
g
g
g
Figure6: Densematrixmultiplicationloopnest.
1 2 3 4 5 6 7 8
0 1 2 3 4 5
Number of Processors
Speedup
Figure7: Speedup obtained by Softspecfor multiplication
ofadense1000by1000squarematrix.
When executed on one processor, speculation has a
nearlytwotimesslowdown. Thesequentialprogram'sloop
bodyisverysimpleandmoreeasilyoptimizedbythecom-
pilerthantheparallelversionofthecodegeneratedbySoft-
spec. ThesparsematrixcodeinSection5.4ismorediÆcult
to optimize and thus exhibits higher speedups and much
lower slowdown on one processor. On two processors the
dense matrix multiplication breaks even, and its speedup
increaseslinearlybyalittlemorethan0.5witheachproces-
soradded.
Figure 8 givesSoftspec's speedup when applied to the
SPEC95FPbenchmarkswim. Theresultsaresimilartothe
matrixmultiplicationresults.
5.2 While Loops withUn-analyzableExits
Automatically parallelizing compilers naturally cannot
parallelize loops whose termination conditions are un-
analyzableuntilthelastiterationatruntime. Softspec'sdy-
namictechniquescanparallelize suchloops. Ourprototype
implementation of Softspec executes a while loop sequen-
tiallythersttimeitisencounteredandusesthenumberof
iterationsinthatrstexecutionasthenumberofiterations
tospeculateon. Theiterationcountisupdatedasarunning
averagetoadapttovarying-lengthwhileloops.
ConsidertheexamplecodeinFigure9,aprocedurefor
performingconvolutioninwhichonearray'sendismarked
withasentinel.Inaprogramthatcallsthisproceduremulti-
pletimes,SoftspecobtainsthespeedupshowninFigure10.
1 2 3 4 5 6 7 8
0 1 2 3 4
Number of Processors
Speedup
Figure8: SpeedupfortheSPEC95FPbenchmarkswim.
void convolve(double *A, int lengthA, double *B,
double sentinel, double *C) f
int i,j;
i = lengthA;
while (B[i] != sentinel) f
double tmp = 0;
for (j=0; j<lengthA; j++) f
tmp += A[j] * B[i-j];
g
C[i] = tmp;
i++;
g
g
Figure9: Examplecodecontainingaloopwhoseexitcondi-
tionis notanalyzableat compiletime. Thecodeperforms
aconvolutioninwhichtheendofarrayBismarkedwitha
sentinel.
1 2 3 4 5 6 7 8
0 1 2 3 4 5
Number of Processors
Speedup
Figure10: Speedupfor aprogramthat makes100calls to
theconvolveprocedureofFigure9,eachwithlengthA=1000
andlengthofB=20000.
void compute(node *link) f
while (link != NULL) f
/* do work on link */
...
/* move to next link */
link = link->next;
g
g
Figure11: Codetraversingalinkedlistandperformingsome
computationateachnode.
5.3 LinkedList Traversal
Codethat traverses a linkedlist canbe diÆcultto impos-
sibleto parallelize. If thenodes ofthe list are all laidout
contiguously then the traversalis amenable to paralleliza-
tionbySoftspec. Inasystemwithagarbagecollector,the
memorylayout of the list canbe controlled. Thegarbage
collectorcancooperatewithSoftspecbykeepingthelistlaid
outcontiguouslyas muchas possible. A Javavirtual ma-
chine, for example, could implement Softspec dynamically
and use its control of memory layout to parallelize many
loopsotherwisenotparallelizable.
Without control over the memory layout, a repeated
speculation strategycanbeusedto obtainspeeduponthe
regionsofthelistthathappentobecontiguous. Weinves-
tigatedperformance onlistswithvaryingmemorylayouts.
Inpracticelistsoftenhaveclustersofnoncontiguousnodes.
Tomodelthis,weallocatea20,000nodelistsuchthateach
sequenceof50consecutivenodeshasacertainchanceofei-
therbeing contiguousor havingfrequentgapsbetweenthe
nodes. We parallelized a program that traverses this list
and performs some computation at each node. The code
frameworkisshowninFigure11.
Performanceresultsforthisprogramaregivenfordier-
entchancesofeachsequenceofthelistcontaininggapsand
fordierentnumbersofiterationstospeculate(repeatedly).
Results for a 0% chanceof each sequence having gaps (a
completelycontiguouslist)areshowninFigure12,fora1%
chanceinFigure13,andfora5%chanceinFigure14. Each
gureshows thespeedupfor vedierentnumbersofiter-
ationsspeculated: \All"performsspeculation oftheentire
loopanddoesnottryagain afterreaching amisprediction
while the other four, \100", \200", \400", and \800", re-
peatedlyspeculatethat manyiterationsatatimeuntilthe
loopnishes,regardless of howmany speculations faildue
tomispredictions.
5.4 SparseMatrix Applications
Inthissectionwegiveanexampleofasparsematrixapplica-
tionthatisforsomematricesstride-predictableandcontains
parallelism, but is not parallelizable by current compilers.
TheFORTRANcodeinFigure15isthecore codeforma-
trixmultiplicationofsparse matricesstoredincompressed
rowstorageformat. Themanyarraysandmultiplelevelsof
indirectionmakeanalysisofthislooptoodiÆcultforcurrent
parallelizingcompilers,butSoftspec'sparallelizationscheme
isapplicable.
Whenthecurrentrowineachofthetwomatricesbeing
multiplied contains contiguous regions of non-zero values,
the memoryreferences in each of the two branches of the
1 2 3 4 5 6 7 8
0 1 2 3 4 5
All 100 250 400 800
Number of Processors
Speedup
Figure 12: Speedup for the sampleprogram in Figure 11
operating on acompletely contiguous list. Softspec either
usedrepeatedspeculationofacertainnumberofiterations
(100,250,400,800)orspeculatedtheentireloop(\All").
if arestride-predictable. Thuswhenmanyconsecutiveit-
erationsof the innerlooptake thesame branch, theinner
loopisparallelizable bySoftspec. Thethen branchinitial-
izesnewnon-zeroelementsintheproductmatrixwhilethe
else branchaddsto anexisting element. Therowbyrow
multiplicationends up often entering the same branchon
consecutiveiterationsoftheinnerloop. Thefrequencyand
durationofparallelizableiterationsequencesaredependent
ontheinputmatrices. Sparsematrixdatasetssuchas the
Non-HermitianEigenvalueProblemCollection[NHE]often
containmatriceswithnon-zero valuesinastripedownthe
diagonalorinblocksdownthediagonal. Suchpatternslead
toparallelizable iterationsequencesoflengths equalto the
widthofthestripesorblocks.
Figure16showsthespeedupobtainedbySoftspecwhen
multiplying block diagonal sparse matrices with dierent
blocksizes. Sixdierentsparse matriceswithnon-zero el-
ementsinsquareblocksdownthediagonal weremultiplied
bythemselves. Theblock widthsforthe sixmatriceswere
25, 50,100,200, 300,and400,respectively. Thenumberof
blocksdoesnotaectthespeedupmuch,asitonlychanges
the total runtimeand not thelengthof the parallelizable
sequences.
No speedup (or slowdown) occurs for parallelizable se-
quencesof25or 50iterations. For100iterationsmoderate
speedup is achieved, butthereis not enough work for the
speedup to scale beyond ve or six processors. Iteration
countsof200, 300,and400allachievescalablespeedup.
The speedups obtained on a few hundred iterations of
theinnerloopofthesparsematrixmultiplicationaregreater
thanthoseforthenestedloopofthedensematrixmultipli-
cation,eventhoughthelatterhasmuchmoreworkineach
parallelized iteration. Thereasonstemsfromtheability of
thecompilertomoreaggressivelyoptimizethedensematrix
loopbodyversusthesparsematrixloopbody. Themultiple
arrayindirectionsinhibitoptimizationsthatcanbeapplied
to the simpler dense matrix code. This results inless of
a disparity in speed between the sequential code and the
1 2 3 4 5 6 7 8 0
0.5 1 1.5 2 2.5 3
All 100 250 400 800
Number of Processors
Speedup
Figure 13: Speedup for the sample program inFigure 11
operatingonalistwitheachsequenceof50nodeshavinga
1%chanceofcontainingmanymemorygaps(a50%chance
ofagapbetweeneachnode). Softspeceitherusedrepeated
speculationofacertainnumberofiterations(100,250,400,
800)orspeculatedtheentireloop(\All").
1 2 3 4 5 6 7 8
0 0.25 0.5 0.75 1 1.25 1.5
All 100 250 400 800
Number of Processors
Speedup
Figure 14: Speedup for the sample program inFigure 11
operatingonalistwitheachsequenceof50nodeshavinga
5%chanceofcontainingmanymemorygaps(a50%chance
ofagapbetweeneachnode). Softspeceitherusedrepeated
speculationofacertainnumberofiterations(100,250,400,
800)orspeculatedtheentireloop(\All").
do k = offb(ja(j)), offb(ja(j)+1)-1
if ((ptr(jb(k)) .eq. 0)) then
ptr(jb(k)) = i
c(ptr(jb(k))) = a(j)*b(k)
index(ptr(jb(k))) = jb(k)
i = i + 1
else
c(ptr(jb(k))) = c(ptr(jb(k)))+a(j)*b(k)
endif
enddo
enddo
Figure15: ThecoreFORTRANloopformultiplyingsparse
matrices a and b and storing the result in c. The sparse
matricesarestoredincompressedrowstorageformat. The
arraysoffa,ja,offb,andjbareusedtolocatethecolumns
androwsofelementsinaandb.
1 2 3 4 5 6 7 8
0 1 2 3 4 5 6
25 50 100 200 300 400
Number of Processors
Speedup
Figure 16: Speedup for the sparse matrix multiplication
code of Figure 15 operating on six dierent block diago-
nalmatrices.Thesixmatriceshaveblocksofwidths25,50,
100,200, 300,and400,respectively.
parallelizedcodegeneratedbySoftspec.
6 RelatedWork
Overthelasttwodecades,compilertechniquesforautomat-
icallyparallelizingsequentialapplicationshaveadvancedre-
markably. Modernparallelizingcompilersareverysuccess-
fulwhentargetingcertaintypesofcode,namelynestedloops
containinglimitedmemoryaliasing. However,thesecompil-
ersarelargesystemsthatusecomplexinterproceduralanal-
yses: thePolaris compiler[BEH +
94 ]containsover170,000
linesofcode,andtheSUIFcompiler[AALT95]containsover
150,000linesofcode.
Aliasanalysisalgorithms[WL95 ,EGH94 ,RR98]canen-
able automatic parallelization in the presence of memory
aliases. However, these alias analysis systems work only
with small, self-containedprograms and donot scale with
programsize.
posed to overcome the needfor acompile-time proof that
code is parallelizable. Candidate loops are speculatively
executedinparallel while a complexhardware system ob-
servesallmemoryaccessesinordertodetectinter-iteration
data dependences. When a dependence is detected, addi-
tional hardware mechanisms undo the speculative execu-
tion and the loop is re-executed sequentially. Some pro-
posals extend existing multiprocessor hardware [HWO98 ,
SM98 ]whileotherspresentcompletelynewhardwarestruc-
tures [MBVS97 ]. Targeting procedural parallelism using
hardware has also beenproposed [OHL99 ]. Because spec-
ulative hardware schemes do not rely anprogram charac-
teristicstheyrequirea lotof workandglobal communica-
tion. Softspec takesadvantage of memoryaccess patterns
to reduce the amount of workand communication needed
todetectdependences,makingitviableinsoftware.
Fundamentalresearchintoprogrambehaviorhasshown
that both data and address values can be predicted by
stride prediction and last-value prediction [SS97]. Stride-
predictability of memory accessesinscientic applications
has been successfully exploited to improve the cache be-
havior of these codes through compiler-inserted prefetch-
ing in uniprocessor and multiprocessor machines [FP91 ].
Thestride-predictabilityofmemoryaddresseshasbeenused
to perform speculative prefetching in out-of-order super-
scalars[GG97 ].
Software-basedspeculativeparallelization schemeshave
mostly focused on determining the inter-iteration depen-
dencepatterns ofapartially parallel loopinorder to con-
structparallelizationschedules. Theseschemesfocus onan
inspector-executormodel[LZ93 ],inwhichanextractedin-
spectorloopanalyzesthememoryaccessesatruntimeand
constructsaschedulefortheexecutorloop,whichrunsparts
oftheloopinparallelusingsynchronization.
Othersoftware-based techniquesspeculatively runcode
inparallel and monitor all memoryaccesses using shadow
arrays[RP95a ].Whenaninter-iterationdatadependenceis
observed,the speculation isundone. Thisis similartothe
hardware-based proposals, but with greater runtime over-
head and only targeting for loops that operate on arrays.
Mechanisms for parallelizing certain types of while loops
havebeendeveloped[RP95b ],buttheyrequirethe helpof
compiler analysis and code reorganization that could not
practicallybedoneatruntime.
7 Conclusions
Wehavepresentedanovelapproachtoparallelizationofse-
quentialcodethatdoesnotrequirecomplexprogramanal-
ysisor additional hardware mechanisms. Wehave demon-
stratedthebenetsofthisapproachonavarietyofprograms
usingaprototypeimplementation.
The trend towards object-oriented programming and
sharedlibraryusageismakingitincreasinglydiÆcultforan
automatically parallelizingcompiler to performthe whole-
programanalysisit needsto identifyparallelism. Wehave
shown how to exploit interprocedural parallelism with no
needfor interprocedural analysis, leadingto a highlyscal-
able solutionthat supportsseparatecompilation. Weplan
to incorporate this into our prototype implementation of
Softspec.
Softspec can be combined with existing automatically
parallelizing compilersto increasethe range of loops they
wouldprovidemorerobustperformance.
Weplantosimulatetheperformanceofourimplementa-
tiononproposedsingle-chipmultiprocessors. Afactorlim-
itingthespeedupsobtainableonshared-memorymultipro-
cessorsisthehighcostofdatasharingbetweenprocessors'
caches. Thesecostsaremuchloweronproposedsingle-chip
multiprocessors. Althoughwe showedspeedup onexisting
hardware,simplehardwareextensionssuchasaddingaspec-
ulativebittocachesthateliminatesundooverheadcandras-
ticallyimproveSoftspecperformance.
Softspec can be implemented entirely in software and
cantarget programbinaries. In thefuturewe hope to in-
tegrate Softspec into a virtual machine environment with
control of the garbage collector to enhance data structure
stride-predictability. We would also like to investigate in-
corporating Softspec into a binary optimizer. This would
enableparallelizationoflegacycodeorthird-partysoftware
that isonlyavailableinbinaryformandcannotberecom-
piled.
Softspecintroducesaframeworkforparallelizationbased
onpredictionratherthanproofof parallelismthat enables
parallelizationofalargeclassofimportantapplicationsthat
are currentlyunable to useautomatic parallelizationtech-
niques. WebelievethattheSoftspecframeworkwillleadto
manyothertechniquesinvolvingdierentoptimizationand
predictionschemes.
References
[AALT95] S.P.Amarasinghe,J.M.Anderson,M.S.Lam,
andC.W.Tseng. TheSUIFcompilerforscal-
able parallel machines. In Proceedings of the
SeventhSIAMConferenceonParallelProcess-
ingfor ScienticComputing,February1995.
[AHU74] A. V.Aho, J.E. Hopcroft,and J.D. Ullman.
The Design and Analysis of Computer Algo-
rithms. Addison-WesleyPublishing Company,
Reading,MA.,1974.
[ATCL +
98] Ali-Reza Adl-Tabatabai, Michal Cierniak,
Guei-Yuan Lueh, Vishesh M. Parikh, and
James M. Stichnoth. Fast, eective code gen-
eration in a just-in-time Java compiler. In
ProceedingsoftheSIGPLAN'98Conferenceon
ProgrammingLanguageDesignandImplemen-
tation,June1998.
[BDB00] Vasanth Bala, Evelyn Duesterwald, and San-
jeev Banerjia. Dynamo: A transparent run-
time optimization system. In Proceedings of
the SIGPLAN'00Conference on Programming
Language Design and Implementation, June
2000.
[BEH +
94] William Blume, Rudolf Eigenmann,Jay Hoe-
inger,DavidPadua,PaulPetersen,Lawrence
Rauchwerger,andPengTu. Automatic detec-
tionofparallelism: Agrandchallengeforhigh-
performance computing. IEEE Parallel and
Distributed Technology, 2(3):37{47, Fall 1994.
Alsohttp://polaris.cs.uiuc.edu/po lari s/.
[CHH +
98] Anton Cherno, Mark Herdeg, Ray Hook-
way, Chris Reeve, Norman Rubin, Tony Tye,
FX!32: A prole-directed binary translator.
IEEEMicro,18(2),March1998.
[Dev99] SrikrishnaDevabhaktuni. Softspec: Software-
basedspeculativeparallelismviastridepredic-
tion.Master'sthesis,M.I.T.,1999.
http://www.cag.lcs.mit.edu/~s aman /
student_thesis/Sri-99.ps.
[EGH94] M. Emami, R. Ghiya, and L. J. Hen-
dren.Context-sensitiveinterproceduralpoints-
toanalysisinthepresenceoffunctionpointers.
InProceedings oftheACMSIGPLAN'94Con-
ference onProgramming LanguageDesignand
Implementation,June1994.
[FP91] J.W.C.FuandJ.H.Patel. Dataprefetching
in multiprocessor vector cache memories. In
The 18thAnnual International Symposium on
ComputerArchitecture,May1991.
[GG97] J.GonzalezandA.Gonzalez. Speculativeexe-
cutionviaaddresspredictionanddataprefetch-
ing.In11thInternationalConferenceonSuper-
computing,July1997.
[HNO97] Lance Hammond, Basem Nayfeh, and Kunle
Olukotun. Asingle-chip multiprocessor. IEEE
Computer,30(9),September1997.
[HWO98] Lance Hammond, Mark Willey, and Kunle
Olukotun. Dataspeculationsupportforachip
multiprocessor. In Proceedings of the Eighth
ACM Conference on ArchitecturalSupport for
Programming Languages and Operating Sys-
tems,October1998.
[Kla00] Alexander Klaiber. The technology behind
Crusoe processors. Transmeta Corporation,
January2000.
http://www.transmeta.com/crus oe/d ownl oad/p df/c rusoe tech wp.p df.
[LZ93] Shun-TakLeungand John Zahorjan. Improv-
ingtheperformanceofruntimeparallelization.
InProceedings of the Fourth ACM SIGPLAN
SymposiumonPrinciples andPractice ofPar-
allelProgramming,1993.
[MBVS97] A.Moshovos,S.E.Breach,T.N.Vijaykumar,
andG.S.Sohi. Dynamicspeculationandsyn-
chronization ofdatadependencies. In24thIn-
ternationalSymposiumonComputerArchitec-
ture,June1997.
[NHE] Non-Hermitianeigenvalueproblemcollection.
http://math.nist.gov/MatrixMa rket /dat a/NEP .
[OHL99] J. Oplinger, D. Heine, and M. S. Lam. In
search of speculative thread-level parallelism.
InProceedings of the 1999 International Con-
ferenceonParallelArchitecturesandCompila-
tionTechniques,October1999.
[RP95a] LawrenceRauchwergerandDavidPadua. The
LRPD test: Speculative run-time paralleliza-
tionof loops withprivatization and reduction
parallelization. In Proceedings of the SIG-
PLAN'95 Conference on Programming Lan-
guageDesignandImplementation,June1995.
allelizing while loops for multiprocessor sys-
tems. In Proceedings of the 9th International
ParallelProcessingSymposium,April1995.
[RR98] R. Rugina and M. Rinard. Span: A shape
andpointeranalysispackage.TechnicalReport
LCS-TM-581,M.I.T.,June1998.
[SM98] J. G. Stean and T. C. Mowry. The poten-
tial for using thread-level data speculation to
facilitate automatic parallelization. In Pro-
ceedingsoftheFourthInternationalSymposium
on High-Performance Computer Architecture,
February1998.
[SS97] Y. SazeidesandJ.E. Smith. Thepredictabil-
ity of data values. In Proceedings of the 30th
Annual International Symposium on Microar-
chitecture,December1997.
[WL95] R.P.WilsonandM.S.Lam. EÆcientcontext-
sensitive pointer analysis for C programs. In
Proceedings of the ACM SIGPLAN'95 Con-
ference onProgrammingLanguage Designand
Implementation,June1995.