HAL Id: hal-00641350
https://hal.inria.fr/hal-00641350
Submitted on 15 Nov 2011
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of
sci-entific research documents, whether they are
pub-lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Hardware acceleration of sequential loops
Pierre Michaud
To cite this version:
Pierre Michaud. Hardware acceleration of sequential loops. [Research Report] RR-7802, INRIA. 2011.
�hal-00641350�
IS
S
N
0
2
4
9
-6
3
9
9
IS
R
N
IN
R
IA
/R
R
--7
8
0
2
--F
R
+
E
N
G
RESEARCH
REPORT
N° 7802
November 2011
Project-Teams ALF
Hardware acceleration of
sequential loops
Pierre Michaud
RESEARCH CENTRE
RENNES – BRETAGNE ATLANTIQUE
Campus universitaire de Beaulieu
35042 Rennes Cedex
Pierre Mi haud
Proje t-TeamsALF
Resear hReport n°7802November201119pages
Abstra t: The urrenttrendingeneral-purposemi ropro essorsistotakeadvantageofMoore's
lawto in reasethe number of oreson thesame hip. Inafew te hnology generations, thiswill
leadto hipswith hundredsof supers alar ores. Obtaining highperforman e onthese so- alled
many- ores will require to parallelize the appli ations. Nevertheless, it is unlikely that all the
appli ationswilltakefulladvantageofthehighnumberof ores. Hen eitisimportant,alongwith
in reasingthenumberof ores,to in reasesequentialperforman eanddedi atearelativelylarge
sili onareaandpowerbudgetforthatpurpose. Inthisstudy,we onsiderthepossibilitytoin rease
sequential performan e with aloopa elerator. The loop a elerator sits besidea onventional
supers alar oreandisspe ializedforexe utingdynami loops,i.e.,periodi sequen esofdynami
instru tions. Loopsaredete tedanda eleratedautomati ally,withouthelpfromtheprogrammer
or the ompiler. The exe ution is migrated from the supers alar ore to the loop a elerator
when a dynami loop is dete ted, and ba k to the supers alar ore when a loop exit ondition
is en ountered. We des ribethe proposed loop a eleratorand westudy its performan eon the
SPEC CPU2006 appli ations. We show that signi ant performan e gains may be a hieved on
someappli ations.
Résumé : La tendan e a tuelle des mi ropro esseurs généralistes est d'exploiter la loi de Moore en
augmentant le nombre de oeurs sur une même pu e. Dans quelques générations te hnologiques, ette
tendan e produira des pu es ave des entaines de oeurssupers alaires. Ilsera né essaire de paralléliser
les appli ations an d'obtenir de hautes performan es sur es futurs multi- oeurs. Cependant, il ne sera
probablement paspossible pourtoutes les appli ations d'exploiter tousles oeurs. Il est don important
de ontinuer d'augmenter la performan e séquentielle en même temps que la performan e parallèle, en
réservant à l'a élération séquentielle une partie relativement importante de la surfa e de sili ium et du
budgetenpuissan eéle triquedelapu e. Dans ette étude,nous onsidéronslapossibilitéd'augmenterla
performan eséquentielle grâ eàuna élérateurdebou le. L'a élérateurdebou leest asso iéàun oeur
supers alaire lassiqueet est spé ialisépourl'exé ution desbou lesdynamiques, 'est-à-diredesséquen es
périodiquesd'instru tionsdynamiques. Lesbou lessontdéte téeseta éléréesautomatiquement,sansaide
duprogrammeuroudu ompilateur. L'exé ution migredu oeursupers alaireversl'a élérateurdebou le
quandunebou ledynamiqueestdéte tée,etvi eversalorsqu'onren ontreune onditiondesortiedebou le.
Nousdé rivonsl'a élérateurdebou le proposéetnousétudionssaperforman esurlesappli ationsSPEC
CPU2006. Nousmontronsquedesgainsdeperforman erelativementimportantspeuventêtreobtenuspour
ertainesappli ations.
1 Introdu tion
During the last de ade, single-thread performan e has in reased at a slower pa e than during previous
de ades, despite transistor miniaturization ontinuing as di tated by Moore's law. For several reasons,
pro essormakershavepreferredto usethe sili onareaoeredby miniaturizationforimplementing
multi- ores. General-purposemulti- oreshavea tuallybenetedtothe Internetbyin reasing thethroughputof
serverfarms. However,the lientsideoftheinternetasnotbenetedasmu hfrommulti- oresastheserver
side. Whilemulti- oreshavebeendenitelyusefulforin reasingthroughput,theirpotentialforde reasing
laten y is still largely underexploited. Most existing ode is sequential, and this is likely to remain so
for sometime. Writing parallel appli ations withportable performan e speedups is verydi ult, even for
theelite of programmerswho understand performan e andknowparallel programming. Thegap between
potentialanda tualperforman ewill getlargerasthenumberofon- hip oresin rease,be auseoflimited
o- hipmemorybandwidthandbe auseofAmdahl'slaw. Asthenumberofon- hip oreskeepsin reasing
withyears,wewill progressivelyenter themany- oreeraandeventuallyrea hapointwhere implementing
aheterogeneous many- orewith onebig and fast ore and several smaller ores will provide a signi ant
performan eadvantage. Forinstan e, letus onsider ahomogeneousmany- orewith200identi al normal
oresononehand,andon theotherhand aheterogeneousmany- orewith only100normal oresandone
monster oretakingthe sili onareaand onsumingthe powerequivalentto 100normal ores. Evenifthe
monster oreisonlytwi efaster thananormal oredespitebeing100timesbigger,asimpleappli ationof
Amdahl'slawshow that theheterogeneousmany- orewill outperformthe homogeneous oreonprograms
whose sequentialfra tion ex eeds 1% of allthe instru tions exe uted 1
. This extreme example illustrates
asituation that hasbeenanalyzed by Hilland Marty[6℄, oneof their on lusions being that resear hers
shouldinvestigatemethodsofspeedingsequentialperforman eeven iftheyappear lo allyine ient.
What thismeansis that resear hersmust ndsolutionsfora eleratingsequentialexe utionfor future
many- oresevenifthese solutionslookabsurd fortoday'smulti- ores. However,in reasing sequential
per-forman e is not obvious, even with the sili on area and power budget of a monster ore. In reasing the
IPC (number of instu tions exe utedper y le) withoutimpa ting the lo k y le is not straightforward.
Perhapsit willbepossibleto implement widersupers alar ores,e.g.,8-wide oreslikethe an eled Alpha
EV8pro essor. Butto thebest ofour knowledgeit hasnotbeenprovedthat thesupers alarwidth ould
bein reasedbeyond8withouta tuallylosingsomeperforman e. Theproblem omesfromstru tures that
do not s ale well with the issue width and the instru tion window size, like register renaming, dynami
instru tion s heduling, register ports, operand bypass network, and memory disambiguationme hanisms.
Many propositions forsolvingthese problems havebeenpublished in thelast 20 years, and someof them
areprobablyworthrevisitinginthe ontextofmany- ores. Nevertheless,itisalsoimportanttoexplorenew
approa hes.
Weproposein this studyto exploreanewapproa h tosequentialperforman e : hardwarea eleration
ofloops. Manyprogramsspendasigni antpartoftheexe utionindynami loops,i.e.,periodi sequen es
ofdynami instru tions. Weexploreinthisstudythepossibilitytoimplementana eleratorspe ializedfor
dynami loopsandthatdoesnotrequireanyhelpfrom theprogrammerorthe ompiler.
Our goalis notto explore thedesign spa eof loopa elerators, whi h webelieveis huge. Instead, we
triedtoimagineana eleratormi roar hite tureexploitinglooppropertiesandavoidingasmu haspossible
the usual supers alar bottlene ks. Wehave simulated the proposed loopa elerator and weshow that it
anpotentiallydeliverahighsustainedIPC. A tually,webelievethatthemi roar hite tureweproposeis
s alableand anbepipelinedata lo kfrequen yatleastashigh asthesupers alar orefrequen y.
This work makes the following ontributions : (1) We propose a hardware me hanism for dete ting
dynami loops;(2)Weprovidea hara terizationofthedynami loopbehaviorofSPECCPU2006
ben h-marks and we showthat a signi ant fra tion of thedynami loops havea large loop body onsisting of
severaltensofinstru tions;(3)Weproposealoopa eleratorthat ana eleratemostdynami loopswith
asmallorlargebodyandthat an issuesimultaneouslyupto32
µ
opsoutofprogramorderfromawindow of several thousands ofµ
ops, with a lo al hardware omplexity not greater than that of a onventional supers alar ore; (4) Wepropose solutionsformemory disambiguationin awindow ofseveralthousandsof
µ
ops,exploiting dynami loopproperties ; (5)We showthat asigni antfra tion ofdynamiµ
opsare1
Thisisassumingperfe t parallelism. Limitedo- hipmemorybanwidth,variousoverheadsandimperfe tloadbalan ing
lo kfrequen y: 3GHz;de ode/rename: 1inst(any)+3"simple"insts(1or2
µ
ops);reorder buer: 64-µ
ops;load queue: 32loads;storequeue : 16stores;dispat h: 6µ
ops, dependen y-basedsteering ; s hedulers : 4 8-µ
op int, 2 8-µ
op FP/SSE, 2 16-µ
op loads ; exe ution : 4int (1 mul or1div), 2FP/SSE (1div), 2loads/stores; retirement: 6µ
ops; post-retirement: 16-store post-retirementqueue, 2stores/ y le ; bran h predi tor : 64-KbitTAGE,18-Kbit ITTAGE[12℄ ;bran hmispredi tion: re overyatexe ution,12 y lesminimumpenalty; a he line: 64B;IL1
a he : 32 KB, 8-wayasso, 1line/ y lebandwidth, upto 6pipelined misses; DL1 a he : 32KB,
8-wayasso,writeba k,2 y leslaten y,8banks,8-bytebankwidth,virtuallyindexed,PC-basedstride
prefet h; DL1 misses: 16MSHRs; L2 a he : 512KB, 8-wayasso,DIP poli y [10℄, write ba k,9
y leslaten y,1line/ y lebandwidth,streamprefet h[16℄;L3 a he: 8MB,16-wayasso,DIPpoli y
[10℄,writeba k,18 y leslaten y,1line/ y lebandwidth,streamprefet h[16℄;memory+bus: 210
y les laten y, 16 bytes/ y le bandwidth ; ITLB : 64 entries, 4-way ; DTLB : 64 entries, 4-way ;
TLB2: 512entries,4-way,4 y leslaten y; page size: 4MB ;load/store dependen ies: store
sets[2℄ +single-entryforwardingfromstorequeue[8℄;
Table1: Baselinesupers alar ore.
a tuallyredundantand anberemovedfromthedynami loops.
Thispaperisorganizedasfollows. Se tion2dis ussespriorwork. Wedes ribeoursimulationset-upin
Se tion 3. Se tion 4des ribesourloop dete torand providesstatisti saboutdynami loopsin the SPEC
CPU2006appli ations. Se tion5givesadetaileddes riptionoftheproposedLAandprovidesaperforman e
evaluation. Finally, se tion6 on ludes this study and givessomedire tions for future work. This paper
usesmanydenitions anda ronyms,theyarelistedinTable5attheendofthepaper.
2 Related work
Kobayashi found that many programs spend asigni ant fra tion of the exe ution in dynami loops,
es-pe ially s ienti programs [7℄. Our denition of dynami loops is loseto Kobayashi'sone. Tubella and
Gonzálezdes ribedahardwareme hanismfortheautomati dete tionofdynami loops[17℄. Their
deni-tion ofdynami loops isdierentfrom oursand targeted toward spe ulativemultithreading. Some re ent
pro essorshavea loopbuer to de rease theenergy onsumption byavoidingre-fet hing and re-de oding
thesameloopbodyagainandagain. Forinstan e,theIntelNehalemhasaLoopStreamDete torthat an
holdupto28
µ
ops[4℄. Gar íaetal. des ribealoopbuerthatimplementsaregisterrenamingme hanism forloops,moree ientthan onventionalregisterrenaminghardware[5℄. Stittet al. proposedaLAforasystem-on- hip,implementedwithsome ongurablelogi ,abletodete tanda elerateloopstransparently
[14℄. Thetimeforanalyzingtheloopand onguringtheLAappearstobeverylong(tensofmillionsof lo k
y lesforrelativelysimpleloops[14℄). Itisnot learwhetherthisapproa h ouldbeusedingeneral-purpose
systems. Clarket al. des ribedanapproa hwhere loopsareidentied andmodulo-s heduleddynami ally
(hen epreservingbinary ompatibility)ontoaLAde ouplingmemorya essesfrom omputations[3℄. Their
LAhasa ongurable omputea eleratorwhi h anexe uteupto15integeroperationsinonly2 lo k
y- les. Vajapeyametal. proposedadynami ve torization(DV)s heme[18℄basedontra epro essors[19,11℄.
This DV s heme hasafew similaritieswith ourloopa elerator(e.g., useof FIFO queues),but isoverall
verydierent. TheDVs hemeassumes64pro essingelements(PEs),dierentinstan esofaloopiteration
beingpro essedbydierentPEs. Neithermemorydisambiguationnorproblems on erning ommuni ation
bandwithandlaten ybetweenPEsareaddressedin [18℄.
3 Simulation set-up
Ourbaseline ongurationsimulatesamodernx86supers alar ore. Oursimulatoristra edriven(wedonot
simulate wrongpathee ts). Weused Pin2.8 [9℄to generateatra eforea hSPEC CPU2006ben hmark.
We ompiledben hmarksforx86-64ar hite turewithg 4.4.3"-O2". Ea h tra e onsistsof40samplesof
50millions onse utiveinstru tions,foratotalof2billionsdynami instru tions(andtheasso iatedmemory
0.0
0.5
1.0
1.5
2.0
2.5
400 401 403 410 416 429 433 434 435 436 437 444 445 447 450 453 454 456 458 459 462 464 465 470 471 473 481 482 483
IPC
Nehalem
simulator
Figure 1: IPC of the simulated ore vs. IPC measured on an Intel Nehalem for the SPEC CPU2006
ben hmarks.
parametersofthebaselinemi roar hite turearelistedinTable1. Instru tionsaresplitintomi ro-ops(
µ
ops forshort)atde ode. WegenerateseparateADDRµ
opsformemorya esses[8℄. Loadandstoreµ
opsgetthe addressfrom theasso iatedADDRµ
op,whi hisexe utedbyanyALU.Atmost8µ
opsaregeneratedper instru tion. Aµ
ophasatmost2sour eregistersand1destinationregister,not ountingtheags. Thereare 4integers hedulers,2oating-points hedulersand2loads hedulers. Ea hs heduler ansele tone readyµ
opper y leinitsissuebuer. Theµ
opsisremovedfromthebuerwhenissued. Weassumeares heduling me hanism,butwedidnotsimulate res hedulingpenalties. Wesimulatehardwareprefet hersfortheDL1,L2andL3 a hes. TheDL1prefet herisaPC-basedstrideprefet her. TheL2andL3prefet hersarestream
prefet hersthat issueprefet h requestsbasedonthesequen e ofmissesand hitsonprefet hed blo ks[16℄.
Theprefet hdistan eisadjusteddynami allywithafeedba kme hanism. Weassume4MBmemorypages.
Usinglargepagesde reasesthenumberofTLBmissesandmakesprefet hingintheL2andL3 a hesmore
ee tive,asstreamprefet hersarelimitedbypageboundaries[16℄. TLBmissesarehardware-managed,as
inx86pro essors[1℄.
ThebaselineIPC(instru tionsper y le)numbersfortheSPECCPU2006arereportedinFigure1. We
alsoreporttheIPCmeasuredonanIntelNehalempro essorwhenexe utingea hben hmarkto ompletion.
Thesimulatedmi roar hite tureisnotidenti al toaNehalem. Forinstan e,wedonotsimulate
µ
opfusion and ma ro-op fusion. Our goal was to simulate a realisti supers alar ore, not to mat h the Nehalemperfe tly.
4 The loop dete tor
In this study, the term loop is used in the sense of dynami loop. A dynami loopis a periodi sequen e
of dynami instru tions [7℄. Instru tions with dierent program ounter addresses (PC, for short) are
onsidereddistin t,although theymayperformthesamea tion. Thelooplengthisthenumberofdynami
instru tionsin thesequen e. Thebodysize
B
,ininstru tions,isthelengthofthesmallestperiod. Theloop body(or body, forshort) onsists ofthe rstB
instru tionsin the sequen e, i.e., iterationnumber0
. The nextB
instru tionsbelong to iterationnumber1
, andsoon. Instru tions in thebody arenot ne essarily distin t. Forexample, the body itself may in lude a tiny loop unrolled, ora fun tion exe uted multipletimes. The loop dete tor is a hardware me hanism whose goal is to nd loops. Be ause there will be a
transitionpenalty forentering andleavingtheloopa elerationmode,wetryto dete tloopsthatare long
enough. Theloopdete tor onsistsoftwomainparts : theLoopMonitorandtheLoopTable.
The loop monitor. TheLoopMonitor(LM) isthemainme hanismfornding loops. Itin ludes atiny
LMtable (e.g., 4entries). Ea h LMentry ontainsanend-of-loopPC(EOL PC)whi his thesear hkey,
a valid bit, a body size, abody signature, a previoussignature, aninstru tion ount and a rst-iteration
bit. TheLM table isfully-asso iative. Forea h instru tionretiring from thereorderbuer, thebody size
of ea h valid LMentryisin rementedand the orrespondingbody signature isupdated. The signatureis
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
instr time
instr time
instr time
instr time
instr time
instr time
instr time
instr time
instr time
instr time
instr time
instr time
instr time
instr time
instr time
instr time
instr time
instr time
instr time
instr time
instr time
instr time
instr time
instr time
instr time
instr time
instr time
instr time
instr time
Percentage of execution spent in loops
benchmark
400
401
403
410
416
429
433
434
435
436
437
444
445
447
450
453
454
456
458
459
462
464
465
470
471
473
481
482
483
MBS=256
MBS=128
MBS=64
MBS=32
MBS=16
Figure2: Per entageofexe ution(instru tionsandtime)spentinloops(MinLM=900instru tions,LBI=5)
during theprogram exe ution. Thelongerthe signature,the lesslikelythat twodistin t loopbodieshave
thesamesignature. In oursimulations,we use32-bit signatures,and weupdate thesignatureby rotating
thesignature 1bit to theleft and applyingan XOR withthe newinstru tionPC. Whenthe body size of
anyvalidLMentryex eedsaxedmaximumbodysize(MBS),we losethatentrybyresettingitsvalidbit.
IfthePC ofaretiring instru tionhitsin theLMtable,welookat theinformation storedin themat hing
entry. Iftherst-iterationbitissetorifthesignatureequalstheprevioussignature,weaddthebodysizeto
theinstru tion ount,otherwiseweresettheinstru tion ount. Whenabran hinstru tionjumpsba kward,
itis onsideredapotentialend-of-loop. If novalidentryexistsfor thatbran hPC, we reate anewentry
(e.g., byreusinga losed entry) : theEOL PCisset tothe bran hPC, thevalidbit andrst-iterationbit
are set, and all the other elds are reset. If an entry already exists for the ba kward jump, we opy the
bodysignature into theprevioussignatureand weresettherst-iterationbit, the body signature andthe
body size. When theinstru tion ount in oneof the LMentries ex eeds apredened threshold MinLM,
we loseall theother entries,and we enter the loop buildingstate. Theonly valid LM entryremaining is
alledthea tiveLMentry. Loopbuilding orrespondstoaxednumberofloopiterationsduringwhi hwe
extra ttheinformationthat willbeneededtoexe utetheloopontheLA.Thenumberofiterationsspent
in theloopbuilding stateis denoted LBI(loopbuild iterations). Whenloopbuilding isdone, wemigrate
theexe utiontotheLA.Aftertheloopexit,weupdatetheinstru tion ountinthea tiveLMentryandwe
losetheentry.
The loop table. IttakesMinLMdynami instru tionsbeforeaninstru tionsequen eis onsideredaloop
by the LM. On the one hand, if MinLM is xed to a low value, short loops will be dete ted. But the
exe utiontimesavedontheLAmay notbeworththetransitionpenalty. Ontheother hand,ifMinLMis
xedtoahighvalue,theMinLMinstru tionsthat ouldhavebeenexe utedontheLAareinsteadexe uted
bythesupers alar ore,whi h isaprobablewasteof performan e. SoMinLM istypi allyset toamedium
value,e.g.,900instru tions. Yet,thenexttime ween ounter thisloop,wemayassumethat itwillbehave
asthelasttimeand enter theloopa elerationmodeimmediately. This isoneofthefun tions oftheloop
table (LT). Ea h LT entryre ords someinformation aboutone loop, identied by its body signature. In
parti ular, there is a onden e bit in ea h LT entry. When we lose an LM entry, we set or reset the
onden ebit depending onwhether theinstru tion ountisgreaterorlessthanMinLM.WhenthePCof
aretired instru tionmat hesoneofthevalidLMentries,wea ess the orrespondingLTentry. IfthatLT
entryexists and ifthe onden e bit is set, we loseall the other LM entries and weenter loop building
immediately.
4.1 Fra tion of exe ution spent in loops
Figure 2 showsthe fra tion of exe ution spent in loopsfor dierent valuesof MBS. Twoper entages are
givenforea hben hmark,aper entageofexe utedinstru tionsandaper entageofexe utiontime. MinLM
is set to 900instru tions andLBI=5. Severalben hmarks spend asigni antfra tion ofthe exe ution in
ben hmark MBS=128 MBS=256 ben hmark MBS=128 MBS=256 401.bzip2 5231 4838 450.soplex 3216 3261 403.g 8315 8312 456.hmmer 5446 5448 433.mil 2190425 6415123 459.GemsFDTD 4249 4953 434.zeusmp 4257 4359 462.libquantum 15362 18216 437.leslie3d 2218 2223 465.tonto 4161 4027 444.namd 3749 6320 481.wrf 2643 2645 447.dealII 2634 2627 483.xalan bmk 4239 5475
Table2: Averagelooplengthforloopyben hmarks(MinLM=900instru tions,LBI=5)
accelerator
loop
L2
L3
DL1
IL1
arch.
reg.
loop
detector
loop
builder
superscalar core
retired uops
Figure3: Thesupers alar oreandthesequentialloopa elerator
in su ession, about onethird of all the instru tionsexe uted belong to loops (as we havedened them)
with amaximumbody size of256instru tions. In fa t,manyloopshaveabody notex eedingafewtens
of instru tions. Yet, several ben hmarks spend a signi antpartof theexe ution in loopswhose body size
ex eeds128instru tions,in parti ular433.mil ,444.namd,459.GemsFDTD,465.tontoand 483.xalan bmk.
We dene a loopy ben hmark as a ben hmark with more than 20% of dynami loop instru tions with
MBS=256. About half of the SPEC CPU2006 ben hmarks are loopya ording to this denition. Loopy
ben hmarksarelistedin Table2,alongwiththeaveragelooplengtha ordingtoourloopdete tor. Inthe
remaining,weassumeMBS=128instru tions.
5 The loop a elerator
This study onsidersthe possibility ofa eleratingtheexe ution of dynami loops, withouthelp from the
programmerandpreservingbinary ompatibility. Thegoalofthisstudyisnottoexplorethedesignspa eof
loopa elerators(LAs),whi hwebelieveishuge. Instead,wehavetriedtoimagineanewmi roar hite ture
avoiding onventional supers alarte hniquesthat donots alewell. Wetriedtoexploit loop hara teristi s
asmu h as possible in order to makethe LA s alable. Wehave also tried to makethe LA asgeneral as
possiblesothat it an a elerateloopswithasmall orlargebody, asitwasshownin thepreviousse tion
thatboth asesareimportant.
Theproposedmi roar hite tureisdepi tedin Figure3. We onsider hereonlythe"sequential"partof
aheterogeneousmany- ore. Duringloopbuilding,theloopbuildertakesasinputthe
µ
opsretiredfromthe supers alar orereorderbuerand preparestheloopbodyforexe utionontheLA.Whenloopbuildingisdone,thesupers alar oreinstru tionwindowis learedandtheexe ution ismigrated totheLA. Whena
loopexit onditiono urs,likeabran htakingadierentdire tion,thear hite turalregistersareupdated
andtheexe utionismigratedba kontothesupers alar ore,untilanotherdynami loopisen ountered.
The rest of this se tion is organized as follows. Se tion 5.1 introdu esthe y li register dependen y
graph,whi hisusefulfor hara terizingaloopandunderstandingitsexe utiononaLA.TheproposedLA
is des ribedin Se tion 5.2 and Se tion5.3 fo uses on memorydependen ies. Se tion5.4 explains howthe
12 : 5c8289
13,flags:=ALU(13)
0 : 5c8270
357:=ADDR(13)
1 : 5c8270
109:=LOAD(357)
7 : 5c827c
109:=SSE(109,358)
2 : 5c8274
17,flags:=ALU(17)
21 : 5c82a5
flags:=ALU(17,20)
11 : 5c8285
12,flags:=ALU(12)
3 : 5c8278
357:=ADDR(12)
4 : 5c8278
110:=LOAD(357)
10 : 5c8280
110:=SSE(110,358)
13 : 5c828d
18,flags:=ALU(18)
5 : 5c827c
357:=ADDR(18)
6 : 5c827c
358:=LOAD(357)
14 : 5c8291
109:=SSE(109,110)
8 : 5c8280
357:=ADDR(26)
9 : 5c8280
358:=LOAD(357)
17 : 5c8299
110:=SSE(109,110)
20 : 5c82a1
19,flags:=ALU(19)
15 : 5c8295
357:=ADDR(19)
18 : 5c829d
357:=ADDR(19)
16 : 5c8295
110:=LOAD(357)
19 : 5c829d
STORE(357,110)
22 : 5c82a8
BRANCH(flags)
Figure 4: Example ofloopCRDG foundin ben hmark481.wrf. Thebody onsists of15 instru tionssplit
upinto23stati
µ
opsrankedfrom 0to22. Dependen ieswithµ
opsnotbelongingtothelooparenotpart oftheCRDG.onto the LAfor good performan e. Other performan e tuning wedid isdes ribedin Se tion 5.6, and the
simulatedperforman espeedupsarepresentedinSe tion 5.7.
5.1 Cy li register dependen y graph
Tounderstandtheperforman eof aLA,itis onvenientto onsider theregisterdependen y graph(RDG)
ofasequen eof
µ
ops,whereea hnodeofthegraphisadynamiµ
opanddire tededgesrepresentregister dependen ies. TheRDG ofaloophasaperiodi stru ture andit ispossibleto summarizeitwith a y liregisterdependen ygraph(CRDG)whereea hnodeofthegraphisastati
µ
op.Figure4showsaloopCRDG onsistingof23statiµ
opsrankedfrom0to22. Therankofastatiµ
op orrespondstothesequentialorder. Forexample,thelooprepresentedinFigure4 onsistsofthedynami instan esofµ
ops0,1,...,22,0,1,...,22,... and soon. Dependen iesin aCRDGareof twosorts: intra-iteration andinter-iteration. Whentheµ
ops produ ingand onsumingaregistervaluebelongtothe sameiteration,itis anintra-iterationdependen y.Otherwiseitisaninter-iterationdependen y. Foraninter-iterationdependen y,iftheprodu erbelongsto
iteration
k
,the onsumerbelongstoiterationk +1
. InFigure4,inter-iterationsdependen iesarerepresented byedges whoseheadrankis lessthan orequalto thetailrank(e.g., bothedgesoriginatingfromµ
op11). Thelongest hainin aRDGdenesaminimumexe utiontime. Foraloop,thelongestRDG hain an befound by onsidering y les in the CRDG. Most loopshave 1-
µ
op y les, likethe loop shown in Figure 4 (µ
ops2,11,12,13,20dependonthemselves).However,afewloopshave y les onsistingofseveralµ
ops. The longest hain inaloopRDG orrespondsto thegreatestCRDG y le,and the hainlengthisequaltothey lesize timesthenumberofloopiterations. Thea tualexe utiontime,though,maybelongerthanthis
minimumtimeifresour esforexe utingthe
µ
opsarelimited. Theimpa tofresour e onstraintsonaloop performan e anbe takenintoa ountbyadding onstraintedges tothe CRDG.Forinstan e, in order tosimplify thehardwareandpermitahigh lo kfrequen y,wefor ethedynami instan esofthesamestati
µ
optoexe uteinprogramorder. This orrespondstoaddingtoea hµ
opintheCRDGa onstraintedgeto itself. Doingsodoesnotin reasetheloopexe utiontime. Ontheother hand,ifweadd a onstraintedgebetweentwodistin t
µ
opsoftheCRDG,wemay reateanarti ial y le onsistingofseveralµ
ops,thereby in reasingtheloopexe utiontime.5.2 A ring-shaped loop a elerator
Figure5depi tstheglobalar hite tureoftheproposedLA.TheLA onsistsof8exe utionnodes(nodefor
short) onne tedbya4-lanepipelinedbus. Therearedierentnodetypes,ea hnodetypebeingspe ialized
for ertain
µ
ops. TheI-nodeexe utesµ
opshavinga1- y lelaten y. Thisin ludesADDRµ
ops,MOVµ
ops (integerandoating-point)andalltheµ
opsthatwouldnormallyexe uteonthesupers alar oreALUs. Themask
I−node
I−node
L−node
F−node
S−node
M−node
I−node
F−node
latch
latch
Figure5: Ring-shapedloopa elerator: 8exe utionnodes onne tedbya4-lanepipelinedbus.
F-node anexe utethemostfrequentoating-pointoperations,in parti ularadditionsandmultipli ations.
Itexe utesintegermultipli ationstoo. TheL-nodeexe utesload
µ
opsandtheS-nodeexe utesstoreµ
ops. TheM-nodeisamutablenodewhosefun tionisdenedatloopbuilding. Itisa tuallyanI-nodeaugmentedwithoating-pointdividers. Iftheloopbody ontainssomeoating-pointdivision,theM-nodeexe utesonly
divisions. OtherwiseitbehaveslikeanormalI-node. Itisnotne essaryfortheLAtobeabletoexe utethe
wholeinstru tionset. Forinstan e,wenoti edthatintegerdivisionsarerareintheloopsfoundbyourloop
dete torintheSPECCPU2006. Therefore,loopsfeaturingintegerdivisionsareexe utedbythesupers alar
ore.
The pipelined bus. The pipelined bus onsists of 4 lanes divided in 8 segments. The bus transports
pa kets. Apa ket onsistsofa128-bitdata 2
,a
µ
oprank,andan8-bitnode ve tor(onebitpernode). Theµ
opranktellswhi hµ
opin thebody istheprodu erof thedata. Thenodeve tortellswhi h nodesneed thedata. Thenodeve toris modiedbyamaskasitpasses throughsegments. When thepa ket rea hesthelast nodeneeding thedata, thenodeve torbe omes nulland thepa ket is onsidered empty. Nowit
is possible forthat node (or asubsequentnode) to overwrite theempty pa ket. In theworst ase, adata
doesa ompleteturnonthelaneuntilitreturnstothenodefromwhi hitwasemitted. Thereare2lat hes
perlanesegment(i.e.,16lat hesperlane). Whenthebus lo kislow,therstlat histransparentandthe
se ond lat h isopaque. Whenthe bus lo kis high, itis theother wayaround. Be ause of lanesegment
RCdelays,weassumethatthebusis lo kedathalfthenode lo kfrequen y 3
. To ompensatethislossof
throughput,ea hlaneisdoubled sothatittransportstwopa ketsinsteadofone,asshownin Figure5.
The exe ution node. Figure 6showsthe global stru ture of anI-node. Other node typesshare many
hara teristi s with the I-node. Ea h node ontains 4exe ution units (EU). Ea h EU anexe ute 1
µ
op per y le, i.e., the maximum node throughput is 4µ
ops per y le. The loopbuilder maps statiµ
ops to EUs. Allthedynami instan esofastatiµ
opareexe utedbythesameEU.Theµ
opsexe utedby the sameEU exe utein programorder with respe t to ea h other,but they anexe uteout-of-orderwith respe t to
µ
ops onother EUs. On e aµ
opis exe uted, itsoutput data is sentto the OutputQueue (OQ),withthe orrespondingµ
oprankandnodeve tor. A opyofthedata isalsopushedinto theLoop Exit Queue (LEQ). The LEQ is used at loop exit time for updating the ar hite tural registers with theorre tvaluesbeforeresuming theexe utiononthesupers alar ore. Data in theOQis sentontothe bus
and removed from the OQ as soon as possible, i.e., when there is an empty pa ket at the orresponding
lanesegment. Theexe ution of
µ
ops onagivenEU is suspended ifthe asso iated OQ is onthe vergeof be oming full. Atthe node input, pa kets areread from the lanes. If the orresponding bit is set in thenodeve tor,thedataand
µ
oprankareinsertedintotheInputQueue(IQ).WeassumethattheIQ never gets full (we will see later howweenfor e this). Consequently, thedata produ ed bya statiµ
op enter the IQs where they are needed in sequential order. Inea h y le, onedata is dequeued from ea h IQ bythe distributor. The distributor sends these data to the External ValueQueues (EVQ). Ea h EU hasa
dedi ated set ofEVQs. Adata from anIQ may beneededby severalEUs in the samenode. In this ase
2
Weassume128-bitdata,asthisisthedatasizerequiredbyIntelSSEoperations. Wedidnottrytooptimizethis. Future
workmaytrytoredu ethedatasizeto64-bit,exploitingthefa tthatmostoperationsarea tuallys alar. 3
IQ
IQ
IQ
IQ
C
E
C
EU
OQ
C
E
C
EU
OQ
C
E
C
EU
OQ
C
E
C
EU
OQ
lane 0
lane 1
lane 2
lane 3
lane 0
lane 1
lane 2
lane 3
EVQs
EVQs
EVQs
EVQs
DISTRIBUTOR
LEQ
LEQ
LEQ
LEQ
Figure6: Exe utionnode(here,anI-node). Ea hexe utionnode ontains4exe utionunits.
thedistributordupli atesthedata(up to4 opies,oneperEU).An EVQisasso iatedwithoneparti ular
stati
µ
op, i.e., all thedata goingthroughagivenEVQ ome from the samestatiµ
op. Ifadata annot bedistributedbe auseoneofthetargetEVQsisfull,thedataremainsintheIQuntilallthetargetEVQsana eptit. Non- onstantdata feeding theEU ome either from anEVQheadorfrom the LEQ"top".
The LEQ top onsists of the
N
max
data output most re ently by the EU, whereN
max
is the maximum number of statiµ
ops that an be mapped on the same EU (the LEQ is physi ally implemented as two distin t parts). The EVQhead data is dequeued only after the value is no longerneeded by anyµ
op on theEU. Theexe utionofµ
opsonanEUis ontrolledbyaCy li Exe ution Controller(CEC).TheCEC loopsthroughtheN
eu
statiµ
opsmappedontotheEU,repeatedlyandinprogramorder. Inparti ular,the CEC ontrolsthevariousmultiplexorsforsele tingtheEUinputsanddequeuesdatafromEVQs. TheCECalso holds onstantvalues,i.e., valuesthat are used asinput bysome
µ
opsand that areguaranteed to be onstantthroughouttheloopexe ution(e.g., registervaluesnotmodiedbytheloop). TheCECgeneratesasequen e number for ea h dynami
µ
op : the sequen enumber for then
th
dynami instan e of astati
µ
opof rankk
isB
max
×
n + k
, whereB
max
= 8 ×
M BS
is themaximumnumberofµ
opsin aloopbody 4and
n
orrespondstotheiterationnumber. Ea h EUexe utesµ
opsin programorder,whi h meansthat thesequen enumberin reasesmonotoni ally. Forµ
opswithinter-iterationinputdependen ies,inputdata forthe rstiterationare obtainedat loopbuilding. Itshould benotedthat there exists adatapath fromea hEU to everyother EU in theLA, thanksto thedistributors. Wehave hosen notto haveany dire t
datapathbetweenEUs inthesamenodeto simplifythehardware 5
. Hen e ifa
µ
optakesasinputadata produ edbyanotherµ
opinthesamenodebutonadierentEU,thisdatamusttravelaroundthelaneto rea h the onsumerµ
op. Ifashort ommuni ationlaten y between2µ
ops isneeded forperforman e, we musttrytomapthese2µ
opsonthesameEU,sothatthe onsumerµ
ops an at hthedatafromtheLEQ top. Ifwendduring loopbuildingthat theloop annotbeexe utedbytheLA,theloopkeepsexe utingonthesupers alar ore. This happens forinstan e iftheloop ontainsa
µ
opthat theLA annotexe ute, likeanintegerdivision. ThishappensalsoifN
max
istoosmallforthat loop,oriftherearetoofewEVQs. TheF-nodeandM-nodehaveaglobalstru turesimilartotheI-node. TheL-nodeandS-nodearesomewhatdierent.
The S-node. TheS-nodere eivesstoreaddressesandstoredatafromothernodes,itdoesnotwriteonthe
lanes(noOQ,noLEQ).Thereis oneLoopStoreQueue(LSQ) asso iatedwithea hofthe4EUs. EUsdo
notreallyexe utethestore
µ
ops,theyjustgathertheaddressanddataforea hdynami store. Theaddress and dataare then sent into theLSQ, along withthe storesequen e number. TheS-node ontainsastorevalidationunit(SVU).TheSVUsendsstoresintothepost-retirementstorequeue,whi htheLAshareswith
4
B
max
isapowerof2,generatingthesequen enumberissimple 5thesupers alar oreandfrom whi hstores anwrite into theDL1 a he. TheSVUhasatable ontaining
the ranks of all the stati store
µ
ops in the loop body, in program order. The SVU logi loops through thistablerepeatedly,generatingthesequen enumber orrespondingtothenextdynami store,inprogramorder. ThisSVUsequen e numbersis sear hedamong the4LSQ heads. Thestoreis thenvalidated: itis
dequeuedfromitsLSQ,senttothepost-retirementstorequeue,andtheSVUsequen enumberisupdated.
Inoursimulations,wehaveassumedthat theS-node anvalidate2storesper y le.
The L-node. TheL-nodere eivesloadaddressesfrom I-nodes. ItreadstheDL1 a heandsendstheload
data throughthebusto theother nodes. In theL-node,EUs are repla edwith LoadIssue Buers(LIB)
and OQs are repla ed with LoadOutput Queues (LOQ). Ea h CEC steps through the stati loads that
havebeenmappedtoit,inprogramorder.TheCECallo atesanentryintheLIBandintheLOQforea h
dynami load. TheloadaddressandLOQentryidentierarewrittenintheLIBentry. TheLOQentryhas
roomfor the loaddata. The CEC writes in theLOQ entry the
µ
op sequen e number and the8-bit node ve torforthedata. Inea h y le,oneloadissele tedfromea hLIBanda essesthe a he,i.e.,theL-nodeanissueupto4loadsper y le. IftheDL1readsu eeds,theloaddataiswrittenintotheLOQentry,and
theLIB entryisfreed. Otherwise(bank oni t, DL1 miss,TLBmiss), theloadwaits in theLIBand will
bereissuedlater. Upon amiss,whenthemissing blo kiseventuallyinsertedinto theDL1, theasso iated
MSHRwakesuptheLIBentriesthatwerewaitingforthatblo k,andthe orrespondingloadsbe omeready
forreissue. TheLOQbehaveslikeareorderbuerforloads.Ina lo k y le,theloadattheheadofaLOQ
isdequeuediftheloaddataisthereandifthepa ketonthelanesegment anbeoverwritten.
The globalsyn hronizer. TheCEChasan
N
max
-entryµ
opbuer,apointeronthat buer,andaLo al IterationCount(LIC).Thepointerpointstotheµ
opthatwillnexta esstheEVQs6
. Thepointer anbe
in rementedinea h lo k y leandisresetwhenitsvaluebe omesequalto
N
eu
. Everytimethepointeris reset,theLICisin remented. The32CECs(8nodes, 4lanes)workindependentlyfrom ea h other. Theymay have dierent LIC valuesat a given time. However, we need a global syn hronizer for keeping the
programsequentialsemanti ,ame hanismequivalenttothereorderbuerin asupers alar ore.
Inparti ular, thesyn hronizermustpreventastorefrom leavingtheSVUbefore allthedynami
µ
ops pre eding thestorein programorder havebeenexe uted,astheseµ
opsmaytriggeraloopexit. Anatural loopexito urswhenastati bran h hangesitsbehavior,i.e.,abran hµ
opprodu esaresultdierentfrom theonere ordedat loopbuilding7
. Storesafter theloopexitpointmustnotwrite intotheDL1. Afterthe
loopexit, theLEQs areused for updatingthear hite turalregisters. But theLEQ size islimited. Hen e
oneof thesyn hronizer fun tion is to prevent the exe ution onan EU from going too far ahead, making
surethat thevaluesneededfor updatingthe ar hite tural registershavenotbeenpushedoutof theLEQ.
Thesyn hronizer is onne ted to allthenodes,re eiving and sendingsignalsfrom and to thenodes. The
SVU in the S-nodemaintainsa maximum sequen e number(MSN). If theloop ontains any store, store
validation freezes when it rea hes the MSN, and the SVU sends a signal to the syn hronizer. Moreover,
ea hCEChasamaximumvalueLICmax forits LIC.DierentCECsmayhavedierent LICmaxvalues,
but with some ontraints : In the S-node, all the CECshave the same LICmax = MinLICmax, where
MinLICmax is xed at loop building. On the other nodes, LICmax
≥
MinLICmax. When the LIC in a CECrea hesMinLICmax,the CECsends asignalto thesyn hronizer. The CEC ontinuesuntil theLICrea hes LICmax,then theCEC freezes. Whenthesyn hronizerhas re eived asignalfrom ea hCEC and
from theSVU, all CECs have aLIC greater than orequal to MinLICmax. The syn hronizer then sends
anunfreezesignaltoallnodes. Uponre eivingtheunfreezesignal,ea hCEC subtra tsMinLICmaxtoits
LIC,whi hunfreezesautomati allythefrozenCECs(e.g.,thoseintheS-node). Uponre eivingtheunfreeze
signal,theSVUaddsMinLICmax
×B
max
totheMSN, whi h unfreezesstorevalidation. Thesyn hronizer maintainstheprogram sequentialsemanti whilepreserving somede ouplingbetweenCECs. Still,it mayimpa tperforman e. LICmax determinesthe instru tionwindowsize. The higherLICmax,thelargerthe
instru tionwindowandthemorelaten ytoleran e. However,thenumberofloopiterationsin thewindow
mustnotex eedtheLEQsize,i.e., LICmax+MinLICmaxmustnotex eedtheLEQsize dividedby
N
eu
. Global syn hronizationlaten ymayalsoimpa tperforman e: thetime duringwhi haCECis frozenisawaste of performan e. Hen e MinLICmax mustnotbe toosmall. Another onstraintis thatMinLICmax
annotbegreaterthantheLSQsize dividedby
N
eu
,astheLSQ buersthestoresuntiltheunfreezesignal in reasestheMSN,whi hpermitsvalidatingthestoreswaitingintheLSQ.Insummary,theLEQsandLSQs6
Theexe utionisfullypipelined(ex eptoating-pointdivisionsintheM-node).Datadependen iesmaystalltheexe ution. 7
mustbelargeenoughforgoodperforman e.
Loop exit. When the exe ution of a
µ
op triggers a loop exit, theµ
op sequen e number, whi h we all the loop exit sequen e number (LESN), is sent to the syn hronizer8
. The LESN is then broad ast to
allCECs. CECswhose sequen enumberalreadyex eedsLESN freezeimmediately andignore subsequent
unfreeze signals. Other CECs ontinue to work until ex eeding the LESN. If a newloopexit is dete ted
whileapreviousloopexitwasalreadypending,andifthenewLESNislessthantheLESNre ordedinthe
syn hronizer,it be omesthenew loopexit point. We anexit theloop a elerationmodewhen allCECs
sequen enumbersex eedtheLESN,allEUpipeshavebeeddrained,theSVUsequen enumberex eedsthe
LESN, andea h LOQin theL-node iseither empty orhas itshead entrysequen e numberex eedingthe
LESN. Then,ar hite tural registers anbeupdatedwithvaluesfound in theLEQs andtheexe ution an
resumeonthesupers alar ore,startingfromtherstdynami instru tionfollowingtheloopexitpoint.
Full IQ and premature loop exit. Wehaveassumedthat theIQsnevergetfull. It isa tuallydi ult
tomakesurethatanIQ annevergetfull. Butisitpossibletomakeitarareevent. IfanIQbe omesfull,
wetriggeraprematureloopexit. TheLESNis setto MSN-1. Then,the loopexittakespla easdes ribed
previously. Nevertheless,forlimitingtheo urren eofpremature loopexits withoutoversizingtheIQs,we
introdu eathrottlingme hanism. Weasso iateawired-ORwithea hlane. Whentheo upan yofanyIQ
onalaneex eedsa ertainthreshold(e.g.,half theIQ apa ity),thewired-ORisassertedandtheOQson
thatlanestopsendingdatauntilthewired-ORisdeasserted.
Deadlo ks. TheLAwehavedes ribedisnotimmunetodeadlo ks. Insteadoftryingtopreventdeadlo ks
in all ases, itis simplerto makethem as rare aspossible. In parti ular,the EVQsmustbelarge enough
to maintainalowdeadlo k probability. Wedete tadeadlo kwhen nounfreezesignalhasbeengenerated
for10000 y les. Whenthis happens,wetriggeraprematureloopexit. A deadlo kleadsto amu hhigher
performan epenaltythanafull IQ.
5.3 Memory dependen ies
The proposed loop a elerator an emulate a window of several thousands of instru tions. But we must
deal with memory dependen ies. We must guarantee a orre t exe ution without sa ri ing too mu h
performan e. Wedes ribeinthisse tionsomeme hanismstodealwithmemorydependen iesinsideloops.
We exploit loop properties to simplify memory dependen y enfor ement. First we expe t most loops to
exhibitaverygoodtemporalandspatiallo ality. Se ond,weexpe t onstant-stridea essestobefrequent
in loops. Third, we expe t dependen ies between a store and a loadin the sameloopto repeat on ea h
iteration.
The memory zone he ker. The memory zone he ker (MZC) is a table lo ated in the L-node but
a essed bothby loads and stores. Thepurpose of theMZC is to dete t memory order violationswithin
loops. TheMZCtakesadvantageof spatial lo alitythat exists inloops. OurMZCis on eptuallysimilar
totheMemoryDisambiguationTabledes ribedbyStoneetal. [15℄ex eptthat were ordonlyloadsinthe
MZC.Wedene azone asamemory regionwhose sizeis apoweroftwoand whi h is alignedin memory.
Themain originalityofourMZCisthat thezone sizeis xedat loopbuilding( f. Se tion 5.6). Thezone
addressisobtainedfrom theload/storeaddressbyarightshiftingoftheaddressbits. Ea h MZCentryis
taggedwithadistin tzoneaddressand ontainsavalidbit,aloadsequen enumber,aloadaddress,aload
datasize,a oni tbit,aloadPCandaknown_p bit. Whenaloadexe utes,ita essestheMZCwithits
zoneaddressUponaMZCmiss,wesear h afreeentryandweinitializeit. A freeentryisan entrywhose
valid bit was reset be ause its sequen e number is less than the SVU sequen e number. If no free entry
is found, aprematureloop exit is triggered. Upon aMZC hit, the sequen e numberin the MZCentryis
ompared with theloadsequen e number. Ifthe loadsequen e numberis greater, it overwritesthe entry
sequen enumber. Ifthedataaddressanddatasizein theentrydonotmat hthoseoftheload,wesetthe
oni t bit. If theloadPCdoesnotmat h thePC in theentry, we resettheknown_p bit. The MZCis
alsoa essedbystoreswhentheyarevalidatedbytheSVU.Apotentialmemoryorderviolationisdete ted
ifthestorezonehitsintheMZC,ifthestoresequen enumberislessthantheentrysequen enumber,and
if the oni t bit is set orif thestore data overlapswith the addressand data size re ordedin the entry.
Theoptimalzonesize,i.e., theonethat minimizesthenumberofprematureloopexits,isnotthesamefor
allloops. Ifthereisagoodspatiallo alityintheloop,takingalargezonepreventsllingtheMZC.Onthe
8
ADDR L
X
Y
Z
ADDR S
STORE
Z
ADDR S
CHECK
MOV
MOV
X
Y
ADDR L
MOV
MOV
CHECK
X
Y
MOV
ADDR L
STORE
Z
ADDR S
MOV
MOV
LOAD
STORE
dependency is predicted
bypassing
double bypassing
MOV
b
b+1
a+1
a
a+2
a+1
a
a+2
b+1
b+2
b+3
b
Figure7: Store-to-loadbypassing(
a
andb
aretheranksofthestoreandCHECKµ
opsaftertransformation)otherhand,ifloadsandstoresareindependentbuta essthesamememoryzones,de reasingthezonesize
maypreventunne essary prematureloopexits. Inour simulations,wehaveassumedthat theMZC ould
beupdatedby4loadsandread by2storesin thesame y le.
The stride-based he ker. TheMZCissu ienttoguaranteea orre texe ution,itisnotsu ientfor
goodperforman e. Areasonably-sizedMZCmaytriggeralotofprematureloopexitsoninexistentmemory
dependen ies. Weintrodu eastride-based he ker(SBC)toassist theMZC.TheSBCtakesadvantageof
onstant-stridememorya esses. TheSBCis asmall fully-asso iativetable where ea hentrysummarizes
thememorya essesgeneratedbyaparti ularstati load. Ea hSBCentryistaggedwiththeloadPCand
ontainsavalidbit,asequen enumber,arstaddress,alastaddress,astride,adatasize,andaniteration
ount. Whenexe utingaload,wesear hthatloadPCintheSBC.Ifanentryexists,weupdateitasfollows.
Iftheiteration ountisnull, weinitializetheentrywith theloadaddress,datasize andsequen enumber.
Else,iftheiteration ountisnon-null,we ompute
S
=load_address−
last_address. Iftheiteration ount equals1,weset thestridetoS
. Otherwise,ifthe iteration ountisgreater than1andifthestridediers fromS
,weinvalidatetheentry. Eventually, weset thelast addressto theloadaddress,andwein rement theiteration ount. Whenvalidatingastore,iftheMZCdete tsapotentialmemoryorderviolationandiftheknown_p bitisset,wea esstheSBCwiththeloadPCre ordedin theMZCentry. Iftheentrydoes
not exist,we reate it (possibly evi ting a load), reset its ontent, and a premature loopexit is triggered
withthestoreas theloopexitpoint. Iftheentryexists,weuseits ontentto onrmthepossiblememory
order violation. Weuse the rstaddress, thelast address, thestride signand the datasize to ndwhi h
memory region the load has a essed so far. If the store data does not overlapwith that region, we are
sure that there was no memory order violation forthat store. Otherwise, weassume that there wasone,
and aprematureloop exit istriggered. Whenvalidating astore, we "remove"the oldest dynami loadof
every validSBC entrywhose iteration ount isnon-nulland whosesequen e numberis lessthanthestore
sequen enumber: wede rementtheiteration ountand,iftheiteration ountisnon-null,weadd
B
max
to thesequen enumberandweaddthere ordedstridetotherstaddress.Store-to-loadbypassing. Someloopsa tually ontaintruememorydependen ies. Asolutionforallowing
theLAtoexe utethese loopsistodostore-to-loadbypassingatloopbuilding. Theloopbuilder ontainsa
StoreSetsmemorydependen ypredi tor[2℄,identi altotheoneimplementedin thesupers alar ore 9
. At
loopbuilding,ifastati loadispredi tedtodepend onastati storewiththesamedatasize,wetransform
thebody asshowninFigure7. Thestore
µ
opstaysinpla e, but2extraMOVµ
opsareaddedintheloop bodyjustbehindthestore10
. Theload
µ
opisremovedfromthebodyandrepla edwithaCHECKµ
opand aMOVµ
op. TheMOVµ
optransmitsthestoredatato theµ
ops onsumingtheloadvalue. TheCHECKµ
op ompares the loadand store addresses. All CHECKµ
ops and MOVµ
ops exe ute on I-nodes. If a CHECKfails, aloopexit istriggered, taking asloopexit pointthe dynamiµ
oppre eding (in sequential order)theinstru tion ontainingtheremovedload.9
OurstoresetsLFSTisfullyasso iative,taggedwiththeSSID,whi hpermitstakingawideSSIDwhilekeepingtheLFST
small. 10
The Bypassed Store Table. Even if aCHECK su eeds, we must still verify that storesbetweenthe
dependent store-load pairdo notwrite that memory lo ation. The SVU ontains asmall fully-asso iative
BypassedStoreTable(BST).Ea hBSTentry ontainsastoreaddress,astorePC,adatasizeandalifetime.
Whenvalidatingabypassedstore,wesear hafreeBSTentryforthatstore,i.e.,anentrywhoselifetime is
null. Weinitializethisentrywiththestoreaddress,PC, datasize,andwesetthelifetimeto themaximum
number of dynami stores separating the bypassed store-load pair(s). This value, whi h may be null, is
obtainedduringloopbuilding. Ifthelifetimeisnon-nullandnofreeentrywasfoundintheBST,wetrigger
aloopexit withthat bypassedstoreastheloopexit point. Whenvalidating astore(whether bypassedor
not), we he kwhether the store oni ts with any BST entrywhose lifetime is non-null. If a oni t is
dete ted,aloopexitistriggeredwiththevalidatedstoreastheloopexitpoint. Forea hvalidatedstore,we
de rement allthenon-nulllifetime valuesin the BST.Whena oni t isdete ted, we trainthestore sets
predi tor withthevalidated store PCand thebypassedstorePC. Doingsomergesthese twostores'store
sets[2℄,sothatonfutureo urren esofthat loop,theloadispredi tedto dependonthe orre tstore.
Double bypassing. Thebypassingmethod des ribed previouslyis ee tive onlyif thedistan e between
the dynami store and the dynami load is less than one loop body. We found that this represents the
majorityof ases. Howeveroneben hmark,456.hmmer,sueredmanyfailed CHECKs. Tosolvethis ase,
we introdu e double bypassing, whi h is the possibility for a dynami store to ommuni ate its data to a
dynami loadatadistan ebetweenoneandtwobodies. Doublebypassingisillustratedin Figure7. Ituses
theBSTlikesimplebypassing,ex eptthattheBSTentrylifetimeissettoalongervalue. Theloopbuilder
ontainsaFailedChe k Table(FCT), whi h his asmall fullyasso iativetable. When aCHECKtriggers
aloopexit,were ordin theFCTthebypassedloadPC,sothat ifapredi ted-dependent loadhitsin the
FCTduringloopbuilding,weapplydoublebypassing.
5.4 Loop body redu tion
Redundantexe utionexistsinmostprograms[13℄. OntheexampleofFigure4,
µ
ops8and9areredundant, theyprodu e thesameresultonea hiteration. This resultis obtainedatloop building, hen etheseµ
ops anberemovedfrom theloop body beforeexe uting theloop ontheLA. Thisis aniterativepro ess : inea h iteration during loop building, we remove from the CRDG the
µ
ops that haveno inputs left in the CRDG. Thenumber of redundantµ
ops identied depends on LBI11
. Wemust be areful with loads and
stores. Astore
µ
op,evenifitprodu esthesameresultonea hiteration, annotbe onsideredredundantin ashared-memoryar hite ture,asremovingitwouldbreakthememory onsisten ymodel. Aredundantloadanberemoved, but wemust he k memory dependen ies. Some hardware support is ne essaryforthat.
TheSVU ontainsaRemoved LoadsTable(RLT).Ea h RLTentry ontainsavalidbit, aloadaddress,a
loaddatasize and aloadPC. Duringloopbuilding, aslongasthereis roomin theRLT,redundantloads
areremovedfromthebodyandre ordedin theRLT.Whenvalidating astore,we he kwhetherthestore
oni ts with any RLTentry. If a oni t is dete ted, a loop exit is triggered with thestore asthe loop
exit point. The PC re orded in the oni ting RLT entry is used to train the store sets predi tor. For
loopyben hmarks, we found that redundant
µ
ops represents typi ally between 10%and 20% of dynami loopµ
ops.MOV bypassing. It is possible to bypass ertain MOV
µ
ops, meaning that, at loop building we make theMOV's onsumerµ
op depend dire tlyon theMOV's produ erµ
op. MOV bypassingis possibleifthe distan e (in the sequen e of dynami instru tions) between the produ er and onsumerµ
ops is less than oneloopbody. IfMOVbypassing anbeapplied forallthe onsumersoftheMOV,theMOVitself anberemovedfrom thebody 12
.
5.5 Mapping heuristi
TheLAperforman eis verydependentonthemappingof
µ
opsontoEUs. Thein-depthstudyofmapping heuristi s is left for future studies. For this preliminary study, we tried to nd a good-enough mappingthroughtrialanderror. Theheuristi wefoundhelpedusunderstandsomeimportantpropertiesforagood
mappingheuristi onsu h LA.Wejust giveahigh-leveldes riptionhere,omittingsomedetails.
11
WithLBI=5,wewereabletoremovealltheredundant
µ
ops. 12lo k frequen y : 3 GHz ; LM table : 4 entries ; LT : 64 entries ; MinLM : 900 instru tions;
LBI:5iterations;MBS:128instru tions;N
max
: 12µ
ops;EVQsper EU:12;EVQ:32data; IQ:32data; OQ:16data;LIB :64loads;LOQ :128data ;LEQ: 128data; LSQ: 64stores;MZC:64entries ;SBC:16entries;BST:16entries ;FCT :16entries ;RLT: 64entries ;SVU
throughput: 2stores/ y le;unfreeze signallaten y: 4 y les;wired-OR laten y: 4 y les
;loop mode transition penalty : 100 y les; extra transition penalty on a LT miss: 10000
y les;store sets SSIT:4096entries ;storesets LFST :16entries;SSIT learing period: 10
millions y les;
Table3: Fixedparametersfortheloopdete tor,loopbuilder,andloopa elerator.
EU balan ing. The
µ
ops mapped onto the same EU form a y le in the onstraint-augmentedCRDG. Therefore, the average number of lo k y les perloop iteration annot beless than theN
eu
of themost loaded EU. The mapping heuristi must try to rea h a balan ed distribution ofµ
ops on EUs. This is parti ularlyimportantforloopswithasmallbody: themappingheuristi shouldavoidputtingtwoµ
opson thesameEUwheneverpossible,asthispotentiallydoublestheloopexe utiontime. Ourmappingheuristiomputes,fromthebody hara teristi s,a
µ
opquotaperEU type,assumingEUbalan ing. On e aµ
opis mappedontoaEU,we onsider anotherµ
op,andso onuntilallµ
opshavebeenmapped. Wetryto avoid mappingaµ
opontoanEUthat hasalreadyrea heditsquota.Lanesegmentsharing. Datathatmustgothroughthesamelanesegmentshareitsbandwidth,whi hwe
havelimitedtoonedataperLA lo k y le. The
µ
opsprodu ing thesedata annotexe uteatanaverage rategreaterthan therate atwhi hthe datago throughthesharedlanesegments. Ourmappingheuristitriestominimizelaneusagebutdoesnottryexpli itlytominimizelanesegmentsharing 13
. Wetrytomap
a
µ
op on thesame EU asoneof its inputµ
ops, provided this EU has notrea hedits quota. Manyµ
ops haveasingle onsumer,andthisoftenpermitsavoidingusingthebusfortransmittingthedata. Ifwemustusethe bus,wetryto put the onsumer
µ
opon thenode followingthenodewhere theprodu erµ
ophas beenmapped,wheneverpossible.Natural CRDG y les. TheCRDG ontainssomenatural y les due only todata dependen ies. Most
natural y lesare1-
µ
op y les onsistingofanintegeradditiondependingonitself. Yet,someloops ontain multi-µ
op y les whi h may in rease theloop exe ution time onsiderably, asit takes several y les fora data to travel from one EU to another. Whenever possible, our mapping heuristi tries to map onto thesameEU the
µ
opsforminganatural y le.Arti ial CRDG y les. Arti ial CRDG y les are y les in the onstraint-augmented CRDG that
are not natural y les. Arti ial CRDG y les may in reasethe loop exe ution time dramati ally, either
be ausethe y le ontainsoating-pointoperationsor,worse, be ausesome
µ
opsin the y le aremapped ontodierentEUs. Mappingµ
opsontothesameEUsastheirprodu erspermitsde reasingtheprobability of reating ostlyCRDG y les,butthisisnotalwayssu ient. Wheneverpossible,ourmappingheuristitriesto avoidmappinga
µ
oponanEUifthiswould reateanarti ialCRDG y le.5.6 Performan e tuning
Oursimulatorimplementstheloopdete torandloopa eleratorasdes ribedin Se tions4and5,i.e.,with
agreat levelof detail. Parameters wehaveused for thesimulationare given in 3. Noti e the loop mode
transition penalty, that wehavexed at100 lo k y les. Weassumethat thispenaltytakesinto a ount
thetimeelapsedafterloopbuildingandbeforetheLAstartsexe utingtheloop,andthetimene essaryto
updatethear hite turalregistersaftertheloopexit. In aseofaLTmiss,weaddanextrapenaltyof10000
y les.
Maximum LICmax. Forgood performan e,the valuesofLICmax and MinLICmaxareset dynami ally
atloopbuilding. Asexplainedbefore,wetrytosetLICmaxandMinLICmaxashighaspossiblebut under
thelimitpermittedbytheLSQandtheLEQ(whi halsodependson
N
eu
). However,whenLICmaxex eeds theEVQdepth,thedeadlo kprobabilitybe omesnon-null. A tually,thereisaLICmaxvaluebeyondwhi hthedeadlo kprobabilityin reasesdramati ally. Thisvalueisnotthesameforallloops. Tosolvethisissue,
13
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
400 401 403 410 416 429 433 434 435 436 437 444 445 447 450 453 454 456 458 459 462 464 465 470 471 473 481 482 483
benchmark
performance
fraction loop instructions
Figure8: Globalperforman espeedups obtainedwiththeloopa elerator,relativeto thebaselinewithno
a elerator (higheris better). These ond barrepresent the fra tion of instru tions exe uted by the loop
a elerator.
were ordinea hLTentryavalueMLwhi hdenesamaximumforLICmax. Whenadeadlo ko urs,we
halvetheMLvaluere ordedintheLTentryforthatloop. Moreover,ahighLICmaxisusefulonlyforloops
withashortbody, forwhi h alargeinstru tionwindowrepresentsmanyiterations. Ifthebodyex eeds20
instru tions,or ifthetimeelapsedsin ethelast deadlo kislessthan1million y les, weset MLequalto
theEVQdepth. Otherwise,weallowMLtobeashigh as128. Overall,withthismethod, deadlo kshavea
negligibleimpa tonperforman e.
Memory zone size. The optimalmemory zonesize is notthesame forall loops. Were ord in ea h LT
entrythe
log
2
of thezone size. Ea h LT entry also ontains a3-bitsaturating ounter. Thezone size for aloopisinitially setto 1024bytes. Whenaprematureloopexit o urs be ausetheMZCis full,the3-bitounter isin remented. Ifthe3-bit ounter valueisequalto 7,wedouble thezonesize. IftheMZCorthe
SBCsignalsamemoryorderviolation,wede rementthe3-bit ounter. Ifthe3-bit ountervalueisnull,we
halvethezonesize.
Disabling "bad" loops. Migrating the exe ution to the LA may a tually de rease performan e. This
generallyhappens be ause of prematureloop exits. Forgood performan e, "bad" loopsmust bedete ted
andtheirexe utionontheLAdisabled. Ea hLTentry ontainsa
T imeLost
valuewhi hestimatesthetime thathasbeenlostsofarbyexe utingtheloopontheLAratherthanonthesupers alar ore. Onaloopexit,theloopdete toristrainedwiththenumber
n
i
ofinstru tionsexe utedbytheLA,thetimen
t
spentonthe LAandthenumbern
m
ofL3 a hemissesgeneratedbytheloop. TheT imeLost
valueisupdatedasfollows:T imeLost ← T imeLost + n
t
−
0.7 × n
i
−
20 ×
n
m
wherevalues0.7
and20
werefoundempiri ally.T imeLost
isinitially set to 0. Formostloopsit be omesnegative,indi atingthat theLAis likelyto perform betterthanthesupers alar ore. Forsomeloop,
T imeLost
in reases. WhenT imeLost
ex eeds100000 y les,we onsiderthatthisisa"bad"loop. Sometimes,weresetalltheT imeLost
valuesin theLT.Wedothiswhen thetimeelapsedsin ethelast resetex eeds100timesthesumofpositiveT imeLost
valuesin theLT.5.7 Simulations results
Figure8showstheperforman eobtainedwiththeloopa elerator. Speedups arerelativetothesimulated
baseline ( f. Figure 1). The se ond bar in Figure 1 shows the fra tion of instru tions exe uted by the
LA.Some speedups areobtainedon mostloopyben hmarks, ex ept 433.mil , whose loopbehavior omes
mostlyfrom loop bodies greater than128instru tions. Performan egains ex eeding25% areobtainedon
6ben hmarks: 434.zeusmp,437.leslie3d,456.hmmer,459.GemsFDTD,462.libquantumand481.wrf. These
6ben hmarksare theones thatexe uteatleast 60%oftheinstru tionsontheLA.Wemeasuredthelo al
a eleration provided by the LA on the loop fra tion. For the 6 ben hmarks mentioned above, the lo al
a elerationisrespe tively2.3,2,1.7, 2.8,2.1 and1.9 (ignoringthetransitionpenalty). Thislevelof lo al
all no no no naive ML= zone = bad
SBC byp. redu . mapping 32 1KB loops
434.zeusmp 1.43 1.32 1.33 1.18 1.00 1.42 1.25 1.43 437.leslie3d 1.45 1.26 1.43 1.10 0.95 1.44 1.40 1.38 456.hmmer 1.38 0.99 0.99 1.22 0.99 1.38 0.99 1.38 459.GemsFDTD 1.32 1.00 1.13 1.13 1.00 1.33 1.28 1.32 462.libquantum 1.39 1.24 1.39 1.37 1.19 1.19 1.39 1.39 481.wrf 1.26 1.12 1.21 1.19 0.95 1.26 1.26 1.26 29 ben h. mean 1.084 1.036 1.055 1.045 0.996 1.068 1.061 1.072
Table4: Performan e relative to the baseline. The se ond olumn is for all features enabled. Following
olumns give performan e when disabling a single feature, respe tively, disabling the SBC, store-to-load
bypassing,loopbodyredu tion,using anaivemappingheuristi ,usingasmallML,usingaxedzonesize,
andallowingbad loops.
Table 4 shows performan e for the 6 ben hmarks mentioned above and the average performan e on
all 29 ben hmarks. The se ond olumn is for all features enabled. Following olumns give performan e
whendisabling asinglefeature. Themappingheuristi isveryimportant. Toquantify itsimpa t,wehave
simulated a"naive" mapping heuristi whi h s ans all EUs in axed order : we map on the urrent EU
one
µ
opthat anbemappedonit,thenwemovetothenextEU. Wedothisrepeatedlyuntilallµ
opshave beenmapped. With thisnaiveheuristi , theaverageperforman e iseven slightly lowerthan thebaseline.Another important feature is the SBC. Without it, false memory dependen ies trigger many premature
loop exits. Loop body redu tion is also very important. Without it, the loop a elerator is learly not
working at its full potential. Store-to-load bypassing brings signi ant performan e gains on 456.hmmer
and459.GemsFDTD. Thelast3 olumns showthat theperforman etuning des ribedin Se tion 5.6bring
non-negligible performan e gains. The xed memory zone size prevents 456.hmmer to benet from the
a elerator. UsingalargeMLisimportantfor462.libquantum,whi hhassmallloopbodies. Disablingbad
loopsisausefulfeature,espe iallyforben hmarkswhi hdonotbenetfromthea elerator.
6 Con lusion and future work
Wehaveproposed ahardware me hanismfor dete tingdynami loops. We foundthat aboutonethird of
alltheinstru tionsexe utedbytheSPECCPU2006suite omefromdynami loopswhosebodysize anbe
quitediverse,fromafewinstru tionstoseveralhundreds.
Wehaveproposedaloopa eleratormi roar hite turethat ana eleratemostdynami loopswithout
help from the ompileror theprogrammer. Thea elerator ongurationwehavefo used onhas8nodes
and 4 lanes, and an exe uteup to 32
µ
opssimultaneouslyfrom awindow of several thousands ofµ
ops. Its lo al hardware omplexity is no greater than that of a onventional supers alar ore. Our eorts forobtaining good performan e speedups have shown the importan e of mapping
µ
ops onto exe ution units very arefully. We have proposed new solutions for dealing with memory dependen ies in a window ofseveral thousandsof instru tions, exploiting loopproperties. Wehave shown that asigni antfra tion of
dynami loopinstru tionsareredundantandneednotbeexe utedbytheloopa elerator.
Thisisapreliminarystudy,intendedtoprovideabasisforfuture workonloopa eleration. Thedesign
spa e of loop a elerators is huge we believe. The mapping heuristi is very important for performan e.
Someofthe hara teristi swehaveoutlinedforagoodmappingheuristi saresomewhatgeneralwebelieve,
likethe importan e of preventingarti ial CRDG y les. Other aspe ts ofour mappingheuristi may be
dependentonsomeofthe hoi eswemade,likenotprovidinganydire tdatapathbetweenEUsonthesame
node, or hoosing apipelined ringfor ommuni atingbetweennodes. Futurestudies mayre onsider these
hoi es and tryto ndabettertradeo between theIPC, the hardware omplexityof thebusand nodes,
and the mapping heuristi . Nevertheless, we believe that the loop a elerator mi roar hite ture wehave
proposediss alablebeyondtheparti ular ongurationwe onsidered. Forinstan e,in reasingthenumber
will in reasethe omplexityofthedistributorand somepartsofthe L-node(forinstan e theMZC,whose
numberofportsisproportionaltothenumberoflanes). Yet,webelievethatifone andoublethewidthof
thesupers alar ore,doublingthenumberoflanesshouldnotbemoredi ult. Thenodemi roar hite ture
depi tedin Figure6is,webelieve,easiertopipelinethana onventional4-waysupers alarexe ution ore.
Future work may onsider the possibilityto over lo kthe loop a elerator. The global speedups wehave
demonstrated are limited by Amdahl's law, i.e., by the fra tion of the exe ution spent in dynami loops.
Nervertheless,thelo ala elerationwehaveobtainedondynami loopsis onsiderableforahardware-only
solution. A ompilerawareof thepresen eofaloopa eleratormaytrytoexploitit.
Referen es
[1℄ T.W. Barr,A.L.Cox,andS.Rixner. Translation a hing: Skip,don'twalk(thepagetable). InPro .
ofthe 37thInt.Symp. onComputer ar hite ture,2010.
[2℄ G.Z. Chrysos and J. S. Emer. Memorydependen e predi tion using storesets. InPro . of the 25th
Int.Symp.on ComputerAr hite ture,1998.
[3℄ N.Clark,A. Hormati,andS. Mahlke. VEAL: virtualizedexe ution a eleratorfor loops. InPro . of
the35th Int.Symp.on ComputerAr hite ture,2008.
[4℄ M.Dixon, P.Hammarlund,S.Jourdan,andR.Singhal. Thenext-generationIntelCore
mi roar hite -ture. IntelTe hnology Journal,14(3), 2010.
[5℄ A.Gar ía,O.J.Santana,E.Fernández,P.Medina,andM.Valero. LPA:arstapproa htotheloop
pro essorar hite ture. InPro .ofthe 3rdInt.Conf.onHighPerforman eEmbeddedAr hite turesand
Compilers,2008.
[6℄ M.D. Hilland M. R.Marty. Amdahl'slawin the multi ore era. IEEE Computer, 41(7):3338, July
2008.
[7℄ M.Kobayashi. Dynami hara teristi sof loops. IEEE Transa tions on Computers, -33(2):125132,
1984.
[8℄ G.H. Loh, S. Subramaniam, and Y. Xie. Zesto : a y le-levelsimulator for highly detailed
mi roar- hite ture exploration. InPro . of the Int.Symp. on Performan e Analysis of Systems andSoftware,
2009.
[9℄ C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Walla e, V. Janapa Reddi, and
K. Hazelwood. Pin : building ustomized program analysis tools with dynami instrumentation. In
Pro . oftheACMSIGPLANConferen eonProgrammingLanguageDesignandImplementation,2005.
http://www.pintool.org.
[10℄ M.Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely Jr, and J. Emer. Adaptiveinsertionpoli ies for high
performan e a hing. InPro . ofthe 34thInt.Symp. onComputer Ar hite ture,2007.
[11℄ E. Rotenberg, Q.Ja obson,Y. Sazeides, and J.E. Smith. Tra epro essors. InPro . of the 30thInt.
Symp.on Mi roar hite ture, 1997.
[12℄ A. Sezne and P. Mi haud. A asefor (partially) tagged geometri historylength bran h predi tion.
JournalofInstru tion Level Parallelism, April2006. http://www.jilp.org/vol8.
[13℄ A. Sodani and G. S. Sohi. Dynami instru tion reuse. In Pro . of the 24th Int. Symp. on Computer
Ar hite ture,1997.
[14℄ G. Stitt, R. Lyse ky, and F. Vahid. Dynami hardware/software partitioning : a rst approa h. In
Pro . of theDesign AutomationConferen e,2003.
[15℄ S.S.Stone,K.M.Woley,andM.I.Frank. Address-indexedmemorydisambiguationandstore-to-load
[16℄ J.M.Tendler, J.S.Dodson,J.S. Field,H. Le,andB.Sinharoy. POWER4systemar hite ture. IBM
JournalofResear h andDevelopment,46(1),January2002.
[17℄ J. Tubella and A. González. Control spe ulation in multithreaded pro essors through dynami loop
dete tion. InPro . of the 4thInt.Symp. onHigh-Performan eComputer Ar hite ture,1998.
[18℄ S.Vajapeyam,P.J.Joseph,andT.Mitra.Dynami ve torization: ame hanismforexploitingfar-ung
ILPinordinaryprograms. InPro . ofthe 26thInt.Symp. onComputer Ar hite ture,1999.
[19℄ S.VajapeyamandT.Mitra.Improvingsupers alarinstru tiondispat handissuebyexploitingdynami
odesequen es. InPro . ofthe 24thInt.Symp. onComputer Ar hite ture,1997.
B
max
: maximum number ofµ
ops for the loop body ; BST : bypassedstore table ; CEC : y li exe ution ontroller ; CRDG : y li register dependen y graph ; EU : exe ution unit ; EVQ :externalvaluequeue;FCT:failed he ktable;IQ:inputqueue;LA:loopa elerator;LBI:loop
builditerations;LEQ:loopexitqueue;LESN:loopexitsequen enumber;LIB:loadissuebuer;
LIC:lo aliteration ount;LM:loopmonitor;LOQ:loadoutputqueue;LSQ:loopstorequeue;
LT:looptable;MBS:maximumbodysizeininstru tions;MinLM:LMthresholdininstru tions;
ML:maximumvalueofLICmax;MSN:maximumsequen enumber;MZC:memoryzone he ker
;N
eu
: numberofstatiµ
opsmappedontoagivenEU;Nmax
: maximumnumberofstatiµ
opsper EU ; OQ: output queue; RDG: registerdependen ygraph; RLT : removedloads table ; SBC :stridebased he ker;SVU:storevalidationunit ;