Hardware acceleration of sequential loops

(1)

HAL Id: hal-00641350

https://hal.inria.fr/hal-00641350

Submitted on 15 Nov 2011

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Hardware acceleration of sequential loops

Pierre Michaud

To cite this version:

Pierre Michaud. Hardware acceleration of sequential loops. [Research Report] RR-7802, INRIA. 2011.

�hal-00641350�

(2)

IS

S

N

0

2

4

9 -6

3

9

9 IS

R

N

IN

R

IA

/R

R

--7

8

0

2 --F

R

+

E

N

G

RESEARCH

REPORT

N° 7802

November 2011

Project-Teams ALF

Hardware acceleration of

sequential loops

Pierre Michaud

(3)

(4)

RESEARCH CENTRE

RENNES – BRETAGNE ATLANTIQUE

Campus universitaire de Beaulieu

35042 Rennes Cedex

Pierre Mi haud

Proje t-TeamsALF

Resear hReport n°7802November201119pages

Abstra t: The urrenttrendingeneral-purposemi ropro essorsistotakeadvantageofMoore's

lawto in reasethe number of oreson thesame hip. Inafew te hnology generations, thiswill

leadto hipswith hundredsof supers alar ores. Obtaining highperforman e onthese so- alled

many- ores will require to parallelize the appli ations. Nevertheless, it is unlikely that all the

appli ationswilltakefulladvantageofthehighnumberof ores. Hen eitisimportant,alongwith

in reasingthenumberof ores,to in reasesequentialperforman eanddedi atearelativelylarge

sili onareaandpowerbudgetforthatpurpose. Inthisstudy,we onsiderthepossibilitytoin rease

sequential performan e with aloopa elerator. The loop a elerator sits besidea onventional

supers alar oreandisspe ializedforexe utingdynami loops,i.e.,periodi sequen esofdynami

instru tions. Loopsaredete tedanda eleratedautomati ally,withouthelpfromtheprogrammer

or the ompiler. The exe ution is migrated from the supers alar ore to the loop a elerator

when a dynami loop is dete ted, and ba k to the supers alar ore when a loop exit ondition

is en ountered. We des ribethe proposed loop a eleratorand westudy its performan eon the

SPEC CPU2006 appli ations. We show that signi ant performan e gains may be a hieved on

someappli ations.

(5)

Résumé : La tendan e a tuelle des mi ropro esseurs généralistes est d'exploiter la loi de Moore en

augmentant le nombre de oeurs sur une même pu e. Dans quelques générations te hnologiques, ette

tendan e produira des pu es ave des entaines de oeurssupers alaires. Ilsera né essaire de paralléliser

les appli ations an d'obtenir de hautes performan es sur es futurs multi- oeurs. Cependant, il ne sera

probablement paspossible pourtoutes les appli ations d'exploiter tousles oeurs. Il est don important

de ontinuer d'augmenter la performan e séquentielle en même temps que la performan e parallèle, en

réservant à l'a élération séquentielle une partie relativement importante de la surfa e de sili ium et du

budgetenpuissan eéle triquedelapu e. Dans ette étude,nous onsidéronslapossibilitéd'augmenterla

performan eséquentielle grâ eàuna élérateurdebou le. L'a élérateurdebou leest asso iéàun oeur

supers alaire lassiqueet est spé ialisépourl'exé ution desbou lesdynamiques, 'est-à-diredesséquen es

périodiquesd'instru tionsdynamiques. Lesbou lessontdéte téeseta éléréesautomatiquement,sansaide

duprogrammeuroudu ompilateur. L'exé ution migredu oeursupers alaireversl'a élérateurdebou le

quandunebou ledynamiqueestdéte tée,etvi eversalorsqu'onren ontreune onditiondesortiedebou le.

Nousdé rivonsl'a élérateurdebou le proposéetnousétudionssaperforman esurlesappli ationsSPEC

CPU2006. Nousmontronsquedesgainsdeperforman erelativementimportantspeuventêtreobtenuspour

ertainesappli ations.

(6)

1 Introdu tion

During the last de ade, single-thread performan e has in reased at a slower pa e than during previous

de ades, despite transistor miniaturization ontinuing as di tated by Moore's law. For several reasons,

pro essormakershavepreferredto usethe sili onareaoeredby miniaturizationforimplementing

multi- ores. General-purposemulti- oreshavea tuallybenetedtothe Internetbyin reasing thethroughputof

serverfarms. However,the lientsideoftheinternetasnotbenetedasmu hfrommulti- oresastheserver

side. Whilemulti- oreshavebeendenitelyusefulforin reasingthroughput,theirpotentialforde reasing

laten y is still largely underexploited. Most existing ode is sequential, and this is likely to remain so

for sometime. Writing parallel appli ations withportable performan e speedups is verydi ult, even for

theelite of programmerswho understand performan e andknowparallel programming. Thegap between

potentialanda tualperforman ewill getlargerasthenumberofon- hip oresin rease,be auseoflimited

o- hipmemorybandwidthandbe auseofAmdahl'slaw. Asthenumberofon- hip oreskeepsin reasing

withyears,wewill progressivelyenter themany- oreeraandeventuallyrea hapointwhere implementing

aheterogeneous many- orewith onebig and fast ore and several smaller ores will provide a signi ant

performan eadvantage. Forinstan e, letus onsider ahomogeneousmany- orewith200identi al normal

oresononehand,andon theotherhand aheterogeneousmany- orewith only100normal oresandone

monster oretakingthe sili onareaand onsumingthe powerequivalentto 100normal ores. Evenifthe

monster oreisonlytwi efaster thananormal oredespitebeing100timesbigger,asimpleappli ationof

Amdahl'slawshow that theheterogeneousmany- orewill outperformthe homogeneous oreonprograms

whose sequentialfra tion ex eeds 1% of allthe instru tions exe uted 1

. This extreme example illustrates

asituation that hasbeenanalyzed by Hilland Marty[6℄, oneof their on lusions being that resear hers

shouldinvestigatemethodsofspeedingsequentialperforman eeven iftheyappear lo allyine ient.

What thismeansis that resear hersmust ndsolutionsfora eleratingsequentialexe utionfor future

many- oresevenifthese solutionslookabsurd fortoday'smulti- ores. However,in reasing sequential

per-forman e is not obvious, even with the sili on area and power budget of a monster ore. In reasing the

IPC (number of instu tions exe utedper y le) withoutimpa ting the lo k y le is not straightforward.

Perhapsit willbepossibleto implement widersupers alar ores,e.g.,8-wide oreslikethe an eled Alpha

EV8pro essor. Butto thebest ofour knowledgeit hasnotbeenprovedthat thesupers alarwidth ould

bein reasedbeyond8withouta tuallylosingsomeperforman e. Theproblem omesfromstru tures that

do not s ale well with the issue width and the instru tion window size, like register renaming, dynami

instru tion s heduling, register ports, operand bypass network, and memory disambiguationme hanisms.

Many propositions forsolvingthese problems havebeenpublished in thelast 20 years, and someof them

areprobablyworthrevisitinginthe ontextofmany- ores. Nevertheless,itisalsoimportanttoexplorenew

approa hes.

Weproposein this studyto exploreanewapproa h tosequentialperforman e : hardwarea eleration

ofloops. Manyprogramsspendasigni antpartoftheexe utionindynami loops,i.e.,periodi sequen es

ofdynami instru tions. Weexploreinthisstudythepossibilitytoimplementana eleratorspe ializedfor

dynami loopsandthatdoesnotrequireanyhelpfrom theprogrammerorthe ompiler.

Our goalis notto explore thedesign spa eof loopa elerators, whi h webelieveis huge. Instead, we

triedtoimagineana eleratormi roar hite tureexploitinglooppropertiesandavoidingasmu haspossible

the usual supers alar bottlene ks. Wehave simulated the proposed loopa elerator and weshow that it

anpotentiallydeliverahighsustainedIPC. A tually,webelievethatthemi roar hite tureweproposeis

s alableand anbepipelinedata lo kfrequen yatleastashigh asthesupers alar orefrequen y.

This work makes the following ontributions : (1) We propose a hardware me hanism for dete ting

dynami loops;(2)Weprovidea hara terizationofthedynami loopbehaviorofSPECCPU2006

ben h-marks and we showthat a signi ant fra tion of thedynami loops havea large loop body onsisting of

severaltensofinstru tions;(3)Weproposealoopa eleratorthat ana eleratemostdynami loopswith

asmallorlargebodyandthat an issuesimultaneouslyupto32

µ

opsoutofprogramorderfromawindow of several thousands of

µ

ops, with a lo al hardware omplexity not greater than that of a onventional supers alar ore; (4) Wepropose solutionsformemory disambiguationin awindow ofseveralthousands

of

µ

ops,exploiting dynami loopproperties ; (5)We showthat asigni antfra tion ofdynami

µ

opsare

1

Thisisassumingperfe t parallelism. Limitedo- hipmemorybanwidth,variousoverheadsandimperfe tloadbalan ing

(7)

lo kfrequen y: 3GHz;de ode/rename: 1inst(any)+3"simple"insts(1or2

µ

ops);reorder buer: 64-

µ

ops;load queue: 32loads;storequeue : 16stores;dispat h: 6

µ

ops, dependen y-basedsteering ; s hedulers : 4 8-

µ

op int, 2 8-

µ

op FP/SSE, 2 16-

µ

op loads ; exe ution : 4int (1 mul or1div), 2FP/SSE (1div), 2loads/stores; retirement: 6

µ

ops; post-retirement: 16-store post-retirementqueue, 2stores/ y le ; bran h predi tor : 64-KbitTAGE,18-Kbit ITTAGE[12℄ ;

bran hmispredi tion: re overyatexe ution,12 y lesminimumpenalty; a he line: 64B;IL1

a he : 32 KB, 8-wayasso, 1line/ y lebandwidth, upto 6pipelined misses; DL1 a he : 32KB,

8-wayasso,writeba k,2 y leslaten y,8banks,8-bytebankwidth,virtuallyindexed,PC-basedstride

prefet h; DL1 misses: 16MSHRs; L2 a he : 512KB, 8-wayasso,DIP poli y [10℄, write ba k,9

y leslaten y,1line/ y lebandwidth,streamprefet h[16℄;L3 a he: 8MB,16-wayasso,DIPpoli y

[10℄,writeba k,18 y leslaten y,1line/ y lebandwidth,streamprefet h[16℄;memory+bus: 210

y les laten y, 16 bytes/ y le bandwidth ; ITLB : 64 entries, 4-way ; DTLB : 64 entries, 4-way ;

TLB2: 512entries,4-way,4 y leslaten y; page size: 4MB ;load/store dependen ies: store

sets[2℄ +single-entryforwardingfromstorequeue[8℄;

Table1: Baselinesupers alar ore.

a tuallyredundantand anberemovedfromthedynami loops.

Thispaperisorganizedasfollows. Se tion2dis ussespriorwork. Wedes ribeoursimulationset-upin

Se tion 3. Se tion 4des ribesourloop dete torand providesstatisti saboutdynami loopsin the SPEC

CPU2006appli ations. Se tion5givesadetaileddes riptionoftheproposedLAandprovidesaperforman e

evaluation. Finally, se tion6 on ludes this study and givessomedire tions for future work. This paper

usesmanydenitions anda ronyms,theyarelistedinTable5attheendofthepaper.

2 Related work

Kobayashi found that many programs spend asigni ant fra tion of the exe ution in dynami loops,

es-pe ially s ienti programs [7℄. Our denition of dynami loops is loseto Kobayashi'sone. Tubella and

Gonzálezdes ribedahardwareme hanismfortheautomati dete tionofdynami loops[17℄. Their

deni-tion ofdynami loops isdierentfrom oursand targeted toward spe ulativemultithreading. Some re ent

pro essorshavea loopbuer to de rease theenergy onsumption byavoidingre-fet hing and re-de oding

thesameloopbodyagainandagain. Forinstan e,theIntelNehalemhasaLoopStreamDete torthat an

holdupto28

µ

ops[4℄. Gar íaetal. des ribealoopbuerthatimplementsaregisterrenamingme hanism forloops,moree ientthan onventionalregisterrenaminghardware[5℄. Stittet al. proposedaLAfora

system-on- hip,implementedwithsome ongurablelogi ,abletodete tanda elerateloopstransparently

[14℄. Thetimeforanalyzingtheloopand onguringtheLAappearstobeverylong(tensofmillionsof lo k

y lesforrelativelysimpleloops[14℄). Itisnot learwhetherthisapproa h ouldbeusedingeneral-purpose

systems. Clarket al. des ribedanapproa hwhere loopsareidentied andmodulo-s heduleddynami ally

(hen epreservingbinary ompatibility)ontoaLAde ouplingmemorya essesfrom omputations[3℄. Their

LAhasa ongurable omputea eleratorwhi h anexe uteupto15integeroperationsinonly2 lo k

y- les. Vajapeyametal. proposedadynami ve torization(DV)s heme[18℄basedontra epro essors[19,11℄.

This DV s heme hasafew similaritieswith ourloopa elerator(e.g., useof FIFO queues),but isoverall

verydierent. TheDVs hemeassumes64pro essingelements(PEs),dierentinstan esofaloopiteration

beingpro essedbydierentPEs. Neithermemorydisambiguationnorproblems on erning ommuni ation

bandwithandlaten ybetweenPEsareaddressedin [18℄.

3 Simulation set-up

Ourbaseline ongurationsimulatesamodernx86supers alar ore. Oursimulatoristra edriven(wedonot

simulate wrongpathee ts). Weused Pin2.8 [9℄to generateatra eforea hSPEC CPU2006ben hmark.

We ompiledben hmarksforx86-64ar hite turewithg 4.4.3"-O2". Ea h tra e onsistsof40samplesof

50millions onse utiveinstru tions,foratotalof2billionsdynami instru tions(andtheasso iatedmemory

(8)

0.0

0.5

1.0

1.5

2.0

2.5 400 401 403 410 416 429 433 434 435 436 437 444 445 447 450 453 454 456 458 459 462 464 465 470 471 473 481 482 483

IPC

Nehalem

simulator

Figure 1: IPC of the simulated ore vs. IPC measured on an Intel Nehalem for the SPEC CPU2006

ben hmarks.

parametersofthebaselinemi roar hite turearelistedinTable1. Instru tionsaresplitintomi ro-ops(

µ

ops forshort)atde ode. WegenerateseparateADDR

µ

opsformemorya esses[8℄. Loadandstore

µ

opsgetthe addressfrom theasso iatedADDR

µ

op,whi hisexe utedbyanyALU.Atmost8

µ

opsaregeneratedper instru tion. A

µ

ophasatmost2sour eregistersand1destinationregister,not ountingtheags. Thereare 4integers hedulers,2oating-points hedulersand2loads hedulers. Ea hs heduler ansele tone ready

µ

opper y leinitsissuebuer. The

µ

opsisremovedfromthebuerwhenissued. Weassumeares heduling me hanism,butwedidnotsimulate res hedulingpenalties. Wesimulatehardwareprefet hersfortheDL1,

L2andL3 a hes. TheDL1prefet herisaPC-basedstrideprefet her. TheL2andL3prefet hersarestream

prefet hersthat issueprefet h requestsbasedonthesequen e ofmissesand hitsonprefet hed blo ks[16℄.

Theprefet hdistan eisadjusteddynami allywithafeedba kme hanism. Weassume4MBmemorypages.

Usinglargepagesde reasesthenumberofTLBmissesandmakesprefet hingintheL2andL3 a hesmore

ee tive,asstreamprefet hersarelimitedbypageboundaries[16℄. TLBmissesarehardware-managed,as

inx86pro essors[1℄.

ThebaselineIPC(instru tionsper y le)numbersfortheSPECCPU2006arereportedinFigure1. We

alsoreporttheIPCmeasuredonanIntelNehalempro essorwhenexe utingea hben hmarkto ompletion.

Thesimulatedmi roar hite tureisnotidenti al toaNehalem. Forinstan e,wedonotsimulate

µ

opfusion and ma ro-op fusion. Our goal was to simulate a realisti supers alar ore, not to mat h the Nehalem

perfe tly.

4 The loop dete tor

In this study, the term loop is used in the sense of dynami loop. A dynami loopis a periodi sequen e

of dynami instru tions [7℄. Instru tions with dierent program ounter addresses (PC, for short) are

onsidereddistin t,although theymayperformthesamea tion. Thelooplengthisthenumberofdynami

instru tionsin thesequen e. Thebodysize

B

,ininstru tions,isthelengthofthesmallestperiod. Theloop body(or body, forshort) onsists ofthe rst

B

instru tionsin the sequen e, i.e., iterationnumber

0

. The next

B

instru tionsbelong to iterationnumber

1

, andsoon. Instru tions in thebody arenot ne essarily distin t. Forexample, the body itself may in lude a tiny loop unrolled, ora fun tion exe uted multiple

times. The loop dete tor is a hardware me hanism whose goal is to nd loops. Be ause there will be a

transitionpenalty forentering andleavingtheloopa elerationmode,wetryto dete tloopsthatare long

enough. Theloopdete tor onsistsoftwomainparts : theLoopMonitorandtheLoopTable.

The loop monitor. TheLoopMonitor(LM) isthemainme hanismfornding loops. Itin ludes atiny

LMtable (e.g., 4entries). Ea h LMentry ontainsanend-of-loopPC(EOL PC)whi his thesear hkey,

a valid bit, a body size, abody signature, a previoussignature, aninstru tion ount and a rst-iteration

bit. TheLM table isfully-asso iative. Forea h instru tionretiring from thereorderbuer, thebody size

of ea h valid LMentryisin rementedand the orrespondingbody signature isupdated. The signatureis

(9)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

instr time

Percentage of execution spent in loops

benchmark

400

401

403

410

416

429

433

434

435

436

437

444

445

447

450

453

454

456

458

459

462

464

465

470

471

473

481

482

483 MBS=256

MBS=128

MBS=64

MBS=32

MBS=16

Figure2: Per entageofexe ution(instru tionsandtime)spentinloops(MinLM=900instru tions,LBI=5)

during theprogram exe ution. Thelongerthe signature,the lesslikelythat twodistin t loopbodieshave

thesamesignature. In oursimulations,we use32-bit signatures,and weupdate thesignatureby rotating

thesignature 1bit to theleft and applyingan XOR withthe newinstru tionPC. Whenthe body size of

anyvalidLMentryex eedsaxedmaximumbodysize(MBS),we losethatentrybyresettingitsvalidbit.

IfthePC ofaretiring instru tionhitsin theLMtable,welookat theinformation storedin themat hing

entry. Iftherst-iterationbitissetorifthesignatureequalstheprevioussignature,weaddthebodysizeto

theinstru tion ount,otherwiseweresettheinstru tion ount. Whenabran hinstru tionjumpsba kward,

itis onsideredapotentialend-of-loop. If novalidentryexistsfor thatbran hPC, we reate anewentry

(e.g., byreusinga losed entry) : theEOL PCisset tothe bran hPC, thevalidbit andrst-iterationbit

are set, and all the other elds are reset. If an entry already exists for the ba kward jump, we opy the

bodysignature into theprevioussignatureand weresettherst-iterationbit, the body signature andthe

body size. When theinstru tion ount in oneof the LMentries ex eeds apredened threshold MinLM,

we loseall theother entries,and we enter the loop buildingstate. Theonly valid LM entryremaining is

alledthea tiveLMentry. Loopbuilding orrespondstoaxednumberofloopiterationsduringwhi hwe

extra ttheinformationthat willbeneededtoexe utetheloopontheLA.Thenumberofiterationsspent

in theloopbuilding stateis denoted LBI(loopbuild iterations). Whenloopbuilding isdone, wemigrate

theexe utiontotheLA.Aftertheloopexit,weupdatetheinstru tion ountinthea tiveLMentryandwe

losetheentry.

The loop table. IttakesMinLMdynami instru tionsbeforeaninstru tionsequen eis onsideredaloop

by the LM. On the one hand, if MinLM is xed to a low value, short loops will be dete ted. But the

exe utiontimesavedontheLAmay notbeworththetransitionpenalty. Ontheother hand,ifMinLMis

xedtoahighvalue,theMinLMinstru tionsthat ouldhavebeenexe utedontheLAareinsteadexe uted

bythesupers alar ore,whi h isaprobablewasteof performan e. SoMinLM istypi allyset toamedium

value,e.g.,900instru tions. Yet,thenexttime ween ounter thisloop,wemayassumethat itwillbehave

asthelasttimeand enter theloopa elerationmodeimmediately. This isoneofthefun tions oftheloop

table (LT). Ea h LT entryre ords someinformation aboutone loop, identied by its body signature. In

parti ular, there is a onden e bit in ea h LT entry. When we lose an LM entry, we set or reset the

onden ebit depending onwhether theinstru tion ountisgreaterorlessthanMinLM.WhenthePCof

aretired instru tionmat hesoneofthevalidLMentries,wea ess the orrespondingLTentry. IfthatLT

entryexists and ifthe onden e bit is set, we loseall the other LM entries and weenter loop building

immediately.

4.1 Fra tion of exe ution spent in loops

Figure 2 showsthe fra tion of exe ution spent in loopsfor dierent valuesof MBS. Twoper entages are

givenforea hben hmark,aper entageofexe utedinstru tionsandaper entageofexe utiontime. MinLM

is set to 900instru tions andLBI=5. Severalben hmarks spend asigni antfra tion ofthe exe ution in

(10)

ben hmark MBS=128 MBS=256 ben hmark MBS=128 MBS=256 401.bzip2 5231 4838 450.soplex 3216 3261 403.g 8315 8312 456.hmmer 5446 5448 433.mil 2190425 6415123 459.GemsFDTD 4249 4953 434.zeusmp 4257 4359 462.libquantum 15362 18216 437.leslie3d 2218 2223 465.tonto 4161 4027 444.namd 3749 6320 481.wrf 2643 2645 447.dealII 2634 2627 483.xalan bmk 4239 5475

Table2: Averagelooplengthforloopyben hmarks(MinLM=900instru tions,LBI=5)

accelerator

loop

L2

L3

DL1

IL1

arch.

reg.

loop

detector

loop

builder

superscalar core

retired uops

Figure3: Thesupers alar oreandthesequentialloopa elerator

in su ession, about onethird of all the instru tionsexe uted belong to loops (as we havedened them)

with amaximumbody size of256instru tions. In fa t,manyloopshaveabody notex eedingafewtens

of instru tions. Yet, several ben hmarks spend a signi antpartof theexe ution in loopswhose body size

ex eeds128instru tions,in parti ular433.mil ,444.namd,459.GemsFDTD,465.tontoand 483.xalan bmk.

We dene a loopy ben hmark as a ben hmark with more than 20% of dynami loop instru tions with

MBS=256. About half of the SPEC CPU2006 ben hmarks are loopya ording to this denition. Loopy

ben hmarksarelistedin Table2,alongwiththeaveragelooplengtha ordingtoourloopdete tor. Inthe

remaining,weassumeMBS=128instru tions.

5 The loop a elerator

This study onsidersthe possibility ofa eleratingtheexe ution of dynami loops, withouthelp from the

programmerandpreservingbinary ompatibility. Thegoalofthisstudyisnottoexplorethedesignspa eof

loopa elerators(LAs),whi hwebelieveishuge. Instead,wehavetriedtoimagineanewmi roar hite ture

avoiding onventional supers alarte hniquesthat donots alewell. Wetriedtoexploit loop hara teristi s

asmu h as possible in order to makethe LA s alable. Wehave also tried to makethe LA asgeneral as

possiblesothat it an a elerateloopswithasmall orlargebody, asitwasshownin thepreviousse tion

thatboth asesareimportant.

Theproposedmi roar hite tureisdepi tedin Figure3. We onsider hereonlythe"sequential"partof

aheterogeneousmany- ore. Duringloopbuilding,theloopbuildertakesasinputthe

µ

opsretiredfromthe supers alar orereorderbuerand preparestheloopbodyforexe utionontheLA.Whenloopbuildingis

done,thesupers alar oreinstru tionwindowis learedandtheexe ution ismigrated totheLA. Whena

loopexit onditiono urs,likeabran htakingadierentdire tion,thear hite turalregistersareupdated

andtheexe utionismigratedba kontothesupers alar ore,untilanotherdynami loopisen ountered.

The rest of this se tion is organized as follows. Se tion 5.1 introdu esthe y li register dependen y

graph,whi hisusefulfor hara terizingaloopandunderstandingitsexe utiononaLA.TheproposedLA

is des ribedin Se tion 5.2 and Se tion5.3 fo uses on memorydependen ies. Se tion5.4 explains howthe

(11)

12 : 5c8289

13,flags:=ALU(13)

0 : 5c8270

357:=ADDR(13)

1 : 5c8270

109:=LOAD(357)

7 : 5c827c

109:=SSE(109,358)

2 : 5c8274

17,flags:=ALU(17)

21 : 5c82a5

flags:=ALU(17,20)

11 : 5c8285

12,flags:=ALU(12)

3 : 5c8278

357:=ADDR(12)

4 : 5c8278

110:=LOAD(357)

10 : 5c8280

110:=SSE(110,358)

13 : 5c828d

18,flags:=ALU(18)

5 : 5c827c

357:=ADDR(18)

6 : 5c827c

358:=LOAD(357)

14 : 5c8291

109:=SSE(109,110)

8 : 5c8280

357:=ADDR(26)

9 : 5c8280

358:=LOAD(357)

17 : 5c8299

110:=SSE(109,110)

20 : 5c82a1

19,flags:=ALU(19)

15 : 5c8295

357:=ADDR(19)

18 : 5c829d

357:=ADDR(19)

16 : 5c8295

110:=LOAD(357)

19 : 5c829d

STORE(357,110)

22 : 5c82a8

BRANCH(flags)

Figure 4: Example ofloopCRDG foundin ben hmark481.wrf. Thebody onsists of15 instru tionssplit

upinto23stati

µ

opsrankedfrom 0to22. Dependen ieswith

µ

opsnotbelongingtothelooparenotpart oftheCRDG.

onto the LAfor good performan e. Other performan e tuning wedid isdes ribedin Se tion 5.6, and the

simulatedperforman espeedupsarepresentedinSe tion 5.7.

5.1 Cy li register dependen y graph

Tounderstandtheperforman eof aLA,itis onvenientto onsider theregisterdependen y graph(RDG)

ofasequen eof

µ

ops,whereea hnodeofthegraphisadynami

µ

opanddire tededgesrepresentregister dependen ies. TheRDG ofaloophasaperiodi stru ture andit ispossibleto summarizeitwith a y li

registerdependen ygraph(CRDG)whereea hnodeofthegraphisastati

µ

op.Figure4showsaloopCRDG onsistingof23stati

µ

opsrankedfrom0to22. Therankofastati

µ

op orrespondstothesequentialorder. Forexample,thelooprepresentedinFigure4 onsistsofthedynami instan esof

µ

ops0,1,...,22,0,1,...,22,... and soon. Dependen iesin aCRDGareof twosorts: intra-iteration andinter-iteration. Whenthe

µ

ops produ ingand onsumingaregistervaluebelongtothe sameiteration,itis anintra-iterationdependen y.

Otherwiseitisaninter-iterationdependen y. Foraninter-iterationdependen y,iftheprodu erbelongsto

iteration

k

,the onsumerbelongstoiteration

k +1

. InFigure4,inter-iterationsdependen iesarerepresented byedges whoseheadrankis lessthan orequalto thetailrank(e.g., bothedgesoriginatingfrom

µ

op11). Thelongest hainin aRDGdenesaminimumexe utiontime. Foraloop,thelongestRDG hain an be

found by onsidering y les in the CRDG. Most loopshave 1-

µ

op y les, likethe loop shown in Figure 4 (

µ

ops2,11,12,13,20dependonthemselves).However,afewloopshave y les onsistingofseveral

µ

ops. The longest hain inaloopRDG orrespondsto thegreatestCRDG y le,and the hainlengthisequaltothe

y lesize timesthenumberofloopiterations. Thea tualexe utiontime,though,maybelongerthanthis

minimumtimeifresour esforexe utingthe

µ

opsarelimited. Theimpa tofresour e onstraintsonaloop performan e anbe takenintoa ountbyadding onstraintedges tothe CRDG.Forinstan e, in order to

simplify thehardwareandpermitahigh lo kfrequen y,wefor ethedynami instan esofthesamestati

µ

optoexe uteinprogramorder. This orrespondstoaddingtoea h

µ

opintheCRDGa onstraintedgeto itself. Doingsodoesnotin reasetheloopexe utiontime. Ontheother hand,ifweadd a onstraintedge

betweentwodistin t

µ

opsoftheCRDG,wemay reateanarti ial y le onsistingofseveral

µ

ops,thereby in reasingtheloopexe utiontime.

5.2 A ring-shaped loop a elerator

Figure5depi tstheglobalar hite tureoftheproposedLA.TheLA onsistsof8exe utionnodes(nodefor

short) onne tedbya4-lanepipelinedbus. Therearedierentnodetypes,ea hnodetypebeingspe ialized

for ertain

µ

ops. TheI-nodeexe utes

µ

opshavinga1- y lelaten y. Thisin ludesADDR

µ

ops,MOV

µ

ops (integerandoating-point)andallthe

µ

opsthatwouldnormallyexe uteonthesupers alar oreALUs. The

(12)

mask

I−node

L−node

F−node

S−node

M−node

I−node

F−node

latch

Figure5: Ring-shapedloopa elerator: 8exe utionnodes onne tedbya4-lanepipelinedbus.

F-node anexe utethemostfrequentoating-pointoperations,in parti ularadditionsandmultipli ations.

Itexe utesintegermultipli ationstoo. TheL-nodeexe utesload

µ

opsandtheS-nodeexe utesstore

µ

ops. TheM-nodeisamutablenodewhosefun tionisdenedatloopbuilding. Itisa tuallyanI-nodeaugmented

withoating-pointdividers. Iftheloopbody ontainssomeoating-pointdivision,theM-nodeexe utesonly

divisions. OtherwiseitbehaveslikeanormalI-node. Itisnotne essaryfortheLAtobeabletoexe utethe

wholeinstru tionset. Forinstan e,wenoti edthatintegerdivisionsarerareintheloopsfoundbyourloop

dete torintheSPECCPU2006. Therefore,loopsfeaturingintegerdivisionsareexe utedbythesupers alar

ore.

The pipelined bus. The pipelined bus onsists of 4 lanes divided in 8 segments. The bus transports

pa kets. Apa ket onsistsofa128-bitdata 2

,a

µ

oprank,andan8-bitnode ve tor(onebitpernode). The

µ

opranktellswhi h

µ

opin thebody istheprodu erof thedata. Thenodeve tortellswhi h nodesneed thedata. Thenodeve toris modiedbyamaskasitpasses throughsegments. When thepa ket rea hes

thelast nodeneeding thedata, thenodeve torbe omes nulland thepa ket is onsidered empty. Nowit

is possible forthat node (or asubsequentnode) to overwrite theempty pa ket. In theworst ase, adata

doesa ompleteturnonthelaneuntilitreturnstothenodefromwhi hitwasemitted. Thereare2lat hes

perlanesegment(i.e.,16lat hesperlane). Whenthebus lo kislow,therstlat histransparentandthe

se ond lat h isopaque. Whenthe bus lo kis high, itis theother wayaround. Be ause of lanesegment

RCdelays,weassumethatthebusis lo kedathalfthenode lo kfrequen y 3

. To ompensatethislossof

throughput,ea hlaneisdoubled sothatittransportstwopa ketsinsteadofone,asshownin Figure5.

The exe ution node. Figure 6showsthe global stru ture of anI-node. Other node typesshare many

hara teristi s with the I-node. Ea h node ontains 4exe ution units (EU). Ea h EU anexe ute 1

µ

op per y le, i.e., the maximum node throughput is 4

µ

ops per y le. The loopbuilder maps stati

µ

ops to EUs. Allthedynami instan esofastati

µ

opareexe utedbythesameEU.The

µ

opsexe utedby the sameEU exe utein programorder with respe t to ea h other,but they anexe uteout-of-order

with respe t to

µ

ops onother EUs. On e a

µ

opis exe uted, itsoutput data is sentto the OutputQueue (OQ),withthe orresponding

µ

oprankandnodeve tor. A opyofthedata isalsopushedinto theLoop Exit Queue (LEQ). The LEQ is used at loop exit time for updating the ar hite tural registers with the

orre tvaluesbeforeresuming theexe utiononthesupers alar ore. Data in theOQis sentontothe bus

and removed from the OQ as soon as possible, i.e., when there is an empty pa ket at the orresponding

lanesegment. Theexe ution of

µ

ops onagivenEU is suspended ifthe asso iated OQ is onthe vergeof be oming full. Atthe node input, pa kets areread from the lanes. If the orresponding bit is set in the

nodeve tor,thedataand

µ

oprankareinsertedintotheInputQueue(IQ).WeassumethattheIQ never gets full (we will see later howweenfor e this). Consequently, thedata produ ed bya stati

µ

op enter the IQs where they are needed in sequential order. Inea h y le, onedata is dequeued from ea h IQ by

the distributor. The distributor sends these data to the External ValueQueues (EVQ). Ea h EU hasa

dedi ated set ofEVQs. Adata from anIQ may beneededby severalEUs in the samenode. In this ase

2

Weassume128-bitdata,asthisisthedatasizerequiredbyIntelSSEoperations. Wedidnottrytooptimizethis. Future

workmaytrytoredu ethedatasizeto64-bit,exploitingthefa tthatmostoperationsarea tuallys alar. 3

(13)

IQ

C

E

C

EU

OQ

C

E

C

EU

OQ

C

E

C

EU

OQ

C

E

C

EU

OQ

lane 0

lane 1

lane 2

lane 3

lane 0

lane 1

lane 2

lane 3

EVQs

DISTRIBUTOR

LEQ

Figure6: Exe utionnode(here,anI-node). Ea hexe utionnode ontains4exe utionunits.

thedistributordupli atesthedata(up to4 opies,oneperEU).An EVQisasso iatedwithoneparti ular

stati

µ

op, i.e., all thedata goingthroughagivenEVQ ome from the samestati

µ

op. Ifadata annot bedistributedbe auseoneofthetargetEVQsisfull,thedataremainsintheIQuntilallthetargetEVQs

ana eptit. Non- onstantdata feeding theEU ome either from anEVQheadorfrom the LEQ"top".

The LEQ top onsists of the

N

max

data output most re ently by the EU, where

N

max

is the maximum number of stati

µ

ops that an be mapped on the same EU (the LEQ is physi ally implemented as two distin t parts). The EVQhead data is dequeued only after the value is no longerneeded by any

µ

op on theEU. Theexe utionof

µ

opsonanEUis ontrolledbyaCy li Exe ution Controller(CEC).TheCEC loopsthroughthe

N

eu

stati

µ

opsmappedontotheEU,repeatedlyandinprogramorder. Inparti ular,the CEC ontrolsthevariousmultiplexorsforsele tingtheEUinputsanddequeuesdatafromEVQs. TheCEC

also holds onstantvalues,i.e., valuesthat are used asinput bysome

µ

opsand that areguaranteed to be onstantthroughouttheloopexe ution(e.g., registervaluesnotmodiedbytheloop). TheCECgenerates

asequen e number for ea h dynami

µ

op : the sequen enumber for the

n

th

dynami instan e of astati

µ

opof rank

k

is

B

max

×

n + k

, where

B

max

= 8 ×

M BS

is themaximumnumberof

µ

opsin aloopbody 4

and

n

orrespondstotheiterationnumber. Ea h EUexe utes

µ

opsin programorder,whi h meansthat thesequen enumberin reasesmonotoni ally. For

µ

opswithinter-iterationinputdependen ies,inputdata forthe rstiterationare obtainedat loopbuilding. Itshould benotedthat there exists adatapath from

ea hEU to everyother EU in theLA, thanksto thedistributors. Wehave hosen notto haveany dire t

datapathbetweenEUs inthesamenodeto simplifythehardware 5

. Hen e ifa

µ

optakesasinputadata produ edbyanother

µ

opinthesamenodebutonadierentEU,thisdatamusttravelaroundthelaneto rea h the onsumer

µ

op. Ifashort ommuni ationlaten y between2

µ

ops isneeded forperforman e, we musttrytomapthese2

µ

opsonthesameEU,sothatthe onsumer

µ

ops an at hthedatafromtheLEQ top. Ifwendduring loopbuildingthat theloop annotbeexe utedbytheLA,theloopkeepsexe uting

onthesupers alar ore. This happens forinstan e iftheloop ontainsa

µ

opthat theLA annotexe ute, likeanintegerdivision. Thishappensalsoif

N

max

istoosmallforthat loop,oriftherearetoofewEVQs. TheF-nodeandM-nodehaveaglobalstru turesimilartotheI-node. TheL-nodeandS-nodearesomewhat

dierent.

The S-node. TheS-nodere eivesstoreaddressesandstoredatafromothernodes,itdoesnotwriteonthe

lanes(noOQ,noLEQ).Thereis oneLoopStoreQueue(LSQ) asso iatedwithea hofthe4EUs. EUsdo

notreallyexe utethestore

µ

ops,theyjustgathertheaddressanddataforea hdynami store. Theaddress and dataare then sent into theLSQ, along withthe storesequen e number. TheS-node ontainsastore

validationunit(SVU).TheSVUsendsstoresintothepost-retirementstorequeue,whi htheLAshareswith

4

B

max

isapowerof2,generatingthesequen enumberissimple 5

(14)

thesupers alar oreandfrom whi hstores anwrite into theDL1 a he. TheSVUhasatable ontaining

the ranks of all the stati store

µ

ops in the loop body, in program order. The SVU logi loops through thistablerepeatedly,generatingthesequen enumber orrespondingtothenextdynami store,inprogram

order. ThisSVUsequen e numbersis sear hedamong the4LSQ heads. Thestoreis thenvalidated: itis

dequeuedfromitsLSQ,senttothepost-retirementstorequeue,andtheSVUsequen enumberisupdated.

Inoursimulations,wehaveassumedthat theS-node anvalidate2storesper y le.

The L-node. TheL-nodere eivesloadaddressesfrom I-nodes. ItreadstheDL1 a heandsendstheload

data throughthebusto theother nodes. In theL-node,EUs are repla edwith LoadIssue Buers(LIB)

and OQs are repla ed with LoadOutput Queues (LOQ). Ea h CEC steps through the stati loads that

havebeenmappedtoit,inprogramorder.TheCECallo atesanentryintheLIBandintheLOQforea h

dynami load. TheloadaddressandLOQentryidentierarewrittenintheLIBentry. TheLOQentryhas

roomfor the loaddata. The CEC writes in theLOQ entry the

µ

op sequen e number and the8-bit node ve torforthedata. Inea h y le,oneloadissele tedfromea hLIBanda essesthe a he,i.e.,theL-node

anissueupto4loadsper y le. IftheDL1readsu eeds,theloaddataiswrittenintotheLOQentry,and

theLIB entryisfreed. Otherwise(bank oni t, DL1 miss,TLBmiss), theloadwaits in theLIBand will

bereissuedlater. Upon amiss,whenthemissing blo kiseventuallyinsertedinto theDL1, theasso iated

MSHRwakesuptheLIBentriesthatwerewaitingforthatblo k,andthe orrespondingloadsbe omeready

forreissue. TheLOQbehaveslikeareorderbuerforloads.Ina lo k y le,theloadattheheadofaLOQ

isdequeuediftheloaddataisthereandifthepa ketonthelanesegment anbeoverwritten.

The globalsyn hronizer. TheCEChasan

N

max

-entry

µ

opbuer,apointeronthat buer,andaLo al IterationCount(LIC).Thepointerpointstothe

µ

opthatwillnexta esstheEVQs

6

. Thepointer anbe

in rementedinea h lo k y leandisresetwhenitsvaluebe omesequalto

N

eu

. Everytimethepointeris reset,theLICisin remented. The32CECs(8nodes, 4lanes)workindependentlyfrom ea h other. They

may have dierent LIC valuesat a given time. However, we need a global syn hronizer for keeping the

programsequentialsemanti ,ame hanismequivalenttothereorderbuerin asupers alar ore.

Inparti ular, thesyn hronizermustpreventastorefrom leavingtheSVUbefore allthedynami

µ

ops pre eding thestorein programorder havebeenexe uted,asthese

µ

opsmaytriggeraloopexit. Anatural loopexito urswhenastati bran h hangesitsbehavior,i.e.,abran h

µ

opprodu esaresultdierentfrom theonere ordedat loopbuilding

7

. Storesafter theloopexitpointmustnotwrite intotheDL1. Afterthe

loopexit, theLEQs areused for updatingthear hite turalregisters. But theLEQ size islimited. Hen e

oneof thesyn hronizer fun tion is to prevent the exe ution onan EU from going too far ahead, making

surethat thevaluesneededfor updatingthe ar hite tural registershavenotbeenpushedoutof theLEQ.

Thesyn hronizer is onne ted to allthenodes,re eiving and sendingsignalsfrom and to thenodes. The

SVU in the S-nodemaintainsa maximum sequen e number(MSN). If theloop ontains any store, store

validation freezes when it rea hes the MSN, and the SVU sends a signal to the syn hronizer. Moreover,

ea hCEChasamaximumvalueLICmax forits LIC.DierentCECsmayhavedierent LICmaxvalues,

but with some ontraints : In the S-node, all the CECshave the same LICmax = MinLICmax, where

MinLICmax is xed at loop building. On the other nodes, LICmax

≥

MinLICmax. When the LIC in a CECrea hesMinLICmax,the CECsends asignalto thesyn hronizer. The CEC ontinuesuntil theLIC

rea hes LICmax,then theCEC freezes. Whenthesyn hronizerhas re eived asignalfrom ea hCEC and

from theSVU, all CECs have aLIC greater than orequal to MinLICmax. The syn hronizer then sends

anunfreezesignaltoallnodes. Uponre eivingtheunfreezesignal,ea hCEC subtra tsMinLICmaxtoits

LIC,whi hunfreezesautomati allythefrozenCECs(e.g.,thoseintheS-node). Uponre eivingtheunfreeze

signal,theSVUaddsMinLICmax

×B

max

totheMSN, whi h unfreezesstorevalidation. Thesyn hronizer maintainstheprogram sequentialsemanti whilepreserving somede ouplingbetweenCECs. Still,it may

impa tperforman e. LICmax determinesthe instru tionwindowsize. The higherLICmax,thelargerthe

instru tionwindowandthemorelaten ytoleran e. However,thenumberofloopiterationsin thewindow

mustnotex eedtheLEQsize,i.e., LICmax+MinLICmaxmustnotex eedtheLEQsize dividedby

N

eu

. Global syn hronizationlaten ymayalsoimpa tperforman e: thetime duringwhi haCECis frozenisa

waste of performan e. Hen e MinLICmax mustnotbe toosmall. Another onstraintis thatMinLICmax

annotbegreaterthantheLSQsize dividedby

N

eu

,astheLSQ buersthestoresuntiltheunfreezesignal in reasestheMSN,whi hpermitsvalidatingthestoreswaitingintheLSQ.Insummary,theLEQsandLSQs

6

Theexe utionisfullypipelined(ex eptoating-pointdivisionsintheM-node).Datadependen iesmaystalltheexe ution. 7

(15)

mustbelargeenoughforgoodperforman e.

Loop exit. When the exe ution of a

µ

op triggers a loop exit, the

µ

op sequen e number, whi h we all the loop exit sequen e number (LESN), is sent to the syn hronizer

8

. The LESN is then broad ast to

allCECs. CECswhose sequen enumberalreadyex eedsLESN freezeimmediately andignore subsequent

unfreeze signals. Other CECs ontinue to work until ex eeding the LESN. If a newloopexit is dete ted

whileapreviousloopexitwasalreadypending,andifthenewLESNislessthantheLESNre ordedinthe

syn hronizer,it be omesthenew loopexit point. We anexit theloop a elerationmodewhen allCECs

sequen enumbersex eedtheLESN,allEUpipeshavebeeddrained,theSVUsequen enumberex eedsthe

LESN, andea h LOQin theL-node iseither empty orhas itshead entrysequen e numberex eedingthe

LESN. Then,ar hite tural registers anbeupdatedwithvaluesfound in theLEQs andtheexe ution an

resumeonthesupers alar ore,startingfromtherstdynami instru tionfollowingtheloopexitpoint.

Full IQ and premature loop exit. Wehaveassumedthat theIQsnevergetfull. It isa tuallydi ult

tomakesurethatanIQ annevergetfull. Butisitpossibletomakeitarareevent. IfanIQbe omesfull,

wetriggeraprematureloopexit. TheLESNis setto MSN-1. Then,the loopexittakespla easdes ribed

previously. Nevertheless,forlimitingtheo urren eofpremature loopexits withoutoversizingtheIQs,we

introdu eathrottlingme hanism. Weasso iateawired-ORwithea hlane. Whentheo upan yofanyIQ

onalaneex eedsa ertainthreshold(e.g.,half theIQ apa ity),thewired-ORisassertedandtheOQson

thatlanestopsendingdatauntilthewired-ORisdeasserted.

Deadlo ks. TheLAwehavedes ribedisnotimmunetodeadlo ks. Insteadoftryingtopreventdeadlo ks

in all ases, itis simplerto makethem as rare aspossible. In parti ular,the EVQsmustbelarge enough

to maintainalowdeadlo k probability. Wedete tadeadlo kwhen nounfreezesignalhasbeengenerated

for10000 y les. Whenthis happens,wetriggeraprematureloopexit. A deadlo kleadsto amu hhigher

performan epenaltythanafull IQ.

5.3 Memory dependen ies

The proposed loop a elerator an emulate a window of several thousands of instru tions. But we must

deal with memory dependen ies. We must guarantee a orre t exe ution without sa ri ing too mu h

performan e. Wedes ribeinthisse tionsomeme hanismstodealwithmemorydependen iesinsideloops.

We exploit loop properties to simplify memory dependen y enfor ement. First we expe t most loops to

exhibitaverygoodtemporalandspatiallo ality. Se ond,weexpe t onstant-stridea essestobefrequent

in loops. Third, we expe t dependen ies between a store and a loadin the sameloopto repeat on ea h

iteration.

The memory zone he ker. The memory zone he ker (MZC) is a table lo ated in the L-node but

a essed bothby loads and stores. Thepurpose of theMZC is to dete t memory order violationswithin

loops. TheMZCtakesadvantageof spatial lo alitythat exists inloops. OurMZCis on eptuallysimilar

totheMemoryDisambiguationTabledes ribedbyStoneetal. [15℄ex eptthat were ordonlyloadsinthe

MZC.Wedene azone asamemory regionwhose sizeis apoweroftwoand whi h is alignedin memory.

Themain originalityofourMZCisthat thezone sizeis xedat loopbuilding( f. Se tion 5.6). Thezone

addressisobtainedfrom theload/storeaddressbyarightshiftingoftheaddressbits. Ea h MZCentryis

taggedwithadistin tzoneaddressand ontainsavalidbit,aloadsequen enumber,aloadaddress,aload

datasize,a oni tbit,aloadPCandaknown_p bit. Whenaloadexe utes,ita essestheMZCwithits

zoneaddressUponaMZCmiss,wesear h afreeentryandweinitializeit. A freeentryisan entrywhose

valid bit was reset be ause its sequen e number is less than the SVU sequen e number. If no free entry

is found, aprematureloop exit is triggered. Upon aMZC hit, the sequen e numberin the MZCentryis

ompared with theloadsequen e number. Ifthe loadsequen e numberis greater, it overwritesthe entry

sequen enumber. Ifthedataaddressanddatasizein theentrydonotmat hthoseoftheload,wesetthe

oni t bit. If theloadPCdoesnotmat h thePC in theentry, we resettheknown_p bit. The MZCis

alsoa essedbystoreswhentheyarevalidatedbytheSVU.Apotentialmemoryorderviolationisdete ted

ifthestorezonehitsintheMZC,ifthestoresequen enumberislessthantheentrysequen enumber,and

if the oni t bit is set orif thestore data overlapswith the addressand data size re ordedin the entry.

Theoptimalzonesize,i.e., theonethat minimizesthenumberofprematureloopexits,isnotthesamefor

allloops. Ifthereisagoodspatiallo alityintheloop,takingalargezonepreventsllingtheMZC.Onthe

8

(16)

ADDR L

X

Y

Z

ADDR S

STORE

Z

ADDR S

CHECK

MOV

X

Y

ADDR L

MOV

CHECK

X

Y

MOV

ADDR L

STORE

Z

ADDR S

MOV

LOAD

STORE

dependency is predicted

bypassing

double bypassing

MOV

b

b+1

a+1

a

a+2

a+1

a

a+2

b+1

b+2

b+3

b

Figure7: Store-to-loadbypassing(

a

and

b

aretheranksofthestoreandCHECK

µ

opsaftertransformation)

otherhand,ifloadsandstoresareindependentbuta essthesamememoryzones,de reasingthezonesize

maypreventunne essary prematureloopexits. Inour simulations,wehaveassumedthat theMZC ould

beupdatedby4loadsandread by2storesin thesame y le.

The stride-based he ker. TheMZCissu ienttoguaranteea orre texe ution,itisnotsu ientfor

goodperforman e. Areasonably-sizedMZCmaytriggeralotofprematureloopexitsoninexistentmemory

dependen ies. Weintrodu eastride-based he ker(SBC)toassist theMZC.TheSBCtakesadvantageof

onstant-stridememorya esses. TheSBCis asmall fully-asso iativetable where ea hentrysummarizes

thememorya essesgeneratedbyaparti ularstati load. Ea hSBCentryistaggedwiththeloadPCand

ontainsavalidbit,asequen enumber,arstaddress,alastaddress,astride,adatasize,andaniteration

ount. Whenexe utingaload,wesear hthatloadPCintheSBC.Ifanentryexists,weupdateitasfollows.

Iftheiteration ountisnull, weinitializetheentrywith theloadaddress,datasize andsequen enumber.

Else,iftheiteration ountisnon-null,we ompute

S

=load_address

−

last_address. Iftheiteration ount equals1,weset thestrideto

S

. Otherwise,ifthe iteration ountisgreater than1andifthestridediers from

S

,weinvalidatetheentry. Eventually, weset thelast addressto theloadaddress,andwein rement theiteration ount. Whenvalidatingastore,iftheMZCdete tsapotentialmemoryorderviolationandif

theknown_p bitisset,wea esstheSBCwiththeloadPCre ordedin theMZCentry. Iftheentrydoes

not exist,we reate it (possibly evi ting a load), reset its ontent, and a premature loopexit is triggered

withthestoreas theloopexitpoint. Iftheentryexists,weuseits ontentto onrmthepossiblememory

order violation. Weuse the rstaddress, thelast address, thestride signand the datasize to ndwhi h

memory region the load has a essed so far. If the store data does not overlapwith that region, we are

sure that there was no memory order violation forthat store. Otherwise, weassume that there wasone,

and aprematureloop exit istriggered. Whenvalidating astore, we "remove"the oldest dynami loadof

every validSBC entrywhose iteration ount isnon-nulland whosesequen e numberis lessthanthestore

sequen enumber: wede rementtheiteration ountand,iftheiteration ountisnon-null,weadd

B

max

to thesequen enumberandweaddthere ordedstridetotherstaddress.

Store-to-loadbypassing. Someloopsa tually ontaintruememorydependen ies. Asolutionforallowing

theLAtoexe utethese loopsistodostore-to-loadbypassingatloopbuilding. Theloopbuilder ontainsa

StoreSetsmemorydependen ypredi tor[2℄,identi altotheoneimplementedin thesupers alar ore 9

. At

loopbuilding,ifastati loadispredi tedtodepend onastati storewiththesamedatasize,wetransform

thebody asshowninFigure7. Thestore

µ

opstaysinpla e, but2extraMOV

µ

opsareaddedintheloop bodyjustbehindthestore

10

. Theload

µ

opisremovedfromthebodyandrepla edwithaCHECK

µ

opand aMOV

µ

op. TheMOV

µ

optransmitsthestoredatato the

µ

ops onsumingtheloadvalue. TheCHECK

µ

op ompares the loadand store addresses. All CHECK

µ

ops and MOV

µ

ops exe ute on I-nodes. If a CHECKfails, aloopexit istriggered, taking asloopexit pointthe dynami

µ

oppre eding (in sequential order)theinstru tion ontainingtheremovedload.

9

OurstoresetsLFSTisfullyasso iative,taggedwiththeSSID,whi hpermitstakingawideSSIDwhilekeepingtheLFST

small. 10

(17)

The Bypassed Store Table. Even if aCHECK su eeds, we must still verify that storesbetweenthe

dependent store-load pairdo notwrite that memory lo ation. The SVU ontains asmall fully-asso iative

BypassedStoreTable(BST).Ea hBSTentry ontainsastoreaddress,astorePC,adatasizeandalifetime.

Whenvalidatingabypassedstore,wesear hafreeBSTentryforthatstore,i.e.,anentrywhoselifetime is

null. Weinitializethisentrywiththestoreaddress,PC, datasize,andwesetthelifetimeto themaximum

number of dynami stores separating the bypassed store-load pair(s). This value, whi h may be null, is

obtainedduringloopbuilding. Ifthelifetimeisnon-nullandnofreeentrywasfoundintheBST,wetrigger

aloopexit withthat bypassedstoreastheloopexit point. Whenvalidating astore(whether bypassedor

not), we he kwhether the store oni ts with any BST entrywhose lifetime is non-null. If a oni t is

dete ted,aloopexitistriggeredwiththevalidatedstoreastheloopexitpoint. Forea hvalidatedstore,we

de rement allthenon-nulllifetime valuesin the BST.Whena oni t isdete ted, we trainthestore sets

predi tor withthevalidated store PCand thebypassedstorePC. Doingsomergesthese twostores'store

sets[2℄,sothatonfutureo urren esofthat loop,theloadispredi tedto dependonthe orre tstore.

Double bypassing. Thebypassingmethod des ribed previouslyis ee tive onlyif thedistan e between

the dynami store and the dynami load is less than one loop body. We found that this represents the

majorityof ases. Howeveroneben hmark,456.hmmer,sueredmanyfailed CHECKs. Tosolvethis ase,

we introdu e double bypassing, whi h is the possibility for a dynami store to ommuni ate its data to a

dynami loadatadistan ebetweenoneandtwobodies. Doublebypassingisillustratedin Figure7. Ituses

theBSTlikesimplebypassing,ex eptthattheBSTentrylifetimeissettoalongervalue. Theloopbuilder

ontainsaFailedChe k Table(FCT), whi h his asmall fullyasso iativetable. When aCHECKtriggers

aloopexit,were ordin theFCTthebypassedloadPC,sothat ifapredi ted-dependent loadhitsin the

FCTduringloopbuilding,weapplydoublebypassing.

5.4 Loop body redu tion

Redundantexe utionexistsinmostprograms[13℄. OntheexampleofFigure4,

µ

ops8and9areredundant, theyprodu e thesameresultonea hiteration. This resultis obtainedatloop building, hen ethese

µ

ops anberemovedfrom theloop body beforeexe uting theloop ontheLA. Thisis aniterativepro ess : in

ea h iteration during loop building, we remove from the CRDG the

µ

ops that haveno inputs left in the CRDG. Thenumber of redundant

µ

ops identied depends on LBI

11

. Wemust be areful with loads and

stores. Astore

µ

op,evenifitprodu esthesameresultonea hiteration, annotbe onsideredredundantin ashared-memoryar hite ture,asremovingitwouldbreakthememory onsisten ymodel. Aredundantload

anberemoved, but wemust he k memory dependen ies. Some hardware support is ne essaryforthat.

TheSVU ontainsaRemoved LoadsTable(RLT).Ea h RLTentry ontainsavalidbit, aloadaddress,a

loaddatasize and aloadPC. Duringloopbuilding, aslongasthereis roomin theRLT,redundantloads

areremovedfromthebodyandre ordedin theRLT.Whenvalidating astore,we he kwhetherthestore

oni ts with any RLTentry. If a oni t is dete ted, a loop exit is triggered with thestore asthe loop

exit point. The PC re orded in the oni ting RLT entry is used to train the store sets predi tor. For

loopyben hmarks, we found that redundant

µ

ops represents typi ally between 10%and 20% of dynami loop

µ

ops.

MOV bypassing. It is possible to bypass ertain MOV

µ

ops, meaning that, at loop building we make theMOV's onsumer

µ

op depend dire tlyon theMOV's produ er

µ

op. MOV bypassingis possibleifthe distan e (in the sequen e of dynami instru tions) between the produ er and onsumer

µ

ops is less than oneloopbody. IfMOVbypassing anbeapplied forallthe onsumersoftheMOV,theMOVitself anbe

removedfrom thebody 12

.

5.5 Mapping heuristi

TheLAperforman eis verydependentonthemappingof

µ

opsontoEUs. Thein-depthstudyofmapping heuristi s is left for future studies. For this preliminary study, we tried to nd a good-enough mapping

throughtrialanderror. Theheuristi wefoundhelpedusunderstandsomeimportantpropertiesforagood

mappingheuristi onsu h LA.Wejust giveahigh-leveldes riptionhere,omittingsomedetails.

11

WithLBI=5,wewereabletoremovealltheredundant

µ

ops. 12

(18)

lo k frequen y : 3 GHz ; LM table : 4 entries ; LT : 64 entries ; MinLM : 900 instru tions;

LBI:5iterations;MBS:128instru tions;N

max

: 12

µ

ops;EVQsper EU:12;EVQ:32data; IQ:32data; OQ:16data;LIB :64loads;LOQ :128data ;LEQ: 128data; LSQ: 64stores;

MZC:64entries ;SBC:16entries;BST:16entries ;FCT :16entries ;RLT: 64entries ;SVU

throughput: 2stores/ y le;unfreeze signallaten y: 4 y les;wired-OR laten y: 4 y les

;loop mode transition penalty : 100 y les; extra transition penalty on a LT miss: 10000

y les;store sets SSIT:4096entries ;storesets LFST :16entries;SSIT learing period: 10

millions y les;

Table3: Fixedparametersfortheloopdete tor,loopbuilder,andloopa elerator.

EU balan ing. The

µ

ops mapped onto the same EU form a y le in the onstraint-augmentedCRDG. Therefore, the average number of lo k y les perloop iteration annot beless than the

N

eu

of themost loaded EU. The mapping heuristi must try to rea h a balan ed distribution of

µ

ops on EUs. This is parti ularlyimportantforloopswithasmallbody: themappingheuristi shouldavoidputtingtwo

µ

opson thesameEUwheneverpossible,asthispotentiallydoublestheloopexe utiontime. Ourmappingheuristi

omputes,fromthebody hara teristi s,a

µ

opquotaperEU type,assumingEUbalan ing. On e a

µ

opis mappedontoaEU,we onsider another

µ

op,andso onuntilall

µ

opshavebeenmapped. Wetryto avoid mappinga

µ

opontoanEUthat hasalreadyrea heditsquota.

Lanesegmentsharing. Datathatmustgothroughthesamelanesegmentshareitsbandwidth,whi hwe

havelimitedtoonedataperLA lo k y le. The

µ

opsprodu ing thesedata annotexe uteatanaverage rategreaterthan therate atwhi hthe datago throughthesharedlanesegments. Ourmappingheuristi

triestominimizelaneusagebutdoesnottryexpli itlytominimizelanesegmentsharing 13

. Wetrytomap

a

µ

op on thesame EU asoneof its input

µ

ops, provided this EU has notrea hedits quota. Many

µ

ops haveasingle onsumer,andthisoftenpermitsavoidingusingthebusfortransmittingthedata. Ifwemust

usethe bus,wetryto put the onsumer

µ

opon thenode followingthenodewhere theprodu er

µ

ophas beenmapped,wheneverpossible.

Natural CRDG y les. TheCRDG ontainssomenatural y les due only todata dependen ies. Most

natural y lesare1-

µ

op y les onsistingofanintegeradditiondependingonitself. Yet,someloops ontain multi-

µ

op y les whi h may in rease theloop exe ution time onsiderably, asit takes several y les fora data to travel from one EU to another. Whenever possible, our mapping heuristi tries to map onto the

sameEU the

µ

opsforminganatural y le.

Arti ial CRDG y les. Arti ial CRDG y les are y les in the onstraint-augmented CRDG that

are not natural y les. Arti ial CRDG y les may in reasethe loop exe ution time dramati ally, either

be ausethe y le ontainsoating-pointoperationsor,worse, be ausesome

µ

opsin the y le aremapped ontodierentEUs. Mapping

µ

opsontothesameEUsastheirprodu erspermitsde reasingtheprobability of reating ostlyCRDG y les,butthisisnotalwayssu ient. Wheneverpossible,ourmappingheuristi

triesto avoidmappinga

µ

oponanEUifthiswould reateanarti ialCRDG y le.

5.6 Performan e tuning

Oursimulatorimplementstheloopdete torandloopa eleratorasdes ribedin Se tions4and5,i.e.,with

agreat levelof detail. Parameters wehaveused for thesimulationare given in 3. Noti e the loop mode

transition penalty, that wehavexed at100 lo k y les. Weassumethat thispenaltytakesinto a ount

thetimeelapsedafterloopbuildingandbeforetheLAstartsexe utingtheloop,andthetimene essaryto

updatethear hite turalregistersaftertheloopexit. In aseofaLTmiss,weaddanextrapenaltyof10000

y les.

Maximum LICmax. Forgood performan e,the valuesofLICmax and MinLICmaxareset dynami ally

atloopbuilding. Asexplainedbefore,wetrytosetLICmaxandMinLICmaxashighaspossiblebut under

thelimitpermittedbytheLSQandtheLEQ(whi halsodependson

N

eu

). However,whenLICmaxex eeds theEVQdepth,thedeadlo kprobabilitybe omesnon-null. A tually,thereisaLICmaxvaluebeyondwhi h

thedeadlo kprobabilityin reasesdramati ally. Thisvalueisnotthesameforallloops. Tosolvethisissue,

13

(19)

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6 400 401 403 410 416 429 433 434 435 436 437 444 445 447 450 453 454 456 458 459 462 464 465 470 471 473 481 482 483

benchmark

performance

fraction loop instructions

Figure8: Globalperforman espeedups obtainedwiththeloopa elerator,relativeto thebaselinewithno

a elerator (higheris better). These ond barrepresent the fra tion of instru tions exe uted by the loop

a elerator.

were ordinea hLTentryavalueMLwhi hdenesamaximumforLICmax. Whenadeadlo ko urs,we

halvetheMLvaluere ordedintheLTentryforthatloop. Moreover,ahighLICmaxisusefulonlyforloops

withashortbody, forwhi h alargeinstru tionwindowrepresentsmanyiterations. Ifthebodyex eeds20

instru tions,or ifthetimeelapsedsin ethelast deadlo kislessthan1million y les, weset MLequalto

theEVQdepth. Otherwise,weallowMLtobeashigh as128. Overall,withthismethod, deadlo kshavea

negligibleimpa tonperforman e.

Memory zone size. The optimalmemory zonesize is notthesame forall loops. Were ord in ea h LT

entrythe

log

2

of thezone size. Ea h LT entry also ontains a3-bitsaturating ounter. Thezone size for aloopisinitially setto 1024bytes. Whenaprematureloopexit o urs be ausetheMZCis full,the3-bit

ounter isin remented. Ifthe3-bit ounter valueisequalto 7,wedouble thezonesize. IftheMZCorthe

SBCsignalsamemoryorderviolation,wede rementthe3-bit ounter. Ifthe3-bit ountervalueisnull,we

halvethezonesize.

Disabling "bad" loops. Migrating the exe ution to the LA may a tually de rease performan e. This

generallyhappens be ause of prematureloop exits. Forgood performan e, "bad" loopsmust bedete ted

andtheirexe utionontheLAdisabled. Ea hLTentry ontainsa

T imeLost

valuewhi hestimatesthetime thathasbeenlostsofarbyexe utingtheloopontheLAratherthanonthesupers alar ore. Onaloopexit,

theloopdete toristrainedwiththenumber

n

i

ofinstru tionsexe utedbytheLA,thetime

n

t

spentonthe LAandthenumber

n

m

ofL3 a hemissesgeneratedbytheloop. The

T imeLost

valueisupdatedasfollows:

T imeLost ← T imeLost + n

t

−

0.7 × n

i

−

20 ×

n

m

wherevalues

0.7

and

20

werefoundempiri ally.

T imeLost

isinitially set to 0. Formostloopsit be omesnegative,indi atingthat theLAis likelyto perform better

thanthesupers alar ore. Forsomeloop,

T imeLost

in reases. When

T imeLost

ex eeds100000 y les,we onsiderthatthisisa"bad"loop. Sometimes,weresetallthe

T imeLost

valuesin theLT.Wedothiswhen thetimeelapsedsin ethelast resetex eeds100timesthesumofpositive

T imeLost

valuesin theLT.

5.7 Simulations results

Figure8showstheperforman eobtainedwiththeloopa elerator. Speedups arerelativetothesimulated

baseline ( f. Figure 1). The se ond bar in Figure 1 shows the fra tion of instru tions exe uted by the

LA.Some speedups areobtainedon mostloopyben hmarks, ex ept 433.mil , whose loopbehavior omes

mostlyfrom loop bodies greater than128instru tions. Performan egains ex eeding25% areobtainedon

6ben hmarks: 434.zeusmp,437.leslie3d,456.hmmer,459.GemsFDTD,462.libquantumand481.wrf. These

6ben hmarksare theones thatexe uteatleast 60%oftheinstru tionsontheLA.Wemeasuredthelo al

a eleration provided by the LA on the loop fra tion. For the 6 ben hmarks mentioned above, the lo al

a elerationisrespe tively2.3,2,1.7, 2.8,2.1 and1.9 (ignoringthetransitionpenalty). Thislevelof lo al

(20)

all no no no naive ML= zone = bad

SBC byp. redu . mapping 32 1KB loops

434.zeusmp 1.43 1.32 1.33 1.18 1.00 1.42 1.25 1.43 437.leslie3d 1.45 1.26 1.43 1.10 0.95 1.44 1.40 1.38 456.hmmer 1.38 0.99 0.99 1.22 0.99 1.38 0.99 1.38 459.GemsFDTD 1.32 1.00 1.13 1.13 1.00 1.33 1.28 1.32 462.libquantum 1.39 1.24 1.39 1.37 1.19 1.19 1.39 1.39 481.wrf 1.26 1.12 1.21 1.19 0.95 1.26 1.26 1.26 29 ben h. mean 1.084 1.036 1.055 1.045 0.996 1.068 1.061 1.072

Table4: Performan e relative to the baseline. The se ond olumn is for all features enabled. Following

olumns give performan e when disabling a single feature, respe tively, disabling the SBC, store-to-load

bypassing,loopbodyredu tion,using anaivemappingheuristi ,usingasmallML,usingaxedzonesize,

andallowingbad loops.

Table 4 shows performan e for the 6 ben hmarks mentioned above and the average performan e on

all 29 ben hmarks. The se ond olumn is for all features enabled. Following olumns give performan e

whendisabling asinglefeature. Themappingheuristi isveryimportant. Toquantify itsimpa t,wehave

simulated a"naive" mapping heuristi whi h s ans all EUs in axed order : we map on the urrent EU

one

µ

opthat anbemappedonit,thenwemovetothenextEU. Wedothisrepeatedlyuntilall

µ

opshave beenmapped. With thisnaiveheuristi , theaverageperforman e iseven slightly lowerthan thebaseline.

Another important feature is the SBC. Without it, false memory dependen ies trigger many premature

loop exits. Loop body redu tion is also very important. Without it, the loop a elerator is learly not

working at its full potential. Store-to-load bypassing brings signi ant performan e gains on 456.hmmer

and459.GemsFDTD. Thelast3 olumns showthat theperforman etuning des ribedin Se tion 5.6bring

non-negligible performan e gains. The xed memory zone size prevents 456.hmmer to benet from the

a elerator. UsingalargeMLisimportantfor462.libquantum,whi hhassmallloopbodies. Disablingbad

loopsisausefulfeature,espe iallyforben hmarkswhi hdonotbenetfromthea elerator.

6 Con lusion and future work

Wehaveproposed ahardware me hanismfor dete tingdynami loops. We foundthat aboutonethird of

alltheinstru tionsexe utedbytheSPECCPU2006suite omefromdynami loopswhosebodysize anbe

quitediverse,fromafewinstru tionstoseveralhundreds.

Wehaveproposedaloopa eleratormi roar hite turethat ana eleratemostdynami loopswithout

help from the ompileror theprogrammer. Thea elerator ongurationwehavefo used onhas8nodes

and 4 lanes, and an exe uteup to 32

µ

opssimultaneouslyfrom awindow of several thousands of

µ

ops. Its lo al hardware omplexity is no greater than that of a onventional supers alar ore. Our eorts for

obtaining good performan e speedups have shown the importan e of mapping

µ

ops onto exe ution units very arefully. We have proposed new solutions for dealing with memory dependen ies in a window of

several thousandsof instru tions, exploiting loopproperties. Wehave shown that asigni antfra tion of

dynami loopinstru tionsareredundantandneednotbeexe utedbytheloopa elerator.

Thisisapreliminarystudy,intendedtoprovideabasisforfuture workonloopa eleration. Thedesign

spa e of loop a elerators is huge we believe. The mapping heuristi is very important for performan e.

Someofthe hara teristi swehaveoutlinedforagoodmappingheuristi saresomewhatgeneralwebelieve,

likethe importan e of preventingarti ial CRDG y les. Other aspe ts ofour mappingheuristi may be

dependentonsomeofthe hoi eswemade,likenotprovidinganydire tdatapathbetweenEUsonthesame

node, or hoosing apipelined ringfor ommuni atingbetweennodes. Futurestudies mayre onsider these

hoi es and tryto ndabettertradeo between theIPC, the hardware omplexityof thebusand nodes,

and the mapping heuristi . Nevertheless, we believe that the loop a elerator mi roar hite ture wehave

proposediss alablebeyondtheparti ular ongurationwe onsidered. Forinstan e,in reasingthenumber

(21)

will in reasethe omplexityofthedistributorand somepartsofthe L-node(forinstan e theMZC,whose

numberofportsisproportionaltothenumberoflanes). Yet,webelievethatifone andoublethewidthof

thesupers alar ore,doublingthenumberoflanesshouldnotbemoredi ult. Thenodemi roar hite ture

depi tedin Figure6is,webelieve,easiertopipelinethana onventional4-waysupers alarexe ution ore.

Future work may onsider the possibilityto over lo kthe loop a elerator. The global speedups wehave

demonstrated are limited by Amdahl's law, i.e., by the fra tion of the exe ution spent in dynami loops.

Nervertheless,thelo ala elerationwehaveobtainedondynami loopsis onsiderableforahardware-only

solution. A ompilerawareof thepresen eofaloopa eleratormaytrytoexploitit.

Referen es

[1℄ T.W. Barr,A.L.Cox,andS.Rixner. Translation a hing: Skip,don'twalk(thepagetable). InPro .

ofthe 37thInt.Symp. onComputer ar hite ture,2010.

[2℄ G.Z. Chrysos and J. S. Emer. Memorydependen e predi tion using storesets. InPro . of the 25th

Int.Symp.on ComputerAr hite ture,1998.

[3℄ N.Clark,A. Hormati,andS. Mahlke. VEAL: virtualizedexe ution a eleratorfor loops. InPro . of

the35th Int.Symp.on ComputerAr hite ture,2008.

[4℄ M.Dixon, P.Hammarlund,S.Jourdan,andR.Singhal. Thenext-generationIntelCore

mi roar hite -ture. IntelTe hnology Journal,14(3), 2010.

[5℄ A.Gar ía,O.J.Santana,E.Fernández,P.Medina,andM.Valero. LPA:arstapproa htotheloop

pro essorar hite ture. InPro .ofthe 3rdInt.Conf.onHighPerforman eEmbeddedAr hite turesand

Compilers,2008.

[6℄ M.D. Hilland M. R.Marty. Amdahl'slawin the multi ore era. IEEE Computer, 41(7):3338, July

2008.

[7℄ M.Kobayashi. Dynami hara teristi sof loops. IEEE Transa tions on Computers, -33(2):125132,

1984.

[8℄ G.H. Loh, S. Subramaniam, and Y. Xie. Zesto : a y le-levelsimulator for highly detailed

mi roar- hite ture exploration. InPro . of the Int.Symp. on Performan e Analysis of Systems andSoftware,

2009.

[9℄ C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Walla e, V. Janapa Reddi, and

K. Hazelwood. Pin : building ustomized program analysis tools with dynami instrumentation. In

Pro . oftheACMSIGPLANConferen eonProgrammingLanguageDesignandImplementation,2005.

http://www.pintool.org.

[10℄ M.Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely Jr, and J. Emer. Adaptiveinsertionpoli ies for high

performan e a hing. InPro . ofthe 34thInt.Symp. onComputer Ar hite ture,2007.

[11℄ E. Rotenberg, Q.Ja obson,Y. Sazeides, and J.E. Smith. Tra epro essors. InPro . of the 30thInt.

Symp.on Mi roar hite ture, 1997.

[12℄ A. Sezne and P. Mi haud. A asefor (partially) tagged geometri historylength bran h predi tion.

JournalofInstru tion Level Parallelism, April2006. http://www.jilp.org/vol8.

[13℄ A. Sodani and G. S. Sohi. Dynami instru tion reuse. In Pro . of the 24th Int. Symp. on Computer

Ar hite ture,1997.

[14℄ G. Stitt, R. Lyse ky, and F. Vahid. Dynami hardware/software partitioning : a rst approa h. In

Pro . of theDesign AutomationConferen e,2003.

[15℄ S.S.Stone,K.M.Woley,andM.I.Frank. Address-indexedmemorydisambiguationandstore-to-load

(22)

[16℄ J.M.Tendler, J.S.Dodson,J.S. Field,H. Le,andB.Sinharoy. POWER4systemar hite ture. IBM

JournalofResear h andDevelopment,46(1),January2002.

[17℄ J. Tubella and A. González. Control spe ulation in multithreaded pro essors through dynami loop

dete tion. InPro . of the 4thInt.Symp. onHigh-Performan eComputer Ar hite ture,1998.

[18℄ S.Vajapeyam,P.J.Joseph,andT.Mitra.Dynami ve torization: ame hanismforexploitingfar-ung

ILPinordinaryprograms. InPro . ofthe 26thInt.Symp. onComputer Ar hite ture,1999.

[19℄ S.VajapeyamandT.Mitra.Improvingsupers alarinstru tiondispat handissuebyexploitingdynami

odesequen es. InPro . ofthe 24thInt.Symp. onComputer Ar hite ture,1997.

B

max

: maximum number of

µ

ops for the loop body ; BST : bypassedstore table ; CEC : y li exe ution ontroller ; CRDG : y li register dependen y graph ; EU : exe ution unit ; EVQ :

externalvaluequeue;FCT:failed he ktable;IQ:inputqueue;LA:loopa elerator;LBI:loop

builditerations;LEQ:loopexitqueue;LESN:loopexitsequen enumber;LIB:loadissuebuer;

LIC:lo aliteration ount;LM:loopmonitor;LOQ:loadoutputqueue;LSQ:loopstorequeue;

LT:looptable;MBS:maximumbodysizeininstru tions;MinLM:LMthresholdininstru tions;

ML:maximumvalueofLICmax;MSN:maximumsequen enumber;MZC:memoryzone he ker

;N