A Case for a Complexity-Effective, Width-partitioned Microarchitecture

(1)

HAL Id: inria-00000211

https://hal.inria.fr/inria-00000211

Submitted on 13 Sep 2005

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents

Microarchitecture

Olivier Rochecouste, Gilles Pokam, André Seznec

To cite this version:

Olivier Rochecouste, Gilles Pokam, André Seznec. A Case for a Complexity-Effective, Width- partitioned Microarchitecture. [Research Report] PI 1742, 2005, pp.27. �inria-00000211�

(2)

I

R ^I

IN S^T

ITUT^D E RECH^E

R^C HE^E

N I^N FORM

P U B L I C A T I O N I N T E R N E

N

^o

1742

A CASE FOR A COMPLEXITY-EFFECTIVE, WIDTH-PARTITIONED MICROARCHITECTURE OLIVIER ROCHECOUSTE , GILLES POKAM , ANDRÉ

SEZNEC

(3)

(4)

Systèmesommuniants

ProjetCAPS

Publiationinterne n

o

1742Août 200527pages

Abstrat: Currentsupersalarproessorsfeature64-bitdatapathstoexeutetheprogram

instrutions,regardlessoftheiroperandssize. Ouranalysisindiates,however,thatmostex-

eutionsomprisealargeamount(40%)ofnarrow-width operations;i.e. instrutionswhih

exlusivelyproessnarrow-widthoperandsandresults. Wefurthernotiedthattheseopera-

tionsarewelldistributedarossaprogramrun. Inthispaper,weexploitthesepropertiesto

masterthehardwareomplexityofsupersalarproessors. Weproposeawidth-partitioned

miroarhiteture(WPM)todeouplethetreatmentofnarrow-width operationsfrom that

ofthe otherprogram instrutions. Wesplit a4-wayissue proessorinto two lusters: one

exeuting64-bit operations,load/storeand omplexoperationsand theothertreatingthe

16-bitoperations. Weshowthat revealing thenarrow-width operationstothehardwareis

suienttokeeptheworkloadbalaned andtheommuniationsminimizedbetweenlus-

ters. UsingaWPMreduestheomplexityofseveralritialproessoromponents: register

le andbypassnetwork. A WPMalso lowerstheomplexityof theinteronnetion fabri

sinethe 16-bit luster is only ableto propagate narrow-width data. We examine simple

ongurationsofWPMwhiledisussingtheirtradeos. Weevaluateaspeulativeheuristi

tosteerthenarrow-widthoperationstowardslusters. Adetailedomplexityanalysisshows

usingaWPM modelsavespowerandareawithaminimal impatonperformane.

Key-words: Hardware omplexity, power onsumption, supersalar proessor, width-

partitionedmiroarhiteture,registerle,narrow-widthoperations,data-width preditor

(Résumé: tsvp)

*

oroheoirisa.fr

**

gpokams.usd.edu

***

sezneirisa.fr

(5)

opérandes. Notreanalyse indique toutefois quela plupart des exéutionsomportent une

frationonsidérable(40%)enopérations tronquées ;-à-d. lesinstrutionsmanipulantex-

lusivementdesopérandeset desrésultatsdepetitedimension. Nousavonsaussiremarqué

que les opérations tronquées sont bien distribuées au ours d'uneexéution. Cetteétude

exploiteespropriétéspourmaîtriser laomplexitédesproesseurssupersalaires. Poure

faire,nousproposonsunemiroarhiteturelusterisée(WPM)pourdéouplerletraitement

desopérationstronquéesdeeluidesautresinstrutionsduprogramme. Nouspartitionnons

ainsi le proesseur entre deux lusters : unluster 64-bit exéutant les opérations 64-bit,

load/store etomplexes, etunluster 16-bittraitantlesopérations16-bit. Considérantles

propriétés relativesaux opérations tronquées, nous montrons que révéler es dernières au

matériel est susant pour maintenir l'équilibragedes harges et minimiser les ommuni-

ationsentre leslusters. Le modèle WPM réduiteaement laomplexité de plusieurs

omposantsritiques du proesseur: hier de registres,réseau debypass. Ce modèle ré-

duit aussi la omplexité duréseau d'interonnexionar le luster 16-bit peut uniquement

propagerdes données 16-bit. Nousexaminons diérentes ongurationsdu WPM en dis-

utant deleurs ompromis. Nous évaluons une heuristiquespéulativepourdistribuer les

instrutionsversles lusters. Une analyse détaillée de la omplexité indique que lemod-

èle WPMréduit laonsommationet lasurfaedesiliiumaveunimpatminimal surles

performanes.

Mots lés : Complexité matérielle, onsommation életrique, proesseur supersalaire,

miroarhiteturelusterisée,hierderegistres,opérationstronquées,préditeurdelargeur

(6)

omputationalunitsto reduetheoverallomplexity[21,9, 1℄. Inthesestudies, theparti-

tioningisditatedbytheneedtobreaktheomplexitygrowthfatoroftheritialompo-

nentsbyreduingtheirsizes. Hene,theresultinglustershavesimplerstrutures,thereby

enablingfastlokrates. However,amajorbottlenekwiththisapproahistheinteronnet

fabriusedtoommuniatedatabetweenlusters. Thisinteronnetfabriisrelativelyslow

anddissipatesalargeamountofpower[19℄. Itisthereforedesirabletominimizethenumber

ofinter-lusterommuniationswhilealsokeepingtheworkloadamonglustersbalaned.

Other studies have onsidered aareful design of theritial proessor omponents to

redue this omplexity. These studies are mainly direted by empirial analysis made on

runtimedata,suhastheseminalobservationmadebyBrooksetal. [4℄thatmostapplia-

tionsonlyneedpartofthefulldatapath-widthtoexeute. Severaloptimizationshavebeen

proposed whih exploit this narrow-width operand property of programs to redue power

onsumption [4, 24, 5, 14℄ or to improve performane [8, 17, 25, 18, 20℄. While they do

atuallyhelpreduingtheomplexityofertainritialproessoromponents(e.g. thereg-

ister le),quantifying theirimpaton theoverallmiroarhitetureis morediult. This

isbeausemany ofthese proposalsfeature ompleximplementations,sometimes requiring

majorhangestothehardware.

This paperproposesto makeeient useoftheavailable silionby exploringnewpos-

sibilities of partitioning a miroarhiteture based on narrow-width data. Central to our

approahistheobservationthattheourreneofnarrow-widthoperations,i.e. instrutions

exlusivelyomprisingnarrow-widthoperands,andthe otherprograminstrutionsis rela-

tively balaned and highly interleaved aross a omplete program run. We observed this

programpropertyontheMediaBenh andSPEC2000benhmarks. Beause oftherelative

prevalene of these narrow-width operations in programs, about 40% of the instrutions

exhibitthispropertyfortheonsideredbenhmarks,wesuggesttouseawidth-partitioned

miroarhiteture(WPM)tomasterthehardwareomplexityofsupersalarproessors. In

aWPM,weresorttopartitioningtodeouplethetreatmentofthenarrow-widthoperations

fromthatoftheotherprograminstrutions. Thisprovidesthebenetofgreatlysimplifying

thedesignofthe ritialproessoromponentsin eah luster (e.g. theregisterle) asno

additional hardware is required for managing eah type of instrution; yet, the interleav-

(7)

ing ofthe twoinstrutiontypes balanesthe workload among thelusters. Wealso show

that WPM redues the omplexity of theinteronnet fabri. Infat, sine lusters with

narrow-widthdatapathanonlyommuniatenarrow-widthdata,thedatapath-widthofthe

interonnetfabriissigniantlyredued,yieldingorrespondingsavingoftheinteronnet

powerandarea. WepresentaneientdesignofWPM,disussingvariousimplementation

hoiesinludingsteeringheurististo distributeinstrutionsamong thelusters andade-

tailed analysis of the omplexity fators aeting the performane, power and area. Our

omplexity analysis showsthat using a WPM arhiteture instead of a lassial64-bit 2-

lustermiroarhitetureanindeedsavepowerandsilionareawithonlyaminimalimpat

ontheoverallperformane.

Theremainderofthispaperisorganizedasfollows. Setion2elaboratesonthemotiva-

tionsofthiswork,providingsomeintuitiveobservationsabouttherationaleofourapproah.

WPMs are desribed in detailed in Setion 3, while theiromplexity analysis is disussed

in Setion 4. Theinstrutionssteering mehanismis presentedin Setion 5.2. Resultsare

presentedin Setion 6,whileSetion 7disussestherelated work. Weonludein Setion

8.

2 Motivations

In reent works, several authors [4, 18, 24, 8℄ have pointed out the large availability of

narrow-width datawithin ompute-intensiveintegerand multimediaprograms. Toexploit

this program property, various denitions of the operations exeuting with narrow-width

operandshavebeenassumed;dependingontheirappliationtothearhiteture. Brookset

al. [4℄ havequalied anarrow-width operatio n as aninstane wherebothsoureoperands

anberepresentedwithfewerthan16bits,whereasPokametal. [24℄onsideredthebasi-

blok granularity to dene narrow-width regions in a program. We formulate a dierent

assumptionthat onsiders anarrow-width operation tobeanoperationwhere nooperand

exeeds16 bits,inludingthedestinationoperand.

Charaterizing narrow-width operations We have quantied the number of our-

renesof thesenarrow-widthoperationsarosstheMediabenh andtheSPEC2000 benh-

marks. Our bitwidth analysis is exlusively devoted to operations proessed throughthe

integer funtional unit, inluding the address alulation. Operations that exeute with

narrow-widthoperandsinthetwo'somplementform arealsoonsidered. Figure1reports

thelassiationandthedistributionoftheintegeroperationsusingnarrow-widthoperands.

Asaonvention,wenoteNforanarrow-widthoperandandFforafull-widthoperand. We

usea3-letternotation forategorizing anoperation: the twoleadinglettersrepresentthe

widthtypeofthesoureoperandsandthelastletteristhewidthtypeoftheresult. Forin-

stane,NFNstandsforanoperationwhihproessesanarrow-widthanafull-widthsoure

operandsandproduesanarrow-widthresult. Forthemonadioperations,weonsiderthat

allofthesoureoperandsfeaturethesamewidth.

(8)

Figure1: Classiationanddistributionofintegeroperationsusingnarrow-widthdata.

We observe from Figure 1 that a signiant part of the integer exeution is devoted

to the narrow-width operations (NNN), about 40%. These results orroborate the prior

observationsmadebyBrooksregardingtheprevaleneofnarrow-widthdatafortheinteger

operations. Thissuggestsashemethatdeouplestheproessingofnarrow-widthoperations

ontodediated narrowoperators. Thiswouldreduesigniantlytheomplexityofertain

proessor omponents lying on the ritial path. In this study, we advoate the use of

deouplingtheproessingofnarrow-widthoperationsontodediatednarrow-widthlusters,

where one or morelusters feature anarrowdatapath-width. Werefersuh apartitioned

model as awidth-partitioned miroarhiteture (WPM). As for aonventional partitioned

arhiteture,aWPMalls forapropersteeringmehanismtodistributethenarrow-width

operations. Itisruialforbothperformaneandpowerthatthesteeringheuristisbalane

theworkloadamonglusterswhileminimizinginter-lusterommuniations.

Inter-lusterommuniations Figure1alsoprovidesanestimateoftheaveragenumber

ofommuniationsthattakeplaewithinaWPM.Aninter-lusterommuniationinWPM

anbetriggeredifanoperationonsumesanarrow-widthvalueproduedinaremoteluster

(e.g. NNF,NFN,NFF)orifitproduesanarrow-widthvaluethatmustbepropagatedto

aremoteluster(e.g. NFN,FFN).AsshowninFigure1,thisonernsroughly20%ofthe

integeroperations. ForaWPM,thismighttranslateintotheworst-asesenariowhereone

operation outof vetriggers an inter-lusterommuniation. However, this is amaximal

boundsinethisisstronglyorrelatedwiththenarrow-widthoperationsdistributionandthe

data dependenyamong operations, i.e. not all the narrow-width operationshave adata

dependeny with the other larger width program instrutions. Our result setion indeed

showsthatthenumberofinter-lusterommuniationsisfarbelowthisbound.

(9)

Figure2: Distanebetweennarrow-widthoperationsandtheotherprograminstrutionsat

runtime.

Workload balane Anotherrelevanttask for theinstrutions steeringmehanismis to

guaranteea good workload balane among lusters. Wehave approximatedthe workload

balanethataWPMmightbesubjetto asfollows. Foreahoperation,wehaveolleted

the distane separating a narrow-width operation from the next operation that exeutes

withlargerdata. Figure 2displaysthemeanofthe mostfrequent distanesobservedover

allbenhmark appliations atruntime. The standarddeviation arossappliationsis also

reported and reveals the strong orrelation of narrow-width distribution between applia-

tions. Another phenomenonillustrated in Figure 2isthe dominaneof short distanes at

runtime. This may be due to the fat that we also inluded address alulations whih

frequentlysoliit thefull datapath-width. This might thereforemean that ourrenesof

narrow-widthoperationsarehighlyinterleavedwiththeotheroperationsinprogramexeu-

tion. FromaWPM viewpoint,this meansthat asimplesteering heuristimay be ableto

ahieveabalanedworkload.

3 Width-partitioned Miroarhit etu re

Mostintegerandmultimediaappliationsexhibitalargefrationofnarrow-widthoperations

that are also well distributed aross the exeution. To take advantage of this program

property,weexamineanovelpartitionedarhiteturethataneientlyoperateonnarrow-

widthoperationsaswellasontheotherprograminstrutions,withreduedomplexity. We

referto this novelpartitioned organization as width-partitioned miroarhiteture (WPM).

This setion desribesthe implementation of suh a 4-way WPM design. Onean easily

onsidersalingupthisdesigntolargerissue-width. Todosowillrequiresomemodiations

totheinter-lusterommuniationmodel. Thisishoweverbeyondthesopeofthispaper.

(10)

Figure3: Baselineorganization Figure4: WPM organization

3.1 Baseline model

OurbaselinemodelisderivedfromtheAlpha 21264[13℄. Itis a64-bit, out-of-order,dual-

lustermahine. Weassumethattheoating-pointoperationsareproessedin adediated

luster not desribed in this paper. Figure 3 shows the blok diagram of this baseline

organization. Asdepitedinthegure,theproessorfront-end(feth, deodeandrename)

andthedataahearesharedbyalllusters. Similarlyto theAlpha21264[13℄, weassume

thattheissuequeuesaredeoupledfromthereorderbuerandpartitionedamonglusters.

Theotheromponentsomprisethefuntionalunitsandtheregisterlewhihisdupliated

onto eah luster. Both lusters are apable of issuing up to two instrutions per yle.

Every64-bitALUantreatomplexinstrutionssuhasmultipliationorshiftoperations.

We assumethat the sheduling of memory operationsis restritedto asingle luster. In

addition, we onsider that the load/store unit is apable of proessing integer and logi

operations. Sine we examine a dual-luster implementation, a fully-onneted topology

isadvoatedtoirumventpotentialresouresontentionandmaximize performane. For

supportingthistopology,eahregisterle(RF)opymustfeatureanumberofwriteports

equalto thetotalnumberofALUs,i.e. 4-writeportsperluster.

One fethed and deoded, instrutionsare proeeded by the renamingstage. At this

step,thesteeringlogiisresponsiblefordispathing theinstrutionsto theproperluster.

Werelyuponaninstrutionsteeringheuristisimilarto[7℄whihsteersinstrutionstothe

luster that produes most of its operands if this luster omprises the proper funtional

unit. AninstrutionanonlyaessitssoureoperandsfromtheloalRF.Weassumethat

inter-lusterommuniationsare impliitly doneby propagatingeveryresultsto the loal

andtheremoteRF.Fortheproduingluster,dataarebypassedinthesameyletoallow

(11)

Figure 5: Inter-luster ommuniation senario. The sux indiates the width of the

operand.

bak-to-bak exeutions, whereas broadasting data to the other luster takes additional

yles.

3.2 WPM design

The basi WPM design onsidered in this study splits the integer ore into two distint

lusters: (1) a main full-width luster featuring a64-bit datapath and(2) a narrow-width

luster featuringa16-bitdatapath. Asinthebaselinemodel,eahlusterisomposedofa

setof funtionalunits (FUs)and aloal RF.Thenarrow-width lusterfeaturestwo16-bit

ALUsandaloal16-bitRF(allednarrow-widthRF).AsshowninSetion4,thisorganiza-

tiondramatiallyreduestheoverallproessoromplexityasthereisnoneedforadditional

hardwaretokeeptrakofthedierentdatapath-widthexeutionmodes. Thenarrow-width

RF has four read ports and two write ports to provide support for the exeution of two

operationsperyle. Thefull-widthluster,ontheotherhand,omprisesa64-bitALUand

one load/storeunit apableof exeuting simplearithmeti and logi operations. A 64-bit

loalRF(alledfull-widthRF)isprovidedwithfourreadportsandtwowriteportstosup-

porttheexeutionoftwo64-bitoperationsperyle. Restritingtheload/storeunittothe

full-widthlusterisoherentwithourapproahsineaddressalulationsgenerallyoperate

onthefulldatapath-width. Figure4illustratesthisbasiWPMorganization. Similartothe

baselinemodel,wedonotonsiderpartitioningtheproessorfront-endandthedataahe.

We do however need to address with are the ommuniation between the narrow-width

lusterandthefull-widthluster.

3.2.1 Inter-lusterommuniations

Theneedtoommuniatedatabetweenthenarrow-widthlusterandthefull-widthluster

is ditated by the propagation of data dependeny among the narrow-width operations

andthe otherprograminstrutions. Considerforinstane theexeutionsenariodepited

(12)

inter-lusterommuniations(seeFigure1). The16-bitdupliateRFshowninFigure4has

beenspeially thought to break down this omplexity. This RF provides aopyof the

narrow-widthRFandiskeptsynhronizedwithitbythefuntionalunitsin bothlusters.

Narrow-width luster implementation details Regardingthenarrow-width luster,

twowriteportsare provided bythe16-bit dupliateRFto allowthe 16-bitALUsto keep

eah writebak register up to date with their opy in the remote luster. There are two

reasonsto maintaintheALUsin thenarrow-widthlusterfully-onnetedwiththeremote

RFopy. First,asshowninFigure1,theavailabilityofnarrow-widthoperationsinprograms

islargeenoughtojustifytheneedofmoreommuniationbandwidthbetweenthenarrow-

width luster and thefull-width luster. Seond, Figure 1evidenes the fat that among

theoperationsthatmaypotentiallyinvolvearemoteommuniationwiththenarrow-width

luster, the NFF operations are by far the largest. A NFF operation may onsume its

valuefromthe16-bitdupliateRF,meaningthattheopymusthavebeenkeptuptodate

bythenarrow-widthluster.

Full-width luster implementation details Inthe full-width luster, tworead ports

andtwowriteportsareprovidedbythe16-bitdupliateRFtoallowthe64-bitALUandthe

load/storeunittoreadandtowritebaktheirresult. However,onlyonewriteportisatu-

allyonnetedtotheremoteRFopy. Thislatterismotivatedbytheobservationthatonlya

smallfrationoftheoperationsexeutingonthefull-widthlusterneedstobesynhronized

withtheirremoteopy. Infat,theseoperationsarerestritedtothesubsetofinstrutions

that produea16-bitresult, e.g. IF3 ^andIF1 ⁱⁿ ^Figure ^5. ^As illustratedin Figure1with FFN and NFN, their representativeness in programs is negligible; there is therefore no

needto providethefull write bandwidthto keepbothopiessynhronized. Moreover,our

analysis showedthat among those instrutionsthat may involve aremote ommuniation

withthenarrow-widthluster,alargeperentageoftheseareatuallynarrow-widthloads.

This explains the additional port on the narrow-width RF whih is write-onneted with

theload/storeunit inthefull-width luster. Thebroadastto theremoteRFopyisdone

eahtimetheload/storeunitwritesto the16-bit dupliateRF.

(13)

Sineonlytheload/storeunitmaintainsbothRFopiessynhronizedwitheahother,it

is possible that a valuebeingwritten bakby the 64-bit ALU in the 16-bit dupliateRF

isnotavailableinthenarrow-widthRFwhenadependentnarrow-widthoperationisready

toissue. Inthat ase,weassumethehardwareautomatially insertsaopyinstrutionto

forwardthat valueto theRFopy[22℄. Notehoweverthat thisaseis raresinetheonly

operationsthatmaypotentiallyommuniatetheirresulttotheremotenarrow-widthluster

areFFNandNFN.Theseoperationsontributeforlessthan3%onourbenhmarks(see

Figure1). It isalso importanttonote that onlynarrow-width operationsan beexeuted

onthenarrow-widthluster. Theothertypesofoperationsontaininganarrow-widthdata,

i.e. FFN, NNF, NFNand NFF,exeuteon thefull-width luster and read/writetheir

narrow-widthdatafrom/tothe16-bitdupliateRF.The16-bitdupliateRFthereforeserves

bothasaopyofthenarrow-widthRFandalsoaloal16-bitRFsinemanyvaluesmaybe

readandwritten intoit withoutatuallymodifyingtheiropy. Weshowindeedin Setion

4that thisatuallysigniantlyreduestheomplexity.

3.2.2 Limitedinter-luster onnetivity

Wealsoexploredashemewithlimitedinter-lusteronnetivitytofurthermitigateoverall

omplexity. In this new organization, we redue the number of write ports on the 16-bit

dupliate RFfrom 4 to 2. In thenarrow-width luster, we remove thepath labelled 2 in

Figure4,meaningthatonlyone16-bitALUisnowabletopropagateitsresulttotheremote

RFopy. Inthefull-width luster, wenote that there is noneed to provide2write ports

on the 16-bit dupliate RF as maintaining it synhronized with the opy is done by the

load/storeunit. Moreover,ouranalysis showsthat thereare onlyafewoperations( NFN

and FFN) exeuting on the main luster that produe narrow-width results. Hene, it

makessensetoremovethepathlabelled1inFigure4,butoperationsproduinganarrow-

widthdata willnowhaveto besteeredtowardtheload/storeunit. Notehoweverthatthe

64-bitALUanstillexeuteoperationswithnarrow-width data. Iftheoperationprodues

anarrow-width result, this resultwill have to bewritten bakto the 64-bit RF. Albeit a

moreeientuseof omputingresouresan berealizedbydoingso,itshould benotied

thatwemayhowevermisssomeoptimizationopportunities.

We also propose to optimize the number of inter-luster ommuniations as a small

frationoftheintegeroperationsexeutingonthefull-widthlusteruseandproduenarrow-

width data. For this purpose,weadvoateusing aopyinstrutionsheme[22℄to update

theontentof a16-bitregisteronlywhenneessary. This approah anleadto signiant

powersavingsintheinteronnetfabriandregisterles. Nevertheless,usingthisapproah

mayalsohaveanegativeimpatontheoverallperformaneastheonsumingoperationswill

bedelayeduntiltheopyinstrutionswritetheirresultsbak. Tomitigatetheperformane

degradation,weproposetobroadastthevalueofloadoperationsasdoneinthebasiWPM.

It makes sense to doso as we observedthat load operationswhih produenarrow-width

data are relatively frequent at runtime. However, a moreeient optimization would be

relatedtotheuseofanarrow-width usagepreditor topreditonwhihlusteravaluewill