HAL Id: inria-00000211
https://hal.inria.fr/inria-00000211
Submitted on 13 Sep 2005
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents
Microarchitecture
Olivier Rochecouste, Gilles Pokam, André Seznec
To cite this version:
Olivier Rochecouste, Gilles Pokam, André Seznec. A Case for a Complexity-Effective, Width- partitioned Microarchitecture. [Research Report] PI 1742, 2005, pp.27. �inria-00000211�
I
R I
IN ST
ITUT D E RECHE
RC HEE
N IN FORM
P U B L I C A T I O N I N T E R N E
N
o1742
A CASE FOR A COMPLEXITY-EFFECTIVE, WIDTH-PARTITIONED MICROARCHITECTURE OLIVIER ROCHECOUSTE , GILLES POKAM , ANDRÉ
SEZNEC
Systèmesommuniants
ProjetCAPS
Publiationinterne n
o
1742Août 200527pages
Abstrat: Currentsupersalarproessorsfeature64-bitdatapathstoexeutetheprogram
instrutions,regardlessoftheiroperandssize. Ouranalysisindiates,however,thatmostex-
eutionsomprisealargeamount(40%)ofnarrow-width operations;i.e. instrutionswhih
exlusivelyproessnarrow-widthoperandsandresults. Wefurthernotiedthattheseopera-
tionsarewelldistributedarossaprogramrun. Inthispaper,weexploitthesepropertiesto
masterthehardwareomplexityofsupersalarproessors. Weproposeawidth-partitioned
miroarhiteture(WPM)todeouplethetreatmentofnarrow-width operationsfrom that
ofthe otherprogram instrutions. Wesplit a4-wayissue proessorinto two lusters: one
exeuting64-bit operations,load/storeand omplexoperationsand theothertreatingthe
16-bitoperations. Weshowthat revealing thenarrow-width operationstothehardwareis
suienttokeeptheworkloadbalaned andtheommuniationsminimizedbetweenlus-
ters. UsingaWPMreduestheomplexityofseveralritialproessoromponents: register
le andbypassnetwork. A WPMalso lowerstheomplexityof theinteronnetion fabri
sinethe 16-bit luster is only ableto propagate narrow-width data. We examine simple
ongurationsofWPMwhiledisussingtheirtradeos. Weevaluateaspeulativeheuristi
tosteerthenarrow-widthoperationstowardslusters. Adetailedomplexityanalysisshows
usingaWPM modelsavespowerandareawithaminimal impatonperformane.
Key-words: Hardware omplexity, power onsumption, supersalar proessor, width-
partitionedmiroarhiteture,registerle,narrow-widthoperations,data-width preditor
(Résumé: tsvp)
*
oroheoirisa.fr
**
gpokams.usd.edu
***
sezneirisa.fr
opérandes. Notreanalyse indique toutefois quela plupart des exéutionsomportent une
frationonsidérable(40%)enopérations tronquées ;-à-d. lesinstrutionsmanipulantex-
lusivementdesopérandeset desrésultatsdepetitedimension. Nousavonsaussiremarqué
que les opérations tronquées sont bien distribuées au ours d'uneexéution. Cetteétude
exploiteespropriétéspourmaîtriser laomplexitédesproesseurssupersalaires. Poure
faire,nousproposonsunemiroarhiteturelusterisée(WPM)pourdéouplerletraitement
desopérationstronquéesdeeluidesautresinstrutionsduprogramme. Nouspartitionnons
ainsi le proesseur entre deux lusters : unluster 64-bit exéutant les opérations 64-bit,
load/store etomplexes, etunluster 16-bittraitantlesopérations16-bit. Considérantles
propriétés relativesaux opérations tronquées, nous montrons que révéler es dernières au
matériel est susant pour maintenir l'équilibragedes harges et minimiser les ommuni-
ationsentre leslusters. Le modèle WPM réduiteaement laomplexité de plusieurs
omposantsritiques du proesseur: hier de registres,réseau debypass. Ce modèle ré-
duit aussi la omplexité duréseau d'interonnexionar le luster 16-bit peut uniquement
propagerdes données 16-bit. Nousexaminons diérentes ongurationsdu WPM en dis-
utant deleurs ompromis. Nous évaluons une heuristiquespéulativepourdistribuer les
instrutionsversles lusters. Une analyse détaillée de la omplexité indique que lemod-
èle WPMréduit laonsommationet lasurfaedesiliiumaveunimpatminimal surles
performanes.
Mots lés : Complexité matérielle, onsommation életrique, proesseur supersalaire,
miroarhiteturelusterisée,hierderegistres,opérationstronquées,préditeurdelargeur
omputationalunitsto reduetheoverallomplexity[21,9, 1℄. Inthesestudies, theparti-
tioningisditatedbytheneedtobreaktheomplexitygrowthfatoroftheritialompo-
nentsbyreduingtheirsizes. Hene,theresultinglustershavesimplerstrutures,thereby
enablingfastlokrates. However,amajorbottlenekwiththisapproahistheinteronnet
fabriusedtoommuniatedatabetweenlusters. Thisinteronnetfabriisrelativelyslow
anddissipatesalargeamountofpower[19℄. Itisthereforedesirabletominimizethenumber
ofinter-lusterommuniationswhilealsokeepingtheworkloadamonglustersbalaned.
Other studies have onsidered aareful design of theritial proessor omponents to
redue this omplexity. These studies are mainly direted by empirial analysis made on
runtimedata,suhastheseminalobservationmadebyBrooksetal. [4℄thatmostapplia-
tionsonlyneedpartofthefulldatapath-widthtoexeute. Severaloptimizationshavebeen
proposed whih exploit this narrow-width operand property of programs to redue power
onsumption [4, 24, 5, 14℄ or to improve performane [8, 17, 25, 18, 20℄. While they do
atuallyhelpreduingtheomplexityofertainritialproessoromponents(e.g. thereg-
ister le),quantifying theirimpaton theoverallmiroarhitetureis morediult. This
isbeausemany ofthese proposalsfeature ompleximplementations,sometimes requiring
majorhangestothehardware.
This paperproposesto makeeient useoftheavailable silionby exploringnewpos-
sibilities of partitioning a miroarhiteture based on narrow-width data. Central to our
approahistheobservationthattheourreneofnarrow-widthoperations,i.e. instrutions
exlusivelyomprisingnarrow-widthoperands,andthe otherprograminstrutionsis rela-
tively balaned and highly interleaved aross a omplete program run. We observed this
programpropertyontheMediaBenh andSPEC2000benhmarks. Beause oftherelative
prevalene of these narrow-width operations in programs, about 40% of the instrutions
exhibitthispropertyfortheonsideredbenhmarks,wesuggesttouseawidth-partitioned
miroarhiteture(WPM)tomasterthehardwareomplexityofsupersalarproessors. In
aWPM,weresorttopartitioningtodeouplethetreatmentofthenarrow-widthoperations
fromthatoftheotherprograminstrutions. Thisprovidesthebenetofgreatlysimplifying
thedesignofthe ritialproessoromponentsin eah luster (e.g. theregisterle) asno
additional hardware is required for managing eah type of instrution; yet, the interleav-
ing ofthe twoinstrutiontypes balanesthe workload among thelusters. Wealso show
that WPM redues the omplexity of theinteronnet fabri. Infat, sine lusters with
narrow-widthdatapathanonlyommuniatenarrow-widthdata,thedatapath-widthofthe
interonnetfabriissigniantlyredued,yieldingorrespondingsavingoftheinteronnet
powerandarea. WepresentaneientdesignofWPM,disussingvariousimplementation
hoiesinludingsteeringheurististo distributeinstrutionsamong thelusters andade-
tailed analysis of the omplexity fators aeting the performane, power and area. Our
omplexity analysis showsthat using a WPM arhiteture instead of a lassial64-bit 2-
lustermiroarhitetureanindeedsavepowerandsilionareawithonlyaminimalimpat
ontheoverallperformane.
Theremainderofthispaperisorganizedasfollows. Setion2elaboratesonthemotiva-
tionsofthiswork,providingsomeintuitiveobservationsabouttherationaleofourapproah.
WPMs are desribed in detailed in Setion 3, while theiromplexity analysis is disussed
in Setion 4. Theinstrutionssteering mehanismis presentedin Setion 5.2. Resultsare
presentedin Setion 6,whileSetion 7disussestherelated work. Weonludein Setion
8.
2 Motivations
In reent works, several authors [4, 18, 24, 8℄ have pointed out the large availability of
narrow-width datawithin ompute-intensiveintegerand multimediaprograms. Toexploit
this program property, various denitions of the operations exeuting with narrow-width
operandshavebeenassumed;dependingontheirappliationtothearhiteture. Brookset
al. [4℄ havequalied anarrow-width operatio n as aninstane wherebothsoureoperands
anberepresentedwithfewerthan16bits,whereasPokametal. [24℄onsideredthebasi-
blok granularity to dene narrow-width regions in a program. We formulate a dierent
assumptionthat onsiders anarrow-width operation tobeanoperationwhere nooperand
exeeds16 bits,inludingthedestinationoperand.
Charaterizing narrow-width operations We have quantied the number of our-
renesof thesenarrow-widthoperationsarosstheMediabenh andtheSPEC2000 benh-
marks. Our bitwidth analysis is exlusively devoted to operations proessed throughthe
integer funtional unit, inluding the address alulation. Operations that exeute with
narrow-widthoperandsinthetwo'somplementform arealsoonsidered. Figure1reports
thelassiationandthedistributionoftheintegeroperationsusingnarrow-widthoperands.
Asaonvention,wenoteNforanarrow-widthoperandandFforafull-widthoperand. We
usea3-letternotation forategorizing anoperation: the twoleadinglettersrepresentthe
widthtypeofthesoureoperandsandthelastletteristhewidthtypeoftheresult. Forin-
stane,NFNstandsforanoperationwhihproessesanarrow-widthanafull-widthsoure
operandsandproduesanarrow-widthresult. Forthemonadioperations,weonsiderthat
allofthesoureoperandsfeaturethesamewidth.
Figure1: Classiationanddistributionofintegeroperationsusingnarrow-widthdata.
We observe from Figure 1 that a signiant part of the integer exeution is devoted
to the narrow-width operations (NNN), about 40%. These results orroborate the prior
observationsmadebyBrooksregardingtheprevaleneofnarrow-widthdatafortheinteger
operations. Thissuggestsashemethatdeouplestheproessingofnarrow-widthoperations
ontodediated narrowoperators. Thiswouldreduesigniantlytheomplexityofertain
proessor omponents lying on the ritial path. In this study, we advoate the use of
deouplingtheproessingofnarrow-widthoperationsontodediatednarrow-widthlusters,
where one or morelusters feature anarrowdatapath-width. Werefersuh apartitioned
model as awidth-partitioned miroarhiteture (WPM). As for aonventional partitioned
arhiteture,aWPMalls forapropersteeringmehanismtodistributethenarrow-width
operations. Itisruialforbothperformaneandpowerthatthesteeringheuristisbalane
theworkloadamonglusterswhileminimizinginter-lusterommuniations.
Inter-lusterommuniations Figure1alsoprovidesanestimateoftheaveragenumber
ofommuniationsthattakeplaewithinaWPM.Aninter-lusterommuniationinWPM
anbetriggeredifanoperationonsumesanarrow-widthvalueproduedinaremoteluster
(e.g. NNF,NFN,NFF)orifitproduesanarrow-widthvaluethatmustbepropagatedto
aremoteluster(e.g. NFN,FFN).AsshowninFigure1,thisonernsroughly20%ofthe
integeroperations. ForaWPM,thismighttranslateintotheworst-asesenariowhereone
operation outof vetriggers an inter-lusterommuniation. However, this is amaximal
boundsinethisisstronglyorrelatedwiththenarrow-widthoperationsdistributionandthe
data dependenyamong operations, i.e. not all the narrow-width operationshave adata
dependeny with the other larger width program instrutions. Our result setion indeed
showsthatthenumberofinter-lusterommuniationsisfarbelowthisbound.
Figure2: Distanebetweennarrow-widthoperationsandtheotherprograminstrutionsat
runtime.
Workload balane Anotherrelevanttask for theinstrutions steeringmehanismis to
guaranteea good workload balane among lusters. Wehave approximatedthe workload
balanethataWPMmightbesubjetto asfollows. Foreahoperation,wehaveolleted
the distane separating a narrow-width operation from the next operation that exeutes
withlargerdata. Figure 2displaysthemeanofthe mostfrequent distanesobservedover
allbenhmark appliations atruntime. The standarddeviation arossappliationsis also
reported and reveals the strong orrelation of narrow-width distribution between applia-
tions. Another phenomenonillustrated in Figure 2isthe dominaneof short distanes at
runtime. This may be due to the fat that we also inluded address alulations whih
frequentlysoliit thefull datapath-width. This might thereforemean that ourrenesof
narrow-widthoperationsarehighlyinterleavedwiththeotheroperationsinprogramexeu-
tion. FromaWPM viewpoint,this meansthat asimplesteering heuristimay be ableto
ahieveabalanedworkload.
3 Width-partitioned Miroarhit etu re
Mostintegerandmultimediaappliationsexhibitalargefrationofnarrow-widthoperations
that are also well distributed aross the exeution. To take advantage of this program
property,weexamineanovelpartitionedarhiteturethataneientlyoperateonnarrow-
widthoperationsaswellasontheotherprograminstrutions,withreduedomplexity. We
referto this novelpartitioned organization as width-partitioned miroarhiteture (WPM).
This setion desribesthe implementation of suh a 4-way WPM design. Onean easily
onsidersalingupthisdesigntolargerissue-width. Todosowillrequiresomemodiations
totheinter-lusterommuniationmodel. Thisishoweverbeyondthesopeofthispaper.
Figure3: Baselineorganization Figure4: WPM organization
3.1 Baseline model
OurbaselinemodelisderivedfromtheAlpha 21264[13℄. Itis a64-bit, out-of-order,dual-
lustermahine. Weassumethattheoating-pointoperationsareproessedin adediated
luster not desribed in this paper. Figure 3 shows the blok diagram of this baseline
organization. Asdepitedinthegure,theproessorfront-end(feth, deodeandrename)
andthedataahearesharedbyalllusters. Similarlyto theAlpha21264[13℄, weassume
thattheissuequeuesaredeoupledfromthereorderbuerandpartitionedamonglusters.
Theotheromponentsomprisethefuntionalunitsandtheregisterlewhihisdupliated
onto eah luster. Both lusters are apable of issuing up to two instrutions per yle.
Every64-bitALUantreatomplexinstrutionssuhasmultipliationorshiftoperations.
We assumethat the sheduling of memory operationsis restritedto asingle luster. In
addition, we onsider that the load/store unit is apable of proessing integer and logi
operations. Sine we examine a dual-luster implementation, a fully-onneted topology
isadvoatedtoirumventpotentialresouresontentionandmaximize performane. For
supportingthistopology,eahregisterle(RF)opymustfeatureanumberofwriteports
equalto thetotalnumberofALUs,i.e. 4-writeportsperluster.
One fethed and deoded, instrutionsare proeeded by the renamingstage. At this
step,thesteeringlogiisresponsiblefordispathing theinstrutionsto theproperluster.
Werelyuponaninstrutionsteeringheuristisimilarto[7℄whihsteersinstrutionstothe
luster that produes most of its operands if this luster omprises the proper funtional
unit. AninstrutionanonlyaessitssoureoperandsfromtheloalRF.Weassumethat
inter-lusterommuniationsare impliitly doneby propagatingeveryresultsto the loal
andtheremoteRF.Fortheproduingluster,dataarebypassedinthesameyletoallow
Figure 5: Inter-luster ommuniation senario. The sux indiates the width of the
operand.
bak-to-bak exeutions, whereas broadasting data to the other luster takes additional
yles.
3.2 WPM design
The basi WPM design onsidered in this study splits the integer ore into two distint
lusters: (1) a main full-width luster featuring a64-bit datapath and(2) a narrow-width
luster featuringa16-bitdatapath. Asinthebaselinemodel,eahlusterisomposedofa
setof funtionalunits (FUs)and aloal RF.Thenarrow-width lusterfeaturestwo16-bit
ALUsandaloal16-bitRF(allednarrow-widthRF).AsshowninSetion4,thisorganiza-
tiondramatiallyreduestheoverallproessoromplexityasthereisnoneedforadditional
hardwaretokeeptrakofthedierentdatapath-widthexeutionmodes. Thenarrow-width
RF has four read ports and two write ports to provide support for the exeution of two
operationsperyle. Thefull-widthluster,ontheotherhand,omprisesa64-bitALUand
one load/storeunit apableof exeuting simplearithmeti and logi operations. A 64-bit
loalRF(alledfull-widthRF)isprovidedwithfourreadportsandtwowriteportstosup-
porttheexeutionoftwo64-bitoperationsperyle. Restritingtheload/storeunittothe
full-widthlusterisoherentwithourapproahsineaddressalulationsgenerallyoperate
onthefulldatapath-width. Figure4illustratesthisbasiWPMorganization. Similartothe
baselinemodel,wedonotonsiderpartitioningtheproessorfront-endandthedataahe.
We do however need to address with are the ommuniation between the narrow-width
lusterandthefull-widthluster.
3.2.1 Inter-lusterommuniations
Theneedtoommuniatedatabetweenthenarrow-widthlusterandthefull-widthluster
is ditated by the propagation of data dependeny among the narrow-width operations
andthe otherprograminstrutions. Considerforinstane theexeutionsenariodepited
inter-lusterommuniations(seeFigure1). The16-bitdupliateRFshowninFigure4has
beenspeially thought to break down this omplexity. This RF provides aopyof the
narrow-widthRFandiskeptsynhronizedwithitbythefuntionalunitsin bothlusters.
Narrow-width luster implementation details Regardingthenarrow-width luster,
twowriteportsare provided bythe16-bit dupliateRFto allowthe 16-bitALUsto keep
eah writebak register up to date with their opy in the remote luster. There are two
reasonsto maintaintheALUsin thenarrow-widthlusterfully-onnetedwiththeremote
RFopy. First,asshowninFigure1,theavailabilityofnarrow-widthoperationsinprograms
islargeenoughtojustifytheneedofmoreommuniationbandwidthbetweenthenarrow-
width luster and thefull-width luster. Seond, Figure 1evidenes the fat that among
theoperationsthatmaypotentiallyinvolvearemoteommuniationwiththenarrow-width
luster, the NFF operations are by far the largest. A NFF operation may onsume its
valuefromthe16-bitdupliateRF,meaningthattheopymusthavebeenkeptuptodate
bythenarrow-widthluster.
Full-width luster implementation details Inthe full-width luster, tworead ports
andtwowriteportsareprovidedbythe16-bitdupliateRFtoallowthe64-bitALUandthe
load/storeunittoreadandtowritebaktheirresult. However,onlyonewriteportisatu-
allyonnetedtotheremoteRFopy. Thislatterismotivatedbytheobservationthatonlya
smallfrationoftheoperationsexeutingonthefull-widthlusterneedstobesynhronized
withtheirremoteopy. Infat,theseoperationsarerestritedtothesubsetofinstrutions
that produea16-bitresult, e.g. IF3 andIF1 in Figure 5. As illustratedin Figure1with FFN and NFN, their representativeness in programs is negligible; there is therefore no
needto providethefull write bandwidthto keepbothopiessynhronized. Moreover,our
analysis showedthat among those instrutionsthat may involve aremote ommuniation
withthenarrow-widthluster,alargeperentageoftheseareatuallynarrow-widthloads.
This explains the additional port on the narrow-width RF whih is write-onneted with
theload/storeunit inthefull-width luster. Thebroadastto theremoteRFopyisdone
eahtimetheload/storeunitwritesto the16-bit dupliateRF.
Sineonlytheload/storeunitmaintainsbothRFopiessynhronizedwitheahother,it
is possible that a valuebeingwritten bakby the 64-bit ALU in the 16-bit dupliateRF
isnotavailableinthenarrow-widthRFwhenadependentnarrow-widthoperationisready
toissue. Inthat ase,weassumethehardwareautomatially insertsaopyinstrutionto
forwardthat valueto theRFopy[22℄. Notehoweverthat thisaseis raresinetheonly
operationsthatmaypotentiallyommuniatetheirresulttotheremotenarrow-widthluster
areFFNandNFN.Theseoperationsontributeforlessthan3%onourbenhmarks(see
Figure1). It isalso importanttonote that onlynarrow-width operationsan beexeuted
onthenarrow-widthluster. Theothertypesofoperationsontaininganarrow-widthdata,
i.e. FFN, NNF, NFNand NFF,exeuteon thefull-width luster and read/writetheir
narrow-widthdatafrom/tothe16-bitdupliateRF.The16-bitdupliateRFthereforeserves
bothasaopyofthenarrow-widthRFandalsoaloal16-bitRFsinemanyvaluesmaybe
readandwritten intoit withoutatuallymodifyingtheiropy. Weshowindeedin Setion
4that thisatuallysigniantlyreduestheomplexity.
3.2.2 Limitedinter-luster onnetivity
Wealsoexploredashemewithlimitedinter-lusteronnetivitytofurthermitigateoverall
omplexity. In this new organization, we redue the number of write ports on the 16-bit
dupliate RFfrom 4 to 2. In thenarrow-width luster, we remove thepath labelled 2 in
Figure4,meaningthatonlyone16-bitALUisnowabletopropagateitsresulttotheremote
RFopy. Inthefull-width luster, wenote that there is noneed to provide2write ports
on the 16-bit dupliate RF as maintaining it synhronized with the opy is done by the
load/storeunit. Moreover,ouranalysis showsthat thereare onlyafewoperations( NFN
and FFN) exeuting on the main luster that produe narrow-width results. Hene, it
makessensetoremovethepathlabelled1inFigure4,butoperationsproduinganarrow-
widthdata willnowhaveto besteeredtowardtheload/storeunit. Notehoweverthatthe
64-bitALUanstillexeuteoperationswithnarrow-width data. Iftheoperationprodues
anarrow-width result, this resultwill have to bewritten bakto the 64-bit RF. Albeit a
moreeientuseof omputingresouresan berealizedbydoingso,itshould benotied
thatwemayhowevermisssomeoptimizationopportunities.
We also propose to optimize the number of inter-luster ommuniations as a small
frationoftheintegeroperationsexeutingonthefull-widthlusteruseandproduenarrow-
width data. For this purpose,weadvoateusing aopyinstrutionsheme[22℄to update
theontentof a16-bitregisteronlywhenneessary. This approah anleadto signiant
powersavingsintheinteronnetfabriandregisterles. Nevertheless,usingthisapproah
mayalsohaveanegativeimpatontheoverallperformaneastheonsumingoperationswill
bedelayeduntiltheopyinstrutionswritetheirresultsbak. Tomitigatetheperformane
degradation,weproposetobroadastthevalueofloadoperationsasdoneinthebasiWPM.
It makes sense to doso as we observedthat load operationswhih produenarrow-width
data are relatively frequent at runtime. However, a moreeient optimization would be
relatedtotheuseofanarrow-width usagepreditor topreditonwhihlusteravaluewill