HAL Id: hal-00758195
https://hal.inria.fr/hal-00758195
Submitted on 28 Nov 2012
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of
sci-entific research documents, whether they are
pub-lished or not. The documents may come from
teaching and research institutions in France or
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
microarchitecture studies
Pierre Michaud
To cite this version:
Pierre Michaud. Constant-work multiprogram throughput metrics for microarchitecture studies.
[Re-search Report] RR-8150, INRIA. 2012. �hal-00758195�
0
2
4
9
-6
3
9
9
IS
R
N
IN
R
IA
/R
R
--8
1
5
0
--F
R
+
E
N
G
RESEARCH
REPORT
N° 8150
November 2012
Constant-work
multiprogram
throughput metrics for
microarchitecture studies
Pierre Michaud
RESEARCH CENTRE
RENNES – BRETAGNE ATLANTIQUE
Pierre Mi haud
Proje t-TeamALF
Resear hReport n°8150November201221pages
Abstra t:
Multi orepro essors animproveperforman ebyde reasingtheexe utionlaten yofparallel
pro-grams,orbyin reasingthroughput,i.e.,thequantityofworkdoneperunitoftimewhenexe uting
independenttasks. Throughputisnotne essarilyproportionalto thenumberof oresand anbe
impa tedsigni antlybyresour esharingin several parts of themi roar hite ture. Quantifying
theimpa tofresour esharingonthroughputrequiresathroughputmetri . Amajorityof
mi roar- hite turestudiesuseequal-timethroughputmetri s,su hasIPCthroughputorweightedspeedup,
thatarebasedontheimpli itassumptionthatallthejobsexe uteforaxedandequaltime. We
argue that this assumption is not realisti . We propose and hara terize some new throughput
metri sbasedontheassumptionthat jobs exe uteaxed andequalquantityof work. Weshow
that using su h equal-work throughputmetri may hangethe on lusionofami roar hite ture
study.
Résumé: Les pro esseursmulti- oeursaugmententlesperforman esen réduisantle temps
d'exé utiondesprogrammesparallèleouenaugmentantledébit, 'est-à-direlaquantitéde
tra-vail ee tuée par unité de temps lorsqu'on exé ute des tâ hes indépendantes. Le débit n'est
pastoujoursproportionnelaunombrede oeurs,il peutdépendrefortementdes onitssurles
ressour espartagéesdelami roar hite ture. Lesmétriquesdedébitpermettentau
mi roar hi-te tedequantierl'impa tde es onitssurledébit. Unemajoritéd'étudesenmi roar hite ture
utilisentdesmétriquesdedébit tellesqueledébit d'instru tionsoulasommedes a élérations
qui sont basées sur l'hypothèse impli ite que toutes les tâ hes s'exé utent pendant des temps
onstants et égaux. Nous pensons que ette hypothèse n'est pas réaliste. Nous proposons et
ara térisonsde nouvellesmétriques de débit basées sur l'hypothèse que les tâ hes exé utent
des quantités de travail onstantes et égales. Nous montrons que l'utilisation de es nouvelles
métriquesdedébitpeuvent hangerles on lusionsde ertainesétudes.
1 Introdu tion
Multi ore pro essors an improve performan e in two ways: they an de rease the exe ution
laten yofparallelprograms,orthey anin reasethroughput,i.e.,thequantityofworkdoneper
unit of timewhen exe utingmultiprogramworkloadsmade ofindependent threads. Whilethe
importan e of parallelprograms in the urrentorfuture softwaree osystemmaybedis ussed,
thebenetofhigherthroughputisundisputableandhasbeenmostlyvisibleindata entersand
intheserversideoftheinternet. However,threadsrunningsimultaneouslyonamulti oreshare
someresour es like the last-level a he, pin bandwidth, hip thermaldesign power, et . With
simultaneous multithreading (SMT), threads share ore mi roar hite ture stru tures: level-1
a hes,bran h predi tors, physi al registers,s hedulinglogi , exe utionunits, et . Be ause of
resour esharing, throughputis not ne essarilyproportionalto thenumber ofthreads running
simultaneously. Moore'slawisexpe tedto ontinuein thenear future,andwithitthegrowing
of the number of ores on a single hip, leading to so- alled "many ore" hips. Commer ial
many ore hips existing so far exploit data parallelismand havefound an important ni he in
graphi s pro essing,i.e., mostlyonthe lientsideof theinternet. Whatis not leariswhether
hips with several hundredsof general-purpose oreswill beuseful on theserverside. Indeed,
someof thesharedresour esareunlikelyto growlinearlywiththe numberof ores, espe ially
pinbandwidth. Mi roar hite tsstudyinggeneral-purposemulti oresormany oresareinterested
in quantifying throughput, to understand, and possibly de rease, the performan e loss due to
resour esharing.
Weargueinthispaperthatthemultiprogramthroughputmetri sthatare ommonlyusedfor
mi roar hite turestudiesarede ient. Weproposenewthroughputmetri saimedat orre ting
thesede ien ies.
Thispaperisorganizedasfollows. InSe tion2welistthemostfrequentlyusedmultiprogram
throughput metri s and we explain why, in our opinion, they are de ient. We introdu e in
Se tion3anewthroughputmetri ,theEW-IPC,whi hisbasedontheassumptionofequal-work
jobs. WeshowthattheEW-IPCmayleadtothroughputparadoxes:a eleratingathreadmight
sometimesde reasethroughput. Be ausethereisnosimpleanalyti alformulafortheEW-IPC,
weintrodu einSe tion4somepra ti alproxiesfortheEW-IPC:theH-IPC,ASH-IPCand
SSH-IPC. We dis ussin Se tion 5thepossibleimpli ationsof using equal-workthroughput metri s
and we show with an example that the on lusions of a study may hange dramati ally. We
on ludethisstudybyre ommendingtousetheASH-IPCortheSSH-IPCinmi roar hite ture
studies on ernedwithmultiprogramthroughput.
2 Multiprogram throughput metri s are broken
Multiprogramthroughputisgenerallymeasuredby onsideringasetofsingle-threadben hmarks
and running simultaneouslysome ombinations of ben hmarks, alled workloads in the rest of
thisstudy. Ingeneral,ifthepro essor anrunupto
K
threadssimultaneously,ea hworkloadis a ombinationofK
ben hmarks(not ne essarilydistin t). Itis ustomaryin mi roar hite ture studiestodeneaxedsetofworkloadsandtosimulateea hworkloadseparatelyforaxedtimeinterval. Fromsimulations,aper-threadperforman enumberisobtainedforea hworkload,and
aglobalthroughputnumberisobtainedbyaggregatingtheper-workloadperforman enumbers.
Obtainingaglobalthroughputnumberisimportantbe ausethis issometimestheonlyway
tode idewhetherame hanismshouldbeimplementedornot.Indeed,multiprogramthroughput
metri sareoftenusedtostudyarbitrationme hanismsinmi roar hite tureswhereoneorseveral
resour esaresharedbetweenthreads runningsimultaneously. Dierentarbitrationme hanisms
modify the arbitration me hanisms in su h a way that ertain threads are a eleratedat the
ostofslowingdownsomeotherthreads. Lookingatindividualthreadsorindividualworkloads
mayyieldsomeinsightwhendoingastudy,buteventuallyitisglobalthroughputthatpermits
de idingwhi harbitrationme hanismshouldbeimplemented.
Sofar,thethreemostfrequentlyusedmultiprogramthroughputmetri sinmi roar hite ture
studiesaretheso- alledIPCthroughput,weightedspeedupandharmoni meanofspeedups. These
threemetri sdierinhowtheydenetheper-workloadperforman e.
2.1 The ET-IPC (aka IPC throughput)
IPCthroughputdenestheperforman eofagivenworkloadasthesumoftheIPCs(instru tions
exe uted per y le) of ea h thread in the workload. Then, global throughput is obtained by
averagingtheper-workloadperforman es:
ET-IPC
=
1
W
W
X
w=1
K
X
i=1
IP C(w, i)
(1)where
K
isthenumberoflogi al ores,W
isthenumberofworkloadsandIP C(w, i)
istheIPC ofthethreadrunningon orei
inthew
th
workload. Intherestofthisstudy,wedenotetheIPC
throughput the equal-time IPC (ET-IPC), as ea h workloadis runfor a xed and equal time
interval. TheET-IPCwasrstusedinthe1990'sforstudyingSMTmi roar hite tures[13,12℄,
andisstilloneofthethreemostpopularthroughputmetri s. However,blindlymaximizingthe
ET-IPC mayleadto implement resour earbitrationme hanismsthat favorhigh-IPCthreads,
penalizing low-IPC ones [12℄. This led to the denition of alternative throughput metri s, in
parti ulartheweightedspeedup[10,7, 9,11℄.
2.2 The weighted speedup and harmoni mean of speedups are
in on-sistent
Instead of raw IPCs orexe ution times, resear hersoften prefer to onsider speedups, that is,
relativeperforman e[5℄. Weightedspeedupdenesper-workloadperforman easthesumofthe
speedupsofea hthread onstituting theworkload,that is,theperforman eofworkload
w
is:K
X
i=1
IP C(w, i)
IP C
ref
(w, i)
where
IP C
ref
(w, i)
isareferen eIPCforthethreadrunningon orei
inthew
th
workload(for
instan e,
IP C
ref
(w, i)
maybedenedastheIPCofthethreadwhenitrunsaloneonareferen e ma hine).Weightedspeedupisoneofthethreemostfrequentlyusedmultiprogramthroughputmetri s.
However,wehaveshowninare entstudythatweightedspeedupisin onsistent,asitgivesmore
weighttoben hmarkswithalowreferen eIPC[6℄. Onemaywanttoweightdierentben hmarks
dierently,butthisshouldbestatedexpli itlyandthismustbejustied. Toourknowledge,none
ofthepaststudiesthathaveusedweightedspeeduphaveprovidedareasonforweightingdierent
ben hmarksdierently,perhapsbe ausetheywerenotawareofthein onsisten yproblem. The
harmoni meanofspeedups,thethirdmostfrequentlyusedthroughputmetri ,isin onsistentas
well[6℄.
Using an in onsistent metri may lead to arti ial on lusions, as illustrated in Table 1.
single thread ma hine X runningAB ma hine Y runningAB ben hmarkA
IP C
ref
= 0.2
IP C = 0.1
IP C = 0.2
ben hmarkBIP C
ref
= 1
IP C = 0.2
IP C = 0.1
weightedspeedup 0.7 1.1harmoni meanofspeedups 0.29 0.18
Table 1: Example illustrating the in onsisten y of weighted speedup and harmoni mean of
speedups. Ben hmarksAandBareequallyimportant. Without furtherinformationon
ben h-marks,theonlypossible on lusionisthatma hinesXandYoerthesamethroughput.
aunit ofworkwhi his notthesameforalltheben hmarks andwhi hdependsonareferen e
ma hine, the ma hine on whi h referen e IPCs are measured. Troughput is dened as the
quantityof work doneperunit of time. A sound throughputmetri ,onewhi h anbeused to
omparema hines,shoulddenetheunitofworkindependentlyfromareferen ema hine,asthe
hoi e ofaparti ularreferen ema hineisarbitrary(the hoi eofadierentreferen ema hine
ould hangethe on lusions). If onewants, forwhateverreason,to givemoreweightto some
ben hmarks,su hweightingshouldbedone ons iouslyandshould bejustied.
2.3 What about geometri mean of speedups ?
It iswellknown that,in the aseof single-threadperforman e, speedups should be aggregated
withageometri meaninordertoavoid onsisten yproblems[2,5℄,andresear hersgenerallyuse
the geometri mean to summarize single-threadspeedups. Oddly, in the aseof multiprogram
throughput,thepra ti eofaddingspeedups(asin weightedspeedup)ortaking theirharmoni
meanisstillwidespread.
Thoughwedonotex lude ompletelythepossibilitytodeneameaningfulthroughputmetri
usingageometri mean,itisnot learhowsu hmetri shouldbedened. Thegeometri mean
ofspeedupsgivesanestimateofthemedianspeedupofrandomprogramsundertheassumption
that ben hmarks are representativeand that speedups are distributed log-normally [5℄. That
is, an assumption is made not only on programs (as in all performan e metri s) but also on
ma hines. It iseasytoimagine aseswherethisassumptiondoesnothold 1
. Theassumptionof
log-normality annotbetruein generalbut oin identally[4℄.
2.4 The SPECrate throughput metri
SPECdenesathroughputmetri alledSPECrate,basedonrunninghomogeneousworkloads,
i.e.,runningseveralindependent opiesofthesameben hmarkandmeasuringhowlongittakes
forall opiestonishexe uting[1℄. However,SPECrateislimitedtohomogeneousworkloads. It
isnotanappropriatemetri forstudyingmi roar hite turesthatexploitthepossible
heterogene-ityofbehavioramong on urrentlyrunningthreads. Forexample,ifweuse onlyhomogeneous
workloadstoevaluatemultiprogramthroughput,wemaybeledto on ludethat ashared
last-level a heoerslittlethroughputadvantageoverprivate a hes,overlookingthepossibilityfor
a shared a he to exploit the fa t that dierent appli ations may have dierent a he spa e
requirements[8℄.
1
Forexample, onsiderapro essor Xsu hthatIPCsaredistributedlog-normally,withamedianIPCequal to1. Thatis,50%ofprogramshaveanIPCgreater than1. Then onsiderapro essorYthat anissueonlya singleinstru tionper y le,sothat50%ofprogramshaveanIPC loseto1. TheIPCdistributionofpro essor Yisnotlog-normal,neitherthespeedups.
2.5 Equal-time throughput metri s
Ameaningfulthroughputmetri isasso iatedwithathroughputexperimentsu hthatthe
quan-tity of work per unit of time measured with this experiment is equalto the throughput value
givenbythemetri .
BoththeET-IPCand theweightedspeedup metri sareequal-time throughputmetri s: the
throughputexperimentwithwhi htheyareasso iateddividestheexe utionofasetof
indepen-dentjobsintoequaltimeintervalssu hthat,duringatimeinterval,asingleworkloadisrunning
ontinuously(thatis,no ontextswit hhappens) 2
.
Hen e ifasetof workloadsisdened independentlyfrom anyparti ularma hine,andifall
theben hmarks ontribute equallyto the jobs onstituting the workloads, allthe ben hmarks
runforthesametotaltime,whateverthe ma hine's performan e.
There exists someappli ations that a tually behavelike that, i.e., they tryto do asmu h
workaspossibleduringaxedtime. Thistypi ally orrespondstosomeintera tiveorreal-time
appli ationsthat anadaptthequalityoftheiroutputasafun tionofthema hine'sperforman e.
Su happli ationsprodu ejobs thatmaybetermed onstant-time jobs. However,this on erns
aminorityofappli ations.
Most jobs, in luding bat h jobs but also intera tive and real-time jobs, are onstant-work
jobs. A onstant-workjob hasxedworkto do,whi h typi ally orrespondstoaxednumber
of program instru tions to exe ute, and the time to do this work depends on the ma hine's
performan e.
In our opinion, the impli it assumption of onstant-time jobs in the onventional "IPC
throughput"metri is the main drawba kof that metri . Repla ing raw IPCs with speedups,
as in weighted speedup, not only makes the metri in onsistent but does not solve the basi
problem: thequantityofworkdonebyajobstilldependsonthema hinesperforman e.
Tosolvethisissue,weproposeintherest ofthisstudysomenewmultiprogramthroughput
metri sthat arebasedontheassumptionthat jobsexe uteaxedand equalquantityofwork.
Ourgoalisto statetheassumptions learlyand to hara terizethemetri swepropose, unlike
whathasbeendonewithotherthroughputmetri s.
3 The EW-IPC: an equal-work IPC throughput metri
Wepropose to dene an equal-workthroughput metri formi roar hite ture studies basedon
thefollowingthroughput experiment:
(1)Theunitofworkistheinstru tion
(2) Thebehaviorof ajob issimilar to oneof the ben hmarks, hosenrandomly,and
alltheben hmarksareequallylikely
(3)Allthejobs exe uteaxed andequalnumberofinstru tions
(4)Thereisasinglejob queue,whi hisneverempty
(5)Thes hedulingpoli yis rst-inrst-out(FIFO,akarst- omerst-served)
Weassumethattheunit of work istheinstru tion,whi hisanaturalunit ofwork forthe
mi- roar hite t. Thisbrings onsisten y: alltheben hmarksaretreatedequally. Hen ethroughput
isgivenin instru tionsperunit of time, e.g., instru tionsper y le(for axed lo k y le)or
2
Theonlydieren e between ET-IPC and weighted speedupis thatthey usedierent unitsof work. The ET-IPCassumesthattheunitofworkistheinstru tion,andthatoneinstru tionfromoneben hmarkisworth oneinstru tionofanotherben hmark. Theweightedspeedupassumesthattheunitofworkforaben hmarkis theaveragenumberofinstru tionsexe utedina lo k y lebytheben hmarkwhenitrunsaloneonareferen e ma hine(hen edierentunitsofworkfordierentben hmarks).
instru tions perse ond. In the rest of this study, we onsider throughput in instru tions per
y le.
These ondassumptionmakessensebe ause,unlesswehavesomespe i informationabout
the representativeness of a ben hmark (whi h is rarely the ase), we have no reason to treat
dierentben hmarksdierently.
Thethird assumptionisthatjobsexe utea onstantwork,andthatthequantityofworkis
thesameforalljobs. Of ourse, onareal system,thereis noreasonfor alljobsto exe utethe
samequantityofwork.Butitisne essarytomakesomeassumptionswhendeningathroughput
metri . Itisrarethatwehaveanyinformationontherepresentativenessofben hmarks,andwe
havenoreasontogivemoreweighttosomeben hmarks. Thethirdassumptionisfundamental:it
isequivalenttosayingthatthenumberofinstru tionsexe utedbyarandomjobisnot orrelated
with thatjob's IPC, i.e., weassumethat,onaverage,arandom low-IPCjob exe utesasmany
instru tionsasarandomhigh-IPCjob.
Thefourthassumptionimpliesthatthejobarrivalrateissu ientlyhighsothatthroughput
islimitedbythema hineperforman e,notbythearrivalrate. Thisassumptionmakessensefor
mi roar hite turestudies,where thethroughputmetri isusedto omparedierentpro essors.
Finally,thelastassumption onsidersthatjobsarebat hjobs. Ingeneral,jobs hedulingmay
impa t throughputquitesigni antly. Wealsoexperimentedwith aper- ore job queuesand a
round-robins hedulingpoli y,forvarioustimequantaandvaryingthedegreeof
multiprogram-ming: wedidnotobserveanysigni antdieren ewithFIFOs heduling. Althoughwehaveno
mathemati alproof,we onje turethatthese ondassumption(jobsare hosenrandomlyamong
ben hmarks)is su ientforthroughputto beindependentof thes hedulingpoli y,as longas
the s hedulingpoli y isunawareof jobsIPCsandtreatsalljobs equally.
Themetri dened by this throughput experiment will be denoted EW-IPC in the rest of
the paper. Noti ethat, be ausejobs exe utean equalnumberof instru tions, the EW-IPCis
dire tlyproportionalto thejobthroughput.
3.1 The IPC matrix method
A possibleway to measure EW-IPC would be to run a single y le-a urate simulation
orre-spondingtotheEW-IPCthroughputexperimentdes ribedinSe tion3. However,itwouldtake
averylongsimulationtimefor theEW-IPC to onvergeto itslimitvalue. Inpra ti e,afaster
simulation method is possible when we have few ben hmarks and few ores: the IPC matrix
method.
Thebasi ideaisto simulateseparatelyallthepossibleworkloads,i.e.,allthepossible
om-binationsof
K
ben hmarks,andre ordthethreadsIPCsintheIPCmatrix. Inthegeneral ase, there isoneIPCmatrixper ore. Ea hoftheK
matri eshasone rowperben hmarkand one olumn for ea h ombination of ben hmarks onthe other ores. So if we haveB
ben hmarks (orben hmarkssli es)andK
logi al ores,anIPCmatrixhasB
rowsandB
K−1
olumns. For
instan e, in the parti ular ase of a symmetri dual- ore and two ben hmarks A and B, both
oreshavethesameIPC matrix,oftheform
M
=
IP C
AA
IP C
AB
IP C
BA
IP C
BB
where
IP C
XY
isthe IPCof ben hmarkX whenrunningsimultaneouslywithben hmarkY. A throughputexperiment orrespondingtotheEW-IPCmetri anberepresentedbyanexe utiontraje tory, su h as the one shown in Figure 1. With 2 ores, the exe ution traje tory is in
a 2-dimensional spa e: the oordinates
(x, y)
of a point on the traje tory indi ate how many instru tionshavebeenexe utedsofaron ores1and2respe tively.B
A
A
A
B
B
A
B
A
B
B
B
A
B
CORE 1
CORE 2
A
A
Figure1: Exampleofexe utiontraje toryonasymmetri dual- ore,assumingtwoben hmarksAand
B.JobsoftypeAorBexe utewithequalprobability. Ea hjobexe utesaxednumberofinstru tions.
At a given time,the oordinates
(x, y)
of apoint on thetraje toryindi atehow manyinstru tions havebeenexe utedso faron ores1and2respe tively. InthisexampleIP C
BA
= 2 × IP C
AB
.On e the IPC matri es havebeenpopulated, anexe ution traje tory orresponding to the
EW-IPCmetri anbeobtainedwithaMonteCarlosimulation: thetotaltime anbeobtained
bysummingthetimes orrespondingtoea hsegmentofthetraje tory. Eventually,iftheMonte
Carlo simulation is long enough, the total number of instru tions divided by the total time
providesana urateapproximationoftheEW-IPCvalue.
The IPC matrixmethod is on eptuallyvery loseto the o-phasematrixmethod used by
VanBiesbrou ketal. [15,14℄.
3.2 A throughput paradox
EW-IPCisamoremeaningfulmetri than ommonlyusedequal-timethroughputmetri s.
How-ever,it isfundamentallydierent. Being loserto arealisti situation exposestheEW-IPC to
the ounter-intuitivebehaviorsthat maybeen ounteredinrealsituations.
Toourknowledge,there isnosimpleanalyti alformulasu h as(1) for omputingEW-IPC
in the general ase. In the general ase, EW-IPCvalues must be obtainedfrom Monte Carlo
simulations. However,asimpleanalyti alformulaexists forthesimple aseof twoben hmarks
andasymmetri dual- ore. Wederivethisformulaintherestofthisse tion,anduseittoshow
thattheEW-IPCmaya tuallyde reasewhenwea elerateanindividualthread.
As explained in Se tion 3.1, theexe ution traje tory issu ient to omputethe EW-IPC.
The exe ution traje tory onsists of segments. For instan e, the traje tory of Figure 1has 4
types of segments: AA, BB, AB and BA. For instan e, a segment AB orresponds to ore 1
runningben hmark Awhile ore2isrunningben hmarkB.Forsu h symmetri dual- ore,the
slopeofthetraje toryisequalto1insegmentsAAandBB, itisequalto
IP C
BA
IP C
AB
andIP C
AB
IP C
BA
insegmentsABandBArespe tively.
ofinstru tionsexe utedonsegmentsAA,BB,ABandBA.Fromthesesegmentfra tions,we an
obtain thetotal exe ution times
t
AA
,t
BB
,t
AB
andt
BA
spent on segmentsAA, BB, AB and BA:t
AA
=
2×IP C
f
AA
AA
N
I
t
AB
=
f
AB
IP C
AB
+IP C
BA
N
I
t
BB
=
f
BB
2×IP C
BB
N
I
t
BA
=
f
BA
IP C
AB
+IP C
BA
N
I
where
N
I
isthetotalnumberofinstru tionsexe uted. Then,theEW-IPC anbe omputedasEW-IPC
=
N
I
t
AA
+ t
BB
+ t
AB
+ t
BA
=
f
AA
1
2×IP C
AA
+
f
BB
2×IP C
BB
+
f
AB
+f
BA
IP C
AB
+IP C
BA
(2)Themaindi ultyliesin omputingthesegmentfra tions. Ingeneral,segmentfra tionsmust
beobtainedfromaMonte Carlosimulation. However,intheparti ular aseoftwoben hmarks
exe utingonasymmetri dual- ore, anexa tanalyti expression anbeobtained. Noti ethat
an exe ution traje torysu h asthe onedepi ted in Figure 1 doesnot depend on
IP C
AA
andIP C
BB
. Hen ethefollowingtwoIPC matri esM
=
IP C
AA
IP C
AB
IP C
BA
IP C
BB
M
′
=
IP C
AB
IP C
AB
IP C
BA
IP C
BA
yieldtheexa tsametraje tory,hen etheexa tsamesegmentfra tions. Butmatrix
M
′
orre-sponds toa situation where theIPC of ajob doesnotdepend onwhi h job is runningonthe
other ore. In this situation,the two oresareindependent, and thesegmentfra tions anbe
omputedeasily(seetheappendix):
f
AB
= f
BA
=
1
4
f
AA
=
2
1
×
IP C
AB
IP C
+IP C
BA
BA
f
BB
=
1
2
×
IP C
AB
IP C
AB
+IP C
BA
(3)Substituting (3)into(2), weeventuallyobtaintheEW-IPC:
EW-IPC
=
4 × (IP C
AB
+ IP C
BA
)
2 +
IP C
BA
IP C
AA
+
IP C
AB
IP C
BB
(4)Formula(4)isnontrivialandleadstosurprising onsequen es. Inparti ular,theEW-IPCmay
de reasewhenwein rease
IP C
AB
(orsymmetri ally,IP C
BA
):IP C
BA
IP C
BB
−
IP C
BA
IP C
AA
> 2 =⇒
∂
EW-IPC∂IP C
AB
< 0
Forinstan e, onsider thefollowingtwoIPCmatri es:
M
1
=
1
4
2 0.2
=⇒
EW-IPC= 1
M
2
=
1
0
2
0.2
=⇒
EW-IPC= 2
FromtheIPC matri es, onemay believe thatma hine
M
1
should oermorethroughputthan ma hineM
2
, while in fa t it is the opposite. In thisexample, de reasingIP C
AB
from 4 to 0 doublestheEW-IPC!ByfreezingajoboftypeAwhentheother orerunsajoboftypeB,wein reasethejob throughput, whi h is ounterintuitive. Letusanalyzewhat is happening. On
bothma hines,workloadBBisaworkloadtypethatwewouldliketoavoidasmu haspossible
be ausein this asetheIPC isverysmall. Onma hine
M
2
,ajob oftypeA annotterminate while the other ore runs a job of type B. Theonly way for aworkloadBB to o uris whenexa tlyatthe sametime(whi his quiteimprobable). Hen e onma hine
M
2
, workloadBB is inexistentinthelimit( f. equations(3)).ThisexampledoesnotmeanthattheEW-IPCmetri istheproblem. Thiskindofsituation,
whereaseeminglyobviousperforman eimprovementa tuallyde reasesthroughput,andwhi h
we all a throughput paradox, an happen in reality. However, the next se tion shows that
throughputparadoxesareunlikelyin pra ti e.
4 Pra ti al proxies for the EW-IPC
TheIPC matri es orresponding to mi roar hite ture studies are likely to have anon-random
stru ture. Ea h row of the IPC matrix gives the IPCs for one parti ular ben hmark for all
possible onditionsunder whi hthis ben hmark may run. Some ben hmarks are"insensitive":
their IPC is almost independent of the threads that are running on the other ores. Other
ben hmarksare"sensitive": theirIPC variesmoreorlessdependingonthe o-runningthreads.
Intheparti ular asewhenalltheben hmarksareinsensitive,theEW-IPC anbe omputed
easily. Let
IP C
i
[b]
bethe IPC of ben hmarkb
on orei
( oresare notne essarily identi al), whi his independentofthejobsrunningontheother ores(asben hmarksareassumedinsen-sitive). WhenweruntheEW-IPCthroughputexperimentforalongtime
T
, orei
exe utesN
i
instru tions,andea hoftheB
ben hmarks ontributesnearlyN
i
/B
instru tionsonthat ore. Hen eforall oresi ∈ [1, K]
,T =
B
X
b=1
N
i
/B
IP C
i
[b]
⇐⇒
N
i
=
B × T
P
B
b=1
IP C
1
i
[b]
TheEW-IPCis1
T
P
K
i=1
N
i
. SubstitutingN
i
,wegetEW-IPC
=
K
X
i=1
B
P
B
b=1
IP C
1
i
[b]
(5)whi his thesumonall oresoftheharmoni meanofben hmarksIPCs.
4.1 Denition of the H-IPC
An interesting question to study is whether we an generalize formula (5) in the ase where
ben hmarksaresensitiveandobtainanapproximationoftheEW-IPCthatwouldnotne essitate
MonteCarlosimulations. An obviousgeneralizationoftheharmoni meanin (5)is
H-IPC
=
K
X
i=1
B
K
P
(b,w)
IP C
1
i
[b,w]
(6)where
IP C
i
[b, w]
istheIPCvalueintheb
th
rowand
w
th
olumnoftheIPCmatrixof ore
i
,and(b, w)
runsoverallrowsand olumns. Thatis,theH-IPCisobtainedbysummingtheharmoni meansof theIPC matri es.Wedistinguishtwosortsofmulti ores: symmetri andasymmetri . Wesaythatamulti ore
issymmetri whenallthe oreshaveequivalentIPCmatri es 3
. Amulti orethatisnotsymmetri
is asymmetri . Forinstan e, most general-purpose multi ore pro essors today are symmetri ,
in ludingthosewhosephysi al oresareSMT.
3
IP C[b, w] = min(3, max(0.1, BIP C[b] × S[b, w]))
Randomvariable
S[b, w]
normallydistributedwithmeanequalto1BIP C[b] = 0.1 + R[b] × 2.9
forb ∈
[1, B]
Randomvariable
R[b]
uniformin[0, 1]
Figure2: IPC matrixmodel
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.2 0.4 0.6 0.8
1
1.2 1.4 1.6 1.8
2
CDF
speedup
stdev=0.05
stdev=0.1
stdev=0.2
stdev=0.3
Figure 3: IPC matrix model: umulative distribution fun tion of the normally-distributed speedup
S[b, w]
fordierent valuesofthestandarddeviation STDEV.Forsymmetri multi ores,theH-IPCformula anbesimplied:
H-IPC
=
K × B
K
P
(b,w)
IP C[b,w]
1
(7)
UnliketheEW-IPC,theH-IPCisnotasso iatedwithathroughputexperimentanditsphysi al
meaning an only be approximate. Nevertheless, we onje ture that the H-IPC an be used
as a proxy for the EW-IPC in most pra ti al situations. Ideally, we would like to he k this
onje turebydoingmanydierentmi roar hite turestudiesandby he kingthattheEW-IPC
andtheH-IPCleadto thesame on lusions. However,wewouldbelimitedinpra ti etoafew
mi roar hite turestudieswith smallIPCmatri es,from whi hit wouldbedi ultto on lude
anythingorgainanyinsight. Instead,weusethearti ialIPCmatrixmodeldes ribedinFigure
2. In this model,
S[b, w]
represents aspeedup following a normaldistribution of mean 1 and whose standarddeviationSTDEVis aparameterthat wewillvary. Forsymmetri multi ores,weenfor e
S[b, w
1
] = S[b, w
2
]
forworkloadsw
1
andw
2
that area tuallythe sameben hmarks multiset.STDEVquantiesben hmarkssensitivity: thehigherSTDEV,themoresensitivethe
ben h-marks. Figure 3 shows the umulative distribution fun tion of
S[b, w]
for dierent values of STDEV.Forinstan e, with STDEV=0.3,thereisaprobabilityofabout50%thatthespeedupislessthan0.8orgreaterthan1.2. Inthe ontextofmi roar hite turestudies,this orresponds
0.000
0.005
0.010
0.015
0.020
2 cores
4 cores
6 cores
coefficient of variation of R
STDEV = 0.05
2 benchmarks
4 benchmarks
10 benchmarks
20 benchmarks
0.000
0.005
0.010
0.015
0.020
0.025
0.030
0.035
0.040
2 cores
4 cores
6 cores
coefficient of variation of R
STDEV = 0.1
2 benchmarks
4 benchmarks
10 benchmarks
20 benchmarks
0.000
0.010
0.020
0.030
0.040
0.050
0.060
0.070
0.080
2 cores
4 cores
6 cores
coefficient of variation of R
STDEV = 0.2
2 benchmarks
4 benchmarks
10 benchmarks
20 benchmarks
0.000
0.020
0.040
0.060
0.080
0.100
0.120
0.140
0.160
2 cores
4 cores
6 cores
coefficient of variation of R
STDEV = 0.3
2 benchmarks
4 benchmarks
10 benchmarks
20 benchmarks
Figure 4: Coe ient of variation (lower is better) of
R =
H-IPC EW-IPC
for symmetri multi ores, for
dierentvaluesofSTDEV.
OurgoalistounderstandtowhatextenttheH-IPC anbeusedasaproxyfortheEW-IPC.
This doesnot ne essarilymean that the H-IPC valuesshould be lose to theEW-IPC values.
TheH-IPCandEW-IPCareequivalentif,foranytwoma hines XandY,
H-IPC
[Y ]
H-IPC[X]
≈
EW-IPC[Y ]
EW-IPC[X]
(8)
In words, the two metri s must indi ate approximately the same speedup. Condition (8) is
equivalentto H-IPC
[Y ]
EW-IPC[Y ]
≈
H-IPC[X]
EW-IPC[X]
(9)That is, the H-IPC and EW-IPC are equivalent if the ratio
R =
H-IPC EW-IPC
is approximately
onstant(not ne essarilyequalto1). Toquantify theequivalen ebetweenthetwometri s,we
generate1000IPCmatri esa ordingtotheIPCmatrixmodelofFigure2. Forea hIPCmatrix,
we ompute
R
(weobtaintheEW-IPCwithMonteCarlosimulations). Eventually,the oe ient ofvariationCV
R
= σ
R
/µ
R
(withµ
R
andσ
R
respe tivelythemeanandstandarddeviationofR
) quantiestheextenttowhi hthetwometri sareequivalent. Weobservedexperimentallythatthefra tion
R ∈ [µ
R
−
2σ
R
, µ
R
+ 2σ
R
]
isgreater than0.93. Forinstan e, aCV
R
of 1%means thatif we omparetwomi roar hite turesXand Y,and iftheH-IPC ofY is10%higherthantheH-IPC ofX,thentheEW-IPCofYislikelybetween8%and12%higherthantheEW-IPC
ofX.
Results arepresented in Figure 4forsymmetri multi oresand in Figure 5forasymmetri
0.000
0.002
0.004
0.006
0.008
0.010
0.012
2 cores
4 cores
6 cores
coefficient of variation of R
STDEV = 0.05
2 benchmarks
4 benchmarks
10 benchmarks
20 benchmarks
0.000
0.005
0.010
0.015
0.020
0.025
2 cores
4 cores
6 cores
coefficient of variation of R
STDEV = 0.1
2 benchmarks
4 benchmarks
10 benchmarks
20 benchmarks
0.000
0.005
0.010
0.015
0.020
0.025
0.030
0.035
0.040
0.045
2 cores
4 cores
6 cores
coefficient of variation of R
STDEV = 0.2
2 benchmarks
4 benchmarks
10 benchmarks
20 benchmarks
0.000
0.010
0.020
0.030
0.040
0.050
0.060
0.070
0.080
0.090
2 cores
4 cores
6 cores
coefficient of variation of R
STDEV = 0.3
2 benchmarks
4 benchmarks
10 benchmarks
20 benchmarks
Figure 5: Coe ientof variation (loweris better) of
R =
H-IPC EW-IPC
for asymmetri multi ores, for
dierent valuesofSTDEV.
0
2
4
6
8
10
12
14
16
18
0
2
4
6
8
10
12
14
16
18
H-IPC
EW-IPC
6 cores, 2 benchmarks
0
2
4
6
8
10
12
14
0
2
4
6
8
10
12
14
16
H-IPC
EW-IPC
6 cores, 4 benchmarks
Figure 6: S atterplots withtheEW-IPC on thexaxis andthe H-IPCon theyaxis (1000points),
forasymmetri 6- ore,assumingSTDEV=0.3. Theleftplotisfor2ben hmarks,andtherightplot
ben hmarks
B
, and the IPC matrix model STDEV 4. Figure 6showssome s atter plots for a
oupleof ases(symmetri 6- ore,STDEV=0.3,2and4ben hmarks). Severalobservations an
bemade,someofwhi hnotobvious: Asexpe ted,
CV
R
in reaseswithSTDEV,thatis,themore sensitivetheben hmarks,thegreaterthedis repan ybetweentheH-IPCandtheEW-IPC;CV
R
de reaseswhenthenumberofben hmarksisin reased;Ex eptforsymmetri multi oreswith2ben hmarks,
CV
R
de reaseswhenwein reasethenumberof ores;CV
R
islowerforasymmetri multi ores.In pra ti e,theH-IPC anbeused asathroughput metri in mi roar hite turestudies, in
lieuoftheharder-to-obtainEW-IPC.Forinstan e,ifweuse20ben hmarkstostudyasymmetri
dual- ore,andiftheben hmarksaremoderatelysensitive(STDEV
< 0.1
),CV
R
islessthan1% and the H-IPC an reasonably be used to establish throughput dieren es of a few per ents.Still,whatweshowwiththeH-IPCisaperforman eunderworkloadsinvolvingseveraldierent
behaviors.Under ertainworkloads,inparti ularworkloadsinvolvingasmallnumberofdistin t
appli ationswitha"monolithi "behavior,orappli ationswithsimilar behaviors,theobserved
performan e might be quite dierent from the on lusions obtainedon the whole ben hmark
suite( f. throughputparadoxes,Se tion 3.2). Itwouldbepossibleto showseveralthroughput
numbers,onethroughputnumberforthewholeben hmarksuite,andalsothroughputnumbers
forsubsetsofben hmarks,e.g.,ben hmarkspairs. However,theH-IPCis notasafemetri for
studyingsensitiveben hmarkpairs,asitmightleadto on lusionshavingnophysi alreality.
4.2 Sampling the IPC matrix: the ASH-IPC and SSH-IPC
Sofar,wehaveassumedthattheIPCmatrixisfullypopulated. However,inpra ti al
mi roar- hite ture studies,it isoftennot possibleto lltheIPC matrix ompletely be ausethis would
requiretoomanysimulations. Afrequentpra ti eisto simulateasampleofseveraltens,
some-times afew hundreds of dierent workloads. A sampled versionof theH-IPC an be dened
asfollows. Let
W
be thenumberof workloadsin the sampleand letIP C(w, i)
bethe IPC of thethread on orei
in thew
th
workloads. Wedene theasymmetri sampledH-IPC,denoted
ASH-IPC,as ASH-IPC
=
K
X
i=1
W
P
W
w=1
1
IP C(w,i)
(10)Thatis,theASH-IPCisthesumofper- oreharmoni meansofIPCs. Inthe aseofasymmetri
multi ore,theIPCs ofall ores omefromthesameIPCmatrix,andwe anusethesymmetri
sampledH-IPC,denotedSSH-IPC:
SSH-IPC
= K ×
W × K
P
K
i=1
P
W
w=1
IP C(w,i)
1
(11)Thatis,theSSH-IPCistheglobalharmoni meanofIPCs timesthenumberof ores.
The oe ient of variation of
ASH-IPC H-IPC
and
SSH-IPC H-IPC
quanties the extent to whi h the
ASH-IPC and SSH-IPC give the same on lusions asthe H-IPC. As in Se tion 4.1, we dene
1000IPCmatri esa ordingtothemodelofFigure2,forsymmetri andasymmetri multi ores.
Results are shown in Figures 7and 8 for asymmetri and asymmetri 4- ore respe tively,
assuming20ben hmarks,STDEV=0.05 andSTDEV=0.3. Theleftgraphsshowthe oe ient
ofvariationof
ASH-IPC H-IPC
asafun tion ofthenumberofworkloadsin thesample, andtheright
4
Fortheasymmetri 6- ore ase,wedonothaveresultsfor20ben hmarksbe ausetheIPCmatri esarevery big,whi hmakessimulationsveryslow. Forsymmetri multi ores,weuseamu hmore ompa trepresentation oftheIPCmatrix,takingadvantageoftheredundan iesinit.
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0
50
100
150
200
250
300
350
400
Coefficient of variation
sample size W
ASH-IPC, STDEV=0.05
random
global balance
per-core balance
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0
50
100
150
200
250
300
350
400
Coefficient of variation
sample size W
SSH-IPC, STDEV=0.05
random
global balance
per-core balance
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0
50
100
150
200
250
300
350
400
Coefficient of variation
sample size W
ASH-IPC, STDEV=0.3
random
global balance
per-core balance
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0
50
100
150
200
250
300
350
400
Coefficient of variation
sample size W
SSH-IPC, STDEV=0.3
random
global balance
per-core balance
Figure7: Coe ientofvariationof
ASH-IPC H-IPC
(leftgraphs)and
SSH-IPC H-IPC
(rightgraphs)fora
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0
50
100
150
200
250
300
350
400
Coefficient of variation
sample size W
ASH-IPC, STDEV=0.05
random
global balance
per-core balance
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0
50
100
150
200
250
300
350
400
Coefficient of variation
sample size W
SSH-IPC, STDEV=0.05
random
global balance
per-core balance
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0
50
100
150
200
250
300
350
400
Coefficient of variation
sample size W
ASH-IPC, STDEV=0.3
random
global balance
per-core balance
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0
50
100
150
200
250
300
350
400
Coefficient of variation
sample size W
SSH-IPC, STDEV=0.3
random
global balance
per-core balance
Figure 8: Coe ient of variation of
ASH-IPC H-IPC
(left graphs) and
SSH-IPC H-IPC
(right graphs) for an
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0
50
100
150
200
250
300
350
400
Coefficient of variation
sample size W
STDEV=0.05
STDEV=0.1
STDEV=0.2
STDEV=0.3
Figure 9: Coe ient of variation of
S =
ASH-IPC EW-IPC
for a symmetri 4- ore, with 20 ben hmarks,
usingper- orebalan edworkloads.
graphsshowthe oe ient of variationof
SSH-IPC H-IPC
. For ea h graph, we show 3 urves, ea h
orrespondingtoadierentwaytosele ttheworkloads:
Random. Workloadsare hosen ompletelyrandomly.
Global balan e. Among the total
W × K
threads onstituting theW
workloads, ea h ben hmarko ursK × W/B
times,W
beingamultipleofthenumberB
ofben hmarks. Per- orebalan e. Ea hben hmarko urs
W/B
timeson ea h ore,withW
amultiple ofB
.A rst observation is that balan ing the ben hmarks de reases signi antly the oe ient of
variation. Forasymmetri multi ores, per- ore balan e is the best of the3 workloadsele tion
methods. Forsymmetri multi ores,per- orebalan eisbetterthanglobalbalan eforthe
ASH-IPC.Forsymmetri multi ores,theSSH-IPCisnearlyequivalenttotheASH-IPCunderper- ore
balan e,anditisbetterthantheASH-IPC underglobalbalan eandunderrandomworkloads.
Forasymmetri multi ores,asexpe ted,theASH-IPCissuperiortotheSSH-IPC.In on lusion,
were ommendusing per- ore balan ed workload samplesand the ASH-IPCor, forsymmetri
multi ores,theSSH-IPC.
Therequiredsample size depends onben hmarks sensitivity andon themi roar hite tures
being ompared.Ifben hmarkshaveasmallsensitivity(STDEV
< 0.05
)andiftheperforman e dieren ebetweenthetwomi roar hite turesissigni ant(afewper ents),takingafewtensofworkloadsisok. Butiftheperforman edieren ebetweenthemi roar hite turesislessthan1%,
severalhundredsofworkloadsmaybene essarytoquantifypre iselytheperforman edieren e.
Figure9showsthe oe ientofvariation
CV
S
ofS =
ASH-IPC EW-IPC
forasymmetri 4- orewith20
ben hmarks, using per- ore balan ed workloads. The numberof workloads ne essaryto make
CV
S
1%orsmallerin reaseswithben hmarkssensitivity(i.e.,withSTDEV).ForSTDEV=0.05, 20 workloadsare su ient forhavingCV
S
.
0.01
, but weneed 60workloadsfor STDEV=0.1 and400workloadsforSTDEV=0.2.workload single-thread IPCSoE IPCSoE
referen eIPC fairness=0.25 fairness=1
mgrid+art 0.31;0.42 0.28;0.19 0.25;0.19 lu as+applu 0.78;0.75 0.56;0.50 0.56;0.50 galgel+g 0.22;1.31 0.06;1.03 0.11;0.78 vortex+gzip 1.03;1.36 0.25;1.03 0.56;0.67 g +eon 0.56;1.44 0.08;1.25 0.33;0.92 gap+vpr 1.36;1.19 1.11;0.31 0.78;0.56 vpr+eon 1.19;1.72 0.22;1.33 0.58;0.83 apsi+swim 2.06;0.28 1.81;0.06 1.31;0.19 mgrid+mgrid 0.61;0.61 0.50;0.50 0.47;0.44 bzip2+bzip2 1.92;1.92 1.08;1.03 1.06;1.06
Table2: IPCnumbers reprodu edapproximatelyfromFigure6inreferen e[3℄
5 Impli ations for the mi roar hite t
From a mathemati al point of view, the dieren e between an equal-time throughput metri
su h asthe ET-IPC andan equal-work throughputmetri su h astheH-IPC isthe dieren e
betweenanarithmeti meanandaharmoni mean. Mi roar hite turete hniquesthatin reaseall
threadsIPCswilllikelybeassessedsimilarlybytheET-IPCandtheH-IPC,atleastqualitatively.
Howeverthesituationismore omplexwhensomeIPCsarein reasedwhileothersarede reased,
whi hhappensforinstan ewhen omparingseveraldierentwaystoarbitratesharedresour es.
Using the ET-IPC favors solutions that make high IPCs higher, even if this makes low IPCs
lower. In parti ular, if a high-IPCthread is in oni t with a low-IPC thread for aresour e,
theET-IPCtendstofavorsolutionsthat giveprioritytothehigh-IPCthread. Contrarytothe
ET-IPC, the H-IPC favors solutionsthat tend to equalize the threads IPCs. That is, when a
high-IPCthread is in oni t for aresour ewith a low-IPC thread,the H-IPC tends to favor
solutionsthat givepriorityto thelow-IPCthread.
The following example is taken from a study by Gabor et al. published in 2006 [3℄. In
thisstudy,theauthorsproposeame hanismfor ontrollingfairnessinpro essorsimplementing
Swit h-on-Event (SoE) multithreading, aka oarse grained multithreading. With SoE
multi-threading,asinglethreadisrunningatagiventime,theotherthreadsarewaiting. Uponalong
laten yevent,likealast-level a hemiss,theexe utionswit hestoanotherthread. Thispermits
hiding(tosomeexent)the a hemisslaten y. However,theauthorsshowthatanaive
implemen-tationofSoEthatswit hesthreadsonevery a hemissmaybeveryunfairtosmall-IPCthreads
thatexe utefewinstru tionsbetween onse utivemisses. Therefore,theyproposeame hanism
for ontrollingfairness. Therelativeperforman eof athread isthethread'sIPC in SoEmode
dividedbythethread'sIPC whenrunningalone. Theydenefairnessastheminimumrelative
performan e divided by the maximumrelative performan e. In the proposed me hanism, the
fairnesstarget is set between 0(no fairnessenfor ement)and 1(strong fairness enfor ement),
and the me hanism introdu es extra thread swit hes in su h a way that the a tual fairness is
lose to the fairness target. One of the on lusions they draw from their experiments is that
enfor ingfairnessrequirestosa ri esomethroughput. Butthethroughputmetri theyusedis
theET-IPC.
Table 2 shows the IPC number that we have measured approximately from Figure 6 in
ET-IPC weighted harmoni mean ASH-IPC SSH-IPC
speedup ofspeedups
fairness=0.25 1.32 1.17 0.41 0.51 0.49
fairness=1 1.22 1.19 0.58 0.82 0.81
ratio 0.92 1.02 1.40 1.61 1.65
Table 3: Throughputmetri sappliedontheIPCnumbersofTable2
SoE pro essor, for dierent valuesof the fairness target 5
. We givein Table 3the throughput
numbersforseveral metri sappliedontheIPCs of Table2. A ording tothe ET-IPC(se ond
olumn),in reasingthefairnesstargetde reasesthroughputmoderately,whi his onsistentwith
the on lusions of [3℄. Theauthors of [3℄ wantedto distinguish learly notionsof throughput
and fairness, this is why they did not use the weighted speedup and the harmoni mean of
speedups. Nevertheless,weshowthese twometri sin thethird and fourth olumnssin e they
are still frequently used. Going from a fairness target of 0.25 to a fainess target of 1 keeps
weightedspeedupalmostthesame. However,afairnesstargetof1in reasestheharmoni mean
ofspeedupsbyroughly40% omparedtoafairnesstargetof0.25. Thelasttwo olumnsofTable
3 give the ASH-IPC and the SSH-IPC. A ording to these equal-work throughput metri s, a
fairnesstargetof1yieldsabout60%-65%morethroughputthanafairnesstargetof0.25.
Equal-workthroughputisin reasedbe auseenfor ingfairnessin reasesthefra tionof pro essortime
allottedtosmall-IPCthreads,therebyin reasingthesethreadsIPCs. Hadtheauthorsof[3℄used
theASH-IPC orthe SSH-IPC,theywouldhave on ludedthat notonlydoestheirme hanism
allowto ontrolfairness,but itin reasesthroughputquitesubstantially.
6 Summary and on lusion
Thethroughputmetri susedinmi roar hite turesudiessofarareequal-timethroughputmetri s
basedontheassumptionthatallthejobsexe uteforaxedandequaltime. Wearguethatthis
assumption is not realisti , and we advo ate for equal-work throughput metri sbased on the
assumptionthattheIPCofarandomjobisnot orrelatedwiththetotalnumberofinstru tions
exe utedby thatjob. Wehaveintrodu ed su h equal-workthroughput metri , alled the
EW-IPC,whi hisasso iatedwithathroughputexperimentandhen ehasa learphysi almeaning.
We have shown that the EW-IPC ould lead to throughput paradoxessu h that a elerating
anindividual threadmightde rease throughput. Be ause thereis nosimpleanalyti alformula
fortheEW-IPCin thegeneral ase,weintrodu edtheH-IPC,asimpleproxyfortheEW-IPC.
Using an IPC matrix model, we have studied to what extent the H-IPC is a good proxy for
the EW-IPC. We have onsidered the pra ti al ase where only a relatively small number of
workloadsare simulated,and wehave proposed twodierent estimatorsfor theH-IPC in this
ase, the ASH-IPC and the SSH-IPC. We have shown that these estimators work best when
all the ben hmarksare equally representedin thesample workloads. We haveshown with an
examplethat using anequal-workthroughput metri su h astheH-IPC may leadto on lude
dierently in a mi roar hite ture study. While equal-time throughput metri s tend to favor
mi roar hite turesthat a elerate faster threads,equal-work throughput metri stend to favor
mi roar hite turesthatmakethreadsperforman emoreuniform. In on lusion,were ommend
5
Figure6in[3℄showsonly10ofthe16workloadstheyhaveusedfortheirstudy.Themissingworkloadsare homogeneousworkloadsthatareslightlyimpa tedbyfairnessenfor ement.
that mi roar hite turestudies on erned with multiprogram throughput usethe ASH-IPC or,
forsymmetri multi ores,theSSH-IPC.
Appendix
Letus assumethattheIPC of arunningjob doesnotdependon thejobrunningon theother
ore. Consequently, thetwo oresare independent. We denote
IP C
A
andIP C
B
theIPCs of jobs of type A and B respe tively. LetP
A
andP
B
= 1 − P
A
be the probabilities that, at a randomtime, agiven oreisexe uting ajob of typeA orBrespe tively. Inthelimit,be auseoftheassumptionsfortheEW-IPC(Se tion3),agiven oreexe utesasmanyinstru tionsfrom
jobsoftypeAandB,whi hmeans
P
A
×
IP C
A
=
P
B
×
IP C
B
=
(1 − P
A
)IP C
B
Solvingthis equationfor
P
A
,wegetP
A
=
IP C
B
IP C
A
+ IP C
B
(12)P
B
=
IP C
A
IP C
A
+ IP C
B
(13)TheaveragetotalIPCis
IP C = 2 × (P
A
×
IP C
A
+ P
B
×
IP C
B
) =
4IP C
A
IP C
B
IP C
A
+ IP C
B
(14)
Be ause the two ores are independent, the probabilities an bemultiplied. Forinstan e, the
probabilityto beonasegmentof typeABatarandomtime (thatis, ore1is exe utingajob
oftypeAwhile ore2isexe utingajoboftypeB)isequalto
P
A
×
P
B
. Thesegmentfra tions are:f
AA
=
P
2
A
×
2 × IP C
A
IP C
f
BB
=
P
2
B
×
2 × IP C
B
IP C
f
AB
= f
BA
=
P
A
P
B
×
(IP C
A
+ IP C
B
)
IP C
Repla ing
P
A
,P
B
andIP C
withtheexpressionsinequations(12),(13)and(14),yieldsf
AB
= f
BA
=
1
4
f
AA
=
1
2
×
IP C
B
IP C
A
+ IP C
B
f
BB
=
1
2
×
IP C
A
IP C
A
+ IP C
B
Referen es[1℄ A.Carlton. CINT92and CFP92 homogeneous apa ity methodoers fair measureof pro essing
[2℄ P. J. Fleming and J. J.Walla e. How not to lie withstatisti s: The orre t way to summarize
ben hmarkresults. Communi ationsof theACM,29(3):218221,Mar.1986.
[3℄ R.Gabor,S.Weiss,andA.Mendelson.Fairnessandthroughputinswit honeventmultithreading.
InPro .ofthe39thAnnual InternationalSymposiumonMi roar hite ture,2006.
[4℄ M. F.IqbalandL. K.John. Confusionby allmeans. InWorkshoponUnique ChipsandSystems
(UCAS),2010.
[5℄ J. R. Mashey. War of the ben hmark means : Time for a tru e. ACM SIGARCH Computer
Ar hite ture News,32(4),Sept.2004.
[6℄ P. Mi haud. Demystifying multi ore throughputmetri s. IEEE Computer Ar hite ture Letters,
Aug.2012.
[7℄ S.Parekh,S.Eggers,H.Levy,andJ.Lo.Thread-sensitives hedulingforSMTpro essors.Te hni al
ReportUW-CSE-00-04-02,UniversityofWashington,2000.
[8℄ M.K.QureshiandY.N.Patt.Utility-based a hepartitioning: Alow-overhead,high-performan e,
runtimeme hanismto partition shared a hes. In Pro . otthe International Symposium on
Mi- roar hite ture,2006.
[9℄ Y.SazeidesandT.Juan.Howto omparetheperforman eoftwoSMTmi roar hite tures.InPro .
oftheIEEE InternationalSymposiumonPerforman e AnalysisofSoftwareandSystems, 2001.
[10℄ A.SnavelyandD.M.Tullsen. Symbioti jobs hedulingforasimultaneousmultithreading
ar hite -ture. InPro .oftheInternationalConferen eonAr hite turalSupportforProgrammingLanguages
andOperatingSystems, 2000.
[11℄ A. Snavely,D.M. Tullsen,and G.Voelker. Symbioti jobs heduling withprioritiesfora
simulta-neousmultithreadingpro essor. InPro .oftheACM SIGMETRICSInternational Conferen eon
Measurement andModelingof ComputerSystems,2002.
[12℄ D.M.Tullsen,S.J.Eggers,J.S.Emer,H.M.Levy,J.L.Lo,andR.L.Stamm. Exploiting hoi e:
instru tionfet handissueonanimplementablesimultaneousmultithreadingpro essor.InPro .of
the23rdAnnual InternationalSymposiumonComputerAr hite ture,1996.
[13℄ D. M.Tullsen, S.J.Eggers,and H.M. Levy. Simultaneousmultithreading: Maximizingon- hip
parallelism.InPro .ofthe22ndAnnualInternationalSymposiumonComputerAr hite ture,1995.
[14℄ M. VanBiesbrou k, L.Ee khout, andB.Calder. Considering allstartingpointsforsimultaneous
multithreadingsimulation.InPro .oftheIEEEInternationalSymposiumonPerforman eAnalysis
ofSystemsandSoftware,2006.
[15℄ M.VanBiesbrou k,T.Sherwood,andB.Calder. A o-phase matrixtoguidesimultaneous
multi-threadingsimulation. InPro .of theIEEE InternationalSymposiumonPerforman e Analysisof