Constant-work multiprogram throughput metrics for microarchitecture studies

(1)

HAL Id: hal-00758195

https://hal.inria.fr/hal-00758195

Submitted on 28 Nov 2012

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

microarchitecture studies

Pierre Michaud

To cite this version:

Pierre Michaud. Constant-work multiprogram throughput metrics for microarchitecture studies.

[Re-search Report] RR-8150, INRIA. 2012. �hal-00758195�

(2)

0

2

4

9 -6

3

9

9 IS

R

N

IN

R

IA

/R

R

--8

1

5

0 --F

R

+

E

N

G

RESEARCH

REPORT

N° 8150

November 2012

Constant-work

multiprogram

throughput metrics for

microarchitecture studies

Pierre Michaud

(3)

(4)

RESEARCH CENTRE

RENNES – BRETAGNE ATLANTIQUE

Pierre Mi haud

Proje t-TeamALF

Resear hReport n°8150November201221pages

Abstra t:

Multi orepro essors animproveperforman ebyde reasingtheexe utionlaten yofparallel

pro-grams,orbyin reasingthroughput,i.e.,thequantityofworkdoneperunitoftimewhenexe uting

independenttasks. Throughputisnotne essarilyproportionalto thenumberof oresand anbe

impa tedsigni antlybyresour esharingin several parts of themi roar hite ture. Quantifying

theimpa tofresour esharingonthroughputrequiresathroughputmetri . Amajorityof

mi roar- hite turestudiesuseequal-timethroughputmetri s,su hasIPCthroughputorweightedspeedup,

thatarebasedontheimpli itassumptionthatallthejobsexe uteforaxedandequaltime. We

argue that this assumption is not realisti . We propose and hara terize some new throughput

metri sbasedontheassumptionthat jobs exe uteaxed andequalquantityof work. Weshow

that using su h equal-work throughputmetri may hangethe on lusionofami roar hite ture

study.

(5)

Résumé: Les pro esseursmulti- oeursaugmententlesperforman esen réduisantle temps

d'exé utiondesprogrammesparallèleouenaugmentantledébit, 'est-à-direlaquantitéde

tra-vail ee tuée par unité de temps lorsqu'on exé ute des tâ hes indépendantes. Le débit n'est

pastoujoursproportionnelaunombrede oeurs,il peutdépendrefortementdes onitssurles

ressour espartagéesdelami roar hite ture. Lesmétriquesdedébitpermettentau

mi roar hi-te tedequantierl'impa tde es onitssurledébit. Unemajoritéd'étudesenmi roar hite ture

utilisentdesmétriquesdedébit tellesqueledébit d'instru tionsoulasommedes a élérations

qui sont basées sur l'hypothèse impli ite que toutes les tâ hes s'exé utent pendant des temps

onstants et égaux. Nous pensons que ette hypothèse n'est pas réaliste. Nous proposons et

ara térisonsde nouvellesmétriques de débit basées sur l'hypothèse que les tâ hes exé utent

des quantités de travail onstantes et égales. Nous montrons que l'utilisation de es nouvelles

métriquesdedébitpeuvent hangerles on lusionsde ertainesétudes.

(6)

1 Introdu tion

Multi ore pro essors an improve performan e in two ways: they an de rease the exe ution

laten yofparallelprograms,orthey anin reasethroughput,i.e.,thequantityofworkdoneper

unit of timewhen exe utingmultiprogramworkloadsmade ofindependent threads. Whilethe

importan e of parallelprograms in the urrentorfuture softwaree osystemmaybedis ussed,

thebenetofhigherthroughputisundisputableandhasbeenmostlyvisibleindata entersand

intheserversideoftheinternet. However,threadsrunningsimultaneouslyonamulti oreshare

someresour es like the last-level a he, pin bandwidth, hip thermaldesign power, et . With

simultaneous multithreading (SMT), threads share ore mi roar hite ture stru tures: level-1

a hes,bran h predi tors, physi al registers,s hedulinglogi , exe utionunits, et . Be ause of

resour esharing, throughputis not ne essarilyproportionalto thenumber ofthreads running

simultaneously. Moore'slawisexpe tedto ontinuein thenear future,andwithitthegrowing

of the number of ores on a single hip, leading to so- alled "many ore" hips. Commer ial

many ore hips existing so far exploit data parallelismand havefound an important ni he in

graphi s pro essing,i.e., mostlyonthe lientsideof theinternet. Whatis not leariswhether

hips with several hundredsof general-purpose oreswill beuseful on theserverside. Indeed,

someof thesharedresour esareunlikelyto growlinearlywiththe numberof ores, espe ially

pinbandwidth. Mi roar hite tsstudyinggeneral-purposemulti oresormany oresareinterested

in quantifying throughput, to understand, and possibly de rease, the performan e loss due to

resour esharing.

Weargueinthispaperthatthemultiprogramthroughputmetri sthatare ommonlyusedfor

mi roar hite turestudiesarede ient. Weproposenewthroughputmetri saimedat orre ting

thesede ien ies.

Thispaperisorganizedasfollows. InSe tion2welistthemostfrequentlyusedmultiprogram

throughput metri s and we explain why, in our opinion, they are de ient. We introdu e in

Se tion3anewthroughputmetri ,theEW-IPC,whi hisbasedontheassumptionofequal-work

jobs. WeshowthattheEW-IPCmayleadtothroughputparadoxes:a eleratingathreadmight

sometimesde reasethroughput. Be ausethereisnosimpleanalyti alformulafortheEW-IPC,

weintrodu einSe tion4somepra ti alproxiesfortheEW-IPC:theH-IPC,ASH-IPCand

SSH-IPC. We dis ussin Se tion 5thepossibleimpli ationsof using equal-workthroughput metri s

and we show with an example that the on lusions of a study may hange dramati ally. We

on ludethisstudybyre ommendingtousetheASH-IPCortheSSH-IPCinmi roar hite ture

studies on ernedwithmultiprogramthroughput.

2 Multiprogram throughput metri s are broken

Multiprogramthroughputisgenerallymeasuredby onsideringasetofsingle-threadben hmarks

and running simultaneouslysome ombinations of ben hmarks, alled workloads in the rest of

thisstudy. Ingeneral,ifthepro essor anrunupto

K

threadssimultaneously,ea hworkloadis a ombinationof

K

ben hmarks(not ne essarilydistin t). Itis ustomaryin mi roar hite ture studiestodeneaxedsetofworkloadsandtosimulateea hworkloadseparatelyforaxedtime

interval. Fromsimulations,aper-threadperforman enumberisobtainedforea hworkload,and

aglobalthroughputnumberisobtainedbyaggregatingtheper-workloadperforman enumbers.

Obtainingaglobalthroughputnumberisimportantbe ausethis issometimestheonlyway

tode idewhetherame hanismshouldbeimplementedornot.Indeed,multiprogramthroughput

metri sareoftenusedtostudyarbitrationme hanismsinmi roar hite tureswhereoneorseveral

resour esaresharedbetweenthreads runningsimultaneously. Dierentarbitrationme hanisms

(7)

modify the arbitration me hanisms in su h a way that ertain threads are a eleratedat the

ostofslowingdownsomeotherthreads. Lookingatindividualthreadsorindividualworkloads

mayyieldsomeinsightwhendoingastudy,buteventuallyitisglobalthroughputthatpermits

de idingwhi harbitrationme hanismshouldbeimplemented.

Sofar,thethreemostfrequentlyusedmultiprogramthroughputmetri sinmi roar hite ture

studiesaretheso- alledIPCthroughput,weightedspeedupandharmoni meanofspeedups. These

threemetri sdierinhowtheydenetheper-workloadperforman e.

2.1 The ET-IPC (aka IPC throughput)

IPCthroughputdenestheperforman eofagivenworkloadasthesumoftheIPCs(instru tions

exe uted per y le) of ea h thread in the workload. Then, global throughput is obtained by

averagingtheper-workloadperforman es:

ET-IPC

=

1 W

W

X

w=1

K

X

i=1

IP C(w, i)

(1)

where

K

isthenumberoflogi al ores,

W

isthenumberofworkloadsand

IP C(w, i)

istheIPC ofthethreadrunningon ore

i

inthe

w

th

workload. Intherestofthisstudy,wedenotetheIPC

throughput the equal-time IPC (ET-IPC), as ea h workloadis runfor a xed and equal time

interval. TheET-IPCwasrstusedinthe1990'sforstudyingSMTmi roar hite tures[13,12℄,

andisstilloneofthethreemostpopularthroughputmetri s. However,blindlymaximizingthe

ET-IPC mayleadto implement resour earbitrationme hanismsthat favorhigh-IPCthreads,

penalizing low-IPC ones [12℄. This led to the denition of alternative throughput metri s, in

parti ulartheweightedspeedup[10,7, 9,11℄.

2.2 The weighted speedup and harmoni mean of speedups are

in on-sistent

Instead of raw IPCs orexe ution times, resear hersoften prefer to onsider speedups, that is,

relativeperforman e[5℄. Weightedspeedupdenesper-workloadperforman easthesumofthe

speedupsofea hthread onstituting theworkload,that is,theperforman eofworkload

w

is:

K

X

i=1

IP C(w, i)

IP C

ref

(w, i)

where

IP C

ref

(w, i)

isareferen eIPCforthethreadrunningon ore

i

inthe

w

th

workload(for

instan e,

IP C

ref

(w, i)

maybedenedastheIPCofthethreadwhenitrunsaloneonareferen e ma hine).

Weightedspeedupisoneofthethreemostfrequentlyusedmultiprogramthroughputmetri s.

However,wehaveshowninare entstudythatweightedspeedupisin onsistent,asitgivesmore

weighttoben hmarkswithalowreferen eIPC[6℄. Onemaywanttoweightdierentben hmarks

dierently,butthisshouldbestatedexpli itlyandthismustbejustied. Toourknowledge,none

ofthepaststudiesthathaveusedweightedspeeduphaveprovidedareasonforweightingdierent

ben hmarksdierently,perhapsbe ausetheywerenotawareofthein onsisten yproblem. The

harmoni meanofspeedups,thethirdmostfrequentlyusedthroughputmetri ,isin onsistentas

well[6℄.

Using an in onsistent metri may lead to arti ial on lusions, as illustrated in Table 1.

(8)

single thread ma hine X runningAB ma hine Y runningAB ben hmarkA

IP C

ref

= 0.2

IP C = 0.1

IP C = 0.2

ben hmarkB

IP C

ref

= 1

IP C = 0.2

IP C = 0.1

weightedspeedup 0.7 1.1

harmoni meanofspeedups 0.29 0.18

Table 1: Example illustrating the in onsisten y of weighted speedup and harmoni mean of

speedups. Ben hmarksAandBareequallyimportant. Without furtherinformationon

ben h-marks,theonlypossible on lusionisthatma hinesXandYoerthesamethroughput.

aunit ofworkwhi his notthesameforalltheben hmarks andwhi hdependsonareferen e

ma hine, the ma hine on whi h referen e IPCs are measured. Troughput is dened as the

quantityof work doneperunit of time. A sound throughputmetri ,onewhi h anbeused to

omparema hines,shoulddenetheunitofworkindependentlyfromareferen ema hine,asthe

hoi e ofaparti ularreferen ema hineisarbitrary(the hoi eofadierentreferen ema hine

ould hangethe on lusions). If onewants, forwhateverreason,to givemoreweightto some

ben hmarks,su hweightingshouldbedone ons iouslyandshould bejustied.

2.3 What about geometri mean of speedups ?

It iswellknown that,in the aseof single-threadperforman e, speedups should be aggregated

withageometri meaninordertoavoid onsisten yproblems[2,5℄,andresear hersgenerallyuse

the geometri mean to summarize single-threadspeedups. Oddly, in the aseof multiprogram

throughput,thepra ti eofaddingspeedups(asin weightedspeedup)ortaking theirharmoni

meanisstillwidespread.

Thoughwedonotex lude ompletelythepossibilitytodeneameaningfulthroughputmetri

usingageometri mean,itisnot learhowsu hmetri shouldbedened. Thegeometri mean

ofspeedupsgivesanestimateofthemedianspeedupofrandomprogramsundertheassumption

that ben hmarks are representativeand that speedups are distributed log-normally [5℄. That

is, an assumption is made not only on programs (as in all performan e metri s) but also on

ma hines. It iseasytoimagine aseswherethisassumptiondoesnothold 1

. Theassumptionof

log-normality annotbetruein generalbut oin identally[4℄.

2.4 The SPECrate throughput metri

SPECdenesathroughputmetri alledSPECrate,basedonrunninghomogeneousworkloads,

i.e.,runningseveralindependent opiesofthesameben hmarkandmeasuringhowlongittakes

forall opiestonishexe uting[1℄. However,SPECrateislimitedtohomogeneousworkloads. It

isnotanappropriatemetri forstudyingmi roar hite turesthatexploitthepossible

heterogene-ityofbehavioramong on urrentlyrunningthreads. Forexample,ifweuse onlyhomogeneous

workloadstoevaluatemultiprogramthroughput,wemaybeledto on ludethat ashared

last-level a heoerslittlethroughputadvantageoverprivate a hes,overlookingthepossibilityfor

a shared a he to exploit the fa t that dierent appli ations may have dierent a he spa e

requirements[8℄.

1

Forexample, onsiderapro essor Xsu hthatIPCsaredistributedlog-normally,withamedianIPCequal to1. Thatis,50%ofprogramshaveanIPCgreater than1. Then onsiderapro essorYthat anissueonlya singleinstru tionper y le,sothat50%ofprogramshaveanIPC loseto1. TheIPCdistributionofpro essor Yisnotlog-normal,neitherthespeedups.

(9)

2.5 Equal-time throughput metri s

Ameaningfulthroughputmetri isasso iatedwithathroughputexperimentsu hthatthe

quan-tity of work per unit of time measured with this experiment is equalto the throughput value

givenbythemetri .

BoththeET-IPCand theweightedspeedup metri sareequal-time throughputmetri s: the

throughputexperimentwithwhi htheyareasso iateddividestheexe utionofasetof

indepen-dentjobsintoequaltimeintervalssu hthat,duringatimeinterval,asingleworkloadisrunning

ontinuously(thatis,no ontextswit hhappens) 2

.

Hen e ifasetof workloadsisdened independentlyfrom anyparti ularma hine,andifall

theben hmarks ontribute equallyto the jobs onstituting the workloads, allthe ben hmarks

runforthesametotaltime,whateverthe ma hine's performan e.

There exists someappli ations that a tually behavelike that, i.e., they tryto do asmu h

workaspossibleduringaxedtime. Thistypi ally orrespondstosomeintera tiveorreal-time

appli ationsthat anadaptthequalityoftheiroutputasafun tionofthema hine'sperforman e.

Su happli ationsprodu ejobs thatmaybetermed onstant-time jobs. However,this on erns

aminorityofappli ations.

Most jobs, in luding bat h jobs but also intera tive and real-time jobs, are onstant-work

jobs. A onstant-workjob hasxedworkto do,whi h typi ally orrespondstoaxednumber

of program instru tions to exe ute, and the time to do this work depends on the ma hine's

performan e.

In our opinion, the impli it assumption of onstant-time jobs in the onventional "IPC

throughput"metri is the main drawba kof that metri . Repla ing raw IPCs with speedups,

as in weighted speedup, not only makes the metri in onsistent but does not solve the basi

problem: thequantityofworkdonebyajobstilldependsonthema hinesperforman e.

Tosolvethisissue,weproposeintherest ofthisstudysomenewmultiprogramthroughput

metri sthat arebasedontheassumptionthat jobsexe uteaxedand equalquantityofwork.

Ourgoalisto statetheassumptions learlyand to hara terizethemetri swepropose, unlike

whathasbeendonewithotherthroughputmetri s.

3 The EW-IPC: an equal-work IPC throughput metri

Wepropose to dene an equal-workthroughput metri formi roar hite ture studies basedon

thefollowingthroughput experiment:

(1)Theunitofworkistheinstru tion

(2) Thebehaviorof ajob issimilar to oneof the ben hmarks, hosenrandomly,and

alltheben hmarksareequallylikely

(3)Allthejobs exe uteaxed andequalnumberofinstru tions

(4)Thereisasinglejob queue,whi hisneverempty

(5)Thes hedulingpoli yis rst-inrst-out(FIFO,akarst- omerst-served)

Weassumethattheunit of work istheinstru tion,whi hisanaturalunit ofwork forthe

mi- roar hite t. Thisbrings onsisten y: alltheben hmarksaretreatedequally. Hen ethroughput

isgivenin instru tionsperunit of time, e.g., instru tionsper y le(for axed lo k y le)or

2

Theonlydieren e between ET-IPC and weighted speedupis thatthey usedierent unitsof work. The ET-IPCassumesthattheunitofworkistheinstru tion,andthatoneinstru tionfromoneben hmarkisworth oneinstru tionofanotherben hmark. Theweightedspeedupassumesthattheunitofworkforaben hmarkis theaveragenumberofinstru tionsexe utedina lo k y lebytheben hmarkwhenitrunsaloneonareferen e ma hine(hen edierentunitsofworkfordierentben hmarks).

(10)

instru tions perse ond. In the rest of this study, we onsider throughput in instru tions per

y le.

These ondassumptionmakessensebe ause,unlesswehavesomespe i informationabout

the representativeness of a ben hmark (whi h is rarely the ase), we have no reason to treat

dierentben hmarksdierently.

Thethird assumptionisthatjobsexe utea onstantwork,andthatthequantityofworkis

thesameforalljobs. Of ourse, onareal system,thereis noreasonfor alljobsto exe utethe

samequantityofwork.Butitisne essarytomakesomeassumptionswhendeningathroughput

metri . Itisrarethatwehaveanyinformationontherepresentativenessofben hmarks,andwe

havenoreasontogivemoreweighttosomeben hmarks. Thethirdassumptionisfundamental:it

isequivalenttosayingthatthenumberofinstru tionsexe utedbyarandomjobisnot orrelated

with thatjob's IPC, i.e., weassumethat,onaverage,arandom low-IPCjob exe utesasmany

instru tionsasarandomhigh-IPCjob.

Thefourthassumptionimpliesthatthejobarrivalrateissu ientlyhighsothatthroughput

islimitedbythema hineperforman e,notbythearrivalrate. Thisassumptionmakessensefor

mi roar hite turestudies,where thethroughputmetri isusedto omparedierentpro essors.

Finally,thelastassumption onsidersthatjobsarebat hjobs. Ingeneral,jobs hedulingmay

impa t throughputquitesigni antly. Wealsoexperimentedwith aper- ore job queuesand a

round-robins hedulingpoli y,forvarioustimequantaandvaryingthedegreeof

multiprogram-ming: wedidnotobserveanysigni antdieren ewithFIFOs heduling. Althoughwehaveno

mathemati alproof,we onje turethatthese ondassumption(jobsare hosenrandomlyamong

ben hmarks)is su ientforthroughputto beindependentof thes hedulingpoli y,as longas

the s hedulingpoli y isunawareof jobsIPCsandtreatsalljobs equally.

Themetri dened by this throughput experiment will be denoted EW-IPC in the rest of

the paper. Noti ethat, be ausejobs exe utean equalnumberof instru tions, the EW-IPCis

dire tlyproportionalto thejobthroughput.

3.1 The IPC matrix method

A possibleway to measure EW-IPC would be to run a single y le-a urate simulation

orre-spondingtotheEW-IPCthroughputexperimentdes ribedinSe tion3. However,itwouldtake

averylongsimulationtimefor theEW-IPC to onvergeto itslimitvalue. Inpra ti e,afaster

simulation method is possible when we have few ben hmarks and few ores: the IPC matrix

method.

Thebasi ideaisto simulateseparatelyallthepossibleworkloads,i.e.,allthepossible

om-binationsof

K

ben hmarks,andre ordthethreadsIPCsintheIPCmatrix. Inthegeneral ase, there isoneIPCmatrixper ore. Ea hofthe

K

matri eshasone rowperben hmarkand one olumn for ea h ombination of ben hmarks onthe other ores. So if we have

B

ben hmarks (orben hmarkssli es)and

K

logi al ores,anIPCmatrixhas

B

rowsand

B

K−1

olumns. For

instan e, in the parti ular ase of a symmetri dual- ore and two ben hmarks A and B, both

oreshavethesameIPC matrix,oftheform

M

₌

IP C

AA

IP C

AB

IP C

BA

IP C

BB

where

IP C

XY

isthe IPCof ben hmarkX whenrunningsimultaneouslywithben hmarkY. A throughputexperiment orrespondingtotheEW-IPCmetri anberepresentedbyanexe ution

traje tory, su h as the one shown in Figure 1. With 2 ores, the exe ution traje tory is in

a 2-dimensional spa e: the oordinates

(x, y)

of a point on the traje tory indi ate how many instru tionshavebeenexe utedsofaron ores1and2respe tively.

(11)

B

A

B

A

B

A

B

A

B

CORE 1

CORE 2

A

Figure1: Exampleofexe utiontraje toryonasymmetri dual- ore,assumingtwoben hmarksAand

B.JobsoftypeAorBexe utewithequalprobability. Ea hjobexe utesaxednumberofinstru tions.

At a given time,the oordinates

(x, y)

of apoint on thetraje toryindi atehow manyinstru tions havebeenexe utedso faron ores1and2respe tively. Inthisexample

IP C

BA

= 2 × IP C

AB

.

On e the IPC matri es havebeenpopulated, anexe ution traje tory orresponding to the

EW-IPCmetri anbeobtainedwithaMonteCarlosimulation: thetotaltime anbeobtained

bysummingthetimes orrespondingtoea hsegmentofthetraje tory. Eventually,iftheMonte

Carlo simulation is long enough, the total number of instru tions divided by the total time

providesana urateapproximationoftheEW-IPCvalue.

The IPC matrixmethod is on eptuallyvery loseto the o-phasematrixmethod used by

VanBiesbrou ketal. [15,14℄.

3.2 A throughput paradox

EW-IPCisamoremeaningfulmetri than ommonlyusedequal-timethroughputmetri s.

How-ever,it isfundamentallydierent. Being loserto arealisti situation exposestheEW-IPC to

the ounter-intuitivebehaviorsthat maybeen ounteredinrealsituations.

Toourknowledge,there isnosimpleanalyti alformulasu h as(1) for omputingEW-IPC

in the general ase. In the general ase, EW-IPCvalues must be obtainedfrom Monte Carlo

simulations. However,asimpleanalyti alformulaexists forthesimple aseof twoben hmarks

andasymmetri dual- ore. Wederivethisformulaintherestofthisse tion,anduseittoshow

thattheEW-IPCmaya tuallyde reasewhenwea elerateanindividualthread.

As explained in Se tion 3.1, theexe ution traje tory issu ient to omputethe EW-IPC.

The exe ution traje tory onsists of segments. For instan e, the traje tory of Figure 1has 4

types of segments: AA, BB, AB and BA. For instan e, a segment AB orresponds to ore 1

runningben hmark Awhile ore2isrunningben hmarkB.Forsu h symmetri dual- ore,the

slopeofthetraje toryisequalto1insegmentsAAandBB, itisequalto

IP C

BA

IP C

AB

and

IP C

AB

IP C

BA

in

segmentsABandBArespe tively.

(12)

ofinstru tionsexe utedonsegmentsAA,BB,ABandBA.Fromthesesegmentfra tions,we an

obtain thetotal exe ution times

t

AA

,

t

BB

,

t

AB

and

t

BA

spent on segmentsAA, BB, AB and BA:

t

AA

=

_{2×IP C}

f

AA

N

I

t

AB

=

f

AB

IP C

AB

+IP C

BA

N

I

t

BB

=

f

BB

2×IP C

BB

N

I

t

BA

=

f

BA

IP C

AB

+IP C

BA

N

I

where

N

I

isthetotalnumberofinstru tionsexe uted. Then,theEW-IPC anbe omputedas

EW-IPC

=

N

I

t

AA

+ t

BB

+ t

AB

+ t

BA

=

_f

_AA

1 2×IP C

AA

+

f

BB

2×IP C

BB

+

f

AB

+f

BA

IP C

AB

+IP C

BA

(2)

Themaindi ultyliesin omputingthesegmentfra tions. Ingeneral,segmentfra tionsmust

beobtainedfromaMonte Carlosimulation. However,intheparti ular aseoftwoben hmarks

exe utingonasymmetri dual- ore, anexa tanalyti expression anbeobtained. Noti ethat

an exe ution traje torysu h asthe onedepi ted in Figure 1 doesnot depend on

IP C

AA

and

IP C

BB

. Hen ethefollowingtwoIPC matri es

M

₌

IP C

AA

IP C

AB

IP C

BA

IP C

BB

M

′

=

IP C

AB

IP C

AB

IP C

BA

IP C

BA

yieldtheexa tsametraje tory,hen etheexa tsamesegmentfra tions. Butmatrix

M

′

orre-sponds toa situation where theIPC of ajob doesnotdepend onwhi h job is runningonthe

other ore. In this situation,the two oresareindependent, and thesegmentfra tions anbe

omputedeasily(seetheappendix):

f

AB

= f

BA

=

1 ₄

f

AA

=

₂

1 ×

_{IP C}

AB

IP C

+IP C

BA

f

BB

=

1

2 ×

IP C

AB

IP C

AB

+IP C

BA

(3)

Substituting (3)into(2), weeventuallyobtaintheEW-IPC:

EW-IPC

=

4 × (IP C

AB

+ IP C

BA

)

2 +

IP C

BA

IP C

AA

+

IP C

AB

IP C

BB

(4)

Formula(4)isnontrivialandleadstosurprising onsequen es. Inparti ular,theEW-IPCmay

de reasewhenwein rease

IP C

AB

(orsymmetri ally,

IP C

BA

):

IP C

BA

IP C

BB

−

IP C

BA

IP C

AA

> 2 =⇒

∂

EW-IPC

∂IP C

AB

< 0

Forinstan e, onsider thefollowingtwoIPCmatri es:

M

₁

₌

1

4 2 0.2

=⇒

EW-IPC

= 1

M

2 =

1

0

2

0.2 =⇒

EW-IPC

= 2

FromtheIPC matri es, onemay believe thatma hine

M

1

should oermorethroughputthan ma hine

M

2

, while in fa t it is the opposite. In thisexample, de reasing

IP C

AB

from 4 to 0 doublestheEW-IPC!ByfreezingajoboftypeAwhentheother orerunsajoboftypeB,we

in reasethejob throughput, whi h is ounterintuitive. Letusanalyzewhat is happening. On

bothma hines,workloadBBisaworkloadtypethatwewouldliketoavoidasmu haspossible

be ausein this asetheIPC isverysmall. Onma hine

M

2

,ajob oftypeA annotterminate while the other ore runs a job of type B. Theonly way for aworkloadBB to o uris when

(13)

exa tlyatthe sametime(whi his quiteimprobable). Hen e onma hine

M

2

, workloadBB is inexistentinthelimit( f. equations(3)).

ThisexampledoesnotmeanthattheEW-IPCmetri istheproblem. Thiskindofsituation,

whereaseeminglyobviousperforman eimprovementa tuallyde reasesthroughput,andwhi h

we all a throughput paradox, an happen in reality. However, the next se tion shows that

throughputparadoxesareunlikelyin pra ti e.

4 Pra ti al proxies for the EW-IPC

TheIPC matri es orresponding to mi roar hite ture studies are likely to have anon-random

stru ture. Ea h row of the IPC matrix gives the IPCs for one parti ular ben hmark for all

possible onditionsunder whi hthis ben hmark may run. Some ben hmarks are"insensitive":

their IPC is almost independent of the threads that are running on the other ores. Other

ben hmarksare"sensitive": theirIPC variesmoreorlessdependingonthe o-runningthreads.

Intheparti ular asewhenalltheben hmarksareinsensitive,theEW-IPC anbe omputed

easily. Let

IP C

i

[b]

bethe IPC of ben hmark

b

on ore

i

( oresare notne essarily identi al), whi his independentofthejobsrunningontheother ores(asben hmarksareassumed

insen-sitive). WhenweruntheEW-IPCthroughputexperimentforalongtime

T

, ore

i

exe utes

N

i

instru tions,andea hofthe

B

ben hmarks ontributesnearly

N

i

/B

instru tionsonthat ore. Hen eforall ores

i ∈ [1, K]

,

T =

B

X

b=1

N

i

/B

IP C

i

[b]

⇐⇒

_N

_i

₌

B × T

P

B

b=1

IP C

1 i

[b]

TheEW-IPCis

1 T

P

K

i=1

N

i

. Substituting

N

i

,weget

EW-IPC

=

K

X

i=1

B

P

B

b=1

IP C

1 i

[b]

(5)

whi his thesumonall oresoftheharmoni meanofben hmarksIPCs.

4.1 Denition of the H-IPC

An interesting question to study is whether we an generalize formula (5) in the ase where

ben hmarksaresensitiveandobtainanapproximationoftheEW-IPCthatwouldnotne essitate

MonteCarlosimulations. An obviousgeneralizationoftheharmoni meanin (5)is

H-IPC

=

K

X

i=1

B

K

P

(b,w)

IP C

1 i

[b,w]

(6)

where

IP C

i

[b, w]

istheIPCvalueinthe

b

th

rowand

w

th

olumnoftheIPCmatrixof ore

i

,and

(b, w)

runsoverallrowsand olumns. Thatis,theH-IPCisobtainedbysummingtheharmoni meansof theIPC matri es.

Wedistinguishtwosortsofmulti ores: symmetri andasymmetri . Wesaythatamulti ore

issymmetri whenallthe oreshaveequivalentIPCmatri es 3

. Amulti orethatisnotsymmetri

is asymmetri . Forinstan e, most general-purpose multi ore pro essors today are symmetri ,

in ludingthosewhosephysi al oresareSMT.

3

(14)

IP C[b, w] = min(3, max(0.1, BIP C[b] × S[b, w]))

Randomvariable

S[b, w]

normallydistributedwithmeanequalto1

BIP C[b] = 0.1 + R[b] × 2.9

for

b ∈

[1, B]

Randomvariable

R[b]

uniformin

[0, 1]

Figure2: IPC matrixmodel

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8

1 1.2 1.4 1.6 1.8

2 CDF

speedup

stdev=0.05

stdev=0.1

stdev=0.2

stdev=0.3

Figure 3: IPC matrix model: umulative distribution fun tion of the normally-distributed speedup

S[b, w]

fordierent valuesofthestandarddeviation STDEV.

Forsymmetri multi ores,theH-IPCformula anbesimplied:

H-IPC

=

K × B

K

P

(b,w)

IP C[b,w]

1

(7)

UnliketheEW-IPC,theH-IPCisnotasso iatedwithathroughputexperimentanditsphysi al

meaning an only be approximate. Nevertheless, we onje ture that the H-IPC an be used

as a proxy for the EW-IPC in most pra ti al situations. Ideally, we would like to he k this

onje turebydoingmanydierentmi roar hite turestudiesandby he kingthattheEW-IPC

andtheH-IPCleadto thesame on lusions. However,wewouldbelimitedinpra ti etoafew

mi roar hite turestudieswith smallIPCmatri es,from whi hit wouldbedi ultto on lude

anythingorgainanyinsight. Instead,weusethearti ialIPCmatrixmodeldes ribedinFigure

2. In this model,

S[b, w]

represents aspeedup following a normaldistribution of mean 1 and whose standarddeviationSTDEVis aparameterthat wewillvary. Forsymmetri multi ores,

weenfor e

S[b, w

1 ] = S[b, w

2 ]

forworkloads

w

1

and

w

2

that area tuallythe sameben hmarks multiset.

STDEVquantiesben hmarkssensitivity: thehigherSTDEV,themoresensitivethe

ben h-marks. Figure 3 shows the umulative distribution fun tion of

S[b, w]

for dierent values of STDEV.Forinstan e, with STDEV=0.3,thereisaprobabilityofabout50%thatthespeedup

islessthan0.8orgreaterthan1.2. Inthe ontextofmi roar hite turestudies,this orresponds

(15)

0.000

0.005

0.010

0.015

0.020 2 cores

4 cores

6 cores

coefficient of variation of R

STDEV = 0.05

2 benchmarks

4 benchmarks

10 benchmarks

20 benchmarks

0.000

0.005

0.010

0.015

0.020

0.025

0.030

0.035

0.040 2 cores

4 cores

6 cores

coefficient of variation of R

STDEV = 0.1

2 benchmarks

4 benchmarks

10 benchmarks

20 benchmarks

0.000

0.010

0.020

0.030

0.040

0.050

0.060

0.070

0.080 2 cores

4 cores

6 cores

coefficient of variation of R

STDEV = 0.2

2 benchmarks

4 benchmarks

10 benchmarks

20 benchmarks

0.000

0.020

0.040

0.060

0.080

0.100

0.120

0.140

0.160 2 cores

4 cores

6 cores

coefficient of variation of R

STDEV = 0.3

2 benchmarks

4 benchmarks

10 benchmarks

20 benchmarks

Figure 4: Coe ient of variation (lower is better) of

R =

H-IPC EW-IPC

for symmetri multi ores, for

dierentvaluesofSTDEV.

OurgoalistounderstandtowhatextenttheH-IPC anbeusedasaproxyfortheEW-IPC.

This doesnot ne essarilymean that the H-IPC valuesshould be lose to theEW-IPC values.

TheH-IPCandEW-IPCareequivalentif,foranytwoma hines XandY,

H-IPC

[Y ]

H-IPC

[X]

≈

EW-IPC

[Y ]

EW-IPC

[X]

(8)

In words, the two metri s must indi ate approximately the same speedup. Condition (8) is

equivalentto H-IPC

[Y ]

EW-IPC

[Y ]

≈

H-IPC

[X]

EW-IPC

[X]

(9)

That is, the H-IPC and EW-IPC are equivalent if the ratio

R =

H-IPC EW-IPC

is approximately

onstant(not ne essarilyequalto1). Toquantify theequivalen ebetweenthetwometri s,we

generate1000IPCmatri esa ordingtotheIPCmatrixmodelofFigure2. Forea hIPCmatrix,

we ompute

R

(weobtaintheEW-IPCwithMonteCarlosimulations). Eventually,the oe ient ofvariation

CV

R

= σ

R

/µ

R

(with

µ

R

and

σ

R

respe tivelythemeanandstandarddeviationof

R

) quantiestheextenttowhi hthetwometri sareequivalent. Weobservedexperimentallythat

thefra tion

R ∈ [µ

R

−

2σ

R

, µ

R

+ 2σ

R

]

isgreater than0.93. Forinstan e, a

CV

R

of 1%means thatif we omparetwomi roar hite turesXand Y,and iftheH-IPC ofY is10%higherthan

theH-IPC ofX,thentheEW-IPCofYislikelybetween8%and12%higherthantheEW-IPC

ofX.

Results arepresented in Figure 4forsymmetri multi oresand in Figure 5forasymmetri

(16)

0.000

0.002

0.004

0.006

0.008

0.010

0.012 2 cores

4 cores

6 cores

coefficient of variation of R

STDEV = 0.05

2 benchmarks

4 benchmarks

10 benchmarks

20 benchmarks

0.000

0.005

0.010

0.015

0.020

0.025 2 cores

4 cores

6 cores

coefficient of variation of R

STDEV = 0.1

2 benchmarks

4 benchmarks

10 benchmarks

20 benchmarks

0.000

0.005

0.010

0.015

0.020

0.025

0.030

0.035

0.040

0.045 2 cores

4 cores

6 cores

coefficient of variation of R

STDEV = 0.2

2 benchmarks

4 benchmarks

10 benchmarks

20 benchmarks

0.000

0.010

0.020

0.030

0.040

0.050

0.060

0.070

0.080

0.090 2 cores

4 cores

6 cores

coefficient of variation of R

STDEV = 0.3

2 benchmarks

4 benchmarks

10 benchmarks

20 benchmarks

Figure 5: Coe ientof variation (loweris better) of

R =

H-IPC EW-IPC

for asymmetri multi ores, for

dierent valuesofSTDEV.

0

2

4

6

8

10

12

14

16

18

0

2

4

6

8

10

12

14

16

18 H-IPC

EW-IPC

6 cores, 2 benchmarks

0

2

4

6

8

10

12

14

0

2

4

6

8

10

12

14

16 H-IPC

EW-IPC

6 cores, 4 benchmarks

Figure 6: S atterplots withtheEW-IPC on thexaxis andthe H-IPCon theyaxis (1000points),

forasymmetri 6- ore,assumingSTDEV=0.3. Theleftplotisfor2ben hmarks,andtherightplot

(17)

ben hmarks

B

, and the IPC matrix model STDEV 4

. Figure 6showssome s atter plots for a

oupleof ases(symmetri 6- ore,STDEV=0.3,2and4ben hmarks). Severalobservations an

bemade,someofwhi hnotobvious: Asexpe ted,

CV

R

in reaseswithSTDEV,thatis,themore sensitivetheben hmarks,thegreaterthedis repan ybetweentheH-IPCandtheEW-IPC;

CV

R

de reaseswhenthenumberofben hmarksisin reased;Ex eptforsymmetri multi oreswith2

ben hmarks,

CV

R

de reaseswhenwein reasethenumberof ores;

CV

R

islowerforasymmetri multi ores.

In pra ti e,theH-IPC anbeused asathroughput metri in mi roar hite turestudies, in

lieuoftheharder-to-obtainEW-IPC.Forinstan e,ifweuse20ben hmarkstostudyasymmetri

dual- ore,andiftheben hmarksaremoderatelysensitive(STDEV

< 0.1

),

CV

R

islessthan1% and the H-IPC an reasonably be used to establish throughput dieren es of a few per ents.

Still,whatweshowwiththeH-IPCisaperforman eunderworkloadsinvolvingseveraldierent

behaviors.Under ertainworkloads,inparti ularworkloadsinvolvingasmallnumberofdistin t

appli ationswitha"monolithi "behavior,orappli ationswithsimilar behaviors,theobserved

performan e might be quite dierent from the on lusions obtainedon the whole ben hmark

suite( f. throughputparadoxes,Se tion 3.2). Itwouldbepossibleto showseveralthroughput

numbers,onethroughputnumberforthewholeben hmarksuite,andalsothroughputnumbers

forsubsetsofben hmarks,e.g.,ben hmarkspairs. However,theH-IPCis notasafemetri for

studyingsensitiveben hmarkpairs,asitmightleadto on lusionshavingnophysi alreality.

4.2 Sampling the IPC matrix: the ASH-IPC and SSH-IPC

Sofar,wehaveassumedthattheIPCmatrixisfullypopulated. However,inpra ti al

mi roar- hite ture studies,it isoftennot possibleto lltheIPC matrix ompletely be ausethis would

requiretoomanysimulations. Afrequentpra ti eisto simulateasampleofseveraltens,

some-times afew hundreds of dierent workloads. A sampled versionof theH-IPC an be dened

asfollows. Let

W

be thenumberof workloadsin the sampleand let

IP C(w, i)

bethe IPC of thethread on ore

i

in the

w

th

workloads. Wedene theasymmetri sampledH-IPC,denoted

ASH-IPC,as ASH-IPC

=

K

X

i=1

W

P

W

w=1

1 IP C(w,i)

(10)

Thatis,theASH-IPCisthesumofper- oreharmoni meansofIPCs. Inthe aseofasymmetri

multi ore,theIPCs ofall ores omefromthesameIPCmatrix,andwe anusethesymmetri

sampledH-IPC,denotedSSH-IPC:

SSH-IPC

= K ×

W × K

P

K

i=1

P

W

w=1

IP C(w,i)

1

(11)

Thatis,theSSH-IPCistheglobalharmoni meanofIPCs timesthenumberof ores.

The oe ient of variation of

ASH-IPC H-IPC

and

SSH-IPC H-IPC

quanties the extent to whi h the

ASH-IPC and SSH-IPC give the same on lusions asthe H-IPC. As in Se tion 4.1, we dene

1000IPCmatri esa ordingtothemodelofFigure2,forsymmetri andasymmetri multi ores.

Results are shown in Figures 7and 8 for asymmetri and asymmetri 4- ore respe tively,

assuming20ben hmarks,STDEV=0.05 andSTDEV=0.3. Theleftgraphsshowthe oe ient

ofvariationof

ASH-IPC H-IPC

asafun tion ofthenumberofworkloadsin thesample, andtheright

4

Fortheasymmetri 6- ore ase,wedonothaveresultsfor20ben hmarksbe ausetheIPCmatri esarevery big,whi hmakessimulationsveryslow. Forsymmetri multi ores,weuseamu hmore ompa trepresentation oftheIPCmatrix,takingadvantageoftheredundan iesinit.

(18)

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0

50

100

150

200

250

300

350

400 Coefficient of variation

sample size W

ASH-IPC, STDEV=0.05

random

global balance

per-core balance

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0

50

100

150

200

250

300

350

400 Coefficient of variation

sample size W

SSH-IPC, STDEV=0.05

random

global balance

per-core balance

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0

50

100

150

200

250

300

350

400 Coefficient of variation

sample size W

ASH-IPC, STDEV=0.3

random

global balance

per-core balance

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0

50

100

150

200

250

300

350

400 Coefficient of variation

sample size W

SSH-IPC, STDEV=0.3

random

global balance

per-core balance

Figure7: Coe ientofvariationof

ASH-IPC H-IPC

(leftgraphs)and

SSH-IPC H-IPC

(rightgraphs)fora

(19)

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0

50

100

150

200

250

300

350

400 Coefficient of variation

sample size W

ASH-IPC, STDEV=0.05

random

global balance

per-core balance

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0

50

100

150

200

250

300

350

400 Coefficient of variation

sample size W

SSH-IPC, STDEV=0.05

random

global balance

per-core balance

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0

50

100

150

200

250

300

350

400 Coefficient of variation

sample size W

ASH-IPC, STDEV=0.3

random

global balance

per-core balance

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0

50

100

150

200

250

300

350

400 Coefficient of variation

sample size W

SSH-IPC, STDEV=0.3

random

global balance

per-core balance

Figure 8: Coe ient of variation of

ASH-IPC H-IPC

(left graphs) and

SSH-IPC H-IPC

(right graphs) for an

(20)

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0

50

100

150

200

250

300

350

400 Coefficient of variation

sample size W

STDEV=0.05

STDEV=0.1

STDEV=0.2

STDEV=0.3

Figure 9: Coe ient of variation of

S =

ASH-IPC EW-IPC

for a symmetri 4- ore, with 20 ben hmarks,

usingper- orebalan edworkloads.

graphsshowthe oe ient of variationof

SSH-IPC H-IPC

. For ea h graph, we show 3 urves, ea h

orrespondingtoadierentwaytosele ttheworkloads:

Random. Workloadsare hosen ompletelyrandomly.

Global balan e. Among the total

W × K

threads onstituting the

W

workloads, ea h ben hmarko urs

K × W/B

times,

W

beingamultipleofthenumber

B

ofben hmarks.

Per- orebalan e. Ea hben hmarko urs

W/B

timeson ea h ore,with

W

amultiple of

B

.

A rst observation is that balan ing the ben hmarks de reases signi antly the oe ient of

variation. Forasymmetri multi ores, per- ore balan e is the best of the3 workloadsele tion

methods. Forsymmetri multi ores,per- orebalan eisbetterthanglobalbalan eforthe

ASH-IPC.Forsymmetri multi ores,theSSH-IPCisnearlyequivalenttotheASH-IPCunderper- ore

balan e,anditisbetterthantheASH-IPC underglobalbalan eandunderrandomworkloads.

Forasymmetri multi ores,asexpe ted,theASH-IPCissuperiortotheSSH-IPC.In on lusion,

were ommendusing per- ore balan ed workload samplesand the ASH-IPCor, forsymmetri

multi ores,theSSH-IPC.

Therequiredsample size depends onben hmarks sensitivity andon themi roar hite tures

being ompared.Ifben hmarkshaveasmallsensitivity(STDEV

< 0.05

)andiftheperforman e dieren ebetweenthetwomi roar hite turesissigni ant(afewper ents),takingafewtensof

workloadsisok. Butiftheperforman edieren ebetweenthemi roar hite turesislessthan1%,

severalhundredsofworkloadsmaybene essarytoquantifypre iselytheperforman edieren e.

Figure9showsthe oe ientofvariation

CV

S

of

S =

ASH-IPC EW-IPC

forasymmetri 4- orewith20

ben hmarks, using per- ore balan ed workloads. The numberof workloads ne essaryto make

CV

S

1%orsmallerin reaseswithben hmarkssensitivity(i.e.,withSTDEV).ForSTDEV=0.05, 20 workloadsare su ient forhaving

CV

S

.

0.01

, but weneed 60workloadsfor STDEV=0.1 and400workloadsforSTDEV=0.2.

(21)

workload single-thread IPCSoE IPCSoE

referen eIPC fairness=0.25 fairness=1

mgrid+art 0.31;0.42 0.28;0.19 0.25;0.19 lu as+applu 0.78;0.75 0.56;0.50 0.56;0.50 galgel+g 0.22;1.31 0.06;1.03 0.11;0.78 vortex+gzip 1.03;1.36 0.25;1.03 0.56;0.67 g +eon 0.56;1.44 0.08;1.25 0.33;0.92 gap+vpr 1.36;1.19 1.11;0.31 0.78;0.56 vpr+eon 1.19;1.72 0.22;1.33 0.58;0.83 apsi+swim 2.06;0.28 1.81;0.06 1.31;0.19 mgrid+mgrid 0.61;0.61 0.50;0.50 0.47;0.44 bzip2+bzip2 1.92;1.92 1.08;1.03 1.06;1.06

Table2: IPCnumbers reprodu edapproximatelyfromFigure6inreferen e[3℄

5 Impli ations for the mi roar hite t

From a mathemati al point of view, the dieren e between an equal-time throughput metri

su h asthe ET-IPC andan equal-work throughputmetri su h astheH-IPC isthe dieren e

betweenanarithmeti meanandaharmoni mean. Mi roar hite turete hniquesthatin reaseall

threadsIPCswilllikelybeassessedsimilarlybytheET-IPCandtheH-IPC,atleastqualitatively.

Howeverthesituationismore omplexwhensomeIPCsarein reasedwhileothersarede reased,

whi hhappensforinstan ewhen omparingseveraldierentwaystoarbitratesharedresour es.

Using the ET-IPC favors solutions that make high IPCs higher, even if this makes low IPCs

lower. In parti ular, if a high-IPCthread is in oni t with a low-IPC thread for aresour e,

theET-IPCtendstofavorsolutionsthat giveprioritytothehigh-IPCthread. Contrarytothe

ET-IPC, the H-IPC favors solutionsthat tend to equalize the threads IPCs. That is, when a

high-IPCthread is in oni t for aresour ewith a low-IPC thread,the H-IPC tends to favor

solutionsthat givepriorityto thelow-IPCthread.

The following example is taken from a study by Gabor et al. published in 2006 [3℄. In

thisstudy,theauthorsproposeame hanismfor ontrollingfairnessinpro essorsimplementing

Swit h-on-Event (SoE) multithreading, aka oarse grained multithreading. With SoE

multi-threading,asinglethreadisrunningatagiventime,theotherthreadsarewaiting. Uponalong

laten yevent,likealast-level a hemiss,theexe utionswit hestoanotherthread. Thispermits

hiding(tosomeexent)the a hemisslaten y. However,theauthorsshowthatanaive

implemen-tationofSoEthatswit hesthreadsonevery a hemissmaybeveryunfairtosmall-IPCthreads

thatexe utefewinstru tionsbetween onse utivemisses. Therefore,theyproposeame hanism

for ontrollingfairness. Therelativeperforman eof athread isthethread'sIPC in SoEmode

dividedbythethread'sIPC whenrunningalone. Theydenefairnessastheminimumrelative

performan e divided by the maximumrelative performan e. In the proposed me hanism, the

fairnesstarget is set between 0(no fairnessenfor ement)and 1(strong fairness enfor ement),

and the me hanism introdu es extra thread swit hes in su h a way that the a tual fairness is

lose to the fairness target. One of the on lusions they draw from their experiments is that

enfor ingfairnessrequirestosa ri esomethroughput. Butthethroughputmetri theyusedis

theET-IPC.

Table 2 shows the IPC number that we have measured approximately from Figure 6 in

(22)

ET-IPC weighted harmoni mean ASH-IPC SSH-IPC

speedup ofspeedups

fairness=0.25 1.32 1.17 0.41 0.51 0.49

fairness=1 1.22 1.19 0.58 0.82 0.81

ratio 0.92 1.02 1.40 1.61 1.65

Table 3: Throughputmetri sappliedontheIPCnumbersofTable2

SoE pro essor, for dierent valuesof the fairness target 5

. We givein Table 3the throughput

numbersforseveral metri sappliedontheIPCs of Table2. A ording tothe ET-IPC(se ond

olumn),in reasingthefairnesstargetde reasesthroughputmoderately,whi his onsistentwith

the on lusions of [3℄. Theauthors of [3℄ wantedto distinguish learly notionsof throughput

and fairness, this is why they did not use the weighted speedup and the harmoni mean of

speedups. Nevertheless,weshowthese twometri sin thethird and fourth olumnssin e they

are still frequently used. Going from a fairness target of 0.25 to a fainess target of 1 keeps

weightedspeedupalmostthesame. However,afairnesstargetof1in reasestheharmoni mean

ofspeedupsbyroughly40% omparedtoafairnesstargetof0.25. Thelasttwo olumnsofTable

3 give the ASH-IPC and the SSH-IPC. A ording to these equal-work throughput metri s, a

fairnesstargetof1yieldsabout60%-65%morethroughputthanafairnesstargetof0.25.

Equal-workthroughputisin reasedbe auseenfor ingfairnessin reasesthefra tionof pro essortime

allottedtosmall-IPCthreads,therebyin reasingthesethreadsIPCs. Hadtheauthorsof[3℄used

theASH-IPC orthe SSH-IPC,theywouldhave on ludedthat notonlydoestheirme hanism

allowto ontrolfairness,but itin reasesthroughputquitesubstantially.

6 Summary and on lusion

Thethroughputmetri susedinmi roar hite turesudiessofarareequal-timethroughputmetri s

basedontheassumptionthatallthejobsexe uteforaxedandequaltime. Wearguethatthis

assumption is not realisti , and we advo ate for equal-work throughput metri sbased on the

assumptionthattheIPCofarandomjobisnot orrelatedwiththetotalnumberofinstru tions

exe utedby thatjob. Wehaveintrodu ed su h equal-workthroughput metri , alled the

EW-IPC,whi hisasso iatedwithathroughputexperimentandhen ehasa learphysi almeaning.

We have shown that the EW-IPC ould lead to throughput paradoxessu h that a elerating

anindividual threadmightde rease throughput. Be ause thereis nosimpleanalyti alformula

fortheEW-IPCin thegeneral ase,weintrodu edtheH-IPC,asimpleproxyfortheEW-IPC.

Using an IPC matrix model, we have studied to what extent the H-IPC is a good proxy for

the EW-IPC. We have onsidered the pra ti al ase where only a relatively small number of

workloadsare simulated,and wehave proposed twodierent estimatorsfor theH-IPC in this

ase, the ASH-IPC and the SSH-IPC. We have shown that these estimators work best when

all the ben hmarksare equally representedin thesample workloads. We haveshown with an

examplethat using anequal-workthroughput metri su h astheH-IPC may leadto on lude

dierently in a mi roar hite ture study. While equal-time throughput metri s tend to favor

mi roar hite turesthat a elerate faster threads,equal-work throughput metri stend to favor

mi roar hite turesthatmakethreadsperforman emoreuniform. In on lusion,were ommend

5

Figure6in[3℄showsonly10ofthe16workloadstheyhaveusedfortheirstudy.Themissingworkloadsare homogeneousworkloadsthatareslightlyimpa tedbyfairnessenfor ement.

(23)

that mi roar hite turestudies on erned with multiprogram throughput usethe ASH-IPC or,

forsymmetri multi ores,theSSH-IPC.

Appendix

Letus assumethattheIPC of arunningjob doesnotdependon thejobrunningon theother

ore. Consequently, thetwo oresare independent. We denote

IP C

A

and

IP C

B

theIPCs of jobs of type A and B respe tively. Let

P

A

and

P

B

= 1 − P

A

be the probabilities that, at a randomtime, agiven oreisexe uting ajob of typeA orBrespe tively. Inthelimit,be ause

oftheassumptionsfortheEW-IPC(Se tion3),agiven oreexe utesasmanyinstru tionsfrom

jobsoftypeAandB,whi hmeans

P

A

×

IP C

A

=

P

B

×

IP C

B

=

(1 − P

A

)IP C

B

Solvingthis equationfor

P

A

,weget

P

A

=

IP C

B

IP C

A

+ IP C

B

(12)

P

B

=

IP C

A

IP C

A

+ IP C

B

(13)

TheaveragetotalIPCis

IP C = 2 × (P

A

×

IP C

A

+ P

B

×

IP C

B

) =

4IP C

A

IP C

B

IP C

A

+ IP C

B

(14)

Be ause the two ores are independent, the probabilities an bemultiplied. Forinstan e, the

probabilityto beonasegmentof typeABatarandomtime (thatis, ore1is exe utingajob

oftypeAwhile ore2isexe utingajoboftypeB)isequalto

P

A

×

P

B

. Thesegmentfra tions are:

f

AA

=

P

2 A

×

2 × IP C

A

IP C

f

BB

=

P

2 B

×

2 × IP C

B

IP C

f

AB

= f

BA

=

P

A

P

B

×

(IP C

A

+ IP C

B

)

IP C

Repla ing

P

A

,

P

B

and

IP C

withtheexpressionsinequations(12),(13)and(14),yields

f

AB

= f

BA

=

1

4 f

AA

=

1

2 ×

IP C

B

IP C

A

+ IP C

B

f

BB

=

1

2 ×

IP C

A

IP C

A

+ IP C

B

Referen es

[1℄ A.Carlton. CINT92and CFP92 homogeneous apa ity methodoers fair measureof pro essing

(24)

[2℄ P. J. Fleming and J. J.Walla e. How not to lie withstatisti s: The orre t way to summarize

ben hmarkresults. Communi ationsof theACM,29(3):218221,Mar.1986.

[3℄ R.Gabor,S.Weiss,andA.Mendelson.Fairnessandthroughputinswit honeventmultithreading.

InPro .ofthe39thAnnual InternationalSymposiumonMi roar hite ture,2006.

[4℄ M. F.IqbalandL. K.John. Confusionby allmeans. InWorkshoponUnique ChipsandSystems

(UCAS),2010.

[5℄ J. R. Mashey. War of the ben hmark means : Time for a tru e. ACM SIGARCH Computer

Ar hite ture News,32(4),Sept.2004.

[6℄ P. Mi haud. Demystifying multi ore throughputmetri s. IEEE Computer Ar hite ture Letters,

Aug.2012.

[7℄ S.Parekh,S.Eggers,H.Levy,andJ.Lo.Thread-sensitives hedulingforSMTpro essors.Te hni al

ReportUW-CSE-00-04-02,UniversityofWashington,2000.

[8℄ M.K.QureshiandY.N.Patt.Utility-based a hepartitioning: Alow-overhead,high-performan e,

runtimeme hanismto partition shared a hes. In Pro . otthe International Symposium on

Mi- roar hite ture,2006.

[9℄ Y.SazeidesandT.Juan.Howto omparetheperforman eoftwoSMTmi roar hite tures.InPro .

oftheIEEE InternationalSymposiumonPerforman e AnalysisofSoftwareandSystems, 2001.

[10℄ A.SnavelyandD.M.Tullsen. Symbioti jobs hedulingforasimultaneousmultithreading

ar hite -ture. InPro .oftheInternationalConferen eonAr hite turalSupportforProgrammingLanguages

andOperatingSystems, 2000.

[11℄ A. Snavely,D.M. Tullsen,and G.Voelker. Symbioti jobs heduling withprioritiesfora

simulta-neousmultithreadingpro essor. InPro .oftheACM SIGMETRICSInternational Conferen eon

Measurement andModelingof ComputerSystems,2002.

[12℄ D.M.Tullsen,S.J.Eggers,J.S.Emer,H.M.Levy,J.L.Lo,andR.L.Stamm. Exploiting hoi e:

instru tionfet handissueonanimplementablesimultaneousmultithreadingpro essor.InPro .of

the23rdAnnual InternationalSymposiumonComputerAr hite ture,1996.

[13℄ D. M.Tullsen, S.J.Eggers,and H.M. Levy. Simultaneousmultithreading: Maximizingon- hip

parallelism.InPro .ofthe22ndAnnualInternationalSymposiumonComputerAr hite ture,1995.

[14℄ M. VanBiesbrou k, L.Ee khout, andB.Calder. Considering allstartingpointsforsimultaneous

multithreadingsimulation.InPro .oftheIEEEInternationalSymposiumonPerforman eAnalysis

ofSystemsandSoftware,2006.

[15℄ M.VanBiesbrou k,T.Sherwood,andB.Calder. A o-phase matrixtoguidesimultaneous

multi-threadingsimulation. InPro .of theIEEE InternationalSymposiumonPerforman e Analysisof

(25)