About effective cache miss penalty on out-of-order superscalar processors

(1)

a p p o r t

d e r e c h e r c h e

INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE

About effective cache miss penalty on out-of-order superscalar processors

Andre´ Seznec, Fabien Lloansi

N˚ 2726

Novembre 1995

PROGRAMME 1

(2)

(3)

About eective cache miss penalty on out-of-order superscalar processors

AndreSeznec, FabienLloansi

Programme 1 | Architectures paralleles, bases de donnees, reseaux et systemes distribues Projet Caps

Rapport de recherche n2726 | Novembre 1995 | 21 pages

Abstract:

For many years, the performance of microprocessors has depended on the miss ratio of L1 caches. The whole processor would stall on a cache miss. The contribution of a cache miss to the execution time was exactly the miss penalty. Limiting the miss ratio on L1 caches has been a major issue for the last ten years. Studies showed that, for current cache sizes, 32 or 64 bytes cache blocks was a good tradeo.

Today, technology has changed. Most of the newly announced processors implement a very complex superscalar microarchitecture allowing out-of-order execution. On these processors, instruction execution continues while L1 cache misses are serviced by a pipelined L2 cache.

In this paper, we show that, on such superscalar processors, the eective contribution of a cache miss to the execution time is quite distinct of the miss use penalty for the missing data or instruction.

We also show that the L2 cache busy time becomes a major bottleneck and that decreasing the demanded throughput on this cache tends to become more important than limiting the L1 miss ratio. This favors the use of short cache block sizes. For current L1 cache sizes and a 16-byte bus, a 16 byte block size is shown to be a good trade-o.

Key-words:

Out-of-order execution, Cache block size, eective miss penalty

(Resume : tsvp) This work was partially supported by CNRS (inter-PRC project ILIAD)

[email protected],[email protected]

(4)

A propos de la penalite sur les defauts de cache sur les processeurs superscalaires executant dans le desordre

Resume :

Dans cet article, nous montrons que, pour les processeurs executant les instructions dans le desordre, la contribution eective au temps d'execution d'un defaut de cache est tres dierente du temps de service du defaut par le cache secondaire. Nous montrons aussi que l'occupation du cache secondaire devient le goul^ot d'etrangement majeur pour les performances plus que le taux de defaut sur les caches. Ceci tend a faire preferer l'usage d'une taille de bloc petite, typiquement de la taille du bus reliant le cache secondaire au cache primaire.

(5)

1 Introduction

Many newly announced processors [7, 6, 5] implement a very complex superscalar microarchitecture allowing out-of-order execution. On these processor, instruction execution continues while L1 cache misses are serviced.

Another technological trend is the limited L1 cache size. Two major constraints limit the L1 cache size. First the access time on the cache increases with the cache size [12]. Second, multiple accesses on L1 cache are needed in a single cycle when the superscalar degree increases [13]. Therefore we believe that the size of the L1 caches will remain limited (in the 8-32Kbytes range). For these cache sizes, block sizes of 16 to 64 bytes lead to the best miss ratio [17].

On these processors, the L2 cache transactions are generally pipelined and many transactions may be pending at the same time. The L2 cache is a shared ressource which has to service instruction and data misses, but also writes (either write through requests or write back requests), and maybe also prefetches and external transactions for maintaining memory consistency.

In this paper, we explore the precise performance impact of pipelined L2 caches on the performance of out-of-order execution microprocessors. We show that, on such superscalar processor, the eective contribution of a cache miss to the execution time is quite distinct of the miss use penalty for the missing data or instruction.

A cycle-by-cycle simulation of an aggressive out-of-order execution microprocessor and of its cache hierarchy shows that the L2 cache busy time becomes a major bottleneck for performance.

As a result decreasing the demanded throughput on the L2 cache tends to become more important than limiting the L1 miss ratio. This favors the use of short cache block sizes. For current L1 cache sizes and a 16-byte bus, a 16 byte block size is shown to be a good trade-o. Since the L2 cache busy time becomes the major issue, implementing a prefetch mechanism has to be cautiously studied:

prefetching may load useless data or instructions, it may waste L2 cache bandwidth and decrease performance.

The remainder of the paper is organized as follows. In Section 2, we explain why the eective contribution of a miss to the execution time is very dicult to modelize on out-of-order execution microprocessor. In Section 3, we present dierent phenomena happening in the L2 cache. These phenomena tend to favor the use of small block sizes. In Section 4, we describe the processor simulator, the benchmarks and the simulated memory hierarchy. Performance results are presented and analyzed in Section 5. Section 6 summarizes this study.

Related work

Numerous studies have investigated cache behaviors and many hardware mechanisms have been proposed to limit the miss ratio and the impact of misses on performance. Fewer studies have been done on non-blocking caches (also called lock-up free caches [14]) and pipelined accesses toL2 cache.

Kroft [14] was the rst one to consider a processor which does not stall on I-cache misses.

Sohi and Franklin [19] investigated the organisation of a multiport non-blocking L1 cache. They mainly focussed on the L1 organisation and showed that a L1 cache may be build with interleaved cache banks. However they did not completely simulate the superscalar processsor (no branch prediction, ..) and the whole memory hierarchy (instruction cache, write requests,.. ).

Farkas and Jouppi [2] explored the alternative implementations of non-blocking caches.

(6)

The usefulness of non-blocking data loads in multiple-issue processors has been studied by Far- kas et al [3]. The study concludes that non-blocking loads and stream buers are very ecient to enhance performance, nevertheless the model used for the external memory system (L2 cache or main memory) is very poor: contentions on the L2 cache are not simulated.

2 Eective miss penalty for on-chip L1 caches

Many newly announced processors (e.g. HP8000, Intel P6, MIPS R10000) implement a very complex superscalar microarchitecture featuring out-of-order execution. Instruction execution continues while L1 cache misses are serviced by a pipelined L2 cache. This L2 cache may be o-chip (e.g. MIPS R10000 or Intel P6) or on-chip (e.g DEC 21164).

We explain here why the delay needed for servicing a miss on such out-of-order execution processors do not necessarily result in a similar loss in execution time.

2.1 Miss penalty for synchronous blocking caches

On most microprocessors introduced before 1994, a miss on the instruction or data cache was resulting in a complete stall of the processor pipeline.

On these processors, the penalty paid on execution time for a L1 cache miss is quite simple to modelize. The time for servicing a miss consists in a latency^Lfor accessing the rst word from the missing block plus an extra penalty^rpaid for each extra word in the block. Let ^K be the number of words in a cache block, Formula 1 represents the time for servicing the miss.

L + (^K^?1)^r ^{cy cl es} (1)

L includes the delay for checking the miss, driving the address pins for accessing the external L2 cache (or main memory), reading the L2 cache and getting the data in the processor, and resume the execution. For instance, on 60 Mhz Pentium systems, minimum^Lis 5 cycles and minimum^ris one cycle.

In order to limit the penalty on execution time, in many implementations, the missing word is returned rst. The execution can then be resumed as soon as the missing word is present, ^Lcycles after the beginning of the miss servicing. In this case, for each individual miss, the lost execution time is at minimum^Lcycles; the average penalty on execution time is slightly higher than^Lcycles, because a cache miss servicing may be delayed by the end of a previous cache miss servicing or a write update.

Notation:

For convenience, in the remainder of the paper,^rthe access time of the L2 cache RAMs will be called the L2-cycle.

2.2 Miss penalty versus use penalty on out-of-order execution micropro- cessors

On recently announced microprocessors [7, 5, 6], aggressive out-of-order execution is implemented and/or pipelined L2 caches are used. We explain here, why the execution time on these processors is merely less aected by each individual cache miss than in traditional synchronous processors.

(7)

First let us dene the two dierent notions, the eectively miss penalty and the use penalty:

Denition 2.1

The use penalty is the delay between the reference to data (or instruction) in the L1 cache and the time the data (or instruction ) is available for use.

On a miss, there is a minimum use penalty^Lfor using the data.

Denition 2.2

The eective miss penalty is the real contribution to the execution time of a L1 cache miss.

On a out-of-order execution microprocessors, the eective miss penalty associated with a specic cache miss depends on its whole context: an isolated cache miss may be traded by the out-of-order execution, while a rapid succession of cache misses will lead to some processor stalls. In the remainder of the paper, we will use the average eective miss penalty.

When the miss results in a complete stall of the processor, the use penalty and the miss penalty are equal. But the average eective miss penalty may be far less than the use penalty when out-of- order execution is used.

Data caches

On an out-of-order execution microprocessor, a data cache miss does not stall the whole processor. On a load miss at cycle T, dependent subsequent instructions cannot be executed before cycle^T +^U^se^{Penal ty}, while independent instructions continue to progress in the pipeline.

Subsequent load/store instructions may even be executed when a lock-up free (or non-blocking) cache is implemented [14]. On recent microprocessors, several cache misses may be pending while the execution of load/store instructions continue.

When the functional units in the processor nd a sucient number of independent instructions to execute, the use penalty is hidden. Aggressive compiler technology may schedule in advance load instructions which are suspected to miss ([2, 1]), thus allowing to farther reduce the miss penalty.

Moreover, on these microprocessors, accesses to the L2 cache are pipelined and more than one miss may be serviced at the same time resulting in an average eective miss penalty signicantly lower than the use penalty.

Instruction cache

On out-of-order execution superscalar microprocessors, instructions are generally read at a high rate. Due to data dependencies and resource conicts, the execution units cannot sustain the same execution rate. Therefore some advance on instruction fetching may be obtained, this is particularly true when a sequence without instruction cache miss or branch misprediction is encountered. While servicing an instruction miss, instructions already issued may be selected for execution. In some cases, if enough instructions were issued in advance, an instruction miss would have no impact on the execution time.

Instruction accesses may also benet from a pipelined L2 cache. Instructions generally exhibits high spatial locality. In order to limit instruction miss penalty, systematic prefetching of the fall- through blocks after an instruction cache miss may be used as on the DEC 21164. This technique may allow to see a miss penalty only the rst miss when the L2-cycle and the processor cycle are equal.

(8)

2.3 Bus width is limited

A solution for limiting the miss penalty might consist in using a very wide bus (for example the width of a cache block). Unfortunately technological and economical factors limits this width. When the L2 cache is on-chip (as for the DEC 21164), the power consumption is a major limitation; the wider the bus will be, the higher be the power consumption. When the L2 cache is o-chip, pin-out is a serious limitation, the minimum cost of the L2 cache is also a major concern.

On current processor generation, using a 16-byte bus width is the dominant trend.

3 Out-of-order execution favors small cache blocks

In this section, we present some phenomena which explains why the optimal block size is smaller for out-of-order execution microprocessors than for synchronous microprocessors.

3.1 Synchronous blocking caches

When using synchronous blocking caches, for cache sizes in the 8K-32Kbyte range, and for a realistic latency^L(5-30 cycles) and memory (or L2 cache) access time^r(1-3 cycles), the optimal size of the cache block between 32 and 64 bytes [17, 8].

3.2 Cache line size and out-of-order execution

On recently announced microprocessors [7, 5, 6], aggressive out-of-order execution is implemented and/or pipelined L2 caches are used. We explain here, why this tends to favor the use of small cache blocks.

High throughput L2 caches

Specic L2 caches associated with the newly announced processors [7, 18, 5] have a very high bandwidth: 16 bytes per cycle on DEC 21164 or the MIPS R10000 or 8 bytes per cycle on Intel P6 for instance.

The access time to the L2 cache is also very short: on a data cache miss, the load data may be used 5 cycles after the load reference on the DEC 21164 and 6 cycles after on the Intel P6 or MIPS R10000.

Data caches

When accessess to the L2 cache are pipelined and more than one miss may be serviced at a time resulting in an average visible penalty signicantly lower than the time really spent for servicing the miss.

Figure 1 illustrates a situation where the use of a small cache line leads to a lower execution time. In this example, three loads are considered: load A and load B result in L1 misses and load C results in a L1 hit¹. Two cases are envisaged, (a) a cache line width equal to the bus width and (b) a cache line width four times larger than the bus width.

In this simple example, we have assumed that a single access port is implemented on the L1 cache. More complex L1 cache designs may be considered: multiple cache accesses by the processor, dedicated Write port is used for updating the cache on a miss, ...

1In a real program, the sequences of misses would be dierent when using dierent blocks sizes . This example is only given to illustrate phenomena which can decrease performance when using large block size.

(9)

For convenience, in the remainder of the paper, the cache block size will sometimes measured in

\bus-words": a four bus-words block is four times wider than the bus.

L1 cycle

L2

1 2 3 4 5 6 7 8 9 10 11 12 13

ld A

ld A Wr A ld B

ld B Wr B ld C

Use A Use B & C

14

(a) one bus word per cache line

L1 cycle

L2

1 2 3 4 5 6 7 8 9 10 11 12

ld A

ld A Wr A ld B

ld A ld A ld A Wr A Wr A Wr A

ld B ld B ld B ld B Wr B Wr B Wr B

13 Wr B

14 ld C

Use A Use B Use C

(b) four bus words per cache line

Figure 1: Servicing data misses on a pipelined L2 cache Let us comment on this example :

1.

data availability

: in (a), data B is available for use on cycle 7, in (b) data B is available for use on cycle 9

2.

L1 data cache occupancy

:

in (a) a miss busies the L1 data cache for two cycles: rst access resulting in a miss and update of the L1 cache with the missing block

in (b) the L1 data cache is busy for 5 cycles: rst access and then 4 cycles for updating the L1 cache.

When the L1 data cache is occupied by an update, it can not service other requests ², and therefore subsequent instruction executions may be delayed. For instance, assuming a single load/store port on the cache, load C is executed at cycle 7 in (a), but must wait for cycle 14 in (b).

3.

L2 cache occupancy

: The L2 cache is a shared ressource which has to service instruction cache misses and data cache misses. It also has to support write transactions and service

2unless more R/W ports are implemented on the cache. On some cache designs, a dedicated Write port is used for updating the cache

(10)

coherency transactions in a complete system. Using a small data cache line for the L1 cache limits its occupancy as illustrated in gure 1.

On the other hand, the use of a larger cache line may lead to a lower cache miss ratio, and the situation illustrated in Figure 1 is more likely to occur with a small block size than with a large block size. Precise simulations will be presented in order to study the trade-o.

Instruction cache

Assuming systematic prefetching of the next cache blocks after an instruction cache miss, Figure 2 shows how block sizes larger than the bus width may result in higher execution time and higher L1 I-cache occupancy and higher L2 cache occupancy as for the data cache. The instruction sequence is assumed to be (I2,I3,I4,I5,J1). In this example, an instruction cache miss is assumed on bus-word I2, I2 being the third bus-word of a four bus-words block, J1 is assumed to be in the cache.

In this example, two phenomena are illustrated:

When using a four bus-words block, the availability of I4 is delayed by the immediate servicing of cache subblocks I0 and I1 which are not immediately used.

With a four bus-words block , the access to J1 is delayed by the servicing of cache subblocks I6 and I7 which are not immediately used.

L1 cycle

L2

1 2 3 4 5 6 7 8 9 10 11 12 13 14

I2

I2 I3 W I2

I4 I5 I6 W I3 W I4 W I5

Use I4

J1

Use J1

(a) one bus word per cache line

L1 cycle

L2

1 2 3 4 5 6 7 8 9 10 11 12 13

I2

I2 I3 W I2

I0 I1

W I3 W I0 W I0 W I4

Use I4 I4 I5 I6

W I5 W I6 I7

W I7 14

J1

Use J1

(b) four bus words per cache line

Figure 2: Servicing instruction misses on a pipelined L2 cache

(11)

Throughput demands on the L2 cache

The L2 cache has to service both data and instruction misses. When a write-through policy is used on L1, it also has to service write requests; when a write back policy is used, L2 cache must service write backs. When a multiple bus-word block is loaded, some of the bus-words loaded may remained unused before the block is ushed from the cache. The demand on the L2 cache is then higher when a large block size is used. This may lead to a performance degradation when the demanded throughput on the L2 cache (i.e. the L1 miss ratios) increases.

Granularity of L2 cache transaction

When using a K bus-words cache block, the L2 cache is busy for K consecutive L2-cycles to service a L1 cache miss or a write back.

To enable fast instruction execution, priority of L2 cache access must be given rst to data cache misses, then to instruction cache misses, and nally to write (through or back) servicing.

When a large block size is used, the servicing of a high priority (e.g. a data miss) may be delayed a long time by any previous low priority transaction currently serviced (e.g. a cache block write back).

Having shorter atomic transactions would allow to service high priority transactions with an average shorter delay. This phenomena explains why, in some cases, when using a multiple bus-word block size, using a write back policy leads to a lower performance level than with a write through policy.

On the other hand, when using a single bus-word cache block, the demand is generally lower on the L2 cache for a write back policy than for a write through policy. In both cases, servicing a write occupies the L2 cache for a single L2-cycle.

Summary

Cache simulation results published in [4] tend to indicate that, when using 16-byte block size, the cache miss ratio is only slightly higher than when using 32-byte or 64-byte block sizes. In the next sections, we investigate whether or not the phenomena described above allow to reach lower execution time when using a 16-byte block size than when using 32-byte or 64-byte block sizes.

4 Evaluation methodology

A complete processor simulator was used in order to measure the eective execution time of various benchmarks on an out-of-order execution microprocessor while varying dierent parameters such as the cache block size, the cache size, the L2 cache access time, the L2-cycle, instruction prefetch policies, write policies, ...

4.1 The simulator

TheSpapackage developed by Gordon Irlam [9] was used to generate address traces for programs executed on a SUN SparcStation2. No modication of the binary to be analyzed was required; user code of a single application is completely traced.

Basic processor conguration

A reasonably aggresive superscalar processor was simulated.

We describe here the common characteristics of the processors which were simulated:

(12)

Up to four aligned instructions are fetched per cycle.

Four functional units are used: an integer unit, a oating point unit, a branch unit, and a load/store unit. A reservation table with 16 entries is associated with each functional unit.

Both the integer unit and the oating point unit executes up to two instructions per cycle, while the branch unit is able to process a single branch per cycle.

The load/store unit is divided can process two load/store per cycle. The processing of a store is divided in two steps: the address computation and the eective write. These two phases are decoupled: the address computation may be executed before the data to be written is available. Up to 16 writes may be waiting for data availability. For the loads, we chose to strongly couple the address computation and the cache read. A strict order is maintained on address computations in order to be able to enforce the respect of data dependencies on the memory accesses.

The data cache can support two reads or one store every cycle. This characteristic is used in the DEC 21164. We chose to simulate this design rather than a fully bi-ported data cache or an interleaved cache design [19].

Up to 16 pending loads missing on the L1 cache are tolerated.

A 256-entry combined branch target buer/branch history table was simulated. A 2-bit prediction was used. On a misprediction (either address misprediction or branch misprediction) , the minimum delay between the execution of a branch and the execution of the instruction in the correct path is 5 cycles. An optimistic instruction fetch policy is used [15]: on an instruction miss, the instruction is fetched from the L2 cache without waiting for the eective address computation; on branch misprediction some instruction blocks may be loaded unnecessarily, but on the other hand, missing instruction blocks are returned sooner than when waiting for the branch resolution.

The execution stage in the pipeline cannot take place before cycle 5, cycle 1 being devoted to the instruction fetch, cycles 2, 3 and 4 being used for decoding, register renaming and writing the reservation tables.

All registers are renamed. 32 extra integer registers and 32 extra oating point registers are used for these purpose. Code condition registers are also renamed. WAW and WAR hazards on registers are then very limited (in practical, due to the high number of available registers, no such hazard was encountered in our simulations).

The latency of all integer operations and branch instruction is one cycle. A single load delay cycle is assumed for a load hitting on the L1 data cache.

A 3-cycle latency was assumed for all oating-point operations.

Performance metrics

For all tested benchmark, a simulation was run assuming a perfect cache (i.e no miss) and a perfect write buer (i.e. no stall due to a full write buer). The execution time we got for this simulation will be referred to asbasic execution time.

As a performance metric, we will use thenormalized execution time. For illustrating the L2 cache busy time, we will use thenormalized L2 busy time. These metrics are dened as follows:

(13)

Denition 4.1

On a specic benchmark^B and a particular processor conguration^P, the normalized execution time is the ratio of the execution time obtained for^B on^P divided the basic execution time for^B.

Denition 4.2

On a specic benchmark ^B and a particular processor conguration ^P, the normalized L2 busy time is the ratio of the L2 busy time obtained for^B on^P divided by its corresponding execution time.

Theaverage eective miss penaltywill also be illustrated. This average eective miss penaltyis computed as the following ratio execution timeTotal miss number^?basic execution time.

Simulated congurations

We wanted to modulate dierent parameters on the simulated processor congurations:

cache sizes and associativities

block sizes

the L2-cycle

the bus width

the L2 access time

the instruction prefetch policy and the prefetch buer depth

the write policy (write through or write back); the write buer depth

In order to avoid exponential explosion of simulation time, only a few congurations were explored. The following assumptions are made:

I and D-cache sizes are equal.

I and D-cache block sizes are equal.

I and D-cache associativities are equal.

The write buer can hold eight writes or eight dirty cache blocks

When implemented, the instruction prefetch buer consists of four 16-bytes word (i.e four cache blocks for a 16-byte size, two cache blocks for a 32-byte block size, ..).

A xed priority on L2 cache accesses is implemented:

1. data cache misses 2. instructrion cache misses 3. write requests

4. instruction prefetches

(14)

The granularity of the transfers between the L1 cache and the L2 cache is assumed to be 16 bytes.

In all our simulations, it was also assumed that more than several misses on a particular cache block may be pending [2]; notice that the complexity of the hardware needed to support this feature would increase with the block size.

We modulated the block size (16-byte, 32-byte and 64-byte) along with a couple of parameters around the basic conguration. The parameters of the simulated congurations are represented in table 1.

Name Cache Size Prefetch L2-cycle assoc. L2 latency write policy

Wthrough 8K no 2 DM 6 Write Through

WBack 8K no 2 DM 6 Write Back

1-cycle-L2 8K no 1 DM 5 Write Through

12-cycle-lat 8K no 2 DM 12 Write Through

4-assoc 8K no 2 4-way 6 Write Through

32K 32K no 2 DM 6 Write Through

PR-Wthrough 8K yes 2 DM 6 Write Through

PR-1-cycle-L2 8K yes 1 DM 5 Write Through

PR-12-cycle-lat 8K yes 2 DM 12 Write Through

Table 1: Simulated congurations

A perfect L2 cache

We did not simulate the L2 cache and assumed a perfect cache (no miss) able to accept a single request (either a 16 bytes read or a 16 bytes write) per L2-cycle.

4.2 Benchmark selection

Eight benchmarks exhibiting a large spectrum of L1 cache behaviors are illustrated here. These benchmarks come from the SPEC92 collection and the SPLASH collection [16]. The rst 50000000 instructions were discarded, then a chunk of 1000000 instructions was traced and simulated.

Some characteristics of the trace chunks we used are listed in table 2. The ideal performance (assuming a perfect cache) varies from 1.53 to 3.06 Instruction Per Cycle (IPC). All miss ratios are given inmiss per instructionfor a 8 Kbytes direct-mapped cache ; this measure indicates more clearly the demanded throughput on the L2 cache than the conventional data miss ratio (ratio of data misses on data references).

The dierent L1 cache behaviors of the eight considered traces are discussed below.

Bench 1: su2cor

This trace presents a very low data locality and a very high data miss ratio:

data miss ratio only slightly decreases when the block size increases.

Instructions exhibit a high spatial locality.

Bench 2: gcc

Tracegccpresents a high instruction miss ratio (for 8Kbytes direct-mapped caches).

This instruction miss ratio decreases with the cache block size.

The data miss ratio is signicantly lower than the instruction miss ratio.

(15)

Bench number 1 2 3 4 5 6 7 8 Bench name su2cor gcc pthor hydro2d locus li compress tomcatv

IPC 2.44 1.53 1.67 2.33 1.53 1.67 2.21 3.06

16 bytes data .098 .013 .046 .086 .001 .010 .019 .019 32 bytes data .089 .013 .040 .046 .001 .008 .020 .019 64 bytes data .089 .015 .040 .028 .002 .007 .022 .039 16 bytes inst. .018 .102 .058 .001 .077 .012 .000 .000 32 bytes inst. .012 .073 .047 .000 .057 .008 .000 .000 64 bytes inst. .007 .055 .033 .000 .044 .008 .000 .000

Table 2: Benchmark characteristics

Bench 3: pthor

pthorexhibits a quite high global miss ratio. Misses are almost equally distributed among instruction misses and data misses.

Bench 4: hydro2d

Trace hydro2d exhibits almost no instruction misses. The prefetch eect of large cache blocks seems very ecient on this benchmark: increasing the block size from 16 bytes to 64 bytes decreases the data miss ratio by a factor three.

Bench 5: locus

Tracelocusexhibits almost no data misses, but a high instruction miss ratio.

Bench 6: li

On traceli, the misses are equitably distributed between instructions and data. The total miss number is relatively low.

Bench 7: compress

Trace compress exhibits almost no instruction miss. The data miss ratio almost does not depend on the block size.

Bench 8: tomcatv

Trace tomcatv exhibits almost no instruction miss. The data miss ratio is quite low, except that it suddenly increases when the block size reaches 64 bytes.

5 Simulation results

Figures 3,4 and 5 illustrates respectively thenormalized execution time, thenormalized L2 busy time and theaverage eective miss penaltyfor the eight benchmarks and for the three considered block sizes on all processor congurations which were simulated. In the remainder of this section, we try to identify the impact of the dierent parameter variations.

5.1 Impact of the L2 access latency

Parameters of the Wthroughand the 12-cycle-latare identical except for the L2 access latencies (resp. 6 cycles and 12 cycles).

(16)

(a) Wthrough (b) 12-cycle-lat (c) 1-cycle-L2

100 150 200 250 300

1 2 3 4 5 6 7 8

Normalized execution time (%)

bench 16b 32b 64b

100 150 200 250 300

1 2 3 4 5 6 7 8

bench 16b 32b 64b

100 150 200 250 300

1 2 3 4 5 6 7 8

bench 16b 32b 64b

(d) WBack (e) 4-assoc (f) 32K

100 150 200 250 300

1 2 3 4 5 6 7 8

bench 16b 32b 64b

100 150 200 250 300

1 2 3 4 5 6 7 8

bench 16b 32b 64b

100 150 200 250 300

1 2 3 4 5 6 7 8

bench 16b 32b 64b

(g) PR-Wthrough (h) PR-12-cycle-lat (i) PR-1-cycle-L2

100 150 200 250 300

1 2 3 4 5 6 7 8

bench basic

16b 32b 64b

100 150 200 250 300

1 2 3 4 5 6 7 8

bench 16b 32b 64b

100 150 200 250 300

1 2 3 4 5 6 7 8

bench 16b 32b 64b

Figure 3: Normalized execution time

(17)

0 20 40 60 80 100

1 2 3 4 5 6 7 8

Normalized L2 busy time (%)

bench 16b 32b 64b

0 20 40 60 80 100

1 2 3 4 5 6 7 8

bench 16b 32b 64b

0 20 40 60 80 100

1 2 3 4 5 6 7 8

bench 16b 32b 64b

0 20 40 60 80 100

1 2 3 4 5 6 7 8

bench 16b 32b 64b

0 20 40 60 80 100

1 2 3 4 5 6 7 8

bench 16b 32b 64b

0 20 40 60 80 100

1 2 3 4 5 6 7 8

bench 16b 32b 64b

0 20 40 60 80 100

1 2 3 4 5 6 7 8

bench 16b 32b 64b

0 20 40 60 80 100

1 2 3 4 5 6 7 8

bench 16b 32b 64b

0 20 40 60 80 100

1 2 3 4 5 6 7 8

bench 16b 32b 64b

Figure 4: Normalized L2 busy time

(18)

0 5 10 15 20

1 2 3 4 5 6 7 8

Effective Miss Penalty (cycles)

bench basic

16b 32b 64b

0 5 10 15 20

1 2 3 4 5 6 7 8

bench basic

16b 32b 64b

0 5 10 15 20

1 2 3 4 5 6 7 8

bench basic

16b 32b 64b

0 5 10 15 20

1 2 3 4 5 6 7 8

bench basic

16b 32b 64b

0 5 10 15 20

1 2 3 4 5 6 7 8

bench basic

16b 32b 64b

0 5 10 15 20

1 2 3 4 5 6 7 8

bench basic

16b 32b 64b

0 5 10 15 20

1 2 3 4 5 6 7 8

bench basic

16b 32b 64b

0 5 10 15 20

1 2 3 4 5 6 7 8

bench basic

16b 32b 64b

0 5 10 15 20

1 2 3 4 5 6 7 8

bench basic

16b 32b 64b

Figure 5: Eective miss penalty

(19)

12 cycles latency

It can be remarked (gure 3.b) that, for all the benchmark for which the overall miss ratio signicantly decreases with the block size (benchmarks 2, 4, 5) performance is higher with a larger cache block (e.g. 64 bytes) than with a smaller one (e.g. 16 bytes). Due to a lower L2 occupancy (gure 4.b), the eective miss penalty is smaller with a smaller block size; but for a high L2 access latency, this does not trade the higher miss ratio.

For benchmarks 1 and 8, the larger miss ratio for a 64 bytes block size than for a 16 bytes block size explains the lower performance.

It can be remarked, that, for some benchmarks the eective miss penalty is signicantly lower than the L2 access latency (benchmarks 1,3,4,7,8), but that when the instruction misses dominate (benchmarks 2,5,6), the eective miss penalty is in the same range for the three cache block sizes and is very close to the L2 access latency: the advance in instruction issuing does not allow to hide the use penalty.

6 cycles penalty

With a smaller L2 access latency, results are dierent. 16 or 32 bytes cache lines seems to be both good trade-os: a lower eective miss penalty allows to trade a larger miss ratio.

Using a 64 byte block size increases the risk to have performance degradation due to conict misses.

5.2 Write policy impact

The simulationsWthrough andWBack only diers by the write policy which was used.

As mentioned in section 3.2, the write back of a multiple bus word cache block busies the L2 cache for several consecutive L2-cycles. This explains why for a 64 byte block size, the performance of the WBacksimulation is worse than theWThroughsimulation for benchmarks 1,3,4,6 and 8(gure 3 (a) and (d)). When using a 16-byte block, the performance of theWBack simulation is worse than the WThroughfor benchmark 1; we checked that in this case, this phenomena was due to a signicantly higher data miss ratio with a write back policy than with a write through policy. This application is a scientic code. We guess that, in the trace chunk we simulated, some arrays are written without having been read in a recent past or being re-read in a near future.

It may be remarked that except for benchmark 1, the performance dierences are marginal.

This tends to argue in favor of using a write-through policy, because it simplies the coherency management. In that case, solutions allowing to combine writes to the same memory block in the write buer (as in the DEC 21064) or write caches [10] may be used for limiting the demanded throughput on the L2 cache.

5.3 Impact of the L2-cycle

The simulations 1-cycle-L2 and Wthrough only diers by the L2-cycle which was used (resp. 1 processor cycle and 2 processor cycles).In 1-cycle-L2, the rst word of a missing block may be used after 5=4+1 cycles while inWthrough, it may be used after 6=4+2 cycles. This dierence in minimummiss use penalty is marginal.But the L2 occupancy is directly proportional to the L2-cycle (gures 4 (a) and (c)), therefore impact of the L2-cycle on performance is very important (gures 3 (a) and (c)). For instance, with a 64 bytes block size, for benchmarks 1, 4, 7 and 8 the average eective miss penalty is more than two times higher forWthrough than for1-cycle-L2.

As in previous cases, with a one processor cycle L2-cycle, benchmarks 1 and 8 requires a small cache block size (e.g. 16 bytes), while benchmarks 2 and 5 requires larger cache blocks, mainly

(20)

because instruction misses dominate and that the eective miss penalty on instructions is very close to the use penalty (this was already noted in 5.1). On the other benchmarks, the performance is approximately equal for the three block sizes.

5.4 Cache size and associativity

The major impact of increasing cache size and/or associativity is to decrease the instruction and data miss ratios. Performance curves associated with4-assoc and32Ksimulations (gure 3(e) and (f)) are then atter than onWthroughsimulation, but have globally the same shapes.

An interesting point to be noted is that the eective miss penalty for 4-assoc on 16 bytes (- gure 5.a) is much lower than the eective miss penalty onWthrough for benchmarks 3 and 6; this increases the overall advantage of using associativity. We did not nd any valid explanation for this phenomenum.

5.5 Impact of instruction prefetching

In order to decrease the penalty paid on instruction misses after a miss, the next blocks may be sys- tematically prefetched, either in the cache or in a prefetch buer [11]. This technique is implemented in the DEC 21164. This technique is known to be ecient to reduce the instruction cache miss ratio;

nevertheless to our knowledge, its eective impact on performance has never been presented when the whole processor and memory system is considered.

A prefetch buer consisting of four 16 bytes word was simulated. On an instruction miss, the prefetch buer is checked before the cache is accessed. When an instruction miss occurs, whether it hits in the prefetch buer or not, the prefetch engine tries to fetch the blocks following the missing block. Such a policy leads to prefetch some blocks already present in the I-cache, but checking this presence before prefetching would require a second access port on the I-cache tags.

For prefetching an instruction block from the L2 cache, the prefetch engine competes with data miss servicing, eective instruction miss servicing, and writes on the L2 cache. As a prefetch may be a useless access, the lowest priority was given in our simulation to instruction prefetching.

Instruction prefetching and performance

For some of the processor congurations we explored, the performance is worse with instruction prefetch than without instruction prefetch. Instruction prefetching may decrease performance (Figure 3 (g) (h) and (i)). Globally, note that on our benchmark set:

1. associating instruction prefetching with a 64 byte cache block decreases the performance.

2. associating instruction prefetching with a 32 byte cache block has globally no impact on performance for the Wthrough simulation, but slightly increases performance for the 1-cycle-L2 and 12-cycle-lat simulations.

3. associating instruction prefetching with a 16 byte cache block signicantly increases the performance when instruction misses dominate.

This is due to two phenomena:

(21)

1. the granularity of a prefetch transaction: when considering a 2 processor cycles L2-cycle and a 64 bytes cache blocks, an instruction prefetch busies the L2 cache for 8 processor cycles. If the prefetch is useless, then useful transactions may have been delayed up to 7 cycles. When no prefetching is used, a 60 to 80 % of L2 cycles are already busyd, so there is a high probability that such useful transactions would be delayed. It can be noticed on gures 4 (g) and (h) that when performing instruction prefetching the L2 bandwidth is nearly saturated.

2. The prefetch is treated with the lowest priority by the L2 cache: there may be quite a long delay before the prefetch gets servicing. When the instruction ow continues in sequence (and would normally hits in the prefetch buer), it is likely that, because of its low priority, the prefetch has not been serviced. When the instruction sequence is disrupted by branches, the prefetch transactions gets more chances to be serviced, but is also less likely to be useful.

It can be noted that, assuming instruction prefetching, using a single bus-word cache block size leads to better performance than other block sizes.

Let us identify the phenomena that explains the better behavior with a single-word cache block and distinguish between two cases:

1. 16 bytes block and1-cycle-L2(gure 3 (i)):

the prefetch transaction busies the L2 cache for a single cycle; and since the prefetch has the lowest priority, it cannot delay any useful L2 access.

2. 16 bytes block and12-cycle-lat(gure 3 (h)) andWthrough (gure 3 (g)):

Without instruction prefetching and a 16 bytes block size, on our benchmark set, the L2 busy time is quite low besides the execution time (gure 4.a and b). Then this creates plenty of

\holes" in the L2 occupancy which can be lled by the prefetches. When using a 16 bytes block size, such a prefetch busies the L2 cache for only 2 cycles, therefore any useful L2 access can be delayed at most one cycle.

The phenomena is particularly high for12-cycle-latbecause of the very occupancy on the L2 cache (particularly on benchmarks 2, 3 and 5).

6 Summary

New out-of-order execution superscalar processors [7, 5, 6] are using small fast L1 caches backed with a large high throughput pipelined L2 cache. Generally this L2 cache has a low access latency (6 to 8 cycles) and a high throughput (typically one 16 byte word each L2-cycle).

In this paper, we have explored the precise performance impact of pipelined L2 caches on the performance of out-of-order execution microprocessors. We have shown that, on such superscalar processor, the eective contribution of a cache miss to the execution time is quite distinct of the miss use penalty for the missing data or instruction and may sometimes be low besides this miss use penalty (e.g. on our bench 7).

Our simulations have also pointed out that, due to asynchronism, the throughput demanded by an out-of-order execution microprocessor on the L2 cache is very high besides the execution time when the L2-cycle is higher then the processor cycle and when the L2 cache access latency is low. While the L2-cycle only marginally aects the minimum miss use penalty, its impact on

(22)

performance is major. Using equal L2-cycle and processor cycle would decrease the pressure on the L2 cache. When an external L2 cache is used, equal L2-cycle and processor cycle can only be achieved with the use of very expensive synchronous SRAMs. When the L2 cache is implemented on the same chip as the processor, pipelining the accesses inside the SRAM block may be done (as for instance in the DEC 21164).

Using a small cache block size is a mean to decrease the pressure on the L2 cache. Our simulations have shown that performance that would be achieved using a 16 bytes block size is equivalent to the performance of 32 or 64 bytes block sizes. Moreover, using instruction prefetching increases the performance when using a single bus-word cache block size; due to a larger L2 cache transaction granularity, instruction prefetch may decrease performance for multiple bus-word block size.

Our simulations have pointed out that prefetching should be only very cautiously considered for out-of-order execution superscalar microprocessors: L2 cache cycles cannot be wasted; priority must be given to sure requests. Using large block size leads to Further simulations are needed to realisticly evaluate the large number of prefetch mechanisms which have been proposed in the litterature over the past ten years. Nevertheless, we are convinced that such mechanisms would not increase performance if the granularity of the L2 access is not a single bus word.

References

[1] S.G. Abraham, R.A. Sugumar, D. Windheiser, B. Rau, R. Gupta \Predictability of Load/Store Instruc- tion Latencies", Proceedings of the 26th International Symposium on Microarchitecture, pp 139-152, Dec. 1993

[2] K. Farkas, N.D. Jouppi, \Complexity/Performance Tradeos with non-blocking loads", Proceedings of the 21st International Symposium on Computer Architecture (IEEE-ACM), April 1994

[3] K. Farkas, N.D. Jouppi, P. Chow, \How Useful Are Non-Blocking Loads, Stream Buers and Speculative Execution in Multiple Issue Processors?", Proceedings of the 1st International Symposium on High Performance Computer Architecture, January 1995

[4] J.D. Gee, M.D. Hill, D.N. Pnevmatikatos, A.J. Smith \Cache Performance of the SPEC92 Benchmark Suite", IEEE Micro, Aug. 1993.

[5] L. Gwennap \MIPS R10000 Uses Decoupled Architecture", Microprocessor Report, October 1994 [6] L. Gwennap \PA-8000 combines complexity and speed",Microprocessor Report, October 1994 [7] L. Gwennap \P6 underscores Intel's lead" Microprocessor Report, Feb. 1995

[8] J.L. Hennessy, D.A. Patterson Computer Architecture a Quantitative Approach, Morgan Kaufmann Publishers, Inc. 1990

[9] G.Irlam \Spa" personnal communication 1992; the Spa package is available from gor- [email protected]

[10] N.P. Jouppi, \Cache write policies and performance", Proceedings of the 20^thInternational Symposium on Computer Architecture, May 1990

(23)

[11] N.P. Jouppi, \Improving Direct-Mapped Cache Performance by the addition of a Small Fully- Associative Cache and Prefetch Buers" Proceedings of the 17^thInternational Symposium on Computer Architecture, June 1990

[12] N.D. Jouppi, S.J.E Wilton, \Tradeos in Two-level on-chip caching", Proceedings of the 21st Interna- tional Symposium on Computer Architecture (IEEE-ACM), April 1994

[13] S. Jourdan, P. Sainrat, D. Litaize \Exploring Congurations of Functional Units in an Out-of-Order Superscalar Processor", Proceedings of the 22nd International Symposium on Computer Architecture (IEEE-ACM), June 1995

[14] D. Kroft \Lockup-free instruction fetch/prefetch Cache organization", Proceedings of 8th International Symposium on Computer Architure, May 1981

[15] D. Lee, J.L. Baer, B. Calder, D. Grunwald \Instruction Cache Fetch Policies for Speculative Execution", Proceedings of the 22nd International Symposium on Computer Architecture, June 1995

[16] J.P. Singh, W. Weber, A. Gupta \SPLASH : Stanford Parallel Applications for Shared-Memory", Technical Report CSL-TR-91-469, Stanford University, 1991.

[17] A.J. Smith \Line (block) size choice for CPU cache memories" IEEE Transactions on Computers, Sept.

1987

[18] Alpha 21164 Microprocessor Hardware Reference Manual, Digital Equipment Corporation, 1994 [19] G. Sohi, M. Franklin, \High-Bandwidth Data Memory Systems for Superscalar Processors", Procee-

ding of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-4), april 1991

(24)

Unite´ de recherche INRIA Lorraine, Technopoˆle de Nancy-Brabois, Campus scientifique, 615 rue du Jardin Botanique, BP 101, 54600 VILLERS LE` S NANCY

Unite´ de recherche INRIA Rennes, Irisa, Campus universitaire de Beaulieu, 35042 RENNES Cedex Unite´ de recherche INRIA Rhoˆne-Alpes, 46 avenue Fe´lix Viallet, 38031 GRENOBLE Cedex 1

Unite´ de recherche INRIA Rocquencourt, Domaine de Voluceau, Rocquencourt, BP 105, 78153 LE CHESNAY Cedex Unite´ de recherche INRIA Sophia-Antipolis, 2004 route des Lucioles, BP 93, 06902 SOPHIA-ANTIPOLIS Cedex

E´ diteur

INRIA, Domaine de Voluceau, Rocquencourt, BP 105, 78153 LE CHESNAY Cedex (France) ISSN 0249-6399