Cache Efficiency - Cache Error Handling

Cache Error Handling

4.11 Cache Efficiency

86 4.11. Cache Efficiency

Page 79 of the original book is missed.

Table 4.2: Operation codes for the cache instruction

Conventional Code Conventional Code

name (hex) name (hex)

Index Invalidate I 0x00 Hit Invalidate I 0x10 Index Writeback Inv D 0x01 Hit Invalidate D 0x11 Index Invalidate SI 0x02 Hit Invalidate SI 0x12 Index Writeback Inv SD 0x03 Hit Invalidate SD 0x13

Index Load Tag I 0x04 Fill I 0x14

Index Load Tag D 0x05 Hit Writeback Inv D 0x15 Index Load Tag SI 0x06

Index Load Tag SD 0x07 Hit Writeback Inv SD 0x17 Index Store Tag I 0x08 Hit Writeback I 0x18 Index Store Tag D 0x09 Hit Writeback D 0x19 Index Store Tag SI 0x0A

Index Store Tag SD 0x0B Hit Writeback SD 0x1B Create Dirty Exc D 0x0D

Hit Set Virtual SI 0x1E Create Dirty Exc SD 0x0F Hit Set Virtual SD 0x1F

• 64-bit CPUs that provide compatibility with the R4000 are just being helpful.

• How cache is addressed: Two different styles are used. In hit-type oper-ations you provide a regular program address (virtual address), which is translated as necessary. If that location is currently cached, the opera-tion is carried out on the relevant cache line; if the locaopera-tion is not in the cache, nothing happens.

Alternatively, there are index operations where the low bits of the ad-dress are used directly to select a cache line, without regard to the line’s present contents. This exposes the cache’s internal organization in a nonportable way.

Running cache maintenance is done almost entirely with hit operations, while initialization requires index types.

• Write back: Causes the cache line to be written back to memory if it is marked dirty — for clean lines this is a nop.

• Invalidate: Marks the line as invalid so that its data won’t be used again.

88 4.11. Cache Efficiency

Page 81 of the original book is missed.

1. A buffer that size isn’t big enough to initialize a secondary cache; we’ll use a devious trick to manage without.

2. Set TagLo to zero, which makes sure that the valid bit is unset and the tag parity is consistent.

The TagLo register will be used by thecache store Tag cache instruc-tions to forcibly invalidate a line and clear the tag parity.

3. Disable interrupts if they might otherwise happen.

4. Initialize the I-cache first, then the D-cache. Following is C code for I-cache initialization. (You have to believe in the functions or macros like Index Store Tag T() which do low-level functions; they’re either trivial assembler code subroutines that run the appropriate machine instructions or — for the brave GNU C user — macros invoking a C asm statement.)

for (addr = KSEG0; addr < KSEG0 + size; addr += lnsize) {

/* clear tag to invalidate */

Index_Store_Tag_I(addr);

/* fill so data field parity is correct */

Fill_I(addr);

/* invalidate again - prudent but not strictly necessary */

Index_Store_Tag_I(addr);

}

5. D-cache initialization is slightly more awkward because there is nocache Index Fill Doperation; we have to load through the cache and rely on normal miss processing. In turn, while the Fill instruction operates on a cache index, load processing always relates to memory addresses and hits in the cache based on the tags. You have to be careful about the tags; with a two-way cache the I-cache-style loop would initialize half the D-cache twice, since clearing PTagLo will reset the bit used to decide which set of the cache line is to be used on the next cache miss.

Here’s how it’s done:

/* clear all tags */

for(addr = KSEG0; addr < KSEG0 + size; addr += lnsize) {

Index_Store_Tag_D(addr);

}

/* load from each line (in cached space) */

for(addr = KSEG0; addr < KSEG0 + size; addr += lnsize) {

90 4.11. Cache Efficiency

junk = *addr;

}

/* clear all tags */

for(addr = KSEG0; addr < KSEG0 + size; addr += lnsize) {

Index_Store_Tag_D(addr);

}

Page 83 of the original book is missed.

92 4.11. Cache Efficiency

These are not necessarily the best measures. For example, x86 CPUs are rather short of registers, so a program compiled for x86 will generate many more data load and store events than the same program compiled for MIPS.

But the extra loads and stores will be of the stack locations that the x86 compiler uses as surrogates for registers; this is a very heavily used area of memory and will be very effectively cached. To some extent the number of cache misses is likely to be characteristic of tracing through a chunk of a particular program.

However, the above comments are useful in pointing out the following obvious ways of making a system go faster.

• Reduce the cache miss rate:

– Make the cache bigger. Always effective, but expensive. In 1996, 64KB of cache occupied something over half the silicon area of a top-end embedded CPU, so doubling the cache size is economically feasible only if you wait for Moore’s Law to give you the extra tran-sistors in the same space.

– Increase the set associativity of the cache. lt’s worth going up to four-way but after that the gains are too small to notice.

– Add another level of cache. That makes the calculation much more complicated, of course. Apart from the complication of yet another subsystem, the miss rate in a secondary cache will be depressingly high; the primary cache has already skimmed the cream of the repetitive data access behavior of the CPU. To make it worthwhile, the secondary cache must be much larger (typically eight times or greater) than the primary cache, and a secondary cache hit must be much faster (two times or better) than a memory reference.

– Reorganize your software to reduce the miss rate. It’s not clear if this works in practice: it’s easy to reorganize a small or trivial program to great effect, but so far nobody has succeeded in building a general tool that has any useful effect on an arbitrary program.

See Section4.12.

• Decrease the cache refill penalty:

– Get the first word back to the CPU faster. DRAM memory systems have to do a lot of work to start up, then tend to provide data quite fast. The closer the memory is to the CPU and the shorter the data path between them, the sooner the data will arrive back.

Note that this is the only entry in this list where better performance goes with a cheaper system. Paradoxically, it’s had the least at-tention, probably because it requires more integration between the CPU interface and memory system design. CPU designers are loath to deal with system issues when they decide the interface of their chips, perhaps because their job is too complicated already!

– Increase the memory burst bandwidth. This is traditionally ap-proached by the expensive technique of bank interleaving, where two or more memories are used to store alternate words; after the startup delay, you can take words from each memory bank alter-nately, doubling the available bandwidth. The first large-scale use of a memory technology, synchronous DRAM (or SDRAM) emerged in 1996. SDRAM changes the DRAM interface to deliver much more bandwidth from a single bank making bank interleaving an obsolete technique.

• Restart the CPU earlier: The simplest method is to arrange that the cache refill bursts start with the word that the CPU missed on and to restart the CPU as soon as that data arrives. The rest of the cache refill contin-ues in parallel with CPU activity. MIPS CPUs since R4x00 have allowed for this technique by using sub-block order for cache refill burst data, which can deliver any word of the block first. But only R4600 and its descendants have taken advantage of this for data misses.

More radically, you can just let execution continue through a load; the load operation is handed off to a bus interface unit and the CPU runs on until such time as it actually refers to the register data that was loaded.

This is called a nonblocking load and is implemented on the R10000 and slated for the RM7000.

Most drastically, you can just keep running any code that isn’t depen-dent on unfetched data as is done by the out-of-order execution R10000.

This kind of CPU uses this technique quite generally, not just for loads but for computational instructions and branches.

Intel’s Pentium Pro (progenitor of the Pentium II), MIPS’s R10000, and HP’s PA-8000 are out-of-order implementations; these 200+ MHz multiple-issue CPUs are reasonably happy being served by a large (and thus rel-atively slow) external cache.

4.12 Reorganizing Software to Influence Cache

Dans le document 0.1 Style and Limits (Page 107-115)