Other Choices in Cache Design

Caches for MIPS

4.5 Other Choices in Cache Design

The 80s and 90s have seen much work and exploration of how to build caches. So there are yet more choices:

• Physically addressed/virtually addressed: While the CPU is running a grown-up OS, data and instruction addresses in your program (the program address or virtual address) are translated before appearing as physical addresses in the system memory.

A cache that works purely on physical addresses is easier to manage (we’ll explain why below), but raw program (virtual) addresses are avail-able to start the cache lookup earlier, letting the system run that little bit faster.

So what’s wrong with program addresses? They’re not unique; many different programs running in different address spaces on a CPU may share the same program address for different data. We could reinitialize the entire cache every time we switch contexts between different address spaces; that used to be done some years ago and may be a reasonable solution for very small caches. But for big caches it’s ridiculously inef-ficient, and we’ll need to include a field identifying the address space in the cache tag to make sure we don’t mix them up.

There’s another, more subtle problem with program addresses: The same physical location may be described by different addresses in dif-ferent tasks. In turn, that might lead to the same memory location cached in two different cache entries (because they were referred to by different virtual addresses that selected different cache indexes). These cache aliasesmust be avoided by the OS’s memory manager; see Section 4.14.2 for details.

From the R4000 on, MIPS primary caches have used the program ad-dress to provide a fast index to start the cache lookup. But rather than using the program address plus an address space identifier to tag the cache line, they use the physical address. The physical address is unique to the cache line and is efficient because the scheme allows the CPU to translate program addresses to physical addresses at the same time it is looking up the cache.

72 4.5. Other Choices in Cache Design

• DMA into memory: If a device is loading data into memory, it’s important to invalidate any cache entries purporting to hold copies of the memory locations concerned; otherwise, the CPU reading these localions will ob-tain stale cached data. The cache entries should be invalidated before the CPU uses any data from the DMA input stream.

Page 69 of the original book is missed.

74 4.5. Other Choices in Cache Design

Why Not Manage Caches in Hardware?

Caches managed with hardware are often called “snoopy”. When another CPU or some DMA device accesses memory, the addresses concerned are made visible to the cache.

With a CPU attached to a shared bus, this is pretty straightforward; the address bus con-tains most of the information you need.The hardware watches (snoops) the address bus even when the CPU is not using it and picks out relevant cycles. It does that by looking up its own cache to see whether it holds a copy of the location being accessed.

If someone is writing data that is inside the cache, the controller can pick up the data and update the cache line but is more likely to just invalidate its own, now stale, copy. If someone is reading data for which updated information is held in the cache, the controller may be able to intervene the bus, telling the the memory controller that it has a more up-to-date version.

One major problem with doing this is that it works only within a system designed to oper-ate that way. Not all systems have a single bus where all transactions appear; bought-in I/O controllers are unlikely to conform to the right protocols.

Also, it’s very complicated. Most of the loca-tions that CPUs work with are the CPU’s pri-vate areas; they will never be read or written by any other CPU or device. We’d like not to build hardware ingenuity into the cache, loading every cache location and bus cycle with complexity that will only sometimes be used.

It’s easy to suppose that a hardware cache control mechanism must be faster than soft-ware, but that’s not necessarily so. A snoopy cache controller must look at the cache tags on every external cycle, which could shut the CPU out of its cache and slow it down;

complex cache controllers usually hold two copies of the cache tags for this reason. Soft-ware management can operate on blocks of cache locations in a single fast loop; hard-ware management will interleave invalida-tions, or write back with CPU accesses at I/O speed, and that usually implies accesses arbi-tration overhead.

So MIPS took the radical RISC position: MIPS CPUs either have no cache management hardware or, where designed for multiproces-sors, they have everything — ike the R4400MC or R10000.

• Writing instructions: When the CPU itself is storing instructions into memory for subsequent execution, you must first ensure that the in-structions are written back to memory and then make sure that the corresponding I-cache locations are invalidated: The MIPS CPU has no connection between the D-cache and the I-cache.

If your software is going to fix these problems, it needs to be able to do two distinct operations on a cache entry.

The first operation is called write back. The CPU must be able to look in the cache for data for a particular location. If it is present in the cache and is dirty (marked as having been writtenthe CPU since it was last obtained from memory or written back to memory), then the CPU copies the data from the cache into main memory.

Page 71 of the original book is missed.

76 4.5. Other Choices in Cache Design

Table 4.1: Cache evolution in MIPS CPUs

CPU Primary Secondary Tertiary

(MHz) Size direct/ on- Size direct/ on- Size direct/

on-I-cache D-cache n-way chip? n-way chip? n-way chip?

R3000-33 32K 32K Direct No

R3052-33 8K 2K Direct Yes

R4000-100 8K 8K Direct Yes 1M Direct No

R4600-100 16K 16K 2-way Yes

R10000-250 32K 32K 2-way Yes 4M 2-way No

R5000-200 32K 32K 2-way Yes 1M Direct No

RM7000-xxx 16K 16K 4-way Yes 256K 4-way Yes 8M Direct No

CPUs that add another level of hierarchy reduce the miss penalty for the next cache inward, so the designers may be able to simplify the inner cache in search of higher clock rates, most obviously by making the next inner cache smaller. It seems likely that as many high-end CPUs gain on-chip secondary caches (from 1998 on), primary cache sizes will fall slightly, with dual 16KB primary caches a favored “sweet spot”.¹

An off-chip cache is generally direct mapped because a set-associative cache system needs multiple buses and therefore an awful lot of pins to con-nect it. This is still an area for experimentation; the MIPS R10000 implements an external two-way set-associative cache with one data bus by delaying the returned data when the hit is not in the expected set.

Amidst all this evolution, there have been two main generations of the software interface to the cache. From a software point of view there is one style founded by the R3000 and followed by practically all 32-bit MIPS CPUs;

there is another starting with the R4000 and used by all 64-bit CPUs to date.² R3000-type MIPS CPUs have caches that are write through, direct mapped, and physically addressed. Cache locations are accessible only as whole words, so an operation that writes a byte (or anything less than a whole word) has to be managed specially. Cache management is done using special modes in which regular loads and stores do magic things to the cache.

1At least this is true of architectures where the primary cache access is mostly fitted into one clock cycle — always true of MIPS so far. It’s intuitively plausible that there should be a more or less fixed cache size whose access takes about the same time as the other activities traditionally fitted into one pipeline stage. However, early versions of the RISC HP-8x00 CPU family accept a two-clock-cycle primary cache latency in return for a huge external primary cache, and they seem to work well.

2One day (perhaps by the time you read this) there will probably be 32-bit MIPS CPUs with R4000-type caches.

Dans le document 0.1 Style and Limits (Page 93-99)