MIPS TLB Facts and Figures - Memory Management and the TLB

Memory Management and the TLB

6.2 MIPS TLB Facts and Figures

The MIPS TLB has always been implemented on chip: The memory trans-lation step is required even for cached references, so it’s very much on the critical path of the machine. That meant it had to be small, particularly in the early days, so it makes up for its small size by being clever.

It’s basically a genuine associative memory. Each entry in an associative memory consists of a key field and a data field; you present the key and the hardware gives you the data of any entry where the key matches. Associative memories are wonderful, but they are expensive in hardware. MIPS TLBs have had between 32 and 64 entries; a store of this size is manageable as a silicon design.

R4000-style CPUs so far have used a TLB where each entry is doubled up to map two consecutive VPNs to independently specified physical pages. The paired entries double the amount of memory that can be mapped by the TLB with only a little extra logic, without requiring any large-scale rethinking of TLB management.

You will see the TLB referred to as being fully associative; this emphasizes that all keys are really compared with the input value in parallel.¹

The TLB entry is shown schematically in Figure 6.3 (you’ll find detailed programming information later in Section 6.5). The TLB’s key consists of the following:

• VPN: The high order bits of the virtual address (the virtual address of the page less low bits). It becomes VPN2 with the double entry, to emphasize

1The R4000’s TLB would be correcly, if pedantically, described as a 48-way set-associative store, with two entries per set.

134 6.2. MIPS TLB Facts and Figures

TLB entry (R3000-style MIPS CPU)

TLB entry (R4000-style MIPS CPU) Output Output

VPN2 PageMask ASID G PFN PFN

VPN ASID G PFN

Figure 6.3: TLB entry fields

that if each physical page is 4KB, the virtual address selecting a pair of entries loses its Ieast-significant bit (which now selects the left or right output field).

• PageMask: This is only found on later CPUs. It controls how much of the virtual address is compared with the VPN and how much is passed through to the physical address; a match on fewer bits maps a larger region. MIPS CPUs can be set up to map up to 16MB with a single entry. With all page sizes, the most significant masked bit is used to select the even or odd entry.

• ASID: Marks the translation as belonging to a particular address space, so it won’t be matched unless the CPU’s current ASID value matches too The G bit, if set, disables the ASID match , making the translation entry apply to all address spaces (so this part of the address map is shared between all spaces). The ASID is 6 bits long on early CPUs, 8 bits on later ones. ¹

The TLB’s output side gives you the physical frame number and a small but sufficient bunch of flags:

• Physical frame number (PFN): This is the physical address with the low 12 bits cut off.

• Cadre control (N/C): The 32-bit CPUs have just the N (noncacheable) bit

— 0 for cacheable, 1 for noncacheable.

The 64-bit CPUs provide a 3-bit field c that can contain a larger range of values that tell multiprocessor hardware what protocols to use when

1The OS-aware reader will appreciate that even 256 is too small an upper limit for the number of simultaneously active processes on a big UNIX system. However, it’s a reasonable limit so long as “active” in this context is given the special meaning of “may have transla-tion entries in the TLB”. Software has to recycle ASIDs where necessary, which will involve purging the TLB of translation entries for the process that is being downgraded. It’s a dirty business, but so is quite a lot of what OSs have to do; and 256 entries should be enough to make sure it doesn’t have to be done so often as to constitute a performance problem. For programming purposes, the G bit is stored with the output side’s flags.

data in this page is shared with other processors. Those 64-bit CPUs that don’t have hardware cache coherency features have maintained this TLB entry layout; only the two code values that mean cacheable with all R4000 cache features (3) and uncached (2) are standard over all R4000-style CPUs. Modern embedded CPUs can select different cache manage-ment strategies with different values: write through vs. write back or write allocate vs. uncached write on miss. See your CPU manual.

• Write control bit (D): Set 1 to allow stores to this page to happen. The “D”

comes from this being called the “dirty bit”; see Section6.8 for why.

• Valid bit (V): If this is 0, the entry is unusable. This seems pretty point-less: Why have a record loaded into the TLB if you don’t want the trans-lation to work? It’s because the software routine that refills the TLB is optimized for speed and doesn’t want to check for special cases. When some further processing is needed before a program can use a page re-ferred to by the memory-held table, the memory-held entry can be left marked invalid. After TLB refill, this will cause a different kind of trap, invoking special processing without having to put a test in every soft-ware refill event.

Translating an address is now simple, and we can amplify the description above:

• CPU generates a program address: This is accomplished either for an instruction fetch, a load, or for a store that doesn’t lie in the special unmapped regions of the MIPS address space.

The low 12 bits are separated off, and the resulting VPN together with the current value of the ASID field in EntryHi is used as the key to the TLB, as modified in effect by thePageMask and G fields in TLB entries.

• TLB matches key: The matching entry is selected. The PFN is glued to the low-order bits of the program address to form a complete physical address.

• Valid? The V and D bits are consulted. If it isn’t valid or a store is being attempted with D onset, the CPU takes a trap. As with all translation traps, the BadVaddr register will be filled with the offending program address; as with any TLB exception, the TLB EntryHi register will be preloaded with the VPN of the offending address.

Don’t use the convenience registers Context (and XContext on 64-bit CPUs) other than in TLB miss processing. At other times they might track things like BadVaddr or they might not; either would be a legiti-mate implementation.

• Cached? If the C bit is set the CPU looks in the cache for a copy of the physical location’s data; if it isn’t there it will be fetched from memory

136 6.3. MMU Registers Described

and a copy left in the cache. Where the C bit is clear the CPU neither looks in nor refills the cache.

Of course, the number of entries in the TLB permits you to translate only a relatively small number of program addresses — a few hundred KB worth.

This is far from enough for most systerns. The TLB is almost always going to be used as a software-maintained cache for a much lamer set of translations.

When a program address lookup in the TLB fails, aTLB refill trap is taken.¹ System software has the following job:

• It figures out whether there is a correct translation; if not, the trap will be dispatched to the software that handles address errors.

• If there is a correct translation , it constructs a TLB entry that will im-plement it.

• If the TLB is already full (and it almost always is full in running systems), the software selects an entry that can be discarded.

• The software writes the new entry into the TLB.

See Section 6.7 for how this can be tackled, but note here that although special CPU features help out with one particular class of implementations, the software can refill the TLB any way it likes.

Dans le document 0.1 Style and Limits (Page 155-158)