• Aucun résultat trouvé

1.4 Organization of the thesis

2.1.1 CPU

2.1.1.1 Data cache

Data caches store recently-used data in a fast but small Static Random Access Memory (SRAM). They exploit the concepts of temporal and spatial locality:

if some resource is accessed, it is likely to be re-accessed in the near future (temporal), as well as its neighbor resources (spatial).

Cache hierarchy Intel processors use a cache hierarchy similar to the one depicted in Figure 2.3 since the Nehalem microarchitecture and until the most recent Broadwell microarchitecture [Int14a]. There are usually three cache levels, called L1, L2 and L3. The levels L1 and L2 are private to each core, and store several kilobytes. The L3 cache is also called Last-Level Cache (LLC). It is shared among cores and can store several megabytes.

To read or write data in main memory, the CPU first checks the memory location in the L1 cache. If the address is found, it is acache hitand the CPU immediately reads or writes data in the cache line. Otherwise, it is acache missand the CPU searches for the address in the next level, and so on, until reaching main memory. A cache hit is significantly faster than a cache miss.

Particularities of the last-level cache in recent processors In recent Intel processors, the last-level cache is divided into slices that are connected to the cores through a ring interconnect. Moreover, the last-level cache is inclusive,

offset set

tag address

0 63

s o

t

cache level line

set

Figure 2.4: Simple addressing scheme.

which means that it is a superset of the L1 and L2,i.e., it contains all data present in L1 and L2. This property does not fully exploit the total available capacity of the cache levels, however, it is an advantageous design for performance reasons, as only one level needs to be checked to know if a line is cached.

Inclusiveness also simplifies the cache coherence protocol. To guarantee the inclusion property, a line evicted from the last-level cache is also removed (invalidated) in the L1 and L2 caches.

Addressing scheme Data is transferred between the cache and the memory in 64-byte blocks calledlines. The location of a particular line depends on the cache structure. Today’s caches aren-way associative, which means that a cache is composed of sets ofnlines. A line is loaded in a specific set depending on its address, and occupies any of thenlines.

With caches that implement adirect addressingscheme, memory addresses can be decomposed in three parts: the tag, the set and the offset in the line.

The lowestobits determine the offset in the line, with: o = log2(line size).

The nextsbits determine the set, with: s = log2(number of sets).And the remainingtbits form the tag. Figure 2.4 illustrates this scheme.

In contrast to direct addressing, some caches implement acomplex addressing scheme, where potentially all address bits are used to index the cache. Indeed, in the last-level cache each physical address is associated with a slice via a function that is not documented by Intel, to the best of our knowledge.

2.1. Architecture of an x86 system

slice 0 slice 1 slice 2 slice 3

H 2

offset set

tag physical address

11 30

0 6

17 63

...

Figure 2.5: Complex addressing scheme on the LLC. This assumes a quad-core processor, and the following characteristics of the LLC: 64B lines and 2048 sets per slice.

As each slice has a cache pipeline, the addressing function is designed to distribute evenly the traffic across all slices for a wide range of memory access patterns, to increase performance. The set is then directly addressed. Intel started implementing this complex addressing scheme on the Sandy Bridge microarchitecture, and onwards (see Table 2.1). Figure 2.5 illustrates this scheme.

The address used to compute the cache location can be either a physical or a virtual address. AVirtually Indexed, Virtually Tagged(VIVT) cache only uses virtual addresses to locate data in the cache. Modern processors involve phys-ical addressing, thus cache levels are eitherVirtually Indexed Physically Tagged (VIPT), orPhysically Indexed Physically Tagged(PIPT). The physical address is not known by processes, thus a process cannot know the location of a specific line for physically indexed caches. Typically, the L1 cache is VIPT, and L2 and L3 are PIPT.

Replacement policy When a cache set is full, a cache line needs to be evicted before storing a new cache line. When a line is evicted from L1 it is stored back to L2, which can lead to the eviction of a new line to the last-level cache, etc. The replacement policy decides the victim line to be evicted. Good replacement policies choose the line that is the least likely to be reused. Used policies include

Table 2.1: Characteristics of the recent Intel microarchitectures.

Nehalem Sandy Bridge Ivy Bridge Haswell

LLC slices 3 3 3 3

LLC complex addr. 7 3 3 3

Replacement policy LRU LRU Quad-Age Quad-Age

Pseudo Random, Least Recently Used (LRU), and variations of LRU [JTSE10]

(see Table 2.1). An adaptive policy can also be used, where the processor dynamically changes the replacement policy depending on the miss rate of specific cache sets [QJP+07]. An efficient replacement policy minimizes the number of cache misses and is thus crucial for performance. These policies are therefore not well documented in recent processors. For instance, the replacement policy used in the Ivy Bridge microarchitecture, a variation of LRU called Quad-Age, only appears as a part of an Intel presentation [JGSW], and is not fully documented to the best of our knowledge. Details of the replacement policies can however be partially reverse-engineered using micro-benchmarks, as it has been done for the Ivy Bridge microarchitecture [Won13].