• Aucun résultat trouvé

Memory Management

Dans le document in C++ and MPI (Page 56-60)

Key Concept

2.2 Mathematical and Computational Concepts

2.2.6 Memory Management

Before we present BLAS (Basic Linear Algebra Subroutines) in the next section, we will give you some preliminary information concerning memory management which will help to make the next discussion more relevant.

In this section, we will discuss two issues:

Memory layout for matrices, and

cache blocking.

Our discussion will be directed toward understanding how memory layout in main memory and cache (see figure 2.11) affect performance.

Memory Layout for Matrices

Computer memory consists of a linearly addressable space as illustrated in figure 2.5. In this illustration, we denote memory as a one-dimensional partitioned strip. To the right of the strip, the labelsaddrdenote the address of the parcel of memory. By linearly addressable we mean that addr2 = addr1 +addrset where addrset is the memory offset between two contiguous blocks addressable memory.3 Single variables and one-dimensional arrays fit quite nicely into this concept since a single integer variable needs only one parcel of memory, and the array needs only one parcel of memory per element of the array.

How are two-dimensional arrays stored in memory? Since memory is linearly addressable, in order to store a two-dimensional array we must decide how to decompose the matrix into one-dimensional units. The two obvious means of doing this is decomposing the matrix into

3We have remained general because different architectures allow different addressable sets. Some archi-tectures are bit addressable, some byte addressable, and some only word addressable. We will not delve further into this matter, but the interested reader should consult a computer architecture book for more details.

integer a array of floats x

addr 1 addr 2 addr 3 addr 4 addr 5 addr 6

addr 43 addr 44 Location of the

variable a Location of the first element of array x Location of the second element of array x

Location of the third element of array x

Memory

Figure 2.5: Schematic showing the memory layout for an integer variable and an array of floating point values. The partitioned strip denotes the memory of the computer, and the labels addr to the right denote the address of the parcel of memory.

a collection of rows or decomposing the matrix into a collection of columns. The first is referred to as “row-major order”, and the later is referred to as “column-major order”. This concept is illustrated in figure 2.6.

After examining figure 2.6, we draw your attention to the following statements:

The amount of memory used by both ordering is the same. Nine units of memory are used in both cases.

The linear ordering is different. Although both orderings start with the same entry (S00), the next addressable block in memory contains a different entry of S (S01 for row-major order versusS10 for column-major order). This is an important observation which we will discuss further in just a moment.

IfS is a symmetric matrix (meaning Sij =Sji), then row-major ordering and column-major ordering appear identical.

C++ uses row-major ordering while FORTRAN uses column-major ordering. In sec-tion 5.1.4 we discuss how multi-dimensional arrays are allocated in C++, and how row-major ordering comes into play.

In addition to the comments above, one other item needs to be mentioned before we can conclude this discussion. As shown in figure 2.11, most modern computing architectures have several layers of memory between what we have referred to as “main memory” and the central processing unit (CPU). Main memory is normally much slower than the CPU,

S00 S01

Figure 2.6: The 3×3 matrix S is decomposed by row-major ordering on the left and by column-major ordering on the right.

and hence engineers decided to insert smaller but faster memory between main memory and the CPU (in this discussion, we will lump register memory in with the CPU and will refer to Level 1 (L1) and Level 2 (L2) cache simply as “cache”). If all the items necessary to accomplish a computation could fit into cache, then the total time to execute the instructions would be reduced because the time cost ofloadsandstoresto memory would be accelerated.

In general, even for programs in which not all the data can fit into cache, the use of cache decreases the total time cost by using cached data as much as possible. This is referred to as cache reuse. The goal is that once some piece of information has been loaded into cache, it should be reused as much as possible (since access to it is much faster than accessing the same piece of data from main memory). Whenever a piece of needed information is found in cache, we have what is referred to as a cache hit. Whenever a piece of information cannot be found in cache, and hence it must be obtained from main memory (by loading it into cache), we have what is referred to as acache miss.

Items from main memory are loaded in blocks of size equal to the size of a cache line, a term which comes from the size of the lines going from main memory modules to the cache memory modules. Hence, if you want to access one element of an array, and it is not in cache (a miss), an entire cache line’s worth of information will be loaded into cache, which may include many contiguous elements of the array. We draw your attention to figure 2.7. In this illustration, we are examining the memory layout during a matrix-vector multiplication. The 3×3 matrixA is stored in row-major order followed by a 3×1 vector x. In this simplified example, our cache consists of nine units which are loaded/stored in blocks of three units (our cache line is of size three units).

At macro time t0, the first element of A is needed, so an entire cache line’s worth of

a11

Write back to main memory

Figure 2.7: Main memory and cache layout for a matrix-vector multiplication.

information is loaded, which in this case consists of three units. Hence, the entire first row of A is loaded into cache. The first element of the vector x is needed, and hence an entire cache line’s worth of information is loaded, which is equal to all ofx. To accomplish the dot product of the first row ofAwith the vectorx, the other entries already in cache are needed.

Thus, we have several cache hits as other computations are accomplished in order to obtain the end result of the dot product, b1. At macro time t1, the first item in the second row of A is needed, and hence there is a cache miss. An entire cache line of information is loaded, which in this case is the entire second row. Again, many computations can be accomplished until another cache miss happens, in which case new information must be loaded to cache.

Although this is a very simplied example, it demonstrates how memory layout can be very important tocache reuse. What would have happened if we had stored the matrixAin column-major order, but had implemented the algorithm above? Because of the particular cache size and cache line size, we would have had a cache miss almost every time we needed an entry of A, and hence the total time cost would have been greatly increased because of the excessive direct use of main memory instead of the fast cache.

Cache Blocking

The concept of cache blocking is to structure the data and operations on that data so that maximum cache reuse can be achieved (maximize cache hits). In the example above, we have achieved this by choosing to store the matrix A in row-major order for the particular algorithm which we implemented.

Keep this concept in mind during the discussion of BLAS in the next section. In several cases, different implementations are given for the same mathematical operation, some

de-pending on whether your matrix is stored in column-major order or in row-major order. In the end, the computational time used to run your program can be significantly altered by paying attention to which algorithm is most efficient with which memory layout. Taking into account cache characteristics when determining the data partition used is very important, and can greatly enhance or deteriorate the performance of your program.

Dans le document in C++ and MPI (Page 56-60)