The Influence of Virtual Memory Management

Part 2 The Software Side: Disappointments and

5 Implications of Nonuniform Memory for Software

5.1 The Influence of Virtual Memory Management

Recall the horrific example from Section 4.2, a seemingly innocent matrix addition that nobody would give much thought were it not that it glaringly illustrates that the interplay between memory mapping, VMM, and the program’s instructions can have enormous consequences. This is greatly aggravated by the general ignorance that most programmers have about memory mappings and related systems aspects. This ignorance may be entirely acceptable for in-core programs (programs that use strictly main memory for all storage needs, including all intermediate storage, and that access external memory, such as disks, only for the initial input of the data and output of the final result), because for in-core programs the memory-mapping function has no significant performance implications. If all data are in main memory, any direct access to any data element takes time independent of its location because of the random access property (RAP) of main memory. Consequently, it makes no difference in what way the data structures are mapped into the main memory (as long as that mapping preserves the RAP). However, this ignorance can be fatal as soon as the program ceases to be in-core, instead using external memory (usually magnetic disks) in conjunction with either VMM or a direct out-of-core programming approach.

It is instructive to examine the differences between VMM and overt out-of-core programming. In out-out-of-core programming the programmer must specify which blocks of data are transferred between disk and main memory and exactly at what point during program execution. This results in consid-erable additional effort for the programmer; it also is a potent source of errors. For these reasons, most programmers avoid out-of-core programming at all costs. Nevertheless, it should also be clear that out-of-core program-ming affords the programmer a significant amount of control over the actions of the program. In contrast, VMM creates a virtual memory space that is dramatically larger than the main memory; for most practical purposes it is unlimited.² As a result, the programmer can proceed as if this virtual memory space were the actual one. Thus, the programmer need not concern herself with the explicit manipulation of (units of) space; instead, the instructions of the algorithms can be translated into code, without having to manipulate blocks of data that must be moved between main memory and disk. While this approach is clearly convenient (who wants to worry about the transfer of blocks of data?), the programmer loses a good deal of control over the behavior of the program.³

To understand the serious implications of these processes, we need a better understanding of the underlying architectural aspects of the various types of memory and in particular of the access characteristics of the types of

2 In physical terms, it is essentially limited by the amount of (free) space on the disks available to the system.

C6730_C005.fm Page 118 Monday, July 24, 2006 11:38 AM

Implications of Nonuniform Memory for Software 119 memory involved. Traditionally, the memory hierarchy of a computer system is as follows:

Registers – Cache – Main memory – External memory

Most programmers consider registers as places of action — the only components that perform instructions.⁴ However, they are also storage⁵

— the fastest storage available. It is beneficial to structure computations so that values stay in registers as long as they are needed. Caches are somewhat slower than registers, but faster than main memory. Therefore, values that will be needed in subsequent computations but that cannot be kept in registers should be stored in a cache, since access to these values is faster than if they are in main memory. If a value cannot be kept in the cache, it moves further down in the memory hierarchy, namely into main memory. The last element in this progression is external memory, usually magnetic disks, which are much slower than all the other storage struc-tures.

It is crucial that programmers realize how much data transfer occurs implicitly in this computation model. In contrast, algorithms are typically not concerned with the effort required to retrieve or store a data item. Thus, while an algorithm may simply stipulate that a data item occurs in an operation, a program will have to retrieve this data item from whatever storage it resides on. This may involve several steps, all of which are implicit, that is, the programmer is not aware of them. Moreover, while the algorithm specifies one data item, many types of implicit retrieval do not read or write a single item, but an entire group in which that item resides (regardless of whether those other items of that group are used or not). Specifically, data are read from disks in blocks and are written to caches in (cache) lines.

While technology of storage devices is a moving target, it is nevertheless very useful to have a rough idea of the access characteristics of these com-ponents. For our purposes, two are paramount: the size of the memory and the time required to access a data item. While one may quibble about specific numbers, it is the relationship between the numbers that is of interest here.

3 In VMM the space required by a program, the logical address space, is divided into pages. Each data item has a logical address. When an item is used in the program, VMM determines whether it resides physically in main memory. If it does, execution proceeds; otherwise, its logical address is used to find the page containing the item, which is then located on, and read from, disk. Since VMM is allocated a fixed amount of memory, the so-called active memory set consisting of a fixed number of pages, reading a page ordinarily implies that another page in the active memory set must be displaced (unless the entire memory set is not yet full, something that would usually occur only at the beginning of execution). Which page is displaced is determined by the replace-ment strategy, usually a variant of LRU (least recently used; the page that has not been used for the longest time is displaced). If the displaced page is dirty (has been modified or written), it must be written back to disk.

4 This means that actions (operations) can only occur if a data item resides in a register.

5 If registers did not play the role of storage, computers would need far fewer of them.

C6730_C005.fm Page 119 Monday, July 24, 2006 11:38 AM

120 A Programmer’s Companion to Algorithm Analysis Let us first look at size, in words.⁶ The size (which is the number) of registers is on the order of 10, the size of caches is on the order of 10³, the size of main memory is on the order of 10⁸ or 10⁹, and the size of magnetic disks is well in excess of 10¹⁰. The gap between cache and register is about two orders of magnitude, but that between main memory and cache is five or six, and that between main memory and disk is perhaps three.

Related to the sizes of the components is the unit of access. The basic unit of data that can be transferred from disk to main memory is a block or page, whose size is on the order of 10³; the basic unit of data from main memory to cache is a cache line, of a size on the order of 10²; the basic unit of data from cache to register is a single word. To illustrate this, suppose we want to carry out an operation involving a specific data element x, and suppose it is determined that this data element resides on magnetic disk. The follow-ing steps must typically be carried out: Locate the block that contains x and transfer it into main memory, likely displacing another data block there.

Then determine the cache line in main memory that now contains x and transfer that line to the cache, likely displacing another data line there.

Finally, locate the data item x and move it to the appropriate register. In all but the last transfer, a significant amount of data must be manipulated that has nothing to do with x.

Let us now consider the time required for each of these transfers. Again, the important factor is more the relationship between the numbers than their absolute values. Roughly speaking, transferring a word from cache to reg-ister takes on the order of 1 nsec, and transferring a cache line from main memory to cache may take on the order of 10 nsec, but transferring a block or page from disk to main memory takes on the order of 10 msec, or 10,000,000 nsec.⁷ Here is the root of all evil: It takes about six orders of magnitude longer to access a data item if it resides on disk than if it resides in the cache.

We can summarize the situation as follows:

6 While historically there have been changes in the word length (it was 16 bits 20 years ago, then it became 32 bits, and it is now moving to 64 bits), we will ignore them in this comparison, since a change in word length will affect all components equally.

7 The process of reading a block off a magnetic disk is somewhat complicated. There is the seek time of the read/write head — finding the track in which the requested block is located and moving the head to that track. This cannot be speeded up arbitrarily because of vibrations. Then the head must read almost one entire track (in the worst case) to determine the beginning of the block on the track; in the next rotation, that track is read. Thus, in the worst case, two rotations are required to read a block once the head has been moved to the correct track.

size access time

10² 1 10³ 10² 10⁷ 10³ 10¹¹

1 ns 3 ns 10 ns 10,000,000 ns

cache main memory external memory

C6730_C005.fm Page 120 Monday, July 24, 2006 11:38 AM

Implications of Nonuniform Memory for Software 121 This gap of six orders of magnitude is astounding. It is also unlikely to be reduced; if anything, it will grow. The reason is quite simple: While registers, caches, and main memory are electronic (solid-state) devices, disk drives are mechanical. Since electronic devices can be speeded up by making them smaller, this means solid-state storage will continue to reduce its access times.

Magnetic disk drives cannot be speeded up significantly;⁸ in fact their access speeds have shrunk by less than one order of magnitude over the past 20 years. Consequently, the outlook is that the gap between access speeds between disk to main memory and between main memory and cache will get wider, from the current six orders of magnitude to seven or more.⁹

This has dramatic consequences for access times, depending on where a data item resides. The location of a data item is the overriding factor when determining the time required to carry out an operation. We may reasonably assume that an operation takes on the order of 10 nsec. Thus, if the operands of the operation reside in the cache or in main memory, the time to retrieve the operands and then to carry out the operation is still on the order of 10 nsec. If the operands reside on disk, the overall time is dominated by the time required to retrieve them; instead of 10 nsec, it now is a million times more.

Here is where it is vitally important to be able to control data movement.

While it may be a very painful experience to write (and debug!) an out-of-core program, it allows one to exercise complete control over the determi-nation of which block of data to retrieve from disk and at what time. In contrast, VMM hides these highly unpleasant details from the programmer, but at the cost of taking away the ability of determining, or even knowing, which data are transferred when.

When one analyzes the problem more carefully, it turns out that it is not so much the question what is transferred that is crucial, but what is being replaced. We noted that absent unlimited main memory, when we bring in a new block or page or line, we must find space for that unit of memory.

This typically involves displacing values that occupy the space we want to use. This is where things can go very wrong. If we displace values that are needed later, these values will have to be brought back in, requiring the displacement of other values. This is particularly dire when we need one (or a few) values of a block or page, but given the process of transfer-ring blocks, the entire page will have to be brought in. This was precisely the situation of the example in Section 4.2 if the memory mapping function

8 There are just two ways of reducing the access speed of magnetic disk drives: increasing the rotation speed of the spinning platters and refining the granularity of the magnetic recordings in the platters. The rotation cannot be increased much further since eventually the centripetal forces will tear the platter apart. Making the granularity of the magnetic recordings finer implies getting the read/write head closer to the platter surface, which is also not feasible because the distances are already very small. Essentially the technology plateaued about two decades ago, as far as access speed is concerned, and no further improvements are likely, owing to the mechanical limitations of the device.

9 A direct consequence is that more programs will change from being compute-bound to being input/output (I/O)-bound.

C6730_C005.fm Page 121 Monday, July 24, 2006 11:38 AM

122 A Programmer’s Companion to Algorithm Analysis was column-major. We managed to use one array element out of each page of size 2048, requiring this page to be brought back 2047 times. Since we did this for every page, the resulting performance became truly execrable.

The situation changes completely if the memory-mapping function is row-major. In that case we use all the elements of a page before the page is displaced, so it never has to be brought back again.

Practically, this is the worst-case scenario, as far as performance is con-cerned. Since in one situation (row-major memory mapping) all elements of a page are used and no page must be brought back again, and in the other situation (column-major mapping) every page must be brought back as many times as that page has elements, it follows that the performance hit here is 2048. In other words, column-major requires 2048 times more page retrievals than row-major. Since the retrieval of a page takes six orders of magnitude longer than an operation involving two operands, we can ignore the time taken by the operations; the retrieval times dominate by far. For the following program fragment,

for j:=1 to n do for i:=1 to n do

C[i,j] := A[i,j] + B[i,j],

under the same assumptions, the situation is exactly reversed: Row-major memory mapping is horribly inefficient, while column-major memory map-ping is optimal.

We pointed out earlier that different programming languages follow dif-ferent conventions about memory-mapping functions; specifically, Fortran compilers invariably use column-major, while all other languages tend to use row-major. Thus, for a non-Fortran language, one should use the original code:

for i:=1 to n do for j:=1 to n do

C[i,j] := A[i,j] + B[i,j],

while in Fortran, the code with the i-loop and the j-loop interchanged should be used:

for j:=1 to n do for i:=1 to n do

C[i,j] := A[i,j] + B[i,j].

Of course, it should be clear that both code fragments produce exactly the same result. It is the performances that will differ drastically if VMM is involved.

C6730_C005.fm Page 122 Monday, July 24, 2006 11:38 AM

Implications of Nonuniform Memory for Software 123 Since the size of a block or page is usually on the order of 1000, it follows that the greatest gain we can obtain (if we are lucky) is also on the order of 1000.¹⁰ In this argument we assume synchronous I/O; in other words, the program must wait for the page to be retrieved before it can continue exe-cuting. This is a fairly simple computation model; more sophisticated models may attempt to predict which page is needed in the future and initiate its retrieval while other computations continue. This requires that I/O be done in parallel with computations. This type of speculative page retrieval is complicated, and it is made difficult by the fact that about 1 million opera-tions can be carried out in the time it takes to retrieve one page from disk.

To do this automatically (which is what a sophisticated VMM would have to do) is exceedingly challenging.¹¹ This is significantly aggravated by the fact that VMM is part of the operating system and as such knows nothing about the program being executed. It is much more likely that a highly competent out-of-core programmer is capable of anticipating sufficiently in advance of the computations which page to retrieve next. (Sufficiently here means about a million operations earlier.) Unfortunately, this skill is very rare. Most programmers prefer to rely on the VMM, with occasionally disas-trous (performance) results. An intermediate approach is turning the task of scheduling the I/O of a program over to the compiler. We will return to this idea in Section 5.4.

Dans le document A ProgrAmmer’s ComPAnion to Algorithm AnAlysis (Page 130-135)