Performance Discrepancies - The Software Side: Disappointments and

Part 2 The Software Side: Disappointments and

4.2 Performance Discrepancies

Discrepancies between the performance suggested by the analysis of the algorithm and the observed performance of the code obtained by translating the algorithm faithfully are much more common than outright wrong results.

Occasionally, the reasons are similar. For example, choosing an inappropriate way of passing parameters may seriously affect the performance of the resulting program. A particularly egregious instance is provided by binary search, where the wrong parameter-passing mechanism can slow perfor-mance exponentially. Much more common causes are the memory hierarchy of modern computing architectures and the support systems (compilers,

2 Sorting methods that guarantee that they will not interchange the order of identical elements are called stable sorting algorithms. The fact that a name was coined to differentiate them from those that might swap such elements indicates that this aspect is more important in applications than one might suspect if one focuses only on the sorting method itself.

3 One of the most insidious problems of parallel and distributed software is so-called race con-ditions whereby two processes compete for some resource (e.g., access to memory, communica-tion links, input/out (I/O) controllers). Sometimes one wins, and at other times the other wins, even though the starting configurations of the two instances are seemingly identical.

C6730_C004.fm Page 105 Monday, July 3, 2006 12:49 PM

106 A Programmer’s Companion to Algorithm Analysis operating systems, run-time execution systems) that heavily influence how efficiently a program executes.

We have already hinted at the evil influence of virtual memory manage-ment (VMM); similar but less dramatic observations hold for caches. We have more to say about this in Chapter 5. The major problem is that VMM interacts fairly subtly with other aspects of a program. For example, consider the problem of adding two matrices:

C := A + B,

where A, B, and C are matrices of type [1:n,1:n], for n a number large enough that the matrix addition cannot be carried out in main memory (in-core). For the algorithm, this formulation would be fully sufficient; for a program, we need to specify a good deal more. A typical program fragment might look like this:

for i:=1 to n do for j:=1 to n do

C[i,j] := A[i,j] + B[i,j]

Since main memory is one-dimensional, the two-dimensional arrays A, B, and C must be mapped into main memory. There are two standard mapping functions for this purpose: row-major and column-major, as we explained in Section 2.2. Since in this scenario we do not have enough main memory to accommodate the three matrices, the mapping function will map each array into the logical memory space, which in turn is divided into blocks. It is these blocks (pages) that are fetched from disk and stored to disk by VMM.

To make this more concrete, let us assume that n = 2¹³, that the size of a page is 2¹¹ (words), and that our available main memory permits us to have 2¹⁰ pages in main memory. If the memory-mapping function is row-major, each row consists of four pages; if it is column-major, each column consists of four pages. Since the total amount of space required for the three matrices is about 3^.2²⁶, but only 2¹⁰ pages are available, VMM will swap pages in and out of main memory as dictated by the code above.

Here is the first problem: Most programmers are not aware of the memory-mapping function used.⁴ Therefore, they are unable to determine how many pages this very simple program fragment will swap in and out. The second problem is that most programmers are not particularly keen on understand-ing VMM.⁵ For our explanations, we assume that the replacement policy is

4 A rule of thumb is the following: A language directly based on Fortran uses column-major; all other languages use row-major memory mapping. However, it is always a good idea to make sure of this. Many programming languages do not specify which mapping function is to be used by the compiler, so this becomes a property of the compiler (thereby driving yet another nail into the coffin of portability).

5 Indeed, for many programmers, the most important aspect of VMM is that it permits them to ignore I/O problems.

C6730_C004.fm Page 106 Monday, July 3, 2006 12:49 PM

Sources of Disappointments 107 pure LRU (least recently used); this means the page that has been unused for the longest time will be the next to be swapped out if the need arises to bring in a page when all available memory space is occupied. Most common operating systems that support VMM implement some version of LRU.⁶⁷

An even greater problem is that most programmers believe all this infor-mation is of no relevance to writing good code. They would be correct if the three matrices fit into main memory.⁸ However, they do not, and the differ-ence between the numbers of pages swapped for one and for the other mapping function is staggering. Specifically, if the memory-mapping func-tion is row-major, 3^.2¹⁵ pages are swapped in and out, but if it is column-major, it is 3^.2²⁶ pages. In other words, one version swaps fewer than 100,000 pages, and the other swaps over 200 million. Thus, it is safe to assume that one version is about 2,000 times slower than the other. To be even more drastic, if the faster version takes 15 minutes to execute,⁹ the slower would take about 3 weeks. Yet, from an algorithmic point of view, the two versions have identical performance.

Note that for an in-core version, nothing ugly would happen, regardless of which mapping strategy is used. It is only once VMM comes into play that all hell breaks lose.

It is instructive to determine an approximate running time for the in-core version (which is the algorithm, for all practical purposes). There are 2²⁶ elements in each matrix; thus, the grand total of required memory for all three matrices is a bit over 200 million words (800 Mbytes, assuming one word has four bytes). There are about 67 million additions and 67 million assignments. It is probably quite conservative to assume that each of the 67 million elements of the matrix C can be computed in 50 nsec on a reasonably modern (circa 2005) processor assuming all elements reside in main mem-ory.¹⁰ Consequently, an in-core version might take about 3,356,000,000 nsec, or not even four seconds, for the computation. In practice, it will take much longer, since presumably even in an in-core version, the matrices A and B

6 Not all operating systems support VMM; for example, Cray supercomputers have never pro-vided VMM, for precisely the performance issues that we explain here.

7 Typically, LRU is not implemented exactly, since this would require a good deal of space to store the age of each page. Instead, variants are preferred that use less space to provide informa-tion that approximates the age of a page.

8 Because of the random access property of main memory, the performance of this code fragment would be independent of the memory-mapping function, provided all matrices are in main memory, that is, if the problem were in-core. Note that we are again ignoring complications caused by the use of caches.

9 This is actually pushing it. It takes typically more than 10 msec to retrieve a page, so in 15 min-utes fewer than 90,000 pages can be retrieved, assuming that nothing else happens.

10 This is very conservative because we do not assume overlapping of computations or pipelin-ing. Assume a clock cycle of 2 nsec (also conservative). Each of the two operands must be retrieved (we ignore the address calculation and concentrate on retrieving contiguous elements for main memory); we assume five clock cycles for each. The same holds for storing the result back. We will assume that the operation itself requires 10 clock cycles. The grand total comes to 25 clock cycles, or 50 nsec. Pipelining can result in a significant time reduction, since after an ini-tial start-up period (priming the pipeline), it could churn out a result every clock cycle.

C6730_C004.fm Page 107 Monday, July 3, 2006 12:49 PM

108 A Programmer’s Companion to Algorithm Analysis initially reside on disk and must first be fetched; similarly, the resulting matrix C most likely must be written to disk. These operations take signifi-cantly more time. Retrieving from or writing to disk one page takes more than 10 msec (that is, 10,000,000 nsec); since there are about 100,000 such page operations, this will take about 17 minutes.

We now have four scenarios with three very different timings:

1. 3 sec (in-core; for the computation alone, assuming A and B are in main memory and C stays there)

2. 17 minutes (in-core; assuming A and B must be fetched from disk and C must be written to disk)

3. 17 minutes (out-of-core; assuming A and B must be fetched from disk and C must be written to disk and assuming the memory-mapping function is row-major)

4. 3 weeks (out-of-core; assuming A and B must be fetched from disk and C must be written to disk and assuming the memory-mapping function is column-major)

Note that the fastest timing and the slowest timing are almost seven orders of magnitude apart. Algorithmic time complexity would suggest the fastest timing (computation of 67 million elements, each taking about 50 nsec), while actual observation might provide us with the slowest timing, namely 3 weeks.

In practice, no sane programmer would let a program execute for weeks if the running time predicted on the basis of the algorithm was a few sec-onds.¹¹ Instead, the programmer would assume after a while that something went wrong and abort execution. Thus, there might even be the possible interpretation that the code is wrong, so this issue could be listed in Section 4.1 as well.

We will see in the next chapter that some techniques permit a programmer to avoid such problems. It is fairly easy to change the code in such a way that the worst-case behavior above (out-of-core; A and B must be fetched from disk and C must be written to disk, and the memory-mapping function is column-major) can be avoided. More intriguingly yet, using optimization techniques as they are available to any good optimizing compiler, such code changes could even be done automatically.

The fundamental issue in this situation is the truly incomprehensible vast-ness of the gap between the time needed to access an item main memory or from disk. It takes a few nanoseconds to access an item in main memory, but it takes more than 10 msec to access this item on disk.¹² This gap is almost

11 It is not even clear whether the system would stay up that long.

12 I know of nothing in the physical world that comes even close to such a discrepancy. To wit, consider transportation: The slowest common way to get from A to B is probably walking, the fastest using a jet airplane. Yet the jet is only about 200 times faster than walking, or less than three orders of magnitude.

C6730_C004.fm Page 108 Monday, July 3, 2006 12:49 PM

Sources of Disappointments 109 seven orders of magnitude. It clearly places a premium on keeping things local.

Dans le document A ProgrAmmer’s ComPAnion to Algorithm AnAlysis (Page 117-121)