I/O Complexity - The Algorithm Side: Regularity,

Part 1 The Algorithm Side: Regularity,

1.6 I/O Complexity

I/O complexity is a nonstandard complexity measure of algorithms, but it is of great significance for our purposes. Some of the justification of and motivation for introducing this complexity measure will be provided in Part 2.

The I/O complexity of an algorithm is the amount of data transferred from one type of memory to another. We are primarily interested in transfers between disk and main memory; other types of transfer involve main mem-ory and cache memmem-ory. In the case of cache memmem-ory the transfer is usually not under the control of the programmer. A similar situation occurs with disks when virtual memory management (VMM) is employed. In all these cases data are transferred in blocks (lines or pages). These are larger units of memory, providing space for a large number of numbers, typically on the order of hundreds or thousands. Not all programming environments provide VMM (for example, no Cray supercomputer has VMM); in the absence of

C6730_C001.fm Page 17 Friday, August 11, 2006 7:35 AM

18 A Programmer’s Companion to Algorithm Analysis VMM, programmers must design out-of-core programs wherein the transfer of blocks between disk and main memory is directly controlled by them. In contrast, an in-core program assumes that the input data are initially trans-ferred into main memory, all computations reference data in main memory, and at the very end of the computations, the results are transferred to disk.

It should be clear that an in-core program assumes the uniformity of memory access that is almost always assumed in algorithms.

Let us look at one illustration of the concept of an out-of-core algorithm.

Consider a two-dimensional (2D) finite difference method with a stencil of the form

s[i,j] = s[i – 2,j] +

s[i – 1,j – 1] + s[i – 1,j] + s[i – 1,j + 1] + s[i,j – 2] + s[i,j – 1] + s[i,j] + s[i,j + 1] + s[i,j + 2] +

s[i + 1,j – 1] + s[i + 1,j] + s[i + 1,j + 1] + s[i + 2,j],

where we omitted the factors (weights) of each of the 13 terms. Suppose the matrix M to which we want to apply this stencil is of size [1:n,1:n], for n = 2¹⁸. Consequently, we must compute another matrix M', whose [i,j] element is exactly the stencil applied to the matrix M at the [i,j] position. (For a somewhat different approach, see Exercise 11, page 35.) Now comes the problem: we have only space of size 2²⁰ available for this operation. Because of the size of the two matrices (which is 2³⁶), we can only bring small portions of M and M' into main memory; the rest of the matrices must remain on disk. We may use VMM or we can use out-of-core programming, requiring us to design an algorithm that takes into consideration not only the compu-tation, but also the movement of blocks between disk and main memory.

It is clear that we must have parts of M and M' in main memory. The question is which parts and how much of each matrix. Let us consider several possibilities:

1.6.1 Scenario 1

Assume that one block consists of an entire row of the matrices. This means each block is of size 2¹⁸, so we have only room for four rows. One of these rows must be the i^th row of M'; the other three rows can be from M. This presents a problem since the computation of the [i,j] element of M' requires five rows of M, namely the rows with numbers i − 2, i − 1, i, i + 1, and i + 2. Here is where the I/O complexity becomes interesting. It measures the data transfers between disk and main memory, so in this case, it should provide us with the answer of how many blocks of size 2¹⁸ will have to be transferred. Let us first take the rather naïve approach formulated in the following code fragment:

C6730_C001.fm Page 18 Friday, August 11, 2006 7:35 AM

A Taxonomy of Algorithmic Complexity 19 for i:=1 to n do

for j:=1 to n do

M′[i, j] := M[i-2,j] +

M[i-1,j-1] + M[i-1,j] + M[i-1,j+1] +

M[i,j-2] + M[i,j-1] + M[i,j] + M[i,j+1] + M[i,j+2] + M[i+1,j-1] + M[i+1,j] + M[i+1,j+1] +

M[i+2,j]

This turns out to have a truly horrific I/O complexity. To see why, let us analyze what occurs when M'[i,j] is computed. Since there is space for just four blocks, each containing one matrix row, we will first install in main memory the rows i − 2, i − 1, i, and i + 1 of M and compute M[i − 2,j] + M[i − 1, j − 1] + M[i − 1,j] + M[i − 1,j + 1] + M[i,j − 2] + M[i,j − 1] + M[i,j] + M[i,j + 1] + M[i,j + 2] + M[i + 1,j − 1] + M[i + 1,j] + M[i + 1,j + 1]. Then we replace one of these four rows with the M-row i + 2 to add to the sum the element M[i + 2,j]. Then we must displace another M-row to install the row i of M' so we may assign this complete sum to M'[i,j]. In order to enable us to be more specific, assume that we use the least recently used (LRU) replace-ment strategy that most virtual memory managereplace-ment systems employ. (This means the page or block that has not been used for the longest time is replaced by the new page to be installed.) Thus, in our example, we first replace the M-row i − 2 and then the M-row i − 1. We now have in memory the M-rows i, i + 1, and i + 2 and the M'-row i. To compute the next element, namely M'[i,j + 1], we again need the M-rows i − 2, i − 1, i, i + 1, and i + 2.

Under the LRU policy, since M-rows i − 2 and i − 1 are not present, they must be installed, replacing rows i and i + 1. Then the just-removed M-row i must be reinstalled, replacing M'-M-row i; subsequently M-M-row i + 1 must be reinstalled, replacing M-row i + 2. Now, the just-removed M-row i + 2 is reinstalled, replacing M-row i − 2. Finally, M'-row i must be brought back, replacing M-row i − 1. It follows that of the six rows involved in the com-putation (five M-rows and one M'-row), each must be reinstalled when computing M'[i,j + 1] after having computed M'[i,j]. While the situation for the border elements (M[i,j] for i = 1,2,n − 1,n or j = 1,2,n − 1,n) is slightly different, in general it follows that for each of the n² elements to be computed, six page transfers are required. Thus, the data movement is 3n times greater than the amount of data contained in the matrices.¹⁷ In particular, most of the n² elements of the matrix M are transferred 5n times; since n = 2¹⁸, each of these M elements is transferred about 1.3 million times. This clearly validates our assertion about the lack of effectiveness of this approach.

For the following, let us assume that we can specify explicitly which blocks we want to transfer. The above analysis implicitly assumed that the replace-ment operations are automatically determined. (After all, it is difficult to conceive of any programmer coming up with as hopelessly inefficient a

17 Each matrix consists of n pages. In total, 6n² pages are transferred. Since 6n²/2n = 3n, the claim follows.

C6730_C001.fm Page 19 Friday, August 11, 2006 7:35 AM

20 A Programmer’s Companion to Algorithm Analysis strategy as the one we described, yet it was the direct consequence of seem-ingly rational decisions: LRU and a code fragment that looked entirely acceptable.) The following scheme allows us to compute the entire matrix M' (we assume that both M and M' are surrounded with 0s, so we do not get out of range problems). To compute M'[i,*]:

1. Fetch rows i − 2, i − 1, and i of M and compute in M'[i,*] the first three lines of the stencil.

2. Fetch rows i + 1 and i + 2 of M, replacing two existing rows of M, and compute the remaining two lines of the stencil.

3. Store M'[i,*] on disk.

Thus, for computing M'[i,*] we need to fetch five rows of M and store one row of M'. If we iterate this for every value of i, we will retrieve 5n rows and store n rows. If we are a bit more clever and recognize that we can reuse one of the old rows (specifically, in computing M'[i,*], in the second fetch operation we overwrite rows M[i − 2,*] and another one, so the row that is still there is useful in the computation of M'[i + 1,*]), this will reduce the block retrievals from 5n to 4n. Thus, even though M and M' have only 2n rows, the I/O complexity is 5n; in other words, we have data movement that is 250% of the amount of data manipulated, a dramatic reduction over the previous result.

1.6.2 Scenario 2

The problem in Scenario 1 was that we had to retrieve the rows correspond-ing to one stencil computation in two parts. Perhaps we can improve our performance if we devise a set-up in which stencil computations need not be split. Assume that each block is now of size 2¹⁶, so we can fit 16 blocks into our available main memory. This should allow us to compute an entire stencil in one part.

We assume that each row consists of four blocks (we will refer to quarters of rows to identify the four blocks). In this case, our algorithm proceeds as follows:

1. Compute the ﬁrst quarter of M'[1,*].

1.1 Fetch the ﬁrst and second block of M[1,*], M[2,*], and M[3,*] and compute the entire stencil in the ﬁrst quarter of M'[1,*].

1.2 Store the ﬁrst quarter of M'[1,*] on disk.

1.3 Calculate the ﬁrst two elements of the second quarter of M'[1,*]

and store it on disk (eight resident blocks).

2. Compute the ﬁrst quarter of M'[2,*].

2.1 Fetch the ﬁrst and second block of M[4,*] and compute the entire stencil in the ﬁrst quarter of M'[2,*].

C6730_C001.fm Page 20 Friday, August 11, 2006 7:35 AM

A Taxonomy of Algorithmic Complexity 21 2.2 Store the ﬁrst quarter of M'[2,*] on disk.

2.3 Calculate the ﬁrst two elements of the second quarter of M'[2,*]

and store it on disk (10 resident blocks).

3. Compute the ﬁrst quarter of M'[3,*].

3.1 Fetch the ﬁrst and second block of M[4,*] and compute the entire stencil in the ﬁrst quarter of M'[3,*].

3.2 Store the ﬁrst quarter of M'[3,*] on disk.

3.3 Calculate the ﬁrst two elements of the second quarter of M'[3,*]

and store it on disk (12 resident blocks).

4. For i = 4 to n - 2 compute the ﬁrst quarter of M'[i,*].

4.1 Fetch the ﬁrst and the second block of row i + 2 of M, overwriting the respective blocks of row i - 3, and compute the entire stencil in the ﬁrst quarter of M'[i,*].

4.2 Store the ﬁrst quarter of M'[i,*] on disk.

4.3 Calculate the ﬁrst two elements of the second quarter of M'[i,*]

and store it on disk (12 resident blocks).

5. Compute the ﬁrst quarter of M'[n - 1,*].

5.1 Compute the entire stencil in the ﬁrst quarter of M'[n - 1,*] and-store it on disk.

5.2 Calculate the ﬁrst two elements of the second quarter of M'[n -1,*] and store it on disk (10 resident blocks).

6. Compute the ﬁrst quarter of M'[n,*].

6.1 Compute the entire stencil in the ﬁrst quarter of M'[n,*] and store it on disk.

6.2 Calculate the ﬁrst two elements of the second quarter of M'[n,*]

and store it on disk (eight resident blocks).

The second quarter of each M'[i,*] is calculated in a similar manner, except that we would go backwards, from i = n to i = 1, which saves us initially fetching a few blocks that are already in memory; of course now we fetch the third quarter of each row, replacing all first quarters. Also, the second quarter of each M-row must be fetched from disk, because we will calculate all but the first two elements, which have already been computed in the previous round (first quarters). The third quarter is analogous (precomput-ing again the first two elements of each fourth quarter). Finally, the fourth quarter is computed similarly, but there is no precomputing of elements of the next round.

To calculate the I/O complexity of this second algorithm, we first note that we have space for 16 blocks. Computing the first quarter of M[i,*] requires us to have 10 blocks in memory, plus we need space (two blocks) for the first and second quarters of M'[i,*]. Therefore, the available memory is not exceeded. Adding up the fetches and stores in the first quarter round, we

C6730_C001.fm Page 21 Friday, August 11, 2006 7:35 AM

22 A Programmer’s Companion to Algorithm Analysis we need a total of 2n block retrievals (portions of M) and 2n block stores (portions of M'). For the second quarter round, we need 3n retrievals (2n analogously to the first round, plus the retrieval of the second quarter of M'[i,*], which had two elements precomputed in the first round) and 2n stores, and similarly for the third. For the fourth quarter round, we need 3n fetches and only n stores, since there is no precomputation in this round. The grand total is therefore 11n block fetches and 7n block stores, for an I/O complexity of 18n blocks of size 2¹⁶. Since each matrix now requires 4n blocks, the data movement with this more complicated scheme is some-what smaller: 225% of the size of the two matrices instead of the 250% of the much simpler scheme above.

This somewhat disappointing result (we seem to always need significantly more transfers than the structures require memory) raises the question of whether this is the best we can do.¹⁸ Here is where the issue of lower bounds, to be taken up in Section 1.8, is of interest. We will return to this issue there and derive a much better lower bound.

We will return to the I/O complexity of a task in Part 2 in more detail.

Here, we merely want to emphasize that important nontraditional measures of the performance of an algorithm are different from the usual time or space complexities. However, as we will see in Part 2, I/O performance is very intimately related to the time complexity of an algorithm when the memory space is not uniform.

Dans le document A ProgrAmmer’s ComPAnion to Algorithm AnAlysis (Page 31-36)