ALGORITHMS FOR A LINEAR ARRAY - A Taste of Parallel Algorithms

A Taste of Parallel Algorithms

2.3. ALGORITHMS FOR A LINEAR ARRAY

Semigroup Computation. Let us consider first a special case of semigroup compu-tation, namely, that of maximum finding. Each of the p processors holds a value initially and our goal is for every processor to know the largest of these values. A local variable, max-thus-far, can be initialized to the processor’s own data value. In each step, a processor sends its max-thus-far value to its two neighbors. Each processor, on receiving values from its left and right neighbors, sets its max-thus-far value to the largest of the three values, i.e., max(left, own, right). Figure 2.6 depicts the execution of this algorithm for p = 9 processors.

The dotted lines in Fig. 2.6 show how the maximum value propagates from P₆ to all other processors. Had there been two maximum values, say in P and P₂ ₆, the propagation would have been faster. In the worst case, p – 1 communication steps (each involving sending a processor’s value to both neighbors), and the same number of three-way comparison steps, are needed. This is the best one can hope for, given that the diameter of a p-processor linear array is D = p – 1 (diameter-based lower bound).

Figure 2.6. Maximum-finding on a linear array of nine processors.

A TASTE OF PARALLEL ALGORITHMS 31 For a general semigroup computation, the processor at the left end of the array (the one with no left neighbor) becomes active and sends its data value to the right (initially, all processors are dormant or inactive). On receiving a value from its left neighbor, a processor becomes active, applies the semigroup operation ⊗ to the value received from the left and its own data value, sends the result to the right, and becomes inactive again. This wave of activity propagates to the right, until the rightmost processor obtains the desired result. The computation result is then propagated leftward to all processors. In all, 2 p – 2 communication steps are needed.

Parallel Prefix Computation. Let us assume that we want the ith prefix result to be obtained at the ith processor, 0 ≤ i ≤ p – 1. The general semigroup algorithm described in the preceding paragraph in fact performs a semigroup computation first and then does a broadcast of the final value to all processors. Thus, we already have an algorithm for parallel prefix computation that takes p – 1 communication/combining steps. A variant of the parallel prefix computation, in which Processor i ends up with the prefix result up to the ( i – 1)th value, is sometimes useful. This diminished prefix computation can be performed just as easily if each processor holds onto the value received from the left rather than the one it sends to the right. The diminished prefix sum results for the example of Fig. 2.7 would be 0, 5, 7, 15, 21, 24, 31, 40, 41.

Thus far, we have assumed that each processor holds a single data item. Extension of the semigroup and parallel prefix algorithms to the case where each processor initially holds several data items is straightforward. Figure 2.8 shows a parallel prefix sum computation with each processor initially holding two data items. The algorithm consists of each processor doing a prefix computation on its own data set of size n/p (this takes n/p – 1 combining steps), then doing a diminished parallel prefix computation on the linear array as above ( p – 1 communication/combining steps), and finally combining the local prefix result from this last computation with the locally computed prefixes (n /p combining steps). In all, 2n/p + p – 2 combining steps and p – 1 communication steps are required.

Packet Routing. To send a packet of information from Processor i to Processor j on a linear array, we simply attach a routing tag with the value j – i to it. The sign of a routing tag determines the direction in which it should move (+ = right, – = left) while its magnitude indicates the action to be performed (0 = remove the packet, nonzero = forward the packet).

With each forwarding, the magnitude of the routing tag is decremented by 1. Multiple packets

Figure 2.7. Computing prefix sums on a linear array of nine processors.

32 INTRODUCTION TO PARALLEL PROCESSING

Figure 2.8. Computing prefix sums on a linear array with two items per processor.

originating at different processors can flow rightward and leftward in lockstep, without ever interfering with each other.

Broadcasting. If Processor i wants to broadcast a value a to all processors, it sends an rbcast(a) (read r-broadcast) message to its right neighbor and an lbcast(a) message to its left neighbor. Any processor receiving an rbcast(a ) message, simply copies the value a and forwards the message to its right neighbor (if any). Similarly, receiving an lbcast( a ) message causes a to be copied locally and the message forwarded to the left neighbor. The worst-case number of communication steps for broadcasting is p – 1.

Sorting. We consider two versions of sorting on a linear array: with and without I/O.

Figure 2.9 depicts a linear-array sorting algorithm when p keys are input, one at a time, from the left end. Each processor, on receiving a key value from the left, compares the received value with the value stored in its local register (initially, all local registers hold the value +∞).

The smaller of the two values is kept in the local register and larger value is passed on to the right. Once all p inputs have been received, we must allow p – 1 additional communication cycles for the key values that are in transit to settle into their respective positions in the linear array. If the sorted list is to be output from the left, the output phase can start immediately after the last key value has been received. In this case, an array half the size of the input list would be adequate and we effectively have zero-time sorting, i.e., the total sorting time is equal to the I/O time.

If the key values are already in place, one per processor, then an algorithm known as odd–even transposition can be used for sorting. A total of p steps are required. In an odd-numbered step, odd-numbered processors compare values with their even-numbered right neighbors. The two processors exchange their values if they are out of order. Similarly, in an even-numbered step, even-numbered processors compare–exchange values with their right neighbors (see Fig. 2.10). In the worst case, the largest key value resides in Processor 0 and must move all the way to the other end of the array. This needs p – 1 right moves. One step must be added because no movement occurs in the first step. Of course one could use even–odd transposition, but this will not affect the worst-case time complexity of the algorithm for our nine-processor linear array.

Note that the odd–even transposition algorithm uses p processors to sort p keys in p compare–exchange steps. How good is this algorithm? Let us evaluate the odd–even

A TASTE OF PARALLEL ALGORITHMS

Figure 2.9. Sorting on a linear array with the keys input sequentially from the left.

Figure 2.10. Odd–even transposition sort on a linear array.

34 INTRODUCTION TO PARALLEL PROCESSING transposition algorithm with respect to the various measures introduced in Section 1.6. The best sequential sorting algorithms take on the order of p log p compare–exchange steps to sort a list of size p. Let us assume, for simplicity, that they take exactly p log₂p steps. Then, we have T (1) = W (1) = p log₂p, T (p) = p, W(p) = p²/2, S (p) = log p (Minsky’s conjecture?),₂ E(p) = (log

In most practical situations, the number n of keys to be sorted (the problem size) is greater

2 p)/p, R (p) = p /(2 log₂p ), U(p) = 1/2, and Q(p) = 2(log₂p)³/ p².

than the number p of processors (the machine size). The odd–even transposition sort algorithm with n/p keys per processor is as follows. First, each processor sorts its list of size n/p using any efficient sequential sorting algorithm. Let us say this takes ( n/p)log₂(n/p ) compare–exchange steps. Next, the odd–even transposition sort is performed as before, except that each compare–exchange step is replaced by a merge–split step in which the two communicating processors merge their sublists of size n/p into a single sorted list of size 2 n/p and then split the list down the middle, one processor keeping the smaller half and the other, the larger half. For example, if P₀ is holding (1, 3, 7, 8) and P₁has (2, 4, 5, 9), a merge–split step will turn the lists into (1, 2, 3, 4) and (5, 7, 8, 9), respectively. Because the sublists are sorted, the merge–split step requires n/p compare–exchange steps. Thus, the total time of the algorithm is (n/p)log₂(n /p ) + n . Note that the first term (local sorting) will be dominant if p < log₂n, while the second term (array merging) is dominant for p > log₂n . For p ≥ log₂n, the time complexity of the algorithm is linear in n; hence, the algorithm is more efficient than the one-key-per-processor version.

One final observation about sorting: Sorting is important in its own right, but occasion-ally it also helps us in data routing. Suppose data values being held by the p processors of a linear array are to be routed to other processors, such that the destination of each value is different from all others. This is known as a permutation routing problem. Because the p distinct destinations must be 0, 1, 2, . . . , p – 1, forming records with the destination address as the key and sorting these records will cause each record to end up at its correct destination.

Consequently, permutation routing on a linear array requires p compare–exchange steps. So, effectively, p packets are routed in the same amount of time that is required for routing a single packet in the worst case.

Dans le document Introduction to Parallel Processing (Page 55-59)