ALGORITHMS WITH SHARED VARIABLES - A Taste of Parallel Algorithms

A Taste of Parallel Algorithms

2.6. ALGORITHMS WITH SHARED VARIABLES

Again, in this section, we focus on developing simple algorithms that are not necessarily very efficient. Shared-memory architectures and their algorithms will be discussed in more detail in Chapters 5 and 6.

Semigroup Computation. Each processor obtains the data items from all other processors and performs the semigroup computation independently. Obviously, all proces-sors will end up with the same result. This approach is quite wasteful of the complex architecture of Fig. 2.5 because the linear time complexity of the algorithm is essentially comparable to that of the semigroup computation algorithm for the much simpler linear-array architecture and worse than the algorithm for the 2D mesh.

Parallel Prefix Computation. Similar to the semigroup computation, except that each processor only obtains data items from processors with smaller indices.

Packet Routing. Trivial in view of the direct communication path between any pair of processors.

Broadcasting. Trivial, as each processor can send a data item to all processors directly.

In fact, because of this direct access, broadcasting is not needed; each processor already has access to any data item when needed.

Sorting. The algorithm to be described for sorting with shared variables consists of two phases: ranking and data permutation. ranking consists of determining the relative order of each key in the final sorted list. If each processor holds one key, then once the ranks are

A TASTE OF PARALLEL ALGORITHMS 41 determined, the jth-ranked key can be sent to Processor j in the data permutation phase, requiring a single parallel communication step. Processor i is responsible for ranking its own key x_i. This is done by comparing xito all other keys and counting the number of keys that are smaller than xi. In the case of equal key values, processor indices are used to establish the relative order. For example, if Processors 3 and 9 both hold the key value 23, the key associated with Processor 3 is deemed smaller for ranking purposes. It should be clear that each key will end up with a unique rank in the range 0 (no key is smaller) to p – 1 (all other p – 1 keys are smaller).

Again, despite the greater complexity of the shared-variable architecture compared with the linear-array or binary-tree architectures, the linear time required by the above sorting algorithm is comparable to the algorithms for these simpler architectures. We will see in Chapter 6 that logarithmic-time sorting algorithms can in fact be developed for the shared-variable architecture, leading to linear speed-up over sequential algorithms that need on the order of n log n compare–exchange steps to sort n items.

PROBLEMS

For each of the following problem/architecture pairs, find a lower bound based on the bisection width. State if the derived bound is useful.

a. Semigroup computation on linear array.

b. Parallel prefix computation on linear array.

c. Semigroup computation on 2D mesh.

d. Sorting on shared-variable architecture.

Semigroup or parallel prefix computation on a linear array

a. Semigroup computation can be performed on a linear array in a recursive fashion. Assume that p is a power of 2. First, semigroup computation is performed on the left and right halves of the array independently. Then the results are combined through two half-broad-cast operations, i.e., broadhalf-broad-casting from each of the middle two processors to the other side of the array. Supply the details of the algorithm and analyze its complexity. Compare the result with that of the algorithm described in Section 2.3.

b. Can an algorithm similar to that in part (a) be devised for parallel prefix computation? If so, how does its performance compare with the algorithm described in Section 2.3?

Parallel prefix computation on a linear array

Given n data items, determine the optimal number p of processors in a linear array such that if the n data items are distributed to the processors with each holding approximately n/p elements, the time to perform the parallel prefix computation is minimized.

Multicasting on a linear array

Suppose processors in a linear array compose messages of the form mcast(x, a, b ) with the meaning that the data value x must be sent (multicast) to all processors with indices in the interval [a, b ]. Packet routing and broadcasting correspond to the special cases mcast(x, j , j) and mcast( x , 0, p – 1) of this more general mechanism. Develop the algorithm for handling such a multicast message by a processor.

42 INTRODUCTION TO PARALLEL PROCESSING

Determine the speed-up, efficiency, and other effectiveness measures defined in Section 1.6 for linear-array sorting with more than one data item per processor.

Parallel prefix computation

a . In determining the ranks of 1s in a list of 0s and 1s (Section 2.4), what happens if a diminished parallel prefix sum computation is performed rather than the regular one?

b . What is the identity element for the carry operator “¢” defined in Section 2.4?

c . Find another example of parallel prefix computation (besides carry computation) involving a noncommutative binary operation.

Algorithms for a linear array

In Section 2.3, we assumed that the communication links between the processors in the linear array are full-duplex, meaning that they can carry data in both directions simultaneously (in one step). How should the algorithms given in Section 2.3 be modified if the communication links are half-duplex (they can carry data in either directions, but not in the same step)?

Algorithms for a ring of processors

Develop efficient algorithms for the five computations discussed in this chapter on a p-proc-essor ring, assuming:

a . Bidirectional, full-duplex links between processors.

b . Bidirectional, half-duplex links between processors.

c . Unidirectional links between processors.

Measures of parallel processing effectiveness

Compute the effectiveness measures introduced in Section 1.6 for the parallel prefix compu-tation algorithm on a linear array, binary tree, 2D mesh, and shared-variable architecture.

Compare and discuss the results.

Parallel prefix computation on a binary tree

Develop an algorithm for parallel prefix computation on a binary tree where the inner tree nodes also hold dates elements.

Routing on a binary tree of processors

a . Modify the binary tree routing algorithm in Section 2.4 so that the variables maxl and maxr are not required, assuming that we are dealing with a complete binary tree.

b . Each processor in a tree can be given a name or label based on the path that would take us from the root to that node via right (R) or left (L) moves. For example, in Fig. 2.3, the root will be labeledΛ(the empty string), P3would be labeled LR (left, then right), and P₇would be labeled RRL. Develop a packet routing algorithm from Node A to Node B if node labels are specified as above.

Sorting on a binary tree of processors

a . Develop a new binary-tree sorting algorithm based on all-to-all broadcasting. Each leaf node broadcasts its key to all other leafs, which compare the incoming keys with their own and determine the rank of their keys. A final parallel routing phase concludes the algorithm.

Compare this new algorithm with the one described in Section 2.4 and discuss.

b . Modify the binary tree sorting algorithm in Section 2.4 so that it works with multiple keys initially stored in each leaf node.

A TASTE OF PARALLEL ALGORITHMS 43 2.13. Algorithms on 2D processor arrays

Briefly discuss how the semigroup computation, parallel prefix computation, packet routing, and broadcasting algorithms can be performed on the following variants of the 2D mesh architecture.

a. A 2D torus with wraparound links as in Fig. 2.4 (simply ignoring the wraparound links is not allowed!).

b. A Manhattan street network, so named because the row and column links are unidirectional and, like the one-way streets of Manhattan, go in opposite directions in adjacent rows or columns. Unlike the streets, though, each row/column has a wraparound link. Assume that both dimensions of the processor array are even, with links in even-numbered rows (columns) going from left to right (bottom to top).

c. A honeycomb mesh, which is a 2D mesh in which all of the row links are left intact but every other column link has been removed. Two different drawings of this architecture are shown below.

2.14. Shearsort on 2D mesh of processors

a. Write down the number of compare–exchange steps required to perform shearsort on general (possibly nonsquare) 2D mesh with r rows and p/r columns.

b. Compute the effectiveness measures introduced in Section 1.6 for the shearsort algorithm based on the results of part (a).

c. Discuss the best aspect ratio for a p-processor mesh in order to minimize the sorting time.

d. How would shearsort work if each processor initially holds more than one key?

REFERENCES AND SUGGESTED READING

[Akl85] Akl, S. G., Parallel Sorting Algorithms, Academic Press, 1985.

[Akl97] Akl, S. G., Parallel Computation: Models and Methods, Prentice–Hall, 1997.

[Corm9O] Cormen, T. H., C. E. Leiserson, and R. L. Rivest, Introduction to Algorithms, McGraw-Hill, 1990.

[JaJa96] JaJa, J. F., “Fundamentals of Parallel Algorithms,” Chapter 12 in Parallel and Distributed Computing Handbook, Edited by A. Y. Zomaya, McGraw-Hill, 1996, pp. 333–354.

[Knut73] Knuth, D. E., The Art of Computer Programming: Vol. 3—Sorting and Searching, Addison–Wesley, 1973.

[Laks94] Lakshmivarahan, S., and S. K. Dhall, Parallel Computing Using the Prefix Problem, Oxford University Press, 1994.

[Leig92] Leighton, F. T., Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes, Morgan Kaufmann, 1992.

This page intentionally left blank.

3

Parallel Algorithm

Dans le document Introduction to Parallel Processing (Page 65-70)