• Aucun résultat trouvé

EFFECTIVENESS OF PARALLEL PROCESSING

Dans le document Introduction to Parallel Processing (Page 44-48)

Introduction to Parallelism

1.6. EFFECTIVENESS OF PARALLEL PROCESSING

Throughout the book, we will be using certain measures to compare the effectiveness of various parallel algorithms or architectures for solving desired problems. The following definitions and notations are applicable [Lee80]:

p Number of processors

W(p) Total number of unit operations performed by the p processors; this is often referred to as computational work or energy

T ( p ) Execution time with p processors; clearly, T(1) = W (1) and T(p) W (p) S(p) Speed-up =

E ( p ) Efficiency = R (p ) Redundancy =

U(p) Utilization = Q(p) Quality =

The significance of each measure is self-evident from its name and defining equation given above. It is not difficult to establish the following relationships between these parameters.

The proof is left as an exercise.

1 ≤ S(p ) ≤ p U(p) = R (p)E(p)

20 INTRODUCTION TO PARALLEL PROCESSING

Figure 1.14. Computation graph for finding the sum of 16 numbers.

Example. Finding the sum of 16 numbers can be represented by the binary-tree computation graph of Fig. 1.14 with T(1) = W(1) = 15. Assume unit-time additions and ignore all else. With p = 8 processors, we have

W(8) = 15 T(8) = 4 E(8) = 15/(8 × 4) = 47%

S(8) = 15/4 = 3.75 R(8) = 15/15 = 1 Q(8) = 1.76

Essentially, the 8 processors perform all of the additions at the same tree level in each time unit, beginning with the leaf nodes and ending at the root. The relatively low efficiency is the result of limited parallelism near the root of the tree.

Now, assuming that addition operations that are vertically aligned in Fig. 1.14 are to be performed by the same processor and that each interprocessor transfer, represented by an oblique arrow, also requires one unit of work (time), the results for p = 8 processors become

INTRODUCTION TO PARALLELISM 21

W(8) = 22 T(8) = 7 E(8) = 15/(8 × 7) = 27%

S(8) = 15/7 = 2.14 R(8) = 22/15 = 1.47 Q(8) = 0.39

The efficiency in this latter case is even lower, primarily because the interprocessor transfers constitute overhead rather than useful operations.

PROBLEMS

1.1. Ocean heat transport modeling

Assume continuation of the trends in Figs. 1.1 and 1.2:

a .

b. When will a vector supercomputer be capable of the computation defined in part (a)?

c . When will a $240M massively parallel computer be capable of the computation of part (a)?

d. When will a $30M massively parallel computer be capable of the computation of part (a)?

When will a single microprocessor be capable of simulating 10 years of global ocean circulation, as described in Section 1.1, overnight (5:00 PM to 8:00 AM the following day), assuming a doubling of the number of divisions in each of the three dimensions? You can assume that a microprocessor’s FLOPS rating is roughly half of its MIPS rating.

1.2. Micros versus supers

Draw the performance trend line for microprocessors on Fig. 1.2, assuming that a microproc-essor’s FLOPS rating is roughly half of its MIPS rating. Compare and discuss the observed trends.

1.3. Sieve of Eratosthenes

Figure 1.6 shows that in the control-parallel implementation of the sieve of Eratosthenes algorithm, a single processor is always responsible for sieving the multiples of 2. For n = 1000, this is roughly 35% of the total work performed. By Amdahl’s law, the maximum possible speed-up for p = 2 and ƒ = 0.35 is 1.48. Yet, for p = 2, we note a speed-up of about 2 in Fig.

1.6. What is wrong with the above reasoning?

1.4. Sieve of Eratosthenes

Consider the data-parallel implementation of the sieve of Eratosthenes algorithm for n = 106. Assume that marking of each cell takes 1 time unit and broadcasting a value to all processors takes b time units.

a . Plot three speed-up curves similar to Fig. 1.8 for b = 1, 10, and 100 and discuss the results.

b. Repeat part (a), this time assuming that the broadcast time is a linear function of the number of processors: b = αp + β, with (α, β) = (5, 1), (5, 10), (5, 100).

1.5. Sieve of Eratosthenes

Consider the data-parallel implementation of the sieve of Eratosthenes algorithm for n = 106. Assume that marking of each cell takes 1 time unit and broadcasting m numbers to all processors takes b + cm time units, where b and c are constants. For each of the values 1, 10, and 100 for the parameter b, determine the range of values for c where it would be more cost-effective for Processor 1 to send the list of all primes that it is holding to all other processors in a single message before the actual markings begin.

22 INTRODUCTION TO PARALLEL PROCESSING 1.6. Sieve of Eratosthenes

a. Noting that 2 is the only even prime, propose a modification to the sieve of Eratosthenes algorithm that requires less storage.

b. Draw a diagram, similar to Fig. 1.6, for the control-parallel implementation of the improved algorithm. Derive the speed-ups for two and three processors.

c. Compute the speed-up of the data-parallel implementation of the improved algorithm over the sequential version.

d. Compare the speed-ups of parts (b) and (c) with those obtained for the original algorithm.

1.7. Amdahl’s law

Amdahl’s law can be applied in contexts other than parallel processing. Suppose that a numerical application consists of 20% floating-point and 80% integer/control operations (these are based on operation counts rather than their execution times). The execution time of a floating-point operation is three times as long as other operations. We are considering a redesign of the floating-point unit in a microprocessor to make it faster.

a. Formulate a more general version of Amdahl’s law in terms of selective speed-up of a portion of a computation rather than in terms of parallel processing.

b. How much faster should the new floating-point unit be for 25% overall speed improve-ment?

c. What is the maximum speed-up that we can hope to achieve by only modifying the floating-point unit?

1.8. Amdahl’s law

a. Represent Amdahl’s law in terms of a task or computation graph similar to that in Fig. 1.13.

Hint: Use an input and an output node, each with computation time ƒ/2, where ƒ is the inherently sequential fraction.

b. Approximate the task/computation graph of part (a) with one having only unit-time nodes.

1.9. Parallel processing effectiveness

Consider two versions of the task graph in Fig. 1.13. Version U corresponds to each node requiring unit computation time. Version E/O corresponds to each odd-numbered node being unit-time and each even-numbered node taking twice as long.

a. Convert the E/O version to an equivalent V version where each node is unit-time.

b. Find the maximum attainable speed-up for each of the U and V versions.

c. What is the minimum number of processors needed to achieve the speed-ups of part (b)?

d. What is the maximum attainable speed-up in each case with three processors?

e. Which of the U and V versions of the task graph would you say is “more parallel” and why?

1.10. Parallel processing effectiveness

Prove the relationships between the parameters in Section 1.6.

1.11. Parallel processing effectiveness

An image processing application problem is characterized by 12 unit-time tasks: (1) an input task that must be completed before any other task can start and consumes the entire bandwidth of the single-input device available, (2) 10 completely independent computational tasks, and (3) an output task that must follow the completion of all other tasks and consumes the entire bandwidth of the single-output device available. Assume the availability of one input and one output device throughout.

INTRODUCTION TO PARALLELISM 23

Draw the task graph for this image processing application problem.

What is the maximum speed-up that can be achieved for this application with two processors?

What is an upper bound on the speed-up with parallel processing?

How many processors are sufficient to achieve the maximum speed-up derived in part (c)?

What is the maximum speed-up in solving five independent instances of the problem on two processors?

What is an upper bound on the speed-up in parallel solution of 100 independent instances of the problem?

How many processors are sufficient to achieve the maximum speed-up derived in part (f)?

What is an upper bound on the speed-up, given a steady stream of independent problem instances?

1.12. Parallelism in everyday life

Discuss the various forms of parallelism used to speed up the following processes:

a. Student registration at a university.

b. Shopping at a supermarket.

c. Taking an elevator in a high-rise building.

1.13. Parallelism for fame or fortune

In 1997, Andrew Beale, a Dallas banker and amateur mathematician, put up a gradually increasing prize of up to U.S. $50,000 for proving or disproving his conjecture that if aq+ br

= cs (where all terms are integers and q, r, s > 2), then a, b, and c have a common factor. Beale’s conjecture is, in effect, a general form of Fermat’s Last Theorem, which asserts that an + bn = cnhas no integer solution for n > 2. Discuss how parallel processing can be used to claim the prize.

Bell, G., “Ultracomputers: A Teraflop Before Its Time,” Communications of the ACM, Vol. 35, No.

8, pp. 27–47, August 1992.

Flynn, M. J., and K. W. Rudd, “Parallel Architectures,” ACM Computing Surveys, Vol. 28, No. 1, pp.

67–70, March 1996.

Johnson, E. E., “Completing an MIMD Multiprocessor Taxonomy,” Computer Architecture News, Vol. 16, No. 3, pp. 44–47, June 1988.

Lee, R. B.-L., “Empirical Results on the Speed, Efficiency, Redundancy, and Quality of Parallel Computations,” Proc. Int. Conf. Parallel Processing, 1980, pp. 91–96.

Parhami, B., “The Right Acronym at the Right Time” (The Open Channel), IEEE Computer, Vol. 28, No. 6, p. 120, June 1995.

Quinn, M. J., Designing Efficient Algorithm for Parallel Computers, McGraw-Hill, 1987.

Quinn, M. J., Parallel Computing: Theory and Practice, McGraw-Hill, 1994.

Schaller, R. R., “Moore’s Law: Past, Present, and Future,” IEEE Spectrum, Vol. 34, No. 6, pp. 52-59, June 1997.

Semiconductor Industry Association, The National Roadmap for Semiconductors, 1994.

Dans le document Introduction to Parallel Processing (Page 44-48)