• Aucun résultat trouvé

Theoretical Performance of Parallel Computers

Dans le document Cluster Computing (Page 31-36)

3. SYSTEM DESIGN

3.1. P ERFORMANCE R EQUIREMENTS

3.1.3. Theoretical Performance of Parallel Computers

=

1

Super-Linear Speedups

Superlinear speedups have been reported by Molavan [14] for non-deterministic AI computations, especially search operations. With respect to parallel processing with search operations, some paths leading to wrong results may be eliminated earlier than with sequential processing. Thus by avoiding some unnecessary computations, the speedup may increase by more than the number of processors.

One example of an algorithm that exhibits a super-linear speedup is A* - a widely known branch-and-bound algorithm used to solve combinatorial optimisation problems. [3.4.2.2][34]

3.1.3. Theoretical Performance of Parallel Computers

In this section, some basic mathematical models of parallel computation are presented. They are useful for understanding the limits of parallel computation.

The ideal speedup that can be achieved by a parallel computer with n identical processors working concurrently on a single problem is at most n times faster than a single processor. In practice, the speedup is much less, due to the many factors as outlined above.

This law is considered to be pessimistic or the lower bound of the speedup. Depending on the application, real speed-up can range from log2n to n. However, with respect to the upper bound n, in practice the formula (and accompanying derivation) below tends to model the performance of the upper bound more accurately.

Consider a computing problem, which can be executed by a uniprocessor in:

Unit time T1 =1

Let fi be the probability of assigning the same problem to i processors Each processor is working equally with average load di =1/i per processor

Assume equal probability of each operating mode using i processors i.e. fi = 1/n, for n operating modes: I = 1, 2,…..n

The average time required to solve the problem on an n-processor system is given below, where the summation represents n operating modes:

n

The Weighted Harmonic Mean Principle relates the execution rate to a weighted harmonic mean. Let Tn be the time required to compute n tasks of k different types. Each type consists of ni tasks requiring ti seconds each, such that:

By definition, the execution rate R is the number of events or operations in unit time, so:

= ∑

to generate a result. Then:

= 1

i

f

i

Ri = 1/ti

Equation 3-3 represents the basic principle in computer architecture some times loosely referred to as a bottleneck, or the weighted harmonic mean principle. The importance of this principle and Equation 3-3 is the following:

If a single rate is out of balance (i.e. lower than the others), then this will dominate the overall performance of the machine.

This is the case for designing general-purpose Beowulf cluster client-nodes with homogenous hardware. This is as opposed to the use of heterogeneous collections of nodes communicating over various different networks. A heterogeneous network and node base are systems that are out of the range of mathematical modelling or determining the performance prior to implementation.

Amdahl introduced Amdahl’s law in 1967. This law is a particular case of the weighted harmonic mean principle. Two rates are considered:

1. The high, or parallel execution rate RH

2. The low, or scalar execution rate RL

If f denotes the fraction of results generated at the high rate RH and 1- f is the fraction generated at the low rate, then the Equation 3-3 becomes:

)

This formula is known as Amdahl’s law. It is useful for analysing system performance that results from two individual rates of execution, such as vector or scalar operations or parallel or serial operations. It is also useful for analysing a complete parallel system in which one rate is out of balance with the others. For example, the low rate may be caused by I/O or communication operations, and the high rate may be caused by vector, memory, or other operations.

Amdahl’s law can also be applied to the execution of a particular program. From the law we see that a small fraction f of inherently sequential or unparalisable computation severely limits the speedup that can be achieved with p processors. Consider a unit time task, for which the fraction f is unparalisable (so it takes the same time f on both parallel and sequential machines) and the remaining 1-f is fully paralisable [so it runs in time (1-f)/p on a p-processor machine].

Of note is the fact that with Amdahl’s formula when f=0 this represents the ideal case of and n times speedup.

Hence to plot these results will give an indicative idea of what the performance a Beowulf cluster will produce depending on the number of processors that are used. In practice this will vary, however, if a good communications network is chosen the speedup as shown will be achieved. Shown below are three views from the same data derived from the principles above.

The three views show; the over all trend and, a low number of processors.

0 64 128 192 256 320 384 448 512

0 64 128 192 256 320 384 448 512

Number of Processors

Speedup

0 4 8 12 16 20 24 28 32

0 4 8 12 16 20 24 28 32

Number of Processors

Speedup

0 2 4 6 8

0 2 4 6 8

Number of Processors

Speedup

Log2n n/lnn

Amdahls f = 0.05 Amdahls f = 0.1 Ideal

Figure 3-3 – Various Estimates of an n-processor Speedup

From the above analysis and plot it can be seen why typical commercial multiprocessor systems consist of only two or four processors. Designers can achieve a better speed up with a smaller number of fast processors than a larger number of slower processors.

One important note that should be made that is closely related to Amdahl’s law is the fact that some applications lack inherent parallelism, thus limiting the speedup that is achievable when multiple processors are used. [6] [9] [14]

Locality of Reference

The success of memory hierarchy is based upon assumptions that are critical to achieving the appearance of a large, fast memory. The foundation of these assumptions is termed locality of reference.

There are three components of the locality of reference, which coexist in an active process which are:

Temporal – A tendency for a process to reference in the near future the elements of instructions or data referenced in the recent past. Program constructs that lead to this concept are loops, temporary variables, or process stacks.

Spatial – A tendency for a process to reference instructions or data in a similar location in the virtual address space of the last reference.

Sequentiality – The principle that if the last reference was rj(t), then there is a likelihood that the next reference is to the immediate successor of the element rj(t).

It should be noted that each process exhibits an individual characteristic with respect to the

Dans le document Cluster Computing (Page 31-36)