• Aucun résultat trouvé

The multiprocessor simulations for this thesis are done using Proteus [Bre91]. Al-though this simulation tool allows varying many architectural parameters, the figures here are obtained for a fixed hardware model. The imaginary machine is composed of 64 nodes arranged in a 88 mesh with bidirectional links between nearest neighbors on the mesh. Each node contains a processor, a memory unit, and a hardware-supported coherent cache. The simulation uses the Alewife [Aga91] cache coherence protocol and also allows for explicit message sending between processors. Although most communi-cation is accomplished through the shared-memory interface, some operations such as software-supported barrier synchronization are implemented using explicit messages.

A compiler which generates point-to-point synchronization for parallel programs has

been implemented in C. The compiler accepts as input the syntax as given in the exam-ples of this thesis and emits the augmented C code expected by Proteus. The different phases of compilation are illustrated in Figure 6-1. Observe that only the top two phases are implemented by conventional sequential compilers. Later phases correspond to anal-ysis steps that are derived in this thesis.

Scalar flow analysis & SSA form

Propagation of linear induction vars.

Array flow analysis

Computation of statement dependences

Computation of processor dependences

Elimination of redundant dependences

Code generation Parsing and preliminary analysis

Parallel program

Code for each processor

Figure 6-1: Compiler structure

The compilation time for several applications on a SparcStation IPC are shown in Figure 6-2. For smaller applications, the almost instantaneous compiler response did not allow for accurate measurement of individual phases. Illustrating the efficiency of algorithms presented here, the compiler finishes in under 35 seconds even on very large procedures. The expensive array flow analysis phase stems primarily from the fact that reaching sets are represented as linked lists. Instead, if one were to use hash tables or binary trees, then the time to search each set can be reduced from

O

(

n

) to

O

(

log

(

n

))

or

O

(1) and can significantly improve compiler running time. Note also that the time to compute processor dependences is significant despite the simplicity of the scheme presented here.

0 5 10 15 20 25 30 35 Number of seconds

WaTor Shallow Simple MICCG3D

Parsing & prelim.

Scalar flow & SSA Propagation Array flow analysis

15.8 8.7

34.3 22.3

Statement dependence Processor dependence Elim. redundant dep.

Code generation Figure 6-2: Compilation time for some applications

6.3 An example

In this section, we present an application which benefits greatly from point-to-point synchronization. Although this example is by no means representative of the bench-marks, it can be used to illustrate some strengths as well as weaknesses of the approach.

The WaTor program is adapted from an ecological simulation that appears in [Fox88].

Given a population of predators and prey with defined behavior, we wish to simulate the dynamics of the population in time. In this particular example, sharks form the predators and minnows form the prey. Both species inhabit a rectangular lake which is represented by a two-dimensional array. Each element in the array can either contain a shark or minnow or be empty. Each fish can move in one of four possible directions.

On each time step, a minnow moves randomly to an adjacent empty array element and leaves an offspring if the minnow is older than a specified breeding age. A shark first searches for adjacent cells with minnows. If one exists, then it randomly moves to one such cell and eats the minnow. Otherwise, it moves as a minnow, but can die if it has not eaten for a certain time.

If one were to imagine a parallel simulation of the above lake, then potential update conflicts immediately arise. Imagine the situation where one processor

p

is updating an element containing a shark and another processor

p

0 is updating an adjacent element

M

Figure 6-3: Eliminating update conflicts for the WaTor benchmark

containing a minnow as in Figure 6-3a. If

p

lets the shark consume the minnow and

p

0 moves the minnow to another array element, then an inconsistency arises. Thus two processors cannot be updating two adjacent elements. Furthermore, a conflict also occurs when two processors try to deposit a fish into the same array element, as shown in Figure 6-3b. Consequently, a correct solution must ensure that at no time can two processors be updating two array elements that are separated by a Manhattan distance of 2 or less. The implementation considered here satisfies this constraint by tiling the array with a 6-color pattern as shown in Figure 6-3c.y On each phase of the update routine, only cells with a particular color are updated. In a 6-color scheme, an update iteration must contain six phases. Semantically, a barrier synchronization occurs between each color phase, ensuring that all updates are free of conflicts. Of course, the nearest-neighbor array usage of the application makes it a prime candidate for implementing point-to-point synchronization.

Using a block partitioning scheme, each processor is responsible for updating a block of the array. Array elements are shared at the boundary points of these blocks, and synchronization must be done to ensure that the accesses are performed in the correct order. If one imagines executing the code in Figure 6-4 on a 88 processor array, then the loop space can be partitioned into processors as illustrated. In order to ensure correct execution order, synchronization must be performed between each set of

y A 5-coloring can also be used to satisfy the constraints, but requires a larger tile pattern.

nestedDOALL loops. Focusing on the dependences between colors 1 and 2, one observes that each processor (

y;x

) executing color 2 must synchronize with processors (

y;x

,1),

(

y;x

+ 1), (

y

+ 1

;x

), and (

y

+ 1

;x

,1). Note that this analysis must be done between every set of loops and not just between consecutive loops. Fortunately, the farther apart the color phases are, the more likely it is that the dependences become redundant due to other dependences between the phases. From this example, one can see that the task of computing point-to-point synchronization can be very tedious and is best done by a compiler rather than a programmer.

Figure 6-4: Partitioning a 3232 WaTor array on a 88 processor array

In the actual code produced by the compiler, each processor must synchronize with 4 to 6 other processors before each color phase. In comparison against a software barrier scheme, the more local synchronization approach would clearly perform better. Instead, if one were to employ a hardware barrier scheme, then the cost of reading 4 remote mem-ory locations is probably similar to that of executing a hardware barrier. At first glance, one may not expect any difference in performance between the two implementations.

However, one has not considered the second disadvantage of barrier synchronization:

unnecessary idling. In this particular example, the variance in execution of each iteration can be very large. Much more processing must be done for array elements that contain

fish than those that are empty. This unbalanced loading also varies dynamically as fish move and regenerate. If global synchronization is performed between each phase, then the time to execute each phase is equal to the time of the busiest processor during the phase. Instead, if point-to-point synchronization were implemented, then the execution of different phases can overlap in time and the effects of busy processors can be mini-mized. This effect can be seen by comparing the execution profiles of the two schemes, as shown in Figure 6-5. In the no-cost barrier scheme, no time elapses between the last processor entering a barrier and the barrier exit by all processors. Even with such an ideal barrier, the illustration shows that the point-to-point synchronization scheme provides superior performance.

Figure 6-5: A comparison of synchronization schemes on the WaTor benchmark

One may wonder how the disadvantages of barrier synchronization are affected by problem size. As a problem becomes larger and each processor spends more time in each phase on computation, the constant overhead of software barriers becomes less significant. With very large problem sizes, one would also expect the variance in load on individual processors to decrease. In this particular example, the fish population on each processor can be represented by a binomial of

n

coefficients where

n

is the

number of elements per processor. As

n

increases, the variance of the fish population decreases, which in turn reduces the penalty for global synchronization. Indeed, for any distribution, the standard deviation of the average of

n

identical events scales as 1

=

p

n

[Fel68]. One would expect the relative overhead due to unnecessary idling to be related to this ratio.

Figure 6-6

shows the effects of different synchronization schemes on varying problem size.

Software barrier synchronization is accomplished by using a message-passing spanning tree which requires around 450 processor cycles to execute on a 64-processor machine.

For both barrier schemes, synchronization is inserted only where necessary as computed by the flow analysis. No-cost point-to-point synchronization implies that no time elapses from a synchronization assertion by the source processor and the observation of that assertion by the sink processor. In studying the graph, one can view the difference between the software and no-cost barriers as the penalty due to global propagation. The difference between no-cost barriers and no-cost point-to-point synchronization can be viewed as the penalty due to unnecessary idling. All times are normalized with respect to the software barrier time. From the graph, we see that as problem size increases, the penalties due to both inefficiencies decrease when compared with overall execution time.

16 x 16 32 x 32 48 x 48 64 x 64 Problem size

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Normalized execution time

Software barrier No-cost barrier No-cost point-to-point

Figure 6-6: WaTor performance for varying problem sizes

Although the WaTor application possesses characteristics that enable point-to-point synchronization to be advantageous, such characteristics cannot be readily extracted from every representation of the program. In order to allow the compiler to provide signifi-cant results, the application had to be written in a particular way. As the first of several examples, consider the code in Figure 6-4. Each color phase is separated into its own

set of nested loops, thus allowing for the compiler to be explicitly aware of the ar-ray elements that are active for each phase. This knowledge in turn enables processor synchronization targets to be computed intelligently and results in each processor only requiring synchronization with a few other processors between phases. Instead, if each phase is not represented by its own loop, but simply by an outer loop as in Figure 6-7a, then the phases are no longer lexically distinguished. Any knowledge about the struc-ture of the colors within the array are hidden. The compiler must treat the loop body as executable by any phase and conclude that each processor must synchronize with all 8 of its neighbors.

As another example, consider the code fragment in Figure 6-8b. For each of the four directions that a fish can move, a statement exists to modify the particular array element in that direction. This allows the compiler to deduce that the set of elements of a that can be changed for coordinate (i

;

j) are: f(i,1

;

j)

;

(i+ 1

;

j)

;

(i

;

j,1)

;

(i

;

j+ 1)g. Now

consider the more cleanly written version in Figure 6-8c where the update is done by one statement and arraysdxanddyrepresent the changes in iandj for each direction.

In the current compiler, nothing is deduced about array values and the compiler must consequently assume that the update can happen to any possible array element. Any dependences with this statement can then only be satisfied by a barrier synchronization since the relationship between the processor and data spaces has been lost. Even if one makes the reasonable assumption that the compiler can deduce that every element in x and y are in the range [,1

;

1], this information only allows one to limit the range of updates to one of nine elements. In order to recover fully what the separate treatment of directions provided, the compiler must somehow realize the coupled relationship between elements of x and y. It must be able to infer that whenever the arrays are

accessed together, only four possible values can result. However, this is too much to expect out of the analysis tools of today.