Implementation issues - Processor dependences and synchronization

Processor dependences and synchronization

4.2.2 Implementation issues

}

do (l=1,5) {

doall (j=1,1000) {

while (sync[j-1] < i); /* S4 */

... = f(a[i-3,j-1]); /* S2 */

} } }

Figure 4-1

The program in Figure 4-1 shows how the above scheme can be used to support the flow dependence betweenS1andS2without resorting to barrier synchronizations. After a value is written to an element ofa, the synchronization variable for that array element is also updated in statementS3. Before the same elementais read, statementS4ensures that its synchronization variable has the proper value. Although this approach produces a correct program, it suffers from various inefficiencies which are addressed in the next section.

4.2.2 Implementation issues

Consider the execution of the program in Figure 4-1 on a machine with 10 proces-sors. Assume that each of the DOALL loops is distributed 100 consecutive iterations per processor so that processor

p

^j is responsible for iterations 100j^,99 to 100j. As im-plemented in the example, each processor needs to check 100 values of the sync array before it proceeds. Instead, the partitioning information can be used to observe that each processor

p

^j needs data written by iterations 100j^,100 to 100j^,1 of the DOALLloop forS1. Therefore, processor

p

^j only needs to synchronize with processor

p

j^,1 and syn-chronization can be accomplished by checking one value rather than 100. One can view the difference of j indices of the array accesses as inducing a spatial relationship on the statements. Although the above case seems straightforward, this problem can be more complex when loop partitions are not uniform and multiple array dimensions involve multipleDOALL loops.

do (i=1,100) { do (k=1,10) {

doall (j=1,1000) {

a[i,j] = ...; /* S1 */

} }

sync[p] = i;

while (sync[p-1] < i-3);

do (l=1,5) {

doall (j=1,1000) {

... = f(a[i-3,j-1]); /* S2 */

} } }

Figure 4-2

In addition to a spatial relationship between statements, a temporal relationship exists as well. In the above example of Figure 4-1, while it is certainly correct to syn-chronize with the current iteration of i, each definition of array a is not actually used until 3 iterations of i later. Consequently, rather than checking the sync variable for the current index of i, one can check for the value i-3 and allow for more variance in execution among the processors. The iteration distance can be viewed as a temporal relationship between the statements. In addition, observe that synchronization is unnec-essarily performed for every iteration of the innerDO loops. Since the dependence really exists from the last iteration of the k loop to the first iteration of the l loop, synchro-nization calls can be moved outside of the loops. The improved program is shown in Figure 4-2 with the variable prepresenting the current processor number.

Another example can be used to illustrate the difference between temporal and spa-tial relationships. In Figure 4-3, an anti-dependence exists between the use of the variable x and its definition. Here, we follow an assumption that iterations of sequential loops are not partitioned among different processors. In case (a), synchronization only needs to be done with the last iteration of the sequential loop, while in case (b), synchroniza-tion must be done with every iterasynchroniza-tion of the parallel loop. This can be explained by the observation that sequential loop iterations are ordered in time while parallel loop iterations are not.

The simple examples described above become much more complex upon examina-tion of Figure 4-4a. The first index of the arraya corresponds to a sequential loop in S1

do (i=1,100)

but corresponds to a parallel loop inS2. Conversely, the second array index is a parallel loop index inS1but is a sequential one in S2. In addition, the last index corresponds to an unknown variablet inS1 and a sequential loop inS2and it is not clear whether the parallel loop index k influences the value of t. If t can be shown to be invariant with respect to certain loops, then provisions can perhaps be made to treattas a constant for those loops. Although synchronizing one array element at a time can still be made to work in this case, it is not clear which of the above improvements can be incorporated.

In particular, note that since the second index of arrayain S2is a sequential loop index, moving the synchronization checking out of that loop implies that all 100 indices of j in S1 must still be checked. In Figure 4-4b, the complexity comes from the fact that loop indices i and k are used multiply in array indices while index j is not used at all. As evident in these examples, spatial and temporal relationships are not necessarily straightforward and must be defined more clearly.

do (i=1,100) do (i=1,100) {

The above examples illustrate the fact that general and efficient implementation of point-to-point synchronization cannot be accomplished through an ad-hoc method.

Rather, a formal treatment of synchronization relationships as well as a parallel-loop

execution model must be introduced to provide the proper background for considering algorithms which compute synchronization targets.

Throughout this chapter, synchronization relationships are computed with respect to the sink of the dependence rather than the source. As shown in the above examples, each synchronization is implemented using an array of values with one element for each processor. At the source, each processor asserts a synchronization by setting its own array element to some value to indicate that it has reached the source statement. At the sink processor, the synchronization check requires the computation of array elements to check and values to use for the check. These computations correspond to the two questions posed earlier in this chapter.

Dans le document Compiler Analysis to Implement Point-to-Point Synchronization in Parallel Programs (Page 67-70)