Other applications of array flow analysis

y = x+1;

doall (i=1,100) a[i,y] = ...;

}

(a)

f(5); /* S1 */

do (j=10,100)

f(j); /* S2 */

f(z); /* S3 */

(b)

Figure 3-16

from f are supported immediately after the call. This allows the compiler to generate only one version of the function f instead of potentially creating a different copy of the procedure for each call. Of course, any dependences occurring within f would be supported inside its body. With this assumption, correct reaching information can be derived by separately computing the gen sets for each call to f. For instance, the call to f in statementS1 of Figure 3-16b produces the gen set containing the subarraya[i,6], while the call in statement S2 produces the subarray a[i,j+1]. In both cases, the results are derived from resolving the gen set produced by the body f, which contains a[i,x+1], with the arguments passed to f. For the call in statement S3, if nothing is known about the value of z, then the subarray returned is a[i,^>]

Note that much potentially derivable information is ignored by the above scheme.

In particular, the lack of specialization of procedure calls requires one to be overly pessimistic when generating code for the procedure. For example, if the procedure makes use of two integer parameters, analysis within the procedure must assume that their values are unknown and that any dependence tests involving them return true.

One can easily imagine scenarios where some calls to such a procedure are made with arguments that cause the dependences to not exist. For those cases, it can be beneficial to make two versions of the procedure, one that supports the dependence and one that does not. At the call site, analysis can be done to determine which procedure should be invoked.

3.6 Other applications of array flow analysis

Array flow analysis provides information on data usage relationships between state-ments. In addition to using this information for deriving point-to-point synchronization, other optimizations for parallel programs can benefit from results of flow analysis. Such optimizations include parallelism detection, private variable detection, data and loop

partitioning, and static data routing. Other optimizations that can benefit from array flow analysis are shown in [DGS93].

3.6.1 Parallelism detection

In compiling programs for multiprocessors, a very useful optimization involves the detection of parallelism in sequentialDOloops [AK87][Wol89]. When a statement in a DO loop body does not depend on other statements in the body, then it can be vectorized by being moved out of the loop and placed in aDOALLloop. Array flow analysis provides more accurate information for determining whether a loop can be vectorized.

do (i=1,100) {

a[i-1] = f1(b[i]); /* S1 */

c[i] = f2(a[i-1]); /* S2 */

a[i] = f3(c[i]); /* S3 */

d[i] = f4(a[i-2]); /* S4 */

}

Figure 3-17

Without array flow analysis, dependence testing is done on all definitions and uses in loop, thus possibly producing some false dependences. Consider dependence testing on all definitions and uses of the loop in Figure 3-17. We must conclude that the loop cannot be vectorized since there seems to be a cyclic dependence involving statements S2and S3 generating the equation

c[i]=f2(f3(c[i-1]))

However, using array flow analysis, we can deduce that the definition ofainS1actually kills the definition inS3. Thus there is no cyclic dependence, and each statement can be vectorized.

3.6.2 Private variable detection

When detecting parallelism from sequentialDOloops, certain transformations can be performed on a loop to make it more easily parallelized. One of these transformations is the introduction of private variables to remove output and anti-dependences across loop iterations. A variable is private in some loop if every iteration of the loop can be viewed as possessing a private copy of that variable. Consider the programs for exchanging

two arrays in Figure 3-18. In both cases, if the variable temp is privatized, then all iterations can be executed in parallel. The topic of privatizing arrays has only recently been discussed in the literature [EHLP91][MAL93].

do (i=1,100) do (i=1,100)

do (j=1,100) { do (j=1,100) {

temp = a[i,j]; temp[j] = a[i,j];

a[i,j] = b[i,j]; a[i,j] = b[i,j];

b[i,j] = temp; b[i,j] = temp[j];

} }

(a) (b)

Figure 3-18

A variable

v

is a candidate for privatization within some loop

l

when certain condi-tions can be satisfied. First, any flow dependences involving

v

must only occur within single iterations of

l

. If one were to allow for copying, then flow dependences can also occur to statements outside of the loop, but never across iterations of a loop. Second,

v

must appear in some output and anti-dependences across loop iterations, otherwise there is no need for it to be privatized. While scalar flow analysis can verify these condi-tions for scalars, array flow analysis allows verification for arrays as well. In Figure 3-18, the variable temp can be privatized with respect to both loops in case (a) and can be privatized with respect to loopi in case (b).

3.6.3 Data and loop partitioning

Even when all potential parallelism is detected in a program, its performance can still be heavily affected by communication costs. In many cases, effective static allocation of tasks and data to processors can reduce these costs significantly. Data partitioning involves splitting and aligning data to minimize communication distance between pro-cessors and the data they access [KLS90][LC91][GB92][RS91]. In loop partitioning, nested loops can be mapped to processors to minimize non-local memory accesses [AH91]. In these techniques, constraints between arrays are formed from flow dependences and oc-currences in common statements. A partitioning algorithm then performs heuristics to resolve cyclic constraints and produce a partitioning scheme. Using array flow analy-sis, more accurate flow dependences can be computed to produce improved partitioning results.

3.6.4 Static routing of data

In most multiprocessors, interprocessor communication is accomplished by send-ing messages through a network. In the conventional scheme of dynamic routsend-ing, a message is routed by examining its header which identifies the destination processor for the message. In situations where two messages need to access the same resource, one message must be either blocked or buffered. Architectures such as iWarp [Bor90]

or NuMesh [War93] seek to alleviate contention costs by introducing the idea of static routing. When destinations of messages are known at compilation, then routing can be scheduled statically to avoid unnecessary contentions [SA91]. Furthermore, hardware which supports static routing can avoid the latency associated with examining headers as in dynamic routing.

In loop-based parallel programs, communication between processors arises primar-ily from flow dependences between different processors. When these flow dependences involve arrays whose indices are constants or linear functions of loop indices, then static routing can be applied. In the program of Figure 3-17, let us assume a machine topology of 100 processors in a line where each processor is responsible for one loop iteration.

Since statementS4requires a read of a[i-2] and statement S3 writes a[i], each pro-cessor must send its result fromS3two processors to the right. Since the communication destination for each processor is known at compile time, static routing can be applied.

Compilation for systolic arrays is a particular approach towards static routing and has been heavily studied [Kun82][Che86][Cap87]. These works focus on the optimal execu-tion of a set of nestedDOloops without dynamic control flow such as conditionals. Since static routing allows very high network bandwidths to be available, one solution allows conditionals to be supported by performing all communication that can exist on any path through the program. Some of the ideas used computing processor dependences in the next chapter can be used to support a scheme for static routing in general programs.

3.7 Summary

In order to provide intelligent support for synchronization, one must first be able to detect dependences between statements in a program. Although one can define depen-dences by searching the program text for any accesses that can overlap, more effective results can be obtained by performing flow analysis to detect the reaching span of each data access. In order to manage references to array elements effectively, we focus on

array indices that are linear functions of loop indices. The task of deducing this infor-mation can be performed by an adaptation of known value propagation algorithms to the linear function lattice.

Because some data accesses can completely mask others, array flow analysis can be used to determine the region over which each array access is active. Rather than op-erating on array regions, the flow analysis done here preserves the index function and traversal path of the flow element to allow for accurate dependence testing. Depen-dences can then be computed between reaching accesses and current accesses for each statement. Since dependence testing has been thoroughly studied in the literature, this thesis proposes only using the simple GCD test to detect dependences. The result of this analysis yields dependence information between statements as well as array accesses that generate those dependences.

Dans le document Compiler Analysis to Implement Point-to-Point Synchronization in Parallel Programs (Page 60-65)