Gaussian Elimination - Application Benchmarks

8.4 Application Benchmarks

8.4.2 Gaussian Elimination

The final application tested here is Gaussian elimination, a technique for solving equations of the form

Ax

⁼

B

with matrix manipulations. The algorithm is discussed in more detail in

Section 2.9, and the COP code is given in Figure 2-7, p. 39. A ²¹⁰²¹⁰ array is used; the total time spent running this algorithm is thus much higher (on the order of millions of cycles) than for the matrix multiplication example.

For the scheduled-routing backend, the COP compiler generates a single phase for the ini-tial reduction/broadcast step (since both are low-traffic operators), then one phase per operator after that. Timings are shown in Figure 8-10.²

SR overhead Read Write CPU

0||

Time (msec)

S D S D S D S D S D

2x2 3x3 5x5 6x6 7x7

Figure 8-10: Gaussian elimination comparison, wall-clock time

These results are largely similar to the results for matrix multiply. Again, the I/O takes slightly more cycles, mostly accrued by the reads from the interface. The scheduled-routing overhead is still low, though it is just perceptible in the figure; a complete breakdown of the scheduled-routing cycle counts is given in Table 8.4.

Size CPU Write Read PSync Router Reprog

22 7177449 185754 332912 7958 144 0

33 3205323 123529 297307 8686 153 0

55 1164685 120804 335352 16145 149 0

66 812548 87413 285552 9386 148 0

Table 8.4: Time breakdown for matrix multiply with scheduled routing

A jump in read time and phase-synchronization time can be seen at the⁵⁵mesh size; at this point, the mesh becomes larger than can be synchronized with message-bound operators for the phase with the column-zero broadcast (b5), and a barrier is introduced to coordinate the end of the phase. The phase-synchronize overhead is still only

<

^1% of the total run-time, however. No reprogramming time is accrued, since the compiler determined that dense broadcasts would be most appropriate for this context.

2Results for⁷⁷and¹⁰¹⁰scheduled-routing meshes are not currently available; Tadpole crashes.

Chapter 9 Conclusions

The goal of this thesis was to demonstrate that a reprogrammable scheduled router could be used for arbitrary applications, and show significant speedups over traditional routing archi-tectures. A wide range of techniques for performing communications compilation for a repro-grammable scheduled router were shown, primarily involved with managing multiple commu-nications phases and efficiently implementing data-dependent commucommu-nications. As part of this effort, a communication language, COP, was designed and implemented, to keep computation compilation separate from communication compilation but still provide much of the necessary compile-time information on communication patterns to the communications compiler. The remainder of this chapter presents the conclusions of the thesis in more detail; the chapter ends with a brief look at possible future work.

9.1 Managing Multiple Phases

A primary focus of the thesis is the management of multiple communication phases in a sched-uled router.

Continuing Operators

One important caveat on changing phases is that there may be some ‘continuing operators’

that do not change. In the normal case where the next phase includes all the nodes that an implementation uses to route, a required path is simply passed to the stream router, then the same VFSMs are reused for that stream in the new phase. If the next phase is fragmented into smaller subphases (for example, because there are disjoint groups), the router is instead forced to use the exact same schedule for that operator, requiring it to work around those schedule slots when creating the schedule for the new phase.

Changing Phases

Changing phases is not as simple as just issuing a command to the router. The compiler must determine when each node has finished moving all the data belonging to the terminating oper-ators from the phase. This thesis presents a procedure for choosing methods to close a phase that make that guarantee:

Using the synchronizing properties of long messages to guarantee that other nodes in the mesh are close to completing the phase;

Hijacking an existing set of streams in order to create a long-message synchronization across nodes before changing phases;

Using ‘processor-bound’ implementations to allow local phase changes (but restricting wire use in the next phase);

or adding an explicit barrier operator to terminate the phase.

Determining which technique to apply requires finding out which operators in a phase are terminating, as well as which operators might run last—a non-trivial task, given the presence of optional operators and the possibility of multiple portions of a COP input hierarchy being present in a single phase. With that information, the message lengths of the operators determine whether long-message synchronizing is viable.

If not, a minimum spanning tree algorithm can be used to find a set of streams which can be hijacked for use as a synchronizing barrier; multicast streams which reach multiple nodes are preferred for their lower latency and data-buffering capacity, but any set of streams that can reach all the nodes in each partition of the mesh is acceptable. If necessary for a given operator implementation, a provided function can be called to undo any implementation-specific modifications to the stream, making it available for synchronization purposes.

Inter-Phase Wire Constraints

An important consideration is avoiding the re-use of inappropriate ^hwire/timeslotⁱ pairs. If one phase is writing a particular direction at a particular time, the compiler must be careful to ensure that, after a phase change, data can not be transferred between phases to unrelated operators.

The thesis sets out the restrictions that are necessary to avoid this kind of problem. In gen-eral, a given^hwire/timeslotⁱcan only be safely reused if the node used to be reading from it, and the operator that was using it has terminated. However, when the previous phase ended with some form of global barrier, any^hwire/timeslotⁱthat was used for writing can be reused, since by the time the new phase starts the neighbor will have stopped reading on that^hwire/timeslotⁱ. The effect of aliased valid/accept hardware lines is also considered for the case where two nodes might try to read simultaneously during a phase change.

The thesis also explores the notion of wire-use dependency when there is no global barrier.

A dependency graph can be set up for each operator/node pair, with two nodes each, repre-senting ‘I/O’ and ‘data arrival’; dependency edges are then created between those nodes, and the resulting graph is used to reason about neighbor phase transitions based on the operators that are currently running. Once an operator has been shown to run only if the neighbor has changed phase, that operator can freely reuse edges as if the previous phase had ended with a global barrier.

Managing a Scheduled Router as a Cache

Since the router has a finite amount of memory for schedules and VFSM annotations, the compiler must attempt to partition and use that space in a way that optimizes the total run time.

The router can be treated as a cache for routing data that is held in the processor’s memory. At run time, before starting a new phase, the COP code checks whether the schedule and VFSM annotations for that phase are currently in the router; if not, it downloads the data from the processor.

Determining the allocation of schedules and VFSMs to slots in the schedule memory and VFSM annotation memory is similar to the way that the ‘optimal’ cache strategy for conven-tional direct-mapped caches is managed, but with several complicating factors. First of all, the application does not provide complete knowledge of the future; there are both optional op-erators to consider, and opop-erators with run-time arguments that induce different phases to be downloaded. Second, VFSMs are cached independently, but must be all present at the time a phase begins; thus, they must be managed as a unit. Ideally, when one VFSM is evicted to make room for another, the compiler also attempts to evict other VFSMs from the same evicted phase to hold the remaining VFSMs from the current phase, to minimize VFSM reloads.

Finding Load Operators

Once all routing and caching decisions have been completed, the question remains of where to place the router load information in the application. The per-operator load function must be placed appropriately in the application code, and this function gives the COP compiler a hook on which to hang reprogramming information.

The thesis examines the question of which operator to use when a group of parallel and sequential operators are all included in a phase. Doing this requires taking into account that some nodes may be involved in the routing of some operators and not others, and thus that the load function for a given phase may vary from node to node. It also considers the case where some load function must be responsible for loading multiple phases of data: for example, a red/black style broadcast on alternate phases would require each node to load two phases’ of routing data, running the appropriate barrier between the two so as to know when to switch.

Results

Two large applications were examined, both requiring multiple phases of communication for optimal run time. The two applications were matrix multiplication and Gaussian elimination.

In both applications the compiler was able to find an optimal grouping of operators into phases, choose barrier methods, and correctly allow for wire reuse in following phases. Caching tech-niques were applied to find mappings of virtual schedules and VFSMs to the physical hardware.

Running the application showed roughly similar cycle counts, and factoring in the predicted speedup of scheduled routing showed overall significantly faster communications for almost all of the simulation runs.

Dans le document Managing Scheduled Routing With A High-Level Communications Language (Page 129-134)