Online Scheduled Routing - Managing Scheduled Routing With A High-Level Communications Language

For some communication patterns, the most efficient implementation involves simply provid-ing online routprovid-ing. In particular, any communication pattern where the communication sources and destinations change rapidly relative to the message size, and where there is no constraint on the values of the sources and destinations, is likely to be most efficiently implemented by online routing.

Furthermore, providing support for online routing in a scheduled router is a guarantee of generality: by providing the same functionality as a dynamic router, the compiler ensures that any application that can be run on the dynamic router can be run on the scheduled router. For applications that require a great deal of online routing, scheduled routing will not perform as well as dynamic routing; however, it seems likely that a large class of applications have suffi-cient exploitable regularity in their message traffic that the fraction of online routing necessary will be low enough not to compromise the overall run time significantly.

4.2.1 Implementation

Online routing cannot be provided without intervention from the processor (or, if available, equivalent hardware in the router). This section assumes that the processor manages the rout-ing usrout-ing some form of interrupts to read and write data to the interface. If interrupts are not available, provisions for polling would need to be made in the generated code. Most of the dis-cussion is also applicable to the case where the scheduled router has dynamic-routing hardware available.

The current implementation of COP relies on an interrupt-driven system that provides buffering in processor RAM for messages. A future implementation of NuMesh (or a sim-ilar system, such as RAW [3, 67]) could include traditional online-routing functionality in hardware.

Online routing is handled by providing one-hop streams between all adjacent nodes in-volved. One VFSM reads an interface address dedicated to, e.g., sending data to ⁺

x

^{, and}

routes the data out the ⁺

x

port. The neighboring processor similarly schedules a VFSM to read data from its ^,

x

port and deliver it to an interface address dedicated to reading online messages from its^,

x

^neighbor.

The online-routing code manages a number of queues on each node. Interrupts are initially enabled for data arriving on each active input stream. As data arrives, the code parses a header word holding a node number, ‘destination queue’ (discussed below), and length. A message buffer is allocated and queued for the appropriate output interface address, and interrupt notifi-cation for empty status on that address is requested. The input handler fills the message buffer at the same time that the output handler empties it. Messages are originated by creating a new message buffer with the passed data and placing it directly in the appropriate output queue. As each output queue empties the code disables interrupts for that output. Messages terminating at the node are placed in the specified destination queues; to read them, the application code calls a routine that waits for the message buffer to be completely filled in, then returns the data.

Each stream of an online operator is assigned a destination queue to deliver their messages to on the destination node. If memory, and header-word bits, are no object, each stream can simply have a distinct number, so that each queue is used only by a single operator. Thus, an

operator can issue a read and be guaranteed to find only messages intended for it waiting on the queue it reads.

Given, however, that memory and header bits are both a resource that should be managed efficiently, the same techniques can be used here as are used in Section 7.2.2, assigning destina-tion queue numbers in the same manner as dynamic-router destinadestina-tion interface addresses are assigned. This would then guarantee that that an application can read from a given local queue and read only the messages for the given operator. This analysis makes the scheduled-routing online code asymptotically faster, since reading requires only constant time once the data has arrived, regardless of the number of messages already queued for other operators.

4.2.2 Resource Allocation

Na¨ıve objections to implementing online routing on top of scheduled routing include that it would require all the operators in a phase to be routed online; or, if not, that it might consume wire bandwidth out of proportion to the actual communication needs; or, for a 6-neighbor network, that every node would need 12 interface addresses dedicated to online routing. All of these statements turn out to be false.

While it is possible to dedicate all the timeslots in a phase to online routing, doing so is not necessary. If an operator that uses online routing shares a phase with an operator that does not, communications for the two operators can be scheduled in the same proportional manner that is used to schedule timeslots for offline-routed operators. Furthermore, fewer online-routing timeslots can be scheduled between nodes that have less online communications volume.

To ascertain how much online-routing bandwidth to allocate and how many interface ad-dresses are required, the compiler iterates through the operators that are using online routing in the phase. For each operator, it examines the possible source/destination combinations that might exist, and allocates bandwidth along the e-cube routing path taken for each case. Thus, each operator yields a set of links that may be expected to support a certain data volume from that operator in this phase. For each link, all the contributions for all the operators in the phase are included. Two operators running sequentially during the phase would contribute the MAX of their bandwidths to the link, whereas two operators running in parallel would contribute the sum of the bandwidths to the link.

Consider the example of a 2D mesh with one stream (A) specified with unknown source and destination, and another stream (B) in parallel with A, whose destination is unknown but whose source is always processor zero.

(parallel

(stream ’A (runtime) (runtime)) (stream ’B 0 (runtime)))

In this case, a fixed bandwidth is allocated for every link corresponding to operator A; the bandwidth allocations on all the positive-going

x

links in the first row, and all positive-going

y

links, are then increased, corresponding to the e-cube delivery patterns of operator B.

The streams associated with online routing are associated with the phase itself, rather than any specific operator. When the stream information is passed to the stream router, the phase’s

online-routing streams are included along with any streams from operators using offline rout-ing. On return from the stream router, the resulting bandwidth for each online-implemented operator can be computed by re-walking all the links, and, for each operator, setting its band-width to the MIN of all the links in all the paths its data might use.

Online routing would normally be considered to consume one interface address for read and one for write for each direction (up to twelve for a six-neighbor mesh). However, as for bandwidth, online-routing interface addresses can be assigned only where needed by the specific operator(s) implemented with online routing. For example, a stream operator with runtime source and dest on a one-dimensional subset of the network would only need a total of four interface addresses (and only two on the nodes at the ends of the subset). Assigning interface addresses on a per-node basis can thus noticeably reduce interface address pressure.

4.2.3 Future Extensions

A variety of extensions could be used to improve the performance of online routing over sched-uled routing.

One extension not currently implemented in the COP compiler is to allow non-nearest-neighbor streams. One simple version of this would be express channels [19]. Some nodes would have an additional set of streams available to them beyond the basic nearest-neighbor sets; these streams would connect to other nodes distant in the mesh. This would decrease the latency experienced by non-local messages, though it would also decrease the bandwidth available on the other network streams, as well as increase the interface address pressure. The address pressure in this model can be decreased by skewing the set of nodes responsible for managing express channels in each direction, such that each node only has one in-express and one out-express (at most) in addition to its nearest-neighbor streams. Assuming that express channels are

k

links in length, using them would reduce the number of routing steps for a path of distance

l

^from

l

to, on average,

k=

²⁺

l=k

; however, the cost would be a bandwidth factor of two for co-scheduling the express links.

A similar approach might be to abandon the mesh network for online routing and overlay a virtual hypercube-style network on the mesh, again trading off some bandwidth for latency.

Here, however, the number of streams rapidly exceeds the schedule RAM. For a 3D mesh of size

N

^{there are}

N

virtual hypercube streams crossing the bisection, but only

N

²³ wires. This gives a

N

¹³ slowdown, and more importantly bounds the size of mesh that can fit in a given schedule size. For example, with a maximum schedule length of 64, a full hypercube can only be done for meshes about ⁴⁴⁴. However, a coarser-granularity hypercube with a nearest-neighbor step for final routing might still be of benefit.

Using fixed interface addresses for reading and writing maximizes bandwidth and mini-mizes latency for the online streams. However, particularly if using something like express channels, address interface pressure can be significant. A single address could be used to read all online messages, but as the messages could then be interleaved Every word would need to be tagged to dispatch it to the correct message, at some cost in network bandwidth as well processing overhead. Similarly, a single address could be used to write all messages, repro-gramming the router to connect the address to a different VFSM each time the interrupt handler switched to writing a new stream. Both solutions have a place when online routing is infre-quently used and the interface addresses are a scarce resource for other co-existent operators;

however, implementing and assessing this is left for future work.

A shared-memory model would be easy to add to processor-based routing. Messages des-tined to a particular virtual interface address could be interpreted as memory-read requests.

For such messages, the interrupt handler would not queue the message for later reading, but instead read the specified memory location(s) and return the results to the given processor with a return message. More sophisticated memory semantics (such as Fetch-and-Op) could also be included at relatively low incremental cost in the handler. This implementation of shared memory could easily be used under a software shared memory implementation such as CRL [31].

Online-routing hardware could also be included directly on the router. Online-routing timeslots would be scheduled just like any other data transfer. If a message were trying to go in the⁺

x

direction, the routing hardware would just wait until the VFSM responsible for moving data from the online-routing hardware to⁺

x

is scheduled, and then provides the ap-propriate word to the neighbor. Similarly, when the message has reached its final destination, once a VFSM to move data from the online-routing hardware to the processor interface were scheduled, the data would be written to the interface, with the online routing engine providing an interface address (perhaps a cycle early) as well as the data word. Internal queues would be necessary to allow messages to bypass one another when accessing the interface addresses. Ad-ditionally, if desired, a few wires could be added between nodes to hold a virtual lane indicator for routing between nodes.

Chapter 5 Managing Multiple Phases

This chapter looks at the question of managing multiple phases in a scheduled router. It starts with a discussion of the ‘continuing operators’ that traverse phases in Section 5.1. Section 5.2 discusses how to ensure all the data I/O is complete before changing phase, and Section 5.3 explores the restrictions placed on wire usage by consecutive phases. Section 5.4 discusses how to allocate router memory to the various phases, and Section 5.5 presents an algorithm for associating phase loads with operator load functions.

5.1 Continuing Operators

Before addressing the management of multiple phases in an application, this section will first discuss the one ‘exception’ to the phase-switching model: operators that continue from one phase to the next. Such operators require special handling. For example, as was discussed in Section 2.4.2, a user may wish to have a particular broadcast operator available throughout the application:

(parallel

(broadcast ’stdin 0)

(sequential OP OP OP ...))

If the compiler decides to break up the sequentialclause into multiple phases, it be-comes necessary to take special measures to ensure that thestdinoperator is valid through all the phases. It is important to guarantee that any data that was in transit during a phase change remains valid, and that the operator is available for reading and writing in each phase.

5.1.1 Simple Continuation

The first issue to realize is that it is not sufficient to place a copy of the operator in each phase—

e.g., to have a separate broadcast from node zero added to the set of operators in each phase from the example above. If this was done, there would be separate VFSMs associated with the operator in each phase, and the compiler (or application) would have to ensure that all the VFSMs from the previous phase had cleared before switching to the new phase; however, the

semantics of a continuing operator is that the source may write data into it at any time, so it is difficult to guarantee that the operator’s stream(s) from the previous phase are empty before switching to the new phase.

Instead, the compiler simply forces the stream to use the same VFSMs in the following phase(s). This way, the stream can simply be ignored during any phase changes; any buffered data in an intermediate VFSM will simply be re-launched once the schedule for the new phase is initiated, since the VFSM will be the same in both the new and old schedule.

In this simple case, continuing streams do not have to be scheduled in the same time slots.

The buffering associated with a VFSM means that the compiler can safely stop scheduling a VFSM for some period of time, then reschedule it, without losing any data; the data will simply wait in the buffer. If no buffering were allocated to VFSMs (perhaps because the compiler were able to assert that the receiving processors could always keep up with the data flow), it would be impossible to switch phases like this, since data would be dropped as soon as the compiler stopped scheduling a VFSM belonging to a stream actively carrying data.

Furthermore, a continuing stream can be rescheduled to have different bandwidth if nec-essary. Consider the case where the continuing operator shares one phase with some operator with traffic ⁴

T

, and the next phase with an operator with traffic

T

. The continuing operator should end up with proportionately more bandwidth in the second phase, since the continuing operator is likely to be in the critical path unless its bandwidth is increased.

Regardless of the scheduling, the exact sequence of VFSMs that carry data must be pre-served. In particular, this requires that the stream follow the same route, use the same pipelines on each node, and even include delays in the same point(s) in the new routing. When routing the stream in the second and following phases, this is given as a requirement for the stream router. Doing so reduces the router’s flexibility somewhat, but still allows it to choose arbitrary timeslots in the global schedule to place the stream in the new schedule.

5.1.2 Continuation to Multiple Phases

The situation is more complicated if the operator continues into a part of the application with multiple threads of control. In this case (as discussed in Section 2.4.3), there may be multiple simultaneous phases with disjoint spatial extents. In this case a somewhat more restrictive solution is required. For example, consider the case where the stdin operator starts in a phase with some setup operators, then continues to a phase where portions of the mesh are independently running through a several-phase loop:

(parallel

(broadcast ’stdin 0) (sequential

(parallel OP OP ...) ; setup phase (parallel

(loop OP OP OP ...) ; left side of mesh (loop OP OP OP ...)))) ; right side of mesh

Since the stdin operator is global, but portions of its path may be in different phases at the same time, it is necessary to reuse not just the VFSMs but their exact scheduling as

well. The precise set of schedule timeslots for the stream form an ‘interface’ that all the phases must adhere to so that the continuing operator can run seamlessly despite any changing phases.

Thus, the stream router is given not just a list of VFSMs to reuse, but a list of VFSMs and schedule slots to reuse.

While this is necessary for correct operation in the disjoint-phase case, it is not used by de-fault. It requires the operator to have a fixed bandwidth from phase to phase, and it also makes the remaining operators harder to schedule, since they must work around fixed ‘obstacles’ in the schedule where the continuing operator(s) have been nailed down.

5.1.3 Continuing Online-Routing Operators

Continuing online-routed operators are handled somewhat like offline-routed ones. A contin-uing online-routed operator requires some subset of the online streams to be present in each phase it is present in. Those streams are routed normally in the first phase that the operator is present in. In the following phases, if there are continuing online-routed operators present, the possible links for each online-routed operator are traversed and the routing information from the first phase where that operator occurred is copied in. As a result, the online-routed streams will work just the same as regular offline streams: they will usually just be rerouted using the same VFSMs and different schedules, but may be fixed to their existing schedule when they cross phase boundaries (as discussed above). Where new online streams are present in a following phase (or where the following phase does not have any continuing streams), new VFSMs and new schedule slots are allocated as would be done for any offline-routed operator.

Dans le document Managing Scheduled Routing With A High-Level Communications Language (Page 60-66)