Earlier NuMesh Prototypes - An Optimized Hardware Architecture and Communication Protocol for S

The NuMesh prototype FSM was a first cut attempt at defining an architecture that could adequately support scheduled communications [Hono91]. The prototype implemen-tation supports four neighbors in a two dimensional mesh configuration. The boards were created from off-the-shelf TTL parts consisting of PALs, FIFOs, transceivers, small mem-ories and some analog components to support system clocking. Each board was also con-nected to a TI DSP chip through two unidirectional FIFOs to allow words to be transferred back and forth between the FSM and the DSP. At compile time, each FSM is loaded with state information describing all of the communication that could go through the node. On each clock cycle, a single communication word can be transferred based on run time checks that determine if input data is valid and if the output port is free.

(defstar all-pole (input in)

(output out(max_delay 2)) (param a1 0.0)

(param a2 0.9)

(function all-pole) def_function all-pole()

output(difference in (product a1 out) (product a2 out@1)) out))

Figure 3.15 A Filter Implementation in Gabriel

A picture of the prototype datapath is included in Figure 3.16. Each node has two transceivers that correspond to the node’s north and east ports. The south and west trans-ceivers are located on the neighboring nodes. Two 1024 word FIFOS are used for commu-nicating with the DSP processor.

A simple FSM model is used to control the datapath. Ten bits of addressing combined with four bits of condition code serve to index a RAM that holds the scheduled communi-cation. Instructions desiring a transfer between ports spin until both the data becomes available and the destination port is free. Words being transferred in the north/south direc-tion or the east/west direcdirec-tion take only a single cycle while words needing to turn a cor-ner suffer an additional cycle of delay due to a transfer between the two busses shown in Figure 3.16. The four condition code inputs to the RAM serve to indicate whether each of the four ports contain data in the transceiver.

Coding applications involves explicitly writing communication code for each FSM.

These communications must be ordered such that one communication must be observed passing through a node before the FSM is ready to accept the next communication. The

N E

W S

UpFIFO DownFIFO

Figure 3.16 Prototype NuMesh Datapath from

exact timing of the arrival data is unnecessary since an instruction can spin until data arrives and early arriving data can be stored in the transceiver until an instruction finally sees it. There is a limited ability to change the FSM during run time through a single access register that serves to provide the control RAM an updated address. This register can be accessed through a jump instruction that causes a neighboring node to start execut-ing from a different point in the communication schedule. An example of user code is included in Figure 3.17. In the example, data is transferred in either direction between the west and south ports. Flow control is handled explicitly in the program. Input data is checked for through the iFulls flag and a free output port is checked for through the oFulls flag. Since the data is turning a corner, the data must first be transferred through the trans-ceiver connecting the two busses. In order for the DSP to communicate with the network, the DSP code must include commands that involve reading or writing the FIFOs.

While the NuMesh prototype is useful for illustrating proof of concept and for hand-coding small examples, it has several limitations that prevent it from being used in a com-plex programming system. One limitation comes from the FSM’s inability to be broken up

Loop:

(case iFulls (South ())

(West (goto FromWest)) (else (Hold)))

(case oFulls

(West (goto Loop))

(else (DriveS LoadXns))) (DriveXew LoadW)

(goto Loop) FromWest:

(case oFulls

(South (goto Loop))

(else (DriveW LoadXew))) (DriveXns LoadW)

(goto Loop)

Figure 3.17 Sample Code for NuMesh Prototype

into multiple smaller FSMs. The sequential nature of the hardware can cause a single com-munication thread to block all other comcom-munication threads that go through the node.

Code can be written that allows the FSM to skip threads that are not ready, but the size of the FSM grows as the product of all the states of the individual communication threads.

Even limited dynamic changes to any one communication thread must, by definition, change the state of the entire FSM and upset the entire communication schedule. While this can be programmed around by keeping multiple copies of slightly different FSM structures in memory with a jump instruction switching between them, the cost becomes prohibitive when more than a very small number of dynamic decisions are allowed.

Flow control is handled through the use of iFulls and oFulls checks that occur each cycle. Since these checks involve data dependencies, a key benefit of scheduled communi-cation is lost. Cycle times must be significantly greater than an internode transfer time because a significant amount of data dependent calculation must occur. No buffering is present to allow data to be temporarily stored in order to free a port, so data must remain blocking a port until a communication thread manages to grab it. Another restriction is that only a single instruction can operate at a time, even though there may be multiple free ports through which an additional transfer can be made.

The early system taught a number of lessons for designing scheduled communication architectures. Data-dependent instructions were shown to be impractical since they greatly increase cycle times. The ability to support multiple FSMs, either by timeslicing or through actual replication, was shown to be vital in order to keep any one communication thread from becoming a bottleneck. Figure 3.18 demonstrates two types of communica-tions that can benefit from timeslicing and physical replication of the FSM.

The FIFO interface to the processor was slow and unwieldy and forced messages to have an ID of some kind since the order of words in the FIFO could be mixed. The ability for flow control was also thought to be useful since the rate at which words could be injected or extracted from the communication network was unpredictable.

In his Master’s thesis, John Pezaris explored a number of more complicated FSM enhancements[Peza94]. Some of the more interesting ideas he explored include multiple single-threaded FSMs that can switch context very quickly, a central arbiter that decides

which FSM threads can gain control of the ports on every cycle, and multiple processing units on the CFSM that allow computation to occur directly in the CFSM datapath.

Dans le document An Optimized Hardware Architecture and Communication Protocol for Scheduled Communication (Page 63-67)