Cycle Time - An Optimized Hardware Architecture and Communication Protocol for Scheduled Commun

One of the goals of this thesis is to produce a communication architecture that allows for a very fast cycle time. Special care was spent to reduce all critical paths to either an internode transfer or a small RAM access. The architecture presented meets this goal. The implemented chip achieved a maximum frequency of just under 37 MHz. However, this chip is designed as a prototype for a scheduled communication system. Since a target architecture of a laser-cut gate array was chosen, performance was compromised. A more aggressive full-custom implementation could have drastically increased the frequency of the design.

Comparing the maximum clock frequency of the NuMesh CFSM to other dynamic routers is a difficult proposition since a wide variety of technologies and implementations are possible and can confuse the comparison. The main advantage of the NuMesh CFSM is the lack of data dependent run time decisions and the ability to perform flow control operations with a single internode transfer. A paper out of Illinois [Aoya93] made a com-parison to dynamic routers, all implemented in the same technology. In the paper, each contributor to the clock period was identified. Since the contributors to the delay for the

Figure 6.2 JTAG Boundary Scan (Input Pin)

D Q

to next pin

from previous pin from

input pin

JTAG shift control

JTAG update control

to circuit

NuMesh CFSM have been fully specified, it is possible to compare the performance of the Numesh CFSM to a variety of dynamic routers.

The paper from Illinois describes a base-line adaptive router that allows two degrees of freedom. While a traditional deterministic router routes messages along dimensions in a particular order, their base-line adaptive router allows the router to choose from between two dimensions to advance the message. Messages are routed among planes in a particular order, but not dimensions.

In the design of adaptive and deterministic routers, the delay can be broken down to two components. The first is the intranode, or internal delay a word has to undergo before being sent. For a deterministic router, this delay might simply be an examination of the header address and the overhead of choosing the appropriate channel for transfer. An adaptive router might have a few more decisions to make depending on how many degrees of freedom the adaptive router can choose between. This intranode delay can further be broken down into two components. The first is termed the path setup while the second is called the data through delay. The path setup delay refers to the time the router takes to set up the crossbar transfer based on the decoding of the header address. For the Illinois base-line adaptive router, the setup path delays are included in Figure 6.3.

Delay Component Description Delay

Address Decoder Chooses destination ports based on header address

3.3 ns Routing Decision Arbitrates among requests of address

decoder result

2.1 ns Packet Update Controls switch operation and updates

header packet based on decision

2.6 ns Crossbar Switch Time for data to go through crossbar 1.1 ns Virtual Channel Multiplexes virtual channels among

physical channel

1.2 ns

Total 10.3 ns

Figure 6.3 Components of Path Setup Delay

The second component of the intranode delay is termed the data through delay. The data through delay comes from the overhead of managing flow control on a transfer. For the Illinois baseline router, the delay time consists of the components illustrated in Figure 6.4. To allow for data synchronization and flow control management between nodes, it

takes two data through delays for a message to be transferred. For the baseline Illinois router, two data through delays is greater than the setup path delay so the intranode delay is 5.7*2 = 11.4 ns.

The components of the internode delay are more simple. They are described in Figure 6.5. Basically, data has to get through an output driver, and input receiver, and then is latched. Although this delay is a non-trivial 4.9 ns, the delay is far less than the 11.4 ns required by the intranode delay. The Illinois router cycle is dominated by the intranode delay, even though it is the most simple of the adaptive routers they examined. If more degrees of freedom are allowed in the adaptive router, the intranode delay becomes much worse.

The Illinois paper goes on to examine a deterministic router based on the same assumptions as their adaptive router. For the deterministic router, the path setup delay becomes 5.7 ns while the data through delay becomes 3.0 ns. The internode delay remains fixed at 4.9 ns. The savings in intranode delay come from the absence of any virtual chan-nel controller and the more simple routing decision that is made. However, two internode

Delay Component Description Delay

Internal Flow Control Time for internal flow control decisions

2.2 ns Crossbar Switch Time for flow control data to

propagate through node

1.0 ns Virtual Channel Time for negotiation of

vir-tual channel

2.5 ns

Total 5.7 ns

Figure 6.4 Components of Data Through Delay

transfer times are required per message word, so the cycle time is 4.9*2 = 9.8, only a small savings over the simple adaptive routing case.

The NuMesh CFSM can also be modeled based on these assumptions. By far, the big-gest savings comes from the fact that only a single internode transfer is required per word.

The intranode delay for the NuMesh router consists of only the path setup delay, since most of data through delay occurs in parallel with the path setup delay, and the rest occurs on a later stage of the pipeline. Compared to the path setup delay of the adaptive router, only the crossbar switch delay and some amount of the routing decision delay is charged to the NuMesh CFSM. Even if the routing decision delay is assumed to be equal to that of the adaptive case (a doubtful proposition, since no data dependent decisions need be made), the total intranode delay would be 1.1 + 2.1 = 3.2ns. Since the path setup delay in the NuMesh design occurs in a different pipeline stage than the internode transfer, the NuMesh cycle time will be the worse of the path setup delay and the internode delay. That gives the CFSM a clock period of 4.9ns. This is 50% of the clock period of the determinis-tic router, and 43% of the simple adaptive router. The Illinois router was implemented in a .8 micron CMOS gate array library from Mitsubishi Corporation.

Since the NuMesh cycle time is determined by an internode transfer, one could imag-ine using fancier signaling techniques or even pipelining the internode transfer itself to allow for an even faster clock cycle. Pipelining the internode transfer would add an addi-tional constraint involving the number of required bubbles between operation of the same

Delay Component Description Delay

Output Buffer Time for data to get off chip 2.5 ns Input Buffer Time for data to get on chip .6 ns Input Latch Setup time to latch data .8 ns Synchronizer Time charged for differences

between system clocks

1.0 ns

Total 4.9 ns

Figure 6.5 Components of Internode Delay

stream, but would allow for very high data throughput. Since no data dependent decisions need ever be made in the NuMesh design, the cycle time can be continually decreased as technology improves. A direct comparison of the three routers is included in Figure 6.6.

6.5 Network Congestion

A second goal of this thesis was to provide a communication routing chip that can effi-ciently support scheduled communication. In order for the architecture to be considered useful, it must be shown that scheduled communication operating on the NuMesh system can outperform dynamic communication. The previous section showed that the cycle time of the network can be improved by fifty percent or more. This section will show that scheduled communication can reduce congestion in the network enough to provide signif-icantly improved application performance.

Chris Metcalf[Metc97], will compare scheduled communication to a variety of dynamic routing schemes. In some of his early analysis, he takes several well-known application kernels and compares their performance for both off-line scheduled routing on the Numesh, and on-line dynamic communication methods. For the rest of this section, it should be noted that the cycle time for all routers is assumed to be the same. The added benefit of the increased potential frequency of the NuMesh CFSM is not taken into account. Three application kernels that Metcalf studied are parallel prefix, transpose, and prefix.

Intranode Delay Internode Delay Total Cycle Time Deterministic

Router

5.7 ns 4.9 ns 9.8 ns

Adaptive Router

11.4 ns 4.9 ns 11.4 ns

NuMesh Router

<=3.1 ns 4.9 ns 4.9 ns

Figure 6.6 Cycle Time Comparison of Routers

A parallel prefix algorithm works as follows. Assume there exists some operation . A group of nodes is labeled from n₀ - n_N. Each node n_i holds the value . The operator can be any associative function such as ADD, MIN, or OR. Parallel machines use parallel prefix for a variety of algorithms such as graph traversal, traveling salesman, and various min/max problems. Since the communications for a parallel prefix are known at compile time, they can be completely scheduled. The algorithm does require several phases of communications that must be loaded in during operation. Metcalf’s com-parison of different communication schemes for the parallel prefix algorithm is included in Figure 6.7.

The second application Metcalf looked at was the transpose benchmark. In a three-dimensional array, every node sends data to another node based on its address. Messages

⊗ n₀⊗…⊗n_i

Figure 6.7 Comparison of Routing Schemes for Parallel Prefix

deterministic on-line routing

off-line routing with NuMesh

are transferred from (x,y,z) to (z,y,x). Two Kilobytes of data are transferred from each node. For this application, both deterministic and adaptive schemes are compared to the NuMesh scheduled routing. The results are shown in Figure 6.8.

The final application kernel is a bit reverse. Again, all nodes are sending data to other nodes based on their address. In this case, every node is assigned a binary address of n_on₁...n_N. A two Kilobyte message is sent to the node with address n_Nn_k-1...n₀. Again, both deterministic and dynamic routing schemes are compared to NuMesh scheduled communication. Results are shown in Figure 6.9.

For all three applications, the scheduled communication model outperforms both the deterministic and adaptive schemes. The deterministic schemes suffer because nodes

Figure 6.8 Comparison of Routing Schemes for Transpose

deterministic on-line routing

attempt to force too much data over selected links while free nearby links remain unused.

The adaptive scheme suffers because bottlenecks are generated by some of the misroutes resulting in even longer delays than the simple deterministic router. If the frequency of the communication nodes was taken into account, the NuMesh scheduled communication results would fare significantly better, and the adaptive scheme would fall even further behind the deterministic scheme.

Dans le document An Optimized Hardware Architecture and Communication Protocol for Scheduled Communication (Page 160-167)