• Aucun résultat trouvé

Constraints in the Chip Express Design Flow

Chip Express constraints added two significant changes to the design of the CFSM.

The RAMs in the Chip Express design require that the read address be latched in on the clock cycle preceding the cycle in which the data is expected to be offered. The design of the pipeline assumes that the read operation of the RAM is combinational, and that the address can be created at the beginning of the cycle in which data is expected. In most cases, the difference is irrelevant since the read address in the design is coming straight

from a register. For these cases, the register is effectively moved into the RAM module. In some cases in the design, some combinational logic occurs between stages and the address is figured out early in one of the stages. For these cases, the addresses must be precom-puted on the previous cycle in order to be stored in the register of the RAM module. Fortu-nately, only minor modifications must be made to allow this to work, and the critical path

Signal Number

of Pins Direction Attached Resistor?

Data Ports 32 x 4 = 128 Bi-directional Pull-Up Resistor Valid/Accept Input 1 x 4 = 4 Input Pull-Down Resistor

Valid/Accept Output 1 x 4 = 4 Output None

Clock System Bits 1 x 4 = 4 Output None

Reset Lines 1 x 5 = 5 4 Bi-directional 1 Output (proc)

Pull-Down Resistor

proc JTAG port 1 x 4 3 Output

1 Input

None

AFX AB(address) 5 Input None

AFX DB(data) 33 Bi-directional None

AFX AEN(enable) 1 Input None

AFX

ADDR_LO(enable)

1 Input None

AFX SREPLY 2 Input None

AFX PREPLY 2 Output None

Processor Interrupt 1 Output None

Power_reset 1 Input None

LEDs 4 Output None

System Clk 1 Input None

Total Pins = 200

Figure 5.12 Pinout Listing of Chip

is not affected.

The second constraint is much more serious. The system design for clock distribution can minimize skew between the clock signals of any two nodes. On each node, the clock signal is an input to the CFSM chip. Additional and unpredictable skew delay can occur when the clock goes through a pin and a receiver. Traditionally, a phase-locked loop can force the external and the internal clocks to match. Unfortunately, the Chip Express logic family (CX2000) did not support this phase locked loop. As a result, the skew between the clocks became unacceptable. Consider the example in Figure 5.13. Node 0 is attempting to transmit data to node 1. Ideally, an entire clock cycle is allowed for the transfer. Data is latched on rising edges of the clock. In the example, if the rising edge of node 1 occurs much after the rising edge of node 0, the data being transmitted by node 0 may start to change before node 1 has a chance to latch the data.

Some amount of skew can be tolerated due to the hold time characteristics of the logic in the transmit path. The absence of a phase-locked loop causes enough skew potential to keep the hold time delays from solving the problem. As a result, a new clocking scheme was implemented. All nodes transmit on the rising edge of the clock, but receive data on the falling edge of the clock. Consider the same example as before, in Figure 5.14.

In this example, since the transmit and receive edges do not line up, clock skew can not cause the wrong data to be latched, although it can reduce the effective amount of time allowed for the transfer. The biggest drawback to this system is that only half a clock cycle is allowed for a data transfer. This causes the internode transfer time to be the critical path in the CFSM.

transmit

receive

Node 0

Node 1

Figure 5.13 Clock Skew Can Cause Transmit Errors

transmit

receive

Node 0

Node 1

Figure 5.14 Overcoming Clock Skew

Chapter 6

Conclusion and Results

The goal of this thesis is to show the design and implementation of an architecture capable of efficiently supporting scheduled communication. Virtual streams of communi-cation are extracted from an applicommuni-cation at compile time. These virtual streams indicate a source processor, a destination processor, and a requested bandwidth of communication. A directed graph of communication can be extracted from these virtual streams, and a global schedule of communication can be created that choreographs communication in the net-work. By taking advantage of scheduled communication information already present in many applications, the NuMesh system can minimize congestion by intelligently schedul-ing the virtual streams. In addition, the clock cycle of the network can be reduced due to the absence of data-dependent decisions and a simple flow control protocol. Although the communication network is static, the processors attached to each node are unpredictable in nature and can inject and remove messages at arbitrary times. A flow control protocol is supported to allow messages to back up in the network without affecting the choreo-graphed schedule of the network.

The rest of this chapter is devoted to summarizing the thesis. First the major architec-tural ideas are discussed. Then the design flow for the actual chip is described. Once the chips were received a variety of testing environments were created. Finally, some effort is made to evaluate the architecture based on the two fundamental goals of reducing the net-work router’s cycle time and minimizing congestion. The chapter ends with some ideas of future work.