Flow Control - An Optimized Hardware Architecture and Communication Protocol for Scheduled Comm

Even though the communication router sets up a static schedule of communication, the processors attached to the network can remove or send messages depending on the dynamic behavior of an application. If a processor is slow to remove messages from the system, but the sending node continuously sends new communication words, words in a particular virtual stream may start to back up in the network. In order to keep words at the end of the virtual stream from being overwritten, there must be some notion of flow con-trol implemented. Flow concon-trol implies that a transfer can be made only if the receiving node can accept new data. If data starts to back up, either a communication word must be stalled on a link, or the word must be temporarily buffered. Stalling a word is unacceptable in the NuMesh system, because on every cycle a physical link may be used by a com-pletely different virtual stream. This means that there must exist some way to buffer words on a NuMesh node. Flow control is particularly difficult in a scheduled communication architecture, because a node can not be stalled if a word needs to be buffered. Any unex-pected clock cycle in one of the NuMesh nodes could throw off the global timing of the entire communication network. This section will explore a novel flow control protocol and will discuss a variety of buffering options for the NuMesh architecture.

4.3.1 Conventional Flow Control

Dynamic networks implement flow control in one of two ways. In the first scheme, every virtual channel has some amount of buffer storage. When a node wants to make a transfer to a neighboring node using a particular virtual channel, it sends a request over a set of control lines. The receiving node checks the virtual channel’s buffer storage, and sees if it can accept another word. The sending node may send the data along with this request. The receiving node then sends an acknowledgment if it can accept the new data.

An important point to realize is that this protocol takes two internode transfers to

com-plete. The transmit and acknowledge must occur in series, because the receiving node has no idea what virtual channel the sending node might want to use.

In the second scheme, the dynamic network is able to reduce the protocol to a single internode transfer. A node transmits the availability of every virtual channel to each of its neighbors. For systems using a small number of virtual channels, this protocol is manage-able, although the number of pins dedicated to the protocol is proportional to the number of virtual channels. Another drawback to this second scheme, is that an extra word of buffer storage for each virtual channel must be provided that can not be used for normal buffer storage. This requirement comes about because the virtual channel is effectively sending an acknowledge before it knows the current state of the virtual channel’s buffer storage. Consider the case in which the same virtual channel is being used for two consec-utive clock cycles. As the first word is being transferred between nodes, the receiving node must indicate whether the virtual channel is available for the second transfer. In order to avoid two internode transfer times, the control line value must not depend on the result of the first cycle’s transfer. This means that a virtual channel must indicate that it is full when it has one additional buffer spot empty. That extra buffer slot will only be used if consecu-tive clock cycle transfers occur over the same virtual channel. If the virtual channel is requested every other clock cycle, the sending node will see a full buffer queue even when there is still an empty slot. Between the wasted buffer word per virtual channel, and the increased decision time to interpret the virtual channel control lines, this scheme too has its drawbacks.

The NuMesh architecture is designed to support a large number of virtual streams.

Since these virtual streams must follow a global schedule, if one stream gets blocked it must not affect the timing of any of the other virtual channels. A blocked communication word must get quickly buffered in order to free up the physical link for the next clock cycle when a completely different virtual stream might be scheduled. The next time the blocked stream gets scheduled, the buffered word must be quickly extracted from buffer storage and sent into the communication network. To meet these constraints, it is essential that each virtual stream has its own buffer storage. The amount of this buffer storage will be discussed in the next section. One advantage of a scheduled communication architec-ture is that a unique flow control protocol can be utilized. While dynamic systems suffer

from either two internode transfers per word in one scheme and increased pinout and wasted buffer storage in another, the scheduled communication protocol escapes both pen-alties. The NuMesh protocol involves a single internode transfer, requires only a single bit line regardless of the number of virtual streams, and does not need to waste any buffer storage. The next subsection will describe the mechanics of this protocol.

4.3.2 Scheduled Communication Flow Control Protocol

In a scheduled communication system, the virtual stream that is to be run on every clock cycle is known at compile time. This information is stored in some format in the communication router. As a result, much of the overhead of an actual transfer can occur well before the data is exchanged between nodes. On every cycle, only one virtual stream is allowed to accept new data, so only a single bit of information needs to be transferred to the source node. Since the receiving node knows ahead of time which virtual stream is going to be used, it can decide whether or not it can accept new data ahead of time. In the case of a dynamic router either the acceptance of the data must occur in serial, or the state of all virtual channels must be transmitted ahead of time. In the scheduled communication architecture, the acceptance for only the scheduled virtual stream can be sent at the same time data is to be transferred.

Figure 4.3 represents the exchange of information that occurs. The receiving node B can decide on the previous clock cycle whether or not it can accept new data from the sending node A. If that stream has been previously blocked and some number of buffered words exist for the stream, the receiving node may decide not to accept new words. Node B would indicate this by sending a low value on the accept line when the communication stream is scheduled. During the same clock cycle, the sending node A can send its data (along with a valid bit) to the receiving node. No decision needs to be made yet based on the receiving node’s ability to accept data. At some point in the future (possibly the same clock cycle or the next), the node can examine the accept bit received from the receiving node and can decide whether or not the transfer was successful. If it discovers the receiv-ing node was unable to accept the communication word, it must be stored in a buffer that belongs to the particular communication stream.

On the following clock cycle, the receiving node becomes the sending node for the next hop of the communication stream. If no flow control backup is occurring, the word transferred to node B on the previous cycle gets transmitted to the destination of node B’s instruction while at the same time node B receives an accept bit from the destination node.

Two important results should be noted from this scheme. First, only a single internode transfer time is required even though flow control is implemented. This allows unrelated virtual streams to operate on consecutive clock cycles. Second, the flow control overhead does not lengthen the cycle time of the node. Since the machine can be pipelined arbi-trarily as long as the scheduled times of transfers are met, the handshake overhead can be pushed back to previous stages or future stages in the architecture. The goal of limiting the cycle time to a single internode transfer time or a small RAM lookup has been met.

Dans le document An Optimized Hardware Architecture and Communication Protocol for Scheduled Communication (Page 85-88)