Latency Issues - Robust, High-Speed Network Design for Large-Scale Multiprocessing

In this section, we consider in further detail many of the issues relevant to achieving low-latency communications.

2.4.1 Network Latency

Ignoring protocol overhead at the destination or receiving ends of a network, the latency in an interconnection network comes from four basic factors:

1. Transit Latency (

T

t): The amount of time the message spends traversing the interconnection media within the network

2. Switching Latency (

T

s): The amount of time the message spends being switched or routed by switching elements inside the network

3. Transmission Time (

T

transmit): The time required to transmit the entire contents of a message into or out-of the network

4. Contention Latency (

): The degradation in network latency due to resource contention in the network

Transit latency is generally dictated by physics and geometry. Transit latency is the quotient of the physical distance and the rate of signal propagation.

T

t⁼

d

v

⁽²

:

²⁾

Basic physics limits the amount of time that it takes for a signal to traverse a given distance. Materials will affect the actual rate of signal propagation, but regardless of the material, the propagation speed will always be below the speed of light,

c

³¹⁰¹⁰cm/s. The rate of propagation is given by:

v

⁼ ^p¹

⁽²

:

³⁾

For most materials

⁰^{, where}

⁰is the permittivity of free space. Conventional printed-circuit boards (PCBs) have

⁼

⁰^{, where}

r ^{4 and}

⁰ is the dielectric constant of free space; thus,

v

^c2. High performance substrates have lower values for

r. The physical geometry for the network determines the physical interconnection distances,

d

. Physical geometry is partially in the domain of packaging (Chapter 7), but is also determined by the network topology (Chapter 3). All networks are limited to exploiting, at most, three-dimensional space. Even in the best case, the total transit distance between two nodes in a network is at least limited by the physical distance between them in three-space. Additionally, since physical interconnection channels (e.g. wires, PCB traces, silicon) occupy physical space, the volume these channels consume within the network often affects the physical space into which the network and nodes may be packed.

For networks with uniform switching nodes, switching latency is the product of the number of switching stages between endpoints,

s

n, and the latency of each switching node,

t

nl^.

T

s⁼

s

t

nl ⁽²

:

⁴⁾

The network topology dictates the number of switching stages. The latency of each switching node is the sum of the signal i/o latency,

t

io, and the switching node functional latency,

t

switch^.

t

nl⁼

t

io⁺

t

switch ⁽²

:

⁵⁾

The signal i/o latency, or the amount of time required to move signals into and out-of the switching node, is generally determined by the signalling discipline and the technologies used for the switching node (Chapter 6). The switch functional latency accounts for the time required to arbitrate for an appropriate output channel and move message data from the input channel to the output channel. In addition to technology, the switch functional latency will depend on the complexity of the routing and arbitration schemes and the complexity of the switching function (Chapter 4). Larger switches generally require more complicated arbitration and switching, resulting in larger inherent switching latencies.

The transmission time accounts for the amount of time required to move the entire message data into or out-of the network. In many networks, the amount of data transmitted in a message is larger than the width of a data channel. In these case, the data is generally transmitted as a sequence of data where each piece is limited to the width of the channel. Assuming we have a message of length

L

to send over a channel

w

bits wide which can accept new data every

t

ctime units, we have the transmission time,

T

transmit, given by:

Here we see one of the places where low bandwidth has a detrimental effect on network latency.

T

transmitincreases as the channel bandwidth decreases.

Contention latency arises when resource conflicts occur and a message must wait until the nec-essary resources are available before it can be delivered to its designated destination. Such conflicts result when the network has insufficient bandwidth or the network bandwidth is inefficiently used.

In packet-switched networks, contention latency manifests itself in the form of queuing which must occur within switches when output channels are blocked. In circuit-switched networks, contention latency is incurred when multiple messages require the same channel(s) in the network and some

messages must wait for others to complete. Contention latency is the effect which differentiates an architecture’s theoretical minimum latency from its realized latency. The amount of contention latency is highly dependent on the manner in which an application utilizes the network. Contention latency is also affected by the routing protocol (Chapter 4) and network organization (Chapter 3).

One can think of contention latency as a derating factor on the unloaded network latency.

T

unloaded ⁼

T

s⁺

T

t ⁽²

:

⁷⁾

T

net⁼

⁽application

;

^topology⁾

T

unloaded⁺

T

transmit ⁽²

:

⁸⁾

One of the easiest ways to see this derating effect is when an application requires more bandwidth between two sets of processors than the network topology provides. In such a case, the effective latency will be increased by a factor equal to the ratio of the desired application bandwidth to the available network bandwidth. e.g. if

A

bw is the bandwidth needed by an application, and

N

bw ^is the bandwidth provided by the network for the required communication, we have:

A

N

In practice, the derating factor is generally larger than a simple ratio due to the fact that the resource conflicts themselves may consume bandwidth. For example, on most local-area networks, when contention results in collisions, the time lost during the collision adds to the network latency as well as the time to finally transmit the message.

The effects of contention latency make it clear why a bus is inefficient for multiprocessor operation. The bus provides a fixed bandwidth,

N

bw. There is no switching latency and generally a small transit latency over the bus. However, as we add processors to the bus, the bandwidth potentially usable by the application,

A

bw, generally increases while the network bandwidth stays fixed. This translates into a large contention derating factor,

, and consequently high network latency.

Unfortunately, it is hard to quantify the contention latency factor as cleanly as we can quantify other network latency factors. The bandwidth required between any pair of processors is highly dependent on the application, the computational model in use, and the run-time system. Further, it depends not just on the available bandwidth between a pair of processors, but between any sets of processors which may wish to communicate simultaneously.

2.4.2 Locality

Often physical and logical locality within a network can be exploited to minimize the average communication latency. In many networks, nodes are not equidistant. The transit latency and switching latency between a pair of nodes may vary greatly based on the choice of the pair of nodes. Logical distance is used to refer to the amount of switching required between two nodes (

T

s), and physical distance is used refer to the transit latency (

T

d) required between two nodes. Thus, two nodes which are closer, or more local, to each other logically and physically may communicate with lower latency than two nodes which are further apart. Additionally, when logically close nodes communicate they use less switching resources and hence contribute less to resource contention in the network.

The extent to which locality can be exploited is highly dependent upon the application being run over the network. The exploitation of network locality to minimize the effective communication latency in a multiprocessor system is an active area of current research [KLS90] [ACD⁺91] [Wal92].

Exploiting network locality is of particular interest when designing scalable computer systems since the latency of the interconnect will necessarily increase with network size. Assuming the physical and logical composition of the network remains unchanged when the network is grown, for networks without locality, the physical distance between all nodes grows as the system grows due to spatial constraints. For networks with locality the physical distance between the farthest separated nodes grows. Additionally, as long as bounded-degree switches (Section 2.7.1) are used to construct the network, the logical distance between nodes increases as well. Locality exploitation is one hope for mitigating the effects of this increase in latency.

It is necessary to keep the benefits due to locality in proper perspective with respect to the entire system. A small gain due to locality can often be dwarfed by the fixed overheads associated with communication over a multiprocessor network. Locality optimizations yield negligible rewards when the transmission latency benefit is small compared to the latency associated with launching and handling the message. Johnson demonstrated upper bounds on the benefits of locality exploitation using a simple mathematical model [Joh92]. For a specific system ([ACD⁺91]), he shows that even for machines as large as 1000 processors, the upper bound on the performance benefit due to locality exploitation is a factor of two.

2.4.3 Node Handling Latency

This document concentrates on designing the network for a high-performance multiprocessor.

Nonetheless, it is worthwhile to point out that the effective latency seen by the processors is also dependent on the latency associated with getting messages from the computation into the network, out-of the network, and back into the computation. Network input latency,

T

p, is the amount of time after a processor decides to issue a transaction over the network, before the message can be launched into the network, assuming network contention does not prevent the transaction from entering the network. Similarly, network output latency,

T

w, is the amount of time between the arrival of the a complete message at the destination node and the time the processor may begin actually processing the message. If not implemented carefully, large network input and output latency can limit the extent to which low-latency networks can facilitate low-latency communication between nodes.

Combining these effects with our network latency we have the total processor to processor message latency:

T

message⁼

T

p⁺

T

net⁺

T

w ⁽²

:

⁹⁾

This document will not attempt to directly address how one minimizes node input and output latency. Node latencies such as these are highly dependent on the programming model, processor, controller, and memory system in use. [NPA92] and [D⁺92] describe processors which were designed to minimize these latencies. [E⁺92] and [CSS⁺91] describe a computational model intended to minimize these latencies. Here, we will devote some attention to assuring that the network itself does not impose limitations which require large node input and output latencies.

Dans le document Robust, High-Speed Network Design for Large-Scale Multiprocessing (Page 28-32)