Robust, High-Speed Network Design for Large-Scale Multiprocessing

(1)

MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY

A.I. Technical Report No. 1445 September, 1993

Robust, High-Speed Network Design for Large-Scale Multiprocessing

Andr´e DeHon andre@ai.mit.edu

Abstract: Large-scale multiprocessing remains an elusive, yet promising paradigm for achieving high-performance computation. As machine size scales upward, there are two important aspects of multiprocessor systems which will generally get worse rather than better: (1) interprocessor communication latency will increase and (2) the probability that some component in the system will fail will increase. Both of these problems can prevent us from realizing the potential benefits of large-scale multiprocessing. In this document we consider the problem of designing networks which simultaneously minimize communication latency while maximizing fault tolerance for large-scale multiprocessors. Using a synergy of techniques including connection topologies, routing protocols, signalling techniques, and packaging technologies we assemble integrated, system-level solutions to this network design problem. In particular, we recommend the use of multipath, multistage networks, simple, source-responsible routing protocols, stochastic fault-avoidance, dense three- dimensional packaging, low-voltage, series-terminated transmission line signalling, and scan based diagnostic and reconfiguration.

Acknowledgements: This report describes research done at the Artificial Intelligence Laboratory of the Mas- sachusetts Institute of Technology. Support for the Laboratory’s Artificial Intelligence Research is provided in part by the Advanced Research Projects Agency under Office of Naval Research contract N00014-91-J-1698. This material is based upon work supported under a National Science Foundation Graduate Fellowship. Any opinions, findings, conclusions or recommendations expressed in this publication are those of the author and do not necessarily reflect the views of the National Science Foundation.

(2)

Acknowledgments

The ideas presented herein are the collaborative product of the MIT Transit Project under the direction of Dr. Thomas Knight, Jr. The viewpoint is my own and, as such, the description reflects my perspective and biases on the subject matter. Nonetheless, the whole picture presented here was shaped by the effort of the group and would definitely have been less complete without such a collective effort.

At the expense of slighting some who’s influences and efforts I may not, yet, fully appreciated, I feel it only appropriate to acknowledge many by name. Knight provided many of the seed ideas which have formed the basis of our works. Preliminary work with the MIT Connection Machine Project and with the Symbolics VLSI group raised many of the issues and ideas which the Transit Project inherited. Collaboration with Tom Leighton, Bruce Maggs, and Charles Leiserson on interconnection topologies has also been instrumental in developing many of the ideas presented here.

Knight, of course, provided the project with the oversight and encouragement to make this kind of effort possible. Henry Minsky has been a key player in the Transit effort since its inception.

Minsky has helped shape ideas on topology, routing, and packaging. Without his VLSI efforts, RN1 would not have existed and all of us would have suffered greatly on other VLSI projects.

Along with Knight, Alex Ishii and Thomas Simon have been invaluable in the effort to understand and develop high-performance signalling strategies. While Simon has a very different viewpoint than myself on most issues, the alternative perspective has generally been healthy and produced greater understanding. The packaging effort would remain arrested in the conceptual stage without the efforts of Fred Drenckhahn. Frederic Chong and Eran Egozy provided the cornerstone for our network organization evaluations and developments. Their efforts took vague notions and turned them into quantifiable entities which allowed us greater insight. Eran Egozy spearheaded the originalMETRO effort which also included Minsky, Simon, Samuel Peretz, and Matthew Becker.

Simon helped shape the MBTA effort. The MBTA effort has also benefitted from contributions by Minsky, Timothy Kutscha, David Warren, and Ian Eslick. Patrick Sobalvarro, Michael Bolotski, Neil Brock, and Saed Younis have also offered notable help during the course of this effort.

This work has also been shaped by numerous discussions with people in “rival” research groups here at MIT. Notably, discussions with Anant Agarwal, John Kubiatowicz, David Chaiken, and Kirk Johnson in the MIT Alewife Project have been quite useful. William Dally, Michael Noakes, Ellen Spertus, Larry Dennison, and Deborah Wallach of the Concurrent VLSI Architecture group have all provided many useful comments and feedback. Noakes and George Andy Boughton have provided much valuable technical support during the life of this project. I am also indebted to William Weihl, Gregory Papadopoulos, and Donald Troxel for their direction.

Our productivity during this effort has been enhanced by the availability of software in source form. This has allowed us to customize existing tools to suit our novel needs and correct deficiencies.

Notably, we have benefitted from software from the GNU Project, the Berkeley CAD group, and the Active Messages group at Berkeley.

We have also been fortunate that several companies offer generous university programs through which they have made tools available to us. Intel has been most helpful in providing a rich suite of tools for working with the 80960 microprocessor series in source form. Actel provided sufficient discounts on their FPGA tools to make their FPGAs useful for our prototyping efforts. Exemplar has

(3)

also provided generous discounts on their synthesis tools. Cadence provided us with Verilog-XL for simulation. Logic Modeling Corporation has made their library of models for Verilog available.

MetaSoftware provided us with HSpice. Many of our efforts could not have succeeded without the support of these industrial strength tools.

This research is supported in part by the Defense Advanced Research Projects Agency under contracts N00014-87-K-0825 and N00014-91-J-1698. This material is based upon work supported under a National Science Foundation Graduate Fellowship. Any opinions, findings, conclusions or recommendations expressed in this publication are those of the author and do not necessarily reflect the views of the National Science Foundation.

Fabrication of printed-circuit boards and integrated circuits was provided by the MOSIS Project.

We are greatly indebted to the careful handling and attention they have given to our projects.

Particularly we are grateful for the help provided by Terry Dosek, Wes Hansford, Sam DelaTorre, and Sam Reynolds.

(4)

List of Figures

1.1 1616 Multibutterfly Network

: : : : : : : : : : : : : : : : : : : : : : : : : :

⁴

1.2 Area-Universal Fat-Tree with Constant Size Switches

: : : : : : : : : : : : : : :

⁵

1.3 Cross-Section of Stack Packaging

: : : : : : : : : : : : : : : : : : : : : : : : :

⁷

2.1 Multiprocessor Model

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

¹²

2.2 Standard IEEE TAP and Scan Architecture

: : : : : : : : : : : : : : : : : : : :

¹²

3.1 Fully Connected Networks

: : : : : : : : : : : : : : : : : : : : : : : : : : : : :

²³

3.2 Full 1616 Crossbar

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

²⁴

3.3 Distributed 1616 Crossbar

: : : : : : : : : : : : : : : : : : : : : : : : : : :

²⁴

3.4 Hypercube

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

²⁵

3.5 Mesh –

k

^-ary-

n

^{-cube with}

k

⁼²

: : : : : : : : : : : : : : : : : : : : : : : : : :

²⁵

3.6 Cube –

k

^-ary-

n

^{-cube with}

k

⁼³

: : : : : : : : : : : : : : : : : : : : : : : : : :

²⁶

3.7 Torus –

k

^-ary-

n

^{-cube with}

k

⁼2 and Wrap-Around Torus Connections

: : : : : :

²⁶

3.8 1616 Omega Network Constructed from 22 Crossbars

: : : : : : : : : : : :

²⁷

3.9 1616 Bidelta Network

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

²⁸

3.10 Benes Network

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

²⁸

: : : : : : : : : : : : : : : : : : : : : : : : : :

²⁹

3.12 Express Cube Network –

k

⁼²

: : : : : : : : : : : : : : : : : : : : : : : : : :

³⁰

3.13 Replicated Multistage Network

: : : : : : : : : : : : : : : : : : : : : : : : : :

³²

3.14 Extra Stage Network

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

³³

3.15 42 Crossbar with a dilation of 2

: : : : : : : : : : : : : : : : : : : : : : : : :

³⁴

3.16 1616 Multibutterfly Network with Radix-4 Routers in Final Stage

: : : : : : :

³⁶

3.17 Left: Non-expansive Wiring of Processors to First Stage Routing Elements

: : : :

³⁸

3.18 Right: Expansive Wiring of Processors to First Stage Routing Elements

: : : : : :

³⁸

3.19 Pseudo-code for Deterministic Interwiring

: : : : : : : : : : : : : : : : : : : : :

⁴⁰

3.20 1616 Path Expansion Multibutterfly Network

: : : : : : : : : : : : : : : : : :

⁴⁰

3.21 Pseudo-code for Random Interwiring

: : : : : : : : : : : : : : : : : : : : : : :

⁴¹

3.22 Randomly-Interwired Network

: : : : : : : : : : : : : : : : : : : : : : : : : : :

⁴²

3.23 Randomized Maximal-Fanout

: : : : : : : : : : : : : : : : : : : : : : : : : : :

⁴³

3.24 Pseudo-code for Random, Maximal-Fanout Interwiring

: : : : : : : : : : : : : :

⁴⁴

3.25 1616 Randomized, Maximal-Fanout Network

: : : : : : : : : : : : : : : : :

⁴⁵

3.26 Completeness of (A) 3-stage and (B) 4-stage Multipath Networks

: : : : : : : : :

⁴⁵

3.27 Comparative Performance of 3-Stage and 4-Stage Networks

: : : : : : : : : : : :

⁴⁷

(11)

3.28 Chong’s Fault-Propagation Algorithm for Reconfiguration

: : : : : : : : : : : :

⁴⁸

3.29 Fault-Propagation Node Loss and Performance for 1024-Node Systems

: : : : : :

⁴⁹

3.30 Cross-Sectional View of Up Routing Tree and Crossover

: : : : : : : : : : : : :

⁵¹

3.31 Connections in Down Routing Stages (left)

: : : : : : : : : : : : : : : : : : : :

⁵²

3.32 Up Routing Stage Connections with Lateral Crossovers (right)

: : : : : : : : : :

⁵²

3.33 Multibutterfly Style Cluster at Leaves of Fat-Tree

: : : : : : : : : : : : : : : : :

⁵²

4.1 METRO Routing Protocol in the context of the ISO OSI Reference Model

: : : : :

⁵⁸

4.2 Basic Router Configuration

: : : : : : : : : : : : : : : : : : : : : : : : : : : :

⁵⁹

4.3 MRP-ROUTERConnection States

: : : : : : : : : : : : : : : : : : : : : : : : : :

⁶¹

: : : : : : : : : : : : : : : : : : : : : : : : : :

⁶⁹

4.5 Successful Route through Network

: : : : : : : : : : : : : : : : : : : : : : : :

⁷⁰

4.6 Connection Blocked in Network

: : : : : : : : : : : : : : : : : : : : : : : : : :

⁷¹

4.7 Dropping a Network Connection

: : : : : : : : : : : : : : : : : : : : : : : : : :

⁷¹

4.8 Reversing an Open Network Connection

: : : : : : : : : : : : : : : : : : : : : :

⁷²

4.9 Reversing a Blocked Network Connection

: : : : : : : : : : : : : : : : : : : : :

⁷³

4.10 Reverse Connection Turn

: : : : : : : : : : : : : : : : : : : : : : : : : : : : :

⁷⁴

4.11 Blocked Paths in a Multibutterfly Network

: : : : : : : : : : : : : : : : : : : : :

⁷⁵

4.12 Example of Fast Path Reclamation

: : : : : : : : : : : : : : : : : : : : : : : : :

⁷⁶

4.13 Backward Reclamation of Connection Stuck Open

: : : : : : : : : : : : : : : :

⁷⁸

4.14 Example Connection Open with Pipelined Routers

: : : : : : : : : : : : : : : :

⁷⁹

4.15 Example Turn with Pipelined Routers

: : : : : : : : : : : : : : : : : : : : : : :

⁸⁰

4.16 Example of Pipelined Connection Setup

: : : : : : : : : : : : : : : : : : : : : :

⁸²

4.17 Example Turn with Wire Pipelining

: : : : : : : : : : : : : : : : : : : : : : : :

⁸³

4.18 Cascaded Router Configuration using Four Routing Elements

: : : : : : : : : : :

⁸⁶

5.1 Mesh of Gridded Scan Paths

: : : : : : : : : : : : : : : : : : : : : : : : : : : :

⁹³

5.2 Scan Architecture for Dual-TAP Component

: : : : : : : : : : : : : : : : : : : :

⁹⁵

5.3 Propagating Reconfiguration Example

: : : : : : : : : : : : : : : : : : : : : : :

¹⁰⁰

5.4 Propagating Reconfiguration Example

: : : : : : : : : : : : : : : : : : : : : : :

¹⁰¹

6.1 Initial Transmission Line Voltage Profile

: : : : : : : : : : : : : : : : : : : : : :

¹⁰⁷

6.2 Transmission Line Voltage: Open Circuit Reflection

: : : : : : : : : : : : : : : :

¹⁰⁷

6.3 Transmission Line Voltage:

Z

term

> Z

⁰^Reflection

: : : : : : : : : : : : : : : :

¹⁰⁸

6.4 Transmission Line Voltage: Matched Termination

: : : : : : : : : : : : : : : : :

¹⁰⁸

6.5 Transmission Line Voltage:

Z

term

< Z

⁰^Reflection

: : : : : : : : : : : : : : : :

¹⁰⁹

6.6 Transmission Line Voltage: Short Circuit Reflection

: : : : : : : : : : : : : : : :

¹⁰⁹

6.7 Parallel Terminated Transmission Line

: : : : : : : : : : : : : : : : : : : : : : :

¹¹⁰

6.8 Serial Terminated Transmission Line

: : : : : : : : : : : : : : : : : : : : : : :

¹¹¹

6.9 CMOSTransmission Line Driver

: : : : : : : : : : : : : : : : : : : : : : : : : :

¹¹³

6.10 Functional View of Controlled Output Impedance Driver

: : : : : : : : : : : : :

¹¹⁴

6.11 CMOSDriver with Voltage Controlled Output Impedance

: : : : : : : : : : : : :

¹¹⁵

6.12 CMOSDriver with Digitally Controlled Output Impedance

: : : : : : : : : : : : :

¹¹⁶

6.13 CMOSDriver with Separate Impedance and Logic Controls

: : : : : : : : : : : :

¹¹⁷

(12)

6.14 Controlled Impedance Driver Implementation

: : : : : : : : : : : : : : : : : : :

¹¹⁸

6.15 CMOSLow-voltage Differential Receiver Circuitry

: : : : : : : : : : : : : : : : :

¹¹⁹

6.16 CMOSLow-voltage, Differential Receiver Implementation

: : : : : : : : : : : : :

¹²⁰

6.17 Bidirectional Pad Scan Architecture

: : : : : : : : : : : : : : : : : : : : : : : :

¹²¹

6.18 Sample Register

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

¹²¹

6.19 Driver and Receiver Configuration for Bidirectional Pad

: : : : : : : : : : : : :

¹²²

6.20 Ideal Source Transition

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

¹²³

6.21 More Realistic Source Transitions

: : : : : : : : : : : : : : : : : : : : : : : : :

¹²⁵

6.22 Impedance Selection Algorithm (Outer Loop)

: : : : : : : : : : : : : : : : : : :

¹²⁶

6.23 Impedance Selection Algorithm (Inner Loop)

: : : : : : : : : : : : : : : : : : :

¹²⁷

6.24 Impedance Matching: 6 Control Bits

: : : : : : : : : : : : : : : : : : : : : : : :

¹²⁸

6.25 Impedance Matching: 3 Control Bits

: : : : : : : : : : : : : : : : : : : : : : : :

¹²⁹

6.26 100ΩImpedance Matching: 6 Control Bits

: : : : : : : : : : : : : : : : : : : :

¹³⁰

6.27 Multiplexor Based Variable Delay Buffer

: : : : : : : : : : : : : : : : : : : : :

¹³¹

6.28 Voltage Controlled Variable Delay Buffer

: : : : : : : : : : : : : : : : : : : : :

¹³¹

6.29 Adjustable Delay Bidirectional Pad Scan Architecture

: : : : : : : : : : : : : : :

¹³²

6.30 Sample Register with Selectable Clock Input

: : : : : : : : : : : : : : : : : : :

¹³³

6.31 Sample Register with Recycle Option

: : : : : : : : : : : : : : : : : : : : : : :

¹³⁵

6.32 Sample Register with Overlapped Recycle

: : : : : : : : : : : : : : : : : : : : :

¹³⁵

7.1 Stack Structure for Three-dimensional Packaging

: : : : : : : : : : : : : : : : :

¹⁴¹

7.2 DSPGA372

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

¹⁴³

7.3 DSPGA372 Photos

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

¹⁴⁴

7.4 BB372

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

¹⁴⁵

7.5 Button

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

¹⁴⁶

7.6 Cross-section of Routing Stack

: : : : : : : : : : : : : : : : : : : : : : : : : :

¹⁴⁸

7.7 Close-up Cross-section of Mated BB372 and DSPGA372 Components

: : : : : :

¹⁴⁹

7.8 Sample Clock Fanout on Horizontal PCB

: : : : : : : : : : : : : : : : : : : : :

¹⁵⁰

7.9 Mapping of Network Logical Structure onto Physical Stack Packaging

: : : : : :

¹⁵²

7.10 Two Level Hollow-Cube Geometry

: : : : : : : : : : : : : : : : : : : : : : : :

¹⁵⁶

7.11 Two Level Hollow Cube with Top and Side Stacks of Different Sizes

: : : : : : :

¹⁵⁷

7.12 Three Level Hollow Cube

: : : : : : : : : : : : : : : : : : : : : : : : : : : : :

¹⁵⁸

8.1 RN1 Logical Configurations

: : : : : : : : : : : : : : : : : : : : : : : : : : : :

¹⁶²

8.2 RN1 Micro-architecture

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

¹⁶³

8.3 Packaged RN1 IC

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

¹⁶⁴

10.1 MBTA Routing Network

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

¹⁶⁸

10.2 MBTA Node Architecture

: : : : : : : : : : : : : : : : : : : : : : : : : : : : :

¹⁶⁹

11.1 MLINKMessage Formats

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

¹⁷³

12.1 Routing Board Arrangement for 64-processor Machine

: : : : : : : : : : : : : :

¹⁷⁶

12.2 Packaged MBTA Node

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

¹⁷⁷

12.3 Layer of Packaged Nodes

: : : : : : : : : : : : : : : : : : : : : : : : : : : : :

¹⁷⁸

(13)

12.4 Exploded Side View of 64-processor Machine Stack

: : : : : : : : : : : : : : : :

¹⁸⁰

12.5 Side View of 64-processor Machine Stack

: : : : : : : : : : : : : : : : : : : : :

¹⁸¹

A.1 Applications on 3-stage Random Networks

: : : : : : : : : : : : : : : : : : : :

¹⁹³

A.2 Applications on the 3-stage Deterministic Network

: : : : : : : : : : : : : : : :

¹⁹⁴

A.3 Comparative Performance of 3-Stage Networks

: : : : : : : : : : : : : : : : : :

¹⁹⁵

A.4 Comparative Performance of 4-Stage Networks

: : : : : : : : : : : : : : : : : :

¹⁹⁶

(14)

List of Tables

3.1 Network Comparison

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

³⁰

3.2 Network Construction Parameters

: : : : : : : : : : : : : : : : : : : : : : : : :

³⁵

3.3 Connections into Each Stage

: : : : : : : : : : : : : : : : : : : : : : : : : : : :

³⁷

3.4 Fault Tolerance of Multipath Networks

: : : : : : : : : : : : : : : : : : : : : :

⁴⁶

4.1 Control Word Encodings

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

⁶⁰

6.1 Representative Sample Register Data

: : : : : : : : : : : : : : : : : : : : : : :

¹²⁴

7.1 DSPGA372 Physical Dimensions

: : : : : : : : : : : : : : : : : : : : : : : : :

¹⁴⁴

7.2 BB372 Physical Dimensions

: : : : : : : : : : : : : : : : : : : : : : : : : : : :

¹⁴⁵

7.3 Unit Tree Parameters

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

¹⁵⁴

7.4 Unit Tree Component Summary

: : : : : : : : : : : : : : : : : : : : : : : : : :

¹⁵⁴

9.1 METRO Architectural Variables

: : : : : : : : : : : : : : : : : : : : : : : : : : :

¹⁶⁶

9.2 METRORouter Configuration Options

: : : : : : : : : : : : : : : : : : : : : : :

¹⁶⁶

A.1 Relative Transaction Frequencies for Shared-Memory Applications

: : : : : : : :

¹⁹⁰

A.2 Split Phase Transactions and Grain Sizes for Shared-Memory Applications

: : : :

¹⁹¹

A.3 Message Lengths for Shared-Memory Applications

: : : : : : : : : : : : : : : :

¹⁹¹

(15)

Part I

Introduction and Background

(16)

1. Introduction

The high capabilities and low costs of modern microprocessors have made it attractive from both economic and performance viewpoints to design and construct large-scale multiprocessors based on commodity processor technologies. Nonetheless, many challenges remain to effectively realize the potential performance promised by large-scale multiprocessing on a wide-range of applications.

One key challenge is to provide sufficient inter-processor communication performance to allow efficient multiprocessor operation – and to provide such performance at a reasonable cost.

In order for processors to work effectively together in a computation, they must be able to communicate data with each other in a timely fashion. The exact nature and role of communication varies with the particular programming model, but the need is pervasive. Virtually all paradigms for parallel processing depend critically on low communication latency to effectively exploit parallel execution to reduce total execution time. Communication latency is a critical determinant of the amount of exploitable parallelism and the cost of synchronization. For shared-memory algorithms, latency affects the speed of cache-replacement and coherency operations. In message-passing programs, latency affects the delay between message transmission and reception. In dataflow programs, latency determines the delay between the computation of a data value and the time when the value can actually be used. Data parallel operations are limited by the rate at which processors can obtain access to the data on which they need to operate.

Multithreaded ([Smi78] [Jor83] [ALKK90] [SBCvE90] [CSS⁺91] [NPA92]) and dataflow ([ACM88] [AI87] [PC90]) architectures have been developed to mitigate communication latency by hiding its effects. These techniques all rely on an abundance of parallelism to provide useful processing to perform while waiting on slow communications. The limit to the usable parallelism then, can be determined by the nature of the problem and the algorithm used to solve it, the rate of computation on each processor, and the communication latency. Our challenge today is to provide sufficiently low-latency communications to match the computation rate provided by commodity processors while allowing the most effective use of the parallelism inherent in each problem.

Regardless of the exact network topology used for communications, both the number of switching components and the amount of wiring inside the network are at least linear in the number of processors supported by the network. The single component failure rate is also linear in the network size. If we do not engineer the network to operate properly when faults exist, the acceptable failure rate for any system will directly fix a ceiling on the maximum machine size. To avoid this ceiling we consider network designs which can operate properly in the presence of faults.

In this document, we examine a class of processor interconnection networks which are designed to simultaneously minimize network latency while maximizing fault tolerance. A combination of organizational techniques, protocols, circuit techniques, and packaging technologies are employed to realize a class of integrated solutions to these problems.

(17)

1.1 Goals

Our goals in designing a high-performance network for large-scale multiprocessing are to optimize for:

Low Latency

High Bandwidth

High Reliability

Testability/Repairability

Scalability

Flexibility/Versatility

Reasonable Cost

Practical Implementation

As suggested above and developed further in Sections 2.3 and 2.5, latency and reliability are key properties which must be considered when designing a large-scale, high-performance multiprocessor network. Insufficient bandwidth will have a detrimental impact on latency (Section 2.4).

Fault diagnosis and repair are key to limiting the impact of any faults in the network (Section 2.6).

Scalability of the solution is important to maximize the longevity with which the solutions are effective. Flexibility in the solutions allow the class of networks to remain applicable across a wide range of specific needs (Section 2.8).

1.2 Scope

This work only attempts to address issues directly related to the network for a large-scale multiprocessor. Attention is paid to providing efficient and robust interfaces between processing nodes and the network. Attention is also given to how the node interacts with the network. However, the fault-tolerance schemes presented here do not guard against failures of the processing nodes or in the memory system. The scheme detailed here may be suitable for a reliable network substrate for future work in processor and memory fault recovery.

1.3 Overview

In this section, we provide a quick overview of the network design at several levels. This section should give the reader a basic picture of the class of networks and technologies being considered.

Part II develops everything introduced here in detail.

(18)

A multibutterfly style interconnection network constructed from 42 (inputsradix) dilation-2 crossbars and 22 dilation-1 crossbars. Each of the 16 endpoints has two inputs and outputs for fault tolerance. Similarly, the routers each have two outputs in each of their two logical output directions. As a result, there are many paths between each pair of network endpoints. Paths between endpoint 6 and endpoint 16 are shown in bold.

Figure 1.1: 1616 Multibutterfly Network 1.3.1 Topology

A suitable network topology is the first essential ingredient to producing a reliable, high- performance network. The network topology will ultimately dictate:

Switching Latency – the number of switches, and to some extent the length of the wires, which must be traversed between nodes in the network

Underlying Reliability – the redundancy available to make fault-tolerant operation possible

Scalability – the characteristic growth of resource requirements with system size

Versatility – the extent to which the network can be adapted to a wide-range of applications.

To simultaneously optimize these characteristics, we utilize multipath, multistage interconnection networks based on several key ideas from the theoretical community including multibutterflies [Upf89] [LM92] and fat trees [Lei85].

Using multibutterfly (See Figure 1.1) and fat-tree networks (See Figure 1.2), we minimize the number of routing switches which must be traversed in the network between any pair of nodes.

Using bounded degree routing nodes, the least possible number of switches between endpoints is logarithmic in the size of the network, a lower bound which these networks achieve. For small machine configurations the multibutterfly networks achieve the logarithmic lower bound with a multiplicative constant of one (e.g. routing switches traversed = logr

N

^{; where}

N

^is

(19)

Figure 1.2: Area-Universal Fat-Tree with Constant Size Switches (Greenberg and Leiserson) the number of processing nodes in the network and

r

is the radix of the routing component used for switching). For larger machine configurations, fat trees provide lower latency for local communication. Applications can take advantage of the locality inherent in the fat-tree topology to realize lower average communication latencies. To further minimize switching latency, our fat-tree networks make use of short-cut paths, keeping the worst-case switching latency down to ⁴₃log₄

N

when using radix-four routing components.

The multipath nature of these routing networks provides a basis for fault-tolerant operation, as well as providing high bandwidth operation. The multipath networks provide multiple, redundant paths between every pair of processing nodes. The alternative paths are also available for min- imizing congestion within the network, resulting in increased effective bandwidth and decreased effective latency. When faults occur, the availability of alternative paths between endpoints makes it possible to route around faulty components in the network.

A high-degree of scalability is achieved by using fat-tree organizations for large networks. The scalable properties of fat trees allow construction of arbitrarily large machines using the same basic network architecture. When organized properly, these large fat trees can be shown to minimize the total length of time that any message spends traversing wires within the routing network as compared to any other network. The hardware resources required for the fat-tree network grow linearly in the number of processors supported.

Further, these networks provide considerable versatility allowing them to be adapted to meet the specific needs of a particular application. By selecting the number of network ports into each

(20)

processing node, we can customize the bandwidth and reliability within the network to meet the needs of the application. By controlling the width of the basic data channel, we can provide varying amounts of latency and bandwidth into a node. This flexibility makes it possible to use the same basic network solutions across a broad range of machines from low-cost workstations to high-bandwidth supercomputers by selecting the network parameters appropriately.

1.3.2 Routing

While a good network topology is necessary for reliable, high-performance communications, it is by no means sufficient. We must also have a routing scheme capable of efficiently exploiting the features of the network. In developing a routing strategy for use with multiprocessor communications networks, we focussed on achieving a routing framework with the following properties:

1. Low-overhead routing – Low-overhead routing attempts to minimize the fraction of poten- tial bandwidth consumed by protocol overhead and similarly minimize the latency associated with protocol processing.

2. Fault identification and localization with minimal overhead – To achieve fault tolerance, we must be able to detect when faults corrupt data in our system. Further to minimize the impact of faults on system performance, we must be able to efficiently identify the source of any faults in the system.

3. Flexible protocol – To be suitable for use in a wide range of applications and environments, the protocol must be flexible allowing efficient layering of the required data transfer on top of the underlying communications.

4. Dynamic fault tolerance – For the network to scale robustly to very large implementations, it is critical that the network and routing components continue to operate properly as new faults arise in the system.

5. Distributed routing – In order to avoid single-points of failure in the system, routing must proceed in a distributed fashion, requiring the correct operation of no central resources.

To this end, we have developed the METRORouting Protocol, MRP, a simple, reliable, source- responsible router protocol suitable for use with multipath networks. MRP provides half-duplex, bidirectional data transmission over pipelined, circuit-switched routing channels. The simple protocol coupled with pipelined routing allows for high-bandwidth, low-latency implementations. The circuit-switched nature avoids the issues associated with buffering inside the network. Each routing component makes local routing decisions among equivalent outputs based on channel utilization, using randomization to choose among equivalent alternatives. Routing components further provide connection information and checksums back to the source node to allow error localization within the network. When errors or blocking occurs, the source can retry data transmission. The randomization in path selection guarantees that any existing non-faulty path can eventually be found without global information.

(21)

spacer Aluminum plate

Aluminum plate

window frame

heatsink

Bus Bar

debug connector

manifold

(1v, 5v, gnd)

cover

vertical clock driver

horizontal clock driver

sma

horizontal board

horizontal board button

board

RN1 Component

Figure 1.3: Cross-Section of Stack Packaging (Diagram courtesy of Fred Drenckhahn) 1.3.3 Technology

Regardless of the advances we make in topology and routing, the ultimate performance of an implementation is limited by the implementation technology. Packaging density constrains the minimum lengths for interconnect and hence the minimum latency between routing components and nodes. Once our interconnection distances are fixed, data transmission latency is limited by the time taken to traverse the interconnect and to traverse component i/o pads.

Packaging

Our goal in packaging these networks is to minimize the interconnection distances between components. At the same time, we aim to utilize economical technologies and provide efficient cooling and repair of densely packaged components. The basic packaging unit is a three-dimensional

(22)

stack of components and printed-circuit boards (See Figure 1.3). Computational, memory, and routing components are housed in dual-sided land-grid arrays and sandwiched between layers of conventional PCBs. The land-grid arrays, with pads on both sides of the package, serve to both house VLSI components and provide vertical interconnect in the stack structure. Button boards are used to provide reliable, solderless connection between land-grid array packages and adjacent PCBs. The land-grid array and button board packages provide channels for coolant flow. The composite stack structure is compatible with both air and liquid cooling. The stack structure provides the necessary dense interconnection in all three physical dimensions allowing for minimal wiring distances between components. Using this technology, we can package an entire 64-node multiprocessor including the network and nodes in roughly 1⁰1⁰5⁰⁰.

Signalling

To minimize wire transit and component i/o time, we utilize series-terminated, matched- impedance, point-to-point transmission line signalling. Further, to reduce power consumption the i/o structures use low-voltage signal swings. By integrating a series-terminated transmission line driver into the i/o pads, we avoid the need to wait for reflections to settle on the PCB traces without requiring additional external components. The low-voltage, series-terminated drivers can switch much faster than conventional 5V-swing drivers. Initial experience with this technology indicates we can drive a signal through an output pad, across 30 cm of wire, and into an input pad in less than 5 ns.

1.3.4 Fault Management

Performance in the presence of faulty components and wires can be further improved by hiding the effects of faulty components. Using some novel, fault-tolerant additions to baseline IEEE 1149.1-1990 JTAG scan functionality, we can realize an effective scan-based testing strategy. By configuring components with multiple test-access ports, the architecture is resilient to faults in the test system itself. With port-by-port deselection and scan capabilities, it is possible to diagnose potentially faulty network components online; i.e. , while the rest of the system remains fully operational. Furthermore, these facilities allow faulty wires and components to be configured out of the system so that they do not degrade system performance. Once localized using boundary scan, the system can log faulty components for later repair and make an accurate assessment of the system integrity. For larger systems, these facilities allow online replacement of faulty subsystems.

1.4 Organization

Before developing strategies for addressing these problems, Chapter 2 develops the problems and issues in further detail. Part II takes a detailed look at the key components of robust, low-latency networks. Chapter 3 leads off by examining the network topology. Chapter 4 addresses the issue of low-latency, high-speed, reliable routing on the networks introduced in Chapter‘3. Chapter 5 considers fault identification and system reconfiguration. Chapter 6 develops suitable, high-speed signalling techniques compatible with the router-to-router communications required by networks the routing protocol. Finally, Chapter 7 looks at packaging technologies for practical, high-performance

(23)

networks. Part III contains a brief series of case-studies from our experience designing and building reliable, low-latency networks. Chapter 8 reviews the RN1 routing component. Chapter 9 discusses RN1’s successor, theMETROrouter series. Chapter 11 describesMETRO-LINK, a network interface suitable for connecting a processing node into aMETRObased network. Finally, Chapters 10 and 12 discuss MBTA, an experimental multiprocessor which puts most of the technology described in Part II and the components detailed in Part III together in a complete multiprocessor system.

Chapter 13 concludes by reviewing the techniques introduced in Part II and showing how they come together to achieve low-latency and fault-tolerant operation.

(24)

2. Background

This chapter provides background material to prepare the reader for the development in Parts II and III. Section 2.1 describes the fault model and multiprocessor model assumed throughout this document. Section 2.2 provides a brief review of standard scan based testing practices. Section 2.3 and 2.5 point out the importance of low latency and fault tolerance to large-scale multiprocessor systems. Section 2.4 reviews the composition of network latency. Section 2.6 looks at the requirements for fault tolerance. Finally, Sections 2.7 and 2.8 introduce several other key issues in the practical design of interconnection networks.

2.1 Models 2.1.1 Fault Model

Faults occurring in a network may be either static or dynamic and may be transient faults or permanent faults. While a permanent fault occurs and remains a fault, a transient fault may only persist for a short period of time. Transient faults which recur with notable frequency are termed intermittent. [SS92] indicate that transient and intermittent faults account for the vast majority of faults which occur in computer systems. For the purposes of this presentation, static faults are permanent or intermittent faults which have occurred at some point in the past and are known to the system as a whole. Dynamic faults are transient faults or any faults which the system has not yet detected.

Throughout this work, we assume that faults manifest themselves as:

1. Stuck-Values – a data or control line appears to be held exclusively high or low 2. Random bit flips – a data or control line has some incorrect, but random value

Faults may appear and disappear at any point in time. They may become permanent and remain in the system, they may be transient and disappear, or they may be intermittent and recurring.

Stuck-value errors may take on an arbitrary, but constant, logic value. Bit flips are assumed to take on random values. Specifically, we are not assuming an adversarial fault model (e.g. [MR91]) in which faulty portions of the system are allowed to take on arbitrary erroneous values.

These fault-manifestations are chosen to be consistent with fault expectations in digital hardware systems. Structural faults in the interconnect between components may give rise to floating or shorted nodes. With proper electrical design, floating i/o’s can appear as stuck-values to internal logic. Shorted nodes will depend on the values present on the shorted nodes and may appear as random bit flips when the values differ. Clocking, timing, and noise problems which cause incorrect data to be sampled by a component will also appear as random bit errors. Opens and bridging faults within an IC may also leave nodes shorted or floating. For a good survey of physical faults and their manifestations see Chapter 2 in [SS92].

(25)

The manner in which we handle dynamic faults in this work relies on end-to-end checksums to make the likelihood that a corrupted message looks like a good message arbitrarily small. As long as faults produce random data, we can select a checksum which has the desired property. However, if we allow arbitrary, malicious intervention as in an adversarial fault model, the adversary could remove a corrupted message from the network and replace it with one which looks good or remove a good message from the network and fake an acknowledgment. In order to handle this stronger fault-model, one would have to replace our practice of guarding data with checksums with an end-to-end data encryption scheme. A properly chosen encryption scheme could make the chances that an adversary could fake any message sufficiently remote for any particular application.

For the sake of the presentation here, we limit our concern to faults within the network itself.

The processing nodes are presumed to function correctly, if at all. A processing node may cease to function, but it may not provide erroneous data to the network. All network transactions requested by the node are presumed to be intentional. The computational implications of losing access to an ongoing computation or the memory stored at a failing node are important but beyond the scope of this work.

Without knowing the reliability design of the computational system as a whole, it is not clear whether a fault-tolerant network should be designed to optimize for harvest or yield. Yield is the term used to describe the likelihood that the system can be used to complete a given task. If we require that all nodes be fully connected to the network, then designing the network is a yield problem in which the network is only considered good when it provides full connectivity. In this case, we want to optimize for the highest yield at the fault levels of interest. Harvest Rate is the term used to refer to the fraction of total functional unit which are usable in a system. If the computational model can cope with the node loss, then designing the network is a harvest problem in which we attempt to optimize for the most connectivity at any fault level.

2.1.2 Multiprocessor Model

For the purpose of discussion, we assume a homogenous, distributed memory, multiprocessor model as shown in Figure 2.1. Each node is composed of a processor, some memory, and a network interface. In a hardware-supported shared-memory machine, this network interface might be the cache-controller [LLG⁺91] [ACD⁺91]; in a message-passing machine, it would be the network message interface [Cor91] [Thi91]. Increasingly, the network interface may be tightly-integrated with the processor [D⁺92] [NPA91]. We explicitly assume the network interface has multiple connections both into the network and out of the network. Multiple connections are necessary to avoid having a potential single point of failure at the connection between each node and the network.

2.2 IEEE-1149.1-1990 TAP

In Part II, we introduce extensions to standard, scan-based testing practices to make them suitable for use in large-scale systems. This section reviews the major points of the existing standard upon which we are building.

The IEEE Standard Test-Access Port (TAP) [Com90] defines a serial test interface requiring four dedicated I/O pins on each component. The standard allows components to be daisy-chained

(26)

Network Processor

Memory

Network Interface

Processor Memory

Figure 2.1: Multiprocessor Model

TDI

TDO Boundary Register

Scan Register

Instruction Register

TMS TCK

TAP Controller

Mux

Instruction Decode Bypass Register

Figure 2.2: Standard IEEE TAP and Scan Architecture

so that a single test path can provide access to many or all components in a system. The standard provides facilities for external boundary-scan testing, internal component functional testing, and internal scan testing. Additionally, the TAP provides access to component-specific testing and configuration facilities. Figure 2.2 shows the basic architecture for an IEEE scan-based TAP.

In a system in which all components comply with the standard, boundary-scan testing allows complete structural testing. Using the serial scan path, every I/O pin in the system can be configured

Robust, High-Speed Network Design for Large-Scale Multiprocessing