MEMORY BUSES - BASIC SYSTEM - I ERROR I DETECTION

I ERROR I DETECTION

BUS 0 BASIC SYSTEM

2 MEMORY BUSES

2 MCUs

2 MEMORY BUSES

Figure 4-11. Flexible System ^F-0439

Figure 4-12 shows a system capable of a comprehensive deferred maintenance strategy. The system is using all of the detection mechanisms available in the BIU and MCU, and it has more than one of every resource in the system. When an error occurs, it will be detected and isolated by the hardware. Software can then rapidly reconfigure the system around the faulty module and be on-line again.

Overview of Fault Handling Mechanisms iAPX 432 Interconnect ARM

________ ~----__ ~ ______ ~---~--__ ~~~~2

LEGEND G - GDP B = BIU I ² IP R = MEMORY M = MCU

·SELF-HEALING SYSTEM LOGICAL CONFIGURATION

2 GDPs 2 IPs

2 MEMORY ARRAYS 2 MEMORY BUSES

PHYSICAL CONFIGURATION 4 GDPs

4 IPs 16 BlUs

2 40-BIT-WIDE RAM ARRAYS 4 MCUs

2 MEMORY BUSES

Figure 4-12. Self-Healing System F-0440

Figure 4-13 shows a small nonstop system. This system has been configured so that every resource in the system has a current and complete backup. The system can continue operation in the presence of any single failure. All of the detection and recovery mechanisms are enabled.

4-20

iAPX 432 Interconnect ARM Overview of Fault Handling Mechanisms

G - GDP B - BIU I - IP R - MEMORY K - KCU

SHALL NONSTOP SYSTEM

kOGICAL CONFIGURATION PHYSICAL CONFIGURATION 1 GDP

1 IP

1 MEMORY ARRAY 1 MEMORY BUS

4 GDPs 4 IPs 16 BIUs

2 40-BIT-WIDE RAM ARRAYS 4 KCUs

2 MEMORY BUSES

Figure 4-13. Small Nonstop System

Figure 4-14 shows a large nonstop system.

F-0441

Overview of Fault Handling Mechanisms

LEGEND G ... GDP

B ^{I :}BIU

I = IP R = MEMORY M = MCU

LARGE NONSTOP SYSTEM LOGICAL CONFIGURATION 4 GDPs

2 IPs

·12 MEMORY ARRAYS 4 MEMORY BUSES

iAPX 432 Interconnect ARM

PHYSICAL CONFIGURATION 16 GDPs

8 IPs 96 BIUs

24 40-BIT-WIDE RAM ARRAYS 48 MCUs

4 MEMORY BUSES

Figure 4-14. Large Nonstop System F-0442

EXAMPLE SYSTEM OPERATION

This summary provides an example of the operation sequence of a fault-tolerant machine. The example is intended to help tie together all of the mechanisms presented in the earlier portions of the document.

4-22

iAPX 432 Interconnect ARM Overview of Fault Handling Mechanisms

A DAY IN THE LIFE OF AN FT SYSTEM

1. The system is powered up, and the components receive the INIT signal. This places the components in a consistent state with each BIU and MCU that has a unique local address.

2. The initialization software sizes the system, runs confidence tests on all of the resources in the system, and reviews the past error history in the system health log. The software decides which resources are available for use in the system configuration.

3.

Based on the needs of the application and the available resources in the system, software decides on the optimal system configuration. Address ranges are assigned, modules are married, and buses are enabled with the correct redundant information. (See Chapter 6.) Timeout values are loaded, and any of the optional fault-tolerant capabilities are set to the desired state. (See Chapter 5.) The memory is loaded with the information required to run the basic operating system. The system is then ready for logical operation and is in a fully fault-tolerant state. The software that performs this step is probably split between the AP and the GDPs.

4. The system is passed to the operating system, and normal operation begins.

5. During normal operation a background task that is constantly checking for latent errors in the system is run. This software uses the various commands to the BIU and MCU to verify the correct operation of the detection and recovery mechanisms. (See the "Management and Testing" sections in Chapters 5 and 6.) This software also polls the error report logs to check on any transient error conditions that may have been reported. (See Chapters 7 and 8.) Infrequently used operations in the processors may also be tested during this time.

6. An error occurs and is detected by the hardware. (See Chapter 5.)

7. An error report message is broadcast throughout the system, identifying the type of error and the location at which the error was detected. (See Chapter 7.)

8. The system becomes quiescent and waits for any transient to subside. (See Chapter 8.)

9. All accesses outstanding in the system are retried. If the error does not recur, then normal operation will resume. If the error recurs, then the recovery operation proceeds to the next stage. (See Chapter 8.)

Overview of Fault Handling Mechanisms iAPX 432 Interconnect ARM

4-24

10. The error recurs, and an error report message is broadcast throughout the system.

11. The system becomes quiescent and waits for any transient to subside. During this time each node will take whatever recovery action is appropriate, based on the type and location of the error. The faulty resource is isolated from the system, and redundant resources are activated automatically by the hardware. (See Chapter 8.)

12. All accesses outstanding in the system are retried. Each

proce~sor in the system receives a reconfiguration IPC from its BIU. This sends an interrupt to each AP and will cause each GDP to suspend its current process at the earliest possible time. The GDPs will then go to their reconfiguration dispatching port. (See Chapter 8.)

13. A process is waiting at the reconfiguration dispatching port(s). This process sends a message to higher software authorities, notifying them of the system reconfiguration.

This process may do some basic housekeeping, but it will almost immediately send the processors back to their normal dispatching ports. (See Chapter 8.)

14. The system is running normally, but management software must make some decisions about the optimal configuration of the system (should resources be deallocated, should a spare be acti vated, etc.). Once these decision's are made, the software may alter the active configuration to put the system in its optimal state for continued operation. Normal operation may need to be suspended for a brief moment if a spare memory module is brought on-line. (See Chapter 10.)

15. The system returns to normal operation.

• 'i'

n

INTRODUCTION

CHAPTEH 5

Dans le document iAPX 432 ~:ERCONNECT (Page 71-77)