SPECIAL CASES OF INTEREST - ~CONFIGURATION TIME

~CONFIGURATION TIME

CHAPTER 9 SPECIAL CASES OF INTEREST

The fault handling mechanisms address two aspects of initialization:

checking that the critical initialization data was correctly transferred into the components, and providing an external indication that the component has received the initialization signal.

The initialization data is loaded in from the memory bus and the ACD or SLAD bus while the INIT signal is asserted. Appendix D, "External Initialization," describes all of the initialization parameters. Only two of these parameters are critical to the basic operation of the fault handling mechanisms: bus ID and module ID. This information must be correctly loaded, or the error reporting mechanisms will not operate correctly.

The bus ID is loaded in from the memory bus pins. The 2 parity bits on the memory bus are valid during initialization. If there is an error in the initialization data, it will be detected by parity. Once the bus ID has been correctly loaded in during initialization, any failure in the Interconnect Device ID register which corrupts the bus ID will result in a module failure (or be uncovered during testing of the error reporting circuits). The bus ID is used only to recognize requests on the processor bus and to send the node ID in error report messages.

Thus, the faulty bus ID will lead to an FRC error because of incorrect address recognition, or the software checking the error reporting network will notice that the node responded to a Test Report command with the wrong bus ID in the error report message.

The module ID not only needs to be loaded correctly, but also needs to be checked continuously during normal operation. The module ID is loaded from the ACD or SLAD bus during initialization. An even parity bi t is appended to the 6-bi tID. This parity bit is stored in the Interconnect Device ID register along with the module ID. Any single error in the ID during initialization or during normal operation will be detected by the parity bit. This more comprehensive coverage is required because a faulty module ID occurring during system operation would almost certainly lead to a system crash. It would most likely lead to a permanent module error being reported with the wrong module ID and thus the wrong module (possibly this module's spouse) would be killed.

These conditions call for total shutdown of the node. If a memory bus parity error is detected during initialization or if the module ID has a parity failure at any time, this node will be immediately locked into the initialization state. This disables outputs and turns off all address recognition. The node simply disappears from the system. The only way to recover the node is to assert the external INIT pin. If correct data is reloaded into the internal registers, then the component will return to normal operation.

Special Cases of Interest iAPX 432 Interconnect ARM These drastic measures are required because this information represents the fundamental base on which the rest of the system operation is moved between physical processors during execution. Thus, additional information is required to correctly interpret the flag. During normal operation the only way the flag can be set is if a permanent error occurs. (Timeouts will only occur during system configuration.)

9-2

iAPX 432 Interconnect ARM Special Cases of Interest This condition can be detected by the software residing in the reconfiguration dispatching port and be communicated to the software that may be performing interconnect accesses. The software could expand a single interconnect access to include a check on the physical processor being used. A single interconnect register operation would become: Read processor ID (IImy-BIUII), perform interconnect access, read bad access (IImy-BIUII), read processor ID ("my-BIU"). If the processor IDs were identical, then the bad access flag would be valid.

Because of the physical implementation of the BIU and MCU, only one process may access a node at any given time. If multiple nodes are accessing the same node, the results of the accesses are indeterminate. Software alone is responsible for guaranteeing that this mutual exclusion requirement is met.

INTERPROCESSOR COMMUNICATION

Interprocessor Communication (IPC) messages are handled in a special way by the BIUs. Logically, all IPCs use bus 0 and the BIUs on bus 0 are responsible for informing the processor about any IPC messages pending for that processor. This information must be duplicated to allow for recovery from bus 0 failures.

Bus 2 is the back-up bus for bus 0; thus, all IPC messages are sent over both bus 0 and bus 2. Because two buses are used for the message transmission, IPCs are treated like MMA accesses. This guarantees that the information in the IPC registers on both buses remains consistent.

While both BIUs participate on the memory bus side of the IPC message, only one of the BIUs actually responds to the request on the processor bus. When the processor reads its IPC register, the BIU on bus 0 responds if its primary address range is enabled, while the BIU on bus 2 responds if its back-up address range is enabled. During normal operation, the BIU on bus 0 will return data on IPC requests. The operation of the BIU on bus 2 can be checked by doing an Interchange command, which will then cause the BIU on bus 2 to return data on IPC requests. The BIU that does not respond on the processor bus updates its IPC register to maintain an accurate copy of the state of IPC messages.

Both IPC read and write use the memory bus and are handled as MMA accesses over bus 0 and bus 2. This approach utilizes the bus arbitration protocol to guarantee that the IPC information is always consistent. For example, an IPC read following behind an IPC write will return the same data from both IPC registers because the read cannot complete until the write has completed. The order of access will be the same on both buses.

MULTIPLE MODULE ACCESS

Multiple module accesses (MMA) are those accesses that span two memory buses. Because the two buses operate independently, an MMA may be in different states on each bus when an error occurs. This requires

Special Cases of Interest iAPX 432 Interconnect ARM special consideration during recovery. There are two cases of interest: Read-Modify-Write (read) requests and write requests. The RMW requests were described in Chapter 8.

Write requests are a problem because they may leave the memory data in an inconsistent state (part old data, part new data). This failure can occur only if a BIU fails before completing its write request, but after the other half of the MMA has been completed. If the error is permanent and there is no shadow module, then there is no way to correct the inconsistent data structure in the memory. A failure in the MCU or the memory array can never cause this problem. If the failure is in the memory module, no other processor will be allowed to access the memory array.

By monitoring the MMAL and MMAH signals on the processor bus, the BIUs can track the progress of the MMA operation on the other bus. If this error situation occurs (with the conditions: permanent module error, my module, not married, MMA write access, my half complete), the BIU that completed its part of the MMA write access must lock the memory location on its bus. This is done by issuing a Force Bad ECC (FBE) request on the memory bus. This request will cause the MCU to write a special ECC code (the complement of the correct ECC code) into the addressed location. Any requests to this location will be rejected because the special FBE code (an error syndrome of all ones) is interpreted as an uncorrectable error. The FBE code will prevent any further accesses to the corrupt location.

It is important to realize that only one of the two locations involved in the original MMA access receives an FBE command. The other location may be accessed without any problems. This does not cause any logical inconsistencies. The only inconsistency occurs when a processor tries to access both locations as a single unit. To prevent that, one of the locations is forced to have a bad ECC code.

ACTIVATION OF SPARE RESOURCES

The iAPX 432' s concept of shadowed, self-checking modules allows a number of spare strategies to be used in a system. The first tradeoff involves a decision between carrying the extra resources as spares or as extra horsepower available to the system. If it is possible to do some load shedding in the event of a failure, then spare resources probably are a bad idea. Spare resources are activated for three reasons: the system management has decided to maintain the old performance level, a totally fault-tolerant environment, or both.

Depending on the resources that are available, these decisions may require activating spare resources. Spare resources are activated on-line by using the procedures outlined for module shadowing startup in Chapter 8.

9-4

REGISTERS

CHAPTER 10

Dans le document iAPX 432 ~:ERCONNECT (Page 151-155)