• Aucun résultat trouvé

A hard stop status and a hard stop bit have been defined for the Model 155. The hard stop bit is located in control register 14 with

Dans le document A Guide to the IBM System/370 Model 155 (Page 112-116)

the other three mask bits discussed. If a hard stop condition occurs,

the Model 155 system ceases all operations immediately without the

occurrence of a logout to the fixed area. Hard stop is initiated by

hardware rather than by programming.

Generally speaking, a hard stop situation is caused by the occurrence of a hard machine check type of error during the processing of a

previous hard machine check error. Implementation of a hard stop prevents system operations from continuing when the nature of the machine malfunction prevents the system from presenting meaningful status data.

The state of the Model 155 after IPL or a system reset is:

1. Recovery reports are disabled. Successful CPU retries and single-bit processor storage error corrections by ECC hardware do not cause machine check interrupts.

2. Configuration reports are disabled. Buffer block deletions do not cause machine check interrupts.

3. External damage reports are enabled. Interval Timer Damage, Time of Day Clock Damage, or a double- or multiple-bit processor storage failure associated with an I/O operation causes a machine check interrupt.

4. PSW bit 13 normally is set to one by the IPL PSW (it is disabled by system reset) to enable Systems Damage and Instruction

Processing Damage interrupts. Therefore, an irreparable system error, an unretryable CPU failure, an unsuccessfully retried CPU failure, or a multiple-bit processor storage error associated with the CPU causes a hard machine check interrupt.

5. Hard stop is enabled.

6. CPU extended logout is enabled and control register 15 points to location 512 as the beginning of the CPU extended logout area.

These status settings cause the Model 155 to run in quiet mode for hardware-corrected errors. If the Model 155 is to operate in full

recording mode, the appropriate mask bits must be altered by the control program.

MACHINE CHECKS ON SYSTEM/360 MODELS 40 AND 50

A machine check situation in Models 40 and 50 results from hardware detection of a machine malfunction or of a parity error. Bad parity can occur in main storage, in local storage, in a register. in an adder, etc. Error correction is not attempted by Model 40 or 50 hardware when a machine check occurs. If the machine check mask in the current PSW (bit 13) is enabled, a machine check causes an interrupt and a diagnostic scan-out occurs, starting at location 128. The number of bytes logged is model dependent.

If an SER routine (OS) or the MCRR routine (DOS) is present in the Model 40 or 50 control program, i t gains CPU control after the machine check interrupt, and the error is logged. A retry of the failing operation is not provided by these routines and the affected program is terminated abnormally. If MCRR is not present in a DOS supervisor, the system is placed in a wait state when a machine check interrupt occurs. (An OS control program must contain a machine check handling routine, SERO or SER1 for Models 40 and 50, as of Release 11.)

RECOVERY MANAGEMENT SUPPORT (RMS) FOR OS MFT AND MVT

RMS for the Model 155 consists of extensions to the facilities offered by RMS routines currently provided for Models 65 and up. The

two RMS routines, machine check handler (MCH) to handle machine check interrupts and channel check handler (CCH) to handle certain channel errors, will be included automatically in MFT and MVT control programs generated for the Model 155. Approximately 1000 bytes of resident

(nucleus) processor storage is required for Model 155 recovery management. A Model 50 user with a 3K SERl routine included in the control program will require only a 4000-byte increase in resident control program storage as a result of the inclusion of Model 155 RMS.

The two primary objectives of RMS are (1) to reduce the number of system terminations that result from machine malfunctions and (2) to minimize the impact of such incidents. These objectives are

accomplished by programmed recovery to allow system operations to continue whenever possible and by the recording of system status for both transient (corrected) and permanent (uncorrected) hardware errors.

Machine Check Handler

After IPL of a control program containing Model 155 RMS routines, mask bits are enabled and control register values are set to permit all machine check interrupts and logouts to occur.

Soft Machine Checks

MCH receives control after the occurrence of both soft and hard machine check interrupts. When a System Recovery or an Automatic Configuration, accompanied by a System Recovery soft machine check, occurs (successful CPU retry, single-bit processor storage error corrected, or buffer block deleted), MCH formats a recovery report

record to be written in the system error recording data set SYS1.LOGREC.

This record contains pertinent information about the error, including the data in the fixed logout area, an indication of the recovery that occurred. identification of the job, job step, and program involved in the error, the date, and the time of day. MCH schedules the writing of the recovery report record and informs the operator that a soft machine check has occurred.

Prior to relinquishing CPU control, MCH determines whether or not an automatic mode switch from recording mode to quiet mode should take place i f a CPU retry or an ECC correction recovery report has just occurred. The determination of whether to switch to nonrecording

(quiet) mode is made on the basis of the number of soft machine checks of a specific type that occur during system operation. Error count thresholds are maintained separately for successful CPU retry and successful processor storage single-bit error corrections. The IBM-supplied threshold values can be altered when the control program is generated or by an operator command during system operation.

MCR switches the system to quiet mode for either ECC corrections only (the DIAGNOSE instruction is used to change the ECC mode bit from full recording to quiet mode) or both CPU retry and ECC corrections

(the System Recovery mask bit is disabled). Mode switching occurs if the number of soft machine checks that occur during system operation exceeds the specified error count threshold for that type. The operator is informed of the mode switch and can switch back to recording mode at any time thereafter.

Mode switching is implemented to attempt to prevent SYS1.LOGREC from being filled with recovery reports when a recurring correctable error condition exists that would cause many reports to be generated.

When a buffer block deletion recovery report occurs, MCH records the error and informs the operator but does not maintain a threshold value for implementing quiet mode for this interrupt. The total number of buffer blocks deleted since system operation began is indicated,

Page of GC20-1 729-0 (multiple-bit processor storage error during error recovery will be performed after the

(See Figure 50.10.5 for the general flow of soft machine checks.)

Soft Machine Check Interrupt ,

Successful

When an Instruction Processing Damage hard machine check occurs (uncorrectable or unretryable CPU error, multiple-bit processor storage error, or storage protect key failure), MeH determines w~ether the error is one that is correctable by programming, such a~ a multiple-bit storage error or a storage protect key failure.

The program damage assessment and repair (PDAR) routine of MCH can repair damaged control program storage areas by loading a new copy

of the affected module if the module is marked reentrant and refreshable

Dans le document A Guide to the IBM System/370 Model 155 (Page 112-116)