INTERMITTENT PROBLEMS - DISK STORAGE SUBSYSTEM 2)

Some action should be taken to correct an intermittent problem whenever possible, even if the failure cannot be duplicated.

The purpose of the following is to assist in trying to dupli-cate the failure and, if that cannot be' done, to provide some guidance as to possible corrective actions that can be taken.

1. If system type errors (Fault Symptom Codes) are being generated, determine if they are predominantly one symp·

tom code or several related codes. Loop on these diagnostic routines/tests in an attempt to produce micro-diagnostic errors. Looping routines/tests increases the testing frequency on specific areas of logic. If the mi-crodiagnostics detect errors, follow the actions listed by the error stop in the. error code dictionary.

2. If the microdiagnostics do not produce errors, use the most frequent Fault Symptom Codes to replace. swap, or check suspected items to correct the error.

3. Maintain a list of what has been done. This inEocmdtion may be valuable if additional action is required. A check of the customer ^Is operation has to be made to de-termine if the problem has been corrected. If mass card replacement was used, make every attempt to determine which one caused the error by putting removed cards back

in one or two at a time.

4. Other forms of stress testing, such as marginal voltages, raising and lowering temperature, and vibration may be tried but have not proven too effective. A folded tab card raked across the cards while in a test loop some··

times helps find a bad card connection or a vibration sensitive card. Moving cables and connectors under the same conditions also occasionally locates a problem.

5. It is essential to have all the information possible re-garding failures. Use full dumps and analyze them fully.

Understand how much of the system is working correctly as well as what is failing.

1-12 83337530 D

6. Determine if a failure is with one or multiple devices.

With single device failures, .determine the failing ad-dresses. Determine if one or many tracks common to one head are failing.

7. For access failures, card swapping between two devices is effective in isolating card failures. Power amplifiers can be swapped between devices. ' Check the interconnect-.

ing cables and connectors. Check the voltages.

8. For intermittent head load and unload problems and for dropping Ready (Intervention Required), it is sometimes necessary to request that the customer leave the machine in the failing condition until the CE arrives.

9. If there are tag or bus parity problems between the con-' troller and one string of devices, there could be a bad cable connector somewhere down the string.

10. Consider using a voltmeter to check voltage level and an oscilloscope to check voltage ripple on the power sup-plies.

11. Check the time when errors occur when possible. It is possible some external noise source is present only at certain times.

12. Question the customer about other possible environmental problems such as room temperature, static discharges pos-sibly from low humidity, or other unusual occurrences.

PROBLEM NOT FOUND

The unit is failing now and the maintenance actions have not corrected the problem.

Return to the original entry,and replace, swap, and check itefus

1 is ted. Tes t the machine in the or igina 1 manner to determine if the trouble is corrected.

NOTE

When replacing or swapping components, keep a list of what has been done. This is very val-uable if the error is being propagated due to components being damaged.

At this point, understanding the failure becomes essential. A methodical approach must be developed and followed. Analyze

83337530 C 1-13

all failure information; microdiagnostic error stops. messages.

or anything else pertaining. to the failure. Know what is fail-ing and what is not. If the failure can be duplicated with the same failure information. you should be close to understanding the problem.

If a fairly solid error condition exists with a microdia~nostic

routine or test. loop the routine or test and scope the inputs that set the error latch or line. Try to determine the input at faul t or if it is the output. At this point. you may be looking for an open or short on the board. back panel wire. or in a cable. rather than a card problem.

If the failures are random or the failure still has not been found. moni tor the vo I tages wi th a vo I tmeter to 'be sure they are within specification. Check the power supplies for: noise or high frequency ripple with an oscilloscope. CheCK ground-ing. cables. and connectors for bad crimps. shorts. or poor connections. Check other environmental conditions that may cause machine problems such as temperature. static. primary power. external noise. etc.

Access cards can be easily swapped between devices to help iso-late access problems. This includes· the power amplif ier.

Check the cabling and voltage.

If the problem is Data ChecKs. they must be isolated to the smallest element possible (one device. one HDA. all devices, etc.). Any HA or RO failures identified with an HOA must be corrected. Advise the customer to rewrite the data or assign an alternate track if data cannot be ,rewritten due to a surface defect. If several defects appear. check the head addresses.

If the problem is common to one head. the head may be defective or the connector may be bad.

ON-LINE ERROR ANALYSIS METHODS SOLEX

"SOT ... EX" stands for Standalone Online Executive. When EREP fault isolation is unsatisfactory or does not provide SUffici-ent detail. run SOLEX and analyze the console error messages described in the SOLEX manual for the appropriate subsystem.

The preferred order of obtaining fault information in terms of impact on customer throughput is as follows:

1-14

Obtain EREP printout using IBM OLTEP. Refer to fault Symptom Code Index for Faul t Symptom Code meaning and action.

83337-530 C

2. Run SOLEX test. If the system is capable of

multi-task-3 .

ing. the system and remaining drives are available for concurrent customer jobs. Perform suggested fault cor-rection procedures.

Consider running storage control.

error code.

standalone microdiagnostics from Analyze maintenance panel display

the for 4. Rerun failing customer job. Question the customer to

de-termine the conditions. at the time of the failure. If the system is capable of mul ti-tasking. the customer may run other jobs concurrently.

EREP PROGRAM

"EREP" stands for Error Recording, Editing, and Printing. The primary method of locating subsystem faults is by analysis of the customer-provided Error Log'Record (SYS1.LOGREC) generated by EREP. The EREP utility program provides you with two basic functions:

1. I t summar izes the sta tis tical data genera ted by the sub-system.

2. It prints detailed unit check records in a format that is easily readable.

The operating system determines the format and context of the statistical summary.

The s tatis tical summary informa t ion is used to quickly deter-mine the condition of the subsystem. This information can be used to:

1. Assist you in deciding what action to take during pre-ventive maintenance.

2. Point out detailed unit check records.

3. Separate device, controller, and storage control problems.

Note that the EREP program is configured for IBM equipment:

the indicated errors on the EREP printout may not reflect the same error on CDC equipment. This is due to differences in meaning of selected sense bi ts. To determine the meaning of EREP-indicated error codes, refer to the sense byte transla-tions listed in section lC. For discussion purposes, OS/360 terms are used to describe operating system functions. Other IBM operating systems have equivalent EREP functions and terms.

83337530 C 1-15

SECTION lA

POWER

POWER lA

INTRODUCTION

This section contains general information on power and trouble-shooting power-related problems.

When using SAMs. refer to Using the SAM (section 1) for format and description.

POWER ON/OFF PROCEDURES

During maintenance or troubleshooting. it may be necessary to turn off power in one or more pieces of equipment. Turning off power is required if a board is to be removed or added.

POWERING OFF/ON A SINGLE CONTROLLER

Use this procedure when replacing a board or dc module in a single controller.

1. At the CPU. vary offline all device paths through the controller to be powered off.

2. At the operator panel disable the controller to be power-ed off.

3. At the HSC PCU turn off the appropriate CB (CB9 for con-troller 1 or CB10 for concon-troller 2).

4. Reverse the above procedure to restore power.

NOTE

A Check-2 error occurring during power up will delay Power Up Sequence Complete approximately four minutes.

83337530 C 1A-1

POWERING OFF/ON A SINGLE DEVICE

Use this procedure when replacing a board. power amp, or dc module in a single device.

NOTE

Install the carriage lock if the power amp .is to be removed for longer than one-half hour.

1. At the CPU. vary offline the device to be powered off.

2. Remove the device logic chassis cover (upper logic chas-sis for devices 0 and 1; lower logic chassis for devices 2 and 3).

3. Stop the device to be powered off by pressing down on the appropriate board-mounted switch~

4. On

1A-2

Device 0 - Board 8. upper chassis Device 1 - Board 28. upper chassis Device 2 - Board 8. lower chassis Device 3 - Board 28. lower chassis

NOTE

Before going to the next step, ensure that the ONLINE status LED is off in the device to be powered off.

Device 0 - Board 7, upper chassis Device 1 - Board 27. upper chassis Device 2

-

^{Board 7.} lower chassis Device 3 - Board 27. lower chassis

the master/slave PCU. turn off the appropriate CB.

Device

o -

CB3 (master PCU) Device 1 - CB4 (master PCU) Device 2 - CB7 (slave PCU) Device 3 - CB8 (slave PCU)

83337530 C

S. Reverse the above procedure to restore power.

NOTE

If more than one device is to be powered up, wait for the READY LED to come on before pow-ering up the next device.

POWERING OFF A DRIVE STRING

Use this procedure to remove all power from a drive string.

1. At the CPU, vary offline all device paths through the HSC (two controllers) to be powered off.

2. At the operator panel disable both controllers.

3. At the HSC PCU turn off CB9 and CBI0 (controller 1 and controller 2).

NOTE

Ensure all air mover fans in the drive string are stopped before going to the next step.

4. At the HSC PCU turn off CBS (main breaker).

WARNING j

unswitched line voltages are still present in-side the PCU.

POWERING ON A DRIVE STRING

1. At the HSC PCU turn on CBS (main breaker).

2. At the HSC PCU turn on CB9 (controller 1).

3. Wait at least four seconds and then turn on CBI0 (con-troller 2).

83337S30 C lA-3

Use SAM DSUl when one or more devices do not power up· far

Dans le document DISK STORAGE SUBSYSTEM 2) (Page 52-62)