Troubleshooting ECC Errors on RA90/RA92 Disk Drives

1. ECM 2. OCP

5.16 Troubleshooting ECC Errors on RA90/RA92 Disk Drives

Disks are getting bigger and faster. As disk bit and track density increases, the electronics and mechanical components of the subsystem operate under tighter constraints. This means that error recovery mechanisms within the architecture may be called upon more frequently to compensate for these narrow tolerances.

This is one of the significant advantages of a Digital storage solution. Digital integrates into the design of the controller and the drive error recovery attributes that enhance and ensure data integrity and delivery to the user. Plug-compatible manufacturers (PCMs) of storage devices, by not owning the design of both ends of the subsystem (controller and drive), are left with little capacity to implement such techniques.

The RA9OIRA92 disk drive has 14 different error recovery mechanisms (reference Appendix B) and, therefore, affords excellent recovery potential for data errors. These error recovery mechanisms provide the margins necessary to protect customer data at increased densities and to ensure that the data is always delivered successfully.

In order to better determine the significance of logged correctable and uncorrectable ECC errors, and for assistance in troubleshooting either, note the discussions and error log examples in the sections that follow.

5.16.1 Uncorrectable ECC Errors--MSCP StatuslEvent E8

An uncorrectable ECC error is architecturally defined as the occurrence of a controller logging an MSCP status/event E8 as a result of a read data error. There are two uncorrectable ECC error types: hard and soft. Both types are reflected by a single MSCP status/event code.

The next two sections attempt to aid the engineer in determining/distinguishing between whether the status/event was hard or soft and significant or insignificant.

5.16.1.1 Hard Uncorrectable ECC Errors

A hard uncorrectable ECC error is the occurrence of an uncorrectable ECC error that renders the drive unable to recover data through any retry or recovery mechanism. An uncorrectable ECC error is not considered "bard" until all attempts at getting the data are exhausted and the controller has to terminate its attempts.

Example 5-3 shows a VMS error log error packet where the data was lost due to a hard error. The fields of note are emphasized in bold.

DIGITAL INTERNAL USE ONLY

******************************* ENTRY ERROR SEQUENCE 3885.

DATE/TIME 30-JAN-1989 19:54:03.77 SCS NODE: PICKUP

Troubleshooting and Error Codes 5-45

29. *******************************

LOGGED ON: SID 0200620E SYS_TYPE 00000000 ERL$LOGMESSAGE ENTRY KA750 REVI 14. OCOOE REVI 98.

I/O SOB-SYSTEM, UNIT _HSC013$DUA36:

MESSAGE TYPE 0001

CONTROLLER DEPENDENT INFORMATION

ORIG ERR 8010

BAD BLK REPLACEMENT REQUEST OPERATION CONTINUING

OPERATION SUCCESSFUL DAD DROIt.

0ltC0lUUlCD8I& ace DItOa

UNIT SOFTWARE VERSION I l l .

LOGICAL BLOCK 1947645.

GOOD LOGICAL SECTOR

EDC ERROR ECC ERROR

LBN REPLACEMENT INDICATED ERR LOGGED TO CONSOLE AND HOST

******************************************************

Example 5-3 VMS Uncorrectable ECC Error Log-Hard

DIGITAL INTERNAL USE ONLY

5-46 Troubleshooting and Error Codes

The disk subsystem will attempt to recover from an uncorrectable ECC error by retrying the transfer five times. For an RA901RA92 disk drive, the controller would then invoke drive recovery level 14 and execute that recovery mechanism up to five times, then invoke drive recovery level 13, and so on, until executing the last recovery level (1).

Note that for UDA controllers, the reported recovery levels from the controller will differ from what the other controllers will report.

5.16.1.2 Soft Un correctable ECC Errors

A soft uncorrectable ECC error is the occurrence of an uncorrectable ECC error on the first read attempt; however, a successful recovery level and/or retry was made and the data was read successfully (with eight or less symbols in error). In such a case, the block is flagged as a BBR candidate for testing purposes by the HSC controller (or in case of a UDAlKDAlKDB controller, the host operating system driver).

For uncorrectable ECC errors (MSCP status/event ES), the following items should be considered:

• For the RA901RA92 disk drive, examine the error log and determine that the MSLG$_LEVEL and MSLG$_RETRY (for VMS) is being reported as follows:

If the recovery level is reported as 0 and the retry count is =1 for the uncorrectable ECC errors, an occasional error under high I/O rates may be considered normal. The normal recovery will occur on the first retry with a recovery level of O. If more than a single retry is necessary, and especially if other levels of recovery are necessary, this indicates potentially more serious error conditions, including the legitimate condition whereby a block is going bad and needs replacement.

The RA90 short-arm HDA and the RA92 HDA will show improved (decreased) ECC error rates.

The nominal distribution of uncorrectable ECC errors for an RA90 disk drive with a long-arm HDA operating at very high I/O rates should appear as follows:

- Ninety percent of the errors occur in the top five heads (heads 0 through 4).

One of the heads (in the 0-4 range) will have no errors logged.

At least three of the top five heads will have errors of this type.

- You should have a sample size of at least 16 uncorredable ECC errors for examination. If this distribution of errors is not met, then further analysis should be done.

For example, if 10 of the 13 heads are logging these data errors, then consider it a general read path problem and troubleshoot accordingly.

If distribution is to a single head, then consider the likelihood of a defective HDA.

If error log information indicates that data recovery was accomplished by utilizing a drive error recovery level of 7 through 14 (head offset mechanism), then consider HDA replacement (especially if 9A, 9B, or 9C errors are being logged in the drive as well).

• Each error log entry of an uncorrectable ECC error should be followed by a BBR packet (reference Section 5.16.2.1). The MSCP status/event code should reflect a 34, BBR replacement attempted but block tested okay. Blocks in a normal drive will be retired at a very low rate (less than 20 percent of the time) for the normal transient occurrence of uncorrectable ECC errors on RA90 disk drives.

Example 5-4 has three fields of note (emphasized in bold). The:first emphasized field denotes the actual MSCP status/event logged (OOES), and a bit-to-text decode denoting that the read error was an uncorrectable ECC error.

The second field of note indicates how the subsystem recovered from the error condition; in this case, a single retry was successful with no special error recovery mechanism being invoked to aid in the recovery of the data.

DIGITAL INTERNAL USE ONLY

Troubieshooting and Error Codes 5-47

The third emphasized field is the field within an error log packet that, for an ECC-type MSCP status/event packet, typically has no meaning and will in most all cases indicate zeros. This section of an errorlog packet will, however, contain significant information for the interpretation of MSCP status/event 6B error packets.

******************************* ENTRY ERROR SEQUENCE 3885.

DATE/TIME 30-JAN-1989 19:54:03.77 SCS NODE: PICKUP

29. *******************************

LOGGED ON: SID 0200620E SYS TYPE 00000000 ERL$LOGMESSAGE ENTRY KA750 REVI 14. OCODE REVI 98.

I/O SOB-SYSTEM, UNIT _HSC013$D0A36:

MESSAGE TYPE 0001

CONTROLLER DEPENDENT INFORMATION

ORIG ERR 8010

BAD BLK REPLACEMENT REQOEST OPERATION CONTINUING

<Minimal. impact event VOLUME SERIAL 1876.

LOGICAL BLOCK 1947645.

GOOD LOGICAL SECTOR

EDC ERROR ECC ERROR

LBN REPLACEMENT INDICATED ERR LOGGED TO CONSOLE AND HOST

< POJ: data pJ:Obleru, the ••

~iel.da 8houl.cl contaa ' zeros' •

***********************************************

Example 5-4 VMS Uncorrectable ECC Error Log-Soft

DIGITAL INTERNAL USE ONLY

5-48 Troubleshooting and Error Codes

5.16.2 Correctable ECC Errors-MSCP Status/Event Codes 1 A8, 1 C8, 1 E8

Correctable ECC errors are those where the data was read with symbols in error above the drive threshold (6-8 symbols for the RA901RA92 disk drive). For ECC errors (MSCP status/event codes lAB, 1C8, and 1E8), consider the following:

• For an RA90 disk drive with a long-arm HDA, an occasional ECC error (including 6-8 symbols in error and soft uncorrectable errors) may be considered normal when the drive has sustained or 110 burst rates of >30 1I0s per second.

The RA90 short-arm HDA and the RA92 HDA show a marked improvement (decrease) ⁱⁿECC error rates.

The nominal distribution of correctable ECC errors for an RA90 disk drive with a long-arm HDA should appear as follows:

- Ninety percent of the errors occur in the top five heads (heads 0 through 4).

- One of the heads (in the 0-4 range) will have no errors logged.

At least three of the top five heads will have errors of this type.

You have a sample size of at least 16 uncorrectable ECC errors for examination. If this distribution of errors is not met, then further analysis should be done.

For example, if 10 of the 13 heads are logging these data errors, then consider it a general read path problem and troubleshoot accordingly.

If distribution is to a single head, then consider the likelihood of a defective HDA.

• Each error log entry of an ECC (6-8 symbol) error should be followed by a BBR packet (reference Section 5.16.2.1). The MSCP status/event code should reflect a 34, BBR replacement attempted but block tested okay. Blocks in a normal drive will be retired at a very low rate Gess than 20 percent of the time) for the normal transient occurrence of correctable ECC errors on RA90 disk drives.

5.16.2.1 BBR Packet

ECC errors that exceed the drive threshold initiate BBR algorithms. The BBR algorithms are provided to test, verify, and replace (if needed) defective media spots or marginal media/head spot combinations (assuming no data path problems). In those instances where the BBR algorithms do not determine a need for block replacement, it may be due to a transient type error situation, or mechanisms not attributable to actual head/media margins. These above-drive-threshold ECC errors (or uncorrectable ECC errors) may be caused by drive phenomena other than bad media/heads.

The BBR packet, which is generated at the completion of the BBR algorithm, will contain several important clues about the nature of the ECC error. Included in the packet is whether the block tested good or bad, and whether the original data was recovered or restored with the FORCED ERROR flag set, indicating the data was lost.

The following MSCP status/event codes are applicable for a BBR packet:

MSCP status/event 14-Bad block successfully replaced.

MSCP status/event 34-Block verified okay; not a bad block.

MSCP status/event 54-Replacement failure; replace command failed.

MSCP status/event 74-Replacement failure; inconsistent RCT.

MSCP status/event 94-Replacement failure; drive access failure.

DIGITAL INTERNAL USE ONLY

Troubieshooting and Error Codes 5-49

MSCP status/event B4-Replacement failure; no block available.

MSCP status/event D4-Replacement failure; two successive RBNs were bad.

Example 5-5 illustrates what the status of the BBR replacement algorithm resulted in. In this example, the block in question did go through BER; however, the block was not replaced, Further in the example, the replace flags demonstrate that the block was not replaced because the block

"verified good." The last segment of the BBR log packet reveals why the block was even tested. In this example, the block was thought to contain a data error with a severity level of "uncorrectable ECC."

5.17 Troubleshooting Controller-Detected Positioner Errors-MSCP

Dans le document Disk Drive (Page 134-139)