• Aucun résultat trouvé

Troubleshooting Controller-Detected Positioner Errors-MSCP Status/Event 68

Dans le document Disk Drive (Page 139-143)

1. ECM 2. OCP

5.17 Troubleshooting Controller-Detected Positioner Errors-MSCP Status/Event 68

MSCP status/event 6B is a positioner unintelligible header error (also referred to as a positioner error mis-seek). Several considerations must be weighed when troubieshooting the MSCP 6B event.

These include:

• For RA9OIRA92 disk drives, what is the I/O rate on the drive?

• Is only one SDI path noting the problem?

• Are other errors being logged at or near the same frequency as the MSCP 6B?

• For RA92 disk drives, what is the write-to-read ratio?

• What recovery level/mechanism is the controller using in order to recover from the situation?

With the RA90lRA92 disk drive, if in the examination of the error log, it can be dete!'Tn1Ded that:

• the Level A retry mechanism is successful on first retry, and

• the Level B retry mechanism is not being used (reported Level B retry count

=

0), and

• "all" errors are being recovered on a single retry,

then an error rate of six per day may be considered nominal for the RA901RA92 disk drives operating near or above 30 I/Os per second.

Example 5-6 illustrates a typical RA90 error log on a VMS system. The fields of note are empl-..asized in bold.

5.17.1 RA92 Disk Drive With MSCP Status/Event 68

RA92 disk drives may log more occurrences of MSCP status/event 6B than RA90 disk drives in applications during which long sequences of write activity are occurring. This phenomenon, as a contributor to 6B events, was recently discovered and identified. Though it occurs more often with the RA.92 disk drive, heavy write-to-read ratios could be a contributor to logged MSCP 6B events by RA90 disk drives.

The problem is occuning within the design of the heads while the head is involved in large sequential write transfers. When the head has to switch back to read (for next header identification), noise can result in the head that essentially disrupts the header signal as it is read. No identifiable damage to the actual header information is exhibited on the media. Customer data is not at risk. The noise merely disrupts the read chain momentarily as the header is being read. By the time the next sector comes around, the read chain will have stabilized.

This head phenomenon will result in additional 6B errors being logged when the write-to-read ratios are heavily weighted in favor of writes. Typical VMS environments may not provide this scenario. It has been noted that typical ULTRIXIUNIX applications appear to have a higher mix of write-to-read activity than VMS applications. However, regardless of the operating system, certain applications may increase the potential of this phenomenon occurring when those applications, by their nature, offer heavy write-to-read ratios.

DIGITAL INTERNAL USE ONLY

5-50 Troubleshooting and Error Codes

****** ENTRY 6., ERROR SEQUENCE 4709. LOGGED ON SID 05283914

ERL$LOGMESSAGE ENTRY KA820 REVt E PATCH REV' 28. OCODE REV' 20.

BI NODE' 2.

I/O SOB-SYSTEM, UNIT _HSC015$D0A36:

MESSAGE TYPE 0001

BAD BLOCK REPLACEMENT ATTEMPT OPERATION SOCCESSFUL

BAD BLOC1t RBPL&CBND1'f BLOCK VB1U:rIBD C:OOD

UNIQUE IDENTIFIER, 00000000FC15 (X) MASS STORAGE CONTROLLER

HSC70

CONTROLLER SOFTWARE VERSION '39.

CONTROLLER HARDWARE REVISION '0.

UNIQUE IDENTIFIER, 0000000003F6(X) DISK CLASS DEVICE (166)

ORCORUCTULJI ace DBOR

*******************************************

Example 5-5 VMS BBR Packet

DIGITAL INTERNAL USE ONLY

Troubleshooting and Error Codes 5-51 DATE/TIME 26-JOL-1990 11:12:49.31

ERL$LOGMESSAGE ENTRY KA88 BEV.

CPU • O.

I/O SOB-SYSTEM, UNIT _BSC4$DUA39:

MESSAGE TYPE 0001

CONTROLLER DEPENDENT INFORMATION

ORIG ERR 1800 SEQUENCE NUMBER RESET OPERATION SUCCESSFUL

UNIQCE IDENTIFIER, 00000017F20D(X)

MASS STORAGE CONTROLLER

HSC50

HEADER COMPARE ERROR HEADER SYNC TIMEOUT

SOSPECTED LOW HEADER MISMATCH ERR LOGGED TO CONSOLE AND HOST

<---~ 1 "A" ~

<---BOB 110 "a" aBIBS

Example 5-6 Positioner MIs-Seek MSCP StatuslEvent 6B

DIGITAL INTERNAL USE ONLY

5-52 Troubleshooting and Error Codes

The occurrence of 6B errors caused by this phenomenon has been more pronounced on the KDMlHSC controllers than on the KDAlKDBIUDA controllers. Since experience and engineering evaluation have shown that the occasional occurrence of the MSCP status/event 6B, when recovered on a single retry, is inconsequential, extra error management code has been implemented as follows:

• HSC software released after the 39x series will contain special 6B error management code that will look for this error signature and will not report this event characteristic of the RA9OIRA92 product.

• The KDM70 controller with microcode at revision level 2 will also contain this enhanced error management code for 6B errors on RA9OIRA92 disk drives.

This phenomenon is being aggressively plD'Sued by Digital and resolution details will be communicated to the field.

5.17.2 Evaluating MSCP 68 Events

When converting some (20-30 LBNs identified as 6B MSCP events) of the target LBN numbers, look for the following:

• Single head but quite random cylinder addresses-consider the HDA

• Single head but narrow band of cylinder addresses-consider mapping out suspect LBNs with DKUTIL or HDA replacement. To manually force replacement of a perceived bad block, make sure a current disk backup exists.

• Repeating LBNs-consider "mapping" out suspect LBNs with the BBR utility (DKUTIL).

• Random heads (10 of 13 heads>--<:Onsider data path including controller SDI module.

Troubleshoot MSCP status/event 6B as follows:

• Update the drive with the latest drive microcode version.

• If errors are only happening on one port, plD'Sue a port path problem, including ECM, SDI cables between drive and bulkhead, cabinet to controller cabinet, and within the controller cabinet and the port interface module in the controller.

• Note whether more than one drive on the requester is reporting consistent 6B events. This would more definitely suggest a port interface problem within the controller.

• If errors are clearly happening on both drive ports, pursue the problem as a drive problem first, when the event rate exceeds the guidelines indicated above and/or customer satisfaction dictates.

5.18 Conclusion

The DSA architecture defines a very reliable and flexible storage subsystem. This subsystem can be maintained efficiently and effectively when consistent and methodical troubleshooting procedures are followed.

Poorly trained or untrained Customer Services engineers are at a serious disadvantage. The cost of supporting incolTectly identified FRUs is very high. Many of the FRU units are expensive to replace. Some very expensive FRUs are not repairable FRUs. The impact to a customer can be substantial. Impacts include:

• Necessity to back up and restore potentially large amounts of data on misdiagnosed HDA replacements.

• Loss of system availability when using standalone diagnostics with controllers such as UDAlKDAlKDB.

• Loss of drive availability when performing extensive subsystem diagnostics using an HSC controller.

DIGITAL INTERNAL USE ONLY

Troubieshooting and Error Codes 5-53

• Increased frustration and inconvenience of dealing with repeated calls.

• Loss of confidence in Digital as a quality supplier of storage systems.

• Increased potential of data loss if improper diagnosis is made and the failure mode continues or gets worse.

SERVICE GOAL

The Customer Services engineer'. number one goal in service efforts is to correctly diagnose a problem on the first ea11 and replace the correct part 80 the c1l8tomer'. disk and data availability is minjma]]y impacted.

Dans le document Disk Drive (Page 139-143)