Handling MyISAM Corruption - Effective MySQL

The following scenario is a detailed explanation of a MyISAM corruption situation that recently occurred and the steps to triangulate a possible recovery and the ultimate solution.

The environment was a single MySQL production server with binary logging (that was disabled prior in the day). There was no MySQL replication server in place.

The Call for Help

While checking my inbox at breakfast the following e-mail draws attention:

NOTE Disaster does not care if you are on vacation.

The Confirmation of a Serious Problem

I immediately contact the client, determine that the situation appears serious, and connect to a running production system finding the following errors in the MySQL error log:

This is the first obvious sign of MyISAM corruption. This generally occurs when the MySQL instance crashes or is not cleanly shut down. A quick review confirms that three minutes earlier this occurred. It is always recommended to try and find the cause to ensure this is understood in future situations.

The First Resolution Attempt

In this situation it is best to shut down MySQL and perform a ^myisamchk of underlying MyISAM data. Depending on the size of your database, this can take some time. By default, ^myisamchk with no options will perform a check only.

This output is good:

This output is not good:

To perform a repair on a MyISAM table the –r option is required. For example:

This shows a successful repair. The following shows an unsuccessful repair:

This following error on the client’s largest and most important table is that classic WTF moment:

As a side note, if myisamchk fails to complete, a temporary file is actually left behind.

(First time experienced by the author.)

The Second Resolution Attempt

One of the benefits of MyISAM is that the underlying data and indexes are simply flat files. These can be copied around between database Schemas with the appropriate table definition file (i.e.,. ^frm). The following steps were used to simulate a new table.

1. Obtain the table definition from a backup file or SHOW CREATE TABLE command.

2. Create a new table in a different schema with the table definition and all indexes removed.

3. Copy the existing ^.MYD file over the newly created table ^.MYD file. The new table does not need to be the same name as the old table; however, the ^.MYD name must match the new name.

4. Repair table (requires MySQL instance to be stopped).

5. Confirm the data is accessible.

As you can see, a repair of the table this time did not produce a core dump. A further confirmation defines data is accessible.

At this time an attempt to re-create the indexes on the table is performed to enable this table and index structure to be copied back to the production schema.

It is clear that the table does not want to be repaired.

The Third Resolution Attempt

At this time, the decision to continue or to pursue data recovery or a restore from the previous night’s backup is considered, and both options are undertaken in parallel. A more detailed recovery is performed using initially the –o option for the older recovery method, and then with ^–e option for an extended recovery. It should be noted that the myisamchk documentation states “Do not use this option if you are not totally desperate.”

The table is confirmed as crashed again.

A successfully reported repair is performed.

However, the table is still considered corrupt.

A more extensive recovery is performed.

The number of data records has decreased from 461,818 to 461,709.

The number of data records has decreased again from 461,818 originally to 461,523, an indication that perhaps corrupted data has been removed.

At this time, the best approach is to try and obtain as much data as possible by extracting data.

This more in-depth approach to try and recover data has also failed.

A Failed Database Backup

There is now no other option than to perform a database restore from the previous night’s backup. This, however, failed with the following problems:

Some 25% of individual schema backup files failed to uncompress on both the

production server and a remote server containing the backup files. What was interesting was the variety of different error messages. The customer was now forced with considering an older backup.

TIP Testing your backup on an external system is important to ensure corruption is not occurring at the time of your backup.

Detecting Hardware Faults

In isolated situations and when other plausible explanations are exhausted, faulty hardware can be the issue. During this data restore process other symptoms of slow performance, especially compressing the original data, and some of the unexplained outcomes shown indicated a possible hardware error. The system log showed no indication of problems to validate this hypothesis; however, in previous situations the end result has been hardware.

During the process of taking additional filesystem backups of the current data and configuration files for contingency, the system failed with a kernel panic. At this time the client was left with no production server, no current backup of data or binary logs, and an uneasy time as the host provider system engineers looked into the problem.

Almost an hour passes before the system is accessible again, and the host provider reports a fault memory error on the console. MySQL is restarted, a myisamchk is performed on the entire database, and several tables require a recover process—all occur without further incident. Another hour later, the database is in what is considered a stable state. A backup is then performed. The client is now convinced of the importance of the need for the process.

NOTE Any organization without an adequate backup and recovery process is at risk for serious business disruption. In this actual example, luck was on their side.

Conclusion

This client backup process had two important flaws. The first was the backup was not checked for any type of error. The uncompressing of backup files was producing errors.

The second flaw was that the binary logs were not being stored on a separate system. If the hardware failure was a disk and not memory, data recovery may have not been possible. Not mentioned in any detail in this example is an additional restore issue where the binary log position was not recorded during the backup.

This is a good working example of the various approaches to attempting to correct a MyISAM database failure. All of these steps were performed with a client that had an emergency and no plan. If you do not have access to expert resources attempting to resolve this type of problem, the likelihood of not exhausting all options increases.

Dans le document Effective MySQL (Page 135-142)