Database corruption

From SybaseWiki
Jump to: navigation, search

One of the tasks of a DBA is to frequently check databases for possible corruption and to repair the database when needed. This document outlines how a DBA can be prepared for a possible corruption and gives 10 how-to's for dealing with it.

Be prepared

With just a few simple measures you can prepare yourself:

Make frequent backups

Making regular database and transaction dumps is your first line of defense. Depending on the size of and activity in your databases, full database dumps can be taken daily, weekly or at another convenient moment. Transaction dumps should be taken at regular intervals like once or twice per hour. These backups can be invaluable when a corruption cannot be fixed and you need to restore a database.

Check your databases frequently for corruption

With ASE 11.9.2 the dbcc checkstorage command was introduced that allows you to perform online checks of a database at a high speed. For previous releases you had to use other dbcc commands that were much slower and caused more locking. These two measures should be standard practice for your servers. Making backups without checking for corruption will only give you limited protection. Some corruptions can exist for a long time in a database before causing any problems. This means that your database dumps also contain the corrupt data.

Checking for corruption

A corruption can be detected with the "dbcc checkstorage" command (version 11.9.2 or later). To use this utility you need to create a dbccdb database and initialize it with the installdbccdb script. Some additional configuration is needed for each database on the server, but since a default configuration can be used for all databases.

When you run dbcc checkstorage a database is checked for corruption and the results are stored into the dbccdb database. An additional "dbcc checkverify" is needed to check whether a corruption is real or is temporarily caused by active transactions. With the sp_dbcc_summaryreport procedure (located in the dbccdb database) a summary of the corruptions (faults) can be displayed. There are two types of faults: soft faults and hard faults. I generally ignore soft faults but take immediate action on the hard faults. An additional parameter @display_recommendations for the sp_dbcc_summaryreport procedure can be used to show a standard text how a fault can be resolved. This parameter can only be used in version or above.

Detecting corruption at runtime

When the server encounters a corruption in a database at runtime (without running checkstorage) this is logged into the errorlog. Continuous monitoring of the errorlog (another standard practice for a DBA) is needed to be notified of any corruption as soon as it causes a problem.

10 how-to's for dealing with corruption

Below are a number of "how-to's" for dealing with a corruption in the database.

Read the errormessage write-up

Most errors reported in the errorlog related to a corruption are given a message number. A description of the meaning and how you can resolve the problem can be found in the "Troubleshooting and Error Messages Guide" volume 2. Errors found by dbcc checkstorage and reported with sp_dbcc_summaryreport (like hard faults) are documented in volume 1.

Decide if users can be kept online

Depending on the type of error you should make a decision whether to halt activity on a database as soon as possible, or to wait for an appropriate moment to repair the corruption. There is no rule-of-thumb here since quite a lot of factors come into play, like: which table or database is involved, what's the type of error, etc, etc. Here are just two examples of a corruption detected at runtime. Example 1:

Error: 7939, Severity: 22, State: 1
Table Corrupt: The entry is missing from the OAM for object id 1136007078 indid 4 for allocation page 3828480.  

According to the Troubleshooting and Error Messages Guide production can continue until a suitable moment for maintenance. Example 2:

Starting diagnostics for read failure:
Device Information:
            Device number = 4
            Phyname = /var/sybase/data/data01.dat
Buffer Information:
Page Information from first read attempt:
Page Information from second readattempt:
SDES Information:
PSS Information:
            pstat = 0x10000, pcurdb = 4, pspid = 11
            p2stat = 0x40001010, p3stat = 0x800
            plasterror = 0, preverror = 0, pattention = 0
End diagnostics for read failure:
Run the DBCC command 'dbcc tablealloc' for both the objects 448004627, 1136007078

This undocumented error indicates a possible hardware failure. Production should be halted as soon as possible since a hardware failure might cause more corruptions. These examples illustrate that it is important to monitor the errorlog. The impact of the corruption can be estimated by checking the Sybase documentation. Undocumented messages, as shown in example 2, generally require a "common sense" approach.

A reboot might help

Some database corruptions are not for real, they only exist in memory of ASE and not on disk. These are called transient (or soft) errors. When that's the case a reboot of Sybase ASE should solve the corruption. If the problem still exists after the reboot the corruption is physically in the database and not only in memory.

Run full dbcc checks

When you investigate a database corruption the full extent of it will become visible when you run both dbcc checkdb in combination with a dbcc checkalloc. For large databases these operations can be very time-consuming. Also run a dbcc checkcatalog to check the system tables.

Make some extra backups

Before working on large or complex database corruptions, it makes sense to make a separate full database dump when you begin (when you don't have one already). Making some intermediate dumps along the way can be very useful when at some point of the database restoration something goes wrong. Keep the database dump you made just before you started fixing the corruption. This dump can be useful later for additional investigation when users suspect they lost some data or for training purposes for your colleagues.

Dropping clustered indexes

Be very careful when you want to drop the clustered indexes on all-pages-locked table when that index seems to be corrupt. Since the clustered index also contains the data there is a possibility you will lose data when dropping the index.

bcp-out and bcp-in

One of the simplest methods to resolve a corruption in a table is to copy the data to new table (or do a bcp out), drop the corrupt table, and recreate it. Another possible quick fix is to dump the database and reload it. Some database corruptions are not included in a database dump so reloading the dump into the database can clear them.

Rerun dbcc checks

Once you fixed the corruption indicated by dbcc checkstorage/checkdb/checkalloc or checkcatalog re-run the check again until they run clean and no errors are reported anymore. It's not uncommon that you need to run these checks several times followed by some repair work before they will run without reporting any errors.

Follow up

When you fixed your database corruption be sure to eliminate the cause of it. When you had hardware related problems be sure to replace the faulty parts. Consider upgrading to a newer Sybase version and/or latest EBF.

Take a course/do some reading

Invest some time into reading the Performance and Tuning manuals to become familiar with terms such as OAM, GAM, extents and pages.