DiskErrorRecovery.tioga
Bob Hagmann February 12, 1986 7:46:44 am PST
Disk Error Recovery
CEDAR 6.0 — FOR INTERNAL XEROX USE ONLY
Disk Error Recovery
Bob Hagmann
© Copyright 1985 Xerox Corporation. All rights reserved.
Abstract: A Short Course in how to get your Cedar file system back to a reasonable state after a hard disk error.
Keywords: Hard Disk Errors
XEROX  Xerox Corporation
   Palo Alto Research Center
   3333 Coyote Hill Road
   Palo Alto, California 94304

For Internal Xerox Use Only
1. The State To Be In
.. is California, of course.
But really, what you want is:
FS BTree not locked in update
all bad pages to be in the "Bad Page Table"
no bad pages in any files
all the bad pages look allocated (bit set on in the VAM - the Volume Allocation Map).
Although it doesn't really belong here, there are two symptoms that appear to be hard disk errors, but are not. The first is when the "file name table" for FS, the FS B-Tree, is locked in update. To discover if you have this problem, see "FS BTree Locked in Update" below. The other problem occurs when the Volume Allocation Map (VAM) is badly out of date. The symptom of this is 100% utilization of the disk during file creation, and the creation never completes. You can wait it out (I think it fixes 10 pages a second), or you can Boot to Iago, and run the Recompute VAM command.
2. How To Get There
There is a definite order to the steps below. On no account should one delete the file before running Check Drive (in 2.2), or the Read Named File and Add Physical Page combination (in 2.1). Many errors are data dependent and the act of deleting the file may fix the error. The page is marginal, and it will get you later.
Choose 2.1 or 2.2 below depending on what you know about the error.
2.1 If You Know the File's Name
Thanks to Maureen Stone for suggesting this section.
Recover - If You Can and Want To
Ask Doug Wyatt to give you his RescueGoodPages program. It reads one page at a time, ignoring all errors, and makes a copy of the file. This works pretty good if the file is a Peanut file or an AIS file. Many errors are single bit errors and you just get a bit of noise in the file.
Get To Iago
Do an "L" boot or a net boot. (An "L" boot is a boot with the "L" key held down from 812 to 845 in the boot sequence).
Find the Bad Page
Use Iago to run ExtraIago. It can be run by using the Iago command Run Diagnostic BCD with the default file name. If Iago prompts you with "Avoid using the local disk for caching the bcd file?", answer "yes".
Use the Read Named File command to read the file. Extract the physical page number from the header for pages in error. Write it down.
Get the Bad Page(s) into the Bad Page Table
Use the Add (Physical) Page to the BadPageTable command to add the page(s) to the bad page table.
Two interesting commands in ExtraIago are Describe Bad Pages and List Bad Pages. The first one attempts to tell you which file the bad page is in. The second tells you the logical and physical disk addresses of the bad pages.
Free the Page
Use the Delete Files command to get rid of the file for local files, and the Flush Cache command to flush cached files. Local files do not have a server name, while cached files do have a server name. This will free the page(s) that are bad.
Regain Consistency
Use the Ensure Bad Pages in VAM command in ExtraIago to ensure all the bad pages look allocated in the VAM. This means that the bad page will never be allocated to a file ever again.
Recover the File
If you lost a cached file, no recovery is needed.
2.2 If You Don't Know Anything
Get To Iago
Do an "L" boot or a net boot. (An "L" boot is a boot with the "L" key held down from 812 to 845 in the boot sequence).
FS BTree Locked in Update
The FS BTree is the "file name table". If it is locked, you should Scavenge via Iago. You can tell if it is locked if you boot up Iago, and do a List File of BootFile.DontDeleteMe (or anything reasonable). If the BTree is locked, it will tell you, otherwise it will list the file. You have to know one detail to scavenge: how many Alto partitions there are. The ExtraIago command (see the next section for how to run ExtraIago) Describe Allocated Disk Pages should tell you if you don't already know (the last line for each physical volume may have a line like "... (X partitions) for potential alto allocations"; X is the number of alto regions to tell scavenge if it asks, and use zero if this line is omitted.). If you trip over a bad page during scavenge, read on. If not, it should work and all (or at least the vast majority of) attachments, local files, and cached files should be in the new BTree. The worst that should happen is that some attachments are lost. Scavenge does not know what the attachments are that are lost, so it will not tell you anything. If it asks to delete the checkpoint, let it.
Find the Bad Page
You are somewhat on your own here. You know you have a bad page because you got an event window, or you hung and the teledebugger tells you there is a problem (or ...). If you don't know, read on — you can probably figure it out below.
A good piece of software to have around is ExtraIago. It can be run by using the Iago command Run Diagnostic BCD with the default file name. Answer "yes" to "Avoid using the local disk for caching the bcd file?".
Get the Bad Page(s) into the Bad Page Table
Get to Iago, and do the Check Drive command. It should find the bad page and stick it in the bad page table. Write down everything Check Drive says. If you know the bad page, you can instead use the Add (Physical) Page to the BadPageTable command.
Two interesting commands in ExtraIago are Describe Bad Pages and List Bad Pages. The first one attempts to tell you which file the bad page is in. The second tells you the logical and physical disk addresses of the bad pages.
Free the Page
If the FS BTree is locked in update, then you must scavenge. If this succeeds, the new BTree is not locked.
If you don't know the file (try Describe Bad Pages first), then use ExtraIago's command Read Logical Volume Pages to read the page(s) reported by Check Drive. If the leader (also called the label) is OK, then it will tell you the address of the header page 0. Read the page following this one (header page 1), and print it via the "all" option. This should have a bunch of junk, but there should be a file name somewhere. Local files do not have a server name, while cached files do have a server name. If not, look for the file header address in the list of root files reported by Describe Logical Volume. If you still don't know what file the page is in, you need some help.
If you know the file name, select that which applies:
cached file: use the Flush Cache command to get rid of it
local file: use the Delete Files command to get rid of it
FS BTree: you must Scavenge. You will likely loose a few attachments from your name space.
VM Backing File: use the Create VM Backing command twice: once to shrink the file to free the page, and after the "Regain Consistency" step below, use Create VM Backing again to set the backing file to the proper size.
Root Files: (Checkpoint, Boot, Cedar Microcode, ...): use the Delete Checkpoint (or whatever) command in ExtraIago.
Regain Consistency
Use the Ensure Bad Pages in VAM command in ExtraIago to ensure all the bad pages look allocated in the VAM. This means that the bad page will never be allocated to a file ever again.
Recover the File
Re-install root files (e. g., Install Boot File in Iago), set the VM Backing file size, or whatever to recover the file lost. If you lost a cached file, no recovery is needed. Do a full boot and checkpoint unless you are sure that it is OK not to do so.