DiskErrorRecovery.tioga
Bob Hagmann September 15, 1986 11:41:35 am PDT
Disk Error Recovery
CEDAR 6.1 — FOR INTERNAL XEROX USE ONLY
Disk Error Recovery
Bob Hagmann
© Copyright 1985, 1986 Xerox Corporation. All rights reserved.
Abstract: A Short Course in how to get your Cedar file system back to a reasonable state after a hard disk error.
Keywords: Hard Disk Errors
XEROX  Xerox Corporation
   Palo Alto Research Center
   3333 Coyote Hill Road
   Palo Alto, California 94304

For Internal Xerox Use Only
1. The State To Be In
.. is California, of course.
But really, what you want is:
FS BTree not locked in update
all bad pages to be in the "Bad Page Table"
no bad pages in any files
all the bad pages look allocated (bit set on in the VAM - the Volume Allocation Map).
2. Introduction
2.1 Other disk problems
Although it doesn't really belong here, there are two symptoms that appear to be hard disk errors, but are not. The first is when the "file name table" for FS, the FS B-Tree, is locked in update. To discover if you have this problem, see "FS BTree Locked in Update" below.
The other problem occurs when the Volume Allocation Map (VAM) is badly out of date. The symptom of this is 100% utilization of the disk during file creation, and the creation never completes. You can wait it out (I think it fixes 10 pages a second), or you can Boot to Iago, and run the Recompute VAM command.
2.2 Lots of errors
If you have all of a sudden discovered lots (> 25) disk errors, then it is likely that the disk has a hardware problem. The most we can support is 128 hard errors on a disk. Pack your bags, you're going on a trip.
2.3 Finding the bad page and file name after a disk error
Some disk errors are such that the system can open an error window and report them. Other disk errors are such that they lock the file table or are in files that are needed to open the error window. The symptoms of the lock up are either a partly opened window that never fully opens, any access to the file name table locks up, or the whole system seems to be wedged.
For any wedged system, use the teledebugger. Depress shift-look-swat and observe the system go into 815. Type "Debug YourMachineName" to the commander on some other Cedar workstation (put your machine's name as YourMachineName). Teledebugging works once per rollback on the debugger machine is a good rule of thumb.
Enter "SignalsImpl" in the Context type-in entry, and click Context on the second line (the one with the box around it) to Freeze the processes that are doing signals. Look at the processes and see if any of them are there due to File.Error. If so, write down the diskPage associated with the error and (if you can figure it out) the file name. If you can find a FS.OpenFile for this file (call it openFile), then find the file's name by interpreting "openFile.a.nameBody".
2.4 Iago, ExtraIago, and FixBadPage
There are three utility programs that we can use. First two are Iago and ExtraIago. To get to Iago, Do an "L" boot or a net boot. (An "L" boot is a boot with the "L" key held down from 812 to 845 in the boot sequence). If you are asked if you want to use Iago, answer "yes". Get ExtraIago going by using the Iago command Run Diagnostic BCD with the default file name. If Iago prompts you with "Avoid using the local disk for caching the bcd file?", answer "yes".
FixBadPage is in CedarChest. It has to be used with a fully running Cedar system. You may have to do a bringover.
3. How To Get There
There is a definite order to the steps below. On no account should one delete the file before running Check Drive (in 2.2), or the Read Named File and Add Physical Page combination (in 2.1). Many errors are data dependent and the act of deleting the file may fix the error. The page is marginal, and it will get you later.
Choose 2.1, 2.2 or 2.3 below depending on what you know about the error and whether you can rollback.
3.1 If You Know the File's Name but can't rollback
Thanks to Maureen Stone for suggesting this section.
Recover - If You Can and Want To
Ask Doug Wyatt to give you his RescueGoodPages program. It reads one page at a time, ignoring all errors, and makes a copy of the file. This works pretty good if the file is a Peanut file or an AIS file. Many errors are single bit errors and you just get a bit of noise in the file.
Get To Iago
Do an "L" boot or a net boot. (An "L" boot is a boot with the "L" key held down from 812 to 845 in the boot sequence).
Find the Bad Page
Use Iago to run ExtraIago. It can be run by using the Iago command Run Diagnostic BCD with the default file name. If Iago prompts you with "Avoid using the local disk for caching the bcd file?", answer "yes".
Use the Read Named File command to read the file. Extract the physical page number from the header for pages in error. Write it down.
Get the Bad Page(s) into the Bad Page Table
Use the Add (Physical) Page to the BadPageTable command to add the page(s) to the bad page table.
Two interesting commands in ExtraIago are Describe Bad Pages and List Bad Pages. The first one attempts to tell you which file the bad page is in. The second tells you the logical and physical disk addresses of the bad pages.
Free the Page
Use the Delete Files command to get rid of the file for local files, and the Flush Cache command to flush cached files. Local files do not have a server name, while cached files do have a server name. This will free the page(s) that are bad.
Regain Consistency
Use the Ensure Bad Pages in VAM command in ExtraIago to ensure all the bad pages look allocated in the VAM. This means that the bad page will never be allocated to a file ever again.
Recover the File
If you lost a cached file, no recovery is needed.
3.2 If You Know the File's Name or the bad page number, and can rollback
Read FixBadPageDoc and use FixBadPage. Select the FixBadPage, RecoverBadPage, DeleteBadPage, FixFile, RecoverFile, or DeleteFile command as appropriate.
3.3 If You Don't Know Anything
Get To Iago
Do an "L" boot or a net boot. (An "L" boot is a boot with the "L" key held down from 812 to 845 in the boot sequence).
FS BTree Locked in Update
The FS BTree is the "file name table". If it is locked, you should Scavenge via Iago. You can tell if it is locked if you boot up Iago, and do a List File of BootFile.DontDeleteMe (or anything reasonable). If the BTree is locked, it will tell you, otherwise it will list the file. You have to know one detail to scavenge: how many Alto partitions there are. The ExtraIago command (see the next section for how to run ExtraIago) Describe Allocated Disk Pages should tell you if you don't already know (the last line for each physical volume may have a line like "... (X partitions) for potential alto allocations"; X is the number of alto regions to tell scavenge if it asks, and use zero if this line is omitted.). If you trip over a bad page during scavenge, read on. If not, it should work and all (or at least the vast majority of) attachments, local files, and cached files should be in the new BTree. The worst that should happen is that some attachments are lost. Scavenge does not know what the attachments are that are lost, so it will not tell you anything. If it asks to delete the checkpoint, let it.
If your disk is badly fragmented, it is possible that scavenge will lock up. Scavenge should finish in 20 minutes on a T-80 and 80 minutes on an AMS-315. If it takes longer than this - worry. Lots of small files, or marginal hardware requiring re-tries can make scavenge run even longer.
To see if you are hung, put the machine in 815 (control-look-swat) and teledebug. Freeze processes in FSFileOpsImpl (should be process 20B) and adjust to see the whole stack. If you see it doing a FileImpl.SetSize on the new BTree, you are probably in trouble. BTreeRead.Lock will be on top of the stack and the process will be waitingCV (forever). If you are not sure you are hung, look at something interesting in FSFileOpsImpl.Scavenge (such as the "nameBody" in the loop), start the machine back up (remembering to thaw all processes), wait a minute or so, and put it back in 815. If you are still at the same place, you are hung.
Deleting the checkpoint will give you enough non-fragmented room. [If possible, try to find Carl Hauser or Bob Hagmann before doing the delete. Remind us that we are interested in this problem. We think we have fixed it in Cedar 7.0]
Sorry that this is somewhat vague. I don't have a machine with this problem available so that I can tell you exactly what button to push. -- Bob
If you have to delete your checkpoint, then consider erasing the volume. You are very fragmented. If you do not use the CacheKillExcessVersions command in the Commander, then you should learn about it.
If you've deleted the checkpoint and the system has lost the pages, then use the Recompute VAM command in ExtraIago to find some free pages. If you can't get enough free pages, you lose.
Find the Bad Page
You are somewhat on your own here. See 2.3 above. You know you have a bad page because you got an event window, or you hung and the teledebugger tells you there is a problem (or ...). If you don't know, read on — you can probably figure it out below.
Get the Bad Page(s) into the Bad Page Table
Get to Iago, and do the Check Drive command. It should find the bad page and stick it in the bad page table. Write down everything Check Drive says. If you know the bad page, you can instead use the Add (Physical) Page to the BadPageTable command.
Two interesting commands in ExtraIago are Describe Bad Pages and List Bad Pages. The first one attempts to tell you which file the bad page is in. The second tells you the logical and physical disk addresses of the bad pages.
If you can rollback, make sure you have the bad page written down on paper, then rollback. Refer to section 3.2 and read FixBadPageDoc.
Free the Page
If the FS BTree is locked in update, then you must scavenge. If this succeeds, the new BTree is not locked.
If you don't know the file (try Describe Bad Pages first), then use ExtraIago's command Read Logical Volume Pages to read the page(s) reported by Check Drive. If the leader (also called the label) is OK, then it will tell you the address of the header page 0. Read the page following this one (header page 1), and print it via the "all" option. This should have a bunch of junk, but there should be a file name somewhere. Local files do not have a server name, while cached files do have a server name. If not, look for the file header address in the list of root files reported by Describe Logical Volume. If you still don't know what file the page is in, you need some help.
If you know the file name, select that which applies:
cached file: use the Flush Cache command to get rid of it
local file: use the Delete Files command to get rid of it
FS BTree: you must Scavenge. You will likely loose a few attachments from your name space.
VM Backing File: use the Create VM Backing command twice: once to shrink the file to free the page, and after the "Regain Consistency" step below, use Create VM Backing again to set the backing file to the proper size.
Root Files: (Checkpoint, Boot, Cedar Microcode, ...): use the Delete Checkpoint (or whatever) command in ExtraIago.
Regain Consistency
Use the Ensure Bad Pages in VAM command in ExtraIago to ensure all the bad pages look allocated in the VAM. This means that the bad page will never be allocated to a file ever again.
Recover the File
Re-install root files (e. g., Install Boot File in Iago), set the VM Backing file size, or whatever to recover the file lost. If you lost a cached file, no recovery is needed. Do a full boot and checkpoint unless you are sure that it is OK not to do so.