3. How To Get There
There is a definite order to the steps below. On no account should one delete the file before running Check Drive (in 2.2), or the Read Named File and Add Physical Page combination (in 2.1). Many errors are data dependent and the act of deleting the file may fix the error. The page is marginal, and it will get you later.
Choose 2.1, 2.2 or 2.3 below depending on what you know about the error and whether you can rollback.
3.1 If You Know the File's Name but can't rollback
Thanks to Maureen Stone for suggesting this section.
Recover - If You Can and Want To
Ask Doug Wyatt to give you his RescueGoodPages program. It reads one page at a time, ignoring all errors, and makes a copy of the file. This works pretty good if the file is a Peanut file or an AIS file. Many errors are single bit errors and you just get a bit of noise in the file.
Get To Iago
Do an "L" boot or a net boot. (An "L" boot is a boot with the "L" key held down from 812 to 845 in the boot sequence).
Find the Bad Page
Use Iago to run ExtraIago. It can be run by using the Iago command Run Diagnostic BCD with the default file name. If Iago prompts you with "Avoid using the local disk for caching the bcd file?", answer "yes".
Use the Read Named File command to read the file. Extract the physical page number from the header for pages in error. Write it down.
Get the Bad Page(s) into the
Bad Page Table
Use the Add (Physical) Page to the BadPageTable command to add the page(s) to the bad page table.
Two interesting commands in ExtraIago are Describe Bad Pages and List Bad Pages. The first one attempts to tell you which file the bad page is in. The second tells you the logical and physical disk addresses of the bad pages.
Free the Page
Use the Delete Files command to get rid of the file for local files, and the Flush Cache command to flush cached files. Local files do not have a server name, while cached files do have a server name. This will free the page(s) that are bad.
Regain Consistency
Use the Ensure Bad Pages in VAM command in ExtraIago to ensure all the bad pages look allocated in the VAM. This means that the bad page will never be allocated to a file ever again.
Recover the File
If you lost a cached file, no recovery is needed.
3.2 If You Know the File's Name or the bad page number, and can rollback
Read FixBadPageDoc and use FixBadPage. Select the FixBadPage, RecoverBadPage, DeleteBadPage, FixFile, RecoverFile, or DeleteFile command as appropriate.
3.3 If You Don't Know Anything
Get To Iago
Do an "L" boot or a net boot. (An "L" boot is a boot with the "L" key held down from 812 to 845 in the boot sequence).
FS BTree Locked in Update
The FS BTree is the "file name table". If it is locked, you should Scavenge via Iago. You can tell if it is locked if you boot up Iago, and do a List File of BootFile.DontDeleteMe (or anything reasonable). If the BTree is locked, it will tell you, otherwise it will list the file. You have to know one detail to scavenge: how many Alto partitions there are. The ExtraIago command (see the next section for how to run ExtraIago) Describe Allocated Disk Pages should tell you if you don't already know (the last line for each physical volume may have a line like "... (X partitions) for potential alto allocations"; X is the number of alto regions to tell scavenge if it asks, and use zero if this line is omitted.). If you trip over a bad page during scavenge, read on. If not, it should work and all (or at least the vast majority of) attachments, local files, and cached files should be in the new BTree. The worst that should happen is that some attachments are lost. Scavenge does not know what the attachments are that are lost, so it will not tell you anything. If it asks to delete the checkpoint, let it.
If your disk is badly fragmented, it is possible that scavenge will lock up. Scavenge should finish in 20 minutes on a T-80 and 80 minutes on an AMS-315. If it takes longer than this - worry. Lots of small files, or marginal hardware requiring re-tries can make scavenge run even longer.
To see if you are hung, put the machine in 815 (control-look-swat) and teledebug. Freeze processes in FSFileOpsImpl (should be process 20B) and adjust to see the whole stack. If you see it doing a FileImpl.SetSize on the new BTree, you are probably in trouble. BTreeRead.Lock will be on top of the stack and the process will be waitingCV (forever). If you are not sure you are hung, look at something interesting in FSFileOpsImpl.Scavenge (such as the "nameBody" in the loop), start the machine back up (remembering to thaw all processes), wait a minute or so, and put it back in 815. If you are still at the same place, you are hung.
Deleting the checkpoint will give you enough non-fragmented room. [If possible, try to find Carl Hauser or Bob Hagmann before doing the delete. Remind us that we are interested in this problem. We think we have fixed it in Cedar 7.0]
Sorry that this is somewhat vague. I don't have a machine with this problem available so that I can tell you exactly what button to push. -- Bob
If you have to delete your checkpoint, then consider erasing the volume. You are very fragmented. If you do not use the CacheKillExcessVersions command in the Commander, then you should learn about it.
If you've deleted the checkpoint and the system has lost the pages, then use the Recompute VAM command in ExtraIago to find some free pages. If you can't get enough free pages, you lose.
Find the Bad Page
You are somewhat on your own here. See 2.3 above. You know you have a bad page because you got an event window, or you hung and the teledebugger tells you there is a problem (or ...). If you don't know, read on — you can probably figure it out below.
Get the Bad Page(s) into the
Bad Page Table
Get to Iago, and do the Check Drive command. It should find the bad page and stick it in the bad page table. Write down everything Check Drive says. If you know the bad page, you can instead use the Add (Physical) Page to the BadPageTable command.
Two interesting commands in ExtraIago are Describe Bad Pages and List Bad Pages. The first one attempts to tell you which file the bad page is in. The second tells you the logical and physical disk addresses of the bad pages.
If you can rollback, make sure you have the bad page written down on paper, then rollback. Refer to section 3.2 and read FixBadPageDoc.
Free the Page
If the FS BTree is locked in update, then you must scavenge. If this succeeds, the new BTree is not locked.
If you don't know the file (try Describe Bad Pages first), then use ExtraIago's command Read Logical Volume Pages to read the page(s) reported by Check Drive. If the leader (also called the label) is OK, then it will tell you the address of the header page 0. Read the page following this one (header page 1), and print it via the "all" option. This should have a bunch of junk, but there should be a file name somewhere. Local files do not have a server name, while cached files do have a server name. If not, look for the file header address in the list of root files reported by Describe Logical Volume. If you still don't know what file the page is in, you need some help.
If you know the file name, select that which applies:
cached file: use the Flush Cache command to get rid of it
local file: use the Delete Files command to get rid of it
FS BTree: you must Scavenge. You will likely loose a few attachments from your name space.
VM Backing File: use the Create VM Backing command twice: once to shrink the file to free the page, and after the "Regain Consistency" step below, use Create VM Backing again to set the backing file to the proper size.
Root Files: (Checkpoint, Boot, Cedar Microcode, ...): use the Delete Checkpoint (or whatever) command in ExtraIago.
Regain Consistency
Use the Ensure Bad Pages in VAM command in ExtraIago to ensure all the bad pages look allocated in the VAM. This means that the bad page will never be allocated to a file ever again.
Recover the File
Re-install root files (e. g., Install Boot File in Iago), set the VM Backing file size, or whatever to recover the file lost. If you lost a cached file, no recovery is needed. Do a full boot and checkpoint unless you are sure that it is OK not to do so.