Page Numbers: Yes X: 527 Y: 10.5" First Page: 3
Columns: 1 Edge Margin: .4" Between Columns: .4"
Margins: Top: 1.3" Bottom: 1"
Line Numbers: No Modulus: 5 Page-relative
Odd Heading:
Maxc OperationsTenex Crashes
Even Heading:
Tenex CrashesMaxc Operations
2.TENEX CRASHES
If a Maxc-Tenex maintainer is available at PARC, inform him of the crash and normally he will take over. You should try the office numbers of the maintainers, even at odd hours, because they work at irregular hours. Otherwise, be brave and read on.
If you have not already done so, you should first familiarize yourself with the material in the Introduction to this manual (Section 1). There is a map of the machine room posted on the bulletin board, and most of the equipment is labelled.
There are obvious problems in attempting to describe what to do when a system crashes. This section simply outlines a few simple procedures whose purposes are twofold: (1) to collect data about the crash for subsequent analysis and (2) to restart the machine quickly, with as little state lost as is possible.
Begin by checking the log book and the whiteboard for any special instructions before proceeding with the following. The last few log book entries may describe a crash like the one that has just occurred. This may suggest a restart procedure for you to follow.
You should append a log entry to the logbook with your name, the date and time, a statement that Maxc crashed, and any other information that you discover while following the procedures below. Recent error typeouts (e.g., memory parity error messages) should be cut out and taped in the logbook or copied from the display into the log if relevant.
Look carefully at the upper and lower windows on the Maxc Alto. (Type an "A" or something to activate the display.) The upper window will show the "flashing-register" display and error messages from AltIO; the lower window serves as a console teletype. Look at the Maxc2 logging terminal, a Diablo printer located behind the Maxc2 Alto console.
Maxc Alto failures are generally manifested by an Alto Swat call or by the Alto hanging someplace. If it hangs, then the flashing-register display will not be updating TODCLK and none of the other registers in the flashing-register display will be being updated either.
The upper window will show error messages from Tenex. The logging terminal will have numerous messages typed out by Tenex before it crashed. The "flashing register" display shows the names and contents of several Tenex core locations that are frequently updated during normal operation. For example, TODCLK contains the time-of-day, which is updated by AltIO; this register should be getting updated regularly by AltIO, if AltIO hasn’t crashed. The Tenex scheduler will be frequently updating the NBRUN-NBPROC word if Tenex is running normally.
Information about the crash may be apparent to you when you read the print out on these. The logging terminal may have BUGCHK messages from Tenex; frequently the BUGCHK messages are irrelevant to the crash, but sometimes they are interesting. The log is normally filled with messages about network failures of various types and recoverable disk errors such as:
**PUPSRV date time FTP: Server timed out ...
***IMPBUG 8599 Header ...
*DSKERR: ...
These messages are usually irrelevant to the crash, so don’t be overly concerned about them. However, if the crash is caused by a network jamup of some kind, the BUGCHK messages on the logging terminal may give a clue to the nature of the problem (for Ethernet or ARPA network failures).
When system personnel are not present, Tenex is generally left in a mode in which it attempts recovery from Tenex-detected errors. Thus most crashes handled by non-system people will be of a more obscure (relative to Tenex) nature. The following paragraphs describe some of the more common crashes and suggested recovery procedures.
Your first objective in dealing with a crash is to determine what kind of failure caused the crash. To do this you will look at various symptoms and try to classify the failure. The following are plausible reasons for crashes:
a. Ethernet problems occasionally cause Tenex to be inaccessible, even though Tenex has not crashed. In this case, there will be no BUGHLT message from Tenex, no "Microcode halt" message from AltIO, and the flashing registers will be updating normally. This situation is normally accompanied by numerous network-related BUGCHK messages on the logging terminal. You can determine whether or not this has occurred by typing control-C to the lower window of the Alto; if Tenex responds to control-C with a login message, then you know that it is still alive, and you should find someone to fix the Ethernet. (If the cursor is not flashing in the lower window, you have to type the bottom unmarked key at the right of the Alto keyboard before typing control-C.).
b. Software or firmware bugs have been rare, usually related to the ARPA, Ether, and MCA networks, and these bugs generally manifest themselves only when the network hardware is malfunctioning in some way. However, a Tenex BUGHLT or microprocessor hanging condition might be caused by a software/firmware bug.
c. Disk drive or disk controller failures are frequent causes of crashes. These might be manifested by one of the disk drives going into select-lock (a red light on the disk control panel turns on when this occurs.). Building power glitches might also cause this. A disk drive/controller failure will normally show the symptom of a BUGHLT message in the lower window of the Alto. Disk failures are discussed below.
d. Main memory failures are generally manifested by a failure message from AltIO. Garden variety uncorrectable double errors may result in the Tenex parity error sweep being invoked for a crash autorestart, and this may in turn be followed by a microprocessor halt as discussed below. Memory cabinet power supplies sometimes turn off due to shorts or other hardware failures, manifested by one of the four power supply lights on a memory cabinet turning off. An assortment of symptoms for various memory failures are discussed below.
e. Microprocessor failures are generally manifested by microprocessor halts, by peculiar Tenex BUGHLT’s, or by the microprocessor hanging. If the microprocessor hangs, the TODCLK item in the flashing-register display will generally be updating normally (TODCLK is updated by AltIO, not Tenex), but other flashing-register display items will be unchanging. This could also occur when the core image of Tenex is smashed in some strange way.
The various symptoms which imply one or another of these kinds of failures are discussed below.
A. BUGHLT or BUGCHK: A message of the form
BUGHLT at 73550
$8B>>BUGADR BUGHLT/ CAI UUOH+4
types out on the Maxc console. This occurs when the monitor is in debug mode, which should never be the case unless system maintainers are present. However, if you cannot find one, set the DBUGSW and DCHKSW cells to zero and proceed from the breakpoint:
DBUGSW/10<lf>
DCHKSW/10<cr>
<esc>P
Tenex will recover from the error if possible, else restart automatically with no further intervention required
Sometimes Tenex will hang in the autorestart code which follows a "BUGHLT at nnnnnn" message. This may be indicated by TODCLK in the Tenex register display updating normally but nothing else happening. In this case, lookup the message associated with the BUGHLT in the BUGSTRINGS.MAXC2 listing on the table. This frequently is caused by a disk or microprocessor hardware problem. Unless something more creative occurs to you when you read the BUGHLT message, try the "Last Resort" procedure in paragraph I below.
B."Trouble with System Pack nnn" prints out on the Maxc console, followed by "Type M to move pack, R to resume". This is caused by a disk unit going offline or failing in some equally catastrophic way. If the source of the problem is obvious (e.g. someone switched the unit off accidentally), rectify the problem, wait for the unit to be online (green light lit), and type "R" on the Maxc console. In other cases, it is usually better to move the pack to a free drive (if there is one--frequently the only free drive has a Bsys backup pack mounted on it, which you may remove). After waiting for the new drive to be online, type "M" followed by the letter corresponding to the drive you have moved the pack to (A through H). Tenex should now resume automatically.
Sometimes disk units have gone into select lock without any apparent reason (e.g., after a building power glitch). This can be cured sometimes by powering down the unit, letting it stop, then powering up again. After powering down the front panel switches and waiting for the disk unit to stop, you may have to turn off the AC power switch in back in order to clear select lock. If the disk unit does not stop when you power down the front switch, do not turn off the AC power in back because the heads may not have retracted, and you will destroy the disk pack by powering down. We have had several failures like this.
If moving the pack doesn’t succeed in restarting Tenex, or if another crash occurs later, you will have to restart by booting Micro-Exec. Then you will have to tell Micro-Exec what the new disk configuration is. This is done by using "Print.Disk.Configuration" and "Set.Disk.Configuration" commands as discussed in the Micro-Exec section of this document.
If by unfortunate chance the drive that fails is the first one in the old configuration, then Micro-Exec will be in the first save area on that drive, and you will have to boot it by typing nB to AltIO as discussed in the AltIO section. This is a little different from the normal boot procedure which defaults the drive for booting to drive A.
C."Micro Breakpoint" prints out in the AltIO command window. This message means that the microprocessor hit a breakpoint, which is usually caused either by Tenex executing a HALT instruction or by the microcode detecting some serious internal inconsistency.
The only known HALT instructions in Tenex are associated with catastrophic disk errors, and a message such as
IRREC. READ ERROR IN DIRECTORY--BEWARE OF DISK WRITE FAILURE
TROUBLE WITH DISK PACK 000211
is typed out on the Maxc console. Errors of this nature should be handled only by knowledgable system people, since the Tenex file system may be endangered.
A "Micro Breakpoint" not accompanied by a printed message on the Maxc console is usually due to a microprogram-detected inconsistency. Perform the following procedure:
1.Enter Midas by typing the following on the Alto keyboard:
Strike the middle unmarked key.
#3301P(to "un-protect" AltIO)
:M...OK.(to enter Midas)
2.Write down the contents of the following registers displayed by Midas:
NPC IMA P Q STK 0 PC PISTAT F INSTR
3.Execute the "Compare" command in the command menu (you must confirm it with Return). If this prints "No errors" or "1 errors on Midas.Errors" then the microcode is ok, so continue at step 6 below. Otherwise, exit to the Alto Executive with "Exit", issue the command "Type Midas.Errors", and write down anything interesting. (The microcode legitimately clobbers SM location IODEND, so if this is the only error in Midas.Errors then nothing is wrong.) If there are any real errors, most likely a bipolar memory chip has failed, and attempts to restart the system will probably be unsuccessful until the chip is replaced. Notify a hardware maintainer.
4.Type the commands "21;G" (which should end up within a few seconds at IMA=30), followed by "25;G", which checks the correctness of the microcode. If IMA=30, the microcode is ok and you should go on to step 6. If IMA=20, the microcode is incorrect. Write down the contents of LM 10. If you are ambitious, consult Section 13 for information on interpreting Checker failures. Run appropriate microprocessor diagnostics if you are familiar with them.
5.Successively select the menu items "Run-Program" and "Tenload" using the left mouse button.
6.Attempt to "soft-restart" Tenex as follows:1
Select menu items "AltIO", "Dont-Go", "Do-It". Then type:
:140Go [confirm] .
If this is successful, Tenex will within a minute or so broadcast the message "Maxc resumed from service interruption" to all terminals. If not, follow the instructions in "Last Resort" (paragraph I below).
D."Bipolar Memory Parity Error". Handle as in case C, except that interpreting the Checker failure is an especially desirable thing to do. Perform a "LMPEscan" before doing step 3, and write down any errors reported. If bipolar memory parity errors keep occurring, notify a hardware maintainer, since it is necessary to change a bipolar memory chip.
E."Fatal Memory Error, Maxc Stopped" on Alto console. This indicates that the memory is very sick, and a hardware maintainer should be notified.
------------------------------
1Note that it is generally possible to "soft-restart" Tenex even after running microprocessor diagnostics such as DGBASIS andDGIML (but not DGM or DGMR, which are memory diagnostics).
"Main memory error: DE q" where q = J, K, L, or M indicates a main memory storage problem in the indicated memory quadrant.
"Main memory error: DE q" where q = J, K, L, or M indicates ???
"Maxc halted with memory bus parity error" indicates a problem in the logic for generating or sending parity from the memory to the processor or in receiving parity by the processor, or in transmitting one of the data bits between the memory and the processor. It is normal for this to occur in conjunction with a "Main memory error."
"DIP in Q" (where Q = J, K, L, or M) means that the parity of the data on the bus from the port to the processor was incorrect. This will happen in conjunction with a "DE in Q" and isn’t significant in this case. In other cases it indicates a hardware problem in the port or in the transmission path from the port into the processor.
You should check the power supply lights before embarking on any other action. Enter the machine room and locate the bank of logic racks. You will see the Maxc2 Alto (there is a sign on it). To its right will be the cabinet containing the processor and port (labelled "Maxc2 processor"). The two power lights at the bottom of the cabinet should be lit.
To the right of the processor are memory cabinets. The right-most four lights at the bottom of each memory cabinet should be lit. (The two lights to the left of these are insignificant).
If any of the power lights is not lit there are three possibilities: the light has burned out; some electrical short has legitimately invoked the safety circuits and shut down the supply; or a glitch has invoked the safety circuits, but the hardware is ok (frequent source of failures). Because there may be an electrical failure, you should notify a maintainer if possible. However, if you can’t find one, hope that nothing fatal has occurred and proceed as follows:
First, put the system disk drives in Read-Only mode; it is necessary to do this before power down until after power up. Then, power down processor, port, and memory, as discussed in the Power Down section. This is done by running a program. Do not turn off any hardware switches now. Then power up the processor, port, and memory as discussed in the Power Up section starting at step F.
Note: You have to power down Maxc before powering up again.
If this succeeds in getting the power supplies on again, try the cold start procedure for restarting Tenex as discussed in Paragraph I. If it does not succeed, then the hardware is broken and has to be fixed.
If the lights are all on, and if you can’t locate a hardware maintainer, you should restart Tenex from scratch. If the failure is a double error caused by failure of storage components, then the restart procedure will zone out the bad storage region so that Tenex will not use that area and the failure will not reoccur. If the failure is more serious, such that the memory is unusable, then the hard restart will fail and the hardware will have to be fixed. Paragraph J below discusses the hard restart procedure.
F. Alto crash. If the Alto has fallen into Swat (the message "Swat" followed by a number and a date appears at the top of the screen), record in the log book the information below the lowest line of squiggles. Then press the boot button on the back of the Alto keyboard. Then issue the commands:
AltIO/H <cr>
140Go [confirm] .
This should result in a Tenex "soft restart", as in Paragraph C, step 6.
G."Disk needs fixing" message from CHECKDSK. When Tenex autorestarts following a BUGHLT, it first runs the BSYS verify and CHECKDSK programs to determine whether or not the file system has been damaged by the crash. If either of these programs detects a problem, it will abort the autorestart. Fixing these problems is hazardous and should ordinarily be attempted only by a system maintainer. The procedures for recovering from CHECKDSK failures are discussed in a later section (Section 14).
H.No response from Tenex; i.e., no error messages have typed out and none of the above alternatives seems to apply, but nothing happens when you type control-C on the Maxc console. This type of crash is particularly hard to diagnose unless sufficient information is recorded.
First, note the numbers at the top of the Alto screen, and note whether any of them are changing over time.
Second, enter Midas by typing the following on the Alto keyboard:
Strike the middle unmarked key.
#3301P(to "un-protect" AltIO)
:M...OK.(to enter Midas)
(If the message "Unclean Micro Stop" prints out, note this as well.)
Now continue by carrying out the instructions given earlier beginning at paragraph C, step 2.
I.Last resort. It may happen that a crash does not fall into one of the above categories or that the restart procedure fails. In this case, the following procedure will always succeed if all the hardware is working:
1.Boot the Alto.
2.Type the command:
MIDAS TENGO <cr>
3.Wait about 30 seconds while the microcode loads and AltIO and Micro-Exec are started. Micro-Exec will execute an automatic "Go" command, after which you should enter date and time if Tenex requests it. Generally, you can ignore bad-chip messages printed out during memory testing--the regions of storage affected by bad chips are mapped out by Tenex. Save the printout in the log book, however.
If this doesn’t work, try to find any one of the people listed below at PARC. If none of them is around and the hour is between 9 AM and midnight, call one of the system maintainers. If between midnight and 9 AM, don’t bother, but leave a message on the telephone recording saying that the machine will be down until morning. Instructions for recording messages are posted in the back room.
People to notify: (use phone list on wall beside phone)
Software and general system operation
Ron WeaverGeneral, Tenex, and AltIO
Ed FialaGeneral, microcode, Midas, Tenex
Hardware
Ed FialaMicroprocessor
Mike OvertonMemories and Disks
Ed McCreightDisks
Herb YearyAlto
Andy HammondsAlto
Ron WeaverWhen any of the above aren’t available