Page Numbers: Yes X: 527 Y: 10.5" First Page: 4
Columns: 1 Edge Margin: .4" Between Columns: .4"
Margins: Top: 1.3" Bottom: 1"
Line Numbers: No Modulus: 5 Page-relative
Odd Heading:
Maxc OperationsTenex Crashes
Even Heading:
Tenex CrashesMaxc Operations
2.TENEX CRASHES
If a Maxc-Tenex maintainer is available at PARC, inform him of the crash and normally he will take over. You should try the office numbers of the maintainers, even at odd hours, because they work at irregular hours. Otherwise, be brave and read on.
If you have not already done so, you should first familiarize yourself with the material in the Introduction to this manual (Section 1). There is a map of the machine room posted on the bulletin board, and most of the equipment is labelled.
There are obvious problems in attempting to describe what to do when a system crashes. This section simply outlines a few simple procedures whose purposes are twofold: (1) to collect data about the crash for subsequent analysis and (2) to restart the machine quickly, with as little state lost as is possible.
Begin by checking the log book and the whiteboard for any special instructions before proceeding with the following. Note that there are separate log books for Maxc1 and Maxc2. The last few log book entries may describe a crash like the one that has just occurred. This may suggest a restart procedure for you to follow.
You should append a log entry to the logbook with your name, the date and time, a statement that Maxc crashed, and any other information that you discover while following the procedures below. Recent error typeouts (e.g., memory parity error messages) should be cut out and taped in the logbook or copied from the display into the log if relevant.
Maxc1: Look carefully at the Infoton terminal (which has a sticker saying "Maxc1 Nova" pasted to it) and at the console teletype. The Infoton will show the "flashing-register" display and error messages from NVIO. Look at the logging terminal, which is a TI terminal off to the side of the Maxc room near the door.
Maxc2: Look carefully at the upper and lower windows on the Maxc2 Alto. The upper window will show the "flashing-register" display and error messages from AltIO; the lower window serves as a console teletype. Look at the Maxc2 logging terminal, a Diablo printer located behind the Maxc2 Alto console.
Maxc1 Nova or Maxc2 Alto failures are generally manifested by an NVIO punt (Maxc1) or Alto Swat call (Maxc2) or by the Nova or Alto hanging someplace. If it hangs, then the flashing-register display will not be updating TODCLK and none of the other registers in the flashing-register display will be being updated either.
The console teletype will show error messages from Tenex. The logging terminal will have numerous messages typed out by Tenex before it crashed. The "flashing register" display shows the names and contents of several Tenex core locations that are frequently updated during normal operation. For example, TODCLK contains the time-of-day, which is updated by NVIO/AltIO; this register should be getting updated regularly by NVIO/AltIO, if NVIO/AltIO hasn’t crashed. The Tenex scheduler will be frequently updating the NBRUN-NBPROC word if Tenex is running normally.
Information about the crash may be apparent to you when you read the print out on these. The logging terminal may have BUGCHK messages from Tenex; frequently the BUGCHK messages are irrelevant to the crash, but sometimes they are interesting. The log is normally filled with messages about network failures of various types and recoverable disk errors such as:
**PUPSRV date time FTP: Server timed out ...
***IMPBUG 8599 Header ...
*DSKERR: ...
These messages are usually irrelevant to the crash, so don’t be overly concerned about them. However, if the crash is caused by a network jamup of some kind, the BUGCHK messages on the logging terminal may give a clue to the nature of the problem (for Ethernet or ARPA network failures).
When system personnel are not present, Tenex is generally left in a mode in which it attempts recovery from Tenex-detected errors. Thus most crashes handled by non-system people will be of a more obscure (relative to Tenex) nature. The following paragraphs describe some of the more common crashes and suggested recovery procedures.
Your first objective in dealing with a crash is to determine what kind of failure caused the crash. To do this you will look at various symptoms and try to classify the failure. The following are plausible reasons for crashes:
a. Ethernet problems occasionally cause Tenex to be inaccessible, even though Tenex has not crashed. In this case, there will be no BUGHLT message from Tenex, no "Microcode halt" message from NVIO/AltIO, and the flashing registers will be updating normally. This situation is normally accompanied by numerous network-related BUGCHK messages on the logging terminal. You can determine whether or not this has occurred by typing control-C on the console teletype (Maxc1) or to the lower window of the Alto (Maxc2); if Tenex responds to control-C with a login message, then you know that it is still alive, and you should find someone to fix the Ethernet. (On Maxc2, if the cursor is not flashing in the lower window, you have to type the bottom unmarked key at the right of the Alto keyboard before typing control-C.).
b. Software or firmware bugs have been rare, usually related to the ARPA, Ether, and MCA networks, and these bugs generally manifest themselves only when the network hardware is malfunctioning in some way. However, a Tenex BUGHLT or microprocessor hanging condition might be caused by a software/firmware bug.
c. Disk drive or disk controller failures are frequent causes of crashes. These might be manifested by one of the disk drives going into select-lock (a red light on the disk control panel turns on when this occurs.). Building power glitches might also cause this. A disk drive/controller failure will normally show the symptom of a BUGHLT message on the console teletype (Maxc1) or lower window of the Alto (Maxc2). Disk failures are discussed below.
d. Main memory failures are generally manifested by a failure message from NVIO/AltIO. Garden variety uncorrectable double errors may result in the Tenex parity error sweep being invoked for a crash autorestart, and this may in turn be followed by a microprocessor halt as discussed below. Memory cabinet power supplies sometimes turn off due to shorts or other hardware failures, manifested by one of the four power supply lights on a memory cabinet turning off. An assortment of symptoms for various memory failures are discussed below.
e. Microprocessor failures are generally manifested by microprocessor halts, by peculiar Tenex BUGHLT’s, or by the microprocessor hanging. If the microprocessor hangs, the TODCLK item in the flashing-register display will generally be updating normally (TODCLK is updated by NVIO/AltIO, not Tenex), but other flashing-register display items will be unchanging. This could also occur when the core image of Tenex is smashed in some strange way.
The various symptoms which imply one or another of these kinds of failures are discussed below.
A. BUGHLT or BUGCHK: A message of the form
BUGHLT at 73550
$8B>>BUGADR BUGHLT/ CAI UUOH+4
types out on the Maxc console. This occurs when the monitor is in debug mode, which should never be the case unless system maintainers are present. However, if you cannot find one, set the DBUGSW and DCHKSW cells to zero and proceed from the breakpoint:
DBUGSW/10<lf>
DCHKSW/10<cr>
<esc>P
Tenex will recover from the error if possible, else restart automatically with no further intervention required
Sometimes Tenex will hang in the autorestart code which follows a "BUGHLT at nnnnnn" message. This may be indicated by TODCLK in the Tenex register display updating normally but nothing else happening. In this case, lookup the message associated with the BUGHLT in the BUGSTRINGS.MAXC1/2 listing on the table. This frequently is caused by a disk or microprocessor hardware problem. Unless something more creative occurs to you when you read the BUGHLT message, try the "Last Resort" procedure in paragraph J below.
B."Trouble with System Pack nnn" prints out on the Maxc console, followed by "Type M to move pack, R to resume". This is caused by a disk unit going offline or failing in some equally catastrophic way. If the source of the problem is obvious (e.g. someone switched the unit off accidentally), rectify the problem, wait for the unit to be online (green light lit), and type "R" on the Maxc console. In other cases, it is usually better to move the pack to a free drive (if there is one--frequently the only free drive has a Bsys backup pack mounted on it, which you may remove). After waiting for the new drive to be online, type "M" followed by the letter corresponding to the drive you have moved the pack to (A through H). Tenex should now resume automatically.
Sometimes disk units have gone into select lock without any apparent reason (e.g., after a building power glitch). This can be cured sometimes by powering down the unit, letting it stop, then powering up again. After powering down the front panel switches and waiting for the disk unit to stop, you may have to turn off the AC power switch in back in order to clear select lock. If the disk unit does not stop when you power down the front switch, do not turn off the AC power in back because the heads may not have retracted, and you will destroy the disk pack by powering down. We have had several failures like this.
If moving the pack doesn’t succeed in restarting Tenex, or if another crash occurs later, you will have to restart by booting Micro-Exec. Then you will have to tell Micro-Exec what the new disk configuration is. This is done by using "Print.Disk.Configuration" and "Set.Disk.Configuration" commands as discussed in the Micro-Exec section of this document.
If by unfortunate chance the drive that fails is the first one in the old configuration, then Micro-Exec will be in the first save area on that drive, and you will have to boot it by typing nB to NVIO (Maxc1) or AltIO (Maxc2) as discussed in the AltIO and NVIO sections. This is a little different from the normal boot procedure which defaults the drive for booting to drive A.
C."Micro Breakpoint" prints out on the Nova console (Infoton) on Maxc1, or in the AltIO command window on Maxc2. (On Maxc1, this and similar NVIO messages will usually appear above the lowest line of text and two lines of numbers usually displayed by NVIO.) This message means that the microprocessor hit a breakpoint, which is usually caused either by Tenex executing a HALT instruction or by the microcode detecting some serious internal inconsistency.
The only known HALT instructions in Tenex are associated with catastrophic disk errors, and a message such as
IRREC. READ ERROR IN DIRECTORY--BEWARE OF DISK WRITE FAILURE
TROUBLE WITH DISK PACK 000211
is typed out on the Maxc console. Errors of this nature should be handled only by knowledgable system people, since the Tenex file system may be endangered.
A "Micro Breakpoint" not accompanied by a printed message on the Maxc console is usually due to a microprogram-detected inconsistency. Perform the following procedure:
1.Enter Midas by typing the following on the Nova console or Alto keyboard:
Maxc2: Strike the middle unmarked key.
#3301P(to "un-protect" NVIO/AltIO)
:M...OK.(to enter Midas)
2.Write down the contents of the following registers displayed by Midas:
NPC IMA P Q STK 0 PC PISTAT F INSTR
3.Maxc2 only. Execute the "Compare" command in the command menu (you must confirm it with Return). If this prints "No errors" or "1 errors on Midas.Errors" then the microcode is ok, so continue at step 6 below. Otherwise, exit to the Alto Executive with "Exit", issue the command "Type Midas.Errors", and write down anything interesting. (The microcode legitimately clobbers SM location IODEND, so if this is the only error in Midas.Errors then nothing is wrong.) If there are any real errors, most likely a bipolar memory chip has failed, and attempts to restart the system will probably be unsuccessful until the chip is replaced. Notify a hardware maintainer.
4.Type the commands "21;G" (which should end up within a few seconds at IMA=30), followed by "25;G", which checks the correctness of the microcode. If IMA=30, the microcode is ok and you should go on to step 6. If IMA=20, the microcode is incorrect. Write down the contents of LM 10. If you are ambitious, consult Section 13 for information on interpreting Checker failures. Run appropriate microprocessor diagnostics if you are familiar with them.
5.Maxc1: Type control-A to return control to DOS. Then reload the microcode via the command:
MIDAS TENLOAD <cr>
This takes a while (about 2 minutes). Wait until all messages at the bottom of the screen disappear.
Maxc2: Successively select the menu items "Run-Program" and "Tenload" using the left mouse button.
6.Attempt to "soft-restart" Tenex as follows:1
Maxc1:
!NVIO.SV/H<cr>
NVIO
:140G...OK.
Maxc2:
Select menu items "AltIO", "Dont-Go", "Do-It". Then type:
:140Go [confirm] .
If this is successful, Tenex will within a minute or so broadcast the message "Maxc resumed from service interruption" to all terminals. If not, follow the instructions in "Last Resort" (paragraph J below).
D."Bipolar Memory Parity Error" (Maxc2 only). Handle as in case C, except that interpreting the Checker failure is an especially desirable thing to do. Perform a "LMPEscan" before doing step 3, and write down any errors reported. If bipolar memory parity errors keep occurring, notify a hardware maintainer, since it is necessary to change a bipolar memory chip.
E."Fatal Memory Error, Maxc Stopped" on Nova/Alto console. This indicates that the memory is very sick, and a hardware maintainer should be notified.
"Main memory error: DE q" (Maxc2 only) where q = J, K, L, or M indicates a main memory storage problem in the indicated memory quadrant.
"Maxc halted with memory bus parity error" (Maxc2 only) indicates a problem in the logic for generating or sending parity from the memory to the processor or in receiving parity by the processor, or in transmitting one of the data bits between the memory and the processor. It is normal for this to occur in conjunction with a "Main memory error."
"DIP in Q" (where Q = J, K, L, or M) means that the parity of the data on the bus from the port to the processor was incorrect. This will happen in conjunction with a "DE in Q" and isn’t significant in this case. In other cases it indicates a hardware problem in the port or in the transmission path from the port into the processor.
You should check the power supply lights before embarking on any other action. Enter the machine room and locate the bank of logic racks for the Maxc machine which has crashed. You will see the Maxc1 Nova or Maxc2 Alto (there are signs on them). To its right will be the cabinet containing the processor and port (labelled "Maxc1 processor" or "Maxc2 processor"). The two power lights at the bottom of the cabinet should be lit.
To the right of the processor are memory cabinets (presently 3 cabinets for each system). The right-most four lights at the bottom of each memory cabinet should be lit. (The two lights to the left of these are insignificant).
If any of the power lights is not lit there are three possibilities: the light has burned out; some electrical short has legitimately invoked the safety circuits and shut down the supply; or a glitch has invoked the safety circuits, but the hardware is ok (frequent source of failures). Because there may be an electrical failure, you should notify a maintainer if possible. However, if you can’t find one, hope that nothing fatal has occurred and proceed as follows:
------------------------------
1Note that it is generally possible to "soft-restart" Tenex even after running microprocessor diagnostics such as DGBASIS andDGIML (but not DGM or DGMR, which are memory diagnostics).
First, put the system disk drives in Read-Only mode; it is necessary to do this before power down until after power up. Then, power down processor, port, and memory, as discussed in the Power Down section. This is done by running a program. Do not turn off any hardware switches. Then power up the processor, port, and memory as discussed in the Power Up section starting at step F.
Note: You have to power down Maxc before powering up again.
If this succeeds in getting the power supplies on again, try the cold start procedure for restarting Tenex as discussed in Paragraph J. If it does not succeed, then the hardware is broken and has to be fixed.
If the lights are all on, and if you can’t locate a hardware maintainer, you should restart Tenex from scratch. If the failure is a double error caused by failure of storage components, then the restart procedure will zone out the bad storage region so that Tenex will not use that area and the failure will not reoccur. If the failure is more serious, such that the memory is unusable, then the hard restart will fail and the hardware will have to be fixed. Paragraph J below discusses the hard restart procedure.
F."NVIO Punt" on Nova console (Maxc1 only). If this is an immediate punt after running DGM or DGMR, do "POWER ON" and try again. Otherwise, this indicates a serious inconsistency detected by NVIO. Crash data should be saved and the crash recovered as follows (type on the Nova console).
#3301P("un-protect" NVIO)
:D...OK.(enter Nova debugger)
XPUNT/ junk :sssss+n = xxxxxx <lf>
PUNT0/ junk :sssss+n = xxxxxx <lf>
PUNT1/ junk :sssss+n = xxxxxx <lf>
PUNT2/ junk :sssss+n = xxxxxx <cr>
<esc>P(Resume NVIO)
:R...OK.(Resume Maxc)
If Tenex does not resume after this procedure, try a "soft restart" by typing:
:140G...OK.
on the Nova console.
Write down in the logbook the data typed out by the debugger in response to the ":" and "=" characters you typed in.
G.Nova crash (Maxc1 only). If NVIO has stopped updating the bottom row of numbers on the Infoton screen, it is most likely that the Nova has crashed. Go into the machine room and record the Nova’s state, as follows:
1)If the Nova is still running ("Run" lit), press "Stop" followed by "Continue" a few times, recording the state of the "Address" lights after each "Stop". Leave the Nova stopped.
2)Record the state of the "Address" and "Data" lights.
3)Make sure the switches are set to 100040. Then press "Reset" followed by "Start". The Infoton should print out a row of numbers (if not, go to step 4). Write these down in the log. Then type <esc>P. After a few seconds, the message:
BREAK
R
should print out. Then type:
SAVE CRASH <cr>
which saves the crashed core image for later examination.
4)If the procedure described in step 3 failed, press "Reset" followed by "Program load" on the Nova front panel. Then, in either case, attempt the "soft-restart" procedure described above (paragraph C, steps 5 and 6).
We have had periods when the Nova disk gets smashed occasionally. A crash in which the disk is smashed might manifest as being unable to boot the machine. If a disk crash is suspected,
1) Take down Portola (or some other two-disk Nova);
2) Put good disk in dp0, bad disk in dp1;
3) Boot the machine and then turn off write protect by pushing the red buttons on dp0 and dp1.
4) Run DKUTIL and type G↑c to start.
This will copy the good disk onto the bad disk.
Alto crash (Maxc2 only). If the Alto has fallen into Swat (the message "Swat" followed by a number and a date appears at the top of the screen), record in the log book the information below the lowest line of squiggles. Then press the boot button on the back of the Alto keyboard. Then issue the commands:
AltIO/H <cr>
140Go [confirm] .
This should result in a Tenex "soft restart", as in Paragraph C, step 6.
H."Disk needs fixing" message from CHECKDSK. When Tenex autorestarts following a BUGHLT, it first runs the BSYS verify and CHECKDSK programs to determine whether or not the file system has been damaged by the crash. If either of these programs detects a problem, it will abort the autorestart. Fixing these problems is hazardous and should ordinarily be attempted only by a system maintainer. The procedures for recovering from CHECKDSK failures are discussed in a later section.
I.No response from Tenex; i.e., no error messages have typed out and none of the above alternatives seems to apply, but nothing happens when you type control-C on the Maxc console. This type of crash is particularly hard to diagnose unless sufficient information is recorded.
First, note the numbers on the last line of the Nova console (Maxc1) or at the top of the Alto screen (Maxc2), and note whether any of them are changing over time.
Second, enter Midas by typing the following on the Nova console or Alto keyboard:
Maxc2: Strike the middle unmarked key.
#3301P(to "un-protect" NVIO/AltIO)
:M...OK.(to enter Midas)
(If the message "Unclean Micro Stop" prints out, note this as well.)
Now continue by carrying out the instructions given earlier beginning at paragraph C, step 2.
J.Last resort. It may happen that a crash does not fall into one of the above categories or that the restart procedure fails. In this case, the following procedure will always succeed if all the hardware is working:
1.Maxc1: On the Nova front panel in the machine room, make sure the address switches are set to 100040. Then press "Reset" followed by "Program Load". On the Nova console (Infoton), you should see:
DOS REV 04
R
Then type "POWER ON <cr>"
Maxc2: Boot the Alto.
2.Type the command:
MIDAS TENGO <cr>
3.Wait about 2 minutes (Maxc1) or 30 seconds (Maxc2) while the microcode loads and NVIO/AltIO and Micro-Exec are started. Micro-Exec will execute an automatic "Go" command, after which you should enter date and time if Tenex requests it. Generally, you can ignore bad-chip messages printed out during memory testing--the regions of storage affected by bad chips are mapped out by Tenex. Save the printout in the log book, however.
If this doesn’t work, try to find any one of the people listed below at PARC. If none of them is around and the hour is between 9 AM and midnight, call one of the system maintainers. If between midnight and 9 AM, don’t bother, but leave a message on the telephone recording saying that the machine will be down until morning. Instructions for recording messages are posted in the back room.
People to notify: (use phone list on wall beside phone)
Software and general system operation
TaftGeneral, Tenex, NVIO, and AltIO
FialaGeneral, microcode, Midas, Tenex
BoggsGeneral
GeschkeGeneral
Hardware
FialaMicroprocessor
LampsonMicroprocessor
OvertonMemories and Disks
McCreightDisks
YearyAlto
QuatermanNova
MannNova
WinfieldNova
ThackerAlmost anything
TaftWhen any of the above aren’t available