Maxc OperationsTenex Crashes32.TENEX CRASHESIf a Maxc-Tenex maintainer is available at PARC, inform him of the crash and normally he willtake over. You should try the office numbers of the maintainers, even at odd hours, because theywork at irregular hours. Otherwise, be brave and read on.If you have not already done so, you should first familiarize yourself with the material in theIntroduction to this manual (Section 1). There is a map of the machine room posted on thebulletin board, and most of the equipment is labelled.There are obvious problems in attempting to describe what to do when a system crashes. Thissection simply outlines a few simple procedures whose purposes are twofold: (1) to collect dataabout the crash for subsequent analysis and (2) to restart the machine quickly, with as little statelost as is possible.Begin by checking the log book and the whiteboard for any special instructions before proceedingwith the following. The last few log book entries may describe a crash like the one that has justoccurred. This may suggest a restart procedure for you to follow.You should append a log entry to the logbook with your name, the date and time, a statement thatMaxc crashed, and any other information that you discover while following the procedures below.Recent error typeouts (e.g., memory parity error messages) should be cut out and taped in thelogbook or copied from the display into the log if relevant.Look carefully at the upper and lower windows on the Maxc Alto. (Type an "A" or something to activatethe display.) The upper window will show the "flashing-register" display and error messages fromAltIO; the lower window serves as a console teletype. Look at the Maxc2 logging terminal, aDiablo printer located behind the Maxc2 Alto console.Maxc Alto failures are generally manifested by an Alto Swat call or by the Alto hanging someplace.If it hangs, then the flashing-register display will not be updating TODCLK and none of the otherregisters in the flashing-register display will be being updated either.The upper window will show error messages from Tenex. The logging terminal will have numerousmessages typed out by Tenex before it crashed. The "flashing register" display shows the namesand contents of several Tenex core locations that are frequently updated during normal operation.For example, TODCLK contains the time-of-day, which is updated by AltIO; this register should begetting updated regularly by AltIO, if AltIO hasn't crashed. The Tenex scheduler will be frequentlyupdating the NBRUN-NBPROC word if Tenex is running normally.Information about the crash may be apparent to you when you read the print out on these. Thelogging terminal may have BUGCHK messages from Tenex; frequently the BUGCHK messages areirrelevant to the crash, but sometimes they are interesting. The log is normally filled with messagesabout network failures of various types and recoverable disk errors such as:**PUPSRV date time FTP: Server timed out ...***IMPBUG 8599 Header ...*DSKERR: ...These messages are usually irrelevant to the crash, so don't be overly concerned about them.However, if the crash is caused by a network jamup of some kind, the BUGCHK messages on thelogging terminal may give a clue to the nature of the problem (for Ethernet or ARPA networkfailures). fpi=SqXGp"]r& ZfsU XR W^: T3[ R2( Q+6 MD L{G J^ Is FHtEs DN C@B @ t!s >I = A ;< 8]-u% 6sI 5U\ 35 0ts." /!4- -H *rP (=" 'i5, %M $a&> "< ;" .X Q &LX, fJ  P ^B  >\x9Tenex CrashesMaxc Operations4When system personnel are not present, Tenex is generally left in a mode in which it attemptsrecovery from Tenex-detected errors. Thus most crashes handled by non-system people will be of amore obscure (relative to Tenex) nature. The following paragraphs describe some of the morecommon crashes and suggested recovery procedures.Your first objective in dealing with a crash is to determine what kind of failure caused the crash.To do this you will look at various symptoms and try to classify the failure. The following areplausible reasons for crashes:a. Ethernet problems occasionally cause Tenex to be inaccessible, even though Tenex has notcrashed. In this case, there will be no BUGHLT message from Tenex, no "Microcode halt"message from AltIO, and the flashing registers will be updating normally. This situation is normallyaccompanied by numerous network-related BUGCHK messages on the logging terminal. You candetermine whether or not this has occurred by typing control-C to the lower window of the Alto; ifTenex responds to control-C with a login message, then you know that it is still alive, and youshould find someone to fix the Ethernet. (If the cursor is not flashing in the lower window, youhave to type the bottom unmarked key at the right of the Alto keyboard before typing control-C.).b. Software or firmware bugs have been rare, usually related to the ARPA, Ether, and MCAnetworks, and these bugs generally manifest themselves only when the network hardware ismalfunctioning in some way. However, a Tenex BUGHLT or microprocessor hanging conditionmight be caused by a software/firmware bug.c. Disk drive or disk controller failures are frequent causes of crashes. These might be manifestedby one of the disk drives going into select-lock (a red light on the disk control panel turns on whenthis occurs.). Building power glitches might also cause this. A disk drive/controller failure willnormally show the symptom of a BUGHLT message in the lower window of the Alto. Disk failuresare discussed below.d. Main memory failures are generally manifested by a failure message from AltIO. Garden varietyuncorrectable double errors may result in the Tenex parity error sweep being invoked for a crashautorestart, and this may in turn be followed by a microprocessor halt as discussed below. Memorycabinet power supplies sometimes turn off due to shorts or other hardware failures, manifested byone of the four power supply lights on a memory cabinet turning off. An assortment of symptomsfor various memory failures are discussed below.e. Microprocessor failures are generally manifested by microprocessor halts, by peculiar TenexBUGHLT's, or by the microprocessor hanging. If the microprocessor hangs, the TODCLK item inthe flashing-register display will generally be updating normally (TODCLK is updated by AltIO,not Tenex), but other flashing-register display items will be unchanging. This could also occurwhen the core image of Tenex is smashed in some strange way.The various symptoms which imply one or another of these kinds of failures are discussed below.A. BUGHLT or BUGCHK: A message of the formBUGHLT at 73550$8B>>BUGADR BUGHLT/ CAI UUOH+4types out on the Maxc console. This occurs when the monitor is in debug mode, which shouldnever be the case unless system maintainers are present. However, if you cannot find one, set theDBUGSW and DCHKSW cells to zero and proceed from the breakpoint:)fqX ;pi _s] ]E \W Z1 W^X UM TV Q+tsG ON N#O LP KJ I"= HN Fa Ats7 @[E >G =S+ :'t&s' 8W 7 V 5] 4 0tsJ /hE -Z ,_01 *C )W0 &,ts( $B #$A ! U < X_ 2rsn C= C ;@ >[=Maxc OperationsTenex Crashes5DBUGSW/10DCHKSW/10PTenex will recover from the error if possible, else restart automatically with no further interventionrequiredSometimes Tenex will hang in the autorestart code which follows a "BUGHLT at nnnnnn" message.This may be indicated by TODCLK in the Tenex register display updating normally but nothingelse happening. In this case, lookup the message associated with the BUGHLT in theBUGSTRINGS.MAXC2 listing on the table. This frequently is caused by a disk or microprocessorhardware problem. Unless something more creative occurs to you when you read the BUGHLTmessage, try the "Last Resort" procedure in paragraph I below.B."Trouble with System Pack nnn" prints out on the Maxc console, followed by "Type M tomove pack, R to resume". This is caused by a disk unit going offline or failing in some equallycatastrophic way. If the source of the problem is obvious (e.g. someone switched the unit offaccidentally), rectify the problem, wait for the unit to be online (green light lit), and type "R" onthe Maxc console. In other cases, it is usually better to move the pack to a free drive (if there isone--frequently the only free drive has a Bsys backup pack mounted on it, which you may remove).After waiting for the new drive to be online, type "M" followed by the letter corresponding to thedrive you have moved the pack to (A through H). Tenex should now resume automatically.Sometimes disk units have gone into select lock without any apparent reason (e.g., after a buildingpower glitch). This can be cured sometimes by powering down the unit, letting it stop, thenpowering up again. After powering down the front panel switches and waiting for the disk unit tostop, you may have to turn off the AC power switch in back in order to clear select lock. If thedisk unit does not stop when you power down the front switch, do not turn off the AC power inback because the heads may not have retracted, and you will destroy the disk pack by poweringdown. We have had several failures like this.If moving the pack doesn't succeed in restarting Tenex, or if another crash occurs later, you willhave to restart by booting Micro-Exec. Then you will have to tell Micro-Exec what the new diskconfiguration is. This is done by using "Print.Disk.Configuration" and "Set.Disk.Configuration"commands as discussed in the Micro-Exec section of this document.If by unfortunate chance the drive that fails is the first one in the old configuration, then Micro-Exec will be in the first save area on that drive, and you will have to boot it by typing nB to AltIOas discussed in the AltIO section. This is a little different from the normal boot procedure whichdefaults the drive for booting to drive A.C."Micro Breakpoint" prints out in the AltIO command window. This message means that themicroprocessor hit a breakpoint, which is usually caused either by Tenex executing a HALTinstruction or by the microcode detecting some serious internal inconsistency.The only known HALT instructions in Tenex are associated with catastrophic disk errors, and amessage such asIRREC. READ ERROR IN DIRECTORY--BEWARE OF DISK WRITE FAILURETROUBLE WITH DISK PACK 000211is typed out on the Maxc console. Errors of this nature should be handled only by knowledgablesystem people, since the Tenex file system may be endangered. fpi=SqXGp^#_s^_]];.][\ X.8 V S7& R"= P !; ON M; L> G?2rs2 EP D7I BY A.S ?:& >&M <W 9wJ 7!; 6ot,! 4s] 3g t s ts 1./ 0_. -37+ +P *+"> (A %|0ts* #O "sC * 2rs> 8! N W e:X< _ = $ >]L`Tenex CrashesMaxc Operations6A "Micro Breakpoint" not accompanied by a printed message on the Maxc console is usually due toa microprogram-detected inconsistency. Perform the following procedure:1.Enter Midas by typing the following on the Alto keyboard:Strike the middle unmarked key.#3301P(to "un-protect" AltIO):M...OK.(to enter Midas)2.Write down the contents of the following registers displayed by Midas:NPC IMA P Q STK 0 PC PISTAT F INSTR3.Execute the "Compare" command in the command menu (you must confirm it withReturn). If this prints "No errors" or "1 errors on Midas.Errors" then the microcode isok, so continue at step 6 below. Otherwise, exit to the Alto Executive with "Exit", issuethe command "Type Midas.Errors", and write down anything interesting. (Themicrocode legitimately clobbers SM location IODEND, so if this is the only error inMidas.Errors then nothing is wrong.) If there are any real errors, most likely a bipolarmemory chip has failed, and attempts to restart the system will probably be unsuccessfuluntil the chip is replaced. Notify a hardware maintainer.4.Type the commands "21;G" (which should end up within a few seconds at IMA=30),followed by "25;G", which checks the correctness of the microcode. If IMA=30, themicrocode is ok and you should go on to step 6. If IMA=20, the microcode isincorrect. Write down the contents of LM 10. If you are ambitious, consult Section 13for information on interpreting Checker failures. Run appropriate microprocessordiagnostics if you are familiar with them.5.Successively select the menu items "Run-Program" and "Tenload" using the left mousebutton.6.Attempt to "soft-restart" Tenex as follows:1Select menu items "AltIO", "Dont-Go", "Do-It". Then type::140Go [confirm] .If this is successful, Tenex will within a minute or so broadcast the message "Maxcresumed from service interruption" to all terminals. If not, follow the instructions in"Last Resort" (paragraph I below).D."Bipolar Memory Parity Error". Handle as in case C, except that interpreting the Checkerfailure is an especially desirable thing to do. Perform a "LMPEscan" before doing step 3, and writedown any errors reported. If bipolar memory parity errors keep occurring, notify a hardwaremaintainer, since it is necessary to change a bipolar memory chip.E."Fatal Memory Error, Maxc Stopped" on Alto console. This indicates that the memory is verysick, and a hardware maintainer should be notified.------------------------------1Note that it is generally possible to "soft-restart" Tenex even after running microprocessor diagnostics such as DGBASISandDGIML (but not DGM or DGMR, which are memory diagnostics).)fqX ;pi _s1. ]H)ZfX9W;U[U`U"T3tSUT3SWT3")QFM3)J$'I-<G=F$!D"1C$5A:@:)<=;eC9'%8]-*665U*)2)sM0)-z+.v*NsX:(t(s9( (sW(%7$4$"" 2rs 0 @^ ? 8B e2r"s9 3 :v :> 2= z =Z VMaxc OperationsTenex Crashes7"Main memory error: DE q" where q = J, K, L, or M indicates a main memory storage problemin the indicated memory quadrant."Main memory error: DE q" where q = J, K, L, or M indicates ???"Maxc halted with memory bus parity error" indicates a problem in the logic for generating orsending parity from the memory to the processor or in receiving parity by the processor, or intransmitting one of the data bits between the memory and the processor. It is normal for this tooccur in conjunction with a "Main memory error.""DIP in Q" (where Q = J, K, L, or M) means that the parity of the data on the bus from the portto the processor was incorrect. This will happen in conjunction with a "DE in Q" and isn'tsignificant in this case. In other cases it indicates a hardware problem in the port or in thetransmission path from the port into the processor.You should check the power supply lights before embarking on any other action. Enter the machineroom and locate the bank of logic racks. You will see the Maxc2 Alto (there is a sign on it). To itsright will be the cabinet containing the processor and port (labelled "Maxc2 processor"). The twopower lights at the bottom of the cabinet should be lit.To the right of the processor are memory cabinets. The right-most four lights at the bottom of eachmemory cabinet should be lit. (The two lights to the left of these are insignificant).If any of the power lights is not lit there are three possibilities: the light has burned out; someelectrical short has legitimately invoked the safety circuits and shut down the supply; or a glitch hasinvoked the safety circuits, but the hardware is ok (frequent source of failures). Because there maybe an electrical failure, you should notify a maintainer if possible. However, if you can't find one,hope that nothing fatal has occurred and proceed as follows:First, put the system disk drives in Read-Only mode; it is necessary to do this before power downuntil after power up. Then, power down processor, port, and memory, as discussed in the PowerDown section. This is done by running a program. Do not turn off any hardware switches now.Then power up the processor, port, and memory as discussed in the Power Up section starting atstep F.Note: You have to power down Maxc before powering up again.If this succeeds in getting the power supplies on again, try the cold start procedure for restartingTenex as discussed in Paragraph I. If it does not succeed, then the hardware is broken and has tobe fixed.If the lights are all on, and if you can't locate a hardware maintainer, you should restart Tenex fromscratch. If the failure is a double error caused by failure of storage components, then the restartprocedure will zone out the bad storage region so that Tenex will not use that area and the failurewill not reoccur. If the failure is more serious, such that the memory is unusable, then the hardrestart will fail and the hardware will have to be fixed. Paragraph J below discusses the hard restartprocedure.F. Alto crash. If the Alto has fallen into Swat (the message "Swat" followed by a number and adate appears at the top of the screen), record in the log book the information below the lowest lineof squiggles. Then press the boot button on the back of the Alto keyboard. Then issue thecommands:AltIO/H 140Go [confirm] .This should result in a Tenex "soft restart", as in Paragraph C, step 6. fpi=SqXGp _s N ]! Z@ WA U5) Ty)8 R0 O U NiU LI Ka3 HYtNs FW EQ X C8 @N ?AW <8[ :c 90G 7V 6(< 3 G 1G 05( .G - *tsX7 'I %|@" # U kd 12 cZ Z [ Sr sQ '= K5& hX  9 ;  W ; 3H x >^OTenex CrashesMaxc Operations8G."Disk needs fixing" message from CHECKDSK. When Tenex autorestarts following aBUGHLT, it first runs the BSYS verify and CHECKDSK programs to determine whether or notthe file system has been damaged by the crash. If either of these programs detects a problem, itwill abort the autorestart. Fixing these problems is hazardous and should ordinarily be attemptedonly by a system maintainer. The procedures for recovering from CHECKDSK failures arediscussed in a later section (Section 14).H.No response from Tenex; i.e., no error messages have typed out and none of the abovealternatives seems to apply, but nothing happens when you type control-C on the Maxc console.This type of crash is particularly hard to diagnose unless sufficient information is recorded.First, note the numbers at the top of the Alto screen, and note whether any of them are changingover time.Second, enter Midas by typing the following on the Alto keyboard:Strike the middle unmarked key.#3301P(to "un-protect" AltIO):M...OK.(to enter Midas)(If the message "Unclean Micro Stop" prints out, note this as well.)Now continue by carrying out the instructions given earlier beginning at paragraph C, step 2.I.Last resort. It may happen that a crash does not fall into one of the above categories or thatthe restart procedure fails. In this case, the following procedure will always succeed if all thehardware is working:1.Boot the Alto.2.Type the command:MIDAS TENGO 3.Wait about 30 seconds while the microcode loads and AltIO and Micro-Exec are started.Micro-Exec will execute an automatic "Go" command, after which you should enter dateand time if Tenex requests it. Generally, you can ignore bad-chip messages printed outduring memory testing--the regions of storage affected by bad chips are mapped out byTenex. Save the printout in the log book, however.If this doesn't work, try to find any one of the people listed below at PARC. If none of them isaround and the hour is between 9 AM and midnight, call one of the system maintainers. Ifbetween midnight and 9 AM, don't bother, but leave a message on the telephone recording sayingthat the machine will be down until morning. Instructions for recording messages are posted in theback room.)fqX ;pi a's2r s% _9 ^R \U [I Yu s T2rs5 S<9$ Q^ N6* M IXAFE-[DE-"CtCRUCCRWC" @~D =S] 82r sO 6,ts 5x)2L )/!+X)(),'F8%K $>U"3 R  C 4* -6 ~  7>NMaxc OperationsTenex Crashes9People to notify: (use phone list on wall beside phone)Software and general system operationRon WeaverGeneral, Tenex, and AltIOEd FialaGeneral, microcode, Midas, TenexHardwareEd FialaMicroprocessorMike OvertonMemories and DisksEd McCreightDisksHerb YearyAltoAndy HammondsAltoRon WeaverWhen any of the above aren't available fpi=SqXGp _rs')[%X W;)TP O` M LX J IP &b I :UR TIMESROMAN  TIMESROMAN  TIMESROMAN  TIMESROMAN  TIMESROMAN  TIMESROMAN TIMESROMAN  # ,35j/86MaxcOps2.Bravo RWeaver.PAApril 24, 1984 2:46 PM