Tenex CrashesMaxc Operations42.TENEX CRASHESIf a Maxc-Tenex maintainer is available at PARC, inform him of the crash and normally he willtake over. You should try the office numbers of the maintainers, even at odd hours, because theywork at irregular hours. Otherwise, be brave and read on.If you have not already done so, you should first familiarize yourself with the material in theIntroduction to this manual (Section 1). There is a map of the machine room posted on thebulletin board, and most of the equipment is labelled.There are obvious problems in attempting to describe what to do when a system crashes. Thissection simply outlines a few simple procedures whose purposes are twofold: (1) to collect dataabout the crash for subsequent analysis and (2) to restart the machine quickly, with as little statelost as is possible.Begin by checking the log book and the whiteboard for any special instructions before proceedingwith the following. Note that there are separate log books for Maxc1 and Maxc2. The last few logbook entries may describe a crash like the one that has just occurred. This may suggest a restartprocedure for you to follow.You should append a log entry to the logbook with your name, the date and time, a statement thatMaxc crashed, and any other information that you discover while following the procedures below.Recent error typeouts (e.g., memory parity error messages) should be cut out and taped in thelogbook or copied from the display into the log if relevant.Maxc1: Look carefully at the Infoton terminal (which has a sticker saying "Maxc1 Nova" pasted toit) and at the console teletype. The Infoton will show the "flashing-register" display and errormessages from NVIO. Look at the logging terminal, which is a TI terminal off to the side of theMaxc room near the door.Maxc2: Look carefully at the upper and lower windows on the Maxc2 Alto. The upper windowwill show the "flashing-register" display and error messages from AltIO; the lower window serves asa console teletype. Look at the Maxc2 logging terminal, a Diablo printer located behind the Maxc2Alto console.Maxc1 Nova or Maxc2 Alto failures are generally manifested by an NVIO punt (Maxc1) or AltoSwat call (Maxc2) or by the Nova or Alto hanging someplace. If it hangs, then the flashing-registerdisplay will not be updating TODCLK and none of the other registers in the flashing-registerdisplay will be being updated either.The console teletype will show error messages from Tenex. The logging terminal will havenumerous messages typed out by Tenex before it crashed. The "flashing register" display shows thenames and contents of several Tenex core locations that are frequently updated during normaloperation. For example, TODCLK contains the time-of-day, which is updated by NVIO/AltIO;this register should be getting updated regularly by NVIO/AltIO, if NVIO/AltIO hasn't crashed.The Tenex scheduler will be frequently updating the NBRUN-NBPROC word if Tenex is runningnormally.Information about the crash may be apparent to you when you read the print out on these. Thelogging terminal may have BUGCHK messages from Tenex; frequently the BUGCHK messages areirrelevant to the crash, but sometimes they are interesting. The log is normally filled with messagesabout network failures of various types and recoverable disk errors such as:)fqX ;pi "]r&X ZfsU XR W^: T3[ R2( Q+6 MD L{G J^ Is FHtEs DQ C@&< A > t!s = I ;A :< 6tsK 5U%< 3M 2L /!ts7 -H ,\ * 'it!sts % tsP $aI "% @ .$> ;! &,- '7 A  n;" X fQ L  >ZJMaxc OperationsTenex Crashes5**PUPSRV date time FTP: Server timed out ...***IMPBUG 8599 Header ...*DSKERR: ...These messages are usually irrelevant to the crash, so don't be overly concerned about them.However, if the crash is caused by a network jamup of some kind, the BUGCHK messages on thelogging terminal may give a clue to the nature of the problem (for Ethernet or ARPA networkfailures).When system personnel are not present, Tenex is generally left in a mode in which it attemptsrecovery from Tenex-detected errors. Thus most crashes handled by non-system people will be of amore obscure (relative to Tenex) nature. The following paragraphs describe some of the morecommon crashes and suggested recovery procedures.Your first objective in dealing with a crash is to determine what kind of failure caused the crash.To do this you will look at various symptoms and try to classify the failure. The following areplausible reasons for crashes:a. Ethernet problems occasionally cause Tenex to be inaccessible, even though Tenex has notcrashed. In this case, there will be no BUGHLT message from Tenex, no "Microcode halt"message from NVIO/AltIO, and the flashing registers will be updating normally. This situation isnormally accompanied by numerous network-related BUGCHK messages on the logging terminal.You can determine whether or not this has occurred by typing control-C on the console teletype(Maxc1) or to the lower window of the Alto (Maxc2); if Tenex responds to control-C with a loginmessage, then you know that it is still alive, and you should find someone to fix the Ethernet. (OnMaxc2, if the cursor is not flashing in the lower window, you have to type the bottom unmarkedkey at the right of the Alto keyboard before typing control-C.).b. Software or firmware bugs have been rare, usually related to the ARPA, Ether, and MCAnetworks, and these bugs generally manifest themselves only when the network hardware ismalfunctioning in some way. However, a Tenex BUGHLT or microprocessor hanging conditionmight be caused by a software/firmware bug.c. Disk drive or disk controller failures are frequent causes of crashes. These might be manifestedby one of the disk drives going into select-lock (a red light on the disk control panel turns on whenthis occurs.). Building power glitches might also cause this. A disk drive/controller failure willnormally show the symptom of a BUGHLT message on the console teletype (Maxc1) or lowerwindow of the Alto (Maxc2). Disk failures are discussed below.d. Main memory failures are generally manifested by a failure message from NVIO/AltIO. Gardenvariety uncorrectable double errors may result in the Tenex parity error sweep being invoked for acrash autorestart, and this may in turn be followed by a microprocessor halt as discussed below.Memory cabinet power supplies sometimes turn off due to shorts or other hardware failures,manifested by one of the four power supply lights on a memory cabinet turning off. An assortmentof symptoms for various memory failures are discussed below.e. Microprocessor failures are generally manifested by microprocessor halts, by peculiar TenexBUGHLT's, or by the microprocessor hanging. If the microprocessor hangs, the TODCLK item inthe flashing-register display will generally be updating normally (TODCLK is updated byNVIO/AltIO, not Tenex), but other flashing-register display items will be unchanging. This couldalso occur when the core image of Tenex is smashed in some strange way. fpi=SqXGp_s,]\ W^J U P TVB R O] N#E LW K1 GX FkM D AtsG @7N >A =/@ ;I :'tsts. 8_ 7U 5@ 2pts7 0E /hG -+ *t&s' )4W ' V &,0ts $ts& !}tsG '; t'9 4& lD < ts( 9B L 1a G f>WMTenex CrashesMaxc Operations6The various symptoms which imply one or another of these kinds of failures are discussed below.A. BUGHLT or BUGCHK: A message of the formBUGHLT at 73550$8B>>BUGADR BUGHLT/ CAI UUOH+4types out on the Maxc console. This occurs when the monitor is in debug mode, which shouldnever be the case unless system maintainers are present. However, if you cannot find one, set theDBUGSW and DCHKSW cells to zero and proceed from the breakpoint:DBUGSW/10DCHKSW/10PTenex will recover from the error if possible, else restart automatically with no further interventionrequiredSometimes Tenex will hang in the autorestart code which follows a "BUGHLT at nnnnnn" message.This may be indicated by TODCLK in the Tenex register display updating normally but nothingelse happening. In this case, lookup the message associated with the BUGHLT in theBUGSTRINGS.MAXC1/2 listing on the table. This frequently is caused by a disk ormicroprocessor hardware problem. Unless something more creative occurs to you when you readthe BUGHLT message, try the "Last Resort" procedure in paragraph J below.B."Trouble with System Pack nnn" prints out on the Maxc console, followed by "Type M tomove pack, R to resume". This is caused by a disk unit going offline or failing in some equallycatastrophic way. If the source of the problem is obvious (e.g. someone switched the unit offaccidentally), rectify the problem, wait for the unit to be online (green light lit), and type "R" onthe Maxc console. In other cases, it is usually better to move the pack to a free drive (if there isone--frequently the only free drive has a Bsys backup pack mounted on it, which you may remove).After waiting for the new drive to be online, type "M" followed by the letter corresponding to thedrive you have moved the pack to (A through H). Tenex should now resume automatically.Sometimes disk units have gone into select lock without any apparent reason (e.g., after a buildingpower glitch). This can be cured sometimes by powering down the unit, letting it stop, thenpowering up again. After powering down the front panel switches and waiting for the disk unit tostop, you may have to turn off the AC power switch in back in order to clear select lock. If thedisk unit does not stop when you power down the front switch, do not turn off the AC power inback because the heads may not have retracted, and you will destroy the disk pack by poweringdown. We have had several failures like this.If moving the pack doesn't succeed in restarting Tenex, or if another crash occurs later, you willhave to restart by booting Micro-Exec. Then you will have to tell Micro-Exec what the new diskconfiguration is. This is done by using "Print.Disk.Configuration" and "Set.Disk.Configuration"commands as discussed in the Micro-Exec section of this document.If by unfortunate chance the drive that fails is the first one in the old configuration, then Micro-Exec will be in the first save area on that drive, and you will have to boot it by typing nB to NVIO(Maxc1) or AltIO (Maxc2) as discussed in the AltIO and NVIO sections. This is a little differentfrom the normal boot procedure which defaults the drive for booting to drive A.)fqX ;pi _sX_ ZC2rsWU Rh= PC O`@K#L5KL5JJZ.JHI- E.8 C @7& ?A= = !; <8+!,/ :J 90I 4^2rs2 2P 1UI /Y .MS ,:& +EM )W &J %!; #t,! " s] t s ts ./ }. R7+ P J"> A 0ts* _ F O >ZDOMaxc OperationsTenex Crashes7C."Micro Breakpoint" prints out on the Nova console (Infoton) on Maxc1, or in the AltIOcommand window on Maxc2. (On Maxc1, this and similar NVIO messages will usually appear abovethe lowest line of text and two lines of numbers usually displayed by NVIO.) This message meansthat the microprocessor hit a breakpoint, which is usually caused either by Tenex executing a HALTinstruction or by the microcode detecting some serious internal inconsistency.The only known HALT instructions in Tenex are associated with catastrophic disk errors, and amessage such asIRREC. READ ERROR IN DIRECTORY--BEWARE OF DISK WRITE FAILURETROUBLE WITH DISK PACK 000211is typed out on the Maxc console. Errors of this nature should be handled only by knowledgablesystem people, since the Tenex file system may be endangered.A "Micro Breakpoint" not accompanied by a printed message on the Maxc console is usually due toa microprogram-detected inconsistency. Perform the following procedure:1.Enter Midas by typing the following on the Nova console or Alto keyboard:Maxc2: Strike the middle unmarked key.#3301P(to "un-protect" NVIO/AltIO):M...OK.(to enter Midas)2.Write down the contents of the following registers displayed by Midas:NPC IMA P Q STK 0 PC PISTAT F INSTR3.Maxc2 only. Execute the "Compare" command in the command menu (you mustconfirm it with Return). If this prints "No errors" or "1 errors on Midas.Errors" thenthe microcode is ok, so continue at step 6 below. Otherwise, exit to the Alto Executivewith "Exit", issue the command "Type Midas.Errors", and write down anythinginteresting. (The microcode legitimately clobbers SM location IODEND, so if this is theonly error in Midas.Errors then nothing is wrong.) If there are any real errors, mostlikely a bipolar memory chip has failed, and attempts to restart the system will probablybe unsuccessful until the chip is replaced. Notify a hardware maintainer.4.Type the commands "21;G" (which should end up within a few seconds at IMA=30),followed by "25;G", which checks the correctness of the microcode. If IMA=30, themicrocode is ok and you should go on to step 6. If IMA=20, the microcode isincorrect. Write down the contents of LM 10. If you are ambitious, consult Section 13for information on interpreting Checker failures. Run appropriate microprocessordiagnostics if you are familiar with them.5.Maxc1: Type control-A to return control to DOS. Then reload the microcode via thecommand:MIDAS TENLOAD This takes a while (about 2 minutes). Wait until all messages at the bottom of thescreen disappear.Maxc2: Successively select the menu items "Run-Program" and "Tenload" using the leftmouse button. fpi=SqXGp _s2rs 8 ]?t \sR ZH YN UW TVQ+X<O L{_ J= G1. FHH)CXI?ts>m[>>m"<t<U<<W<")9F63)3gt s3 170_(0."  '-VF+ L*NK (J)%=$C"'%!-*6 *)ts0[0X>UtsE  =[FTenex CrashesMaxc Operations86.Attempt to "soft-restart" Tenex as follows:1Maxc1:!NVIO.SV/HNVIO:140G...OK.Maxc2:Select menu items "AltIO", "Dont-Go", "Do-It". Then type::140Go [confirm] .If this is successful, Tenex will within a minute or so broadcast the message "Maxcresumed from service interruption" to all terminals. If not, follow the instructions in"Last Resort" (paragraph J below).D."Bipolar Memory Parity Error" (Maxc2 only). Handle as in case C, except that interpreting theChecker failure is an especially desirable thing to do. Perform a "LMPEscan" before doing step 3,and write down any errors reported. If bipolar memory parity errors keep occurring, notify ahardware maintainer, since it is necessary to change a bipolar memory chip.E."Fatal Memory Error, Maxc Stopped" on Nova/Alto console. This indicates that the memoryis very sick, and a hardware maintainer should be notified."Main memory error: DE q" (Maxc2 only) where q = J, K, L, or M indicates a main memorystorage problem in the indicated memory quadrant."Maxc halted with memory bus parity error" (Maxc2 only) indicates a problem in the logic forgenerating or sending parity from the memory to the processor or in receiving parity by theprocessor, or in transmitting one of the data bits between the memory and the processor. It isnormal for this to occur in conjunction with a "Main memory error.""DIP in Q" (where Q = J, K, L, or M) means that the parity of the data on the bus from the portto the processor was incorrect. This will happen in conjunction with a "DE in Q" and isn'tsignificant in this case. In other cases it indicates a hardware problem in the port or in thetransmission path from the port into the processor.You should check the power supply lights before embarking on any other action. Enter the machineroom and locate the bank of logic racks for the Maxc machine which has crashed. You will see theMaxc1 Nova or Maxc2 Alto (there are signs on them). To its right will be the cabinet containingthe processor and port (labelled "Maxc1 processor" or "Maxc2 processor"). The two power lights atthe bottom of the cabinet should be lit.To the right of the processor are memory cabinets (presently 3 cabinets for each system). The right-most four lights at the bottom of each memory cabinet should be lit. (The two lights to the left ofthese are insignificant).If any of the power lights is not lit there are three possibilities: the light has burned out; someelectrical short has legitimately invoked the safety circuits and shut down the supply; or a glitch hasinvoked the safety circuits, but the hardware is ok (frequent source of failures). Because there maybe an electrical failure, you should notify a maintainer if possible. However, if you can't find one,hope that nothing fatal has occurred and proceed as follows:------------------------------1Note that it is generally possible to "soft-restart" Tenex even after running microprocessor diagnostics such as DGBASISandDGIML (but not DGM or DGMR, which are memory diagnostics).)fqX ;pi )_sX+`Su\tZ [s YXtW9XWWXTtS_s:QtQ9Q QWQN7M,4$K" F2rst s5 EQ@" C$9 BIK =v2r"s3 ;; 8t s # 7B1 4,t s% 2F 1!> /C ,_ U *U )WI '3 $tNs #$] !A <& ( le \ d 9[ c 1G V )< u g :> y= 2=]_Maxc OperationsTenex Crashes9First, put the system disk drives in Read-Only mode; it is necessary to do this before power downuntil after power up. Then, power down processor, port, and memory, as discussed in the PowerDown section. This is done by running a program. Do not turn off any hardware switches. Thenpower up the processor, port, and memory as discussed in the Power Up section starting at step F.Note: You have to power down Maxc before powering up again.If this succeeds in getting the power supplies on again, try the cold start procedure for restartingTenex as discussed in Paragraph J. If it does not succeed, then the hardware is broken and has tobe fixed.If the lights are all on, and if you can't locate a hardware maintainer, you should restart Tenex fromscratch. If the failure is a double error caused by failure of storage components, then the restartprocedure will zone out the bad storage region so that Tenex will not use that area and the failurewill not reoccur. If the failure is more serious, such that the memory is unusable, then the hardrestart will fail and the hardware will have to be fixed. Paragraph J below discusses the hard restartprocedure.F."NVIO Punt" on Nova console (Maxc1 only). If this is an immediate punt after running DGMor DGMR, do "POWER ON" and try again. Otherwise, this indicates a serious inconsistencydetected by NVIO. Crash data should be saved and the crash recovered as follows (type on theNova console).#3301P("un-protect" NVIO):D...OK.(enter Nova debugger)XPUNT/ junk :sssss+n = xxxxxx PUNT0/ junk :sssss+n = xxxxxx PUNT1/ junk :sssss+n = xxxxxx PUNT2/ junk :sssss+n = xxxxxx P(Resume NVIO):R...OK.(Resume Maxc)If Tenex does not resume after this procedure, try a "soft restart" by typing::140G...OK.on the Nova console.Write down in the logbook the data typed out by the debugger in response to the ":" and "="characters you typed in.G.Nova crash (Maxc1 only). If NVIO has stopped updating the bottom row of numbers on theInfoton screen, it is most likely that the Nova has crashed. Go into the machine room and recordthe Nova's state, as follows:1)If the Nova is still running ("Run" lit), press "Stop" followed by "Continue" a fewtimes, recording the state of the "Address" lights after each "Stop". Leave the Novastopped.2)Record the state of the "Address" and "Data" lights.3)Make sure the switches are set to 100040. Then press "Reset" followed by "Start". TheInfoton should print out a row of numbers (if not, go to step 4). Write these down in fpi=SqXGp _sG ]G \_ ZY W^tsX7 T3I RP Q+ MU L{d J12 IsZ GZ Fk A2r st s2 @&2 >E = 9[99"X8]t88]8W8]"666y6"6`6(6<65U 4y5U"84`5U(h4<5U3 3yy3"83y`3(h3y<32L 1y2L"81`2L(h1~2L0q0" /Dt./D.W/D" ,N(t(9((W( % "< ! @2r st sA *7 8)  H!4)X4) F *B =](Tenex CrashesMaxc Operations10the log. Then type P. After a few seconds, the message:BREAKRshould print out. Then type:SAVE CRASH which saves the crashed core image for later examination.4)If the procedure described in step 3 failed, press "Reset" followed by "Program load" onthe Nova front panel. Then, in either case, attempt the "soft-restart" proceduredescribed above (paragraph C, steps 5 and 6).We have had periods when the Nova disk gets smashed occasionally. A crash in which the disk issmashed might manifest as being unable to boot the machine. If a disk crash is suspected,1) Take down Portola (or some other two-disk Nova);2) Put good disk in dp0, bad disk in dp1;3) Boot the machine and then turn off write protect by pushing the red buttons on dp0 and dp1.4) Run DKUTIL and type G^c to start.This will copy the good disk onto the bad disk.Alto crash (Maxc2 only). If the Alto has fallen into Swat (the message "Swat" followed by anumber and a date appears at the top of the screen), record in the log book the information belowthe lowest line of squiggles. Then press the boot button on the back of the Alto keyboard. Thenissue the commands:AltIO/H 140Go [confirm] .This should result in a Tenex "soft restart", as in Paragraph C, step 6.H."Disk needs fixing" message from CHECKDSK. When Tenex autorestarts following aBUGHLT, it first runs the BSYS verify and CHECKDSK programs to determine whether or notthe file system has been damaged by the crash. If either of these programs detects a problem, itwill abort the autorestart. Fixing these problems is hazardous and should ordinarily be attemptedonly by a system maintainer. The procedures for recovering from CHECKDSK failures arediscussed in a later section.I.No response from Tenex; i.e., no error messages have typed out and none of the abovealternatives seems to apply, but nothing happens when you type control-C on the Maxc console.This type of crash is particularly hard to diagnose unless sufficient information is recorded.First, note the numbers on the last line of the Nova console (Maxc1) or at the top of the Altoscreen (Maxc2), and note whether any of them are changing over time.Second, enter Midas by typing the following on the Nova console or Alto keyboard:)fqX ;pi _s>[ZfW;XS TP9)MRL5G J- GG FZ)BX10)?1&)<1[)9T1! 6(/ 2r st s1 1yB /C .q*+EX )j9)  )jW) &H !2r s% ?9 R 7U I / \2rs5 9$ T^ )+ts ts7 yXQ 2>]Maxc OperationsTenex Crashes11Maxc2: Strike the middle unmarked key.#3301P(to "un-protect" NVIO/AltIO):M...OK.(to enter Midas)(If the message "Unclean Micro Stop" prints out, note this as well.)Now continue by carrying out the instructions given earlier beginning at paragraph C, step 2.J.Last resort. It may happen that a crash does not fall into one of the above categories or thatthe restart procedure fails. In this case, the following procedure will always succeed if all thehardware is working:1.Maxc1: On the Nova front panel in the machine room, make sure the address switchesare set to 100040. Then press "Reset" followed by "Program Load". On the Novaconsole (Infoton), you should see:DOS REV 04RThen type "POWER ON "Maxc2: Boot the Alto.2.Type the command:MIDAS TENGO 3.Wait about 2 minutes (Maxc1) or 30 seconds (Maxc2) while the microcode loads andNVIO/AltIO and Micro-Exec are started. Micro-Exec will execute an automatic "Go"command, after which you should enter date and time if Tenex requests it. Generally,you can ignore bad-chip messages printed out during memory testing--the regions ofstorage affected by bad chips are mapped out by Tenex. Save the printout in the logbook, however.If this doesn't work, try to find any one of the people listed below at PARC. If none of them isaround and the hour is between 9 AM and midnight, call one of the system maintainers. Ifbetween midnight and 9 AM, don't bother, but leave a message on the telephone recording sayingthat the machine will be down until morning. Instructions for recording messages are posted in theback room.People to notify: (use phone list on wall beside phone)Software and general system operationTaftGeneral, Tenex, NVIO, and AltIOFialaGeneral, microcode, Midas, TenexBoggsGeneralGeschkeGeneralHardwareFialaMicroprocessorLampsonMicroprocessorOvertonMemories and Disks fpi=SqXFp_ts][];]"\t[U\[W\" XD U] P2r sO O`,ts M)Jts:I-7G"D}X B?