Hardware Diagnostic and Maintenance ProceduresMaxc Operations4213. HARDWARE DIAGNOSTIC and MAINTENANCE PROCEDURESThis section discusses operation of various hardware diagnostic software and firmware.Microprocessor diagnostics are normally used to isolate suspected failures of the microprocessor orport hardware. Micro-Exec memory and disk testing commands are generally used to isolatesuspected failures of disk units and main storage modules.The microprocessor diagnostics require that a small nucleus of the microprocessor and its interfaceto the Alto be working, before meaningful failure diagnosis can take place. When a failure hasoccurred in this essential nucleus, the Midas "Test-All" command and SMIDiag.run are used toisolate the failure.MemBash and TM have been used to check out memory reference interference problems andfailures in the Alto memory interface.13.1. Running Microprocessor DiagnosticsThe microprocessor diagnostics are normally kept on the Alto disk.You must obtain the notebook labelled "Microprocessor Diagnostics" or "Maxc Diagnostics" inorder to do any useful debugging. However, the diagnostics all fall into a common pattern and theycan usually be run in a simple-minded way without understanding too much about what theprograms are really doing. The following generalizations may be helpful.Associated with each of the diagnostics described below is a command file whose name is the sameas that of the diagnostic. If you type to Nova DOS or to the Alto Executive:MIDAS diagname Midas will load the diagnostic and show registers pertinent to the diagnostic's operation on thedisplay. You may wish to set some parameters before starting a diagnostic (for example, the lowand high addresses for a memory test). However, the values as loaded are reasonable for thoroughtesting.When you are ready, type START;G to start the diagnostic. (DGM has two alternate startingaddresses for interrupt system testing.) The diagnostic will halt after one pass at IMA=20 (DGIMLis the only exception to this rule: it breaks at 3200 on Maxc1 and 7200 on Maxc2). This meansthat no failures were detected. To repeat, type ;P to Midas. To loop indefinitely, delete thebreakpoint at 20 by typing 20;K and then ;P. All of the diagnostics except DGM loop in less thantwo seconds, so something is wrong if the machine hangs in the loop.A breakpoint at any location except 20 indicates that an error was detected. The most commonerror breakpoint is at 25 or 26, the comparison error breakpoint. The location in the diagnosticfrom which the comparison routine was called is displayed at STK 0. You can get Midas to tellyou the nearest label by selecting the value to the right of "STK 0" with the middle mouse button.Following the .LS listing is the diagnostic listing, which contains general operating instructions incomments preceding the program itself. DBEG0 is assembled or loaded ahead of some diagnostics,)fqX.;pi A_rX3 \1s^4_" Zf` X? V: S_>% QS O@ M J; H& BrX) >sB ;E 9T?$ 7? 5I 2LQ 0M-X )"> '` & ># $> 9! T 7#< l#< -4 D e] a L M ;+: p'8 ~ )>\Maxc OperationsHardware Diagnostic and Maintenance Procedures43so its listing and .LS file may also be relevant. Some diagnostics INSERT other files or areassembled from several sources, so you may have to look a bit to find the relevant listings. Whentracing an error, find the tag nearest the address in STK 0 (if the breakpoint is at IMA=26) orIMA (if the breakpoint is elsewhere) and read the comments there from the listing.If the breakpoint is at IMA=26, the diagnostic may be resumed from the point of error by thecommand ;P, which will return to the caller of the compare routine (this usually works butnot always).The diagnostic names and what they test are listed below:DGBASIC.MBMost basic diagnostic. Does not use the main memory, the inter-rupt system or the P-register input multiplexors. Tests everthingthing except the SM, DM, DM1, DM2, LM, and RM memoriesfirst, then tests these memories and the F-register. Cycle time < 1second.DGP.MBTests the P-register inputs and a few afterthoughts fromDGBASIC.MB. Cycle time < 1 second.DGALU.MBTests an assortment of P-register, Q-register, and ALU operationsusing random numbers. Cycle time < 1 second.DGM.MBTests the memory interfaces. There are alternate entry points atSTART and XSTART for interrupt system testing. Be sure to readthe listing comments before starting at XSTART. Cycle time ~ 20minutes to test each 128K of main memory; correspondingly less ifthe address range is restricted. Note: After running DGM, youhave to go through the "Power Up" procedure before running thePDP-10 microcode again, because certain low core locations thatmust be zero are not left zero by DGM. This will cause animmediate NVIO punt, if you forget to do this.DGI.MBTests the interrupt system and repeats parts of DGBASIC and DGPaffected by interrupts. Cycle time < 1 second.DGIML/DGIMH.MBTests the SM, DM, DM1, DM2, and MP memories using randomnumbers, then the instruction memory (IM) first by using each ofthe 72 patterns of a single 0 in a field of 1's, then the 72 patterns ofa single 1 in a field of 0's, then random numbers. DGIML runs inthe top of IM and tests the memory below it, while DGIMH runsin the bottom of IM and tests the memory above it. Cycle time < 3seconds.Alternate starting addresses exist to test only IM, only MP, or onlySM/DM/DM1/DM2 (which are physically a single memory).There are a number of special loops for repeating test sequencesthat have provoked failures in the past. fpi'qX&Fp _s'6 ]K9) [A YR VDV Ty: R O=X9K $J 7H66Fk-DA.8?d#;6 :'-6$4%3 '1U*/"ts-(+?*+*(`.$%#$/  +/+R)"%/5 1 ( =\2-Hardware Diagnostic and Maintenance ProceduresMaxc Operations44DGRL.MBTests the right register bank (RM) and the left register bank (LM)using random numbers. Cycle time < 1 second.DGMR.MBTests the main memory using random numbers. Cycle time ~ 5seconds. Note: After running DGMR, you have to go through the"Power Up" procedure before running the PDP-10 microcode toclear several low core locations that must be zeroed but are smashedby DGMR.DGREG.MBTests assorted registers with random numbers. Like DGALU,DGREG is a reliability diagnostic that supplements the basic testsin DGBASIC and DGP. Cycle time < 1 second.13.2. Running PDP-10 DiagnosticsPDP-10 instruction diagnostics 0A through 0N and 0R may be run on Maxc either in stand-alonemode or under Tenex. We use them only for checking out new PDP-10 emulator microcode andnot for hardware diagnosis (which is what they were originally intended for). Some of them havebeen patched to account for Maxc incompatibilities.There are two methods of running PDP-10 diagnostics stand-alone. The more convenient is to runthem from Micro-Exec by means of the "run.diagnostic.program" command. However, if themicrocode is working so poorly that Micro-Exec won't run, the other method is to load them fromthe Alto using AltIO's "Load" command.Running diagnostics from Micro-Exec is simple. All the diagnostics that will run on Maxc arestored as programs on save area 2. They may be listed out by the "Print.Program.Directory"command. To execute a single pass of a given diagnostic, type:*Run.Diagnostic.Program Control returns to Micro-Exec when the diagnostic is finished. To make the program loop forever,type:*Goto 4000 The iteration count (a decrementing negative number) may be displayed by examining Maxclocation 1. To abort this, you have to either reboot Micro-Exec or halt the microprocessor withAltIO "H" command and then restart Micro-Exec with "20G."Running diagnostics using DMPLD or AltIO is somewhat messier. This procedure should be usedonly when, due to hardware or microcode problems, it is impossible to run the diagnostics fromMicro-Exec.The procedure is as follows:1)Retrieve the diagnostics from the DIAGNOSTICS> directory on Ivy using FTP. Onlya few diagnostics will fit on the Alto disk at once.)fqX.;pi ams&_-\1"Zf tsX'V*UQ:O6 M+ G?rX! CsM BD @7[ >m3 :[ 90@ 7f=" 5& 2)#: 0_6% .?+"*+"X* '&; %"s" K"s 4# 7E l9 = 0B e X 2P 4  p=\Maxc OperationsHardware Diagnostic and Maintenance Procedures452)Load the PDP-10 microcode by means of the "MIDAS TENLOAD" command.3)Enter AltIO by selecting the "AltIO", "Dont-Go", and "Do-It" menu items.4)Load the selected diagnostic into Maxc memory by means of AltIO's "Load" command. Thenmake sure all the patches have been made.5)Start the diagnostic by issuing the "Go" command.On Ivy, the DIAGNOSTICS> directory contains all the PDP-10 diagnostics, along with aset of RUNFILs for running them in user mode under Tenex (note that diagnostic 0D cannot berun in user mode). The script named .RUNFIL will cause the specifieddiagnostic to be started and run forever (type control-B to stop). Additionally, there are commandfiles BASIC.RUNFIL and RELIABILITY.RUNFIL which execute one pass of diagnostics 0A-0Hand 0I-0N respectively.13.3. Memory MaintenanceMemory maintenance is scheduled periodically every 3 months or so to replace storage ic's thathave failed. Because the memory is error-corrected, the system can be operated with a number ofbad components, so memory maintenance is not scheduled until the number of failures becomessignificant.The procedure for doing this is as follows:1) Schedule the system down using the procedure discussed in "Stopping Tenex."2) At the appointed time, the system will halt; then unprotect AltIO with 3301P.3) Boot MicroExec with the AltIO "B" command.4) Run "Test.Memory.Slow" as discussed below. You want to get output on paper; you shouldissue the "Diablo Printer On" command to AltIO before starting the test.5) Power down the system as discussed in the "Power Down" section. It is possible to power-offonly the memory (so you don't have to turn off the disks).6) Pull the cards affected by bad chips and mark the bad ic's with a magic marker; interpret theoutput of "Test.Memory.Slow" as discussed below. Record the serial numbers of the boards pulledfrom each position and attempt to restore the cards to their original positions after repair.7) Replace the bad 1103 ic's or, if that isn't possible, use the spare memory cards in the rack abovethe Maxc2 Alto. (Refer to Appendix A for diagrams showing memory board and chip locations.)8) Restore the repaired cards to their original positions (if using a spare card, mark the bad cardwith the complete failure information using a piece of paper and scotch tape). fpi'qX&Fp _s2B [2H X22-* Vg) R2X1 O< M'4 KC J#[ HY? F ?r <\sA :#= 8S 6 3X+ 0N ,P )4- %&4 #H G : IB ~00 ] BS wq s7 /4 ;N Z =[Hardware Diagnostic and Maintenance ProceduresMaxc Operations469) Power up the system as discussed in the "Power Up" section.10) Reload the microcode and restart AltIO and MicroExec using "Midas MExecGo," as discussedin the "Loading the PDP-10 Emulator" section; repeat Test.Memory.Slow to see if the repairs havebeen successful.11) If everything is ok, restart Tenex.Memory maintenance is performed with the aid of Micro-Exec.1 After loading the microcode andstarting up Micro-Exec, simply type:*Test.Memory.Slow This requires about 6 minutes to test all 384K of memory. All data, tag, and parity bit failurescause an error count to be accumulated for the affected chip. The physical location of each chipwith errors is reported at the end of testing each memory cabinet, along with the pages affected andthe error count. Every word in memory is tested 82 times using various patterns, so every chip iswritten into and read from 83968 times. Hence, a total chip failure will generate on the order of40000 errors (since it will yield the wrong value approximately half the time), while a solid singlebit failure will generate on the order of 40 errors.Conclusively diagnosed check bit failures are also reported in this manner. For check bit failuresthat Micro-Exec cannot diagnose (due to an insufficient sample or lack of any obviously failing bitpattern), information is printed out as to the frequency of zeroes and ones in the correct values ofeach check bit. Since the check bits can't actually be read, the best that can be done is to deducewhich check bit is failing on the basis of these frequencies.The "Test.Memory.Verbose" command does everything that "Test.Memory.Slow" does, but alsoprints out the address, error type, correct data, incorrect data, and exclusive or for every error. Forcheck bit failures, the correct data and the computed Hamming code are printed out. Thiscommand will generate reams of output unless the memory is pretty clean to begin with.If the diagnostic crashes due to "fatal error from wrong place in code", one should re-boot Micro-Exec and issue the "Test.Memory.Write.Slow" command before running the memory test. This setsa mode switch that forces Micro-Exec to use a more conservative (but slower) method of writingdata into the memory under test.To obtain hardcopy of the memory error printout, issue the "Diablo Copy On" command in theAltIO command window before starting the memory test.Because of error correction, Tenex can run ok with bad bits and even bad cards in the memorysystem. Exception: a double error in pages 0 to about 217 will prevent Tenex from running. Non-hardware-maintainers should ordinarily not attempt to repair the memory system. However, thefollowing instructions for locating failures reported by Test.Memory.Slow (or Test.Memory.Fast) areprovided, just in case.------------------------------1There is also a Maxc memory diagnostic called TMEM which runs on the Nova, but it does not test memory asthoroughly, does not find bad check bits, and requires the use of a second program, PER, to analyze failures. Theseprograms are therefore not documented here.)fqX.;pi  amsX> ] N \1O Zf VX' S;TuSs! Q$NFM NFX Ja I W G? Y EtF CH AU @4 <): :M 9 ;ts 7BF 5x= 2X 0;Q .q$5 ,V )4L 'iY %-1 # bS 5 &\ [H 2+ ,7  ]u LJ c g+ L =[,Maxc OperationsHardware Diagnostic and Maintenance Procedures47The error printout is of the form:Quad q Cab c Card d Col h Row r Pages 1360-1367 bit 14, 10 errorswhere q is J, K, L, or M; c is 0 to 3; d is 1 to 16; h is 0 to 11; and r is 0 to 7.The cabinets are numbered consecutively 0, 1, 2, ... starting with the one next to the processor. Thequadrants are each represented by a row of 16 cards starting at the top of the cabinet, so J is thetop row, then K, then L, and M is the bottom row. In each row the cards are numbered 1 to 16starting at the left as you look from the back of the cabinet. (Ref. Figure 4, Appendix.)If you hold the cards with the ic's up, edge connector toward you, the storage chips form an arrayof 12 columns by 8 rows starting with 0,,0 in the lower right corner. The other parts on the boardare 12 address drivers on the left and right and 6 sense amps and other stuff next to the edgeconnector. (Ref. Figure 5, Appendix.)In an emergency, you can replace a card with one of the spares in the unused rack above theMaxc2 Alto. If you have to do this, be sure to attach the error printout and a note about where thecard came from to the card you remove; leave the card on the table in the Maxc room so systemmaintainers can find it.13.4. Disk MaintenanceMicro-Exec is used to diagnose most disk problems. The assortment of commands for doing thisare discussed in the Micro-Exec section. Two microdiagnostic programs called DSKD and EXAMare also available. However, these programs are only of interest to Ed McCreight--they are notintended for operation by novices.Micro-Exec reports disk addresses in the following format:0FFPPPHHCCCSwhere:FF = function (04 = read, 10 = write)PPP = pack numberHH = head numberCCC = cylinder numberS = sector numberBasic test procedures are as follows:a)Scanning a system pack for errors:*Scan.Disk.Pack.For.Errors fpi'qX&Fp _s"[B X2S TV Rc Q+L O`Fr qs KM J#F HYX Fr qs CV ARO ?,1 = 6r 3s P 1-. /W .*" *:'F # bX%8 %CT"9!f =YpHardware Diagnostic and Maintenance ProceduresMaxc Operations48This prints out both soft and hard errors, and provides no way of telling which errors arehard. To find only hard errors:*Set.Disk.Error.Retry.Count 20*Scan.Disk.Pack.For.Errors b)Writing and verifying a test pack: (Note that pack 100 is reserved exclusively for thispurpose.)*Mount.Auxiliary.Pack *Write.Disk.Test.Pattern (Writes and verifies test pattern)*Verify.Disk.Test.Pattern (Verifies already written test pattern)*Dismount.Auxiliary.Pack (when done)c)Observing disk data on a scope:*Loop.On.Specified.Disk.Page This repeatedly reads a single page, blinking the screen whenever an error is detected.Type DEL to terminate.Following Tenex file system crashes, where fatal read errors in directories have occurred, thefollowing general methods help to isolate the cause of the failure.1. Turn on the "Read-Only" switches for all the system disk packs.2. Use Micro-Exec S.D.P.F.E to find out which drives/packs/controllers are causing trouble. If aparticular pack suffers many failures, but the others seem ok, it is likely that the common controlleris ok, but that something is wrong with either the unit controller or disk drive that is experiencingthe failures. If the failures are confined to a particular head, it is likely that that head is misalignedor broken.3. If a pack produces many irrecoverable read errors on one drive, try mounting it on another driveand using S.D.P.F.E. If it can be read ok on the other drive, then it is likely that there is somefailure in the read electronics of the original unit controller or disk drive; otherwise, it is likely thatthere is a failure in the write electronics of the original controller/drive.4. Try writing a test pattern on a test disk pack on a good drive (Pack 100 usually used for thispurpose because it does not have any bad spots.). Then mount the pack on the suspect drive anduse S.D.P.F.E to see if it can be read correctly. If it cannot be read correctly, you have furtherindication of a read electronics failure; if it can be read correctly, then you have furtherconfirmation of a write electronics or head failure.5. If you determine that a particular drive/controller combination is broken, you can determinewhether it is the drive or controller by recabling the suspect controller to a working drive and thesuspect drive to a working controller. (Refer to last two paragraphs in section 12.1 for a discussion ofwhat to do if the disk configuration is changed.))fqX.;pi  amsA _9\19Zf!CV MU*9Q#9O@9N#F9LX)CH9Et; BH @7 <O :C 7XC 4_ 2L&@ 07. .+@ , ){d '>% %f $M ZT A B!  O 04 ` L )t(srt ^1  =ZMaxc OperationsHardware Diagnostic and Maintenance Procedures4913.5. TMTM is a Maxc memory diagnostic that runs on the Alto. It is used for basic debugging of thememory system, or when the memory is too sick to support Maxc programs such as Micro-Exec.TM is documented separately in a memo entitled "The Maxc2 Memory Test Program and OtherFolklore", by Larry Clark.13.6. MemBashMemBash is an Alto program that beats on the Alto/Maxc memory interface while monitoring thestate of the Maxc processor. It is intended to be run at the same time as a Maxc main memorymicro-diagnostic (DGM or DGMR) to provoke problems that arise under conditions of memorycontention.The Maxc micro-diagnostic should first be loaded using Midas and the end-of-test breakpoint at 20removed by typing "20;K". Then exit Midas, start up MemBash, and issue the "Go" command.The program runs until any key is struck.If any error occurs (in the processor, memory, or Alto memory interface), the state of the memorysystem and all the memory interfaces is displayed. The processor is then restarted at the beginningof the micro-diagnostic.The Alto memory operations executed by MemBash are ordinarily sequential reads through the first256K of memory. The "Alto Operation" command may be used to select write or RMW operationsor to specify that the Alto beat on a single memory location.MemBash may be exited by means of the "Quit" command.13.7. SMIDiagThis program is a diagnostic for the Alto System Maintenance Interface (SMI). Its operation is verysimple, and one should step through the various available commands using the "?" feature.There are two main tests, one for the address register and the other for the data register. In eithertest, one may use data patterns consisting of all zeroes, all ones, alternating ones and zeros, orrandom data. Important: Before running the address test with random data, it is necessary todisconnect the SMI cable to Maxc and install a terminator on the Alto interface.13.8. AITestAlto-IMP interface test. fpi'qX&Fp _r [s/- Y(2 XW VD Or Ls,0 JGB H|: F C@@! AuY ?) <8 U :nD 8 51Z 3g1* 1= .*X5 'ir #sM "-/* H T & t s2 [P r )sX >Y) TIMESROMAN  TIMESROMAN  TIMESROMAN  TIMESROMAN  TIMESROMAN  TIMESROMAN %4*1)6j/9 7MaxcOps13.BravoRWeaverJuly 21, 1981 2:56 PM