Maxc OperationsHardware Diagnostic and Maintenance Procedures5716. HARDWARE DIAGNOSTIC and MAINTENANCE PROCEDURESThis section discusses operation of various hardware diagnostic software and firmware.Microprocessor diagnostics are normally used to isolate suspected failures of the microprocessor orport hardware. Micro-Exec memory and disk testing commands are generally used to isolatesuspected failures of disk units and main storage modules.The microprocessor diagnostics require that a small nucleus of the microprocessor and its interfaceto the Nova/Alto be working, before meaningful failure diagnosis can take place. When a failurehas occurred in this essential nucleus, the Midas "Test-All" command (Maxc2 only) andSMIDiag.run (Maxc2 only) are used to isolate the failure. TR has been used for this purpose onMaxc1, but the last year or so we have not been able to recover TR from any of the old tapes onwhich it is stored. We have had to patch test loops into NVIO and use a scope to fix problems inthe essential nucleus on Maxc1.MemBash and TM (Maxc2 only) have been used to check out memory reference interferenceproblems and failures in the Alto memory interface.16.1. Running Microprocessor DiagnosticsThe microprocessor diagnostics are normally kept on the Nova or Alto disk and (for Maxc1) on the10SYS tape.You must obtain the notebook labelled "Microprocessor Diagnostics" or "Maxc Diagnostics" inorder to do any useful debugging. However, the diagnostics all fall into a common pattern and theycan usually be run in a simple-minded way without understanding too much about what theprograms are really doing. The following generalizations may be helpful.Associated with each of the diagnostics described below is a command file whose name is the sameas that of the diagnostic. If you type to Nova DOS or to the Alto Executive:MIDAS diagname Midas will load the diagnostic and show registers pertinent to the diagnostic's operation on thedisplay. You may wish to set some parameters before starting a diagnostic (for example, the lowand high addresses for a memory test). However, the values as loaded are reasonable for thoroughtesting.When you are ready, type START;G to start the diagnostic. (DGM has two alternate startingaddresses for interrupt system testing.) The diagnostic will halt after one pass at IMA=20 (DGIMLis the only exception to this rule: it breaks at 3200 on Maxc1 and 7200 on Maxc2). This meansthat no failures were detected. To repeat, type ;P to Midas. To loop indefinitely, delete thebreakpoint at 20 by typing 20;K and then ;P. All of the diagnostics except DGM loop in less thantwo seconds, so something is wrong if the machine hangs in the loop.A breakpoint at any location except 20 indicates that an error was detected. The most commonerror breakpoint is at 25 or 26, the comparison error breakpoint. The location in the diagnostic fpi'qX&FpA_r3 \1s^4_" Zf` X? V: S_>% QM O"#7t s M t s ; L5*5 Jj#> H E-t s" Cc3 ` >#  9! T  #< B#< w-4 D ;] pa T )>\Hardware Diagnostic and Maintenance ProceduresMaxc Operations58from which the comparison routine was called is displayed at STK 0. On Maxc1 the .LS listing foreach diagnostic gives address symbols and their values, sorted in order of value so that thesymbol nearest the error address may be found readily (you normally are interested in IM memoryaddresses only). On Maxc2, you can get Midas to tell you the nearest label by selecting the valueto the right of "STK 0" with the middle mouse button. Following the .LS listing is the diagnosticlisting, which contains general operating instructions in comments preceding the program itself.DBEG0 is assembled or loaded ahead of some diagnostics, so its listing and .LS file may also berelevant. Some diagnostics INSERT other files or are assembled from several sources, so you mayhave to look a bit to find the relevant listings. When tracing an error, find the tag nearest theaddress in STK 0 (if the breakpoint is at IMA=26) or IMA (if the breakpoint is elsewhere) andread the comments there from the listing.If the breakpoint is at IMA=26, the diagnostic may be resumed from the point of error by thecommand ;P, which will return to the caller of the compare routine (this usually works butnot always).One common problem that occurs on Maxc1 when running diagnostics is that the microprocessorsingle steps and refuses to run. The usual cause of this is a memory interface hangup. This may becured by performing the "Reset Memory" operation of NVIO, as follows:!NVIO.SV/H/R NVIO:M...OK.If the machine still hangs, run POWER ON and see if that cures the problem. If that fails, runPOWER OFF, wait for several seconds, and run POWER ON again.The diagnostic names and what they test are listed below:DGBASIC.MBMost basic diagnostic. Does not use the main memory, the inter-rupt system or the P-register input multiplexors. Tests everthingthing except the SM, DM, DM1, DM2, LM, and RM memoriesfirst, then tests these memories and the F-register. Cycle time < 1second.DGP.MBTests the P-register inputs and a few afterthoughts fromDGBASIC.MB. Cycle time < 1 second.DGALU.MBTests an assortment of P-register, Q-register, and ALU operationsusing random numbers. Cycle time < 1 second.DGM.MBTests the memory interfaces. There are alternate entry points atSTART and XSTART for interrupt system testing. Be sure to readthe listing comments before starting at XSTART. Cycle time ~ 20minutes to test each 128K of main memory; correspondingly less ifthe address range is restricted. Note: After running DGM, youhave to go through the "Power Up" procedure before running thePDP-10 microcode again, because certain low core locations thatmust be zero are not left zero by DGM. This will cause animmediate NVIO punt, if you forget to do this.)fqX.;pi  `sQ ^? ](X []E Y\ W6* UW T3` RhN P4) N) KaV I: G DZ Q Ba @E< =SX;9Q9fU99fW9 6KK 4< 1X9- $+ 7*6(=-&s#8!6#6 -$%'(*]"ts( ? * 3. >^DMaxc OperationsHardware Diagnostic and Maintenance Procedures59DGI.MBTests the interrupt system and repeats parts of DGBASIC and DGPaffected by interrupts. Cycle time < 1 second.DGIML/DGIMH.MBTests the SM, DM, DM1, DM2, and MP memories using randomnumbers, then the instruction memory (IM) first by using each ofthe 72 patterns of a single 0 in a field of 1's, then the 72 patterns ofa single 1 in a field of 0's, then random numbers. DGIML runs inthe top of IM and tests the memory below it, while DGIMH runsin the bottom of IM and tests the memory above it. Cycle time < 3seconds.Alternate starting addresses exist to test only IM, only MP, or onlySM/DM/DM1/DM2 (which are physically a single memory).There are a number of special loops for repeating test sequencesthat have provoked failures in the past.DGRL.MBTests the right register bank (RM) and the left register bank (LM)using random numbers. Cycle time < 1 second.DGMR.MBTests the main memory using random numbers. Cycle time ~ 5seconds. Note: After running DGMR, you have to go through the"Power Up" procedure before running the PDP-10 microcode toclear several low core locations that must be zeroed but are smashedby DGMR.DGREG.MBTests assorted registers with random numbers. Like DGALU,DGREG is a reliability diagnostic that supplements the basic testsin DGBASIC and DGP. Cycle time < 1 second.16.2. Running PDP-10 DiagnosticsPDP-10 instruction diagnostics 0A through 0N and 0R may be run on Maxc either in stand-alonemode or under Tenex. We use them only for checking out new PDP-10 emulator microcode andnot for hardware diagnosis (which is what they were originally intended for). Some of them havebeen patched to account for Maxc incompatibilities.There are two methods of running PDP-10 diagnostics stand-alone. The more convenient is to runthem from Micro-Exec by means of the "run.diagnostic.program" command. However, if themicrocode is working so poorly that Micro-Exec won't run, the other method is to load them fromthe Nova disk using DMPLD or from the Alto using AltIO's "Load" command.Running diagnostics from Micro-Exec is simple. All the diagnostics that will run on Maxc arestored as programs on save area 2. They may be listed out by the "Print.Program.Directory"command. To execute a single pass of a given diagnostic, type:*Run.Diagnostic.Program fpi'qX&Fp_s%]K/Y  +X/VD+Ty)R"P%OK/I5H1FH(B&A -="; ts:'89*6o2:126 /h+ (rX! %5sM #jD ![ 3 c[ @ =" H #: 6% ?  3 X* @ C=[!Hardware Diagnostic and Maintenance ProceduresMaxc Operations60Control returns to Micro-Exec when the diagnostic is finished. To make the program loop forever,type:*Goto 4000 The iteration count (a decrementing negative number) may be displayed by examining Maxclocation 1. To abort this, you have to either reboot Micro-Exec or halt the microprocessor withNVIO/AltIO "H" command and then restart Micro-Exec with "20G."Running diagnostics using DMPLD or AltIO is somewhat messier. This procedure should be usedonly when, due to hardware or microcode problems, it is impossible to run the diagnostics fromMicro-Exec.Maxc1: The procedure is as follows:1)Load the diagnostics from the "PDP-10 Diagnostics" tape using the DOS "Load" command. (Itis not possible simply to FTP them from Maxc2 because the file format used by DMPLD isdifferent from that used by FTP.)2)Load the PDP-10 microcode by means of the "MIDAS TENLOAD" command.3)Load the selected diagnostic into Maxc memory by means of DMPLD (see Section 15).4)Enter NVIO by typing:!NVIO.SV/H 5)Using ODT, make the patches listed on a sheet of paper at the beginning of the diagnostic listing.6)Start the diagnostic by typing::4000G...OK.Maxc2: The procedure is as follows:1)Retrieve the diagnostics from the directory on Maxc1 using FTP. Only afew diagnostics will fit on the Alto disk at once.2)Load the PDP-10 microcode by means of the "MIDAS TENLOAD" command.3)Enter AltIO by selecting the "AltIO", "Dont-Go", and "Do-It" menu items.4)Load the selected diagnostic into Maxc memory by means of AltIO's "Load" command. Thenmake sure all the patches have been made.5)Start the diagnostic by issuing the "Go" command.)fqX.;pi  _s&; ]KYY KY Vg4# TE R> O`= MB K HYtsX D2V C'/ AR! =2XB :n2Q 6233 3 02b ,2)4Q()4r(W)4 %ts "P2B 2 2XB 2H 02-* e) 2X1 =W_Maxc OperationsHardware Diagnostic and Maintenance Procedures61On Maxc1, the directory contains all the PDP-10 diagnostics, along with a set ofRUNFILs for running them in user mode under Tenex (note that diagnostic 0D cannot be run inuser mode). The script named .RUNFIL will cause the specified diagnostic to bestarted and run forever (type control-B to stop). Additionally, there are command filesBASIC.RUNFIL and RELIABILITY.RUNFIL which execute one pass of diagnostics 0A-0H and0I-0N respectively.16.3. Memory MaintenanceMemory maintenance is scheduled periodically every 3 months or so to replace storage ic's thathave failed. Because the memory is error-corrected, the system can be operated with a number ofbad components, so memory maintenance is not scheduled until the number of failures becomessignificant.The procedure for doing this is as follows:1) Schedule the system down using the procedure discussed in "Stopping Tenex."2) At the appointed time, the system will halt; then unprotect NVIO/AltIO with 3301P.3) Boot MicroExec with the NVIO/AltIO "B" command.4) Run "Test.Memory.Slow" as discussed below. You want to get output on paper; on Maxc1, theconsole terminal has paper output, so nothing special is required; on Maxc2, you should issue the"Diablo Printer On" command to AltIO before starting the test.5) Power down the system as discussed in the "Power Down" section. On Maxc2, it is possible topower-off only the memory (so you don't have to turn off the disks).6) Pull the cards affected by bad chips and mark the bad ic's with a magic marker; interpret theoutput of "Test.Memory.Slow" as discussed below. Record the serial numbers of the boards pulledfrom each position and attempt to restore the cards to their original positions after repair.7) Replace the bad 1103 ic's or, if that isn't possible, use the spare memory cards in the rack abovethe Maxc2 Alto. (Refer to Appendix A for diagrams showing memory board and chip locations.)8) Restore the repaired cards to their original positions (if using a spare card, mark the bad cardwith the complete failure information using a piece of paper and scotch tape).9) Power up the system as discussed in the "Power Up" section.10) Reload the microcode and restart AltIO/NVIO and MicroExec using "Midas MExecGo," asdiscussed in the "Loading the PDP-10 Emulator" section; repeat Test.Memory.Slow to see if therepairs have been successful.11) If everything is ok, restart Tenex. fpi'qX&Fp _s=! ]K; [Z Y!"I W= V! O`r KsA J##= HYS F CX+ ?N <8U 82 5U7& 3a 1> .MG ,D )B 'F00 %|] " S ?q s7 /4 N X> W TM  X' =[;Hardware Diagnostic and Maintenance ProceduresMaxc Operations62Memory maintenance is performed with the aid of Micro-Exec.1 After loading the microcode andstarting up Micro-Exec, simply type:*Test.Memory.Slow This requires about 6 minutes to test all 384K of memory. All data, tag, and parity bit failurescause an error count to be accumulated for the affected chip. The physical location of each chipwith errors is reported at the end of testing each memory cabinet, along with the pages affected andthe error count. Every word in memory is tested 82 times using various patterns, so every chip iswritten into and read from 83968 times. Hence, a total chip failure will generate on the order of40000 errors (since it will yield the wrong value approximately half the time), while a solid singlebit failure will generate on the order of 40 errors.Conclusively diagnosed check bit failures are also reported in this manner. For check bit failuresthat Micro-Exec cannot diagnose (due to an insufficient sample or lack of any obviously failing bitpattern), information is printed out as to the frequency of zeroes and ones in the correct values ofeach check bit. Since the check bits can't actually be read, the best that can be done is to deducewhich check bit is failing on the basis of these frequencies.The "Test.Memory.Verbose" command does everything that "Test.Memory.Slow" does, but alsoprints out the address, error type, correct data, incorrect data, and exclusive or for every error. Forcheck bit failures, the correct data and the computed Hamming code are printed out. Thiscommand will generate reams of output unless the memory is pretty clean to begin with.If the diagnostic crashes due to "fatal error from wrong place in code", one should re-boot Micro-Exec and issue the "Test.Memory.Write.Slow" command before running the memory test. This setsa mode switch that forces Micro-Exec to use a more conservative (but slower) method of writingdata into the memory under test.Maxc2: To obtain hardcopy of the memory error printout, issue the "Diablo Copy On" commandin the AltIO command window before starting the memory test.Because of error correction, Tenex can run ok with bad bits and even bad cards in the memorysystem. Exception: a double error in pages 0 to about 217 will prevent Tenex from running. Non-hardware-maintainers should ordinarily not attempt to repair the memory system. However, thefollowing instructions for locating failures reported by Test.Memory.Slow (or Test.Memory.Fast) areprovided, just in case.The error printout is of the form:Quad q Cab c Card d Col h Row r Pages 1360-1367 bit 14, 10 errorswhere q is J, K, L, or M; c is 0 to 3; d is 1 to 16; h is 0 to 11; and r is 0 to 7.------------------------------1There is also a Maxc memory diagnostic called TMEM which runs on the Nova, but it does not test memory asthoroughly, does not find bad check bits, and requires the use of a second program, PER, to analyze failures. Theseprograms are therefore not documented here.)fqX.;pi  ^s;_9u^s! \$YoY YoX Ua T3W Rh Y PF NH MU K>4 G): FM D7;ts BlF @= =/X ;eQ 9$5 7V 4^L 2Y 0-1 . +ts J )< &O\ $H "2+ ,7 % X"AB S 1u  J c ;+ =[Maxc OperationsHardware Diagnostic and Maintenance Procedures63The cabinets are numbered consecutively 0, 1, 2, ... starting with the one next to the processor. Thequadrants are each represented by a row of 16 cards starting at the top of the cabinet, so J is thetop row, then K, then L, and M is the bottom row. In each row the cards are numbered 1 to 16starting at the left as you look from the back of the cabinet. (Ref. Figure 4, Appendix.)If you hold the cards with the ic's up, edge connector toward you, the storage chips form an arrayof 12 columns by 8 rows starting with 0,,0 in the lower right corner. The other parts on the boardare 12 address drivers on the left and right and 6 sense amps and other stuff next to the edgeconnector. (Ref. Figure 5, Appendix.)In an emergency, you can replace a card with one of the spares in the unused rack above theMaxc2 Alto. If you have to do this, be sure to attach the error printout and a note about where thecard came from to the card you remove; leave the card on the table in the Maxc room so systemmaintainers can find it.16.4. Disk MaintenanceMicro-Exec is used to diagnose most disk problems. The assortment of commands for doing thisare discussed in the Micro-Exec section. Two microdiagnostic programs called DSKD and EXAMare also available. However, these programs are only of interest to Ed McCreight--they are notintended for operation by novices.Micro-Exec reports disk addresses in the following format:0FFPPPHHCCCSwhere:FF = function (04 = read, 10 = write)PPP = pack numberHH = head numberCCC = cylinder numberS = sector numberBasic test procedures are as follows:a)Scanning a system pack for errors:*Scan.Disk.Pack.For.Errors This prints out both soft and hard errors, and provides no way of telling which errors arehard. To find only hard errors:*Set.Disk.Error.Retry.Count 20*Scan.Disk.Pack.For.Errors fpi'qX&Fp _sV ]Kc [L YFr qs VDM TyF RX Pr qs MrV KO I,1 H ARr =s P <-. :KW 8" 5:1 .**X%('#%X# %C"98! A 99!h x=YHardware Diagnostic and Maintenance ProceduresMaxc Operations64b)Writing and verifying a test pack: (Note that pack 100 is reserved exclusively for thispurpose.)*Mount.Auxiliary.Pack *Write.Disk.Test.Pattern (Writes and verifies test pattern)*Verify.Disk.Test.Pattern (Verifies already written test pattern)*Dismount.Auxiliary.Pack (when done)c)Observing disk data on a scope:*Loop.On.Specified.Disk.Page This repeatedly reads a single page, ringing the bell (Maxc1) or blinking the screen(Maxc2) whenever an error is detected. Type DEL to terminate.Following Tenex file system crashes, where fatal read errors in directories have occurred, thefollowing general methods help to isolate the cause of the failure.1. Turn on the "Read-Only" switches for all the system disk packs.2. Use Micro-Exec S.D.P.F.E to find out which drives/packs/controllers are causing trouble. If aparticular pack suffers many failures, but the others seem ok, it is likely that the common controlleris ok, but that something is wrong with either the unit controller or disk drive that is experiencingthe failures. If the failures are confined to a particular head, it is likely that that head is misalignedor broken.3. If a pack produces many irrecoverable read errors on one drive, try mounting it on another driveand using S.D.P.F.E. If it can be read ok on the other drive, then it is likely that there is somefailure in the read electronics of the original unit controller or disk drive; otherwise, it is likely thatthere is a failure in the write electronics of the original controller/drive.4. Try writing a test pattern on a test disk pack on a good drive (Pack 100 usually used for thispurpose because it does not have any bad spots.). Then mount the pack on the suspect drive anduse S.D.P.F.E to see if it can be read correctly. If it cannot be read correctly, you have furtherindication of a read electronics failure; if it can be read correctly, then you have furtherconfirmation of a write electronics or head failure.5. If you determine that a particular drive/controller combination is broken, you can determinewhether it is the drive or controller by recabling the suspect controller to a working drive and thesuspect drive to a working controller. (Refer to last two paragraphs in section 14.1 for a discussion ofwhat to do if the disk configuration is changed.)16.5. TMMaxc2 only. TM is a Maxc memory diagnostic that runs on the Alto. It is used for basicdebugging of the memory system, or when the memory is too sick to support Maxc programs such)fqX.;pi C`s M^9[#9Y@9WF9V!)CR9O=; K5 J> FO DC ARXC =_ <&@ :K7. 8+@ 6 3Cd 1y>% /f -M '#T %XA #B! ! O 4 ` L t(srt '1 fr t s8 *W 6 =](Maxc OperationsHardware Diagnostic and Maintenance Procedures65as Micro-Exec. TM is documented separately in a memo entitled "The Maxc2 Memory TestProgram and Other Folklore", by Larry Clark.16.6. MemBashMaxc2 only. MemBash is an Alto program that beats on the Alto/Maxc memory interface whilemonitoring the state of the Maxc processor. It is intended to be run at the same time as a Maxcmain memory micro-diagnostic (DGM or DGMR) to provoke problems that arise under conditionsof memory contention.The Maxc micro-diagnostic should first be loaded using Midas and the end-of-test breakpoint at 20removed by typing "20;K". Then exit Midas, start up MemBash, and issue the "Go" command.The program runs until any key is struck.If any error occurs (in the processor, memory, or Alto memory interface), the state of the memorysystem and all the memory interfaces is displayed. The processor is then restarted at the beginningof the micro-diagnostic.The Alto memory operations executed by MemBash are ordinarily sequential reads through the first256K of memory. The "Alto Operation" command may be used to select write or RMW operationsor to specify that the Alto beat on a single memory location.MemBash may be exited by means of the "Quit" command.16.7. SMIDiagMaxc2 only. This program is a diagnostic for the Alto System Maintenance Interface (SMI). Itsoperation is very simple, and one should step through the various available commands using the "?"feature.There are two main tests, one for the address register and the other for the data register. In eithertest, one may use data patterns consisting of all zeroes, all ones, alternating ones and zeros, orrandom data. Important: Before running the address test with random data, it is necessary todisconnect the SMI cable to Maxc and install a terminator on the Alto interface.16.8. AITestMaxc2 only. Alto-IMP interface test.16.9. TRMaxc1 only. TR is a Nova program that tests the registers and memories of the Maxc1 processor. fpi'qX&Fp _sI ]K, Vr St s@ QNV OI M JG@! H|Y F) C@ U AuD ? <8Z :n1* 8= 51X5 .qr *t s> )4(: 'i #H "-T b t s2 P r etX s r 3t sC >^Hardware Diagnostic and Maintenance ProceduresMaxc Operations66Since it uses the same routines as Midas to read and set the processor state, Midas will probably notrun if TR doesn't run.The program runs under DOS. It takes six kinds of arguments:WThe registers and memories are referenced by numbers as follows:0PC14--1IMA15LM2P16--3Q17RM4X18--5Y19SM6AC20--7F21DM8MAR22--9KMAR23MAP (0 to 511 only)10MDR24--11MDRL25IM[0:35]12KMDR26--13ARM27STACK28--29IM[36:72]30--31MAIN (0 to 64K-1 only)W > 18 refers to a memory and will be written M below.VValue, which is a positive integer < 2**48 and is taken as decimal unless suffixed by"R8". To add n zeroes to the end of the number suffix it with "En". Thus 64 =100R8 = 1E2R8.AAddress, which is exactly like a value.IIncrement of the form i j k, where i, j, and k are Vs. The increment i j k is t for t:= istep j until k. Hence the increment 1 3 10 produces the sequence 1, 4, 7, 10.RRotate, of the form i j k as above, which produces the sequence t for t := i, 2 t + Jwhile t <= k. Hence the R's.SSequence of addresses, which is exactly like an increment.The simplest calls of TR are:TR W V to write V into register W and read it back.TR M V A to write V into address A of M and read it back.Only errors are printed (in octal). The switch /T prints each value written and read, and /Nsuppresses printing. The switch /F sends the printing to a file which is given as the first argument.)fqX.;pi  _sT ]K YX=Vg@R"'Q+"'O`"'M"'K"'J"'H6"'Fk"'D"'B"'A "'?A"'=v"';"'"9'"8'"6K'"4'16-;+6* &X'#$E!YN>X: 97;  3 + I y>]5Maxc OperationsHardware Diagnostic and Maintenance Procedures67Thus:TR/F FOO 3 77R8tests Q with 77 and writes the errors to FOO (which must not exist already).TR/I W ITR/I M I ATR/R W RTR/R M R Atest the register or memory location with each value of the increment in turn or with each value ofthe rotate in turn.Two switches make sense for memories only:TR/S M V Stests each address in the sequence. V may be replaced by I or R if the appropriate switch is used.Each location is tested with all the values of the I or R before going on to the next.TR/A M V Swrites V into all the addresses and then reads them back. Again I or R may be used, in which casesuccessive values are written into successive addresses. On successive cycles of the test the i and kof s are incremented by 1.Thus TR/A/R 19 1 0 7 100R8 1 104R8 does:100 1101 3102 7103 1104 3101 1102 3103 7104 1---Summary of flags:Aaddress rangeTR/A/R 10 0 1 77R8 100 1 200Fwrite on fileTR/F FOO 3 77R8Iincrement for valueTR 3 0 1 77R8Nno typingRrotateTR/R 3 0 1 1E12R8Ssequence addressesTR/S 10 77R8 100 1 200Ttype everything fpi'qX&Fp _s[ X2LTR Q+O` K): J# FX*C@ ?9* >V:X 7E 5U^ 3 0X(,*)'F%|#! Q X &s &s &s Cx&s &s H >]o TIMESROMAN  TIMESROMAN  TIMESROMAN  TIMESROMAN  TIMESROMAN  TIMESROMAN"*)/6&;@1Dj/G EJTMaxcOps16.bravoRWeaverNovember 25, 1980 1:32 PM