Dorado Hardware ManualInstruction Fetch Unit14 September 198164Instruction Fetch Unit The instruction fetch unit, or IFU, decodes a stream of bytes from memory into a sequenceof 8-bit opcodes and operands using a writeable decoding memory, and presents theresults to the processor for efficient interpretation. The next section contains an overviewof IFU function, supplemented by details in later sections.Read this chapter with Figure 12 in front of you.Overview of OperationThe IFU handles four independent instruction sets. Opcodes are 8-bit bytes, which may befollowed in memory by 0, 1, or 2 operand bytes. Hence, the total length of an operation is1, 2, or 3 bytes. The first operand byte is called a, the second b.One method of dealing with operations longer than 3 bytes is to encode them in IFUM as 1-bytejumps to the next operation. This gives up the possibility of referencing N, a, or b with _Id butavoids having to restart the IFU. The processor then must compute the proper place in theinstruction stream and reference a, b, g, etc. without help from the IFU.The term PC refers to the displacement of an opcode byte from the codebase, which is BR31. PC's are 16-bit items, where 0:14 are an unsigned word displacement relative to thecodebase, and bit 15 selects the byte. In other words, codebase points at a 32k segmentof virtual memory; a PC selects a byte in this segment. The PC's are named PCF, . . .,PCM, and PCX, where the final letter in the name denotes the level in the IFU pipeline.Since the IFU's PC is only 16 bits, overflowing either end of the code segment causes wraparound.This programming error is not detected by the hardware.For Alto compatibility reasons, we currently have the following kludge. Instruction sets 0and 1 treat byte 0 in the selected word as bits 0:7, 1 as bits 8:15; instruction sets 2 and 3treat byte 0 as bits 8:15, 1 as 0:7. Eventually, this may be changed so that all instructionsets use 0 for the byte in 0:7 and 1 for 8:15.The IFU is started by first selecting an instruction set (InsSetOrEvent_B function) and thenloading the F-level PC (PCF_B function). The IFU then starts fetching the byte streamstarting at the word BR[31] + PCF[0:14], byte PCF[15], from the cache and preparesopcodes for interpretation by the processor.Bytes from the cache then march through the IFU pipeline beginning with the F and G full-word buffer registers on the MemD board; single bytes from F/G then move into J or H onthe IFU board. InsSet[0:1] and the opcode byte in J address the decoding memory, IFUM,a 1024-word x 24-bit (+3 parity) RAM containing the information in the table below.Although IFUM is writeable, it will normally be loaded with the microprogram and notsubsequently changed (Diagnostics are, of course, an exception.). fp"sq5pFf arp ^e+. \4 ZA Y;UqX1 Pzs Mp3& K>C Is4tp tpyFuKyE-",tutu yC!9yB%!tututu! >pqp7qp <T ;&2 9TE 7'0y4uay3g7 0p? .M[ ,E *. 'F R %|= #> !, t+. C 8 ? JD A 8=R%Dorado Hardware ManualInstruction Fetch Unit14 September 198165Table 18: IFUM FieldsNameSize ContentsLength' 2Opcode length: 1, 2 or 3 bytes (0 length is illegal).TPause' 1The opcode is of type pause.TJump' 1The opcode is of type jump.IFaddr'10TNIA[4:13] of the first instruction to be executed in interpreting this opcode(TNIA[14:15] from the IFUJump in the exit of the previous opcode).RBaseB' 1RBase initialization, discussed below.MemB 3MemBase initialization, discussed below.Sign 1Operand sign extension, discussed below.Packeda 1Packed a, discussed below.N 4Operand encoded in the opcode, discussed below.Length', TPause', TJump', Sign, Packeda, and N are used by the IFU to prepare operandsand to sequence correctly to the next opcode; IFaddr' is passed to the control section; andthe processor uses MemB and RBaseB' to initialize MemBase and RBase when themicrocode for the opcode commences.Length' determines the number of operand bytes; a for a two or three-byte instruction willbe in H, while b for a three-byte instruction will be in F/G, when the assembled instructionis ready to proceed. The assembled instruction and a then drop into the M level.IFUJump[n] (see "Control Section") transfers control to the starting instruction for theopcode assembled in M, where TNIA[4:13]_IFaddr, TNIA[14:15]_n (n is 0 to 3) is thelocation of the entry instruction. A 4-long entry vector, rather than a single startingaddress, can be utilized for faster execution, as discussed later. IFaddr may be overruledby a trap address when appropriate.At t0 of the starting instruction, the processor initializes RBase to RBaseB (i.e., to 0 or to 1)and MemBase to 0..MemBX[0:1]..MemB[1:2] if MemB[0] = 0, or to 348+MemB[1:2] ifMemB[0] = 1. MemBX is interpreted as a stack pointer to a 4-entry stack with 4 baseregisters in each entry, and MemB[1:2] in IFUM select a particular base register from thecurrent entry. The MemBX kludge may reduce computation on procedure call/return, asdiscussed later. Other information about the opcode and a are copied into the X level.Instructions that implement the opcode then reference operands in sequence using theA_Id, RisId, or TisId operations discussed in "Processor Section" or the IFetch_ operationdiscussed in "Memory Section," which read operands from the X level. The operandsequence delivered by the IFU in response to _Id is as follows: fp"sq5pFf$CbsX_vP ]KuPP/[PPPwuZ PPPwuXPP7~W;BUP!T3P!RP!QPtuPtuO`P( Lqtpqp( JG.qp& H|qpqp F# C@qp)tp) AutpH ?4tp  <8X :n? 82& 6#!qp 5# 11u1p$qp /*/Du/p .A ,<7" *r4 (8tp %5T #jH !I ? T &wuwuwuwuy<wu<<tututuwuy:wtu"wu y9T()wu 6p+tptp  4:(qp& 2p5 03% .tptp tptpqp -qp tp +E? ){ & 6qp $>9qp "sqp y7qpq tqp/yl1ulpulpyqpqtpqp"tpqySptpqpy qp oG qp qp9 . <\x Dorado Hardware ManualInstruction Fetch Unit14 September 198167and H and refills these pipe levels with bytes along the jump path.The B_PCX' function reads PC (inverted) for the current opcode. Note that PCF_B doesnot affect the value of PCX; B_PCX' continues to read the displacement of the currentopcode, which does not change until an IFUJump is done.An opcode that conditionally jumps can be encoded in IFUM with type either jump orregular. If encoded as type jump, when the condition is false, the program must issuePCF_B to restart the IFU at the fall-through address. Similarly, if regular, PCF_B must beissued to restart at the jump address.The Length argument delivered by _Id after other operands have been referenced is useful inconditional jump calculations. Note that the fall-through address for a conditional jump isLength+PCX, so:T_(Id)(PCX')1;*Id = Length for type jumpPCF_T;Noop;IFUJump[0];restarts the IFU at the fall-through address for type jump.Following PCF_B, the IFU flushes its pipeline; it is illegal for either the instructioncontaining PCF_B or the one immediately after it to do an IFUJump, but any subsequentinstruction can issue an IFUJump; however, the processor will spin uselessly at the IFU"NotReady" trap until the fifth cycle after PCF_B (earliest) or later (longer opcodes, cachemisses, Mar traffic). fp"sq5pFf bC ^9 \ K [7 Wq p/qp Uqpqp4 T5qp RE&yOuwu1yMByLX:J#wu w:Hu:Gb:F yC6wu @~p< >-( <.) ;G 9TB 9 <.Dorado Hardware ManualInstruction Fetch Unit14 September 198168Table 20: IFU FF DecodesNameActionIFUResetHalt and clear the IFU pipeline and clear errors, testing features, and BrkPending(i.e., BrkIns); Reschedule condition and instruction set are not cleared.B_IFUMLH'Read the high-order IFUM word, InsSet, and IdCnt onto B (low-true) as follows:FieldB bitsIdCnt 0:2Count of _Id's since start of opcodeInsSet 3:4Instruction set numberPackeda 5Packed aIFaddr' 6:15Starting addressIFUMLH_BLoad the high-order IFUM word from B (t1 to t3), where the Packeda and IFaddrfields are in the same form as B_IFUMLH'. Must have at least one interveninginstruction after a preceding BrkIns_ or InsSetorEvent_.IFUMRH_BLoad the low-order IFUM word from B (t1 to t3) in the format given below; musthave at least one intervening instruction after a preceding BrkIns_ orInsSetorEvent_:FieldB bitsSign 0IPar.0 1Even parity over N, MemB[1:2], and IFAD[0:1]IPar.1 2Even parity over IFAD[2:9]IPar.2 3Even parity on Packeda, Sign, Length', MemB.0,RBaseB', TPause, and TJumpLength' 4:5Instruction length (low true)RBaseB' 61-bit RBase initializationMemB 7:93-bit MemBase initializationTPause' 10Type pause (low true)TJump' 11Type jump (low true)N12:154-bit operandB_IFUMRH'Read IFUM fields in the same format as IFUMRH_B (inverted).PCF_BLoad PCF at t3, clear and restart the pipeline.B_PCX'Read PC for the currently executing opcode (inverted).BrkIns_BLoad BrkIns from B[0:7] at t3, and set BrkPending (ill-defined unless the IFU hasbeen reset). BrkIns replaces the next opcode loaded into J; then BrkPending iscleared. BrkIns also addresses IFUM on IFUMLH/RH_ and B_IFUMLH'/RH'.InsSetOrEvent_BIf B[0]=1, then B[6:7] are loaded into the InsSet register at t3; if B[0]=0, thenB[4:15] control event counters as discussed in the "Other IO and Event Counters"chapter. A following PCF_B starts the IFU interpreting using the new instructionset. Illegal except when the IFU is paused or reset or when PCF_ will be donebefore the next IFUJump. fp"sq5pFf bbsXx^vx\uC[]IxY6Ww$Uu$){T$){  St$u){tQu$){xOOOOOtu M @L8xJ&JGJJGJ!I-ECFGFw$D7u$B$){,Au$){?$){tu){>m= $){ ;$){:K$){8$){7$){6($){ x4^;x2 22!x06x..q.+-V-"+<x*+&)*+(!/'#*'%<$a $>7mCDorado Hardware ManualInstruction Fetch Unit14 September 198169Table 20: IFU FF Decodes (continued)NameActionRescheduleCause a reschedule trap on the second or third "successful" IFUJump."Successful" means that an IFUJump is not trapped for some other reason suchas not-ready. The second IFUJump will be trapped if it does not occur in theinstruction immediately after the first successful IFUJump; otherwise, the thirdsuccessful IFUJump will be trapped. The trap instruction is executed as though itwere the first instruction of the rescheduled opcode, and _Id and IFUJump willwork as though that opcode were in progress.Also set the Reschedule branch condition (emulator only) to true.RescheduleNowRescheduleNow is guaranteed to trap the next successful IFUJump, so long as thenext IFUJump appears in the second cycle after RescheduleNow, or later. TheReschedule branch condition is not affected.NoRescheduleTurn off the Reschedule trap and branch condition.IFUTest_BLoad the test-control register from B (load with 0 or do IFUReset when not testing)as follows:FieldB bitsTestFG 0:7Substituted for cache dataTestFGParity 8Substituted for cache parity bitTestFault 9Substituted for memory fault signalTestMemAck 10Substituted for memory MemAck signalTestMakeF_D 11Substituted for memory MakeF_D signalTestFH' 12enable FHCP and t1 when IFUTick executedTestSH' 13enable SHCP and t2 when IFUTick executedTestEn 14test enableIFUTickTick the IFU's clock once according to TestFH and TestSH in the IFUTest register.The IFUJump Entry VectorAn IFUJump[n], encoded in the JCN field of the instruction, sends control to an addresspartly determined by the IFU and partly by the IFUJump clause. The four possible targetsof an IFUJump are called an "entry vector".An opcode leaves its results in one of several convenient forms agreed to by convention,then chooses an entry instruction in its successor with IFUJump[n], where n =0 to 3.Every opcode in the instruction set must have an entry vector of the same length. Carefulchoice of forms may reduce execution time by one cycle for some opcodes withoutincreasing execution time for successor opcodes.A true branch condition (FF-encoded) with IFUJump prevents starting the next opcode. Forexample, IFUJump[2,condition] sends control to the next opcode's entry 2, if condition isfalse, or entry 3, if condition is true. However, no other IFU activities associated withstarting the new opcode take place when condition is true, so entry 3 is executed in thecontext of the opcode that did the IFUJump[2,condition]; however, the processor initializesRBase and MemBase as though the next opcode were starting, so this part of the state islost. Thus, at a cost of one entry instruction in every opcode of an instruction set, it maybe possible to shorten the execution time of some opcodes using a conditional exit.An opcode with common and uncommon exit cases, for example, can exit withIFUJump[2,condition], where entry 2, the common case, starts the next opcode, while entry3 is reached for the uncommon case. Since IFUJump loads Link with .+1, entry 3 caneither Return, to execute more code associated with the uncommon case, or it can do fp"sq5pFf'bsX%x^vx\u *+wu[]5Y1XJW;8UNTy,RAxP (wu#O=N#,xLX 2xJ5I- Gbw$Eu%){ D7 %){ B%){ Au %){ @ %){ >%){ >&>= %){ <= ;e%){x9-$ 4s 12p!6 /h*/ -+ *+ qp (`Q &R $@ #0 1( 2' J /4$ dO 8 #: S 5?6 5$ ? 3L z =^Dorado Hardware ManualInstruction Fetch Unit14 September 198170something more explicit, if an appropriate convention is followed by all opcodes.The following example shows how an instruction set with four opcodes (Push, Add, Store,and JNZ) is implemented using a four-long entry vector. The opcodes in this example dealwith the stack like Mesa opcodes do, and the first three entry conventions are, in fact, oneswhich might be used by the current Mesa emulator.%Entry0:Stk[StkP] holds top-of-stack (if anygarbage if stack empty), T holds garbage1:T and Stk[StkP-1] hold previous top of stack (garbage if stack empty),Stk[StkP] garbage, Md holds top-of-stack.2:T and Stk[StkP+1] hold top-of-stack,Stk[StkP] holds previous top of stack (garbage if stack empty).3:Results in same form as entry 2, but restart IFU at NewPC = (Id)(PCX')1Note that Stack&+1 references must not check for underflow when the stack may legitimately beempty.%*Push the memory location pointed to by N.Push:Fetch_Id, T_StackNoUFL&+1, IFUJump[1];Fetch_Id, T_StackNoUFL&+1_Md, IFUJump[1];Fetch_Id, StkP+2, IFUJump[1];T_(Id)(PCX')1, StkP+1, Return;*Replace the top two stack entries by their sum.Add:T_Stack&1, Branch[.+2];Stack_Md;T_Stack&1_T+(Stack&1), IFUJump[2];T_(Id)(PCX')1, StkP+1, Return;*Store the top-of-stack into the memory location pointed to by N and pop the stack.Store:Store_Id, DBuf_Stack&1, IFUJump[0];Stack_Md, Branch[Storex];Store_Id, DBuf_T, IFUJump[0];T_(Id)(PCX')1, StkP+1, Return;Storex:Store_Id, DBuf_Stack&2, IFUJump[2];*Pop the stack and branch if the top-of-stack was zero, else fall through*This opcode is of type jump.JNZ:Pd_Stack&1, Branch[ZTest];Pd_Md, StkP1, Branch[ZTest];Pd_T, Branch[ZTest];T_(Id)(PCX')1, StkP+1, Return;ZTest:T_Stack&1, IFUJump[2,ALU#0];*Return here when the jump doesn't take.T_Stack&1, PCF_T;IFUJump[2];Push thus requires 1 execution cycle; Store and Add take either 1 or 2 cycles dependingupon the entry point; JNZ takes 2 cycles when the jump takes or 9 cycles when the opcodefalls through (because the IFU isn't ready until the fifth cycle after PCF_B).Although every opcode in an instruction set must have an entry vector following the sameconventions, it is not necessary that the vector be four-long. In the above example, asingle-entry scheme would probably use the entry 2 convention followed above. In thatevent, Push, Add, Store, and JNZ would require 2, 1, 2, and 3 cycles (common case), fp"sq5pFf bQ ^P \G [$9 YL1yUuTy(MR(F(Qq)O($(Ni?L(IyKaMyIyHYyE-*yC&B%)@?y;0y:n87f$5y2Sy12$/.*,y+"$y'Iy&sy$#j! byyZ(R p8 96" nN 5# 2). gF 1" U<]Dorado Hardware ManualInstruction Fetch Unit14 September 198171respectively, compared to 1, 1 or 2, 1 or 2, and 2 or 3 cycles for the four-entry schemeabove.Since Mesa requires about 120 IFU entries for its 256 opcodes, the cost of the secondentry in the vector is between 0 and 120 locations, and 120 locations each for the third andfourth entries. Since Mesa is implemented by about 1044 instructions using entry vectorsof length 1, a vector of length 2 scheme would require ~1100, length 3 ~1220, and length4 ~1340 instructions. The implementor of an instruction set should decide when theadditional locations expended for larger entry vectors are no longer worth the additionalspeed.Although we originally hoped for as much as 8% faster inner loops and 4% overall speedimprovement, Gene McDaniel measured only 2% faster execution for Mesa (excluding diskwait) using a length 3 entry vector; microstore increased about 120 locations. Investigationrevealed that increased traffic on Mar (by overlapped Fetch_ and _Md) was causing IFUnot ready to occur more often, offsetting the fact that fewer processor cycles were needed.Forwarding saved about .2 cycles/opcode.Note: IFU trap locations discussed below must also be entry vectors that follow the sameconvention.Timing SummaryFrom the detailed timing discussion at the end of this chapter, the following generalizationsabout IFU timing can be drawn:Assuming no misses and no delays because the processor uses Mar, IFUJump willsuccessfully dispatch to the entry instruction of the next opcode on the fifth cycleafter PCF_B if the new opcode either is one byte long or is two bytes long andstarts at an even byte; otherwise it will succeed on the sixth cycle.A jump opcode causes a 3 cycle gap in the IFU pipe. The effect of the gap wouldbe a 3 cycle delay if each opcode were executed in exactly one cycle. However,the gap can overlap with extra cycles taken on the jump opcode itself or either ofthe two preceding opcodes. As usual in timing considerations, a 3-byte opcodecounts as two normal opcodes.If a long stream of regular one-byte opcodes is being executed by the processor atthe fastest possible rate (one instruction/opcode), and if the IFU neither missesnor faults nor waits for the processor's use of Mar or the cache, then it will alwayshave the next opcode ready for IFUJump. If the IFU waits one cycle for theprocessor to use Mar, it will shortly fill its pipe again, so scattered Mar referencesby the processor will not result in IFU NotReady.If a long stream of regular two-byte opcodes, each of which has an a but no N(This is the worst case.), is being executed by the processor at the fastest possiblerate (one instruction/opcode), and if the opcodes in the stream start at the evenbytes in words, and if the IFU neither misses nor faults, and if the processor neveruses Mar, then the IFU will give 25% NotReady. Each cycle in which the processor fp"sq5pFf b)/ `S \G [X YLE WM U> S(1 R" NP LE K9$ IPQ GN E( BIqp? @~ ;es 7p? 6(y27y01#y/!8y-VEy)qp.y(5y&O3qp y$$*y"yHBy}3y,)yCyAyS1y5 tpyDyLE y &.y ? p p=\;Dorado Hardware ManualInstruction Fetch Unit14 September 198172uses Mar adds one cycle of delay. If the opcodes in the stream start at the oddbytes in words, then the processor will get NotReady 40% of the time.Three-byte opcodes are not as bad as two-byte opcodes because, in the worstcase, the processor cannot reference both a and b in less than 2 instructions.Hence, a stream of three-byte opcodes has timing approximately the same as astream in which each three-byte opcode is replaced by a one-byte opcode followedby a two-byte opcode.Mar traffic may be an important timing factor if many opcodes finish in one or two cycles.Whenever the processor is making a reference, the IFU cannot use Mar, and the IFU mustmake one reference for every two bytes in the instruction stream. Note that if a processorreference is held, the IFU will also be prevented from making references (but the IFU is notprevented from making references when _Md is held).Use of MemBX and the Duplicate Stk RegionsThe present Mesa implementation requires 34 cycles for a local XFER and 54 cycles for anexternal XFER, excluding memory wait, and measurements made on the Mesa compilershowed that 38% of all cycles were spent in XFER. For this reason, speed improvements inXFER are an important objective.Since about 70% of all calls return before calling any other procedure, if a caller's baseregisters and stack were left untouched, then this information would neither have to besaved during the call nor restored during the return in most cases.The hardware that supports this idea consists of the MemBX register, pointing at one offour blocks of 4 base registers each, and StkP, pointing at one of four stacks of 64registers each. During a procedure call, StkP and MemBX may be advanced by 1 region,leaving the caller's state intact; if the callee makes nested calls, then eventually the MemBXand Stk regions would be exhausted and some would have to be saved and (eventually)restored. However, if the callee returns without too many nested calls, then its caller'sstate would still be intact.We have not constructed examples that use this idea, but a savings of 50% in averageXFER timing has been projected for Mesa.TrapsThe IFU may trap for not ready, reschedule request, map faults, cache data errors, andIFUM parity errors. When a trap condition occurs, the IFU substitutes a trap address forIFaddr on the next IFUJump. Hence, the next IFUJump sends control to one of the entriesin the trap vector. Locations assigned to these trap vectors are given in "Control Section"; note that eachinstruction set has independent trap locations. fp"sq5pFfybE y`SEy\9y[*tptpyYL:yWGyU RE O Pz< N; LU K3 Fs* Bp2& @B > N =/ 9Z 7L 6(C 2-* 05 /!< -VI +9 ) P ' $B "( s /p3# dH :  ]> / @ K=XDorado Hardware ManualInstruction Fetch Unit14 September 198173Each trap vector is dispatched into by IFUJump exactly as though it were an opcode.B_PCX' reads the PC of the opcode that would have been executed if the trap had notoccurred and RBase, MemBase, and _Id stuff are set according to that opcode (in everycase except NotReadyall are undefined at a NotReady trap).The relative priority of traps is as follows: IFUM parity error is highest, then NotReady,reschedule, cache data parity error, and map fault.The NotReady trap occurs whenever the IFU does not have both an opcode and itsassociated operands (a, b) ready for the processor. Since PCX, MemBase, and RBase areinvalid, the trap microcode must wait for the IFU to become ready. The following codesequence will work for all instruction sets that do not use a conditional exit:NotReady:FreezeBC, IFUJump[0];FreezeBC, IFUJump[1];FreezeBC, IFUJump[2];FreezeBC, IFUJump[3];For the sample instruction set given earlier, which uses entry 3 as a conditional exit, thefollowing sequence would be appropriate:NotReady:IFUJump[0];*Can't convert to IFUJump[2] because stack may be emptyT_Stack&1_Md, IFUJump[2];*Convert case 1 to case 2IFUJump[2];T_(Id)(PCX')1, StkP_StkP+1, Return;*Resume the opcode which didn't really exitIf the IFU detects bad parity on any read of IFUM, the IFUJump to the opcode affected bythis parity error will trap to the IFUM parity error trap location.The IFU will trap at the cache data parity error location, if it detected invalid parity on anybyte sent by the memory system. PCX will always correctly point at the opcode that wouldhave been executed next had the trap not occurred; however, the opcode and operandspointed at by PCX are not necessarily the ones that suffered the parity error. This occursbecause the pipe has continued ahead of PCX. The most confusing case occurs when theopcode following PCX was a jump; in this case the opcode fetched by the jump may havecaused the parity error, in which case PCX+/ jump displacement is limited to the rangePCX4008 to PCX+3778.The IFU will hold an IFUJump in the cycle prior to a cache data parity error or IFUM parityerror trap.Note that IFUReset must be given after an IFUM or cache data parity error and beforerestarting the IFU.The Reschedule function is used by io tasks to request service by the emulator. The IFUwill honor this trap request on the second IFUJump after it is executed, as discussed in alater section. The RescheduleNow function is like the Reschedule function, but the IFUhonors it on the first IFUJump after it is executed, rather than the second (RescheduleNowwas intended for use when continuing an opcode which previously experienced a fault). fp"sq5pFf b5 `Sq? ^.' \:p YL@ W3 T2 REtptp= Pzqp ) NOyKuJI-GFk Cp8# AR(y>u=/ &&;&:n 9 %+$ 5pL 3C 0Q .C ,4 +"N )W1$ 'A %< ##ku#p #ku#p L  l7  09 e$qp. D qpD .' =YMBDorado Hardware ManualInstruction Fetch Unit14 September 198174An IFU fetch may experience a map fault. The memory system does not report IFU mapfaults to the fault task. Instead, it signals the IFU that a map fault has occurred, and theIFU passes this indication through its pipeline. Eventually, the IFUJump that would havesent control to the opcode affected by the map fault will instead transfer to the map faulttrap vector.Although IFU map faults are not reported to the fault task, the fault task must be careful to passover any pipe entries that were created by IFU map faults when it is woken for some other reason.Erroneous bytes fetched after a pause or jump opcode might cause map faults, but the IFU discardsthese before they reach the end of the pipeline, so the processor is never informed. Consequently,erroneous references interfere with processor memory activity and delay the IFU's efforts to refill itspipe on a jump, but don't have any disastrous effect.An IFU fetch may experience single or double storage failures. Unlike map faults, these arereported to the fault task just as on processor fetches. The memory system pipeline willfinish loading the cache munch just as though the data were ok, and the cache entries willhave valid byte parity. The IFU will continue running just as though no error had occurred.However, the fault task will be woken soon enough that it will run before the IFU's Fregister is loaded with a byte from the bad munch. Hence, the fault task will run beforethe emulator can possibly execute an IFUJump to the byte that suffered the error.For a recoverable error, the fault task can simply carry out some logging action and block;no harm will occur because the IFU will actually have gotten valid data, and the cache willcontain valid data. For an irrecoverable error, the fault task must clear the bad cachemunch and use the RescheduleNow function to trap the next IFUJump to code for dealingwith the irrecoverable error.Erroneous bytes fetched after a pause or jump opcode might suffer irrecoverable errors. The faulttask has no reasonable way to distinguish these from bytes really in the instruction stream, so it willcause a Reschedule trap anyway.RemarkAlthough independent trap vectors for each instruction set are probably inessential, performance should bebetter when the NotReady trap, which occurs frequently, is distinct for each instruction set. This allows thevarious IFUJump exits to be transformed into the form most likely to be convenient for the next opcode.The other traps could have been implemented to use a common trap for all locations. This would be moreeconomical for IFUM and FG parity error traps, if these simply result in an uncontinuable crash when runningsystem microcode. However, different trap vectors for each instruction set are probably more convenient forReschedule and Map fault traps, which have to save the state of the emulator currently running.In any case, reserving locations for these traps costs at most 5 traps * 4 instruction sets * 4 entries/trap =1008 locations, and realistically is much less than this because many instruction sets will not need 4 entriesand there will probably be fewer than 4 instruction sets concurrently active. fp"sq5pFf b-& `S2+ ^A \H Z yX2uGyVLyT3wuwu 'yRRyQq-:yP wu' Lp> J: I-G GbD CU B%2( @[Q <A ;[ 9TE 7D 5y2uwuwu5y1UD#y/ ,w *+uT (:4 'iO $+< #j_ " ;1 _  V j M H IsA GP ET D'yARs Ry?My>/ ;Apqpq 9wpR 4^r 0pT /!U -VE +V ) &Oqp7 $C#s$p "3)  &"&!}&)  & qp1  r wpD  ;G pE )<\ Dorado Hardware ManualInstruction Fetch Unit14 September 198177following the one with IFUMLH/RH_ must put the same data on B. If this data comes fromRM or T, the register must not have been loaded in the cycle preceding the IFUMLH/RH_(because the bypass logic will change the B select from Pd or Md to RM or T, possiblyglitching data on B). The following subroutines illustrate loading and reading back IFUM.WriteIFUM:IFUReset;*Stop future IFU fetches and clear the pipeT_41C;Cnt_T;IFUReset, GoTo[.,Cnt#0&1];*Reset after previously issued fetches completeInsSetOrEvent_RMaddr0;*Load 2 instruction set bits forming IFUM addressBrkIns_RMAddr1;*Load 8 opcode bits forming IFUM addressTaskingOff;*Ensure no B glitch below and let BrkIns_ settle for 1 cycleIFUMLH_RMdataHi;*Write high part of IFUMB_RMdataHi;*Keep data good a little longer (mustn't glitch)IFUMRH_RMdataLo;*Write low part of IFUMB_RMdataLo, TaskingOn;*Keep data good a little longerIFUReset, Return;*Clear BrkInsReadIFUM:IFUReset;*Stop future IFU fetches and clear the pipeT_41C;Cnt_T;IFUReset, GoTo[.,Cnt#0&1];*Reset after previously issued fetches completeBrkIns_RMaddr1;*Load 8 opcode bits forming IFUM addressInsSetOrEvent_RMaddr0;*Load 2 instruction set bits forming IFUM addressNoop;*Two instructions must elapse after loading BrkIns*one after loading InsSet (?Two noops after loading InsSet*might be better since this is a tight path?)RMdataHi_IFUMLH;*Read IFUM into RM.RMdataLo_IFUMRH;IFUReset, Return;*Clear BrkInsContinuing from Processor FaultsSaving and restoring the state of an interrupted program requires some cleverness not onlyfor the IFU, but also for the Control, Processor, and Memory sections. The emulator mightfault for a data error, map fault, or stack overflow/underflow; for io tasks, stackoverflow/underflow is impossible and map faults will probably be illegal, so only data errorfaults are legitimate. The discussion here will concentrate on map faults, though the sameapproach could be used for other fault conditions as well.The fault task must use as few instructions as possible so that io tasks won't be preemptedfor too long. The minimum is to copy all pipe entries that contain memory faults into RM orStk buffers, preserve DBuf, and save the emulator's TPC; the fault task must itself deal withdata error faults by io tasks; it then restarts the emulator at a trap address. The emulatormicroprogram then saves the rest of the emulator state and deduces the nature of thefault(s) using methods discussed in "Memory Section".The emulator fault microcode first saves ALU branch conditions and task-specific registers,then other information of interest. The saved information is stored where the Mesa (orwhatever) program can get at it; then the trap microcode restarts Mesa at a trap procedurethat will service the map fault (probably swap in a page from the disk); eventually, state willbe restored and the opcode that faulted will be resumed at the instruction that faulted. fp"sq5pFf b? `SF ^@ \@yYs X&+W;UTy&)S&1Q&(PW & 2N&M &0L5&J&Is&yFEt&+DBAR&)?&(>&1=/&2&; 0&:n-9 &76K& 1Ur -p$6 ,J *N!+"( (L &K $: !}D U L <! RF 5 [ K M 8" 3, ,, 0, ALU<0, or ALU=0.WakeUp[ContTask];*Wakeup the special task reserved for continuation.Link_SavedLink, At[ConTab,0];*Restore Link and ALU>0TaskingOn;Pd_Not(Pointers_SavedPointers), GoTo[COK];Link_SavedLink, At[ConTab,1];*Restore Link and ALU<0TaskingOn;Pd_Pointers_SavedPointers, GoTo[COK];Link_SavedLink, At[ConTab,2];*Restore Link and ALU=0TaskingOn;Pd_(SavedPointers) xor (Pointers_SavedPointers), GoTo[COK];COK:FreezeBC, GoTo[.];*The special restart task needed for continuationContinueInit:RBase_RBase[SavedTPC];*Initialization code for the task*First of two wakeups comes herechange emulator's TPC to Resume1 and block.Cont0:Block;T_Resume1Loc;Link_T, TaskingOff;LdTPC_0C;*Restart emulator at Resume1TaskingOn;Block;*Second of two wakeups comes here. Reload emulator TPC with continuation address.Cont1:Link_SavedTPC;LdTPC_0C;*Restart emulator at saved continue addressBranch[Cont0];IFU TestingThe IFU test control register is loaded by the IFUTest_B function; when not testing, thisregister should contain 1, and it is loaded with 1 by the IFUReset function. IFUTest.15disables the periodic wakeup request to the Junk task discussed in the "Slow IO" chapter;when IFUTest.15 is 0, the junk wakeups occur 60 times/sec and are dismissed by anyIFUTest_ function.IFUTest.14 (TestEn) enables IFU test mode; it is illegal for this bit to change from 0 to 1when the IFU is active because, if this occurred in the same cycle that an IFU memoryreference was issued, then the IFU would pollute the Mar bus indefinitely, making thememory system unusable by the processor. fp"sq5pFf&bAs1`&,y^B''\&)[ & Z &)&X1&W^5&U%'T S<Q &.Pz& yOPyMCLX&)&JI&3H6&F Et*D&B AR%?&> =/;y;y901y7 6o&!y5Ly32L 0/&.* ,y+iRy* (&+'F "Pr p'2 0( IF ~=  BT wM < (  @[3qpy=sHy<8Hy:cy9wy6%ts:y5x)9y4Ly2H /hp> -F +8up (`X && & s&p$ $E$>s$p #I !6Eup k* D /:::(5 2)sp up+up ^s p * ;= IsLup Gupupup E Bl9 @ =/R ;e '* 9=up up 7 4^qpqp/ 2I -zr *p9% (==x$r%x" s5tsx *x."x@xRMxd M xV xxxq 1=Y Dorado Hardware ManualInstruction Fetch Unit14 September 198183t10:Load M from IFUM; IFUJump will now succeed; load J from H (FOO+1); load G from F(FOO+2 and FOO+3); F and H become invalid; start reading the IFUM entry for J.t11:Load H from G (FOO+2); load F from FOO+4 reference.t12: (The FOO+1 opcode would pop into M if IFUJump were done at t10.)IFU is quiescent; F has two useful bytes, G one byte, J/H has two bytes; M level is readyand waiting for IFUJump.FOO is a two-byte regular opcodet10:Load M from IFUM and M[a] from H; IFUJump will now succeed; load J from F (FOO+2);load G from F (garbage and FOO+3); F and H become invalid; start reading the IFUM entryfor J.t11:Load H from G (FOO+3); G becomes invalid; load F from FOO+4 reference; reference theword containing FOO+6.t12:Load G from F; F becomes invalid.t15:Load F from the FOO+6 reference; now quiescent.FOO is a three-byte regular opcodet10:Load M from IFUM and M[a] from H; IFUJump will now succeed; load G from F (FOO+2and FOO+3); H and F become invalid; J goes to special state (b in H).t11:Load H from G (FOO+2 = b); load F from the FOO+4 reference; now quiescent.t12: (The FOO+2 byte would pop from H into M[b] if IFUJump were done at t10.)FOO is a one-byte jump opcodet10:Load M from IFUM; IFUJump will now succeed; J, H, G, and F become invalid.t11:Discard the FOO+4 reference; reference the first word along the jump path.t13:Reference the second word along the jump path.t15:Load F from the first word along the jump path.t16:Load J from F, etc.FOO is a two-byte jump opcodet10:Load M from IFUM and M[a] from H; IFUJump will now succeed; G and F become invalid; Jand H are in a special jump state, computing the jump address.t11:Discard the FOO+4 reference; reference the first word along the jump path.t12:J and H become invalid.t13:Reference the second word along the jump path.t15:Load F from the first word along the jump path, etc.Second case: Restart IFU at odd bytet0:An instruction with PCF_FOO is started, where FOO is odd.t2:F, G, H, J, and M levels are invalid; IFUJump will trap at NotReady.t3:Reference the word containing FOO.t5:Reference word containing FOO+1.t7:Load F with data from the FOO reference; reference the word containing FOO+3.t8:Load the second byte from F into J; F becomes invalid; start reading the IFUM entry for J.t9:Load F from the FOO+1 reference.t10:Distinguish 3 cases below (and the one and two-byte jump cases which are not repeatedbelow). fp"sq5pFfxbAsI`0x^3x]B[SZCxVqxTysus*R:QxOPNFxLX!xJj/xGq"xDsus8B=usx@ us2x>*usx;qx8sJx6Jx4.x2/x1x-qx+Esus2 )>x'Jx%x#.x" 4xr%xs5tsx Dx"x0xBMxT%5xfxxB  7m[ADorado Hardware ManualInstruction Fetch Unit14 September 198184FOO is a one-byte opcodet10:Load M from IFUM; IFUJump will now succeed; load J from F (FOO+1); load G from F(garbage and FOO+2); F becomes invalid; start reading the IFUM entry for J.t11:Load H from G (FOO+2); G becomes invalid; load F with the FOO+3 reference; referencethe word containing FOO+5.t12:Load G from F; F becomes invalid.t15:Load F from the FOO+5 reference; now quiescent.FOO is a two-byte opcodet10:Load G from F (FOO+1 and FOO+2); F becomes invalid.t11:Load H from G (FOO+1); load F with the FOO+3 reference.t12:Load M from IFUM and M[a] from H; IFUJump will now succeed; load J from G (FOO+2);load G from F; F and H become invalid; start reading the IFUM entry for J.t13:Reference the word containing FOO+5; load H from G (FOO+3).t17:Load F with data from the FOO+5 reference; now quiescent.FOO is a three-byte opcodet10:Load G from F (FOO+1 and FOO+2); F becomes invalid.t11:Load H from G (FOO+1); load F from the FOO+3 reference.t12:Load M from IFUM and M[a] from H; IFUJump will now succeed; H becomes invalid; J is ina special state (b in H).t13:Load H from G (FOO+2); load G from F (FOO+3 and FOO+4); F becomes invalid; referencethe word containing FOO+5.t17:Load F from the FOO+5 reference; now quiescent. fp"sq5pFfxbqx_s9^eKx\w)+[xY)!xW;/xSqxQs3xO7xM us:LJxJ#;xH69xDqxBs3x@7x>us%<usx:79wx7/ 7f7m0Dorado Hardware ManualSlow IO14 September 198185Slow IOThe slow io facility allows data transfers between the processor and any of up to 256independently addressed io registers. It is intended that the slow io facility will be used toload and read control information associated with high speed io devices (> 20 x 106bits/sec), which will then use the fast io system for their data transfers. Low speed devices(< 20 x 106 bits/sec) will use the slow io bus for all phases of their operation. Very slowor polled devices may be driven directly from an emulator.Device controllers for Dorado interact with the processor by exchanging data over a 16-bitbidirectional bus IOB ("Input/Output Bus"). There may be a total of up to 256 io registersin all controllers connected to a single system. The unique 8-bit device numbers assignedto particular devices or uses that appear in every system are discussed in subsequentchapters and summarized in the table below.Table 21: IO Register AddressesNumberNameComment10DiskControlDisk control register11DiskMuffDisk muffler control12DiskDataDisk FIFO data13DiskRamDisk format RAM14DiskTagDisk tag register15EDataEthernet input or output data16EControlEthernet control and status360PixelClockDDC pixel clock361MixerDDC mixer362CMapDDC CMap363DWTFlag* (DispM analog of DWTFlag)364DHTFlag* (DispM analog of DHTFlag)365BMapDDC BMap366NLCB* (DispM analog of NLCB)367Statics* (DispM analog of Statics)370StatusDDC muffler and OIS data372MiniMixerDDC MiniMixer373DWTFlagDDC word task control374DHTFlagDDC horizontal task control375HRamDDC horizontal waveform control376NLCBDDC next line control block377StaticsDDC debugging control fp&q4]pGf ar ^ep J \&9 ZO[]s YpK W; WsW;pQ Up: Q9! P4BOcP4LOP4$ "s "s "s  "s "s "sI "s  [ WWsWp5 U )) T7 RE" NI M-) K>0' Is#; G D7\ BlS >Ik;tXy8q 4p  2 11 up/D -z (+ ) ('s (p"&O%s &Op $#s $p""-s "p  bs p t [pM  seus :  =Z (Dorado Hardware ManualSlow IO14 September 198187TIOA_a;Output_Stack&1, IFUJump[0];These opcodes allow a Mesa program to have full access to the io system. The intent isthat these instructions will be used to set up registers in firmware-driven devices, and do allthe service required by polled slow devices. In many cases, the use of an INPUT orOUTPUT instruction is not sensible (doing io to a device normally driven by firmware, forexample), but the capability should prove useful for testing and diagnostics.Wakeup, Block, and NextThe "Control Section" chapter discussed task switching, and the material which follows is anelaboration of that discussion.Note that a task for which a wakeup request is issued at t0 cannot commence its nextinstruction until t4; i.e., at least two cycles elapse after a wakeup before the next instructionis executed. The task then runs until it does a Block; in order to avoid an erroneous extrawakeup, the task must lower its wakeup request at least one cycle before issuing Block.Consequently, an io device may turn off its wakeup request according to one of threestrategies:The first is to turn off the request when Next becomes equal to its task number; inthis case the wakeup request is lowered at t0 of the first instruction executed for thetask, and it must not block until the second instruction to prevent an erroneoussecond wakeup. The special situation in which Next is invalid ("Next Lies") must bedealt with by device controllers that do this. This situation occurs as follows:Suppose that a task blocks with the following instruction:Branch[Loop], Fetch_Address, Block;*Fetch next wordThis generates Switch and the task in Bnt is broadcast over the Next bus.If the Fetch_ causes hold and Bnt < Ctask, then no task switch will occur.However, the Next bus is incorrectly broadcasting Bnt. Since hold occursafter t1, there is insufficient time to change the Next bus back to Ctask inthis case.Consequently, controllers using Next detect "Next Lies" and disable anyactions that would otherwise be performed when it occurs.A pathological lockout problem should be noted: Since task T's wakeuprequest was lowered at t2 when Next=T was noted at t0, the Next Liescondition will (correctly) result in repeating the held instruction at t2;however, some task of lower priority than T may erroneously execute at t4.This might be a problem if some high demand task of higher priority iscoded so that it always creates Next Lies (say, by doing Block andimmediate _Md in the instruction after a Fetch_).Another consequence of "Next Lies" is that IOAtten may be incorrect when fp&q4]pGfbsup`Ss \pB [C YLU W,- UM Qqt Mp\ L5 H5H6sHp FFksFp2 E-7% CcqNp ?K >& y:)*y8,8]s8p&y7 qCy5U p8y3 D;0:,#.q ;)46;'i/;%8;##Gs#p;;"  ;G;9;[(;spsp ;79sp;+nsp;1';f  +;1; *2 : =](HDorado Hardware ManualSlow IO14 September 198188"Next Lies" is occurring. Consequently, branch on IOAtten is illegal duringan instruction that blocks and might cause hold.The second strategy monitors TIOA becoming equal to a particular device value. Inthis case the wakeup request is lowered at t0 of the second instruction following awakeup, and the task must not block until the third instruction. The disk controllerhas used this strategy, which has the draw back that if TIOA inadvertently assumesthe particular device value for any other task, the hardware will malfunction. Aconsequence of any device using this strategy is that all tasks must be careful toinitialize TIOA properly when first awakened.The third strategy waits for some Output_B or Pd_Input operation to reset thewakeup condition. This would reset the condition at t3 or t5 of the Output_Binstruction, and the wakeup would be lowered at t4 or t6; in this case the task mustnot block until the third or fourth instruction after the Output_B or Pd_Input toavoid an erroneous wakeup. The exact requirement depends upon the iocontrollerthe disk controller, for example, lowers its wakeup request at t4 and canblock in the third instruction after Output_B, while the display controller horizontaltask lowers its wakeup request at t5 and can block in the fourth instruction.If loops naturally run for at least three instructions, use of TIOA is more economical than useof Next because TIOA decoding is mandatory in any case, while Next is needed only forshort loop devices, devices that use the fast io system, and devices that drive the SubTasklines.SubTasksWhen an io device sees Next becoming equal to its task, it can (optionally) present a two-bitSubTask number as well.The processor, control, and memory sections clock SubTask into flipflops at t0. Theprocessor OR's SubTask [0:1] into RBase[2:3] and into MemBase[2:3]. This allows the samefirmware to control several identical io devices concurrentlyeach device, represented by aSubTask, gets its own RM region with 16 RM locations and its own pair of MemBaseregisters; if only SubTask[0] is driven, then two RM regions and four MemBase registers areavailable to each subtask. Note that the 16 change-RBase-for-write functions do not ORSubTask into the changed address, so they cannot be used; also, if RBase is read by theprocessor the value read out has SubTask OR'ed in. However, the 16 change-RSTK-for-write functions do work.Note also that when the debugging processor (Baseboard microcomputer or Alto runningMidas) asserts the Freeze signal, the affect of the subtask on RBase[2:3] is disabled, butsubtask continues to affect MemBase[2:3].In the memory section, the task and SubTask that issued an IOFetch_ is bussed to fastoutput devices with data from storage. The device receiving the data identifies itself bymeans of this information. IOStore_'s are handled similarly.A task presenting SubTask signals generally must Block at the same location each iteration fp&q4]pGf;b >;`S0y\5y[,Zs[pyYLq"pyW5yUQySqp$qpyR"-yN6yL$LXsLpLXsLpyK!JsKpJsKpqyIPNpyG,>-yE3E.sEpyCK yB%#AsB%p >#< = 1$ ;A[ 9w 4^t 0p> /! +2+"s+p )5$ (0+ &O@ $; " Eqp K %2" Zqp +) qp" S) 7 < L= ,q* =\xiDorado Hardware ManualSlow IO14 September 198189since there is only a single TPC value for all of the SubTasks. Hence, the full generality oftasking is unavailablethe microcode for these tasks must be coded as though the wakeupmechanism were a priority interrupt.Illegal Things IO Tasks Must Not Do(1) It is illegal to Block in an instruction that does B_ExternalSource, where ExternalSourceis anything except one of the sources on the IFU board. This restriction is needed so thatthe emulator will be able to do arithmetic on B_PCX'.(2) The IOAtten branch condition is illegal in an instruction that Blocks and might be held,because NextLies might occur, as discussed above.(3) A task may not Block on an instruction that might be held, if its wakeup request might bedropped at t0 of the instruction. If this occurred, the instruction might inadvertently berepeated before the Block took effect.(4) It is illegal to Block with TaskingOff in force.(5) A task must not Block until one cycle after its wakeup request is turned off.(6)It is illegal to issue Wakeup[n] if task n might run in the next cycle. Wakeup[n] must beexecuted with TaskingOff in such circumstances. fp&q4]pGf bK `SW ^$ Yot# Up1, T3(3 Rh5 N+1 M,1 IT G GbsGp+# F$& B4 ?AQ ;20* :/ 9=.NDorado Hardware ManualFast IO14 September 198190Fast IO The fast input/output system provides high-bandwidth data transfers between storage andio devices. Transfers occur in units of one munch (= 16 words); the addresses of the 16words must be i, i+1, ..., i+15, where i mod 16 = 0. One word is transferred every clock,for a peak bandwidth of 533 x 106 bits/second. A fast device is also interfaced to the slowio system, from which it receives its control information, since there is no way for thedevice to communicate directly with the processor using the fast io system.A single transaction of the fast io system transfers exactly one munch. Successivetransactions are completely independent of each other, whether they involve the same ordifferent devices, as far as the io system is concerned. The only relationship betweentransactions is that storage references of two transactions occur in the order that they wereissued.Each fast io transaction is initiated by an IOFetch_ or IOStore_ reference coded in ASEL.Once this instruction has been executed, the transaction proceeds without furtherinteraction with the processor (except for fault reporting). The transaction itself involves astorage reference, and transport of the data between main storage and the device. In thecase of a fetch, transport happens at the end of the reference, after the munch has beenerror-corrected. For a store, transport happens at the beginning of the reference, inparallel with mapping the VA and starting the storage chips. As a result of this difference,the transport for a fetch may overlap or even follow the transport for a following store.TransportThe device is only concerned with the transport of the data, and has no way of knowingexactly how or when the storage reference take place. The transport happens in 16clocks, each transporting a single word using the Fin bus (IOFetch_'es) or Fout bus(IOStore_'s). The two busses are independent, and transport can be happening on both ofthem simultaneously.The two busses have much in common. Both have Task and Subtask lines, on which thememory presents the task and subtask involved in the transport about to begin and a Nextsignal used for synchronization. The Fout bus has a Fault line which is high at the time thelast word of the transaction is delivered if there was a memory fault during the fetch (otherthan a corrected single error).Both data busses are 18 bits wide: 16 data bits, numbered 0..15, and two byte partiy bits,numbered 16 (bits 0..7) and 17 (bits 8..15). The parity bits have the same timing as thedata bits. A device is invited to check the parity of data on Fin, and is required to generateparity for data on Fout.Wakeups and MicrocodeThe normal interface between a device and its task involves one wakeup for each munchtransferred. The device must keep track of the number of wakeups it has issued, sincedata may not arrive from storage for several microseconds, but there is no way to stop the fp&q5 pFf arp ^e9 \< Z qpqpqp qp* YYsYp; W;P UpK QK P4*- NiW LX J GbS E(Q CA Bqp% @7> >mD <= : L 5t 2LpN 0: .', ,B +" '7 %G $ R "PE  0* IU ~C  t )pO ^ I G L=\7Dorado Hardware ManualFast IO14 September 198191data from arriving once the task has started the memory reference.Typical microcode for a fast output device is given in the "Display Controller" chapter.LatencySuppose that the highest priority fast io task issues its wakeup request at t0; then it willexecute its first instruction at t4. Some other task can cache fault with clean victim in thecycle starting at t0, and another task can cache fault with dirty victim in the cycle startingat t2. The first reference gives rise to one storage reference and the second to two storagereferences; each of these three storage references takes 8 cycles to handle, so the fast ioreference will not begin for about 24 cycles. From the time it begins until the last dataword is delivered to the device is 23.5 cycles, for a total of 47.5 cycles, to which 2 cyclesmust be added for the time between the wakeup and the first executed instruction. In thissituation, the transport is not finished until 49.5 cycles after the wakeup. Lower prioritytasks are delayed by an additional 8 cycles for each reference which might be made by ahigher priority task.The above is one possible worst case. Another is the execution time of higher prioritytasks; a wakeup might be delayed by sum of the longest normal execution of the fault taskand of other higher priority tasks. The fault task execution time is presently unknown.A store reference is slightly better, since its transport is finished 8 cycles after thereference starts, for a total latency of 40 cycles.All these numbers assume that a reference can be started every 8 cycles. Ifsuccessive references are to 4k modules, however, they can happen only every 13cycles, and the calculations must be adjusted accordingly. Also, data is returnedfrom a 4k module 3.5 cycles later. fp&q5 pFf bB ^-+ Yt V!pMUsV!p TV"SsTVp9 RQsRpE PP4sPpR NH M,-- KaE I1) GD FW D7 @? >6# =/(0 9/) 73y5-y3g3y1Ay0" /<8:Dorado Hardware ManualDisk Controller 14 September 198192Disk ControllerThis chapter describes the Dorado disk controller, which uses the Slow IO system tocontrol up to four Century Data Trident disk drives. Either the 80x106-byte T-80 or the300x106-byte T-300 drives can be used. An extension of the controller onto a second logicboard (not designed) would allow control of up to 31 disk drives; alternatively, duplicatingthe present controller (with different TIOA, task, and muffler assignments) would allowindependent control of four additional drives.Keep Figure 13 in view while reading this chapter.The disk controller uses task 148 and the first five values of the TIOA addresses in block108 - 178 (The Ethernet controller, on the same logic board, uses two of the other three.).Either the task or TIOA block can be modified by changing a SIP component on the logicboard. TIOA assignments are as follows:108DiskControlOutput_B to control register118DiskMuffOutput_B muffler control and Pd_Input to read muffler128DiskDataPd_Input to read FIFO or Output_B to write FIFO data138DiskRamOutput_B to format RAM148DiskTagOutput_B to tag registerNote: other tasks must not select these TIOA addresses at any time; doing so may cause the diskcontroller to malfunction.The controller is interfaced to the disk drives by a daisy chain cable bussed to all drivesand by an independent radial cable to each drive. The radial cables contain the followingsignals:data line (bidirectional, differentially driven)data clock (from drive, differentially driven)subsector/index line (from drive)selected line (from drive)select line (from controller)sequence line (from controller, controlled by the baseboard for drive 0 and groundedfor other drives)two VCC lines and scope trigger (from controller)The daisy-chain cable contains the following signals:16 control "tags" driven by the controller and received by the selected drive9 error and status signals from the drive as follows:CylOffset'ReadOnly'NoTerminatorHeadOvfl'SeekInc'DevCheck'NotOnLineNotReadyIndex' fpX" q0pGf ar ^ep G \F](s\p Z[]sZp#0 Y.. W;> Up.QqX2 NpMN9 LLLLL8 JD H(yEtDsEtp  yCCsCp 5yAARsAp 4y@?s@p y>J=s>Jp x;ts t$sx:K 6p5qp 51q p. 3gy0_0y..y,!y*y)4y'iG %y#1 5yMy5/ d :o   =\x6Dorado Hardware ManualDisk Controller 14 September 198193The controller or's the NoTerminator error (which means that the daisy-chain cable isn'tterminated) into the NotOnLine error; the other error indications are discussed later.Disk AddressingThe disk system is accessed through a many level addressing scheme. First a particulardisk drive is selected. Then a data surface or head and a cylinder are selected (5 surfaces,815 cylinders on a T-80). Each cylinder is further divided into sectors which consist ofblocks.Firmware may control the following parameters:Sector size (1378 words max., limited by 4-bit subsector counter)Number of blocks within one sector (1 to 4)Block sizes (2 to 2684 words)Note: Various limits on the sizes of blocks and sectors will be discussed. The processor interface allowsa six-bit subsector counter of which only four bits are presently implemented, and this is the mostsignificant length limit at present (1378 words). If the subsector counter were enlarged to six bits, thenthe block size limit imposed by the error correction algorithm (2684 data words) would apply. We are,however, unlikely to find any of these length limits significant unless we enlarge the memory page size to4096 words. Jumpers in the disk unit could also be set to vary the spacing between subsector pulses.Because sector formats are flexible, firmware can adjust the controller to system needs.The sector formats specifically envisioned in the design of the controller include 28 256-word sectors for Alto Diablo emulation and Pilot, 16 512-word sectors for Juniper, and 91024-word sectors for Alto Trident emulation.Sector Layout ConsiderationsEach block within a sector can be either read, written, or checked. However, once anyblock is written, later blocks in that sector cannot be read during that disk revolution.(Later blocks should be readable on subsequent disk revolutions, though this is notguaranteed and no existing software depends on this.) Reading or writing must start withthe first block in the sector and continue; since check bits are stored at the end of eachblock, the entire block must be read to verify its data or correct errors; however, one doesnot have to read or write subsequent blocks in the sector. After a check-block operation isstarted, the controller inhibits writing later blocks within a sector without a specific "OK"from the firmware.Our general plan is to use the first block in a sector as a header identifying the diskaddress; all headers will be written when a disk pack is initialized; subsequently, the disktask compares the header with the disk address it thinks it is accessing. The header notonly provides a useful safeguard against positioning errors but also allows faster sectordetermination when switching to a new drive, as discussed later.The second block might identify information stored in the sector (e.g., the Label block inAlto format). The third block might be the data block. The fourth block could holdreference, backup, or archiving information. All of these choices are a matter ofprogramming convention. fpX" q0pGf b$4 `SA \u XpE V0qpqp U(qp SqpP = V ;A+$ 9w> 7H 5? 44% 2LF 0$ -P +EI ){E 'Bvp % L $ Rx9x<x :xJtsMx] xOx(ts2x53xf : C=W;Dorado Hardware ManualDisk Controller 14 September 198195Table 23: T-80 Specifications and CharacteristicsCapacity82.1 million 8-bit bytes unformattedTransfer rate9.67 x 106 bits/sec (= one 16-bit word/1.65 ms)Cylinder positioning time6 ms cylinder to cylinder maximum (3 ms typical)30 ms average55 ms maximumRotational speed3600 rpm (16.66 ms/revolution)Sector length selection12-bit increments through jumpers on sector boardDensities370 cylinders/inch6060 bits/inch max. recording densityDisk pack characteristicsIBM 3336-type components5 recording surfaces plus 1 servo surface815 cylinders/surfaceOperating methodsModified frequency modulation recordingLinear positioning motor with cylinder following servoMechanical specificationsSize - 17.8" wide x 10.5" high x 32" deepWeight - 230 lbs.Error rateRecoverable: 1 error/1010 bitsIrrecoverable: 1 error/1013 bitsPositioning: 1 error/106 seeksPack start/stop time20 sec start time20 sec stop time (with dynamic braking)Controls and indicatorsReady IndicatorOff = disk not spinningFlashing = spinning up/downOn = ReadyFault IndicatorStart/Stop switchRead-only switchDegate switch (inside the drive; takes disk off-line for testing)General Firmware OrganizationThis section gives a general overview of how the disk controller firmware is organized;more detailed descriptions follow later.The disk drive generates subsector and index pulses on one line in the radial cable; thecontroller distinguishes these according to pulse width. In the normal Idle loop, thecontroller looks only at these pulses from the connected drives. A four-bit counter for eachdrive counts down subsector pulses and generates sector pulses. Upon either a sector oran index pulse from the selected drive, the controller generates a disk task wakeup. Thedisk task then either increments (sector wakeup) or zeroes (index wakeup) its firmwaresector counter, clears the wakeup condition, checks for a new command, and blocks.Because there are no hardware sector counters, the disk task must maintain a sectorcounter itself; this implies that the rotational position is generally unknown on alldeselected drives. fpX" q0pGfLbAwF2 _9sQ$ ] Q]]"vs ZQ/QYL QW UQ SQ+ QNQQO% MQQLX)QJ HQQGb6 E-Q%QC A Q  B%AQ?@~?Q>J>>J <QQ:' 8Q 754^ Q2Q1Q0;Q.A *uX 'Fp@ %|( " qpqp% ?R t./ 1qp qp9 : J< $/  F C^ Dorado Hardware ManualDisk Controller 14 September 198196When first selecting a drive, there are two strategies for determining the sector position: (1) Wait foran index wakeup, at which time the sector position becomes known; (2) Wait for a sector wakeupand then read the sector number stored in the header block (This can only be done if the disk isnot moving to a new cyclinder.). The most efficient strategy appears to be a combination: Selectthe drive and start a seek to the correct cylinder; if an index wakeup arrives before the seek isfinished, then the sector position is synchronized with no loss of time. If the seek finishes first, thenread the next header to determine the sector number. When a new disk operation is noted, firmware will perform the following steps:Execute a drive-select command, if the drive differs.Load the sector size only if different, and block until index.Load the format RAM only if word count or commands differ.Execute a Control Tag (seek) command only if the cylinder differs, and wait (continuingto count sectors) until the drive becomes ready again.Execute a Head Tag command.Block until, at a sector wakeup, the next sector is the one wanted.Load the appropriate transfer command into the control registerBlock until the next sector wakeup.At the start of the next sector, the controller will become active and sequence throughcommands under control of the format RAM and two sequence proms (one for reading,one for writing).The sequence proms define what operations the controller must go through, and the formatRAM contains all parameters that might change from one implementation to another.Actual commands for the Trident disk are stored in the format RAM along with countvalues such as words/block, words of ECC, and words of delay before some operation; thecommands are loaded into the tag register and executed by the controller during thetransfer.Once a transfer has started, the disk task will be woken according to the number of wordsin the FIFO, and it will send or receive the appropriate number of words. Read andcompare operations are performed by firmware, as well as detecting checksum errors atthe end of reading. During writing, firmware must provide one word of sync bits (2018standard, 0018 for Alto Trident emulation) followed by the specified number of words forthat block (the controller will append 2 words of checksum). During read, the controllerwill look for, and discard, the first word of sync bits, then firmware must accept thespecified number of words for that block, followed by two words of checksum to bediscarded, followed by the ECC remainder to be used for error detection/correction.Task WakeupsThe controller may wakeup the disk task for many conditions; the disk task must deteminethe cause and take appropriate action, which must in some way cause the wakeup to goaway.In general, there are two ways to determine the wakeup condition: read the wakeupcondition, or assume the condition knowing the state of the disk task (which implies thestate of the controller). When expecting a sector or index wakeup, the disk task must testcarefully to count sectors reliably, but in the middle of word transfer operations, it will fpX" q0pGfybAsQy`Ey_Oy^Ey\6+y[]eyY5 VpNxS<5xQq>xO:xM).xL6xJGxH|%qpxF?xD# Au qp? ?4 = :n: 86 6(* 56! 3C3 1y ./* ,<S *rC (F( & &,&F $Y #%1 !6? k3 &uX p J 6  K E < M9" =];Dorado Hardware ManualDisk Controller 14 September 198197assume the wakeup reason to minimize overhead. The various conditions are as follows:IndexTW, SectorTW, TagTW, RdFifoTW, and WrFifoTW; these wakeup conditions aredetailed in the "Muffler Input" section.Control RegisterThe DiskControl register is a collection of flip-flops defining the state of the controller; onOutput to DiskControl, IOB is interpreted as follows:B[5]Clear EnableRunB[6]Set DebugModeB[7]Set BlockTilIndexB[8:9]Operation for first block of sector, where the operations are:0 = Done (finished with all blocks in this sector)1 = Write2 = Read and check3 = ReadB[10:11]Operation for second block of sector, as above.B[12:13]Operation for third block of sector, as above.B[14:15]Operation for fourth block of sector, as above.EnableRun determines whether the controller is active at all. It is initially cleared byIOReset, and can only be set by completing the loading of the format RAM (see below).DebugMode allows the controller to be exercised by diagnostics when no disk is present; inthis case, diagnostic firmware provides fake disk bit-clocks and data. The flip-flop iscleared by DisableRun.BlockTilIndex can be set to disable sector and index task wakeups until (a) the selecteddrive is ready, and (b) an index pulse is received from the drive. It is cleared by an indexwakeup. This is useful after switching drives or executing a ReZero operation, either ofwhich causes the controller to lose sector synchronization with the drive. BlockTillIndexprevents the wakeup conditions from being set until these conditions are met, but does notclear any such wakeups that have already occurred. To prevent races, it is necessary toclear SectorTW and IndexTW, then set BlockTillIndex, then clear SectorTW again.A request for a sector transfer is initiated by loading bits 8 and 9 of the control registerwith a non-zero value. Then the controller will wait until the next sector pulse to set the"Active" flip-flop and execute the transfer. Once a transfer has been started, it may beaborted by loading a new value into the control register twice. The first will clear theActive flip-flop, and the second will load the control register. (When Active, the controlregister is enabled for shifting commands rather than loading of io data.)Format RAM and Sequence PROMsThe format RAM is a 16-word by 12-bit register that holds commands and delay countsused by the controller during a transfer. Words within the RAM are used according to thefollowing table; the example values are appropriate for Alto Diablo disk emulation (2-wordheader, 8-word label, and 256-word data record). fpX" q0pGf b.( `S7 ^( ZCuX VpP U5xR"sFxP xOxM>'LX2'K'I'HxGF/xE.xD/ @p8" >D ;eJ 9 N 7 4^*. 2N 0= .? -3 Kq +ipB )O &,S $a@qp "Y E A 7J uX p.% */ #7  0 b =X2%Dorado Hardware ManualDisk Controller 14 September 198198ExampleAddrDescription Value00Word count of the first block000101Word count of the second block000702Word count of the third block037703Word count of the fourth block000004Control tag command for a read operation010405Control tag command for a write operation020406Control tag command to set Head Select000407Control tag command to zero the tag bus000008Word count to write zeroes before writing the 1st block of a sector003309Word count to write zeroes before writing the sucessive blocks000610Word count to wait before reading the 1st block of a sector001111Word count to wait before reading the sucessive blocks000212Word count of ECC words plus one000213Word count of 2000114Word count of 1 (minimum count)000015Not used0000Notice that the format RAM contains both word counts and tag commands. Word countsare 1 less that the desired count. Tag commands will be loaded into the tag register (seebelow) and then used as a "control tag function" by the Trident disk. The values in theright column are those used for the Alto Diablo emulation format. Notice that all but thefirst 4 values are determined by characteristics of the drive being used as opposed to thespecific sector format. The meaning of the tag command values can be found in the "TagRegister" section.The format RAM is addressed in two ways. During a transfer, sequence PROMs move datafrom the RAM into either a tag register or a count register. At other times, the Dorado mayaddress the RAM with the RAM Address register, which is zeroed when the control registeris written; executing an Output to the DiskRam register writes IOB into the RAM at thecurrent address and then increments the address. Loading the last word in the formatRAM turns on the EnableRun flip-flop allowing normal disk control activity. The formatRAM may be read via the muffler scheme discussed later.There are two sequence PROMs, one for reading (or checking) and one for writing. ThePROMs are addressed by a program counter that is initialized to zero at the beginning of asector and is incremented upon completion of each PROM program action. Either the readPROM or the write PROM is selected according to the operation being performed on thecurrent block.The sequence PROMs are clocked by WordClock, which is derived from the disk bit clock,which in turn is derived from timing information pre-recorded on the disk pack. Thesubsector pulses generated by the drive are also derived from this timing information. Thisenables very precise placement of the data on the disk, in a manner that is independent ofthe disk's rotational velocity or the Dorado's clock rate. fpX" q0pGf9bAta 9F]^s:_]]K:_]\:_]Z:_]Y(:_]XU):_]W&:_]U':_]TC:_]S_>:_]R";:_]P6:_]O:_]Ni:_]M,:_]K:_ Hp =q F!p9 E D C@/+ Au: ?7 = :n0% 8M 6qp 5 K 3CU 1yL /7 ,<U *r: (6! &"2 % !V =  V @!9 u: x d:_]J+:_]H<:_]G+]FH+]E :_]C@t9? 9 ]=/s):_];&:_]:(:_]9w*:_89tD]6s:_]5 :_]4@:_]3C$:_]2<:_]0(:_]/*:_.MtD]-s:_]+ :_]*@:_])W$:_](<:_]&+]%+]$a:_]#$ uX p R 0- 4 0B eA Rvp < H ;> p$ )<\Dorado Hardware ManualDisk Controller 14 September 1981100Bits 4 through 15 of the tag register are interpreted according to the following table:Tag[0]Drive select and subsector countTag[4:15] are interpreted by the controller to effect drive select orsubsector counter changes. The tag timing and wakeup circuit is notactivated; firmware must take care of the timing by first loading Tag[4:15]as desired but with Tag[0:3] equal 0, then or-ing in the Tag[0] bit andoutputting again. 4:9Subsector countDivide the 117 subsector pulses from disk by subsector count+1 to form Sectorpulses (Tag[4:5] are presently unimplemented).Tag[4:9] = 3 yields 29 sectors large enough for 256-word data blocksTag[4:9] = 6 yields 16 sectors large enough for 512-word data blocksTag[4:9] = 148 yields 9 sectors large enough for 1024-word data blocks 10Load subsector from Tag[4:9] for the drive selected prior to the execution of thistag instruction.11:15Drive selectThe basic controller handles up to 4 disk drives; additional units may beaccommodated by adding drive dependent logic on an additional board andconnecting it in in place of drive 3. To allow this, the 5 bit drive select field isinterpreted as follows. 0 - 3select drive 0 to 3, respectively 4 - 368select drive 3378don't select any driveTag[1]Head TagLoads a register in the drive that selects the head to be used duringsubsequent read/write commands. A Tag wakeup occurs at completion(1.6 ms). 4:7Unused 8Off Cylindermay be activated during a read to attempt recovery of unreadabledata. It causes cylinder positioning to be offset 80 micro-inches. 9Determines direction of offset if bit 8 is set.10:15Head numbervalues from 0 to 4 are valid for a T-80, 0 to 19 for a T-300. Thedrive will turn on "EndOfCylinder" (alias HeadOverflow) error if an invalid headaddress is issued.Tag[2]Cylinder TagCauses the drive to seek to the specified cylinder. A Tag wakeup occursafter the tag timing sequence has completed (1.6 ms), and the NotReadystatus bit is raised until the seek has completed (3 to 55 ms depending onthe seek distance). 4:15Cylinder number (0 to 814 for Trident disks presently in use). An illegal cylindernumber will cause DeviceCheck to be raised.Tag[3]Control TagA Tag wakeup occurs at command completion (1.6 ms) and uponcompletion of the last read/write operation in a sector. Generally, ControlTag commands are issued only by the controller itself (using tagcommands from the format RAM) rather than by the microcode; DeviceCheck Reset and ReZero are an exception. fpX" q0pFMf b?x^\; [)YL.W4URs~~Q>~P4.~ND~MrD~L KL1I~4ts~H|FH~ ~DB~CC~B%E~@@>"s@=v<"s=v@;;B"s;x8p654-3 vp0_s~.*~M~,C*~ %(`~@ ~'K~%x"Pp  )+vp7&ds~ >~+xp %/vpqp& 6U12! , ( y;]xDorado Hardware ManualDisk Controller 14 September 1981101 4AltoLeaderspecial flag to the controller that allows disks written by an AltoTricon Controller to be read. This bit should only be used for the Alto Tridentsimulation. 5Unused 6Strobe Latecauses data recovery circuits within the drive to sample data earlywithin the data bit time (for recovery when the drive is experiencing excessiveread errors). 7Strobe Earlylike StrobeLate except in the obvious way. 8Writeturns on the write circuits. 9Readturns on the read circuits. 10Unused 11Reset Head registerzeroes the head address register in the drive. 12Device Check Resetresets all latched error conditions in the drive. 13Head Selectturns on the head selection circuits, in conjunction with a Read orWrite. 14ReZerorepositions the heads to cylinder 0 (if the heads are loaded) and resetsthe head address register; resets SeekIncomplete and DeviceCheck errorconditions. 15Head Advanceincrements the head address register in the drive.FIFO RegisterData to/from the disk is buffered through a 16-word FIFO (25 ms of buffer), which isread/written with Pd_Input/Output_B when TIOA selects DiskData. Each FIFO word holds16 data bits, 2 parity bits, and a 2-bit field indicating that the next word to be read is eitherwrite, read, or read-and-check type data. During output to the disk, the controller checksparity both when receiving data on the io bus and again when reading the FIFO. During adisk read, parity is computed before writing into the FIFO, is passed through the FIFO, andis then written on the io bus for the processor to test.Muffler InputDorado uses a multiplexor scheme called the muffler system for reading miscellaneouslogic signals during debugging from the Alto or baseboard. The disk controller also allowsa muffler address to be specified on an Output to the DiskMuff register; in this way, anyDskEth board signal available through the multiplexors (mufflers) is also available forfirmware sampling. Other bits of the DiskMuff register output specify other operations asfollows:B[0]Simulate read data of 1 for 1 cycle (for use by diagnostic programs)B[1]Simulate read clock of 1 for 1 cycle (for use by diagnostic programs)B[2]Clear CompareErrdone by disk task if a read&compare is found to be OKB[3]Set ReadDataErrdone by disk task to inhibit future writesB[4]Clear the index wakeup flip-flopB[5]Clear the sector wakeup flip-flopB[6]Clear the tag wakeup flip-flop fpX" q0pFMfbAs~4~`3~_ ]K~[~)&~Y@~XU V!~1S~ Q~ O~MO~BK~>H~K~GEQ~ 3~C,~B @[~; <8uX 8p)vp 6E 51L 3g[ 1-+ /E .8 )uX &Op3! $? "5$ A %F Zxs FEx~ FxA Fx :x x !xK  (J <\A;A19G8!,7=514^92 1U\4 / ;.3-34+$*r6$7)4'2& \.$a\,"\.!\ k\ 5\ /\ 8u\:\ 0xp\9\7\ # #\  z \ <=S=; :K\ > b8 (*(7B/(5 b4:('0(2/(1y/ b/ ){5(.q+ ,\'+"\ (!) (\ /&s\ <$\'#$\3xp\s\l\xp\\s\.x p\ 8VEDorado Hardware ManualDisk Controller 14 September 1981104 100ShiftInThe controller is currently shifting data into the FIFO 101ShiftOutThe controller is currently shifting data out of the FIFO 102ComputeECCThe controller is currently shifting data and computing the ECC checksum 103NextBlockOccurs between blocks within a sector 104LoadTagIndicates that the next word read from the format RAM should be loadedinto the tag register as opposed to the count register 105CntDone'Indicates that the count register is again zero, and a new value from theformat RAM will be loaded next 106OutRegFullThe holding register on the input to the FIFO has been loaded, but nottransferred into the FIFO. 107InRegFullThe holding register out of the FIFO has been loaded, but not read viaPd_Input or loaded into the output shift register.110:113FifoWaddrThe 4-bit address indicating where the next word will be written into theFIFO114:117FifoRaddrThe 4-bit address indicating where the next word will be read from theFIFO. if FifoWaddr equals FifoRaddr then the FIFO is defined as empty.Error Detection and CorrectionTo allow high data density and a few surface imperfections during manufacture, Tridentdisk packs are not required to be perfect. A disk pack is defined as suitable when no morethan three bad areas occur on any data surface; a bad area is defined as one which couldpotentially cause read errors of no more than 11 bits in length. To correct errors arisingfrom these imperfections as well as other (infrequent) read errors, the controller implementsan error detection and correction scheme which will detect (with very high probability)errors of any length, and will allow correction of any burst error of 11 bits or less.Warning: If an error burst longer than 11 bits occurs, there is a significant possibility that the errorcorrection algorithm detailed below will fail and double the number of bad bits! Consequently, diskhandling programs should try other methods of error recovery before invoking the error-correctionalgorithm.To avoid problems, it is good practice to run diagnostic programs on new disk packs; note badsectors and don't use these during normal operation.When an error does occur, the first step is to try rereading the offending sector several times. Oneof these reads may succeed. If not, try rereading with the cylinder position offset or with the datastrobe early or late as discussed in the "Tag Register" section. If these attempts all fail, then tryerror correction.Error correction is accomplished through a mixture of disk controller hardware (for ECCgeneration and checking) and system software/firmware (for error recovery). This is acompromise between capability, speed, and cost. The basic capabilities and restrictions ofthe 32-check-bit scheme are summarized below.1) A single error burst of length less than 12 data bits (i.e., a scattering of errorbits within the bit stream, all of which fit within an 11-bit span) can be corrected inblocks shorter than 2685 data words. (Example: for the data "0001100101", thedata "0000101101" contains a single burst error of length 4.). The codeimplemented will detect errors in arbitrarily long blocks, but not enoughinformation exists to correct longer blocks. fpX" q0pFMfbs\s4`0\s6^B\ s*\T\sZf\s4X6W\s!US\ s+REPz\s.N2M,\s"'KI\s*HY@ D7uX @p: >= =/= ;eS 9#: 7.) 6Mx3CwsZx18,x0*7x/! x,)4x+"4x(Y x'#D!x%Wx$a !pG HV }/, -xAqpAxv3$x@;U=Uxp,xFG+xL,  2*n.For this particular application the polynomials chosen are:P(X) = (X11 + X2 + 1)(X21 + 1)During a write, the two polynomials are multiplied together and implemented by hardware inthe form:P(X) = X32 + X23 + x21 + X11 + X2 + 1The data stream is premultiplied by X32 to make room for the 2 word ECC and thenreduced modulo P(X). This is accomplished by the normal feedback shift registertechnique with the difference that to perform premultiplication, the output of the register isexclusive-or'd with the incoming data and then fed back. After all data bits have beenshifted out, the contents of the ECC shift registers are appended to the disk block.During a read, the feedback shift register is reconfigured such that the two originalpolynomials are implemented separately. The incoming data stream, including the 2appended words of ECC, is independently reduced modulo P0(X) and P1(X), whereP0(X) = X21 + 1 fpX" q0pFMfxb'(x`Sx\Lx[AxYL9xUKxT,xRE8xPzDxN+$xL?xK9xIP+xG D,!q BIp: @~ = Q99 9:'9 6(56(I 4:;01U01U01U0 -VK +((((((((((( $%%5$ "6 !A H%2 }A  : AB v8vv w ( 0=U:Dorado Hardware ManualDisk Controller 14 September 1981106P1(X) = X11 + X2 + 1After reading in all words off the disk, the contents of the two polynomial shift registers areread out of the FIFO. If the data is recovered without error, then reducing it modulo P0(X)and P1(X) results in the registers containing all zeroes.If the data contains an error, then the two registers will be non-zero. If one but not bothregisters is non-zero, then the error is irrecoverable.To recover from an error, a procedure is undertaken which determines the pattern of bitswhich are in error, and the displacement of this pattern from the end of the record. I amsimply going to present the magic equation to be solved, and some magic constants to beused for solving this equation. Much of the polynomial implementation and the equations,which use the "Chinese Remainder Theorem" are discussed in technical reports fromCALCOMP (Calcomp Technical Report TR-1035-04, by Wesley Gee and David George) andXEROX (Xerox XDS preliminary report "Error Correction Code for the R.M. Subsystem," byGreg Tsilikas, 28 March 1972.).The basic equation is:D = Q*LCM  (A0*M0*S0 + A1*M1*S1)where:Ei = modulus of the polynomialLCM = least common multiple of E0 and E1Mi = LCM/EiAi = a constant such that Ai*Mi modulo Ei = 1Q = smallest integer to make D positiveSi = number of shift operations to the appropriate polynomial remainders asdescribed below.D = displacement of right-most incorrect bit from the end of the record.The values of E0 and E1 were found by programming the procedure outlined in theCALCOMP report, and yielded the following result:E0 = 21 E1 = 2047The least common multiple (LCM) of E0 and E1 is simply the product of E0 and E1 sincethe two numbers have no factors in common. Thus the LCM, which is also the recordlength which can be corrected, is 42,987 bits, or 26862 words.Knowing LCM and E0 and E1, the values of M0 and M1 are easily found to beM0 = 2047 M1 = 21The values of A0 and A1 are next determined using a trial and error approach that I put ina small program. The results can easily be confirmed, and are given below: gpX" q0pFMfaaababa ^"= \T;[\T ZfYZf3 VK U*7 Q N OT N#H LXI J: H8 F$3 E- A:>J =>J=>J=>J=>J=>J=>Jx::9 89 :7676:514514:3C23C23C23C23C:1U':/./I:-:+H (`'(`'(`8 &s1#"t# "t# $ > ? dddddee C KZ K'JK>xIP xETxD @@@,x=/==/==/x;eFx7>x65x6(6666x4  -x2L/12L x0_Ax.Kx,-3,7x* %*N*x(6x'#6 #6: ? ? x:u]:] @ Y  h  Gb$ C L B% I @[6$ >< <Z :7qX2 2ps .pqp/ -3S +i qp/ )> '9qp & qp $> R  qpB T qp0q 0pS q pqp O )> ^H E  L<\5Dorado Hardware ManualDisplay Controller14 September 1981110Items can be treated in one of three ways: First, an Alto monitor can be driven. Second,items can be mapped through the 256-word x 4-bit MiniMixer into video data for a black-and-white or grey-level monitor.Three separate interfaces are provided on the DispY board. An Alto monitor interface ORs one-bititems from the A and B channels with the cursor, and then XORs by polarity to produce one-bitpixels for an Alto display. A seven-wire interface outputs 1 bit/pixel for a binary monitor. And an 8-bit digital-to-analog converter (DAC) produces grey-level video.Third, items may be mapped by the Mixer (or A color map), a 1024-word x 24-bit RAM, intosignals for a color or grey-level monitor. A variety of modes determine which bits from theA and B items address the mixer. Mixer output consisting of 8 bits for each of the red,green, and blue guns is then digital-to-analog converted for color monitors. Additionally,there is a 24-bit/pixel mode in which the Dorado supplies 8 bits for each of the threecolors; the colors are independently mapped through the Mixer and two additional 256-word x 8-bit RAMs called the BMap and the CMap.The DDC is implemented on two Dorado main logic boards, called DispY and DispM.DispY contains all the logic necessary for vertical and horizontal sweep control, channeldata paths, and video data for binary and grey-level monitors running at a fixed pixel clockrate. DispM contains the color maps, the programmable pixel clock, and the three DACsfor driving a color monitor. Additionally, DispM contains an independent terminal controllerthat is structurally similar to a one-channel, one bit/pixel DispY but is specialized to drivinga 7-wire terminal.Thus there are two principal DDC configurations. On a Dorado with only a 7-wire terminaland no color monitor, only the DispY board is present; it is programmed for Alto terminalemulation, and only a small subset of its capabilities are used. However, on a Dorado withboth a 7-wire terminal and a color monitor, the DispM board is also present; all of DispYand the color hardware on DispM are used to drive the color monitor, and the independentcontroller on DispM is used to drive the 7-wire terminal.Video Data PathFast IO Interface and FIFOThe fast io system delivers data to the DDC at a rate of 16 bits/clock; words are receivedalternately in the REven (t1) and ROdd (t2) registers shown in Figure 14, then written intothe FIFO, a 256-word x 32-bit RAM, during the first half of the next Dorado cycle (t2 to t3),leaving the second half of the cycle free for read access by the video channels. In otherwords, the REven and ROdd registers widen the data path from 16 to 32 bits to allowsufficient time to both write and read the FIFO in one cycle.The 256 double-words in the FIFO are divided evenly among the two channels, so each hasbuffer storage for 16 munches. Each channel has write and read pointers that address theFIFO when appropriate.Write pointers are initialized once during vertical retrace and then sequence throughaddresses for the entire display field; a write pointer is incremented after each double-wordwrite for its channel, so that the next word to be written is addressed at all times. Since fp" q3gpFMf b: `S(qp ^y[t SyZfMyYMyW@ TVp qp1 R Q P9 N8# M,P Ka3" I/ F$6 DZC B Q @Q >J =/` ;e 7G 6(P 4^)2 2; 0 K .9 )s %q "-pQ bt bp t bp, T tp tp )1 ? 8= ,+ (M ] U F VC =]QDorado Hardware ManualDisplay Controller14 September 1981111the fast io system delivers only one munch at a time, there is never any problem indeciding which of the two write pointers should address the FIFO.Read pointers, however, are initialized during each horizontal retrace, so that the correctfirst double-word is read at the start of every scan line. This is required because the fast iosystem always delivers complete munches, but unused double words may appear at theend of the last munch for the previous scan line, or at the beginning of the first munch forthe current scan line; the read pointer has to be reinitialized to skip over these. FIFO readsalternate between channels A and B, so the data rate for one channel is limited to 32bits/2 cycles (=16 bits/cycle).Note that bitmaps are required to start at even addresses because the FIFO is 32 bits wide.Item FormationAt the output end of the FIFO there is a multiplexor shared by both channels and, for eachchannel, two intermediate buffers (FIB and SIB), and a shift register SR. The multiplexorpermutes the 32-bit quantity emerging from the FIFO so that when the double-word hasmarched through FIB and SIB and is finally loaded into SR, successive shifts will producesuccessive items of the selected size (8, 4, 2, or 1 bits).The SR is tapped as follows:SR.0Item[0] for item sizes 1, 2, 4, or 8;SR.16Item[1] for sizes 2, 4, or 8, gated to 0 for size 1;SR.8, SR.24Item[2:3] for sizes 4 or 8, gated to 0 for sizes 1 or 2;SR.4, SR.12, SR.20, SR.28Item[4:7] for size 8, gated to 0 for sizes 1, 2, or 4.All eight Item bits are gated to 0 if the channel is off. It is useful to think at this point that,regardless of a channel's item size, an 8-bit wide item is produced, whose bits contain non-zero data only in those positions dictated by the item size; i.e., for size 1 only the mostsignificant bit may be non-zero; size 2 allows data in the topmost two bits, etc.The SR loads on the item clock after its last item has been used; the item clock rate is thepixel clock rate divided by the resolution (1, 2, or 4 for full, half, or quarter, respectively).Hence, for 8, 4, 2, or 1-bit items, SR will be shifted 3, 7, 15, or 31 times, repectively, and bereloaded from SIB on the following item clock.Synchronization of SR, which uses the item clock, with FIB and SIB, which use the Dorado systemclock, is a little tricky. SIB_FIB will occur no later than (4.6 ns)+C+(1.1 ns)+C+C = 3*C+5.7 nsafter SR_SIB, where C is the period of the Dorado system clock and 4.6 ns and 1.1 ns are the worstcase propagation delay and setup time of the components in the synchronizer; FIB_FIFO will occurat this time or on one of the next three Dorado clocks, depending upon which of these four clockscorresponds to t2 of the cycle in which this channel can read the FIFO. Allowing for propagationdelay through SIB (5.0 ns) and setup time for SR (1.7 ns), the worst case minimum spacing betweenloads of SR is 3*C+(5.7 ns)+(6.7 ns) = 3*C+12.4 ns. This must be less than the time foremptying SR which is I*(32/ItemSize), where I is the period of the item clock. Hence, I >(3*C+12.4)/4 for ItemSize=8, or I > 25.6 ns for a Dorado clock period of C = 30 ns.The 8-bit items from the two channels are then presented to either the Mixer section on theDispM board or the MiniMixer or Alto video interface on the DispY board. fp" q3gpFMf b/$ `SA \3( [/1 YL E WG U? S&/ R" N q/p" Jjq Fp&4 E-#qpqpqp CcD A"7 ?; <\y8"s%y7"sy5U "s #y3"s- 0d .MD ,K *Q 'Fq p* %|.3 #<% !.y%tIyFycT y`yZyAA:y01y9K y0*yw/$ )p; ^H & =ZSDorado Hardware ManualDisplay Controller14 September 1981112MixerThe Mixer is controlled by the A8B2, BBypass, and 24Bit mode controls. It is a 1024-wordx 24-bit RAM for which the 10 bits of address required may be obtained from two possiblesource distributions, depending upon the A8B2 mode. When A8B2 is true, the addressconsists of AItem[0:7] and BItem[0:1]; when false (called A6B4), the address is AItem[0:5]and BItem[0:3].Another mode, the BBypass mode, can be enabled independently for the B channel. If B isbypassed, none of its bits contribute to the Mixer address. Instead, they bypass the mixerand address a 256 x 8 RAM, the BMap, whose outputs are ORed with the mixer outputs forthe blue DAC. For example, with ASize=8, BSize=4, BBypass true, and A8B2 true, andwith appropriate values in the Mixer RAM, the controller may be thought of as three 4/bitspixel channels driving three color guns. One channel is bypassed data from B, while theother two are mapped through the Mixer.24Bit mode, used in conjunction with BBypass mode, is used to run a three-channel colordisplay directly from memory. In this mode, items from the A channel alternately addressthe Mixer (called the AMap in this mode) and another 256 x 8 RAM called the CMap.Meanwhile, the B channel runs at half the A channel rate and addresses the BMap asdescribed above. (That is, the B channel must be set to one-half the resolution of the Achannel.) With suitable values in the color maps, the AMap, BMap, and CMapindependently generate outputs for the red, blue, and green DACs respectively.Note: when the A channel is turned on, the first AItem addresses the AMap and the second AItemaddresses the CMap. For the A and B pixels to align properly on the display in 24Bit mode, the leftmargin counts must be set to start the B channel one pixel clock earlier than the A channel. Theblue and green portions of the AMap must be entirely zeroed, since the blue and green outputs areORed with the BMap and CMap.After routing as dictated by the mixer modes, chosen items are loaded into the mapaddress registers, causing the color maps to produce a new video value every pixel clock(every two pixel clocks in 24Bit mode), and these values are latched in the three 8-bit mixeroutput registers. Three very fast DAC modules then produce a Red-Green-Blue triple ofanalog signals for a color monitor, or up to three grey-level video signals. In conjunctionwith the sync, blank, and composite waveforms produced by the monitor control circuitry,these signals can drive a wide variety of monitors attached to the Dorado.Alto Video InterfaceA small circuit on the DispY board produces video for an Alto monitor. This circuit ORsCursorData, AItem[0], and BItem[0], then XORs by the polarity, and finally ORs with thevertical and horizontal blanking signals. This interface is obsolete and is no longer inactive use.MiniMixerA small video mixer on the DispY board, not to be confused with the large Mixer on theDispM board, can drive either a DAC or the seven-wire interface discussed later. TheMiniMixer is a 256 word x 4-bit RAM addressed by a combination of AItem, BItem, and statebits, as shown in Figure 14. On every pixel clock, dDAC[0:3] are loaded from MiniMixer fp" q3gpFMf bq ^pqpqpqp \X [9 YL O W TX RE[ PzR N3 LG K8 IP' E> DQ BIH @~< >: </03 ;Ny8]utL y6Fy5"?y4:ay2 /p&, -)/ +C *+&0 (`2* &6" $J q pT I(/ ~+q   n p? 2 I g@ O U=]DDorado Hardware ManualDisplay Controller14 September 1981113output, while dDAC[4:7] are loaded directly from AItem[4:7]. The MiniMixer aims atexperiments with mixing channels and driving grey level monitors.Horizontal and Vertical ControlEvery monitor requires horizontal synchronizing and blanking waveforms. Interlacedmonitors must be able to distinguish fractions of a scan line to implement interlacing. Ingeneral, the duration and phasing of sync/blank waveforms is unique to a given monitor.The DDC uses the 1024-word x 3-bit HRam (Horizontal RAM) to control horizontalsync/blank.The DDC has a set of registers called the CLCB (Current Line Control Block) whichcontrols video generation for the current scan line. The DHT sets up parameters for thenext scan line in NLCB (Next Line Control Block), a 16-word x 12-bit RAM. The first 32pixel clocks of horizontal blanking are called the HWindow; during HWindow parameters forthe next line are copied from NLCB into CLCB. Vertical control is also handled through theNLCB.The interpretation of fields in NLCB and HRam are shown in Figure 15 and loading will bediscussed in the "Slow IO Interface" section; the use of the different information isdiscussed here. The top part of Figure 14 shows how horizontal timing is controlled.Line Control BlocksThe fields in NLCB/CLCB are interpreted as follows, where a denotes that the item ischannel-specific (i.e., copies exist for both A and B channels):aPolarity. A single bit, used only for binary monitors, that inverts black and white(APolarity and BPolarity are or'ed by the hardware).aResolution. A 2-bit field that controls item clock generation; values of 0, 2, and 3cause quarter, half, and full resolution, respectively.aItemSize. A 4-bit field unary encoded as aSize1, aSize2, aSize4, or aSize8,denoting bits/pixel for the channel; setting multiple bits is illegal.aLeftMargin. A 12-bit field in units of pixel clocks specifying 31 less than thenumber of pixel clocks to wait after HWindow completes before turning thechannel on. This value is not a straightforward constant, but depends uponmonitor-specific horizontal blanking time. If the horizontal blanking time is B pixelclocks and the desired beginning of data is L pixel clocks after the end ofhorizontal blanking, then aLeftMargin should be loaded with B+L3231 =B+L63, independent of resolution. Since L may be 0, this implies that thehorizontal blanking time for the monitor must be greater than 63 pixel clocks.Since high-speed monitors typically have greater than 4 ms horizontal blankingtimes, and are this fast only with high speed pixel clocks, this restriction is notexpected to be significant. fp" q3gpFMf b= `SA [:s Wp ! 2 U[ T3W Rh "qp' P M,)qp# KaE Iqp. G3qp F? D7 @< >< =/5 8q 5xp*vp 3@y/0;vj/0;pHy.q4y**vj**p ;y)47y%k%vj%ku%pvpvpvp vpy#Fy / vj / p 8 y3y1y&=y[Kyvp,y5y=y1(vpyf3y z T=Y\Dorado Hardware ManualDisplay Controller14 September 1981114Note: For a monitor connected via the 7-wire interface, aLeftMargin must be B+L68,rather than B+L63, because video signals are delayed from horizontal control waveformsby 5 pixel clocks.Note: The value loaded into aLeftMargin must actually be the negative of the left margincount computed above.aWidth. A 12-bit counter that counts at the pixel clock rate as soon as thechannel turns on; when the counter runs out (or when horizontal retrace starts,whichever is earliest), the channel is turned off. Precisely, if the channel is to runfor W pixel clocks, the width counter must be loaded with (W+255).aFifoAddr. An 8-bit quantity pointing to the munch and word within the munch forthe first FIFO read for the next scan line; this must be an even number becausedoublewords are fetched from the FIFO. Firmware must keep track of the numberof used munches for any given line and advance aFifoAddr by exactly the rightamount, adjusting for munch boundaries, interlacing, and data breakage. TheCLCB register for aFifoAddr is the channel read pointer itself.MixerModes. A set of bits that control the mixer; these are not channel-specific.These will normally be changed infrequently, maybe at the field rate or duringdisplay initialization. However, they are in the NLCB to allow modes to change onthe fly.Vertical Control Word (VCW). A word controlling the vertical retrace operation ofthe monitor; it contains the vertical blank bit, vertical sync bit, and interlace fieldbit discussed in the "Vertical Waveform Generator" section below.Cursor and CursorX. The 12-bit CursorX value is loaded into a counter whichstarts counting at the end of HWindow. When the counter runs out, the 16-bitCursor value is shifted out onto the CursorVideo line. This is used by the Altovideo interface and in the MiniMixer address. Precisely, if horizontal blanking is Bpixels in duration, and the leftmost bit of the cursor is to appear X pixels beyondthe end of horizontal blanking, then the CursorX register must be loaded with(B+X+226), or (B+X+221) when using the 7-wire interface.Horizontal Waveform GeneratorThe 1024-word x 3-bit HRam contains control information for these waveforms. Undernormal operation, HRam is addressed by a 12-bit counter (HRamAddr[0:11]) which is resetat the leading edge of horizontal sync and then increments every pixel clock until the nextleading edge of horizontal sync; HRamAddr[1:10] address the RAM, and the output isloaded into the HRamOut register every other pixel clock. The three bits in HRamOutcontrol horizontal sync, horizontal blank, and half-line; these three bits are combined andlevel shifted by a logic network appropriate for the monitor being driven.The 1024-word HRam imposes the uninteresting restriction that there be fewer than 2048 pixels/scanline.As shown in the diagram at the top of Figure 14, horizontal blanking (HBlank) is true fromthe end of one scan line to the beginning of the next. During horizontal blanking, HSync isturned on to initiate the horizontal retrace and turned off again when horizontal retrace is fp" q3gpFMf;but2vt;`<;_9;\wut vtut;ZyWNWvjWNWpByU>yTM yRECyN|NvjN|Np3yM+$yK>@ yIsvpyG4yEvp,yBBl !qpy@0y>1!y= y9C9y7Qy6Ay2< 2%y0Fy.?y-3<y+iSy)9y': #q p7 Q/( V R T '[ \JytQy: p O D V;! B <]oDorado Hardware ManualDisplay Controller14 September 1981115finished. HBlank then continues for a monitor-specific interval. Note that if a channel'svisible left margin is non-zero, then the horizontal scan will begin before that channel isproducing any data; in this case, the video channel outputs zero items to the mixing stagesuntil the channel is turned on.Due to an implementation error, when the 7-wire interface is being driven from DispY, the value ofHBlank[i] may differ from HBlank[i1] only when i is even, where i is HRamAddr[1:10].Vertical Waveform GeneratorOnly 2:1 interlaced monitors are supported in this design, but more complicated verticalcontrol could be provided, if desired. To support 2:1 interlace, HRam contains a waveformcalled HalfLine, which is a pulse at the horizontal line frequency, 180o out of phase withHSync.Vertical control is handled by DHT through the NVCW word in the NLCB, which specifieswhether or not vertical blank or retrace should begin or end during the next scan line. TheDHT microcode must keep track of scan lines to enable vertical signals at the appropriatetimes.The three VCW bits are called VBlank, VSync, and OddField. VSync enables vertical syncto begin on the next line, and the OddField bit chooses either HSync or HalfLine on whichto do vertical syncing (OddField=1 implies HalfLine phasing for vertical sync). This phasewill alternate from the start of the line to the middle of the line and back for successivefields. The blanking signal for the monitor is VBlank ORed with HBlank.Pixel Clock SystemThe programmable pixel clock on the DispM board, if present, determines the fundamentalvideo data rate for a given monitor. The pixel clock is controlled by loading the PixelClkregister via the slow io system. The pixel clock frequency is (312.5*(241M))/(16D)KHz, where M is PixelClk[4:11] and D is PixelClk[12:15]. Note that the pixel clock will notstabilize until about 1/2 second after the PixelClk register is loaded.The parts of the DDC synchronized to the rest of Dorado do, of course, use the Doradosystem clock. As discussed earlier, the synchronization logic for refilling SIB after SR_SIBputs a lower bound on the pixel clock period of (3*C+12.4)/4 ns (= 25.6 ns for a Doradoclock period of C = 30 ns), for an item size of 8 on either channel. We anticipate thatpixel clock rates in the range 10 to 50 MHz (100 to 20 ns/pixel) will be required, so thelower bound is approximately consistent with this. fp" q3gpFMf b&5 `SP ^[ \yYtJyXutut utut Tyq QpO O=J MO7MMO K H3" FH< D}F B ?Aqpqpqp =v/* ;? 9; 8H 2s /p#4 -R +@ *+8$ (`G $(- #$B !YP  L $5 2 Z <8+ 8> 6U 51 1*P /V .*0% ,_# (q $p5' "R !E V 3  R AE v.tvp (W ]Z S "7 P 3= Z =^0Dorado Hardware ManualDisplay Controller14 September 1981117a field, then one keyboard word is reported instead of the mouse position change; thus, thecorrect state of the keyboard is eventually reported even if transitions are missed.Table 24: Terminal Microcomputer MessagesMessageTypeComments00BIllegalignored01BKeyboard word 0 (corresponds to Alto memory location 1077034B)02BKeyboard word 1 (Alto 177035B)03BKeyboard word 2 (Alto 177036B)04BKeyboard word 3 (Alto 177037B)05BMouse buttons and keyset (Alto 177033B)06B8-bit changes in X-coordinate (0:7 of the message body) and Y-coordinate (8:15 of themessage body), represented in excess-200B notation07BIllegalignored10BKeyboard word 4 (Star keyboards only; no Alto analogue)11BKeyboard word 5 (Star)12B16BIllegalignored17BBoot message. Actually, depressing the boot button jams the data to one continuously,rather than generating a valid terminal message. Furthermore, when the boot button islet up, there may be as many as 8 bits of garbage following the last consecutive onebit; these must be ignored by the firmware. The firmware should also ignore bootbutton pushes less than 10 ms in duration, as these may be caused by noise or contactbounce.Processor Task ManagementThis section outlines the implementation requirements of DHT and DWT and discusses thehardware associated with task wakeups and DWT subtask arbitration between the twochannels.Since DHT must do a lot of processing, it runs at low priority and is awakened once/scanline at the end of HWindow. When it runs, it must calculate all parameters for the nextscan line (i.e., the one after the scan line that is just starting), load the NLCB appropriatelyfor each channel, and set up the munch address and count for each channel in the RMregisters aNextAddr and aNextCount referred to in the DWT sample code below; then itsets the aNextWCBFlag flags discussed below. The DHT wakeup will remain active untilany NLCB output command is executed, so the DHT must execute at least one NLCBoutput command every time it wakes up, and this must occur at least three instructionsprior to blocking.DWT is a very high priority task which may run on behalf of either channel: channel A issubtask 0; channel B, subtask 2. Since it uses the subtask mechanism, DWT must alwaysblock at the same instruction each iteration. DWT does not explicitly know the channel forwhich it is executing at any given time; its two parameters, a start address and munchcount, are received from DHT in RM registers specific to the subtask. In the normal case,DWT initates an IOFetch and blocks. The following is the main-line DWT microcodepresently in use: fp" q3gpFMf b@ `ST\TwF)yYuyXUUMtS>QON#LX'J%0I-2GbE7CB@7'/>>=vD<L:59T 4^s 0p< /!Q -V )B (8q &OpW $B " vp vp) vpE %7 Z"4  Y S!5 "9 D M )#. ^ d m=/A 9p E 82$ 6KE 4$3 2H 0E /!&0 -VU +K )C '4 $8vp "EM vp vp)vp %J Zvp+ vp!vpvp ySvpvpyvpvp F L-)  :/(>Jt = p<t= p0 990t9p& 7 M 6(K 2J 0K /!R -V ) > ( H &OH $@ "N +6 %7" Z1& C 2yutPy ututyutO yPyC;#yIy  Z ^=Z]Dorado Hardware ManualDisplay Controller14 September 1981120The MiniMixer is loaded by a single output instruction that specifies both the address anddata to be loaded. During the command pulse from t3 to t5 of the Output_B instruction,the video channel address to the MiniMixer is replaced by the address being loaded, so ifthe video channel is active, garbage may appear at the output during this cycle.The 16-word x 12-bit NLCB is also loaded by single output instructions that specify boththe address and data. For the NLCB, output instructions are only effective when HWindowis not occurringduring HWindow the RAM address is supplied by a counter thatsuccessively copies the NLCB words into CLCB. The format of each of the words in NLCBis shown in Figure 15. Note that any NLCB output operation will dismiss the wakeuprequest for DHT, and DHT must not block any sooner than the fourth instruction after thefirst NLCB output operation is issued.The Statics output command is used for debugging and initialization. Two bits in theStatics register called DHTShutUp and DWTShutUp are discussed in the "DDC InitializationRequirements" section below. Three other fields called FakePClk, UseFakePClk, andMufAddr are used for debugging. When UseFakePClk is true, the regular pixel clock isdegated; if FakePClk is true, then a pixel clock will occur at t5 of the Statics outputcommand; otherwise no clock occurs. Every Statics command also loads the hardwaresignal addressed by MufAddr into a flipflop (at t5) which can be read by the Status inputcommand discussed below. In combination, the fake pixel clock and muffler readoutfeatures allow diagnostic firmware to checkout most of the internal data paths in theDDCby simulating a very slow pixel clock and "stepping" the DDC through various states,the diagnostic can check nearly all of the data paths between fake pixel clocks. Thehardware signals selected by MufAddr[5:11] are given in the table below.Table 25: DDC Muffler SignalsMufAddrSignalMufAddrSignal 0ACurrentWCBFlag 70AFifoFull 01:07AReaderPtr[1:7] 71BFifoFull 10ANextWCBFlag 72ASize8 11:17AWriterPtr[1:7] 73ASize8-4 20BCurrentWCBFlag 74ASize8-4-2 21:27BReaderPtr[1:7] 75BSize8 30BNextWCBFlag 76BSize8-4 31:37BWriterPtr[1:7] 77BSize8-4-2 40:47AItem[0:7] 100AOn 50:57BItem[0:7] 101BOn 60:63AServicePtr[1:4]102:103ARes[0:1] 64:67BServicePtr[1:4]104:105BRes[0:1] 106MonitorTypeMuffler 106 (MonitorType) is the only one of interest during normal operation. It identifiesthe type of monitor connected via the 7-wire interface: zero denotes an Alto-style monitor;one denotes an LF (large format) monitor. fp" q3gpFMf b> `S3_t`Sp_t`Sp ^@ \P YL? W> Utp7 S: R"A PWB N& KU IPX G7qpq p Eqp*$ C0CctCp B%7 @[1?t@[p >)) <Q :tpL 905 7fH2LsXy.q )W0y+ip )W0y) )W0y' )W0y&  )W0y$> )W0 y"s )W0y  )W0y )W0 y )W0yI )W0y~ )W0y )W0)W0 w] [ ) Z  7f'- 5& 2)M 0_@ .< ,7 *@ )4. %? #B P 2$ 2y/ut6(y;(ym>&y 0 n s ;epQy7vp  y6(vp  y4^vp 0Z /!? -VD + (U &O;x#j:A:!=: (x7:xR:7:x:x:-::xU:8: "vp g<[/Dorado Hardware ManualDisplay Controller14 September 1981123MMunches/scan line that the fast io system can deliver.The time required to fill the FIFO for both channels is a little longer than 30*8+20 cycles(= 276 cycles) or about 13.8 ms at a Dorado clock period of 25 ns; this follows from thefact that there are 15 munches/channel or a total of 30 munches of FIFO storage, and thefast io system can deliver one munch per 8 cycles with the first munch arriving 20 cyclesafter the first IOFetch_. 13.8 ms is much smaller than the vertical blanking time and longerthan the horizontal blanking time, so the FIFO will start out full at the beginning of a fieldand will be actively refilling itself during HS+HB of each scan line. If the memory systemkeeps up with the demands of the video channels, then the FIFO will tend to refill itselfafter momentary transients in which it empties out a little.Consequently, we know that HS+HB = 1/(S*F)  2*VR, and that M = (HS+HB)/T lesscorrections for refresh references, storage references by other tasks, hold, and delays fortasks of higher priority than DWT. At F = 30 frames/sec, VR = 800 ms, and S = 1000scan lines, we get HS+HB = 31.7 ms and M = 31.7/0.4 = 79 munches less corrections.There will be an average of two refresh references/scan line, so we get an upper bound of77 munches = 19,712 bits/scan line from storage.However, the DWT will not get all storage bandwidth. The DWT wakeup spacing iscontrolled by a SIP; the smallest reasonable spacing would result in one IOFetch every 8cyclescloser spacing would result in hold while a preceding IOFetch completed, so moreprocessor cycles would be consumed without improving data rate. At this tightest spacing,DWT runs for 2 cycles out of every 8. Conceivably, worst case memory activity discussedin the "Fast IO" chapter could occur during these 6 cycles (a clean miss 3 cycles beforethe IOFetch, followed by a dirty miss 2 cycles before the IOFetch, each by a different task).However, the large amount of storage in the FIFO allows us to rely upon statistics toaverage out memory competition, so it is probably reasonable to allow DWT at least 80% ofstorage bandwidth or about 16,000 bits/scan line in the above example, which wouldaccommodate 1000 line x 1000 pixels/line x 16 bits/pixel. For HB = 5 ms this is equivalentto a pixel clock period of 26.7 ns.This is only one speed limitation. Since the 32-bit wide FIFO is accessed once/cyclealternately by the A and B channels (i.e., 16 bits/cycle/channel), and since exactly threedoublewords are fetched before the horizontal scan begins for each channel, the maximumbits/scan line for each channel is about (3*32 bits)+[(26.7 ns/pixel)*(16 bits/50 ns)*(1000pixels/line)] = 8640 bits/scan line. This means that unless both channels are running atthe same data rate, the data rate will be significantly below the upper bound determinedabove. For example, in 24Bit mode, if the A channel runs at full resolution and gets 8640bits/scan line, the B channel will run at half resolution and get only 4320 bits/scan line, sothe maximum data rate would be about 1000 lines x 538 pixels/line x 24 bits/pixel. fp" q3gpFMfxb:6 ^7$ \vp [X YLO Wvp< UV S,/ R"H PW< L," KF IPCvp G vp1 EP C0 @~5 >D <tp2 ;Q 9T3% 7H 5L 3 I 2) N 0_> ./vp ,# )W> 'B %-* #Q "-Y b K F U 5 X =MP2Dorado Hardware ManualEthernet Controller14 September 1981124Ethernet ControllerAn Ethernet is the principal means of communication between a Dorado and the outsideworld. An Ethernet is a broadcast multi-access packet switched network which canconnect up to 256 stations separated by as much as 1 kilometer with a 3 mHz channel.The 'Ether' is a passive coaxial cable to which each station is connected through atransceiver that is high-impedance when receiving, low impedance when driving.Readers unfamiliar with the general concepts behind the Ethernet should refer to"Ethernet: Distributed Packet Switching for Local Computer Networks," by R. M. Metcalfeand D. R. Boggs, CACM, 19(7):395-404, July 1976; or to Design and Performance of LocalComputer Networks, by John Shoch, published by University Microfilms, August 1979.Read this chapter with Figure 16 in view.Ethernet PacketsEthernet data are encoded in packets. Packets are preceded by a low signal (i.e., silence)on the Ether; they begin with a one-bit prefixed by the transmitter, called the start bit. Bitsin the packet are phase encoded, where the bit cell time is nominally 340 ns; phaseencoded signals have one data transition per bit cell and its direction (low-to-high = 1) isthe value of the bit. Midway between these there may be a setup transition, so that thenext data transition can be in the correct direction.Packets end when no transitions are detected for more than 1.5 bit times and the Ether islow. Collisions are transmissions that overlap in time and cause malformed andundecodable bits. Transmitters jam the Ether with a continuous high for several bit timesafter participating in a collision. Collisions are of four types: too many transitions, in whichtwo transitions occur within .25 bit times; too few transitions, in which a transition occursbetween 1.25 and 1.5 bit times after the last one; end-of-packet (EOP), in which notransitions occur for more than 1.5 bit times and the Ether is low; and jam, which is thesame as EOP except that the Ether is high.In a well-formed packet that does not experience a collision, the start bit is immediatelyfollowed by an 8-bit destination host number, then an 8-bit source host number. This isfollowed by an indefinite number of 16-bit data words, a 16-bit checksum, and finallysilence.Even when transmitted without a source-detected collision, a packet may fail to reach itsdestination; packets are delivered only with high probability. Stations requiring a lowerresidual error rate must follow mutually agreed upon communication protocols.When the sender of a packet detects a collision, some method is needed to arbitrate(without communication) its use of the Ether with other stations contending for it. Thealgorithm used on the Ethernet, called the 'binary exponential backoff collision algorithm,'is discussed in the above references. It involves waiting a random interval and thenreattempting transmission. The (ideal) distribution of the random intervals depends uponmany factors. fp",q3pFMf ar ^ep3! \Q Z$0 Y< W;N S$P QW P4OP4"1uOmP46xOhP49vO@P4 BLO?P4D"OP4 NQNiNNi4 JqX) Es Blpqp @>qp >q p4 = qp, ;A;qp 9w5 60) 4:(q p;) 2pqp! 0Cqp .,qp - 'q p +E; qp ){* & = $>U "s6  7U l q1p M 0%. e"6 P &/ Y ;  =[xDorado Hardware ManualEthernet Controller14 September 1981125RemarksFrom the method of collision detection, it follows that in a noise free Ether with ideal transmitters andreceivers, a bit cell time between 0.75*T and 1.25*T, where T is the nominal bit cell time (340 ns), can bedecoded correctly.Phase encoding has the undesirable property that only 50% of the transmission medium's theoretical bandwidthis utilized. A number of reasonably simple encodings are known that more nearly approach the theoreticallimit, though phase encoding is simple to implement. If at some time we were willing to abandon compatibilitywith the existing Ethernet, we should reconsider the use of phase encoding.A promising alternative to phase encoding is bit-stuffing, which averages 67%, 86%, or 93% of theoreticalbandwidth for 0th, 1st, and 2nd order codes. This encoding outputs data bits in a cell time equal to 1/2 ofthe phase-encoded cell time; when 1 (0th order), 2 (1st order), or 3 (2nd order) data bits have been outputwithout a transition, then a non-data transition is inserted into the bit stream. The 1st order encoding (86%)could be implemented with a few changes to the current controller.Controller OverviewThe Ethernet controller is a slow IO device packaged with the disk controller on the DskEthlogic board. These two devices require more edge pins than are available in an MSA-IOslot, so the board must be mounted in a Fast IO slot (see Figure 2).It would be possible to package two Ethernet controllers on one logic board using different task andTIOA assignments for each. This might be appropriate if Dorados are ever used as Ethernetgateways.A cable connects the controller to a transceiver outside the Dorado enclosure; thistransceiver is almost identical to the ones used for Altos and other computers, thedifference being that it uses +12 volts rather than +15. Dorado transceivers are paintedbright red and have large block lettering saying "Dorado only". Plugging in the wrong typeof transceiver will not damage anything; it just won't work. The cable between thecontroller and the transceiver contains twisted-pair signals for receiver data, transmitterdata, collision, +5 v, and +12 v.The controller has independent transmitter and receiver sections. Because these twosections are completely independent, the Dorado can receive its own transmissions. Thisis an important aid in hardware and software debugging and simplifies the device driver,which need not check for sending to itself. Furthermore, the receiver can receiveconsecutive packets separated by the minimum inter-packet spacing (510 ns). This meansthat the Dorado can receive, without loss, streams of packets directed to it by mulitplehosts and packets that immediately follow broadcasts. This capability is important forservers and other high-performance applications.The controller uses two tasks, one for the transmitter (EOT for Ethernet Output Task) andone for the receiver (EIT for Ethernet Input Task). The receiver task is higher priority. Topermit two instruction/wakeup loops, a wakeup request is removed whenever the Next bussays the task is about to run. This simple strategy can be fooled into removing a requestwhen NextLies occurs, but this is harmless since the required service rate is low. To avoida spurious wakeup, a wakeup is not requested again until after the task has blocked. Adebugging control bit can be set which prevents wakeups even when all other conditionsare satisfied. fp",q3pFMf bAt _ue ^B-> \ ZC93 XY W&H V!K Si R";1 P0; O`E* MB I s Ep[ CI BDx?AuKx=Jx< 90pq p# 7fD 58! 3I 2(+ 0;4' .q! * q pqp )4; 'i: %= #7 " $4 ?6! t0 &6< A 88$c8(8 mR 8" R  C C< x  1#q p$ Is< G< EN D P BI N @~: >.+ <G ;S 7R 5/- 4#qp 2LT 0-, . O ,> +"0 ':! %"9 $L "PY @ V  ? ;A: 9w9 75" 5Y 4*- 2L@ .F -M +E:vp ){vp4qp '9 %; $' U [ Q > -+  ? B< wH 0( 7% H M9$\ =]WDorado Hardware ManualEthernet Controller14 September 1981129A transmitter data late occurs when the TxFSM requests a Fifo read and the Fifo is emptybut TxEOP is false. The PE sends one random bit and then stops. The resulting packethas an illegal length and probably a bad CRC.The PE inverts and latches TxData at the start of each bit cell and inverts the latched value1/2 bit time later. TxGo, synchronized to the beginning of a bit cell, enables the PE. ThePE assumes that a data bit is available long before it is needed and acknowledges each bitafter latching it by generating TxGotBit.A collision may be detected by either the transceiver or PD. The occurrence of a collisionis captured, synchronized, and used to abort the outgoing packet. The output of the firststage of the TxCollision synchronizer is wire-or'ed with PD output to jam the Ether after acollision. The jam lasts for one or two bit times, being the delay through the TxCollisionsynchronizer, TxFSM, and TxGo synchronizer.ClocksThe controller needs a clock with a nominal frequency of eight times the Ether bit rate.The SingleStep control bit selects either the 23.53 mHz crystal oscillator or single Doradoclocks injected under program control. The clocks for the Ether-synchronous parts of thecontroller are constructed from this basic clock.The slowest Dorado clock period at which the transmitter works is 42.5 ns. Disabling theDorado system clocks while TxOn is true causes a transmitter data late. If TxGo is true,the packet is chopped off, causing an incomplete transmission and probably a runt bit.When the clock is reenabled, the PE sends a few fragmentary bits and then the data lateaborts the packet.The slowest Dorado clock period at which the receiver works is 85 ns. Disabling theDorado system clocks causes a receiver data late. The next packet that arrives after theclock is reenabled reports the data late.Task WakeupsThe controller is designed for two completely independent tasks, with the receiver higherpriority. Two IOAs select data and status/control registers. IOAtten may be tested todecide whether a wakeup request is just for another word or something special (endingstatus for the receiver, or PE aborted for the transmitter).Task wakeups must, on the average, be serviced within 5.44 ms. The transmitter andreceiver each have 17 words of buffering (bus register + 15 Fifo + shift register) so thevariance can be quite largeaccumulated delay of up to about 90 ms is tolerable, whilelonger delay will cause a data late error. fp",q3pFMf b'1 `SA ^- [!< YL P WQ U) RE[ PzM NC LB K+ Fs BpD @ M >2' =/1 9"7 70) 6(A 4^I 2 /!< -VE +) &ss #p"7 !6 M k8 < /:vp d@ ?vp * =UN Dorado Hardware ManualEthernet Controller14 September 1981130Muffler InputAll muffled signals on the DskEth board are accessible to Dorado firmware. The method bywhich a particular signal is selected and read out is discussed in the "Muffler Input"section of the "Disk Controller" chapter. Signal addresses 1208 to 1778 for the Ethernetcontroller are enumerated below. Unless it is obvious, signals which are specific to thereceiver or transmitter have Rx or Tx respectively somewhere in their names.Table 26: Ethernet Muffler SignalsWord BitNameMeaningERX0 120PDNew1/8 bit time sample of PD input signal 121PDOldPDNew delayed one sample time122:125PDCnt[0:3]Number of samples since last data transition 126PDCntCtrlIncrements or clears PDCnt 127ReportCollisionsControl register bit that enables PD collision reporting 130RxBOP"Beginning Of Packet" enables receiver data wakeups 131EthData.18Marks status word terminating a packet 132 133RxCRCErrorOutput of receiver CRC checker 134RxDataLateReceiver Fifo overflowed 135RxBusRegFullWord in BusReg can be read with Pd_Input 136RxFifoFullReceiver Fifo is full 137RxFifoEmptyReceiver Fifo is emptyETX140:142TxState[0:2]State of transmitter FSM 143TxEOPTransmitter data wakeups are disabled 144TxBusRegFull'Word is waiting to be written into the transmitter Fifo 145TxGoneTransmitter FSM is shut down 146TxSREmpty'Transmitter shift register is empty 147TxCntDwn'Transmitter wakeups disabled until next pendulum clock 150TxCRCEnblShift/compute control for transmitter CRC 151TxGoEnable PE 152TxDataSerial data input to PE153:154TxSRCtrl[0:1]Transmitter shift register control 155PEOutputPhase Encoder (PE) output 156TxFifoFullTransmitter Fifo is full 157TxFifoEmptyTransmitter Fifo is emptyERX1160:162RxState[0:2]State of receiver FSM 163RxCollisionReceiver-detected collision 164PDCarrierThe Ether is in use165:166PDEvent[0:1]PD output (no event, collision, 0, and 1) 167RxSRFull'Receiver shift register is full 170RxEOPMarks status word terminating a packet 171RxSync'True for one cycle triggering write of SR into Fifo 172RxIncTransReceiver incomplete transmission 173RxCRCResetResets receiver CRC chip 174RxCRCClkClocks receiver CRC ship 175RxDataSerial data output from RxFSM176:177RxSRCtrl[0:1]Receiver shift register control fp",q3pFMf bs ^p+. \D [?Zu[pZu[p YL.+ WLS_wF#yPtyO`n"tyLuyKan"t&yJn"tyHn "t,yG?n"t yEn"t1yD}n"t )yCn "t!yAny@[n "ty>n "ty=n "t$y<8n "t y:n "t y89y6n "ty5xn"t y4n "t3y2n"t y1Un "t y/n"t +y.n"t y-3n"ty+n"ty*rn "t"y)n"ty'n "t  y&On "t  y#y"Pn "ty n "t yn"ty.n "t)yn"tyln"t!y n"t/yn "tyJn "tyn"tyn"ty(n "t x;eu @)x90@9@7x5@=@4:P@2"x0@3@/D;@-vu0@,<x* @&x'@"-@&sx$>@-@" D@!}xH @!x@=@x~ @5@4@x@5xT@T$@0xx@ x$@ 6x @C y<]iDorado Hardware ManualEthernet Controller14 September 1981132be false. Do not issue TestClock in an instruction that changes TestData.Cleared by IOReset.ReportCollisionsallows the PD to report malformed bits as collisions. Cleared by IOReset.Status RegisterTIOA of 168 also selects the (read-only) status register. The bits in this register are themost interesting to the microcode. Less interesting state is available from the mufflers.Host Addrthe host address set by pullups on the backplane.RxOnthe receiver is enabled.TxOnthe transmitter is enabled.LoopBackthe interface is looped back.TxCollthe current output packet was aborted by a collision.NoWakeupsall wakeups are disabled.TxDataLatethe current output packet was aborted by a data late.SingleStepthe 23.53 mHz oscillator is disabled.TxFifoPEthe current output packet was aborted by a parity error. fp",q3pFMf@bAu$&@`x^@A Ys VDp UuVDp/" TyGxQNu@.xO@xL@xJ@xH|@5xFH@xD @5xA @%x?@8^ ?<(Dorado Hardware ManualOther IO and Event Counters14 September 1981133Other IO and Event Counters In addition to the disk, ethernet, and display controllers discussed in earlier chapters,Dorado contains a general input/output interface and a junk task wakeup located on theIFU board; the two registers used in this interface may alternatively be used as eventcounters in performance monitoring, and that use is also discussed here.Since the IFU board is not interfaced to the IOB, it cannot use the slow io system to controlthese features, so functions are used instead.Junk Task WakeupThe IFU board contains a circuit which wakes up the junk task (task 1) every 32 ms. Thewakeup is dismissed by the AckJunkTW_B function; this function interprets B[15] as follows:a 1 enables wakeups; a 0 disables them; B[0:14] are ignored. The junk task can dismiss thewakeup by doing IFUTest_B with any value on B (but B[15] must be 0 to reenable thewakeup at the next 32 ms tick).Junk task microcode will, among other things, maintain a Real Time clock.General IOA 16-bit register called GenIn (synonym EventCntA) is used for general input; it can be readwith the B_GenIn (synonym B_EventCntA) function but cannot be written by firmware.When used for general input, GenIn is written with information that is TTL-to-ECL convertedfrom the backpanel.A 16-bit register called GenOut (synonym EventCntB) is used for general output; it can beeither read with the B_GenOut (synonym B_EventCntB) function or written with theGenOut_B (synonym EventCntB_B) function. GenOut is connected to the backpanelthrough ECL-to-TTL converters.The plan is that devices such as Diablo printers can be connected to the GenIn and/orGenOut signals via backpanel connectors.The choice of using one of these registers for general io or for event counting is determinedby the InsSetOrEvent_B function discussed below.Event CountersThe GenIn and GenOut registers can alternatively be used as event counters. They cannot,of course, be used simultaneously for general io. The registers are setup for either io orevent counting by the InsSetOrEvent_B function, where B[0:15] are interpreted as follows: fpQq2pEVf arp ^e N \qpqp Z5q Yp@ UT S. Ns K>pOtp Is[ GO E-% Dtp @I ;s 8pD 6K3 4"9 2 /DH -z3 += ) &s4! $( !6O k0 Rs p3& %6 K5$r >UDorado Hardware ManualOther IO and Event Counters14 September 1981134If B[0] is 1, then InsSet[0:1] are loaded as discussed in the "Instruction Fetch Unit" chapter.If B[0] is 0, then its the general io/event counters as follows:B[4] enables counting of EventCntAB[5] enables counting of EventCntBB[8:10] select the event type to be counted by EventCntA as follows:0True (i.e., every cycle)1Hold2Processor memory reference (not held)3Good IFUJump (i.e., not held and not an exception)4Miss5-7Backpanel events A, C, and E, respectivelyB[12:14] select the event type to be counted by EventCntB as follows:0True1Hold2Successful IFU memory reference3IFUJump that wasn't ready4Miss5-7Backpanel events B, C, and D, respectivelyB[15] causes the event to be counted for all cycles if 1 or only for emulator or fault task cycles if 0.To use the event counters, you first stop them counting and read their current values; thenyou tell them what to count and start them counting and your system running. Note thatthey never get reset, but just keep counting from wherever they areit's up to the user toworry about counter turnover.The expected mode of operation is that the junk task will detect counter overflow andupdate double or triple-precision vectors in RM that count events; even if the counter iscounting once per 60 ns cycle, counter wraparound only occurs every 3.93 ms, so adouble-precision vector could count for at least 255 seconds and triple-precision for 228days. Sample microcode for maintaining a double-precision counter is given in theexample below:*The double-precision vector consisting of two RM locations, CountHi and CountLo*is initialized such that CountHi eq 0 and CountLo contains minus the value in*the event counter, and another RM location called CountFlag is initialized to 0.*The microcode below increments CountHi whenever the event counter cycles.*At any instant, the high part of the total count is in CountHi and the low part*is CountLo+event counter; CountHi has to be incremented by 1 if the counter*just overflowed.(CountLo) - (EventCntB') - 1;*CountLo + event counterPd_CountFlag, Branch[.+2,alu>=0];CountFlag_T-T-1, Branch[.+3];*Set CountFlag to -1 in 2nd half of the counter cycle.CountFlag_T-T, Branch[.+2,alu>=0];*Set CountFlag to 0 in 1st half of the counter cycle,CountHi_(CountHi)+1;*and increment CountHi, if we were in the 2nd half. . .*of the counter cycle last time.The microcode for reading the counter when it is updated like this is as folows:*Return to caller high part of event count in T, low part in Q.TaskingOff, Pd_CountFlag;T_(CountLo) - (EventCntB') - 1, Branch[.+3,alu>=0];*CountLo + event counter = low part of resultTaskingOn, Branch[.+3,alu<0];*Low part ovf iff CountFlag<0 and low sum >=0T_(CountHi)+1, Q_T, Return;*High part of result = CountHi+1TaskingOn;*High part of result = CountHi fpQq2pEVf b6) `S@y](u"yZ"yXD:V:U*:S%:Rh2:Q:O*yMrE:K>:I:H|:G:E:DZ*yB%L >p Q = %2 ;A1) 9w 6U 4:B 2p3 0Y .E - x*NuPx(Nx'Qx&,Jx$Px#jLx" xk)Wx !x)WxI")W"x)W"x)W 9pPxwu?xxx3+x )W)x )Wx V )W 3>]LDorado Hardware ManualOther IO and Event Counters14 September 1981135T_CountHi, Q_T, Return;. . . fpQq2pEVfxbAux`b `4N$Dorado Hardware ManualError Handling14 September 1981136Error HandlingIn addition to single-error correction and double-error detection on data from storage,Dorado also generates, stores, and checks parity for a number of internal memories anddata paths. The general concepts on handling various kinds of detected failures are asfollows:(1) Failures of the processor or control sections should generally halt Dorado becausethese sections must be operational before any kind of error analysis or recovery firmwarecan be effective.(2) Failures arising from memory and io sections should generally result in a fault taskwakeup and be handled by firmware. In some situations, such as map parity errors, it isespecially important to report errors this way rather than immediately halting becausefirmware/software may be able to bypass the hardware affected by the failure and continuenormal operation until a convenient time for repair occurs. In other situations, the firmwaremay be able to diagnose the failure and leave more information for the hardwaremaintainers before halting.(3) IFU section failures and memory section failures detected by the IFU should generallybe buffered through to the affected IFUJump, then reported via a trap; in this way, if it ispossible to recover from the failure, then it will be possible to restart the IFU at the nextopcode and continue.(4) Memories and data paths involving many parts should generally be parity checked. Itis not obvious that this is always a good idea because extra parts in the parity logic will bean additional source of failures, but instantly detecting and localizing a failure seemspreferable to continuing computation to an erroneous and undetected result.(5) When Dorado halts due to a failure, information available on mufflers and in the 16-bitsof passively available error status (ESTAT) should localize the cause of the error asprecisely as possible.Since the MECL-10K logic family has a fast 9-input parity ladder component, the hardwareuses parity on 8-bit bytes in most places; there is usually insufficient time to compute parityover larger units. IM and MIR, two exceptions, compute parity over the 17-bits of data ineach half of an instruction; and the cache address section computes parity over the 15address bits and WP bit.Odd parity is used throughout the machine, except that the cache address section andIFUM use even parity. Odd parity means that the number of ones in the data unit,including the parity bit, should be odd, if the data is ok.The control processor (Midas or the baseboard microcomputer) independently enablesvarious kinds of error-halt conditions by executing a manifold operation discussed in the"Dorado Debugging Interface" document. It also has to initialize RM, T, the cache addressand data sections, the Map, and IFUM to have valid parity before trying to run programs.Reasons for this will be apparent from the discussion below.When Dorado halts, error indicators in ESTAT indicate the primary reason for the halt, and fp#q 2pFMf ar ^epW \I Z L Y U2% S)0 Q N(1 L"6 JV I-9 Gb^ E'( C @[Z >V <*3 : 7E 5P 3H 2)K .U , I +" 'O %0/ $#7 "PJ  T I!0 ~;  -% BY w&4 X < pL  )=\&Dorado Hardware ManualError Handling14 September 1981137muffler signals available to the control processor further define the halt condition; ESTATalso shows the halt-enables. Midas will automatically prettyprint a message describing thereasons for an error halt. The exact conditions that cause error halts are detailed in thesections below; the table here shows the ESTAT and muffler information which is relevant.Table 27: Error-Related SignalsESTATESTAT Task ErrorEnableExperiencingRelated Muffler Signals Bit Bit Haltand MeaningRAMPERAMPEenTask2BkSTK, RM, or T parity failure.RmPerr and TmPerr mufflers on each processorboard indicate which byte of RM/STK or T had aparity failure. StkSelSaved indicates that RmPerr appliesto STK rather than RM.MdPEMdPEenprocessor-detected Md parity failureTask2Bkif immediate _Md (_MDSaved false)Task3Bkif deferred _Md (_MDSaved true)MdPerr muffler on each processor boardshows which byte of Md failed.IMrhPEIMrhPEenCTDparity failure of IM[17:33]IMlhPEIMlhPEenCTDparity failure of IM[0:16]IOBPEIOBPEenTask2BkPd_Input parity failure if IOBoutSaved falseTask2BkOutput_B parity failure if IOBoutSaved trueIOPerr mufflers on each processor board showwhich byte failed.MemoryPEMemoryPEencache address section parity failure,cache data parity failure on write ofdirty victim or dirty Flush_ hit, orfast input bus parity failure.Processor ErrorsThe processor has parity ladders on each byte of the following:input to RM/STKgenerate parity for write of RM/STKinput to Tgenerate parity for write of TBgenerate parity for DBuf_B, MapBuf_B, Output_B, IM_BIOBcheck parity for Pd_Input and Output_BMdcheck parity for _MdRcheck parity for _RM/STK (unless bypassed from Pd orMd or replaced by _Id)Tcheck parity for _T (unless bypassed from Pd or Md orreplaced by _Id)Input ladders to RM/STK and T generate parity stored with data in the RAM; these laddersare not used for detecting errors.The processor computes parity on its internal B bus (alub). The generated parity may betransmitted onto IOB when an Output_B function is executed; Store_ references write Bdata and parity in the cache; parity for IM writes and map writes is computed from B parity.None of the other B destinations either check or store B parity. External B sources do not fp#q 2pFMf b[ `S6% ^; \2'NXxsXyUMt yS  (yR (yOu ((Ntutu(M, (Kt utu(JjyH ($G?(!E((D}tu(CyAR (y? (y= (t u<\(t u(:tu&(9y7 (%(6o%(5$(3 .s +Ep?:'R:& R:$>R4:"sR&: R:R4R:IR5R~  A B" ; M ;A pJ )=\QDorado Hardware ManualError Handling14 September 1981138generate parity.Parity on the R/T ladders is checked only when the R/T data path is sourced from theRAM, not when bypassing from Md or Pd is occurring, and not when R/T is sourced fromId. A detected failure causes the RAMPE error halt, which indicates that some byte of RM,STK, or T had bad parity. The muffler signals that further describe this error are in thePERR word: StkSelSaved is true if the source for R was STK, false if the source for R wasRM; each processor board has RmPerr and TmPerr signals; RmPerr is true if the RM/STKbyte on that board had bad parity, TmPerr if the T byte had bad parity. Note that if aninstruction beginning at t0 suffered an error, Dorado halts immediately after t4; the mufflersignals apply to the instruction starting at t0. The Task2Bk muffler signals show the taskthat executed the instruction at t0.Md parity is checked whenever _Md is done; a failure causes the MdPE error-halt whenenabled. The _MDSaved muffler signal in PERR is true when a deferred _Md caused theerror (T_Md, RM/STK_Md), false when an immediate _Md (A_Md, B_Md, or ShMdxx)caused the error. On a deferred _Md error, Dorado halts after t6 and Task3Bk shows thetask that executed the instruction starting at t0; on an immediate _Md, Dorado halts aftert4, and Task2Bk shows the task. The MDPerr muffler signals on each processor boardshow which byte of Md was in error.Io devices (optionally) compute and send odd parity with each byte of data; the processorchecks parity when the Pd_Input function is executed, but not when the Pd_InputNoPEfunction is executed. When enabled, an IOBPE error halts the processor at t4 of theinstruction that suffered the error; Task2Bk shows the task that executed the instruction.The processor also checks IOB parity on Output_B, and an error halts at t4 as forPd_Input. The IOBoutSaved muffler signal distinguishes Pd_Input from Output_B errors;an IOPerr muffler signal on each processor board shows which byte of IOB was in error; allof these are in the PERR muffler word.The processor generally does not pass parity at one stage through multiplexing to the next stage, soany failure in the multiplexing between one stage and the next will go undetected (exception: Bparity passed through to IOB).For example, the processor could write Md parity sent by the cache into the T RAM, when T isbeing written from Md. Instead, however, it checks Md parity independently, but then recomputesthe parity written into T with the input ladder. Hence, a parity failure detected on a byte of T canonly indicate a failure in either (1) the input parity ladder; (2) the output parity flipflop; (3) the outputparity ladder; (4) one of three 16x4 T RAM's; (5) one of two 4-bit latches clocked at t1 (Figure 3)through which the output of the T RAM passes; (6) one of two 4-bit latches clocked by preSHC'.Parity is handled similarly for writes of RM/STK.Parity is similarly recomputed on B.The processor does not generate or check parity on the A, Mar, or Pd data paths. Anyfailures of the A, Mar, B, Pd, or shifter multiplexing or of the ALU go undetected; failures ofQ, Cnt, RBase, MemBase, ALUFM, or branch conditions go undetected.RemarkSince 256x4 and 16x4 RAM's are used for RM, STK, and T, and since the processor is implemented with thehigh byte (0:7) on ProcH and the low byte (8:15) on ProcL, byte parity requires an additional 4-bit storageelement on each board, of which only 1 bit is used. We could conceivably have used all 4 bits to implement afull error-correcting code for each byte of R and T data. However, there is insufficient time to correct thedata. (Also, we use 256x1 RAM's instead of 256x4 RAM's for the RM and STK parity bits.) fp#q 2pFMf b ^K \(, [#qp! YLA W q p' Uqpqpqp S#qp R"QuR"p4QuR"p PW.OuPWpqp N"MuNp K>,qp Is qp$ GA E)EQuEpqp D+CuDp) BIAuBIpqpqp( @~# = 8! ;A5 9w(qp8u9wp 7%qp! 5> 5Uu5p 4q p+ 2LqpQ 0&y-uA#y,_ Ty*y(`5'y'3-y%Ny$>!Ly"W"P"y!6%9y1y$ p6 E B Tt uR U)B c O 3X =]Dorado Hardware ManualError Handling14 September 1981139Alternatively, parity could be computed over each 4-bit nibble rather than each 8-bit byte; the MC170component allows nibble parity to be computed just as economically as byte parity. If this were done, then aparity failure would be isolated to a particular nibble. With byte parity, a detected failure could be any of 9+components; with nibble parity, it would be isolated to one of 6+ components. Implementing nibble parity forRM/STK and T would require about 4 more ic's per board than byte parity.It is hard to say whether the additional precision of nibble parity would be worth the additional parts.Control Section ErrorsThe control section stores parity with each 17-bit half of data in IM. When IM is written,the two byte-parity bits on B are xor'ed with the 17th data bit to compute the odd parity bitwritten into IM. It is possible to specify that bad (even) parity be written into IM, and thisartifice is used to create breakpoints; bad parity from both halves of IM is assumed to be adeliberately set breakpoint by Midas.IM RAM output is loaded into MIR and parity ladders on each 17-bit half give rise to errorindicators that, when enabled, will halt the processor after t2 of the instruction suffering anerror. For testing purposes, halt-on-error can be independently enabled for each half ofMIR. Both the unbuffered output of the MIR parity ladders and values buffered at t2 appearin ESTAT. The buffered values show the cause of an error halt, and the unbuffered signalsallow Midas to detect parity errors in MIR before executing instructions or when displayingthe contents of IM.The special MIRDebug feature discussed in the "Dorado Debugging Interface" documentprevents MIR from being loaded at t2 when MIR parity is bad. In other words, when theMIRDebug feature is being used, all of the t2 clocks in the machine will occur except theones to MIR. This feature prevents the instruction that suffered an error from beingoverwritten at the expense of being unable to continue execution after the error.MIRDebug can be enabled/disabled by the control processor.IFU ErrorsThe IFU never halts the processor; any errors it detects are buffered until an IFUJumptransfers control to a trap location. The errors it detects, discussed in "IFU Section", areparity failures on bytes from the cache, IFUM parity failures, and map parity failures on IFUfetches.Memory System ErrorsThere is no parity checking on Mar or on data in BR, so any failure in the addresscomputation for a reference goes undetected. However, valid parity is stored with VA inthe cache, and any failure detected will cause the MemoryPE error to occur, halting thesystem (if MemoryPE is enabled).Parity is also stored in the Map (computed from B parity) and an error causes a fault taskwakeup in most situations (Exceptions: IFU references and Map_ references do notwakeup the fault task when a map parity error occurs). fp#q 2pFMf bAuG `J# _#N ^"K \H Z <, U*s Qp+0 O-0 N#4+ LXW J% GT EQ'qpDuEQp C N ASA/uAp ?9! >&*1 <\ 8? 7#6u7p2 5U,4u5Up, 3N 1 Q /: *s 'ipV %] #Y "  s ~p%- F (/  9! H 6 =[;7Dorado Hardware ManualError Handling14 September 1981140The cache data section stores valid parity with each byte of data. When a munch is loadedfrom storage, the error corrector carries out single-error correction and double errordetection using the syndrome and recomputes parity on each 8-bit byte of data stored inthe cache. When a word from B is Store_'d in the cache, byte parity on B is stored withthe data.A MemoryPE error occurs if, when storing a dirty victim back into storage, the memorysystem detects bad parity on data from the cache.The IFU and processor also check parity of data from the cache, as discussed previously.Sources of FailuresIn a full 4-module storage configuration, Dorado will have 1173 MOS storage, about 700Schottky-TTL, 3000 MECL-10K, and 60 MECL-3 DIPs, and about 1500 SIPs (7-resistorpackages). This logic is connected with over 100,000 stitch-welded or multiwireconnections to sockets into which the parts plug; logic boards connect to sidepanelsthrough about 2500 edge pins. Sockets are used for all the RAM DIPs in the machine;other parts are soldered in. Given all these potential sources of failure, reliable operationhas been a surprising achievement.Initial debugging of new machines has been slow and difficult, requiring expertise not easilyavailable in a production environment. In addition to mechanical assembly, board stuffing,and testing for shorts and opens both before and after stuffing, each machine hasaveraged about one man month of expert technician time to repair other malfunctionsbefore it could be released to users.Once released, the Dorados have been pretty reliable. During a 100-day period (6 October1980 to 14 January 1981) the CSL technicians kept records of service calls made forapproximately 15 Dorados in service at that time. The following summarizes the 43 servicecalls that were made.37 daysmean time between service calls per machine.45 days mean time between failures (some service calls were for microcode orsoftware problems).2.5 hours per machine per month average service time.13% of failures and 5% of time reseating logic boards in the chasis (connectors notmaking contact).11% of failures and 17% of time on open nets.13% of failures and 12% of time repairing 16k MOS RAM failures (standardconfiguration was 2 modules).37% of failures and 28% of time replacing other DIPs and SIPs.5% of failures and 10% of time on T80 problems. fp#q 2pFMf b< `SV ^2% \1' Z WJ U1 REX M,s Ip: GP F$5P DZ%/ B/% @7' >" ;!< 9'4 7Q 6(= 4^% 0Q /!L -VO +y((y%;y$y!5yt5yyu-yBy9y>y/ t K VH TB S"8 QN9 O Jjs FpZ E-F CcR ?E >&D <\J :.1 8) 5U0& 3P 1/) /Y .*B ,_Z *%1 (/) %XU #"5 !#3  A )+ -( 'A \U   @ UI [ L y=]3Dorado Hardware ManualError Handling14 September 1981142of future failures will do this.Some interesting questions are: How does MTBF vary with the EC arrangement? MTBF ispertinent if we let Dorados run until they fail. Alternatively, how likely is a failure in thenext day, week, or month, if we test the memory that often and replace bad RAMs? Thesequestions can be asked assuming perfect testing (no failures at t=0) or imperfect testing(some likelihood of failures at t=0 because diagnostics didn't find them).To answer them, MOS RAM failures are modelled as one of two types: those affecting asingle address in the RAM (called SF's), and those affecting all addresses (called TF's).We assume that TF's occur about 1/4 as often as SF's in 4Kx1 RAM's. RAM failures areassumed exponentially distributed, correct if the failure rate doesn't change with time; overthe time range of interest, this is reasonable. Finally, perfect testing is assumed, so thereare 0 failures at t=0. These assumptions give rise to the following:let p = prob that an ic has a TF = 1  eatlet q = prob that an ic has a SF = 1  ebtlet n = number of MOS RAMs in the memoryWithout error correction, MTBF is the integral from 0 to infinity of [(1p)(1q)]n =1/n(a+b). With b = 4a, in our 4-module system with n = 1024, this is 1/5120a =.00018/a.With error correction, failure occurs when, in a single EC unit, a TF coincides with eitheranother TF or an SF. This ignores two coinciding SF's which is about 4000 (16k RAMs) or16000 (64k RAMs) times less likely.let n = number of RAMs in an error correction unitthen Prob[no failure] = Prob[no TF] + Prob[1 TF and 0 SF]Prob[no TF] = (1p)nSince failure modes are independent,Prob[1 TF and 0 SF] = np[(1p)(1q)]n1Prob[no failure] = Pok = (1p)n + np((1p)(1q))n1Pok = enat + n(1eat)(e(a+b)(n1)t)This is the probability for a single EC unit, so mean time to failure for all MOS storage isPok raised to a power equal to the number of EC units. In other words, the argument ofthe integral for a 4-module x 4 quadwords/module system is Pok16 with n = 64+8; it isPok4 with n = 256+10 for a one munch EC unit.Then, expected time to failure for our 16 x n=64+8 memory system, is about:(1/n) * (1/16a + 16a/(16a+b)2 + 240a2/(16a+2b)3 + 3360a3/(16a+3b)4)= (1/an) * (1/16 + 1/25 + 5/288 + 105/17208)= (1/16an) * (1 + .64 + .28 + .006) = 1.93/16an= 1.93/16*72*a = .00168/aIn other words, mean time to failure is about 1.93 times longer than the time to the first TF= 9.5 times better than with no error correction = as often as 1024/9.5 = 108 fp#q 2pFMf b ^U \_ [> YLI WJ T$1 REJ PzU NX LC KEyG(H5uyEp(FkuyDp( @5A.@ >'( =  9T 71' 6#y22y09y-V-uy+p$y)$*Nuy'p'ju'p(u'p(uy%p%5u%p&Ou%p&Ou%p&Ou %p "s? u p< t!utp @u@p) Ky up up up up upy,y/y  -0   H  =]oSDorado Hardware ManualError Handling14 September 1981143uncorrected storage ic's.The results don't change much when imperfect testing is assumed. The effect of this is toreplace densities for p and q by 1  Aeat, where A would be .999 if there was a 1/1000chance of a MOS ic being bad at t=0.RemarksOn each storage board, data from MemD is transported to a shift register consisting of 8 flipflops which arethen written into the MOS RAM's after transport has been completed. This arrangement is unfortunateanyfailure in one of these components will cause a multiple error, and there are about 250 of these parts in a fullstorage configuration.One way to eliminate this problem while simultaneously reducing the part count on each storage board wouldbe to make modules consist of four storage boards, rather than two, so that only four flipflops receive data oneach bit path during transport; since each of these is in a different quadword, single failures would not causemultiple errors.The Dorado EC operates on quadwords, requiring 8 check-bits/64 data bits, or a 12.5% storage penalty.Alternative schemes are: 10 check bits/256 data bits (3.9%); 9 check bits/128 data bits (7.4%); 7 check bits/32data bits (22%); and no error correction at all (0%).The implementation of the EC pipeline is such that wider correction units significantly increase the time for amiss. The current quadword error corrector requires 7 clocks (3 clocks for setup and correction, 1 clock perword of the quadword); this would become 11 clocks with a 128-bit EC scheme or 19 clocks with a 256-bit ECscheme. Although cache hit rate seems to be above 99%, some implementation avoiding this delay would stillbe needed to make larger correction units attractive.If our quadword correction unit were replaced by a 4 x n=256+10 scheme:1/4na + 4a/n(4a+b)2 + 3a2/2n(2a+b)3, where for b = 4a this is(1/4na)*(1 + 1/4 + 1/36) = 1.28/4na = .0012/aIn other words, MTBF is about 1.28 times longer than the time to the first TF. So error correction hasincreased MTBF by a factor of 6.2 over no error correction; alternatively, a 1064-RAM corrected memory failsas frequently as a 1064/6.7 = 159 RAM uncorrected memory.Surprisingly, the 64+8 EC scheme has only 42% longer MTBF than a 256+10 EC scheme. This improvementmay not be worth the 96 additional MOS RAM and 80 other DIPs required for address buffering; the 80additional DIPs might cause more failures than they save, being a net loss.The other method of maintaining our systems is to regularly test storage and replace bad RAMs. Then thelikelihood of no double error before replacement is simply the value of the probability distribution (Pok4 andPok16 above) at the selected instant. This reduces to an approximation of the form Pok = [ex + xex]mwhere x = nat, m is 4 or 16, and n = 72 for m=4 or 266 for m=16. If this is evaluated at t = 1/mna,1/2mna, 1/4mna, etc. the following results are obtained:Table 28: Double Error Incidence vs. Repair Rate m1/mna1/2mna1/4mna1/8mna 4.52.81.94.9816.79.84.98.99The interpretation of this table is as follows: Measure mean time to total failure (TF) of a MOS RAM and callthis time 1/a; then assume 4 SF's per TF. Then the rate at which TF's occur in storage will be 1/mna. Sothe above tables show probability that the Dorado hasn't suffered a double error when tested and fixed asoften, 1/2 as often, 1/4 as often, or 1/8 as often as the mean rate of TF's. fp#q 2pFMf b ^E \']nu\p& [$ Vt T3u?- RI Q+` O LC' Kal I b HY Etu& Da B5 ?` >Jc <X ;A)B 95 6Gy4:44:44:44:y2- /_ .q ^ ,9 *+M (c '#K $ad ">)"-#G" ! P ! ! ! Y Z8vF1yu!(y"t)Wy"t)W a La C& DL h !=[tDorado Hardware ManualPerformance Issues14 September 1981144Performance IssuesThis chapter discusses two issues:(1) How rapidly will Dorado be able to execute Mesa, Lisp, SmallTalk, etc. macroprograms;(2) What relationship do some of the design parameters bear to performance;Cycle TimeThe first issue is cycle time. Dorado was designed for a 50 ns cycle time; the first threeprototypes used stitchweld technology for interconnections and operated correctly at 55 nscycle time; however, subsequent machines are being built using multiwire technology andwill not operate faster than about 60 ns cycle time. The baseboard at present initializes theclock period to 64 ns for all machines during a boot, although there is some indication thatdesign changes made recently and repair of a few lingering slow path problems wouldpermit 5 to 10 ns faster operation.With respect to achievable cycle time, the two important differences between stitchweldand multiwire technology are that stitchweld uses point-to-point wiring and has wireimpedance of about 100 ohms (which is ideal), but multiwire uses Manhattan (square-corner) wiring with wire impedance of about 50 ohms on the inner layer and 70 ohms onthe outer layer of wiring (Most signals are in the outer layer.); longer wires and imperfectimpedance matching result in slower speed.Emulator PerformanceGene McDaniel's measurements of the Alto Mesa compiler have been adjusted to makethem compatible with Pilot Mesa and are summarized below. It must be pointed out thatthe compiler makes heavier use of short pointers than do Pilot Mesa programs; programsbeing developed now are heavily biased toward long pointers and would be slower than theexecution rate below indicates. Average execution rate was about 5.6 cycles/opcodeexcluding disk wait. About 38% of all cycles are consumed by XFER opcodes (i.e.,subroutine call or return) and account for about 6% of opcodes executed. If these areexcluded, the remaining 94% average about 3.1 cycles/opcode; if jumps and conditionaljumps are also excluded (about 14% of executions), the others average 2.5 cycles/opcode.These times include all memory and IFU delays.These excellent results indicate that there are no unusual delays due to problems with thememory or IFU and that the processor is completing most opcodes quickly. Since XFERopcode take 34 (local) to 54 (external) cycles/opcode excluding memory delays, speeding,respecifying, or reducing executions of XFER seem to be the most promising ways ofimproving performance.In the above results, instruction forwarding has saved an average of about .25cycles/opcode or about 4% overall, in agreement with our expectations. fp##q5pEVf ar ^ep" ZE WK Rhs NpK M,Z KaW I%9 G/- F3 D7# @W >T =/-& ;eB 9J 7* 2s /Dp() -zV +8 )H (A &O= $*, "I 2& %. /+ I F SJ  ?@H LF t =Y Dorado Hardware ManualPerformance Issues14 September 1981145For SmallTalk and Lisp instruction sets, performance is much worse than Mesa (averagingover 30 cycles/opcode on Smalltalk 76). Careful studies should be made to understand thereasons for this fully, but one reason is that the 16-bit word size is a serious limitation.Long storage pointers are used extensively, so execution would be substantially faster on amachine with, say, 32-bit data paths.IFU Not-Ready WaitFor the Mesa compiler, 19.5% of all cycles were in IFU not-ready wait; 16% due toincorrectly predicted jumps, 2.5% to cache miss wait, and 1% to other causes. The 16%due to incorrectly predicted jumps might be improved.The Mesa microcode presently predicts that all conditional jumps will not jump; it isdesirable to predict not-jump unless more than 75% of executions jump due to theoverhead of restarting the IFU an extra time. 40% of the time the prediction is wrong and ajump occurs, so it seems that the microcode is doing the best it can.However, some loops ("while J ne 0 do," for example) are compiled as a normally-falseconditional jump at the beginning of the loop and an unconditional jump from the end ofthe loop back to the beginning; a faster sequence is a normally-true conditional jump at theend of the loop, eliminating the unconditional jump altogether. The general objectives inchanging the compiler would be as follows: (1) Eliminate unnecessary jumps andconditional jumps; (2) Make the jump/not-jump execution of conditional jumps be aspredictable as possible; and (3) Make the not-jump path be the most likely, unless thisconflicts with objective (1).Microstore RequirementsSpeed is not the only issuesome reduction in microstore requirements might be possiblethrough design changes. Space requirements for a 1981 release of the Alto/Mesaemulator system were as follows:Table 29: Utilization of the MicrostoreMesa basic opcode set20248Cedar allocator & collector5768Floating point4578Alto opcode set11638Alto BCPL Runtime2268BitBlt subroutine4168Fault handling658Ethernet driver2558Disk driver4308Display driver5008Junk io driver768LoadRam1008Initialization1508 fp##q5pEVf bG `SU ^ Q \> Z% Us Rhp6 PM N5 KaM I5 GM FE B"3 @G >D =/., ;e(N 9O 7W 6 0s -zp-* +/ ) &ssX(:#$p)W"t:!Yp)W t:p )Wt:p)W7t:p)Wlt:/p)Wt:dp )Wt:p)W t:p )WBt:p )Wwt::p )Wt:op)Wt: p )W t =[:Dorado Hardware ManualPerformance Issues14 September 1981146Total76738leaving 1058 free locationsSince we do not require that more than two emulators be loaded in the microstore at onetime, there is presently a little space left for extensions. MicroD is able to utilize well over99% of the available microstore.The third performance issue is cache efficiency and miss wait; the fourth is available iobandwidth and io task cycle consumption. These are discussed in sections below.Cache Efficiency and Miss waitThe value of shortening the wait for a storage read is roughly proportional to misslikelihood. Suppose that the prototypical opcode was a one-byte opcode implemented bythe following microcode:Fetch_Id, StkP+1;Stack_Md, IFUJump[0];For this example, execution time on a hit is 2 cycles; on a miss, 28 cycles. Delay for IFUmisses must be added to this. Since the IFU is 6 bytes ahead of the current opcode, itsmisses delay 28 cycles less execution time for preceding 6 bytes; if any of the 6 bytes itselfcauses a miss, IFU delay will be 0 because it will catch up; the IFU never gets two misses(in this example) because it crosses at most one munch boundary. Hence, execution timewill be 2 + 26*(1H) + (28-12)*H6*(1H), with the following results:Table 30: Execution Time vs. Cache EfficiencyHitExecution IFU% Miss% CyclesCycles Wait1002.00.000 992.26.1517 982.52.2829 963.04.5044 943.56.6753 924.08.7959This crude analysis shows the importance of cache efficiency in determining systemperformance. Fortunately, measurements made by Doug Clark and Gene McDanielindicated the following surprisingly high cache hit statistics:Overall cache hit rate on three Mesa programs was 99.2% to 99.8%. 4.9% to 8.1%of all cycles were held. 10% to 19% of references were Store_'s, the rest fetches.16% to 66% of misses had dirty victims, which cause additional cycles to be heldwhile the cache address section is busy.Another measurement showed a 99.7% hit rate for IFU references.The processor obtains a word from the cache in 16% of all cycles and the IFU in32% of all cycles; the processor actually shuts out the IFU by making its own gp##q5pEVf:b)Wat.qbp atbp ^> ],5 [: W$5 UP Ps MrpK K7 IxFkxD A..- ?dD =^ ;-- :R 898t89p#4sX.1Uq!(/!(,p."s**."s*(."s*'#."s*%X."s*#."s* R QC ?y9yJAy-#y(yC?y Hy )$` <]o;Dorado Hardware ManualPerformance Issues14 September 1981147reference about 20% of the time.Provision has been made to expand the Dorado cache to 16k words, when 4k x 1 MECLRAM's are economically available, but the existing cache is so efficient that this may neverbe necessary.Performance Degradation Due to IO TasksTo first approximation, only the display controller word task (DWT) uses enough storagebandwidth to interfere significantly with emulators. Since it uses the fast io system, DWTrequires service once/munch and will require two instructions/wakeup in the ordinarycase. In addition, if the next instruction (by another task) issues a memory reference, it willalways be held one cycle while the DWT's IOFetch_ advances ASRN.A quick calculation shows that at an io bandwidth of 256 x 106 bits/sec (106 munches/sec)the display controller will use 48% of storage bandwidth and 12% of processor cycles at 60ns/cycle.The earlier example showed that with no io interference and a 99% hit rate, the emulatorspent 17% of cycles in miss wait, 83% in useful execution. With a 256 x 106 bit/sec displayactive, emulator misses are slowed about 2 cycles each, so the overall effect of the displaywould be that about 78% of all cycles are emulator executions, 12% display taskexecutions, and 16% hold; the one cycle holds for IOFetch_ would make performancesomewhat worse than this.An IOFetch_ by the display task to the same cache row as an emulator miss will remain inthe address section, increasing display task latency and requiring more buffering.However, this won't degrade emulator performance.The Alto monitor only uses 14.7 x 106 bits/sec (1/17 of the above) and would not interfereappreciably with emulators.The disk controller is the fastest "slow" io device among standard peripherals. Whenrunning, its word interrupt task reads a double word from the cache every 3.2 ms in a 3instruction/interrupt inner loop, consuming about 5.6% of all cycles at 60 ns/cycle. Itsmemory references consume the cache at a rate of .04 munches/ms, low enough thatstorage interference with the emulator isn't significant. However, a 256-word disk transferdisplaces about 1/16 of the cache entries, so the emulator may experience a lower hit rate.Cache and Storage GeometryThe current geometry was chosen without measurements or simulation of programs, butmeasurements made since then have indicated a surprisingly good cache performance, sonot much could be gained through changes.The following parameters are relevant:1 word as the unit of storage inside the memory pipeline;16-word munch; fp##q5pEVfyb ^J \H [ Us' Rp(/ P> N$0 M,=# Ka@ G'H|tGp H|tGp F$< DZ @E ?:?t?p =S> ;%* 9@ 7 4U 2013 01 -z$.t-zp# + (=,) &sMup $D "=up  !H H> /s pS U () &y D9y yqp  2=]&Dorado Hardware ManualPerformance Issues14 September 1981148256 munches in the cache (expandable to 1024);4 columns in the cache.Munch SizeA 16-word munch size was chosen primarily because 8 cycles for transport balances 10cycles for storage access, avoiding loss of bandwidth. The use of 256x4 RAM's toimplement the cache address section allows the original 4k-word cache (implemented with1kx1 RAM's) to be expanded to 8k words or 16k words, when 4kx1 RAM's are economicallyavailablethis is possible because only 64 of the 256 words in the address section arebeing used with the 4k-word cache. Miss wait is about 28 cycles and storage bandwidthabout 533 x 106 bits/sec with 16-word munches.8-word munches would lower the storage bandwidth to about 262 x 106 bits/sec, probablyunacceptable. Also 8-word munches would limit cache expansion to 8k words. However,miss wait would be reduced to about 24 cycles because transport would require only 4cycles. 32-word munches would not allow greater storage bandwidth to fast io devicesbecause bandwidth is already limited by transport with 16-word munches. Nor would itallow expansion to a larger cache data section because we have no way to build a datasection larger than 16k words. Also, miss wait would be slowed to 36 cycles, so it does notseem that this munch size is attractive.For a given size of the cache data section, with smaller munches the cache will tend tostabilize with a larger amount of useful information; however, when a program is changingcontexts, larger munches might bring the new context into the cache more quickly. Also,fast io tasks will interfere less with the emulator on larger munches because fewer wakeupsand IOFetch_'es will be required. However, the extra buffering and longer miss wait offsetsthis advantage somewhat.Considered together, these factors suggest that the 16-word munch we are using issubstantially better than either 8 or 32-word munches.Data Path WidthHaving only 16 bit wide data paths slows misses. Doubling the paths to 32 bits wouldreduce EC time by 1 cycle and transport time into the cache by 4 cycles (i.e., delay onmisses would be 23 cycles instead of 28). There were not enough edge pins to do this.However, if a method of doubling the path width were found, the storage system wouldprobably be arranged as two modules of four storage boards each rather than fourmodules of two boards each, and 32-word munches might be better than 16-wordmunches. fp##q5pEVfyb.y`Sqp \q Xp> VH UW S<9 Qq-) O(. M NitMp Jj.JtJjp H; FB E (- C@U AuC ?B =( :n@ 8B 64$ 5E 3CO 1y ."/ ,<6 'q $pJ "@ E %T Z5 /  ~=N Dorado Hardware ManualPerformance Issues14 September 1981149Cache ColumnsThe reason for multiple columns is to approximate LRU reloading; the columns aremoderately expensive because separate hit logic has to be provided for each one; the V-NVstuff also costs a few ic's with more than two columns. Altogether the current 64x4 cacheis about 40 ic's larger than a 128x2 cache (Because of its 50-50 LRU behavior on thefourth column, our cache is somewhere between the 64x4 and 128x2 or 128x3 cachesbelow.). The table below shows likelihood that the Nth LRU munch is no longer in thecache for various geometries:Table 31: Cache Geometry vs. LRU Behavior N32x464x2128x232x364x3128x364x4128x4 4.000.001.000.000.000.000.000.000 8.000.006.002.002.000.000.000.000 16.001.025.007.013.002.000.000.000 32.017.089.026.077.014.002.002.000 64.140.264.090.323.079.014.018.002128.570.596.264.767.323.080.141.019256.960.910.595.987.764.323.568.142512.763.959.567These numbers are computed from a binomial distribution using the following formulae:let R = rows in cachelet C = columns in cachethen p = (R1)/R = probability that a munch of VA is in its rowthen q = 1/R = probability that a munch of VA is not in its rowthen probability of a miss for the nth element is:CP(miss)11  pn21  pn  nqpn131  pn  nqpn1  n(n1)q2pn2/2!41  pn  nqpn-1  n(n1)q2pn2/2!  n(n1)(n2)q3pn3/3!etc.Without extensive measurements on programs, it is impossible to know how much better,say, a 32x4 cache is than a 64x2 cache, or to know whether a 128x2 cache is better orworse than a 32x4 cache, for example. If a particular program is confining itself to a verysmall set of munches, then more closely approximating LRU reloading is most important.However, if the likelihood of reference flattens out after a small N, then it won't mattermuch that LRU reloading isn't very well approximatedthe total size of the cache will be amore important determinant of performance. fp##q5pEVf bq ^p B \6# [? YL6 WI UL SPzsX*xMp~$*/5:xI~$*/5:xG~$*/5:xF~$*/5:xD7~$*/5:xBl~$*/5:x@~$*/5:x>~$*/5:x= ~$*/5: 9U 6( 4^ 2? 0? .2y+y((ty&Op&t&Op&ty$p%t$p%t$p %t$p%t$py"#Gt"p#Gt"p #Gt"p#Gt"p#Gt"p#Gt"py  }!4 R A C S0*  M * w=UDorado Hardware ManualGlossary14 September 1981150Glossary a - the first 8-bit operand of a two-byte or longer opcode.b - the second 8-bit operand of a three-byte or longer opcode.bypassing - a number of memories and task-specific registers in Dorado (RM, STK, and T,for example) are written with data that might be needed before the write occurs. These areimplemented so that data about-to-be-written is substituted for data read from the registeror memory when appropriate. This substitution is called bypassing and enables Dorado torun considerably faster than would otherwise be possible.cache entry - a munch together with VA of the munch and 4 flag bits. For a 64 row x 4column cache, VA[28:31] are the word in the munch, VA[22:27] address the row, andVA[7:21] are stored in the cache entry.column - one of 4 groups of 64 (expandable to 256) cache entries. The cache column inwhich a word with VA resides is determined by comparing VA[7:21] with the correspondingbits stored in the four columns at row VA[22:27]. Thus a memory word may occupy one of4 locations in the cache.control processor - the microcomputer on Dorado's baseboard, or the Midas programoperating Dorado from an Alto.dirty - a cache entry is dirty if the information in it differs from information in storage,because a store has been done into the cache, and storage has not yet been updated. Apage is dirty if a store has been done into the page since its map dirty bit was cleared.emulator - the lowest priority task, number 0, always awake. The emulator is distinguishedby the fact that it cannot block, can use Stk, and has a private pipe entry. Primarily theemulator task will implement instruction sets.entry vector - the exit microinstruction of an opcode sends control to the firstmicroinstruction of the next opcode by means of IFUJump[n] (n = 0 to 3), where nchooses one of 4 entry microinstructions for the next opcode; these four microinstructionsare the next opcode's entry vector.fault task - the highest priority task, number 15, woken whenever a memory fault or stackerror occurs.hit - a reference which finds the desired word in the cache.Midas - the Alto program used for loading and debugging Dorado remotely.miss - a reference which does not find the desired word in the cache.module - the unit in which storage is packaged, either 64K, 256K, or 1M words. A machinemay have 1 to 4 modules. fp%q5pEVf arp ^esp: Zsp= WqpJ U"9 S/, R"9qp PW9 Lq p9 K D IP' Eqp2 D; BI I @~ = qp@ ;A 7qpq pD 61% 4:qpD 0q p? .%6 -3-q ): pC '= &,(2 $aq p q p.! % qp9 AqpC qpA ]qp>  X KC Is FqpP Bqp? ?qp@ =S 9qp> 8qp 5q 6Kp4 2qp. 14! /DM +qp> *< &qp: $ !Yqp: I q p/(  qp)  Kq pL P Q = : z =\2(Dorado Hardware ManualGlossary14 September 1981152tag - The extra bit in Md readout which complements for successive Fetch_'es andStore_'s by the same task. Agreement of the bit in Md with the current value equalsreference finished.task - one of the 16 priority scheduled tasks. Special tasks are the emulator (task 0, lowestpriority) and the fault task (task 15, highest priority). Other tasks are paired with iocontrollers.VA - virtual address.Vacant - a cache entry or map entry which does not contain valid data.Victim (Vic) memory - stores 4 bits for each cache row. Two of the bits specify the victimwhich will be chosen if a reference to that row results in a miss, and the other two are thenext victim.victim - on a processor reference that causes a cache miss, the cache entry chosen to bereplaced by the referenced data.WP - write protected. Map entries and cache entries have bits with this name. fp%q5pEVf bqp = `S0$ ^ [qpX YL@ W Tqp Pqp@ M,q pqp$q KapF I F$qp5 DZ @qpL @<'j} HELVETICA  HELVETICA  TIMESROMAN  HELVETICA HIPPO  HELVETICA HELVETICA HELVETICA HELVETICA  HELVETICA  HELVETICA  HELVETICA HELVETICAHIPPO  HELVETICA  HELVETICA  TIMESROMAN  HELVETICA HELVETICA HIPPO  HELVETICA  HELVETICA  TIMESROMAN  HELVETICA HELVETICA HELVETICA HIPPO  HELVETICA HELVETICA  HELVETICA  TIMESROMAN  HELVETICA  HELVETICA HELVETICAHIPPO  HELVETICA HELVETICA  HELVETICA  TIMESROMAN  HELVETICA  HELVETICA HELVETICAHIPPO  HELVETICA HELVETICA  HELVETICA  TIMESROMAN  HELVETICA  HELVETICA HELVETICA HELVETICA HELVETICA  HELVETICA  TIMESROMAN  HELVETICA  HELVETICAHIPPO  HELVETICA  HELVETICA  TIMESROMAN HIPPO R QO" M+2:BJ S \cksz Px/  :  5"U* Y3:AHJR [ d mu ~        z %, R6>E NV _gnu}LCf9G FiCB"9@ ]F[Do;J;\;;@;[E=<t$~y?Q,3gfp XXޖ0d )( DQ,p  pfemory2.pressffemory2.press to local file d1memory2.press [New file]]o96Z" :#: Z"FD8m;J;\;;j/ZӒDoradoManual-B.pressFiala17-Sep-81 9:34:26 PDT