A Processor for a High-Performance Personal Computerby Butler W. Lampson and Kenneth A. PierJanuary 1981ABSTRACTThis paper describes the design goals, microarchitecture, and implementation of themicroprogrammed processor for a compact high performance personal computer. Thismachine supports a range of high level language environments and high bandwidth I/Odevices. It also has a cache, a memory map, main storage, and an instruction fetch unit;these are described in other papers. The processor can be shared among 16 microcodedtasks, performing microcode context switches on demand with essentially no overhead.Conditional branches are done without any lookahead or delay. Microinstructions are fairlytightly encoded, and use an interesting variant on control field sharing. The processorimplements a large number of internal registers, hardware stacks, a cyclic shifter/masker,and an arithmetic-logic unit, together with external data paths for instruction fetching,memory interface, and I/O, in a compact, pipelined organization.The machine has a 60 ns microcycle, and can execute a simple macroinstruction in onecycle; the I/O bandwidth is 530 megabits/sec. The entire machine, including disk, displayand network interfaces, is implemented with approximately 3000 MSI components, mostly ECL10K; the processor is about 35% of this. In addition there are up to 4 storage modules, eachwith about 300 16K or 64K RAMs and 200 MSI components, for a maximum of 8 megabytes.The total volume, including power and cooling, is about .14 m3 (4.5 ft3). A number ofmachines are currently running.A version of this paper appeared in Proc. 7th Symposium on Computer Architecture,SigArch/IEEE, La Baule, May 1980, 146-160.CR CATEGORIES6.34, 6.21KEY WORDS AND PHRASESarchitecture, controller, emulation, input/output, microprogram, pipeline, processor.c Copyright 1981 by Xerox Corporation.XEROXPALO ALTO RESEARCH CENTER3333 Coyote Hill Road / Palo Alto / California 94304M'pI"EqX&@2 :Er7s4524@tst2sH1 $,/ C. G,~()* N)v@'tsts'$ E#CtstsD!G< K;C9#78 C3<{wX{0wO .zw(- 7*.+)-$&7%Q". 7v7'&7nzwzw#3 ;:;zw% <3*zwJ+zwL zw= # R tH 0l !,+,dB HcSEC. 2HISTORY3microcoding the Model 0, together with the significant improvements in memory technology sincethe Model 0 design was frozen, to redesign and reimplement nearly every section of the Dorado.We fixed some serious design errors and a number of annoyances to the microcoder, substantiallyexpanded all the memories of the machine, and speeded up the basic cycle time. Dorado Model 1came up in the spring of 1979. During the next year several copies of this machine were built in the stitchweld technology used forthe prototypes. Stitchwelding worked very well for prototypes, but is too expensive for evenmodest quantities. Its major advantages are packaging density and signal propagation characteristicsvery similar to those of the production technology, very rapid turnaround during development(three days for a complete 300-chip board, a few hours for a modest change), and completecompatibility with our design automation system.At the same time, the design was transferred to multiwire circuit boards; the Manhattan wirerouting and lower impedance of this technology slowed the machine down by about 15%. Doradosare now assembled with very little in-house labor, since boards and backpanels are manufacturedand loaded by subcontractors. We do 100% continuity testing of the boards both before and afterthey are loaded with components and soldered. Checkout of an assembled machine is still non-trivial, but is a fairly predictable operation done entirely by technicians.3. GoalsThis section of the paper describes the overall design goals for the Dorado. The high levelarchitecture of the processor, described in the next section, follows from these goals and thecharacteristics of the available technology.The Dorado is intended to be a powerful but personal computing system. It supports a single userwithin a programming system which may extend from the microinstruction level to a fullyintegrated programming environment for a high-level language; programming at all levels must berelatively easy. The machine must be physically small and quiet enough to occupy space near itsusers in an office or laboratory setting, and cheap enough to be acquired in considerable numbers.These constraints on size, noise, and cost have a major effect on the design.In order for the Dorado to quickly become useful in the existing CSL environment, it had to becompatible with the Alto software base. High-performance Alto emulation is not a requirement,however; since the existing software is also obsolescent and due to be replaced, the Dorado onlyneeds to run it somewhat faster than the Alto can.Instead, the Dorado is optimized for the execution of languages that are compiled into a stream ofbyte codes; this execution is called emulation. Such byte code compilers exist for Mesa [3, 6],Interlisp [2, 7] and Smalltalk [4]. An instruction fetch unit (IFU) in the Dorado fetches bytes fromsuch a stream, decodes them as instructions and operands, and provides the necessary control anddata information to the processor; it is described in another paper [5]. Further support for this goalcomes from a very fast microcycle, and a microinstruction powerful enough to allow interpretationof a simple macroinstruction in a single microinstruction. There is also a cache which has a latencyof two cycles, and can deliver a word every cycle. The goal of fast execution affects the choices ofimplementation technology, microstore organization, and pipeline organization. It also mandates anumber of specific features, for example, stacks built with high speed memory, and hardware baseregisters for addressing software contexts.Another major goal for the Dorado is to support high-bandwidth input/output. In particular, colormonitors, raster scanned printers, and high speed communications are all part of the researchactivities within CSL; one of these devices typically has a bandwidth of 20 to 400 megabits/second.Fast devices should not slow down the emulator too much, even though the two functions competefor many of the same resources. Relatively slow devices must also be supported, without tying upthe high bandwidth I/O system. These considerations clearly suggest that I/O activity and emulationshould proceed in parallel as much as possible. Also, it must be possible to integrate as yet^[zG0<8wX IV@TWSxNQN^ME,.K"=J=DH HG5 #D VBRA N?~ S=*/OMNYL IjUG BFbMA{X>dw"9<#2;\ C96?5)Pzw3zw?2! G0=$/T-!6, R(/04'b zwzw%)zw z$Zw E "#c}"w#c}"w?!RP'[; zw1 :W"@'9TY+ X,zwzw% =zwzw PD5tD |w/l )zwzw <dzwzw  6H]SEC. 4HIGH LEVEL ARCHITECTURE5<==PanelWiringSideSideWiringPanel-5-2+5+12VVVVxxxx250757025AAAAFigure 1a: Dorado chassisStorage Storage Storage Storage Storage Storage Memory addressingInstruction fetch unitProcessor, high byteProcessor, low byteControl sectionMicroinstruction memoryBaseboardStorage Storage Disk/Ethernet controllerDisplay controllerI/OI/OI/OI/OI/O10.5 inPipe, map and storage controlCache data, error correctionInstructionProcessorCacheStorageSlow input/outputFast input/outputEthernetDiskDisplayKeyboardFetch Unit8K-32KbytesFigure 1b: Dorado block diagram<==<<265 MBits/sec16 bits/60 ns265 MBits/sec16 bits/60 ns120 ns access530 MBits/sec256 bits/480 ns1.7 us accessAir plenumBoard Area288 16-pin DIPs (logic)and144 8-pin SIPs (terminators)per board512K-16M bytesfzG%z<8wtd24 Hk r# 6#'p$:('$)& $(%@$(%$@(V<$)UC$(U @$(U $@J V<$K&UC$J U @$J U $@J '$K&& $J %@$J %$@'VHVH^$:1U$7T'$1T@$1T$U2%|O3O5zO=O?{OA%OKU$K&s$KQoK#Q&zp_KK$K|GK&pPK|FK+K)LB>$ LB.^$ K&p^K$?{K$@K$AK$BK$CK$E K$F%K$GBK$H^K$I{K$H.;$G.;$F.;$E.;$D{.;$C^.;$BB.;$A%.;$@ .;$>.;$=.;$<.;$;.;$:z.;$9^.;$8A.;$7%.;$6 .;$4.;$3.;$2.;$1.;$0z.;$/^.;$.A.;$-%.;$,.;$*.;$).;$2R|$4^R|$6 R|$>^R|$@ R|$AR|$*=&$ E=E:-E; ?Q!$@!$B_!$H|!$F!$Em!$C!$J!$K!$M!$N!$V^ $k r   r  ]9$ z$9] V]$] V$]@ V$]@ V$ z$9@r$ $$ z$9  V@$  V$]GyGG $2 G,Vk#$ z$9# V$# V$]<%O9G %$%, k5zpV6 !zky (%tdp  (% (% ' (% ' (%V 6 rO 5zE 2pB;8@1>7%=; B4lO-c'A PROCESSOR FOR A HIGH-PERFORMANCE PERSONAL COMPUTER6Second, when the processor is available to each device, complex device interfaces can beimplemented with relatively little dedicated hardware, since most of the control does nothave to be duplicated in each interface. For low bandwidth devices, the force of thisargument is reduced by the availability of LSI controller chips, but for data rates above onemegabit/second no such chips exist as yet. Of course, to make this sharing feasible, switching the processor must be nearly free of overhead,and devices must be able to make quick use of the processor resources available to them.Many design decisions are based on the need for speed. Raw circuit speed is a beginning. Thus,the Dorado is implemented using the fastest commercially available technology which has areasonable level of integration and is not too hard to package. In 1976, the obvious choice was theECL 10K family of circuits; probably it still is. Secondly, the processor is organized around twopipelines. One allows a microinstruction to be started in each cycle, though it takes three cycle tocomplete execution. Another allows a processor context switch in each cycle, though it takes twocycles to occur. Thirdly, independent busses communicate with the memory, IFU, and I/O systems,so that the processor can both control and service them with minimal overhead.Finally, the design makes the processor both accessible and flexible for users at the microcode level,so that when new needs arise for fast primitives, they can easily be met by new microcode. Inparticular, the hardware eliminates constraints on microcode operations and sequencing often foundin less powerful designs, e.g., delay in the delivery of intermediate results to registers or incalculating and using branch conditions, or pipeline delays that require padding of microinstructionsequences without useful work. We also included an ample supply of resources: 256 generalregisters, four hardware stacks, a fast barrel shifter, and fully writeable microstore, to make theDorado reasonably easy to microcode. 5. Low level architectureThis section describes in some detail the key ideas of the architecture. Implementation techniquesand details are for the most part deferred to the next section; readers may want to jump ahead tosee the application of these ideas in the processor. Along with each key idea is a reference to theplaces in the processor where it is used.5.1 TasksThere are 16 priority levels associated with microcode execution. These levels are called microtasks,or simply tasks. Each task is normally associated with some hardware and microcode whichtogether implement a device controller. The tasks have a fixed priority, from task 0 (lowest) to task15 (highest). Device hardware can request that the processor be switched to the associated task;such a wakeup request will be honored when no requests of higher priority are outstanding. The setof wakeup requests is arbitrated within the processor, and a task switch from one task to anotheroccurs on demand, typically every ten or twenty microcycles when a high-speed device is running.When a device acquires the processor (that is, the processor is running at the requested prioritylevel and executing the microcode for that task), the device will presumably receive service from itsmicrocode. Eventually the microcode will block, thus relinquishing the processor to lower prioritytasks until it next requires service. While a given task is running, it has the exclusive attention ofthe processor. This arrangment is similar in many ways to a conventional priority interrupt system.An important difference is that the tasks are like coroutines or processes, rather than subroutines:when a task is awakened, it continues execution at the point where it blocked, rather than restartingat a fixed point. This ability to capture part of the state in the program counter is very powerful.Task 0 is not associated with a device controller; its microcode implements the emulators currentlyresident in the Dorado. Task 0 requests service from the processor at all times, but with the lowestpriority.]zwXzwzwzwzwz wzwzwWwLU JTo%-R zw/Qg N<5+LUIJH 6F EEzw[C} 5&A<@u%zwzwzw>L;W:B N8 K 7:C5 H42B2 Y1*,W{X),wD'V&$&;$# |Xvw?| w|wCn//Lf| w9?|w^Z3JG+ |w. K #O M.3clGNdR dHbSEC. 5LOW LEVEL ARCHITECTURE75.2 Task schedulingWhenever resources (in this case, the processor) are multiplexed, context switching must onlyhappen when the state being temporarily abandoned can be restored. In most multiplexedmicrocoded systems, this requires the microcode itself to explicitly poll for requests, save andrestore state, and initiate context switches. A certain amount of overhead results. Furthermore,the presence of a cache introduces large and unpredictable delays in the execution of microcode(because of misses). A polling system would leave the processor idle during these delays, eventhough the work of another task can usually proceed in parallel. To avoid these costs, the Doradodoes task switching on demand of a higher priority device, much like a conventional interruptsystem. That is, if a lower priority task is executing and a higher priority device requests a wakeup,the lower priority task will be preempted; the higher priority device will be serviced without theconsent or even the knowledge of the currently active task. The polling overhead is absorbed bythe hardware, which also becomes responsible for resuming a preempted task once the processor isrelinquished by the higher priority device.A controller will continue to request a wakeup until notified by the processor that it is about toreceive service; it then removes the request, unless it needs more than one unit of service. Whenthe microcode is done, it executes an operation called Block which releases the processor. Theeffect is that requesting service is done explicitly by device controllers, but scheduling of a giventask is invisible to the microcode (and nearly invisible to the device hardware).5.3 Task specific stateIn order to allow the immediate task switching described above, the processor must be able to saveand restore state within one microcycle. This is accomplished by keeping the vital state informationthroughout the processor not in a single rank of registers but in task specific registers. These areactually implemented with high speed memory that is addressed by a task number. Examples oftask specific registers are the microcode program counter, the branch condition register, themicrocode subroutine link register, the memory data register, and a temporary storage register foreach task. The number of the task which will execute in the next microcycle is broadcastthroughout the processor and used to address the task specific registers. Thus, data can be fetchedfrom the high speed task specific memories and be available for use in the next cycle.Not all registers are task specific. For example, COUNT and Q are normally used only by task 0.However, they can be used by other tasks if their contents are explicitly saved and restored.5.4 PipeliningThere are two distinct pipelines in the Dorado processor. The main one fetches and executesmicroinstructions. The other handles task switching, arbitrates wakeup requests and broadcasts thenext task number to the rest of the Dorado. Each structure is synchronous, and there is no waitingbetween stages.The instruction pipeline, illustrated in Figure 2, requires three cycles (divided into six half cycles) tocompletely execute a microinstruction. The first cycle is used to fetch it from microstore (time t-2 tot0). The result of the fetch is loaded into the microinstruction register MIR at t0. The second cycleis split; in the first half, operand fetches (as dictated by the contents of MIR) are performed and theresults latched at t1 in two registers (A and B) which form inputs to the next stage. In the secondhalf cycle, the ALU operation is begun. It is completed in the first half cycle of cycle three, and theresult is latched in register RESULT (at t3). The second half of cycle three (t3 to t4) is used to loadresults from RESULT into operand registers.XozG%<8wR|XNw(-Mi/"K %1JaN HO GY.)E%7DQO B SAI|w$?<>A.|w< 9$=8D64~w#5I3M/|X,Xw=#*.4)P 8| w '+)&H:$*/#@8! Z 8R " zwzw"U|X _wJ >W'8 >|w| w $ W~  $w 6~ 6w4zw~ 6wHGzw ~7wzwzw ) zwQRzw~Rw$~Rw~Rwdzw H]eA PROCESSOR FOR A HIGH-PERFORMANCE PERSONAL COMPUTER8<==MIRFetch fromInstruction memoryoperandfetchABoperandmodification>>>resultstorefirst cyclesecond cyclethird cycleT4T3T2T1T0T0T1T2T3T4T4T3T2T1T0T0T1T2T3T4T4T3T2T1T0T0T1T2T3T4T-2T-1T-1T-2T-2T-1T-1T-2T-2T-1T-1T-2<>>><<>>>>ResultInstruction PipelineTiming OverlapFigure 2: Instruction pipeline and timing overlap<==<>>>>>>>WakeUpTaskfetch Next task specific statefetch Next microinstructionbroadcast Next taskCurrentTaskCurrent<>>+$p1"d $ ]$ k $+ ]$z ]$z ]$ $=9$54]$54]$1 ]$1t$A$99$9_]$1-$Pq?]E#IBr/;9z9G9 lG V %G %G A G lG  % G %G d AGV G G GzG lG V %G %G G G: VGs GGsG Ge|' CC C9G!zG8PGr!G!G! %G! %G8 lGr! GAG G  C:zGA lG V: %G: %G :G8A9G8 ^9GV3$V 3$V99rIH : e##&W 0W$;t < ;te l$ |BV,tp  C;3$V>>>bypass pathnormal pathmemoryPrevious Result Addressoperand fetchFigure 4: Bypassing examplemultiplexor switched ifCurrent Operand Address =<==<<^zwXzwzwzwzwz wzwzwW A~W1Ww~W1WwUPTLzwzw-zw&RQQD ~wHO&6N<~w ?LH|XEwT D 7B.+AT ?~.(=|w8G5/~w)D VB CA;?~;|X8Tw_61zw5L I3C2Dzwzw+0zwzw?/<7 zwzw-  G,4O*zw30),Lzw'Azw$}zwzw@zwz"wzwzwzwzw,!uEzwzw  UmC|wzwzwzwzw|wzwzw(e zwzw/3zwzw|X w|w*3 <5| w+zw O F | AP tY &~w0 lO\d~w H^A PROCESSOR FOR A HIGH-PERFORMANCE PERSONAL COMPUTER12<==>>>>rrrrCBrIMOutCPRegTLink*TPC*+1rrProcessorControlTPIMOutBestNextPCBestNextTaskNEXTIMAddressThisTaskNextPCLinkNextControlFFASelectLoadControlBSelectALUOPRAddressLink>>>*{PE}{Switch}{StartCycle}{StartCycle}{StartCycle}{InstructionDecode}{TPCByPass}{Switch}{UseCPReg}{ReadIM}{RAddr}{ReadLink}{Link_B}r{ }NextCtrlBlockmultiplexormultiplexor select signaltask specificFigure 5: Control sectionThisTaskLastTaskThisPCThisPC+1MemoryInstructionIMThisPCMIRt-2t0t2t-4Priority EncoderTLinkByPassLink_B{}<==<<>rr>rBBBBTLinkAddrTPCAddr`MzwXzwzwzwzwz wzwzwtd2L HeC'2R-R(HR*:O$9.O$92O$96sO$9*G,sI $*>X,sFfr$0;$ .5P$9,sJ$35P$.O$94: <O$9:O$99O$98O$9 R 9O$9 KD $K I GCU$2$95 $8$,s$:W$03$92Tn$ 9Tn$ 9M|$G$,sM$9OGVV$V$1PB$rO$]rR$&$+W&$V4$V6I+$2$9 O$]6$$&| 9  &`/1r1r5rDa F 9F  J6 L L rL Q()L.eL2L5L:L?(?)Gr  Kf$ r9 $H7B$ r7 $ r7$ ) +$'$ ' O$ $ ,$ ,$5W$4$*:>$*:>$*:Hr$.G_$*:G;$*:G;$K$ MX$IQ$9 9IQ$95 $9+C+W;G-5 G,s;$9O$ d&A G.e9$(G8 $U52$*:Dtr$2 z9G5AG G.2$*:$*:,$$*:B$ Kf$$AGr*$k*$$*A$G/H$@*e$$*Dt$k*Dr$.Ck$*D$k*E$/:C$@.Ck$,sE$],se$ 4G/:k$/Hk$7G%O$]%:L$pR$OGKf@$8,s *s*C/#0W|!. d$-B"sp "s K$K$MX$5:0!P$)7 7 27$)7f$1I '$-1I =O$r'sK*:2X. .X.!VX.X. X. X.UX.;tO$=W9Gr|_9$ H9$9+${p2vH4 d rF O#; (47 )$:W+,$:W$ =r(>'$D'$5 :\$$9;,lG0%$,sV$6I9$ +r$:H$ O$9V$27ft7r(et(r,trKtK94$ 9 pE9Yn$ VYn$rYn$Yn$$Yn$*:Yn$.Yn$6sYn$9Z/:$H$G;tV y$G%$.=TJ $R9$vl H%3$3$3 $9,s$) $- G#pd|'sH$.94$.56sTJ$y 9U&W$ 9 8ey$8,|>;&`)O$UO$U,s$*:O$/:O$($p('s1s+: +$ @0)$])$ C:W$,s?$>'%$ :[$Lj*H\LSEC. 5LOW LEVEL ARCHITECTURE13<==>>>>+Shifter.......................RM..>..r..COUNTALUMemBase>>>>>>>TIOAIOAddressAdder..ALUFMrLSH 1RSH 1.......StackPtrRBaseABExternalBIODatadevicesRESULT,....>>>>>>>memorymemorydevicesto I/OsmallconstantconstantRESULTBdata from>IFUAqFigure 6: Data sectionRAddrRaddr>register or memorylatchmultiplexor latchmultiplexornarrower data pathFFfrom t1 to t2from t2 to t3.>RT*task specific****LoadcontrolcontrolLoadALUOPaddress toRRTTMDQMDQRTMemData (MD)FFFFFF(copy of A)MemAdMDMemBALUFMALUALUALU.IFUconstconst(copies of B)IODataCOUNTIOAdBaseRegsto/fromcontrol,memory,IFU,to/from,XB{FF, mask}{bypass}{bypass}}{}{{control}{control}rother 16 bit pathpointerStackData readyshortly after t1Latches followLatches followData readyshortly before t3RAMs readat t1, load at t4load at t3RegistersBalso to RESULTshift 1 bit left or rightShiftCtl)(also toto RESULT<==<
$>$!3%W$  $$!$'s/$-($'s(f@$'s(f$!z*$ ],$(r+--+$/+$0tkd-kU-fk-k21k2d1k2d4$kA1$k$GU$GUA'kGd$kG"$3$.d$3_$.$/4$N33_d$ 39$ 3_9$.$.@$H1m$$1& $$/4$$1m$& $$$ r$# $ & $H& $#$9$$;$'& $ { $#$ d $ &$ !z%W$G(B$,.$2B$ < $$%($ d!z*{$ ],$!z*$ ]'$ ]-$G< k$2B $94{ k$/+$-+$$A&- y$$1 $$1m $/1m$/5-$ ]$A5 $$A5Q d$/5Q$ 's.@G's(f@G's(G-l(G23;k2$k2$^kH6$;kH"x+"*D"("&" D""C ]&$(Bk$.d$/4$!z0$!309$ 2B$2 $%3G$%44G$#d$#Bd$!z%4$$ $%(G$z+`z)'V z *D ,}  '%$%"|1 )n1 r|!$ 4z4 y9N$V9$8$$A! $"r2$$]$$9!zI$!z%k$"x$$z@$$$A;$'^$$e $?Q$9?-$?-$]Af$H$Fg$]Fg$F$9TK$R$]R$/B$ kB*$+W$$% $$:Wz=oF$B$UB$B$y|?E$G_$E$E$End$Du$*/4$d x>Dr?a G$ G_$kF9$FC$*:|+kL9$kL9$ yKC$ yK$rxCE DKC%$JJ$vK 3Br+-<N$%4X$5Q$UAvR 9|P 9O~ 9N>O~ yOJ9$K 9J 9LPC$N.$N.$]Pg$]vNO'/:G%:rO 2YG7TnG2T'GG2T'G)eY$k.eUk)Uk.U$)Y$)eU$k/U$.Xu$.X.$2lVR$]N.$]0WxP00WMi/Ug@$=$$9)$9*$9.$27G65G925tNG25tG2v6m43$4;|0=7%,m $9*{$=#r$="%$="%$$A"Ik$="I$9*$d9)$d9($9($29,&y$7%,&$>t+ >)^AxA A"|A$'A%A'}A)'9.;$9-$9-$= :$%<]$:/$:{/$:{/{$:/$G%^$3G_#l$;t1&$A;t1&$;1m$;1$K&W$<2$L$Jn$9JJ$]<3$VS.(%$='I@$;%44O$7WY $7W $9=$4$%4 $9'$ 9'm k$'&$-IS$[C$9R$9[C;$3$V2 $$ $%W$ $O1I$9 3_$ 1$U V1$y V$$y $$U V'$ V'$ & $#$9 V$$C;, $C;, r$L$dM;3$LB$M_$%:r2B%:&%:  %:"Fp!PFr,D4xSDODGDD`=%$=6J$B|2B0B/DB#B!S.y$L$@kF$ 9F 9DFpZDFV<FNuFOC;#G$.G $eG @ VEGU VIGU VMG VQ`G r@J ?- Fp[` VJn$$VNQ$VCX$V@J$|< vT H|$ H r:$ V8GU+S.$+|Q yS.]$p rv7 O*{V$6P|H$ $U $ $yp 9 $O$9,$9,$,z $ $]z $z $]G9 , U $s ,+$s +$s H$s,$U $,s $,s $,s $,s e$,s$3,$vyQ$Vt V !$!$"x 9$&%$9@J$5$ 5r$#|2 r2%0pd3 r%9NS. dt# "l / d047$1,94$1,v9{FpWY !t2!'&!1&!%{!34{!/{!(!#%:-B%:+ dr\ tC@%:7 0W$ %:r5=6m$?t._>6>;5 ?{,^$^>;1?43;*rX.*:VC;,m $L$dM3$C;,& $L+$zLB$3M_$$3$Vkz$k%4$ 2'm$ 2'&$-I$$:W $%k $%B$]kB$]ke*$+W$F$kL$@94$=$S$R$[`C$9ZC;$/B$ 03$ 03%$$A& y$$A"%k$$A! $$e $$e^$!3$U!z$';$>"%$,s z$,s 3$Fp^FFF3Ax`Fp"Ax$1 $$1I $ Tn$ O*4y$!Vt {A8; V/XV"+0t k0t+# k#  4;|43p '$!3$v)^*W%:t %:VVEt Et   V5-$ V1$!3 49$0W%@$01I$rSu$/X$ Ot,m]+t$Oe {$ &U$$% $&2B &3_$7$7 'spd3 (dtG<<0 4O ]R A PROCESSOR FOR A HIGH-PERFORMANCE PERSONAL COMPUTER146. ImplementationIn this section we describe, at the block diagram level, the actual implementation of the Doradoprocessor. There is only space to cover the most interesting points and to illustrate the key ideasfrom 5.6.1 ClocksThe Dorado has a fully synchronous clock system, with a clock tick every 30 nanoseconds. A cycleconsists of two successive clock ticks; it begins on an even tick, which is followed by an odd tick,and completes coincident with the beginning of a new cycle on the next even tick. Even ticks maybe labeled with names like t-2, t0, t2, t4 to denote events within a microinstruction execution or apipeline, relative to some convenient origin. Odd ticks are similarly labeled t-1, t1, t3.6.2 The control sectionThe processor can be divided into two distinct sections, called control and data. The control sectionfetches and broadcasts the microinstructions to the data section (and the remainder of the Dorado),handles task switching, maintains a subroutine link, and regulates the clock system. It also has aninterface to a console and monitoring microcomputer which is used for initialization and debuggingof the Dorado. Figure 5 is a block diagram of the control section.6.2.1 Task pipelineThe task pipeline consists of an assortment of registers and a priority encoder. All the registers areloaded on even clocks. Wakeup requests are latched at t0 in WAKEUP, one bit per task; READY hascorresponding bits for preempted and explicitly readied tasks. The requests in WAKEUP and READYcompete. A task can be explicitly made ready by a microcode function. The priority encoderproduces the number of the highest priority task, which is loaded into BESTNEXTTASK and also usedto read the TPC of this task into BESTNEXTPC; these registers are the interface between the twostages in this pipeline. The NEXT bus normally gets the larger of BESTNEXTTASK and THISTASK.THISTASK is loaded from NEXT, and LASTTASK is loaded from THISTASK, as the pipeline progresses.This method of priority scheduling means that once a task is initiated, it must explicitly relinquishthe processor before a lower priority task can run. A bit in the microword, Block, is used toindicate that NEXT should get BESTNEXTTASK unconditionally (unless the instruction is held).Note that it takes a minimum of two cycles from the time a wakeup changes to the time thischange can affect the running task (one for the priority encoding, one to fetch the microinstruction).This implies that a task must execute at least two microinstructions after its wakeup is removedbefore it blocks; otherwise it will continue to run, since the effects of its wakeup will not have beencleared from the pipe. The device cannot remove the wakeup until it knows that the task will run(by seeing its number on NEXT). Hence the earliest the wakeup can be removed is t0 of the firstinstruction (NEXT has the task number in the previous cycle, and the wakeup is latched at t0); thusthe grain of processor allocation is two cycles for a task waking up after a Block.Some trouble was taken to keep the grain small, for the following reason. Since the memory isheavily pipelined and contains a cache which does not interact with high bandwidth I/O, the I/Omicrocode often needs to execute only two instructions, in which a memory reference is started anda count is decremented. The processor can then be returned to another task. The maximum rate atwhich storage references can be made is one every eight cycles (this is the cycle time of the mainstorage RAMs). A two cycle grain thus allows the full memory bandwidth of 530 megabits/secondto be delivered to I/O devices using only 25% of the processor.[ zwXzwzwzwzwz wzwzwT{QwP P SNJ|XGYw;|w|Ew/|w|wDQ7'B~B@Bw~B@Bw~B@Bw~B@Bw:@F~@R@w~@R@w~@R@w<|X9w 3|w|w811+6V5)V3A/|X ,{w`*.~*j*wzwzw) Bzwz'w H&)z w $} zwz w3"zw z wzw!uzwzwzwzwJV I~w Bzw z w2FC@.3--zw4~w  zwA~w |wD~w |$6.zwzwzwztw=%;l?zwOdzwzw) %H`SEC. 6IMPLEMENTATION15A simpler design would require the microcode to explicitly notify its device when the wakeupshould be removed; it would then be unnecessary to broadcast NEXT to the devices. Since thisnotification could not be done earlier than the first instruction, however, the grain would be threecycles rather than two, and 37.5% of the processor would be needed to provide the full memorybandwidth. Other simplifications in the implementation would result from making the pipelinelonger; in particular, squeezing the priority encoding and reading of TPC into one cycle is quitedifficult. Again, however, this would increase the grain.6.2.2 Fetching microinstructionsRefer to the right hand side of Figure 5. At t0 of every instruction, the microinstruction registerMIR is loaded from the outputs of IM, the microinstruction memory, and the THISPC register isloaded with IMADDRESS. The NEXTPC is quickly calculated based on the NextControl field in MIR,which encodes both the instruction type and some bits of NEXTPC; see Figure 7 for details. Thiscalculation produces THISTASKNEXTPC, so called because if a task switch occurs it is not used as thenext IMADDRESS. Instead, the BESTNEXTPC computed in the task pipeline is used as IMADDRESS.<==0P$<0P$:X0P$80P$50P$#t/%;/'s/)/+/./0W/2/4;/6t/8/:/=/?X/?X*=*:*8*6t*4;*2*0W*.*+*)*'s*%;*#*5+P$8+P$:X+P$<+P$>+P$A+P$1t+P$/;+P$-+P$*+P$(+P$$+P$!+,$]!+,A$A+P$9!-e$!(e$A&P$9!&,A$!&,$]$&P$(&P$*&P$-&P$/;&P$1t&P$A&P$>&P$<&P$:X&P$5&P$3&P$#%%;%'s%)%+%.%0W%2%4;%6t%8%:%=%?X%?X = : 8 6t 4; 2 0W . + ) 's %; # 5!$$<$:X$8$5$3$#%;'s)+.0W24;6t8:=?X?X,=,:,8,6t,4;,2,0W,.,+,),'s,%;,#,3l$5l$8l$:Xl$>l$Al$1tl$/;l$-l$*l$(l$$l$!H$]!HA$Al$9!$'s+P$9's&P$9'sl$9.v &W0P$30P$93+P$94t,6,8,;-,=f,?,8&P$9-v'&W!$3!$98!9$>!$98!$9:X!$9?"8"<l$9?Xt=:86t4;20W.+)'s%;#5$8$:X$<$>$A$1t$/;$-$*$($&W$$$!$]!A$A$9! $3$93$!p ,u  t -r6I)v181",,+W,",'",:')"3":"Bu# B" +v<u(vr6r !4W4;4W:u1 :, Wp sr6I k$ :H $k$ :k$ :3 $6I 3G$HG$Gk$ :Bt1 u!tS  !J$pdG7xA PROCESSOR FOR A HIGH-PERFORMANCE PERSONAL COMPUTER16TPC is written with the previous value of THISTASKNEXTPC every cycle (at t3), and read for the taskin BESTNEXTTASK every cycle as well. Thus, TPC is constantly recording the program counter valuefor the current task, and also constantly preparing the value for the next task in case there is a taskswitch.6.2.3 Miscellaneous featuresThere is a task specific subroutine linkage register, LINK, shown in Figure 5, which is loaded withthe value in THISPC+1 on every microcode call or return. Thus each task can have its ownmicrocoded coroutines. LINK can also be loaded from a data bus, so that control can be sent to anarbitrary computed address; this allows a microprogram to implement a stack of subroutine links,for example. In addition to conditional branches, which select one of two NEXTPC values, there arealso eight-way and 256-way dispatches, which use a value on the B bus to select one of eight, or oneof 256 NEXTPC values. Since the Dorado's microstore is writeable, there are data paths for reading and writing it. Relatedpaths allow reading and writing TPC. These paths (through the register TPIMOUT) are folded intoalready existing data paths in the control section and are somewhat tortuous, but they are usedinfrequently and hence have been optimized for space. In addition, another computer (either aseparate microcomputer or an Alto) serves as the console processor for the Dorado; it is interfacedvia the CPREG and a very small number of control signals.6.3 The data sectionFigure 6 is a block diagram of the data section, which is organized around an arithmetic/logic unit(ALU). It implements most of the registers accessible to the programmer and the microcodefunctions for selecting operands, doing operations in the ALU and shifter, and storing results. It alsocalculates branch conditions, decodes MIR fields and broadcasts decoded signals to the rest of theDorado, supplies and accepts memory addresses and data, and supplies I/O data and addresses.6.3.1 The microinstruction registerMIR (which actually belongs to the control section) is 34 bits wide and is partitioned into thefollowing fields:RAddress4Addresses the register bank RM.ALUOp4Selects the ALU operation or controls the shifter.BSelect3Selects the source for the B bus, including constants.LoadControl3Controls loading of results into RM and T.ASelect3Selects the source for the A bus, and starts memory references.Block1Blocks an I/O task, selects a stack operation for task 0.FF 8Catchall for specifying functions.NextControl8Specifies how to compute NEXTPC.6.3.2 BussesThe major busses are A, B (ALU sources), RESULT, EXTERNALB, MEMADDRESS, IOADDRESS, IODATA,IFUDATA, and MEMDATA .The ALU accepts two inputs (A and B) and produces one output (RESULT). The input busses have avariety of sources, as shown in the block diagram. RESULT usually gets the ALU output, but it isalso sourced from many other places, including a one bit shift in either direction of the ALU output.A copy of A is used for MEMADDRESS; two copies of B are used for EXTERNALB and IODATA.MEMADDRESS provides a sixteen bit displacement, which is added to a 28 bit base register in thememory system to form a virtual addresses. EXTERNALB is a copy of B which goes to the control,memory, and IFU sections, and IODATA is another copy which goes to the I/O system; the sources of^zwXzwzwzwzwz wzwzwX(zw'z w~WX(wV:z wzw2TRS2O3|XLw1zw$JzwFI zwFG|@E?zwDt| wzw#Bzw?5+>Azw%zw<#5;9 >9(381zw,42|X1wD/zwF -,zw+,{ zw&*>zwzw&|X#zwX"I~w]zwmz~w] zw#~w]zwe~ w]!zwzw~w]zw#]~w] zwzw,~w]"U~ w]zwV|X+w zwzwzw zwzwz wzwzw zwzw |zwzwzwzwzwzwtDzwzw z wzw zwzwlz wN&zw zwdzwzw#zwzw L HcuSEC. 6IMPLEMENTATION17B can thus be sent to the entire processor. Both are bidirectional and can serve as a source for B aswell. IOADDRESS is driven from a task specific register; it specifies the particular device and registerwhich should source or receive IODATA.IFUDATA and MEMDATA allow the processor to receive data from the IFU and memory in parallelwith other data transfers. MEMDATA has the value of the memory word most recently fetched bythe current task; if the fetch is not complete, the processor is held when it tries to use MEMDATA.IFUDATA has an operand of the current macroinstruction; as each operand is used, the IFU presentsthe next one on IFUDATA.6.3.3 RegistersHere is a list and brief description of registers seen by the microprogrammer. All are one word (16bits) wide.RM:a bank of 256 general purpose registers; a register can be read onto A, B, or theshifter, and loaded from RESULT under the control of LoadControl. Normally, thesame register is both read and loaded in a given microinstruction, but loading of adifferent register can be specified by FF.STACK:a memory addressed by the STACKPTR register. A word can be read or written,and STACKPTR adjusted up or down, in one microinstruction. If STACK is used in amicroinstruction, it replaces any use of RM, and the RAddress field in the microwordtells how much to increment or decrement STACKPTR. The 256 word memory isdivided into four 64 word stacks, with independent underflow and overflowchecking.T:a task specific register used for working storage; like RM, it can be read onto A, B,or the shifter, and loaded from RESULT under the control of LoadControl..COUNT:a counter; it can be decremented and tested for zero in one microinstruction, usingonly the NextControl or FF field. It is loaded from B or with small constants fromFF.SHIFTCTL:a register which controls the direction and amount of shifting and the width of leftand right masks; it is loaded from B or with values useful for field extraction fromFF.Q:a hardware aid for multiply and divide instructions; it can be read onto A or B, andloaded from B, and is automatically shifted in useful ways during multiply anddivide step microinstructions.The next group of registers vary in width. They are used as control or address registers, changeddynamically but infrequently by microcode.RBASE:RM addressing requires eight bits. Four come from the RAddress field in themicroword, and the other four are supplied from RBASE. It is loaded from B or FF,and can be read onto RESULT.STACKPTR:an eight bit register used as a stack pointer. Two bits of STACKPTR select a stack,and the least significant six bits a word in the stack. The latter bits areincremented or decremented under control of the RAddress field whenever a stackoperation is specified.MEMBASE:a five bit register which selects one of 32 base registers in the memory to be usedfor virtual address calculation. It is loaded from FF field or from B, and can beloaded from the IFU at the start of a macroinstruction.ALUFM:a 16 word memory which maps the four-bit ALUOp field into the six bits requiredto control the ALU.IOADDRESS:a task specific register which drives the IOADDRESS bus, and is loaded by I/Omicrocode to specify a device address for subsequent Input and Output operations.It may be loaded from B or FF.\zG+I ;wVzwT zwUzwYSzwPpzwzw$ zwNzw/ Mh,,zwKzwMzwJ` zwFa|X C6wLA?zw Dzwzw >zw~ w <&) ;~w9zw zw 7zw 'zw 5zw ~w 4x$zw 2 5 1p/_zw zwzwzw -zw~ w+zw *( *F~ w~w zw (~w&zw E %-zw( #~w!zw Hzwzw zw( e(7 zw zw5~w L zwzw~w zwzw zw 33  ~w + zw /# 1~wzw  zw$zw z~w! } zwlzw )zwzwz w,~w~w dzw~w  8HaA PROCESSOR FOR A HIGH-PERFORMANCE PERSONAL COMPUTER186.3.4 The shifterThe Dorado has a 32 bit barrel shifter for handling bit-aligned data. It takes 32 bits of input fromRM and T, performs a left cycle of any number of bit positions, and places the result on A. TheALU output may be masked during a shift instruction, either with zeroes or with data fromMEMDATA.The shifter is controlled by the SHIFTCTL register. To perform a shift operation, SHIFTCTL is loaded(in one of a variety of ways) with control information, and then one of a group of "shift and mask"microoperations is executed.6.4 Physical organizationOnce the goal of a physically small but powerful machine was established, engineering design andmaterial lead times forced us to develop the Dorado package before the implementation was morethan partially completed, and the implementation then had to fit the package. The data section ispartitioned onto two boards, eight bits on each; the boards are about 70% identical. The controlsection divides naturally into one board consisting of all the IM chips (high speed 1K x 1 bit ECLRAMs) and their associated address drivers, and a second board with the task switch pipeline,NEXTPC logic, and LINK register.The sidepanel pins are distributed in clusters around the board edges to form the major busses.The remaining edge pins are used for point to point connections between two specific boards. TheI/O busses go uniformly to all the I/O slots, but all the other boards occupy fixed slots specificallywired for their needs. Half the pins available on the sideplanes are grounded, but wire lengths arenot controlled except in the clock distribution system, and no twisted pair is used in the machineexcept for distribution of one copy of the master clock to each board.We were very concerned throughout the design of the Dorado to balance the pipelines so that noone pipe stage is significantly longer than the others. Furthermore, we worked hard to make thelongest stage (which limits the speed of this fully synchronous machine) as short as possible. Thelongest stage in the processor, as one might have predicted, is the IMADDRESS calculation andmicroinstruction fetch in the control slice. There is about a 50 nanosecond limit for reliableoperation in a stitchwelded machine, and 60 ns in a multiwired machine. There are pipe stages ofabout the same length in the memory and IFU.We also worked hard to get the most out of the available real estate, by hand tailoring theintegrated circuit layout and component usage, and by incrementally adding function until nearlythe entire board was in use. We also found that performance could be significantly improved bycareful layout of critical paths for minimum loading and wiring delay. Although this was a verylabor intensive operation, we believe it pays off.7. PerformanceFour emulators have been implemented for the Dorado, interpreting the BCPL, Lisp, Mesa andSmalltalk instruction sets. A typical microinstruction sequence for a load or store instruction takesonly one or two microinstructions in Mesa (or BCPL), and five in Lisp. The Mesa opcode can senda 16 bit word to or from memory in one microinstruction; Lisp deals with 32 bit items and keepsits stack in memory, so two loads and two stores are done in a basic data transfer operation. Morecomplex operations (such as read/write field or array element) take five to ten microinstructions inMesa and ten to twenty in Lisp. Note that Lisp does runtime checking of parameters, while inMesa most checking is done at compile time. Function calls take about 50 microinstructions forMesa and 200 for Lisp.The Dorado supports raster scan displays which are refreshed from a full bitmap in main memory;this bitmap has one bit for each picture element (dot) on the screen, for a total of .51 megabits]zwXzwzwzwzwz wzwzwW|TowFRzwzw@zwQgzw1%OzwL zw*zw K4LI E|XBw4(A3#?~@= : LD7$E|wd5) AHbSEC. 7PERFORMANCE19(more for gray-scale or color pictures). A special operation called BitBlt (bit boundary blocktransfer) makes it easier to create and update bitmaps; for more information about BitBlt consult [9],where it is called RasterOp. BitBlt makes extensive use of the shifting/masking capabilities of theprocessor, and attempts to prefetch data so that it will always be in the cache when needed. TheDorado's BitBlt can move display objects around in memory at 34 megabits/sec for simple cases likeerasing or scrolling a screen. More complex operations, where the result is a function of the sourceobject, the destination object and a filter, run at 24 megabits/sec.I/O devices with transfer rates up to 10 megabits/sec are handled by the processor via the IODATAand IOADDRESS busses. The microcode for the disk takes three cycles to transfer two words in thisway; thus the 10 megabit/sec disk consumes 5% of the processor. Higher bandwidth devices usethe fast I/O system, which does not interact with the cache. The fast I/O microcode for the displaytakes only two instructions to transfer a 16 word block of data from memory to the device. Thiscan consume the available memory bandwidth for I/O (530 megabits/sec) using only one quarter ofthe available microcycles (that is, two I/O instructions every eight cycles).Recall that the NEXTPC scheme ( 5.5 and 6.2.2) imposes a rather complicated structure on themicrostore, because of the pages, the odd/even branch addresses, and the special subroutine calllocations We were concerned about the amount of microstore which might be wasted by automaticplacement of instructions under all these constraints. In fact, however, the automatic placer can use99.9% of the available memory when called upon to place an essentially full microstore.AcknowledgementsThe early design of the Dorado processor was done by Chuck Thacker and Don Charnley. Thedata section was redesigned and debugged by Roger Bates and Ed Fiala. Peter Deutsch wrote themicrocode assembler and instruction placer, and Ed Fiala wrote the Dorado assembler macros, themicroprogram debugger, and the hardware manual. Willie-Sue Haugeland, Nori Suzuki, BruceHorn, Peter Deutsch, Ed Taft and Gene McDaniel are responsible for production and diagnosticmicrocode.References1.Clark, D.W. et. al. The memory system of a high-performance personal computer. Technical Report CSL-81-1, Xerox PaloAlto Research Center, January 1981. Revised version to appear in IEEE Transactions on Computers.2.Deutsch, L.P. Experience with a microprogrammed Interlisp system. Proc. 11th Ann. Microprogramming Workshop, PacificGrove, Nov. 1979.3.Geschke, C.M. et. al. Early experience with Mesa. Comm ACM 20, 8, Aug 1977, 540-5524.Ingalls, D.H. The Smalltalk-76 programming system: Design and implementation. 5th ACM Symp. Principles ofProgramming Languages, Tucson, Jan 1978, 9-16.5. Lampson, B.W. et. al. An instruction fetch unit for a high-performance personal computer. Technical Report CSL-81-1,Xerox Palo Alto Research Center, Jan. 1981. Submitted for publication.6.Mitchell, J.G. et. al. Mesa Language Manual, Technical Report CSL-79-3, Xerox Palo Alto Research Center, April 1979.7. Teitelman, W. Interlisp Reference Manual, Xerox Palo Alto Research Center, Oct. 1978.8. Thacker, C.P. et. al. Alto: A personal computer. In Computer Structures: Readings and Examples, 2nd edition, Sieworek, Bell and Newell, eds., McGraw-Hill, 1981. Also in Technical Report CSL-79-11, Xerox Palo Alto Research Center, August1979.9. Newman, W.M. and Sproull, R.F. Principles of Interactive Computer Graphics, 2nd ed. McGraw-Hill, 1979.=KzG- ;w61~w5p.~w 3 ~w~w,2h !60~w=/`^-=*zwzwSz)-wzw0%',-&%zwzw)zwzw$,/#,zwzw!%zwzw"n zw, PfT-0^R{`wPZ X L $) PE { dyw{zG  (HBA z)KpGqOrq{( Bsqpq&ftu{qDp)q{%(#tu{q pqpsqtq!tu{qOpsp{ qtuX{qGpqYrq{Gtu{qpqpqrq3ituX{q Gpq-tuX{qGpqp*q{Drq/{Ftq{p+q 6 HA PROCESSOR FOR A HIGH-PERFORMANCE PERSONAL COMPUTER20dquXququququq uququ8 eHZ$ HELVETICA HELVETICA  HELVETICA HELVETICA  HELVETICA HELVETICA MATH  TIMESROMAN LOGO  TIMESROMAN  TIMESROMAN  TIMESROMAN   TIMESROMAN   TIMESROMAN TIMESROMAN  TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN  HELVETICA  HELVETICA  HELVETICA HELVETICA HELVETICATEMPLATE@ GATES  f n W%- e8 RB L NW ` 2k 1t ! n  ^ W ~K Z;[Y'.RUN.ZZ;'.RUN.Z[Y'presseditB( ZoZ'presseditZZ'presseditZZ'prj/7procFinal.presspiera13-Jan-81 16:15:51 PST