A Processor for a High-Performance Personal Computerby Butler W. Lampson and Kenneth A. PierJanuary 1981ABSTRACTThis paper describes the design goals, microarchitecture, and implementation of themicroprogrammed processor for a compact high performance personal computer. Thismachine supports a range of high level language environments and high bandwidth I/Odevices. It also has a cache, a memory map, main storage, and an instruction fetch unit;these are described in other papers. The processor can be shared among 16 microcodedtasks, performing microcode context switches on demand with essentially no overhead.Conditional branches are done without any lookahead or delay. Microinstructions are fairlytightly encoded, and use an interesting variant on control field sharing. The processorimplements a large number of internal registers, hardware stacks, a cyclic shifter/masker,and an arithmetic-logic unit, together with external data paths for instruction fetching,memory interface, and I/O, in a compact, pipelined organization.The machine has a 60 ns microcycle, and can execute a simple macroinstruction in onecycle; the I/O bandwidth is 530 megabits/sec. The entire machine, including disk, displayand network interfaces, is implemented with approximately 3000 MSI components, mostly ECL10K; the processor is about 35% of this. In addition there are up to 4 storage modules, eachwith about 300 16K or 64K RAMs and 200 MSI components, for a maximum of 8 megabytes.The total volume, including power and cooling, is about .14 m3 (4.5 ft3). A number ofmachines are currently running.A version of this paper appeared in Proc. 7th Symposium on Computer Architecture,SigArch/IEEE, La Baule, May 1980, 146-160.CR CATEGORIES6.34, 6.21KEY WORDS AND PHRASESarchitecture, controller, emulation, input/output, microprogram, pipeline, processor.c Copyright 1981 by Xerox Corporation.XEROXPALO ALTO RESEARCH CENTER3333 Coyote Hill Road / Palo Alto / California 94304 \p Y>" TkqX& O Ir Fs4 D2 Cx@tst AsH @p$, > C =h G ;() :` N 8@ 7Xtsts' 4- E 2tstsD 1%J zwL 6zw=  R . H 0 !, w+, BTVk(SEC. 2HISTORY3microcoding the Model 0, together with the significant improvements in memory technology sincethe Model 0 design was frozen, to redesign and reimplement nearly every section of the Dorado.We fixed some serious design errors and a number of annoyances to the microcoder, substantiallyexpanded all the memories of the machine, and speeded up the basic cycle time. Dorado Model 1came up in the spring of 1979. During the next year several copies of this machine were built in the stitchweld technology used forthe prototypes. Stitchwelding worked very well for prototypes, but is too expensive for evenmodest quantities. Its major advantages are packaging density and signal propagation characteristicsvery similar to those of the production technology, very rapid turnaround during development(three days for a complete 300-chip board, a few hours for a modest change), and completecompatibility with our design automation system.At the same time, the design was transferred to multiwire circuit boards; the Manhattan wirerouting and lower impedance of this technology slowed the machine down by about 15%. Doradosare now assembled with very little in-house labor, since boards and backpanels are manufacturedand loaded by subcontractors. We do 100% continuity testing of the boards both before and afterthey are loaded with components and soldered. Checkout of an assembled machine is still non-trivial, but is a fairly predictable operation done entirely by technicians.3. GoalsThis section of the paper describes the overall design goals for the Dorado. The high levelarchitecture of the processor, described in the next section, follows from these goals and thecharacteristics of the available technology.The Dorado is intended to be a powerful but personal computing system. It supports a single userwithin a programming system which may extend from the microinstruction level to a fullyintegrated programming environment for a high-level language; programming at all levels must berelatively easy. The machine must be physically small and quiet enough to occupy space near itsusers in an office or laboratory setting, and cheap enough to be acquired in considerable numbers.These constraints on size, noise, and cost have a major effect on the design.In order for the Dorado to quickly become useful in the existing CSL environment, it had to becompatible with the Alto software base. High-performance Alto emulation is not a requirement,however; since the existing software is also obsolescent and due to be replaced, the Dorado onlyneeds to run it somewhat faster than the Alto can.Instead, the Dorado is optimized for the execution of languages that are compiled into a stream ofbyte codes; this execution is called emulation. Such byte code compilers exist for Mesa [3, 6],Interlisp [2, 7] and Smalltalk [4]. An instruction fetch unit (IFU) in the Dorado fetches bytes fromsuch a stream, decodes them as instructions and operands, and provides the necessary control anddata information to the processor; it is described in another paper [5]. Further support for this goalcomes from a very fast microcycle, and a microinstruction powerful enough to allow interpretationof a simple macroinstruction in a single microinstruction. There is also a cache which has a latencyof two cycles, and can deliver a word every cycle. The goal of fast execution affects the choices ofimplementation technology, microstore organization, and pipeline organization. It also mandates anumber of specific features, for example, stacks built with high speed memory, and hardware baseregisters for addressing software contexts.Another major goal for the Dorado is to support high-bandwidth input/output. In particular, colormonitors, raster scanned printers, and high speed communications are all part of the researchactivities within CSL; one of these devices typically has a bandwidth of 20 to 400 megabits/second.Fast devices should not slow down the emulator too much, even though the two functions competefor many of the same resources. Relatively slow devices must also be supported, without tying upthe high bandwidth I/O system. These considerations clearly suggest that I/O activity and emulationshould proceed in parallel as much as possible. Also, it must be possible to integrate as yet f2zG<Gw _ I ^W@ \W [ON Y V^ U,. S"= RD P H O # KV J]R H N GU S E*/ DMD ?z{X  \M [OY Y VU U B SM N{X Kw"9 J#2 H C G C? B_Pzw @zw? ?W G ==$  H <B ; Y 9 4{X 1wD 0 V .&; -# )|X %w?| w $V|wC "// !NL | w9 F?|w Z J G  |w.  K O M .3 c G LN RTVk(SEC. 5LOW LEVEL ARCHITECTURE75.2 Task schedulingWhenever resources (in this case, the processor) are multiplexed, context switching must onlyhappen when the state being temporarily abandoned can be restored. In most multiplexedmicrocoded systems, this requires the microcode itself to explicitly poll for requests, save andrestore state, and initiate context switches. A certain amount of overhead results. Furthermore,the presence of a cache introduces large and unpredictable delays in the execution of microcode(because of misses). A polling system would leave the processor idle during these delays, eventhough the work of another task can usually proceed in parallel. To avoid these costs, the Doradodoes task switching on demand of a higher priority device, much like a conventional interruptsystem. That is, if a lower priority task is executing and a higher priority device requests a wakeup,the lower priority task will be preempted; the higher priority device will be serviced without theconsent or even the knowledge of the currently active task. The polling overhead is absorbed bythe hardware, which also becomes responsible for resuming a preempted task once the processor isrelinquished by the higher priority device.A controller will continue to request a wakeup until notified by the processor that it is about toreceive service; it then removes the request, unless it needs more than one unit of service. Whenthe microcode is done, it executes an operation called Block which releases the processor. Theeffect is that requesting service is done explicitly by device controllers, but scheduling of a giventask is invisible to the microcode (and nearly invisible to the device hardware).5.3 Task specific stateIn order to allow the immediate task switching described above, the processor must be able to saveand restore state within one microcycle. This is accomplished by keeping the vital state informationthroughout the processor not in a single rank of registers but in task specific registers. These areactually implemented with high speed memory that is addressed by a task number. Examples oftask specific registers are the microcode program counter, the branch condition register, themicrocode subroutine link register, the memory data register, and a temporary storage register foreach task. The number of the task which will execute in the next microcycle is broadcastthroughout the processor and used to address the task specific registers. Thus, data can be fetchedfrom the high speed task specific memories and be available for use in the next cycle.Not all registers are task specific. For example, COUNT and Q are normally used only by task 0.However, they can be used by other tasks if their contents are explicitly saved and restored.5.4 PipeliningThere are two distinct pipelines in the Dorado processor. The main one fetches and executesmicroinstructions. The other handles task switching, arbitrates wakeup requests and broadcasts thenext task number to the rest of the Dorado. Each structure is synchronous, and there is no waitingbetween stages.The instruction pipeline, illustrated in Figure 2, requires three cycles (divided into six half cycles) tocompletely execute a microinstruction. The first cycle is used to fetch it from microstore (time t-2 tot0). The result of the fetch is loaded into the microinstruction register MIR at t0. The second cycleis split; in the first half, operand fetches (as dictated by the contents of MIR) are performed and theresults latched at t1 in two registers (A and B) which form inputs to the next stage. In the secondhalf cycle, the ALU operation is begun. It is completed in the first half cycle of cycle three, and theresult is latched in register RESULT (at t3). The second half of cycle three (t3 to t4) is used to loadresults from RESULT into operand registers. f2zG1Gw _|X \w(- [,/" Y %1 X$N VO U.) S%7 RO P S O |w$ M< L.|w J  GU$= ED DM4~w# BI AEM =F|X :w=# 8.4 7 8| w 5+) 4 : 2*/ 18 / Z -R *" zwzw" )LU %M|X ""wJ > '8  k>|w| w  W~Zw ~lw4zw~lw  Gzw  ~wzwzw )  zwQ zw~w$~w~w 'zwTVk(A PROCESSOR FOR A HIGH-PERFORMANCE PERSONAL COMPUTER8<==4+ <\*+! |X w +) P B D HB -2 @!; U 8D M 05 C ., TVk(MSEC. 5LOW LEVEL ARCHITECTURE11start memory references can be affected by the state of the memory. When such an instruction isexecuted, the memory checks to see whether it can be allowed to proceed. If so, no action is taken.But if the memory is busy, or the data being used is not ready, the memory responds by activatingthe signal Hold. The effect of Hold is to stop any state changes specified by the current instruction.However, all the clocks in the system keep running. This is important, because task switching mustnot be inhibited during memory delays. In effect, Hold converts the currently executing instructioninto a "no operation, jump to self" instruction. If no task switch occurs, the instruction is executedagain, and a new calculation is made to see whether it can proceed. Meanwhile, the memorypipeline is running, and sooner or later, the need for Hold will be gone as the pipeline progresses.Note that if a task switch occurs while an instruction is held, the state is such that the heldinstruction may simply be restarted when the lower priority task is resumed by the processor.Cycles which would otherwise be dead time are consumed instead by higher priority tasks doinguseful work.5.8 Separate external interfacesIf most macroinstructions (byte codes) are to execute in a small number of cycles, hardware mustbe provided to make communication among processor, IFU, and memory very quick in the commoncases. The Dorado provides a number of data paths and control structures for this purpose,detailed in the block diagrams, Figures 5 and 6. All the busses are a full word wide and can beaccessed in one cycle or less. The B input to the ALU is extended to the remainder of the Dorado(except I/O devices, which have their own busses) for the transfer of status and control between theprocessor and the other subsystems. The memory address bus is a copy of the A side ALU input.Memory data comes directly into the processor and is routed to a variety of destinationssimultaneously, to make such operations as field manipulations and indirect addressing fast. TheIFU can directly supply operand data to the processor, and any microinstruction can specify that it isthe last of a macroinstruction, in which case the successor address is supplied by the IFU. Thisrequires a microstore address bus and operand data bus directly from the IFU to the processor.It is also desirable to make I/O transfers through the processor fast. To this end there is an I/Oaddress bus and an I/O data bus for direct access to I/O controllers. The data bus can transfer oneword per cycle, or 265 megabits/second, and both the memory reference and the I/O transfer canbe specified in a single instruction, so that it is possible to move a sequence of words between the cache and a device at this rate. However, this subsystem is called the slow I/O system. There isalso a more direct memory access I/O subsystem, the fast I/O system; it allows data to move directlybetween storage and I/O devices, in blocks of 16 words, without polluting the cache. Figure 1bshows a display controller that uses both slow and fast I/O systems.5.9 ConstantsNotice that there is no source for 16 bit constants within the processor. Such constants arenecessary, particularly in device controller microcode where they often are used as commands,addresses or literal data. It would be possible to include a constant box, addressed perhaps with anFF function, as a source for constants. However, such a box would have a limited size and,experience tells us, would not hold enough constants to satisfy a growing world.Fortunately, a large fraction of the constants used in microcoding are either small positive or smallnegative (2's complement) integers, or sparsely populated bit vectors, with the property that one ofthe two eight bit fields in the constant is all zeroes or all ones. Thus a useful subset of constantscan be specified using the eight bits of FF for one byte of the constant and two other bits to specifythe other byte value and position. Using this technique, most 16 bit constants can be specified inone microinstruction, and any constant can be assembled in two microinstructions. (The "other"two bits come from the BSelect field in the microword). f2zG1G?w _6% ^W P \@ [O~w~w 8 YI XG/~w- Vc U?> S/~w) PV O C M; L H|X Dw_ CV1zw A I @NC >zwzw+ =Fzwzw? ;7 zwzw :>  G 8O 76zw30 5Lzw 4.Azw 1zwzw@zwz /wzwzwzwzw, -Ezwzw ,w U *C|wzwzw )ozwzw|wzwzw( ' zwzw/ &g3zwzw "h|X =w|w*  < 55| w zw O - F  A ~P Y v&~w0 O n\ ~wTVk(A PROCESSOR FOR A HIGH-PERFORMANCE PERSONAL COMPUTER12<==A :|X 7w` 6.~56wzwzw 4. Bzwz 2w H 1&)z w / zwz w3 .zw z wzw ,zwzwzwzw )oV 'I~w &gzw z w2 #<F !C 4@ .3 ,-- zw4~w  zwA~-w |wD~w $6 .zwzwzwz w= %; ? zwO zwzw)TVk(SEC. 6IMPLEMENTATION15A simpler design would require the microcode to explicitly notify its device when the wakeupshould be removed; it would then be unnecessary to broadcast NEXT to the devices. Since thisnotification could not be done earlier than the first instruction, however, the grain would be threecycles rather than two, and 37.5% of the processor would be needed to provide the full memorybandwidth. Other simplifications in the implementation would result from making the pipelinelonger; in particular, squeezing the priority encoding and reading of TPC into one cycle is quitedifficult. Again, however, this would increase the grain.6.2.2 Fetching microinstructionsRefer to the right hand side of Figure 5. At t0 of every instruction, the microinstruction registerMIR is loaded from the outputs of IM, the microinstruction memory, and the THISPC register isloaded with IMADDRESS. The NEXTPC is quickly calculated based on the NextControl field in MIR,which encodes both the instruction type and some bits of NEXTPC; see Figure 7 for details. Thiscalculation produces THISTASKNEXTPC, so called because if a task switch occurs it is not used as thenext IMADDRESS. Instead, the BESTNEXTPC computed in the task pipeline is used as IMADDRESS.<== Ah(3 ?zw, ;|X 8wD 76zwF 5,zw+ 4. zw& 2>zwzw .|X +zwX )'~Rwzw& z~Rw zw#$~Rwzw#~ Rw!zwzw!~Rwzw# ~Rw zwzw,~wR"~ Rwzw  |X w zwzwzw zwzwz wzwzw Zzwzw /zwzwzwzw zwzw 'Dzw zw z wzw zwzw z wN &zw zw zwzw#zwzw LTVk(SEC. 6IMPLEMENTATION17B can thus be sent to the entire processor. Both are bidirectional and can serve as a source for B aswell. IOADDRESS is driven from a task specific register; it specifies the particular device and registerwhich should source or receive IODATA.IFUDATA and MEMDATA allow the processor to receive data from the IFU and memory in parallelwith other data transfers. MEMDATA has the value of the memory word most recently fetched bythe current task; if the fetch is not complete, the processor is held when it tries to use MEMDATA.IFUDATA has an operand of the current macroinstruction; as each operand is used, the IFU presentsthe next one on IFUDATA.6.3.3 RegistersHere is a list and brief description of registers seen by the microprogrammer. All are one word (16bits) wide.RM:a bank of 256 general purpose registers; a register can be read onto A, B, or theshifter, and loaded from RESULT under the control of LoadControl. Normally, thesame register is both read and loaded in a given microinstruction, but loading of adifferent register can be specified by FF.STACK:a memory addressed by the STACKPTR register. A word can be read or written,and STACKPTR adjusted up or down, in one microinstruction. If STACK is used in amicroinstruction, it replaces any use of RM, and the RAddress field in the microwordtells how much to increment or decrement STACKPTR. The 256 word memory isdivided into four 64 word stacks, with independent underflow and overflowchecking.T:a task specific register used for working storage; like RM, it can be read onto A, B,or the shifter, and loaded from RESULT under the control of LoadControl..COUNT:a counter; it can be decremented and tested for zero in one microinstruction, usingonly the NextControl or FF field. It is loaded from B or with small constants fromFF.SHIFTCTL:a register which controls the direction and amount of shifting and the width of leftand right masks; it is loaded from B or with values useful for field extraction fromFF.Q:a hardware aid for multiply and divide instructions; it can be read onto A or B, andloaded from B, and is automatically shifted in useful ways during multiply anddivide step microinstructions.The next group of registers vary in width. They are used as control or address registers, changeddynamically but infrequently by microcode.RBASE:RM addressing requires eight bits. Four come from the RAddress field in themicroword, and the other four are supplied from RBASE. It is loaded from B or FF,and can be read onto RESULT.STACKPTR:an eight bit register used as a stack pointer. Two bits of STACKPTR select a stack,and the least significant six bits a word in the stack. The latter bits areincremented or decremented under control of the RAddress field whenever a stackoperation is specified.MEMBASE:a five bit register which selects one of 32 base registers in the memory to be usedfor virtual address calculation. It is loaded from FF field or from B, and can beloaded from the IFU at the start of a macroinstruction.ALUFM:a 16 word memory which maps the four-bit ALUOp field into the six bits requiredto control the ALU.IOADDRESS:a task specific register which drives the IOADDRESS bus, and is loaded by I/Omicrocode to specify a device address for subsequent Input and Output operations.It may be loaded from B or FF. f2zG7 G?w _zwT zw ^WzwY \zw Yzwzw$ zw X$zw/ V,,zw UzwMzw S zw O|X LnwL J HzwDzwzwGUzw~ w E&)DM~w BE =H ;|zwzwzwzw@ 9@ 8tM 6@ 3;! 2A,1 0D /9zw -F ,19 *#zw '(1 % (. $z\ "@ !r- {X tw3zw H l zw. G d"> L \D 7$ T )E|w 5)TVk(SEC. 7PERFORMANCE19(more for gray-scale or color pictures). A special operation called BitBlt (bit boundary blocktransfer) makes it easier to create and update bitmaps; for more information about BitBlt consult [9],where it is called RasterOp. BitBlt makes extensive use of the shifting/masking capabilities of theprocessor, and attempts to prefetch data so that it will always be in the cache when needed. TheDorado's BitBlt can move display objects around in memory at 34 megabits/sec for simple cases likeerasing or scrolling a screen. More complex operations, where the result is a function of the sourceobject, the destination object and a filter, run at 24 megabits/sec.I/O devices with transfer rates up to 10 megabits/sec are handled by the processor via the IODATAand IOADDRESS busses. The microcode for the disk takes three cycles to transfer two words in thisway; thus the 10 megabit/sec disk consumes 5% of the processor. Higher bandwidth devices usethe fast I/O system, which does not interact with the cache. The fast I/O microcode for the displaytakes only two instructions to transfer a 16 word block of data from memory to the device. Thiscan consume the available memory bandwidth for I/O (530 megabits/sec) using only one quarter ofthe available microcycles (that is, two I/O instructions every eight cycles).Recall that the NEXTPC scheme ( 5.5 and 6.2.2) imposes a rather complicated structure on themicrostore, because of the pages, the odd/even branch addresses, and the special subroutine calllocations We were concerned about the amount of microstore which might be wasted by automaticplacement of instructions under all these constraints. In fact, however, the automatic placer can use99.9% of the available memory when called upon to place an essentially full microstore.AcknowledgementsThe early design of the Dorado processor was done by Chuck Thacker and Don Charnley. Thedata section was redesigned and debugged by Roger Bates and Ed Fiala. Peter Deutsch wrote themicrocode assembler and instruction placer, and Ed Fiala wrote the Dorado assembler macros, themicroprogram debugger, and the hardware manual. Willie-Sue Haugeland, Nori Suzuki, BruceHorn, Peter Deutsch, Ed Taft and Gene McDaniel are responsible for production and diagnosticmicrocode.References1.Clark, D.W. et. al. The memory system of a high-performance personal computer. Technical Report CSL-81-1, Xerox PaloAlto Research Center, January 1981. Revised version to appear in IEEE Transactions on Computers.2.Deutsch, L.P. Experience with a microprogrammed Interlisp system. Proc. 11th Ann. Microprogramming Workshop, PacificGrove, Nov. 1979.3.Geschke, C.M. et. al. Early experience with Mesa. Comm ACM 20, 8, Aug 1977, 540-5524.Ingalls, D.H. The Smalltalk-76 programming system: Design and implementation. 5th ACM Symp. Principles ofProgramming Languages, Tucson, Jan 1978, 9-16.5. Lampson, B.W. et. al. An instruction fetch unit for a high-performance personal computer. Technical Report CSL-81-1,Xerox Palo Alto Research Center, Jan. 1981. Submitted for publication.6.Mitchell, J.G. et. al. Mesa Language Manual, Technical Report CSL-79-3, Xerox Palo Alto Research Center, April 1979.7. Teitelman, W. Interlisp Reference Manual, Xerox Palo Alto Research Center, Oct. 1978.8. Thacker, C.P. et. al. Alto: A personal computer. In Computer Structures: Readings and Examples, 2nd edition, Sieworek, Bell and Newell, eds., McGraw-Hill, 1981. Also in Technical Report CSL-79-11, Xerox Palo Alto Research Center, August1979.9. Newman, W.M. and Sproull, R.F. Principles of Interactive Computer Graphics, 2nd ed. McGraw-Hill, 1979. f2zG8 G?w _1~w ^W.~w \ ~w~w, [O !6 Y~w= XG^ V= SzwzwSz Rwzw0% P,- O zwzw)zwzw M,/ L,zwzw J%zwzw" GU zw, E P DMT B-0 AER