An Instruction Fetch Unit for a High-Performance Personal Computerby Butler W. Lampson, Gene A. McDaniel and Severo M. OrnsteinJanuary 1981ABSTRACTThe instruction fetch unit (IFU) of the Dorado personal computer speeds up the emulation ofinstructions by pre-fetching, decoding, and preparing later instructions in parallel with theexecution of earlier ones. It dispatches the machine's microcoded processor to the properstarting address for each instruction, and passes the instruction's fields to the processor ondemand. A writeable decoding memory allows the IFU to be specialized to a particularinstruction set, as long as the instructions are an integral number of bytes long. There areimplementations of specialized instruction sets for the Mesa, Lisp, and Smalltalk languages.The IFU is implemented with a six-stage pipeline, and can decode an instruction every 60 ns.Under favorable conditions the Dorado can execute instructions at this peak rate (16 mips).This paper has been submitted for publication.CR CATEGORIES6.34, 6.21KEY WORDS AND PHRASEScache, emulation, instruction fetch, microcode, pipeline.c Copyright 1981 by Xerox Corporation. XEROXPALO ALTO RESEARCH CENTER3333 Coyote Hill Road / Palo Alto / California 94304 \p Y>" RqX; LJ Cr @sts4 ?3 N =? <+)- :)ts" 9# B 74 6ts9 4V 1lu. -r *s %r "s9 Nvwt$ wX)Xx)aq)arG0TVk(AN INSTRUCTION FETCH UNIT FOR A HIGH-PERFORMANCE PERSONAL COMPUTER221. IntroductionThis paper describes the instruction fetch unit (IFU) for the Dorado, a powerful personal computerdesigned to meet the needs of computing researchers at the Xerox Palo Alto Research Center.These people work in many areas of computer science: programming environments, automatedoffice systems, electronic filing and communication, page composition and computer graphics, VLSIdesign aids, distributed computing, etc. There is heavy emphasis on building working prototypes.The Dorado preserves the important properties of an earlier personal computer, the Alto [13], whileremoving the space and speed bottlenecks imposed by that machine's 1973 design. The history,design goals, and general characteristics of the Dorado are discussed in a companion paper [8],which also describes its microprogrammed processor. A second paper [1] describes the memorysystem. The Dorado is built out of ECL 10K circuits. It has 16-bit data paths, 28 bit virtual addresses, 4K-16K words of high-speed cache memory, writeable microcode, and an I/O bandwidth of 530Mbits/sec. Figure 1 shows a block diagram of the machine. The microcoded processor can executea microinstruction every 60 ns. An instruction of some high level language is performed byexecuting a suitable succession of these microinstructions; this process is called emulation.<=={X! ;w> :>47%5D36 0 -1% ,/ 9 *T )'{wB 'D & A $: #a !V ;# G  B X!{w G P 4& < HU =yw" H Z C yw] { wyw=TVk(AN INSTRUCTION FETCH UNIT FOR A HIGH-PERFORMANCE PERSONAL COMPUTER26The EU demands instructions from the IFU at an irregular rate, depending on how fast it is able toabsorb the previous ones. A simple machine must completely process an instruction beforedemanding the next one. In a machine with multiple functional units, on the other hand, the firststage in the EU waits until the basic resources required by the instruction (adders, result registers,etc.) are available, and then hands it off to a functional unit for execution. Beyond this point theoperation cannot be described by a single pipeline, and complete execution of the instruction maybe long delayed, but even in this complicated situation the IFU still sees the EU as a single consumerof instructions, and is unaware of the concurrency which lies beyond.Under this umbrella definition for an IFU, a lot can be sheltered. To illustrate the way an IFU canaccommodate specific language features, we draw an example from Smalltalk [5]. In this language,the basic executable operation is applying a function f (called a method) to an object o: f(o, . . .).The address of the code for the function is not determined solely by the static program, butdepends on a property of the object called its class. There are many implementation techniques forfinding the class and then the function from the object. One possibility is to represent a class as ahash table which maps function names (previously converted by a compiler into numbers) into codeaddresses, and to store the address of this table in the first word of the object. The rather complexoperation of obtaining the hash table address and searching the table for the code addressassociated with f, is in the proper domain of an IFU, and removes a significant amount ofcomputation from the processor. No such specialization is present in the Dorado's IFU, however.2.3 Pipelining instruction fetchesFor the sake of definiteness, we will assume henceforth that the smallest addressable unit in the code is a byte;the memory delivers data in units called words, which are larger than bytes;an instruction (and its addresses, immediate operands, and other fields) may occupy one ormore bytes, and the first byte determines its essential properties (length, number of fields,etc.).Matters are somewhat simplified if the addresssable unit is the unit delivered by the memory or ifinstructions are all the same length, and somewhat complicated if instructions may be any numberof bits long. However, these variations are inessential and distracting.The operation of instruction fetching divides naturally into four stages:Generating addresses of instruction words in the code, typically by sequentially advancing aprogram counter, one memory word at a time.Fetching data from the code at these addresses. This requires interactions with themachine's memory in general, although recently used code may be cached within the IFU.Such a cache looks much like main memory to the rest of the IFU.Decoding instructions to determine their length and internal structure, and perhaps whetherthey are branches which the IFU should execute. Decoding changes the representation ofthe instruction, from one which is compact and convenient for the compiler, to one whichis convenient for the EU and IFU.Formatting the fields of each instruction (addresses, immediate operands, register numbers,mode control fields, or whatever) for the convenience of the EU; e.g., extracting fields ontothe EU's data busses.Buffering may be introduced between any pair of these stages, either the minimum of one itemrequired to separate the stages, or a larger amount to increase the elasticity. Note that an item mustbe a word early in the pipe (at the interface to the memory), must be an instruction late in the pipe(at the interface to the EU), and may need to be a byte in the middle.There are three sources of irregularity (see 2.1) in the pipeline, even when no wrong branches aretaken: f2ywXy wywywywywywy wywy w _ywyw ^W@ \I [OywX Y-8 XGD V:ywyw U?C R!yw( yw P 4" O 3{w {w{w{w{w MJ L({w J U H;! Gx M EQ Dp {w yw% B Hyw >{X "B6yw 8ywf{ w(yw^H ywyw{ w{wF9ywzyw ",' +4 {w@{ w yw{w >.1 TVk(SEC. 2THE PROBLEM27The instruction length is irregular, as noted in the previous paragraph; hence a uniformflow of instructions to the EU implies an irregular flow of bytes into the decoder, and viceversa.The memory takes an irregular amount of time to fetch data; if it contains a cache, theamount of time may vary by more than an order of magnitude. The EU demands instructions at an irregular rate. These considerations imply that considerable elasticity is needed in order to meet the EU's demandswithout introducing delays. 2.4 Hand-off to the EUFrom the IFU's viewpoint, handing-off an instruction to the EU is a simple producer-consumerrelationship. The EU demands a new instruction. If one is ready, the IFU delivers it as a pile ofsuitably formatted bits, and forgets about the instruction. Otherwise the IFU notifies the EU that itis not ready; in this case the EU will presumably repeat the request until it is satisfied. Thus at thislevel of abstraction, hand-off is a synchronized transfer of one data item (a decoded instruction)from one process (the IFU) to another (the EU).Usually the data in the decoded instruction can be divided into two parts: information about whatto do, and parameters. If the EU is a microprogrammed processor, for example, what to do canconveniently be encoded as the address of a microinstruction to which control should go (a dispatchaddress), and indeed this is done in the Dorado. Since microinstructions can contain immediateconstants, and in general can do arbitrary computations, it is possible in principle to encode all theinformation in the instruction into a microinstruction address; thus the instructions PushConstant(3)and PushConstant(4356) could send control to different microinstructions. In fact, however, micro-instructions are expensive, and it is impractical to have more than a few hundred, or at most a fewthousand of them. Hence we want to use the same microcode for as many instructions as possible,representing the differences in parameters which are treated as data by the microcode. Theseparameters are presented to the EU on some set of data busses; 4 has several examples.Half of the IFU-EU synchronization can also be encoded in the dispatch address: when the IFU isnot ready, it can dispatch the EU to a special NotReady location. Here the microcode can do anybackground processing it might have, and then repeat the demand for another instruction. Thesame method can be used to communicate other exceptional conditions to the EU, such as a pagefault encountered in fetching an instruction, or an interrupt signal from an I/O device. TheDorado's IFU uses this method (see 3.4). Measurements of typical programs [7, 11] reveal that most of the instructions executed are simple,and hence can be handled quickly by the EU. As a result, it is important to keep the cost of hand-off low, since otherwise it can easily dominate the execution time for such instructions. As the EUgets faster, this point gets more important; there are many instructions which the Dorado, forinstance, can execute in one cycle, so that one cycle of hand-off overhead would be 50%. Thispoint is discussed further in 3 and 4.2.5 AutonomyPerhaps the most important parameter in the design of an IFU is the extent to which it functionsindependently of the execution unit, which is the master in their relationship. At one extreme wecan have an IFU which is entirely independent of the EU after it is initialized with a code address (itmight also receive information about the outcome of branches); this initialization would only occuron a process switch, complex procedure call, or indexed or indirect jump. At the other extreme isan IFU which simply buffers one word of code and delivers successive bytes to the EU; when thebuffer is empty, the IFU dispatches the EU to a piece of microcode which fetches another memoryword's worth of code into the buffer. The first IFU must decode instruction lengths, follow jumps,and provide the program counter for each instruction to the EU (e.g., so that it can be saved as a f2yG9 G?w_E^Wyw,\Z{2"X7Vyw, TGH yw R N{X Kwyw #yw J yw2yw HCyw yw G  ywH E] Dywyw @+/ ?Vyw- = A { L 4N 36 { w3 1 yw6 .ywyw.yw -yw |w + #0 )@yw (w+ywyw &yw # K "D%yw+ Zy = K < = ;5I 9-yw 8-( 5Z 3~4/ 1<" 0v*yw .{wD -n8 {w +2, *f? (5yw '^&{w{w %{wY $Vyw^ W{X ,wc  %2 $V  ywyw+ ywU G >  I   TVk(SEC. 2THE PROBLEM29<== B*yw AE  >0yw! <E ;6|8TVk(3:p;q r:p 9$q >r8p9$q 76%r6p76q5 5HJ 3#0r37p3qr37p 1qV 0R/' .sq&9 -JG +S *B6% ( sq $tX !qsq5 a 3 asq7 sq5 sq YA %; QQsq sq9 I<& X A sq2 '1 98TVk( AN INSTRUCTION FETCH UNIT FOR A HIGH-PERFORMANCE PERSONAL COMPUTER32<== CyV A sq3 @q/sq! =F1#:sq$,9j#7S#4< 2b"< 0> /Z K -;sq )tX &q sq. %(#sq #*2 " :sq A `  sq% i4sq I > sq sq 6+/ ,sq$ .sqs qsq'sq( &@sq ? wsq8sq sqJsqVTVk(AN INSTRUCTION FETCH UNIT FOR A HIGH-PERFORMANCE PERSONAL COMPUTER34same time. The processor actually has access to all the information needed to maintain its own PC,but the time required to do this in microcode would be prohibitive (at least one cycle perinstruction). The IFU can also follow branches, provided they are PC-relative, have displacements specifiedentirely in the instruction, and are encoded in certain limited ways. These restrictions ensure thatonly information from the code (plus the current PC value) is needed to compute the branchaddress, so that no external dependencies are introduced. It would be possible to handle absoluteas well as PC-relative branches, but this did not seem useful, since none of the target instruction setsuse absolute branches. The decoding table specifies for each opcode whether it branches and howto obtain the displacement. On a branch, DECODE resets the earlier stages of the pipe and passesthe branch PC back to ADDRESS. The branch instruction is also passed on to the processor. If it isactually a conditional branch which should not have been taken, the processor will reset the IFU tocontinue with the next instruction; the work done in following the branch is wasted. If the branchis likely not to be taken, then the decoding table should be set up so that it is treated as anordinary instruction by the IFU, and if the branch is taken after all, the processor will reset the IFUto continue with the branch path; in this case the work done in following the sequential path iswasted. Even unconditional jumps are pased on to the processor, partly to avoid another case inthe IFU, and partly to prevent infinite loops in the IFU without any processor intervention. 4. IFU-processor hand-offWith a microcoded execution unit like the Dorado's processor, efficient emulation depends onsmooth interaction between the IFU and the processor, and on the right kind of concurrency in theprocessor itself. These considerations are less critical in a low-performance machine, where manymicrocycles are used to execute each instruction, and the loss of a few is not disastrous. A high-performance machine, however, executes many instructions in one or two microcycles. Adding oneor two more cycles because of a poorly chosen interface with the IFU, or because a very commonpair of operations cannot be expressed in a single microinstruction, slows the emulator down by 50-200%. The common operations are not very complex, and require only a modest amount ofhardware for an efficient implementation. The examples in this section illustrate these points.Good performance depends on two things:An adequate set of data busses, so that it is physically possible to perform the frequentcombinations of independent data transfers in a single cycle. We shall be mainly concernedwith the busses which connect the IFU and the processor, rather than with the internaldetails of the latter. These are summarized in Figure 4.A microinstruction encoding which makes it possible to specify these transfers in a singlemicroinstruction. A horizontal encoding does this automatically; a vertical one requiresgreater care to ensure that all the important combinations can still be specified.We shall use the term folding for the combination of several independent operations in a singlemicroinstruction. Usually folding is done by the microprogrammer, who surveys the operations tobe done and the resources of the processor, and arranges the operations in the fewest possiblenumber of microinstructions.f2sqXs qsqsqsqsqsqs qsqs q _,0sq ^WS \  Ysq-sq X$I V,sq' U!9 Ssq? R5( P(sq O sqsq+ M:sq L[ JL Hsq2s GxqV E*/ Dpsq.sq$ ?uXvu srqsq 7Osq 6c@ 4 !: 3[M 00 sqQ . ,Tsqsq sq0sq )sq sqsqsq sq (x5sq & f " srqsq%sq !qsq0srqs q5 sq i&9  6 a  6Q r q s xsGxs#) dq(<  TVk(SEC. 4IFU-PROCESSOR HAND-OFF37PushConstantWord:temp _ LeftShift[IFUData, 8];-- put alpha into the left half of tempPush[temp or IFUData], IFuJump;-- or in beta, push the result on the stack, and dispatch to the next instructionNotice that the first microinstruction uses the IFU to acquire data from the code stream. Then thesecond microinstruction simultaneously retrieves the second data byte and dispatches to the nextinstruction. These examples illustrate several points. Any number of microinstructions can be executed to emulate an instruction, i.e., betweenIFUJumps. Within an instruction, any number of IFUData requests are possible; see Table 3 for asummary of the data delivered to successive requests. IFUJump and IFUData may be done concurrently. The IFUData will reference the currentinstruction's data, and then the IFUJump will dispatch the processor to the first microinstruc-tion of the next instruction (or to NotReady).Suppose analysis of programs indicates that the most common PushConstant instruction pushes theconstant 0. Suppose further that 1 is the next most common constant, and 2 the next beyond that,and that all other constants occur much less frequently. A lot of code space can probably be savedby dedicating three one-byte opcodes to the most frequent PushConstant instructions, and using atwo-byte instruction for the less frequent cases, as in the PushConstantByte example above, where theopcode byte designates a PushConstantByte opcode and alpha specifies the constant. A third opcode,PushConstantWord, provides for 16-bit constants, and still others are possible.Pursuing this idea, we define five instructions to push constants onto the stack: PushC0, PushC1,PushC2, PushCB, PushCW. Any five distinct values can be assigned for the opcode bytes of theseinstructions, since the meaning of an opcode is completely defined by its decoding table entry. Theentries for these instructions are as follows: (N is a constant encoded in the opcode, Length is theinstruction length in bytes, and Dispatch is the microcode dispatch address; for details, see 5.4).OpcodePartial decoding table contents-- RemarksPushC0Dispatch_PushC, N_0, Length_1-- push 0 onto the stackPushC1Dispatch_PushC, N_1, Length_1-- push 1 onto the stackPushC2Dispatch_PushC, N_2, Length_1-- push 2 onto the stackPushCBDispatch_PushC, Length_2-- push alpha onto the stackPushCWDispatch_PushCWord, Length_3-- push the concatenation of alpha and beta onto the stackHere is the microcode to implement these instructions; we have seen it before:PushC:-- PushC0/1/2, (ifuData=N), PushCB, (ifuData=alpha)Push[IFUData], IFUJump; PushCWord:-- PushCW, temp _ Lshift[IFUData, 8];-- (IFUData=alpha here)Push[temp or IFUData], IFUJump;-- (IFUData=beta here)Observe that the same, single line of microcode (at the label PushC) implements four differentopcodes, for both one and two byte instructions. Only PushConstantWord requires two separatemicroinstructions.4.2 Initializing stateA standard method for reducing the size and increasing the usefulness of an instruction is toparameterize it. For example, we may consider an instruction with a base register field to beparameterized by that register: the "meaning" of the instruction depends on the contents of theregister. Thus the same instruction can perform different functions, and also perhaps can get bywith a smaller address field. This idea is also applicable to microcode, and is used in the Dorado.For example, there are 32 memory base registers. A microinstruction referencing memory does notspecify one of these explicitly; instead, there is a MemBase register, loadable by the microcode,which tells which base register to use. Provided the choice of register changes infrequently, this isan economical scheme. f2sG2 G?q `"s ^xs#' ] vsxsxs#vsL Z{q*sq0 X5% Ws ,UMSsrqQ?srq'O0McsrqsrqsrqK srq )J[rq G05r q EX D(N B8r q A ) rq ?rq( >rq> :Erqrq 9irqrqrqI 7 W 6a'yrq&rq 4 rq< 2sG* 0* /}* .?* -* +*: (qJ &s#G2 %Ixsxs #8 # ! xs#xs  vsxsxs#xs dq0rq rq \ ]tX 2qJ t qN * '+ Q "X ./ (rq% a TVk(AN INSTRUCTION FETCH UNIT FOR A HIGH-PERFORMANCE PERSONAL COMPUTER38For emulation it presents some problems, however. Consider the microcode to push a localvariable; the address of the variable is given by the alpha byte plus the contents of the base registerlocalData, whose number is localDataRegNo:PushLocalVar:MemBase _ localDataRegNo;-- Make memory references relative to the local data.Fetch[IFUData];-- Use contents of PC+1 as offset.Push[memoryData], IFUJump;-- Push variable onto stack, begin next instructionThis takes three cycles, one of which does nothing but initialize MemBase. The point is clear: suchparametric state should be set from the IFU at the start of an instruction, using information in thedecoding table. This is in fact done on the Dorado. The decoding table entry for PushLocalVarspecifies localData as the initial value for MemBase, and the microcode becomes:PushVar:Fetch[IFUData];-- IFU initializes MemBase to the local dataPush[memoryData], IFUJump;-- Push variable onto stack, begin next instructionOne microinstruction is saved. Furthermore, the same microcode can be used for a PushGlobalVarinstruction, with a decoder entry which specifies the same dispatch address, but globalData as theinitial value of MemBase. Thus there are two ways in which parameterization saves space overspecifying everything in the microinstruction: each microinstruction can be shorter, and fewer areneeded. The need for initialization, however, makes the idea somewhat less attractive, since itcomplicates both the IFU and the EU, and increases the size of the decoding table. A major reduction in the size of the decoding table can be had by using the opcode itself as the dispatch address. This has a substantial cost in microcode, since typically the number of distinctdispatch addresses is about one-third of the 256 opcodes. If this price is paid and parameterizationeliminated, however, the IFU can be considerably simplified, since not only the decoding table spaceis saved, but also the buffers and busses needed to hand off the parameters to the processor, andthe parameterization mechanism in the processor itself. On the Dorado, the advantages ofparameterization were judged to be worth the price, but the decision is a fairly close one. Thecurrent memory base register and the current group of processor registers are parameters of themicroinstruction which are initialized from the IFU. The IFU also supplies the dispatch address atthe same time. The remainder of the information in the decoding table describes the data fieldsand instruction length; it is buffered in EXECUTE and passed to the processor on demand.4.3 ForwardingEarlier we mentioned folding of independent operations into the same microinstruction as animportant technique for speeding up a microprogram. Often, however, we would like to fold theemulation of two successive instructions, deferring some of the work required to finish emulation ofone instruction into the execution of its successor, where we hope for unused resources. Thiscannot be done in the usual way, since we have no a priori information about what instructioncomes next. However, there is a simple trick (due to Ed Fiala) which makes it possible in manycommon cases.We define for an entire instruction set a small number n of cleanup actions which may be forwardedto the next instruction for completion; on the Dorado up to four are possible, but one must usuallybe the null action. For each dispatch address we had before, we now define n separate ones, onefor each cleanup action. Thus if there were D addresses to which an IFUJump might dispatch, thereare now nD. At each one, there must be microcode to do the proper cleanup action in addition tothe work required to emulate the current instruction. The choice of cleanup action is specified bythe microcode for the previous instruction; to make this convenient, the Dorado actually has fourkinds of IFUJump operations (written IFUJump[i] for i=0, 1, 2, 3), instead of the one describedabove. The two bits thus supplied are ORed with the dispatch address supplied by the IFU todetermine the microinstruction to which control should go. To avoid any assumptions about whichpairs of successive instructions can occur, all instructions in the same instruction set must use thesame cleanup actions and must be prepared to handle all the cleanup actions. In spite of thislimitation, measurements show that forwarding saves about 8% of the execution time in straight-linecode (see 6.4); since the cost is very small, this is a bargain.f2sG@ q _8 ^WL \rqr Zs YG#5 XFxs#xs Wxs#3 Tq>rq S, sq , Q.r P$qrqrq Ns Lxs#Gxs& Kxs#3 Hlq.tqr Fq .r q Ed rtqC C T B\X @ sqsq1 =9& <) Q :L 9! sq&" 7? 6E 4 D 3F 1 sqsq& 0  R .'sq' *tX '[q8 %> $SN "V !K tq# Z C 5tqtqtq t q9( tq/tq tq%tqsrq tqJ >" 5) |srqsrqtqtq* !sq+sq tK C lI  A d>TVk(SEC. 4IFU-PROCESSOR HAND-OFF39We illustrate this feature by modifying the implementation of PushLocalVar given above, to showhow one instruction's memory fetch operation can be finished by its successor, reducing the cost ofa PushLocalVar from two microinstructions to one. We use two cleanup actions. One is null (action0), but the other (action 2) finds the top of the stack not on the hardware stack but in thememoryData register. Thus, any instruction can leave the top of stack in memoryData and do anIFUJump[2]. Now the microcode looks like this:PushLocalVar[0]:Fetch[IFUData], IFUJump[2];-- this entry point assumes normal stack, and leaves top of stack inmemoryData.PushLocalVar[2]:Push[memoryData], Fetch[IFUData], IFUJump[2]; -- this entry point assumes top of stack is in memoryData and leaves it there.In both cases, the microcode executes IFUJump[2], since the top of stack is left in the memoryDataregister, rather than on the stack as it should be. In the case of PushLocalVar[2], the previous instruc-tion has done the same thing. Thus, the microcode at this entry point must move that data into thestack at the same time it makes the memory reference for the next stack value. The reader can seethat successive Push instructions will do the right thing. Of course there is a payoff only becausethe first microinstruction of PushLocalVar[0] is not using all the resources of the processor.It is instructive to look at the code for Add with this forwarding convention:Add[0]:temp _ Pop[];-- this entry point assumes and leaves normal stackStackTop _ StackTop+temp, IFUJump[0];Add[2]:StackTop _ StackTop+memoryData, IFUJump[0]; -- this entry point assumes top of stack is in memoryData, leaves normalstack.This example shows that the folding enabled by forwarding can actually eliminate data transferswhich are necessary in the unfolded code. At Add[2] the second operand of the Add is not put onthe stack and then taken off again, but is sent directly to the adder. The common data bus of the360/91 [15] obtains similar, but more sweeping, effects at considerably greater cost. It is alsopossible to do a cleanup after a NotReady dispatch; this allows some useful work to be done in anotherwise wasted cycle.4.4 Conditional branchesWe conclude our discussion of IFU-processor interactions, and give another example of forwarding,with the example of a conditional branch instruction. Suppose that there is a BranchNotZeroinstruction that takes the branch if the current top of the stack is not zero. Assume that itsdecoding table entry tells the IFU to follow the branch, and specifies the instruction length as thefirst IFUData value. Straightforward microcode for the instruction is:BranchNotZero:-- IFU jumps come here. IFU assumed result#0.if stack=0 then goto InsFromIFUData, Pop;-- Test result in this microinstruction.IFUJump;-- Result was non-zero, IFU did right thing.InsFromIFUData:-- Result was zero. Do the instruction at PC+IFUData.temp _PC+IFUData;-- PC should be PC+Instruction length.PC _ temp;-- Redirect the IFUIFUJump;-- This will be dispatched to NotReady, where the code will loop until the IFUrefills starting at the new location.The most likely case (the top of the stack non-zero) simply makes the test specified by theinstruction and does an IFUJump (two cycles). If the value is zero (the IFU took the wrong path),the microcode computes the correct value for the new PC and redirects the IFU accordingly (fourcycles, plus the IFU's latency of five cycles; guessing wrong is painful). If we think thatBranchNotZero will usually fail to take the branch, we can program the decoding table to treat it as f2sG2 G?q _8r q ^WA \r q+* [O#6 Yr q%r q XGsrq$ Ubs T$xsGxs#D%R P NxsxsW Kq$stq r Jq;r q H R G 3* E rq1 Dr q4 @*rq! =s <G#4 ;xxs 8 7UxsQ%6 2q'4 1hrqrq /<# .`E ,rq" +X 'YtX $.qsq1 "Gr !&q D sq 8 srq: 9s #Gxsxs vsvsxs #( xs#xs xs#*xsxs nxsxs#xs xs 0xs#x s#pws$x%s% q!7  srq*sq 2sqsq  sq, yr qBLTVk(AN INSTRUCTION FETCH UNIT FOR A HIGH-PERFORMANCE PERSONAL COMPUTER40an ordinary instruction and deliver the branch displacement as IFUData, and reverse the sense of thetest.A slight modification of the forwarding trick allows further improvement. We introduce a cleanupaction (say action 1) to do the job of InsFromIFUData above (it must be action 1 or 3, since asuccessful test in the Dorado ors a 1 into the next microinstruction address). Now we write themicrocode (including for completeness the action 2 of 4.3):BranchNotZero[0]:-- IFU jumps come here. Expect result#0.Test[stack=0], Pop, IFUJump[0];-- Test result in this microinstruction; if the test succeeds, we do IFUJump[1].BranchNotZero[2]:Test[memoryData=0], IFUJump[0];EveryInstruction[1]:-- Branch was wrong. Do the instruction at PC+IFUData.temp _PC+IFUData;PC _ temp;-- Redirect the IFUIFUJump[0];-- This will be dispatched to NotReady, where the code will loop until the IFUrefills starting at the new location.Now a branch which was predicted correctly takes only one microinstruction. For this to work, theprocessor must keep the IFU from advancing to the next instruction if there is a successful test inthe IFUJump cycle. Otherwise, the PC and IFUData of the branch instruction would be lost, and thecleanup action could not do its job. Note that the first line at EveryInstruction[1] must be repeatedfor each distinct dispatch address; all these can jump to a common second line, however.5. ImplementationIn this section we describe the implementation of the Dorado IFU in some detail. The primaryfocus of attention is the pipeline structure, discussed within the framework established in 2 and 3.3, but in addition we give (in 5.4) the format of the decoding table, which defines how the IFUcan be specialized to the needs of a particular instruction set. Figure 3 gives the big picture of thepipeline. Table 1 summarizes the characteristics of each stage; succeeding subsections discuss eachrow of the table in turn. The first row gives the properties of an ideal stage, and the rest of thetable describes departures from this ideal. This information is expanded in the remainder of thissection; the reader may wish to use the table to compare the behavior of the different stages.The entire pipe is synchronous, running on a two-phase clock which defines a 60 ns cycle; someparts of the pipe use both phases and hence are clocked every 30 ns. An "ideal" stage is describedby the first line of the table. There is a buffer following each stage which can hold one item(b=1), and may be empty (represented by an empty flag); this is also the input buffer for the nextstage. The stage takes an item from its input buffer every cycle (tinput=1) and delivers an item toits output buffer every cycle (toutput=1); the item taken is the one delivered (l=1). The buffer isloaded on the clock edge which defines the end of one cycle and the start of the next. The stagehandles an item if and only if there is space in the output buffer for the output at the end of thecycle; hence if the entire pipe is full and an item is taken by the processor, every stage will processan item in that cycle. This means that information about available buffer space must propagate allthe way through the pipe in one cycle. Furthermore, this propagation cannot start until it is knownthat the processor is accepting the item, and it must take account of the various irregularities whichallow a stage to accept an item without delivering one or vice versa. Thus, the pipe has globalcontrol. Note that a stage delivers an output item whether or not its input buffer is empty; if it is,the special empty item is delivered. Thus the space bookkeeping is done entirely by counting emptyitems.Implementing global control within the available time turned out to be hard. It was consideredcrucial because of the minimal buffering between stages. The alternative, much easier approach islocal control: deliver an item to the buffer only if there is space for it there at the start of the cycle.This decouples the different stages completely within a cycle, but it means that if the pipe is full(best case) and the processor suddenly starts to demand one instruction per cycle (worst case), thepipe can only deliver at half this rate, even though each stage is capable of running at the full rate;f2sG@ q _=srq ^W [,L Yrsrq) X$ uq? V4 Ss#Gxs# R}xs#Exs Q? Pxs M#+xsxs LGxsxs K xs#x Is#pws$x%Hs% EbqP Csq* BZsrqsqsrq1 @2rq ?RU :uX 7Tq .sq 5=# 4LDs 2q] 1D+0 /=$ .<8% ,V )I (  R &? %rq rqrq) #}=r"p#}q !r!p!q(rq P -/ 6+ t q$* L  R "3t qL rqLr q  : N1* tqE tq FL F >cTVk(SEC. 5IMPLEMENTATION41StageSizeInputOutputResetRemarks"ideal"t=1; takes onet=l=1; delivers one Clears bufferAll state is in the bufferitem if output item if buffer willto empty on after the stage.is possiblebe empty; b=1PC_ . . .ADDRESSwordNo inputNot if paused, MARand jump;Pass PC by incrementing;contention, or memalso acceptsa source, hence has busy; OK if space innew PCvaluestate (PC).any later buffer.MEMORYwordInternall>2; output is and jump;Must enforce FIFO;complicationsunconditional; b=2 discards out-not really part of IFU;put of fetcheshas state of 0-2in progressfetches in progressBYTESbytet=.5t=l=.5and jumpBreak byte feature.DECODEinstrt>.5; rate de-onlyRecycling to vary rate;pends on ins-splits beta byte; encodestruction lengthexceptions; does jumps.DISPATCHinstrOn IFUJumponlyNotReady is default delay; IFUHold is panic delay.EXECUTEbyteOn IFUDataNo output bufferReset unnecessaryTable 1: Summary of the pipeline stages<==sq2 =# sqsq ; 8 sq :sq+sq 8F 7 Jt qP If B G+/ F^ F D P CV" ?WtXwt <,qL :P 7}#sq/ 5Csq 4uK 2/sq 1msq' ..B. .B  c.(.B(/. .B 91.o.Bo ,tH +F+ F c+(F/+ F91+oF (r]q2 '^r q+ %r q- #r  q7 !r q zr q9 "r q8 r qsqsqsq rq rq r qsq   c((/  91oo$X% ^ 8sq sqQ 0F 0sq% (sq$tq Gtq  TVk(SEC. 5IMPLEMENTATION45problem can be attacked by introducing a sub-stage within DECODE; unfortunately, this delays thereading of the decode table by half a cycle, so that its output is not available together with thealpha byte. To solve the problem it is necessary to provide a second output buffer for BYTES, andto feed back its contents into the main buffer if the instruction turns out to be only one byte long,as in Figure 7c. Some care must be taken to keep the PCs straight. This ugly backwarddependency seems to be an unavoidable consequence of the variable-width items. In fact, a three-byte instruction is not handled exactly as shown in Figure 7. Since the bandwidthof BYTES prevents it from being done in one cycle anyway, space is saved by breaking it into twosub-instructions, each two bytes long; for this purpose a dummy opcode byte is supplied betweenalpha and beta. Each sub-instruction is treated as an instruction item. The second one containsbeta and is slightly special: DECODE ignores its dummy opcode byte and treats it as a two-byteinstruction, and DISPATCH passes it on to EXECUTE after the alpha byte has been delivered.<==8sqr =Fq N ; W :>Gsq 8rqW 76 37tXwt 0 qsrq? . 6srqr -qrqr q$rqrq +sq3sq% ((U( (U  c(((U(/( (U 91(o(Uo%rrrr '#sr %F% F c%(F/% F91%oF"q '#rq eZ'#rqe0'#rqeZt'#r q=eZJ'#rqe0t'#rqrq5e0J'#rqeZt'#rqYeZJ'#r!qe0t'#rqr qQe0J'#r$q   c((/  91ooV Xsr TVk(]SEC. 6PERFORMANCE476. PerformanceThe value of an instruction fetch unit depends on the fraction of total emulation time that it saves(over doing instruction fetching entirely in microcode). This in turn clearly depends on the amountof time spent in executng each instruction. For a language like Smalltalk-76 [5], a typicalinstruction requires 30-40 cycles for emulation, so that the half-dozen cycles saved by the IFU arenot very significant. At the other extreme, an implementation language like Mesa [9, 11] iscompiled into instructions which can often be executed in a single cycle; except for function callsand block transfers, no Mesa instruction requires more than half a dozen cycles. For this reason,we give performance data only for the Mesa emulator.The measurements reported were made on the execution of the Mesa compiler, translating aprogram of moderate size; data from a variety of other programs is very similar. All the operatingsystem functions provided in this single-user system are included. Disk wait time is excluded, sinceit would tend to bias the statistics. Some adjustments to the raw data have been made to removeartifacts caused by compatibility with an old Mesa instruction set. Time spent in the procedure calland return instructions (about 15%) has been excluded; these instructions take about 10 times aslong to execute as ordinary instructions, and hence put very little demand on the IFU.The Dorado has a pair of counters which can record events at any rate up to one per machinecycle. Together with supporting microcode, these counters provide sufficient precision that overflowrequires days of execution. It is possible to count a variety of interesting events; some arepermanently connected, and others can be accessed through a set of multiplexors which provideaccess to several thousand signals in the machine, independently of normal microprogram execution.6.1 Performance limitsThe maximum performance that the IFU can deliver is limited by certain aspects of itsimplementation; these limitations are intrinsic, and do not depend on the microcode of theemulator or on the program being executed. The consequences of a particular limitation, of course,depend on how frequently it is encountered in actual execution.Latency: after the microcode supplies the IFU with a new PC value, an IFUJump will go to NotReadyuntil the fifth following cycle (in a few cases, until the sixth cycle). Thus there are at least fivecycles of latency before the first microinstruction of the new instruction can be executed. Ofcourse, it may be possible to do useful work in these cycles. This latency is quite important, sinceevery instruction for which the IFU cannot compute the next PC will pay it; these are wronglyguessed conditional branches, indexed branches, subroutine calls and returns, and a few others ofnegligible importance.A branch correctly executed by the IFU causes a three-cycle gap in the pipeline. Hence if theprocessor spends one cycle executing it and each of its two predecessors, it will see three NotReadycycles on the next IFUJump. Additional time spent in any of these three instructions, however, willreduce this latency, so it is much less important than the other.Bandwidth: In addition to these minimum latencies, the IFU is also limited in its maximumthroughput by memory bandwidth and its limited buffering. A stream of one-byte instructions canbe handled at one per cycle, even with some processor references to memory. A stream of two-byteinstructions, however (which would consume all the memory bandwidth if handled at full speed),results in 33% NotReady even if the processor makes no memory references. The reason is that theIFU cannot make a reference in every cycle, because its buffering is insufficient to absorbirregularity in the processor's demand for instructions. As we shall see, these limitations are ofsmall practical importance. f2sG8 G?q _uX \q#> [,$; YK X$ C sq VY UV S3, R2 NU MeI KL J]$: H R GU$9 ENsq BL A"P ?N > R <\ 8tX 5lqsq1 3, 2dS 09 -tq sq sq srq r ,1q] *V ))%9 'sqsq &!L $ !r"sq+ @r jq srq(" ; tqsq 7 /' 3, / I rq2 'sq J  D VTVk(AN INSTRUCTION FETCH UNIT FOR A HIGH-PERFORMANCE PERSONAL COMPUTER486.2 NotReady dispatchesOur measurements show that the average instruction takes 3.1 cycles to execute (including all IFUdelays). Jumps are 26% of all instructions, and incorrectly predicted jumps (40% of all conditionaljumps) are 10%. The average non-jump instruction takes 2.5 cycles.The performance of the IFU must be judged primarily on the frequency with which it fails to satisfythe processor's demand for an instruction, i.e., the frequency of NotReady dispatches. It isinstructive to separate these by their causes: latency, cache misses by the IFU, dearth of memory bandwidth, insufficient buffering in the IFU. The first dominates with 16% of all cycles, which is not surprising in view of the large number ofincorrectly predicted jumps. Note that since these NotReady cycles are predictable, unlike all theothers, they can be used to do any background tasks which may be around.Although the IFU's hit rate is 99.7%, the 25 cycle cost of a miss means that 2.5% of all cycles areNotReady dispatches from this cause. This is computed as follows: one cycle in three is a dispatch,and .3% of these must wait for a miss to complete. The average wait is near the maximum,unfortunately, since most misses are caused by resetting the IFU's PC, This yields 33% of .3%, or .1%,times 25, or 2.5%.The other causes of NotReady account for only 1%. This is also predictable, since more than halfthe instructions are one byte, and the average instruction makes only one memory reference in threecycles. Thus the average memory bandwidth available to the IFU is two words, or threeinstructions, per instruction processed, or about three times what is needed. Furthermore, straight-line instructions are demanded at less than half the peak rate on the average, and jumps are sofrequent that when the first instruction after a jump is dispatched, the pipe usually contains half theinstructions that will be executed before the next jump. 6.3 Memory bandwidthAs we have seen, there is no shortage of memory bandwidth, in spite of the narrow data pathbetween the processor and the IFU. Measurements show that the processor obtains a word from thememory in 16% of the cycles, and the IFU obtains a word in 32% of the cycles. Thus data issupplied by the memory in about half the cycles. The processor actually shuts out the IFU bymaking its own reference about 20% of the time, since some of its references are rejected by thememory and must be retried. The IFU makes a reference for each word transferred, and makesunsuccessful references during its misses, for a total of 35%. There is no memory reference about45% of the time.6.4 ForwardingThe forwarding trick saves a cycle in about 25% of the straight-line instructions, and hence speedsup straight-line execution by 8%. Jumps take longer and benefit less, so the speed-up within aprocedure is 5%. Like the IFU itself, forwarding pays off only when instructions are executed veryquickly, since it can save at most one cycle per instruction.f2sG@ q _tX \q[s [,q4( Y= V}sq0 T?rq Su $QNsqLmJsq G.1 F9 rq' DA AsqD @rqL >? < /sqsq ;z 8Orq 9 6W 5G5sq 3 H 2?0+ 0V /7 - +8tX ( q&3 &sq? %sq& #@sq !; ysq-  < q rtX Gq,4 &7 ? sqE 5 TVk(QSEC. 6PERFORMANCE496.5 SizeA Dorado board can hold 288 standard 16-pin chips. The IFU occupies about 85% of a board;these 240 chips are devoted to the various stages as shown in Table 4. FunctionChips%ADDRESS-BYTES4017DECODE8635DISPATCH2410EXECUTE188Processor interface2711Clocks188Testing2711Table 4: Size of various parts of the IFUIn addition, about 25 chips on another board are part of MEMORY and BYTES. The early stages aremostly devoted to handling several PC values. DECODE is large because of the decoding table (27RAM chips) and its address drivers and data registers, as well as the branch address calculation. Table 5 shows the amount of microcode in the various emulators, and in some functions commonto all of them. In addition, each emulator uses one quarter of the decode table. Of course they arenot all resident at once.SystemWordsCommentsMesa1300Smalltalk1150Lisp1500Alto BCPL700I/O1000Disk, keyboard, regular and color display, Ethernet Floating point 300IEEE standard; there is no special hardware supportBit block transfer270Table 5: Size of various emulatorsAcknowledgementsThe preliminary design of the Dorado IFU was done by Tom Chang, Butler Lampson and ChuckThacker. Final design and checkout were done by Will Crowther and the authors. Ed Fialareviewed the design, did the microassembler and debugger software, and wrote the manual. Theemulators mentioned were written by Peter Deutsch, Willie-Sue Haugeland, Nori Suzuki and EdTaft. f2sG8 G?q _tX \q!sq [,C XYXXY X  cXY(X(/XY X 91XYoXoUt!{+ TFT F cT(F/T F91ToFQsqs#%q+jPIs#%q+jNs#%q+jMAs#%q,K#%+jJ9#%,H#%+j HG1H G1 cH(G1(/H G1 91HoG1oDX!s @q6sqsq ?Wsq sq+ =sq` :F 9$H 7 44u4 4u  c4(4u(/4 4u 914o4uo2t9! 1/F1/ F c1/(F/1/ F911/oF/qB-B, B*sq)sqsBq!4'!sq/% %L$y%L $y c%L($y(/%L $y 91%Lo$yo 6!NX {u Pq sq0 () HU C @TVk( AN INSTRUCTION FETCH UNIT FOR A HIGH-PERFORMANCE PERSONAL COMPUTER50References1.Clark, D.W. et. al. The memory system of a high performance personal computer. Technical Report CSL-81-1, Xerox PaloAlto Research Center, January 1981. Revised version to appear in IEEE Transactions on Computers.2.Connors, W.D. et. al. The IBM 3033: An inside look. Datamation, May 1979, 198-218.3.Deutsch, L.P. A Lisp machine with very compact programs. Proc 3rd Int. Joint Conf. Artificial Intelligence, Stanford, 1973,687-703.4.Ibbett, R.N. and Capon, P.C. The development of the MU5 computer system. Comm. ACM 21, 1, Jan. 1978, 13-24.5.Ingalls, D.H. The Smalltalk-76 programming system: Design and implementation. 5th ACM Symp. Principles ofProgramming Languages, Tucson, Jan. 1978, 9-16.6.Intel Corp. MCS-86 User's Manual, Feb. 1979.7.Knuth, D.E. An empirical study of Fortran programs. SoftwarePractice and Experience 1, 1971, 105-133.8.Lampson, B.W. and Pier, K.A. A processor for a high performance personal computer. Proc 7th Int. Symp. ComputerArchitecture, SigArch/IEEE, La Baule, May 1980, 146-160. Also in Technical Report CSL-81-1, Xerox Palo Alto ResearchCenter, Jan. 1981.9.Mitchell, J.G. et. al. Mesa Language Manual. Technical Report CSL-79-3, Xerox Palo Alto Research Center, April 1979.10.Russell, R.M. The CRAY-1 computer system. Comm. ACM 21, 1, Jan. 1978, 63-72.11.Tanenbaum, A.S. Implications of structured programming for machine architecture. Comm. ACM 21, 3, March 1978, 237-246.12.Teitelman, W. Interlisp Reference Manual. Xerox Palo Alto Research Center, Oct. 1978.13.Thacker, C.P, et. al. Alto: A personal computer. In Computer Structures: Readings and Examples, 2nd edition, Sieworek,Bell and Newell, eds., McGraw-Hill, 1981. Also in Technical Report CSL-79-11, Xerox Palo Alto Research Center, August1979.14.Thornton, J.E. The Control Data 6600, Scott, Foresman & Co., New York, 1970.15.Tomasulo, R.M. An efficient algorithm for exploiting multiple arithmetic units, IBM J. R&D 11, 1, Jan. 1967, 25-33.16.Anderson, D.W. et. al. The System/360 Model 91: Machine philosophy and instruction handling. IBM J. R&D 11, 8, Jan.1967, 8-24.17.Widdoes, L. C. The S-1 project: Developing high performance digital computers. Proc. IEEE Compcon, San Francisco, Feb.1980, 282-291.f2sG@ q _u \s2 wsOxs2[NB{sws Y2 wsxsw s X29w1s2WN U24xsws{svs T2Nw{w2SNs Q2 {ws P24wsvs O+2Sw2M s xs9xs2L KN2wswsxs3 I2xsws{svs H2Qws{svs2GN E2 ws- D2 ws"w*s2CNDxs/2B @2ws( ?N2P{wsvs =2 wsF{swsvs2< ;N2Pws{sws2: TVk( Helvetica Helvetica  Helvetica Helvetica  Helvetica Helvetica Math  TimesRoman Logo  TimesRoman  TimesRoman   TimesRoman   TimesRoman   TimesRoman TimesRoman TimesRoman TimesRoman TimesRoman  TimesRoman  TimesRoman TimesRoman  TimesRoman  TimesRoman TimesRoman TimesRoman  TimesRoman  Math   TimesRoman_ I j N* I4 <> ?IL (W ,be Np y{ c > D . . x sj/0 DoradoIfu.bxDakeJanuary 13, 1981 3:51 PM