DRAGON COMPUTER SYSTEM (Mark property value: centerVersoHeader)
DRAGON COMPUTER SYSTEM (Mark property value: centerRectoHeader)
(Mark property value: centerVersoFooter)
(Mark property value: centerRectoFooter)
The Dragon Computer System
An Early Overview
Edward M. McCreight
Draft of December 7, 1984
Release as [Indigo]<Dragon>Documentation>DragonOverview.tioga, .press

© Copyright 1984 Xerox Corporation. All rights reserved.
Abstract: Dragon is a 32-bit computer system being designed at Xerox PARC. This document is an introduction to its architecture. A version of this document is being presented as a half-day lecture at the NATO Advanced Study Institute on Microarchitecture of VLSI Computers in Urbino, Italy, July 9-20, 1984, and as a chapter of the book that customarily issues from these Institutes.
XEROXXerox Corporation
Palo Alto Research Center
3333 Coyote Hill Road
Palo Alto, California 94304
Contents
Introduction
Key Ideas of Dragon
Multiprocessing, faster execution, code density
The M-bus
Caches, Virtual-to-Real Mapping
Dragon Processors
Cedar
Procedures and local frames, Strong typing and "Safety", Modules and Procedure Calls, Processes
Conclusions
Acknowledgments
Bibliography
Figures
Introduction
Dragon is a new computer system being designed at the Xerox Palo Alto Research Center. In many ways Dragon is the newest descendant in the line of research personal computers that began in 1973 with the Alto and ends today in the Dorado. Like those machines, Dragon will be a personal computer. To us, this means a high-performance display controller and a virtual memory with only rudimentary protection hardware. Dragon and Dorado both execute programs compiled from the Cedar language. (Cedar is an extension of the Xerox Mesa language, which is a derivative of Pascal.) We plan to build modest numbers of Dragons for ourselves to allow us to explore useful ways of exploiting such performance in a personal setting before such machines are economical in widespread use.
Dragon differs in significant ways from its predecessors. Most importantly, it is a tightly-coupled multiprocessor. We see multiprocessing as an important wave of the future, and we need to understand its implications for software design. Dragon is also a 32-bit machine, both in its arithmetic path and in its virtual addresses. It has a new instruction set, considerably reduced in semantic complexity from previous machines. Dragon can handle 16-bit data, but at a noticeable penalty in code size and execution speed compared with 32-bit data. These facts will encourage our systems programmers to eliminate vestiges of 16-bit-ness that survive from Alto days. Dragon is also our first machine to include custom integrated circuits; previous machines were built of commercial integrated circuits.
Dragon is not even close to being finished. Our logic simulations are about two-thirds complete, and floor planning and layout have begun on some chips, but modest architectural refinements are still a weekly event. We can fairly claim to have made a solid beginning, but a mountain of hard work remains. We hope by publishing this paper to stimulate discussions that could make Dragon a better machine.
Key Ideas of Dragon
Our present workhorse machine, the Dorado, is a personal computer in the same sense that a Rolls-Royce is a personal automobile. It is expensive, power-hungry, and noisy. Those facts alone would be enough to make us want a better machine. But we also had some new ideas that we wanted to investigate through implementation.
Multiprocessing
The first idea is tightly-coupled MIMD (multiple instruction streams operating on multiple data streams) multiprocessing. As we considered the declining fraction of total system cost represented by a single processor, the slow advance in processor circuit speed, and the knee in every processor speed vs. cost curve, it became apparent to us that the most straightforward way to obtain an order-of-magnitude increase in processing power was not to attempt an order-of-magnitude increase in circuit speed or architectural complexity, but rather simply to use multipleprocessors.
This approach has two major drawbacks. The first is that as the number of processors increase, so does the difficulty of routing information among them in general ways. If the processors are all attached to a common shared memory bus, then access to that bus becomes the resource that limits system performance. A partial solution to this problem is to interpose a cache memory between every processor and the shared memory bus. Until recently this solution has suffered from the "cache consistency problem." When a processor stores new data into its cache, should that store always appear on the shared memory bus? If so, serious contention results, because typically one-third of memory operations are stores. If not, then what prevents two caches from having different opinions about the contents of the shared memory? Fortunately a new architectural element, the two-port cache, reduces each processor's average need to store to the shared memory bus, thereby allowing modest numbers of processors to coexist without crippling interference.
The second drawback is more serious, especially in a personal computer. To take advantage of Dragon-style multiprocessing, there must usually be several processes ready to run, and the interprocess communication and synchronization among processes must be modest. While this is the normal case for a heavily-loaded timesharing system, maintaining many ready processes in a personal computer requires breaking up such compute-bound programs as compilers (and many others) into processes. Our existing compilers, for example, were written as single processes because we knew that they would be executed on single-processor systems. To have written them otherwise would have complicated the programming task and made the compiler run more slowly.
Dragon introduces us to two interesting software problems. First, how should we adapt existing software for the new multiprocessing hardware? Second, how should the coming wave of multiprocessor hardware change how we write programs, or how we think about programming? We at Xerox suppose that it is harder to write programs as multiple processes than as single processes, but that supposition is based more on second-hand reports than on first-hand experience. We intend to use Dragon as a tool for investigating how Dragon-style multiprocessing can be used to best advantage.
Faster execution
Speed isn't everything, but it's kilometers ahead of whatever is in second place. Processor speed can be put to use directly in those few places in a program where it's needed, and in other places traded away for clarity of expression, compactness of representation, an extra level of interpretation, delayed binding, garbage collection, or a host of other worthwhile things. We hoped to improve the speed of Dragon, compared with our earlier machines, primarily in two areas. The first is serial execution of basic blocks, where pipelining and rich pipeline bypass paths reduce the average number of cycles per instruction. The goal, like that of the IBM 801, was a simple instruction set and a storage hierarchy that would permit the machine to execute an instruction nearly every machine cycle. The second area of improvement is the procedure call, where simplified parameter passing, a cached control stack, and elimination of several indirections in common cases would provide a dramatic speed improvement over our earlier implementations.
Code density
One of the most notable achievements of the Mesa machine instruction set, implemented in microcode on such machines as the Dorado and the Xerox Star, was the compactness of compiled Mesa programs. As one extreme example, Baskett's Puzzle benchmark compiled into only one-fourth as many code bytes in an early version of the Mesa instruction set as in the IBM 370/4331 instruction set. This compactness has two benefits. The most obvious benefit is that a given program could be stored in fewer dollars' worth of memory. That benefit will remain significant for some time to come, since the declining cost per byte of memory has been nearly balanced by the need to store more ambitious programs. A less obvious benefit is that the smaller a program is, the less bandwidth is required to carry it from one place to another: from disk to memory, from memory to cache, from cache to instruction decoder.
Our challenge was to make the Dragon go faster, and to deal with a much larger address space, without throwing away the compactness that the Mesa architects had worked so hard to achieve.
The M-bus
Figure 1 shows an overview of a typical Dragon system. The central data pathway is the M-bus. We may eventually design Dragon systems with more than one M-bus, but for now there is one M-bus in a system. The M-bus is a synchronous time-multiplexed bus that carries data among the processor caches, the main memory, the map processor, and the display controller. M-bus mastership is assigned by a single round-robin arbiter. M-bus transactions are time-sequenced on a 32-bit address/data bus as dictated by the operation specified by the current M-bus master. Addresses on the M-bus are real (not virtual), and there are separate 32-bit address spaces for I/O and memory.
The following M-bus transactions are defined. A single line in the description below takes one machine cycle. Master commands are in boldface. Slave responses are in lightface. Address/data bus contents are in italics. Flow control is the insertion of optional wait states in the protocol by a slave that needs more time; w refers to the number of wait states inserted.
i. ReadQuad reads four words from memory or a willing cache. This cyclic order transports the requested data word first, so the requesting processor can continue after cycle w+1.
0: ReadQuad address
{anything but ReadQuadReady}*
w+1: ReadQuadReady data[address]
w+2: data[address-(address MOD 4)+(address+1 MOD 4)]
w+3: data[address-(address MOD 4)+(address+2 MOD 4)]
w+4: data[address-(address MOD 4)+(address+3 MOD 4)]
ii. WriteQuad command writes four words to memory. The address must be equivalent to zero MOD 4. If the memory asserts WriteQuadReady in cycle 4 as shown, then the next transaction can commence in the next cycle. Otherwise the next transaction waits until the cycle after the memory asserts WriteQuadReady. Flow-controlling after data transport requires the memory to have a four-word write data buffer, but it allows more time for the memory to decide whether wait states are needed.
0: WriteQuad address
1: data[address]
2: data[address+1]
3: data[address+2]
4: WriteQuadReady data[address+3]
iii. WriteSingle broadcasts a single word in the memory address space. The Master cache does this because it wants to change a word, and the its Shared flag (to be described shortly) for that word is set, indicating that a copy of that word may exist in other caches. This word is not written to memory, and there is no flow control.
0: WriteSingle address
1: data[address]
iv. IORead reads a single word in the I/O address space.
0: IORead address
{anything but IOReadDone}*
w+1: IOReadDone data
v. IOWrite writes a single word in the I/O address space. If the addressed Slave responds with IOWriteDone in cycle 1 as shown, then the next transaction can commence in the next cycle. Otherwise the next transaction waits until the cycle after the Slave responds.
0: IOWrite address
1: IOWriteDone data
Interprocessor atomicity is implemented by allowing the M-bus master to lock the arbiter. There is no automatic timeout, so processors (or caches) must be designed so they cannot lock too long.
Caches
Dragon caches, like other caches, have two related purposes. A cache provides its processor with faster access to most frequently accessed data than the memory system could. This is doubly effective in Dragon's case, because even when a Dragon cache does not contain the data requested by its processor, it can often implement by itself the virtual-to-real mapping that is required in order to find that data. It can do this because our solution to the cache consistency problem requires that a Dragon cache entry contain not only virtual address and data, but real page number as well, which is nearly all the information in the mapping table entry. By adding the rest of the mapping information to each cache entry, we enabled our cache to map any virtual address for which it contains an entry in the same page.
The second purpose of the caches is to reduce the M-bus bandwidth needed by a single processor. This is the only way that modest numbers of processors can share a common M-bus without crippling mutual interference.
Cache Architecture
A Dragon cache intermediates between a processor's P-bus (the cache's "front" door) and the system's common M-bus (the cache's "back" door). A cache contains 64 associative entries, each of which comprises the virtual page number, the real page number, the address of the block within the page, the block of 16 32-bit words of data, and various property bits for the page and for each group of four words within the block.
A cache's P-bus consists of a 32-bit phase-multiplexed virtual address/data field, a parity bit, a command field, a ready bit, and an error field. There is only one processor on a P-bus, and it is always master. One or more caches, and the floating-point accelerator chip(s), are slaves on the P-bus. A P-bus slave decodes its instructions from a combination of the P-bus command and virtual address fields. Commands to caches include { Normal | Locked | IO }{ Read | Write }. A slave responds by setting the ready bit and the error field. When a cache contains data being read or written, it asserts ready in the phase following the read or write operation, so that in the best (and, we hope, most likely) case, a cache read or write takes one machine cycle. Cache errors include PageFault and WriteProtectFault.
At the moment, the cache replacement algorithm replaces entries in strictly sequential order without regard to whether their data is dirty, although this may change. A cache's command decoding can be restricted to a subset of virtual addresses via a bit-serial initial setup path. This mechanism allows multiple cache chips to be attached to a P-bus, thereby implementing a larger but set-associative cache, and improving the hit ratio.
Cache Consistency
A system of caches is consistent, or coherent, if all copies of the value in each memory location are identical. An obvious way to maintain consistency is to broadcast all stores to the M-bus. Since typically one-third of data references are stores, at best this solution reduces the M-bus bandwidth requirements of a processor by a factor of three, independent of cache size. This factor is not large enough for Dragon.
A better solution is to broadcast on the M-bus only those stores that affect data existing in more than one cache. Dashiell and Thacker at Xerox were among the first to discover a way to implement this solution, and Goodman at the University of Wisconsin discovered a similar one at about the same time. Both solutions depend on having a second associator in each cache, called the back-door associator, or the "snoopy" associator. This second associator monitors M-bus reads, and volunteers data that it knows. If the data comes from another cache, instead of from the main memory, then all caches involved mark their copies as shared. Any change to cache data that is marked shared is broadcast to the M-bus.
The Dragon implementation uses a SharedData line in the M-bus, and a Shared flag in each cache entry. During the ReadQuad M-bus transaction, if the real address of some entry in a non-master cache matches the real address on the M-bus, that cache sets that entry's Shared bit, asserts SharedData on the M-Bus, and provides that entry's data during data transport. During WriteSingle (caused by a processor storing to a cache entry whose Shared flag is set), if the real address of some entry in a slave cache matches the real address on the M-bus, that cache asserts SharedData, and overwrites that entry's data during data transport. The master cache entry always loads its Shared flag from the value of the SharedData M-Bus line. In this manner a cache is constantly aware of shared data, and eventually discovers when sharing stops.
Cache Performance
The ultimate goal is for the entire system to execute a set of programs as fast as possible. System performance is determined by the interaction of processors, caches, and M-bus, so we cannot discuss cache performance in isolation. Although we haven't talked much about a Dragon processor yet, we concentrate our attention on how our caches behave when attached to a Dragon processor, since we believe that that behavior will have the greatest influence on total system performance. Other contributions will come from other processors: the map processor, the display controller, and the I/O processor.
A Dragon processor attaches to two caches. It fetches its instructions from one, and fetches and stores its data in the other. We expect these two caches to differ significantly in such performance statistics as hit ratio and average machine cycles between references, because the processor uses the two caches quite differently. At the moment each Dragon cache is planned to contain 64 entries, each entry including a virtual address (divisible by four words), a comparator, and four data words. That is a fairly small cache, and we are considering enlarging it, but for definiteness we will assume that size throughout the rest of this section.
Our own data about cache performance are from two sources. One is a cycle-by-cycle Dragon simulation. From this simulation we know quite a lot about the Dragon's performance on a few simple benchmark programs, notably Baskett's widely-circulated Puzzle. Some statistics for Puzzle are likely to apply as well for more "typical" programs (like compilers or text formatters), while others probably will not. We expect that Puzzle's average of 7 machine cycles per data cache reference and 1.8 machine cycles per instruction cache reference is fairly robust. On the other hand, we expect that cache hit ratios will vary considerably from benchmark to benchmark. As an extreme example, the Puzzle instruction cache hit ratio was very high because nearly the entire program fit within the instruction cache. One would not expect such behavior from a more complex program.
Ideally, cycle-by-cycle simulation would tell us every performance statistic we need. Practically, it suffers from two limitations. Our Dragon Cedar compiler is not ready, so not much Dragon code exists, and what does exist is hand-coded. Furthermore, even if we had plenty of interesting programs expressed as Dragon code, our simulator's speed limits the number of instructions we can simulate.
Our other performance data are measurements taken by N. Suzuki two years ago. He modified our standard Dorado microcode to collect statistics on Dragon cache performance. He pretended that the Dragon was executing the Dorado Mesa instruction set, and he collected statistics only for data references.
The two kinds of performance data make quite different assumptions, so the fact that they agree very closely on Puzzle's average data cache hit ratio (89.7% vs. 89.8%) should be regarded as accidental. Suzuki's data also indicates that the Mesa compiler (a "typical" program) has a data cache hit ratio that is 7% worse than Puzzle. This suggests that a "typical" program might achieve an average 83% data cache hit ratio. If we assume that a cache operation happens once every seven cycles, and that the average cache miss costs six cycles, then Dragon should run 14% slower on a "typical" program with our data cache than with an infinite data cache that never misses.
Performance will likely be worse for the instruction cache than for the data cache. Unfortunately, at the moment all our instruction cache performance data for "typical" programs is second-hand. J. Smith and J. Goodman measured average instruction cache hit ratios over three programs, NROFF, CACHE, and COMPACT, all compiled from C and running on a VAX-11 under UNIX. Their results indicate a 90% hit ratio for a Dragon cache. Based on their measurements, they argue that cache size is the most important factor in determining hit ratio, and that for caches the size of ours, random replacement is better than LRU or FIFO, and direct mapping is nearly as good as full associative lookup. D. Patterson, et al., measured the hit ratio of a simulated RISC II executing the Portable C compiler with a cache similar to (but with a narrower entry than) ours. Their results indicated a hit ratio of 87%. This suggests that our instruction cache hit ratio will be somewhat better than our data cache hit ratio. On the other hand, we will likely spend many more cycles waiting for instruction cache misses than for data cache misses, because our simulation indicates that Dragon accesses the instruction cache almost four times as often as the data cache.
If our cache hit ratios prove to be unacceptably low, we have the board space and the architectural flexibility to attach two more cache chips to each processor. At the moment, we expect to achieve instruction/data cache balance (such that giving a new cache entry to either cache saves equal numbers of cycles) with twice as many entries for instructions as for data.
An important issue is how many Dragon processors an M-bus can support. The numbers above suggest that a single Dragon processor, with one cache chip for instructions and one for data, in steady state on a typical program, is likely to miss the cache about once every 13 machine cycles. Since a quadword read takes a minimum (and the average is probably quite near the minimum) of five machine cycles, the limit of system throughput is 2.6 processors' worth, independent of how many processors are available. According to Smith and Goodman's data, this limit would increase to 7.2 if a second cache chip for instructions were added to each processor. This is the main argument for increasing the size of Dragon's caches.
Virtual-to-Real Mapping
As mentioned above, addresses on the M-bus are real. Caches can intercept and implement many of the virtual-to-real mapping operations required by their processors. But what happens when they can't?
When it needs to know map information, a cache does an IORead from an address that includes the virtual page number to be mapped. (This is one reason why we need a 32-bit I/O address space.) The map processor responds to the IORead by providing the map information. While formulating its response, the map processor may do with the M-bus as it pleases, as long as IOReadDone is never asserted until the response is ready.
This means that the real memory itself can hold the map table entries for those pages of virtual memory that are present in real memory. One idea is to organize the resident map as a hash table occupying a constant small fraction of real memory. There would be an entry in the map corresponding to each virtual page present in real memory. The hash function would probably be the low-order bits of virtual page number. Collisions would be resolved by rooting a balanced tree of map entries, keyed by virtual page number, at the site of a collision in the hash table. Other nodes of the balanced tree would be stored in otherwise empty locations in the hash table. This yields an expected number of probes arbitrarily close to 1, depending on how big and empty the hash table is. It also yields log (virtual page count/hash table size) probes in the worst case. Other ideas are also under consideration, and the prototype machine will probably have a map processor that implements the identity function.
It is sometimes necessary to notify all caches that the map has changed, and ChangeFlags is a special M-bus transaction for this purpose. ChangeFlags behaves like an IOWrite with no data that requires no acknowledgment. The most common use of ChangeFlags is to break the virtual/real association of a real page in all caches. This is done just before writing a page to disk, for example, to prevent processors from changing data as it is being written.
Suzuki's measurements indicate that about 3% of the data references of a typical program require a map transaction on the M-bus. If the map processor is designed to cache any significant fraction of recently active map entries, it will respond to most inquiries in two cycles. From this, and the average of 7 machine cycles per memory operation, we expect the average cost of mapping to be less than 1% in execution speed.
Dragon Processors
In previous sections we have dealt with the M-bus that binds Dragon together, and with the caches that attach to the M-bus. In this section, we deal with the processors, seen in figure 1, that attach to the caches. One of these is the I/O processor, a commercial microprocessor that interfaces the M-bus to a VME bus (an industry-standard 32-bit microcomputer bus) for dealing with purchased I/O controllers. One is the mapping processor, and one is the display controller. The remaining ones are identical Dragon processors, and they are the topic of this section.
Dragon Instruction Set
The Dragon processor has a new instruction set. This instruction set had to be considerably reduced in semantic complexity from previous machines, so that we could implement it in a fast pipelined two-chip engine, and handle the resulting inter-instruction pipeline hazards. At the same time, we wanted not to lose the code density that Mesa had achieved.
The variable state of a Dragon processor can be viewed as consisting of the following registers:
PC, a 32-bit byte program counter.
R, a 128-word x 32-bit operand stack cache.
L, a 7-bit register that indexes R. L is the base of a window of sixteen registers in R that are easy to address.
S, a 7-bit register that indexes R. S is the top of the expression evaluation stack in R. Many instructions implicitly (or explicitly) address R[S-2]..R[S+1], and increment S by [-2..1].
SLimit, a 7-bit register that is compared against S to know when the operand cache is getting close to full.
A, a 16-word x 32-bit group of easily-addressed global variables.
IStack, a control stack cache up to 16 elements long containing saved PC and L values.
a few other special-purpose registers that contain field descriptors, quotient or product, faulted memory addresses, whether the Reschedule interrupt is enabled, etc.
Dragon instructions consist of a one-byte opcode and zero, one, two, or four bytes of additional operands. The opcodes now defined fall into several general categories:
Unconditional jumps or calls to { I | [S] }, returns.
Conditional relative jumps, decided by comparing [S] against { I | [S] | [L] | [A] }.
Memory reads
{ [S] | [L] | [A] } := [ { [S] | [L] | [A] }+I ]^
Memory writes
[ { [S] | [L] | [A] }+I ]^ := { [S] | [L] | [A] }
Densely-encoded operations that produce a new top-of-stack
[S] := { I | C | [L] | [S] op I | [S] op [S-1] }
Densely-encoded operations that copy top-of-stack to register window
[L] := [S].
General three-operand instructions with less density
{ [S] | [L] | [A] } := { [S] | [L] | [A] } op { [S] | [L] | [A] }
In this table, I stands for immediate data (of several lengths) within the instruction, and C is a group of 12 registers containing popular constants like 0 and -1. The notation { a | b } means "either alternative a or alternative b", [ a ] means the contents of register a, and a^ means the value in memory location a.
The remaining opcodes are called XOP's and treated as procedure calls to a vector of runtime support routines. Such a round-trip procedure call costs four machine cycles, so it is reasonable to implement many complex operations by XOP's instead of by special control within the processor, except for the purgative effects of their implementations on the instruction cache.
Dragon's stack cache R is similar to the stack cache in the C machine proposed by Ditzel, et al., and reminiscent of (but more flexible, efficient, and complicated than) that in the RISC. The most interesting (as estimated by the compiler) parameters, local variables, temporaries of the several most-recently-called and still-active procedures of the executing process are kept within R. On overflow, the least recently used registers in R are flushed to memory, and on underflow, they are reloaded from memory. This overflow and underflow detection is automatic (using SLimit and IStack), and flushing and reloading is implemented by code in trap handlers.
A calling procedure can send parameters to a called procedure simply by pushing them on the stack. The first instruction of the called procedure loads L with S-(number of parameters-1), and the parameters become its first few local variables. The called procedure returns results to its calling procedure by storing them in its first few local variables. The Return instruction then loads S with L+(number of results-1), placing the results on top of the caller's stack.
IStack is a cache for control as R is for data. It is also flushed and refilled from memory via trap handlers, and its purpose is to accelerate Call and Return instructions.
Conditional jumps come in two forms, one predicted to jump, and one predicted not to jump. The processor initially behaves as if the conditional jump will happen in the predicted direction, and uses the pipeline backup hardware (that must be present to handle page faults) to recover from mis-predictions. Correctly-predicted control transfers of all kinds take one machine cycle, plus instruction buffer refill time. Mis-predicted conditional jumps cost four cycles plus refill time. This puts a real premium on correct prediction, an interesting challenge for compiler writers.
To check whether dense code was still possible in Dragon's instruction set, we translated the Cedar version of Puzzle by hand, trying to simulate average compiler technology. Here is the result, compared with similar results for other machines (32-bit addresses unless noted):
Code bytes Machine Language Remarks
1214 Alto Mesa (early) 16-bit address, no checks
1369 Lilith Modula2 16-bit address, no checks
1534 Dorado Cedar Mixed 16/32-bit address, range & NIL checks
1938 Dragon Cedar Range & NIL checks
2080 VAX 11/780 C Pointers instead of subscripts, no checks
2500 68000 C No checks
2736 RISCC No checks
3592 MIPS Pascal Global optimization, range checks
4744 IBM 370 Pascal
At this moment only about half of the 256 possible Dragon opcodes are defined; the rest are implemented as XOP's. As we further tune Dragon's instruction set we expect code density to improve.
Structure
A standard Dragon processor cluster consists of five or six large chips. Two of these chips are caches that mediate between the processor and the M-bus, one for instructions and one for data. One or two are floating-point accelerator chips. The remaining two chips form the heart of the Dragon fixed-point execution engine.
The two chips that constitute the fixed-point execution engine form a single logical entity, shown schematically in Figure 2. Instructions and data flow downward in this diagram. Dragon uses a two-phase clock; the heavy horizontal lines marked A indicate latches that are open during the A phase, while those marked B show latches open during B. One of the chips is the Instruction Fetch Unit (IFU), which reads machine instructions, decodes them, expand them into microinstructions, and controls the pipelining of these microinstructions. The other chip is the execution unit (EU) that contains a 160-register 3-port data memory, a pipelined ALU/Masker/Shifter with hardware to assist multiply and divide, the address and data pathway to the data cache (but not its control), and several multiplexers to select operands and implement pipeline short-circuits. The heavy jagged diagonal line denotes the interface between the two chips. The line is shown as diagonal because as a microinstruction progresses through the pipeline, at each stage there is less work being done in the IFU and more being done in the EU for that instruction.
As we scan figure 2 from top to bottom, the first major block we encounter is IFetch. This block contains an asynchronously-filled 6-word buffer of instructions. When a whole instruction is ready, it is byte-aligned by IFetcher and made available to Level 0.
Pipeline Level 0 generates from each instruction a sequence (usually of length one) of fairly wide microinstructions. The registers holding L and S are at this level, and midway through this level the two fetch addresses for the EU RAM are formed.
Unconditional jumps, calls to literal addresses, and returns are completely implemented at Level 0. These operations take one machine cycle at that level, plus the time needed to refill the instruction buffer from the target address, which is one cycle in the best case.
In Level 1 the two EU RAM fetch addresses pass to the EU, and somewhat later literal data from the instruction, and information about the source of the ALU operands, pass to the EU.
In Level 2 the ALU operation and information about what Boolean condition is of interest passes to the EU, and somewhat later the EU does the ALU operation and computes the Boolean condition.
In Level 3 a data cache operation is initiated if appropriate. The address came out of the ALU in the previous level. The wide data path from EU to IFU at the end of Level 3 can carry new contents for an IFU register, such as the program counter. We chose to put data cache operations in a pipeline stage following the ALU, rather than in the same pipeline stage, even though the pipeline is longer that way. We did this because a very large fraction of memory instructions use an address formed as "base register plus offset", so the cache operation is preceded by address arithmetic. If ALU and data cache operations were in in the same pipeline stage, such memory instructions would require an extra cycle.
A microinstruction can fail due to a page fault, an arithmetic overflow, or the realization that a conditional jump has been mis-predicted. Between Level 3 and Level 4, a crucial decision is made. Only by this point is it certain whether a microinstruction will complete successfully. If it cannot, then not only that microinstruction, but all previous microinstructions for the instruction that generated it, as well as all microinstructions for following instructions, must be aborted. The processor must attain a state as if all microinstructions for previous instructions had successfully exited the pipeline (which they have), and all microinstructions for the instruction currently at Level 3-4 had never entered the pipeline (a condition, alas, quite contrary to fact).
In Level 4, successful instructions store new data back in the EU RAM (see Level 1), and successful calls or returns are recorded in the IStack.
Pipelining
Dragon derives much of its speed from its IFU and EU pipelining, and the fact that average instructions (in Puzzle, for example) operate with a pipeline headway of less than two cycles, even though the pipeline is four cycles long. Unfortunately this speed is not free. We must detect and resolve inter-stage dependencies (hazards). In addition, to deal with faults in data cache references, arithmetic faults of various kinds, and mis-predicted conditional jumps, we need a way to abort instructions. This involves recovering much of the processor state to its value three pipeline steps ago.
Hazards
Our design deals with inter-microinstruction pipeline hazards in three ways. The simplest is to avoid them. MIPS avoids hazards by a set of programming rules that are followed by the compiler. With this solution comes the choice of a very smart compiler or a lot of extraneous No-Op's. We didn't want to introduce such rules (and their attendant increase in code size) at the machine instruction level. We do avoid some hazards by implementing a few infrequently-executed instructions with awful side effects (Store IFU Register, for example) by a carefully coded sequence of microinstructions that clears microinstructions for preceding instructions out of sensitive places in the pipeline before the killer microinstruction does its dirty deed, and then prevents microinstructions for succeeding instructions from entering sensitive places in the pipeline while the deed is still in progress.
Another case of hazard avoidance is delay-matching in parallel paths. Dragon has a simple loop-free pipeline in which nearly every computational resource (gate, bus, etc.) is used at only one pipeline stage. This greatly simplifies pipeline scheduling, but it also means that some paths through the pipeline are longer than they would otherwise need to be. For example, consider the store-back path to the EU RAM. For ordinary arithmetic, this could be located at Level 3. But because we put the ALU operations and data cache references in tandem, rather than in parallel, for data cache references the EU RAM store-back path has to be at Level 4. A hazard is avoided by inserting an extra pipeline stage between ALU output and EU RAM store-back path, so that it is always used at Level 4.
Another way Dragon deals with hazards is to detect them, and to prevent some upper endangered segment of the pipeline from advancing, while allowing the lower danger-free segment to advance normally. A microcode No-Op, which changes no state and participates in no hazards, is inserted in the pipeline gap thus created. The boundary between the stationary initial pipeline segment and the advancing final pipeline segment is chosen as high in the pipeline as possible, but so that at least one of the two participants in any hazard is on the stationary side of the boundary. In Dragon, there are three such boundaries. One is at Level 0, and deals with next instruction not ready, or a Call-Return hazard. One is at Level 1, and deals with those cases of EU RAM Write-Read hazard that cannot be short-circuited. And one is at Stage 3, and deals with data cache not ready.
The final way Dragon deals with hazards is to detect them, and to achieve the desired effect in another way, without introducing delay. The data path round-trip, from the EU RAM, through the ALU, through a best-case data cache operation (or the parallel collision-avoiding pipeline stage), and back into the EU RAM, takes three pipeline stages. During that time a potential Write-Read hazard exists on the EU register that is specified as the destination of a microinstruction, because a following microinstruction can read the wrong value. Many of these hazards can be handled without wasted cycles by means of EU pipeline short-circuits, which are additional data pathways in the EU that carry data backward one or two pipeline stages. For example, there is a short-circuit from the output of the ALU back to either input, so that an instruction may use the arithmetic result of the previous instruction. The use of these short-circuit paths is governed by the IFU. It matches the source register addresses of an instruction about to enter pipeline stage 1 against the destination register addresses of instructions further down the pipeline. Short-circuit paths are expensive, both in area and complexity, so we have included them only where they will speed up common instruction sequences.
Instruction Abort
Dragon has a way of attaining a state as if all microinstructions beyond pipeline stage 3 ran to completion, while all those at stage 3 and earlier never started. This mechanism, combined with careful microcoding of multi-microinstruction instructions, allows a data cache fault, or an arithmetic fault, to behave as if an XOP (extended op, or very short procedure call) had been inserted before the instruction that caused the fault, and allows the instruction to be re-executed after the fault has been remedied. This same mechanism is also used to speed up correctly-predicted conditional jumps.
This state restoration is probably the one mechanism responsible for the most logical complexity in the processor. The mechanism is implemented in different ways for the different state variables involved. To recover state in the EU, where permanent state changes only take place at pipeline stage 2 and beyond, the IFU converts destination register addresses to an unimplemented register number, converts the ALU operation in Level 1 to a harmless OR, and notifies the ALU to recover Carry and other small amounts of state from internal saved versions. Within the IFU, the current program counter, S, and L registers at Level 1 are reloaded from shadow copies at Level 3.
The worst problem is the IFU's control stack cache, IStack. This cache contains the caller's return PC and L values for the most recent few pending procedure calls, XOP's, or Reschedule interrupts, or fault traps in the current process. To avoid a complex recovery process, this cache is located at Level 4, which is attained only by microinstructions that do not fault. Unfortunately, this means that if the matching Return appears fewer than three pipeline stages after a Call, the control stack cache does not yet contain the caller's return PC, so a Call-Return hazard exists and an interlock must be activated.
Multiprocess primitives
Dragon will execute a distributed process scheduling algorithm. This algorithm is implemented by two hardware features and some software. One hardware feature is a global interrupt signal called Reschedule that is wired to all processors in a system. The other is a Dragon instruction called Conditional Store that implements a particularly nice form of atomic memory update:
{[S+1] := [S-1]^; IF [S+1]=[S] THEN [S-1]^ := [S-2]; S := S+1}.
The scheduling algorithm is triggered when Reschedule is asserted, either by a Dragon processor or by the I/O processor. When that happens, each processor waits until the code it is running is willing to accept a process switch (handlers for stack overflow/underflow are not, for example), and then does a procedure call to the Reschedule handler code. That code attempts to acquire the global scheduling lock. If that fails, the code checks the processor's control block to ensure that this processor is still executing the process that the scheduler had in mind, and then returns. If the handler acquires the scheduling lock, it unloads the current process context to memory, loads the context of the scheduler process, runs the scheduler process (reassigning processes to processors), activates the Reschedule line once more while holding the scheduling lock, then unloads the scheduler's context to memory, releases the lock, and finally picks up and runs the process in the processor's control block.
The Dragon data and control caches only work well if processors do not switch processes too often. One way of avoiding process switches is to cause the Reschedule handler to reach for the scheduling lock somewhat faster if the processor is running the idle process than if it is not. This means that if there are any idle processors, one of them become the scheduler, and the others will only switch processes if forced by the scheduler code.
Cedar
The designers of Dragon had the Cedar language uppermost in mind as the language from which Dragon programs would be compiled. Much of the organization of Dragon is intended to optimize the execution of common Cedar constructs. Of course, similar constructs can be found in other languages as well. Hill argues that the Bell Labs C machine, with which Dragon shares many features, is especially well suited to executing a wide variety of languages.
Procedures and local frames
Cedar is a successor of Pascal. It depends heavily on the notion that abstraction is implemented by procedure call. A process can be viewed as a LIFO stack of suspended activation records and one current activation record. Each activation record (or "frame") typically contains a program counter and some variables. To call a procedure, the currently-active procedure generates a new activation record for the called procedure, computes some arguments, stores them in the new activation record, suspends its own activation record, and makes the new activation record current. When a current activation finishes, it sends a number of results to the (suspended) caller's activation record, destroys its own activation record, and makes the caller's activation record current.
It is this computational paradigm for which Dragon is heavily optimized. Within the IFU chip is a LIFO cache of program counters and register indexes within the EU chip where the variables of suspended activation records begin. Most activation record variables can be implemented as EU registers, and parameter and result passing is done by index adjustment rather than data copying.
Such other languages as Lisp and Smalltalk also fit this computational paradigm more or less, and Dragon should be an effective engine for them as well. But there are other computational paradigms, such as Prolog, for which an equivalent amount of hardware, organized differently, would probably do significantly better than Dragon.
Strong typing and "Safety"
Cedar is a strongly-typed language like Pascal. This means that the compiler can know a great deal about what might happen during execution, and what cannot. Cedar also has a garbage collector, like Lisp. This means that it must be possible, at run-time, to follow all pointers, and to know the syntactic types of their targets. In "safe" Cedar (an automatically-checked subset of the language) the use of the general "Address-Of" operator is prohibited, as are overlaid variant records, and other devices that can confuse the compiler about the semantics of execution. The result is that the compiler knows many situations when a variable in an active process does not need a memory address. Such variables can be implemented in machine registers without fear that a program might reference them by memory address "alias" and get an inconsistent copy.
A compiler cannot know in general whether alias references will happen in weakly-typed languages like BCPL or C. One solution proposed by Patterson is to simulate such alias references by loading or storing the appropriate register. If needed, Dragon will implement this solution by using the mapping hardware to trap ranges of memory addresses. This implementation would be unacceptably slow if such alias references were very common.
This is one illustration of the general observation that a smart compiler working with a strongly-typed language can do well if it has at least two coding templates available: one that is very efficient for common situations that it understands perfectly, and another one that always works but may be less efficient, for uncommon situations that it cannot predict with certainty.
Modules and Procedure Calls
Cedar, and its precursor Mesa, have very rich control structures including independent compile units, coroutines, and procedure variables. In our earlier Mesa instruction set we had just one kind of cross-compile-unit procedure call instruction, EXTERNALCALL, which favored compactness over speed. Every cross-module procedure call required at least four indirect memory references, accompanied by considerable unpacking and indexing, just to find the new program counter. Every callee needed a record in memory for his local variables, and that record was allocated from a special-purpose allocator, at a typical cost of two memory references. The benefit was that an external procedure call without arguments could be coded in just one or two bytes; the drawback was that where they needed performance, programmers would avoid using EXTERNALCALL by expanding procedures inline, thereby defeating EXTERNALCALL's dense coding.
In Dragon we relax the density of coding of EXTERNALCALL, and we permit varying degrees of indirection, corresponding to tightness of binding, as circumstances permit. A general Dragon EXTERNALCALL is five bytes long. In those five bytes the EXTERNALCALL can be represented in at least three different ways. Tightest of all is a simple procedure call to a 32-bit byte address. This requires no indirect memory references to find the new program counter, but it puts considerable pressure on the loader. Loosest of all is an XOP with a four-byte argument. The XOP interpreter might treat the argument as a pointer to a string variable containing the name of the procedure to be called, locate the appropriate routine via the loader tables, and complete the call. Somewhere in between is a three-byte data reference that loads a new top of data stack, followed by a two byte procedure call indirect through the stack top.
Processes
Cedar already contains a formally complete set of primitives for dealing with multiple tasks in a single address space: PROCESS, FORK, JOIN, WAIT, NOTIFY, MONITOR. We have considerable experience using these primitives in contexts where multiple processes seem natural. These will suffice as Dragon's multiprocess primitives until something better evolves.
Conclusions
We have presented the current state of the design of a high-performance general-purpose personal computer system for use in programming research. We are optimistic that its performance will be quite good in a single-processor configuration, and depending on the tasks at hand and the cache sizes used, modest numbers of processors interacting in the same virtual address space may not seriously impede each other. Further conclusions await further experience with the machine.
Acknowledgments
Dragon was conceived in 1981 by C. Thacker, B. Lampson, P. Petit, N. Wilhelm, and F. Baskett. Today's architects and designers include R. Atkinson, R. Barth, D. Curry, J. Gasbarro, L. Monier, M. Overton, and myself. When the final story is told, there will be many others.
Good IC design tools from PARC's Computer Science Laboratory, and a cooperative spirit and functional silicon from PARC's Integrated Circuit Laboratory, have been and will continue to be essential in whatever success we may achieve.
Figure 2 was provided by D. Curry.
Bibliography
Censier, L. M., and Feautrier, P.
A New Solution to Coherence Problems in Multicache Systems
IEEE Transactions on Computers, C-27(12), pages 1112-1118, December, 1978.

Ditzel, D, and H. R. McLellan.
Register Allocation for Free: The C Machine Stack Cache
ACM SIGPLAN Notices, XVII(4): 48-56, 1982.
Goodman, James R.
Using Cache Memory to Reduce Processor-Memory Traffic
Computer Architecture Symposium Proceedings, IEEE, pages 124-131, 1983.
Hennessy, John, with Norman Jouppi, Steven Przybylski, Christopher. Rowen, and Thomas Gross.
Design of a High-Performance VLSI Processor
Third Caltech Conference on Very Large Scale Integration, pages 33-54. Edited by Randal Bryant, Computer Science Press, 1983.
Describes the Stanford MIPS machine.
Hill, Dwight D.
An Analysis of C Machine Support for Other Block-Structured Languages
Computer Architecture News, published by ACM Special Interest Group on Computer Architecture, XI(4): 6-16, 1982.
Kogge, Peter M.
The Architecture of Pipelined Machines
McGraw-Hill, 1981.
Patterson, David A., and Carlo H. Sequin.
A VLSI RISC
Computer, IEEE, XV(9): 8-21, September, 1982.
Patterson, David A., with Phil Garrison, Mark Hill, Dimitris Lioupis, Chris Nyberg, Tim Sippel, and Korbin Van Dyke.
Architecture of a VLSI Instruction Cache for a RISC
Computer Architecture Symposium Proceedings, IEEE, pages 108-116, 1983.
Radin, George.
The 801 Minicomputer
IBM Journal of Research and Development, XXVII(3): 237-246, 1983.
Sites, Richard
How to Use 1000 Registers
Caltech Conference on VLSI, pages 527-532. Edited by Charles Seitz, 1983.
Smith, Alan Jay
Cache Memories
ACM Computing Surveys, XIV(3): 473-530, 1982.
Smith, James E., and James R. Goodman.
A Study of Instruction Cache Organizations and Replacement Policies
Computer Architecture Symposium Proceedings, IEEE, pages 132-137, 1983.
/indigo/Dragon/documentation/overview/figure1.press
 leftMargin: 1.5 in, topMargin: 1 in, width: 7 in, height: 9.25 in
/indigo/Dragon/documentation/overview/figure2.press
 leftMargin: 1.5 in, topMargin: 1 in, width: 7.6 in, height: 9 in