compcontitle.press leftMargin: 0.5 in, topMargin: 0 in, width: 5 in, height: 0 in
Abstract
The Dragon is a 32-bit research workstation currently being designed at the Xerox Palo Alto Research Center. It is a closely-coupled multiprocessor with a novel architecture and is built using custom VLSI. We expect its performance to be in the 5-50 MIPS range depending on the configuration. Its primary use will be in investigating the applications of multiprocessing to the office environment.
Introduction
Dragon is a new computer system being designed at Xerox PARC. It belongs to a line of personal computers (the D-machines) that started in 1973 with the Alto and includes the Dolphin, Dandelion and Dorado7. All these machines are characterized by a high-performance display, a sixteen-bit processor operating in a single address space, and only rudimentary protection hardware.
Dragon is a significant departure from its predecessors. It is a tightly-coupled multiprocessor with a novel cache architecture; it is a real 32-bit machine, both in its data and address paths; and it uses a new instruction set which is a considerably simplified version of the Mesa byte codes3. Dragon is also much more powerful than its predecessors: the performance of a single Dragon processor should exceed twice that of a Dorado, and we expect to produce machines with up to 10 processors each. Finally, Dragon is our first machine that includes custom integrated circuits. We expect this move to yield a significant reduction in size, cost, and electrical power.
We plan to build a modest number of Dragons for ourselves to explore useful ways of exploiting this level of performance in a personal setting before such machines are economical in widespread use. We see multiprocessing as a key element in future systems and need to understand its implications for software design. At this writing we have almost completed the logic simulation of the processors and have a fully functional arbiter. This is encouraging, but much work remains to be done. We must first cast details of the architecture in concrete before doing further work on chip floor planning and layout.
.
Key Ideas behind Dragon
Everybody wants a faster machine—we want a fast personal machine. Given that increases in performance purely through circuit speed are becoming increasingly harder to come by, we believe the best way to build such machines is to use tightly-coupled multiprocessing in which the individual processors themselves are high-speed devices built out of custom IC's. There were a number of ideas that made this approach promising and we wanted to try these out through implementation.
Multiprocessing
Dragon will have a modest number (1-10) of identical processors attached to a common shared memory bus, the M-bus. A key problem in such structures is that access to the bus is a bottleneck that limits system performance; the obvious use of caches9 to resolve this problem has until recently been limited by the cache consistency problem1, 2. Our solution is to use a special two-port snoopy cache (described below) that maintains consistency and reduces bus traffic.
The software will use this parallelism by mapping processes to processors dynamically. Given that Dragon is a personal machine, we are forced to break up large compute-bound programs, such as compilers and VLSI design aids, into multiple processes. This decomposition is a difficult problem, but we have made a modest beginning. The primitives for multiprocessing have been part of the Cedar language10 for a while, and software written recently has been structured with a view towards exploiting the parallelism in Dragon.
Faster Processors
There are two areas in which we expect single Dragon processors to be faster than their predecessors. The first is in instruction execution time. Like the IBM 8018, a key goal is to execute an instruction nearly every machine cycle. This is achieved by use of a simple instruction set, pipelining with rich pipeline bypass paths4, the use of caching, and instruction prefetch. These architectural devices are feasible in the context of a personal computer only through the use of custom VLSI. Our technology is an in-house 2 micron gate CMOS with polysilicide and two levels of metal. The machine's target clock rate is 10MHz.
The second area of improvement is the procedure call, where simplified parameter passing, a cached control stack, elimination of parameter copy, and of several indirections in common cases is expected to provide a large speed improvement over our earlier implementations.
Code Density
Dragon will retain the compactness of compiled programs exhibited by the Mesa instruction set, for two reasons: it is obviously cheaper to store smaller programs, but even more important is the reduction in code traffic from disk to memory, from memory to cache, and from cache to instruction decoder. In particular, any decrease in M-bus utilization by a single processor will contribute to good performance in a system with shared memory. An undesirable consequence of a compact instruction set is that it tends to make instruction decoding hardware more complex. For the Dragon at least, we expect the gains to outweigh this disadvantage.
General Structure of Dragon
As shown in Figure 1, the Dragon is organized around a central memory bus, the M-bus. Up to 10 identical processors share memory via this bus. Also on the M-bus are the bus arbiter, the mapping processor, the display controller, a few high-speed I/O devices (Ethernet and disk controllers), and a VME bus coupler. This access to a standard bus will let us use a large selection of commercial devices.
dragoncompcon.press leftMargin: 1.2 in, topMargin: 1.1 in, width: 3.6 in, height: 5 in
Figure 1: Structure of the Dragon
The Memory System
The M-bus
The M-bus is a synchronous time-multiplexed bus that carries data among the processor caches, main memory, the map processor, and the display controller. M-bus transactions are time-sequenced on a 32-bit address/data bus as dictated by the operation specified by the current bus master. Addresses on the bus are real, not virtual, and there are separate 32-bit address spaces for I/O and memory.
Interprocessor atomicity is implemented by allowing the M-bus master to lock the arbiter. There is no automatic timeout, so processors (or caches) must be designed so they cannot lock for too long.
Main Memory
Physical memory will have from 4 to 16 Mbytes with error correction. Memory page frames will consist of 256 32-bit words. Transfers to and from memory will occur in 4-word chunks, with the requested word delivered first for reads to decrease latency.
The Arbiter
M-bus mastership is granted by a single-chip synchronous arbiter with 32 request grant ports. A first version of this chip is working.
The Cache
A Dragon cache mediates between a processor's P-bus and the system's common M-bus. Its main purpose is to decrease the amount of M-bus bandwidth consumed by each processor, thus allowing multiple processors to run efficiently.
The cache is a single chip. It is fully associative, and holds (at least) 64 entries with each entry containing a virtual block address, a block of 4 32-bit words of data, a real page number, and various property bits. Several caches can be connected in parallel to increase size by having each one respond to a subset of virtual addresses.
The cache also translates virtual addresses to real by maintaining mapping information on a per-page basis. If the requested word is present, the reference is satisfied in only one cycle. If the word is missing, the cache issues a reject signal that freezes the processor. Two cases are now possible: (i) the virtual to physical translation for the target page is present because another word from this page is in the cache; and (ii) this information is not in the cache, in which case it is supplied by the map processor.
To maintain inter-cache data consistency, the cache monitors physical addresses on the M-bus to determine sharing, writes through if an entry is shared, and writes back if an entry is not shared. On a write-through other caches update their copies of the value, if any. The monitoring is done by a second associator in each cache called the "snoopy" associator. This idea was independently discovered by C. Thacker of PARC, Steve Dashiell from Xerox Electronic Division, and J. Goodman2. A related scheme was published even earlier1.
The P-bus
The P-bus connects one or more caches to a processor. The bus consists of a 32-bit multiplexed virtual address/data field, a parity bit, a command field, a ready bit, and an error field. The P-bus is twice as fast as the M-bus, permitting a cache access within one cycle for a hit.
There is only one processor on a P-bus, and it is always master. One or more caches and the floating-point accelerator chips are slaves. A P-bus slave decodes its instructions from a combination of the command and virtual address fields.
Map processor
The map processor is a single M-bus device that implements the mapping from virtual addresses to real. It also cooperates with the processor caches to implement per-page memory protection and permits multiple virtual address spaces (that may share pages) to coexist simultaneously.
Mapping information for all spaces is maintained in a single table kept in main memory. This table keeps information only about pages currently in main memory in order that it may occupy a small, fixed fraction of real memory. It is organized as a hash table with digital search to resolve hash conflicts. Multiple address spaces are implemented by storing the address space id along with the virtual address in each table entry, and by maintaining a small table within the map processor that gives the dynamic correspondence between processors and address spaces.
The map processor also caches frequently used mapping entries. This allows it to respond to most cache map misses in two cycles. If there is a miss within the map cache also, the map processor reads the main memory map. Caching combined with the above hashing scheme permits very good average response and a worst case response that is bounded by a reasonable value.
The map processor will be implemented by two custom chips: a map cache containing around 2000 mapping entries and a map cache controller.
proccompcon.press leftMargin: 1.6 in, topMargin: 6.2 in, width: 3.6 in, height: 3 in
Figure 2: A Dragon Processor
The Processor
A Dragon processor consists of 4 custom chips—the Instruction Fetch Unit, the Execution Unit, and their respective caches—and a commercial floating-point chip set. Early benchmarks predict a minimum of 5 MIPS per processor.
The Instruction Fetch Unit (IFU)
The IFU is the most complicated chip. It fetches a byte-stream through its cache, forms variable length instructions in an asynchronously-filled 4-word buffer, and then decodes them using several large PLAs. Contrary to the architecture of Dragon's predecessors there is no microcode in RAM or ROM. The IFU is thus specialized for one particular instruction set.
The product of this decoding provides the EU with three addresses for its register file (two source and one destination), a 32-bit literal, various opcodes, and condition select. The IFU also detects impending hazards in the EU pipeline and issues pipeline bypass commands. This avoids wasted cycles while preventing registers from being updated inconsistently.
The IFU also has a stack for PC and procedure linkage information. It processes jumps, traps, calls and returns, relying on the EU for conditional tests. It also maintains the local frame model which consists of a stack pointer and a frame pointer. Finally, the IFU controls the floating-point unit and both the instruction and data caches.
Most of the complexity of the IFU, aside from the sheer size of the instruction set, stems from the treatment of exceptions in the pipeline, which include conditional jumps, instruction aborts, rejects from the cache, and faults.
The Execution Unit (EU)
The EU is a single-chip slave of the IFU. Its main features are a three-ported register file holding 160 32-bit words for stack variables, runtime variables, and constants; a fast 32-bit ALU; a field unit able to extract/insert/shift in one cycle; special multiply hardware (2 bits/cycle) and divide hardware (1 bit/cycle).
Logically, the EU and IFU form a single entity. Both are pipelined, with the two pipelines interacting at virtually every stage.
Floating-point Unit
A survey of potential users showed that fast floating-point is mandatory for graphics and simulation applications. Every Dragon processor will use a set of two high-speed commercial chips which operate in a non-pipelined mode under direct control of the IFU. Communication with the EU occurs over the P-bus, with floating-point operations looking to the EU very much like a cache accesses.
The floating-point chip set in conjunction with low level software will implement the full IEEE floating-point format with an average throughput of about 1.5 MFlops.
Display System
For us a personal machine requires a high-performance display. Dragon will accomodate both color and very high-resolution black and white displays. The color display will support 1024 768 8-bit pixels, with extension to 24 bits per pixel. Two Mbytes of display memory will be dual-ported, thus permitting both high-bandwidth refresh and access over the M-bus. The display controller will mediate access to the display memory over the M-bus, control display refresh, and provide some local display update functions. The implementation will use a mix of custom and commercial chips.
On the Software Side
Instruction Set
To a certain extent, Dragon is a RISC-like machine6. Nearly every hardware feature is directly accessible through the instruction set, which contains both stack oriented and 3-address instructions. Most instructions are simple, executing in just one cycle.
However, certain features are a definite departure from the RISC philosophy. Instructions are byte-encoded for density; there are 7 addressing modes; and a total of about 150 instructions which come in 1, 2, 3, and 5 byte sizes. Instructions include arithmetic/logical operations; calls and jumps with static branch prediction; memory access; compact stack-oriented instructions; 1, 2, and 4 byte immediate constants; adjustment of special registers; floating-point operations; map operations; and the only source of atomicity, the Conditional Store instruction. Roughly half the opcode space is free and may be used to implement new instructions by software invoked via extended operation traps, or XOP's.
The frame model allows procedures to be called with low overhead at register speed: local frames are cached in EU registers, and procedure linkage information is kept in the IFU stack as explained earlier.
Languages
The original designers of Dragon intended Cedar as the language from which Dragon programs would be compiled. Cedar is characterized by strong typing, call-by-value, and a runtime garbage collector. The language also contains a complete set of multiprocess primitives. However, we intend that Dragon will also run Interlisp and Smalltalk and other languages not local to PARC.
Preliminary Simulation Results
We have run some hand-coded benchmarks, which simulate average compiler technology, on a Dragon instruction simulator. Assuming our target clock rate of 10 MHz, Dragon exceeds twice the performance of a Dorado, making it roughly equivalent to four VAX 11/780's. The simulated cache hit ratio was very close to 100% for instructions and close to 90% for data. The hit ratio for instructions is not particularly revealing because the benchmarks thus far have been small programs. The benchmarks also revealed a dynamic average of 2.5 bytes per instruction.
Conclusions
We have presented the architecture of a high-performance general-purpose personal computer system for use in programming research. We are optimistic that its performance will be quite good in a single-processor configuration, and depending on the tasks at hand and the cache sizes used, modest numbers of processors interacting will not seriously impede each other. More definite conclusions await further experience with the machine.
Acknowledgments
Dragon was conceived in 1981 by C. Thacker, B. Lampson, P. Petit, N. Wilhelm, and F. Baskett. Today's architects and designers include R. Atkinson, R. Barth, C. Black, F. Crow, D. Curry, J. Gasbarro, J. Hoel, Ed McCreight, L. Monier, M. Overton, P. Sindhu, and T. Williams. When the final story is told, there will be many others. Ed McCreight also deserves special thanks since his earlier paper on the Dragon5 formed the basis for this paper.
Good IC design tools from PARC's Computer Science Laboratory and a cooperative spirit from PARC's Integrated Circuit Laboratory have been and will continue to be essential in whatever success we may achieve.
Bibliography
[1] Censier, L.M., and Feautrier P.
A New Solution to Coherence Problems in Multicache Systems.
IEEE Transactions on Computers C-27, No 12, December 1978, pages 1112-1118.
[2] Goodman, James R.
Using Cache Memory to Reduce Processor-Memory Traffic
Computer Architecture Symposium Proceedings, IEEE, pp 124-131, 1983.
[3] Johnsson, R. K., and Wick, J. D.
An Overview of the Mesa Processor Architecture
Proceedings of the Symbosium on Architectural Support for Programming Languages and Operating Systems, SigPLAN Notices 17 4 (March 1982) pp 20-29.
[4] Kogge, Peter M.
The Architecture of Pipelined Machines
McGraw-Hill, 1981.
[5] McCreight, Edward M.
The Dragon Computer System,
Proceedings of the NATO Advanced Study Institute on Microarchitecture of VLSI Computers, Urbino, July, 1984.
[6] Patterson, David A., and Carlo H. Sequin.
A VLSI RISC
Computer, IEEE, XV(9): pp 8-21, September, 1982.
[7] Pier, Ken.
A Restrospective on the Dorado, A High-Performance Personnal Computer
10th Annual International Symposium on Computer Architecture, : December 1983, pp 252-269.
[8] Radin, George.
The 801 Minicomputer
IBM Journal of Research and Development, XXVII(3): pp 237-246, 1983.
[9] Smith, Alan Jay
Cache Memories
ACM Computing Surveys, XIV(3): 1982, pp 473-530.
[10] Teitelman, W.
A Tour Through Cedar
IEEE Software, Vol 1, No 2, April 1984, pp 44-73.