CSL Notebook Topic To CSL^.pa Date September 30, 1983 From Russ Atkinson Location PARC Subject Dragon Introduction Organization CSL XEROX Release as ?? Came from [Ivy]Dragon>DragonOverview.tioga Last edited by Russ Atkinson, September 30, 1983 2:36 pm Abstract This note gives an overview of Dragon, a high speed, low cost, multiprocessor computer. A VLSI implementation is in progress. Buzzwords computer architecture, Mesa, pipelined processor, VLSI, cache, multiprocessor An Overview of the Dragon Architecture Introduction In early 1982, a group of people (names?? {Thacker, Petit, Lampson, ??}) in the Computer Science Laboratory (CSL) of Xerox Palo Alto Research Laboratory (PARC) began the design of Dragon, a high speed, low cost, multiprocessor computer. Although Dragon has not yet been constructed, its design is firm enough and interesting enough to warrant description. A VLSI implementation is in progress. A single Dragon machine supports a single virtual address space of 2**32 (??) 32-bit words (16 gigabytes). A single Dragon machine may have as many as ten?? processors, and as few as one processor. The Dragon Processor Each Dragon processor consists of an Instruction Fetch Unit (IFU), an Instruction Fetch Cache (IFC), an Execution Unit (EU) and Execution Cache (EC). The IFU and EU are each a single VLSI chip. The number of chips for the IFC and EC are determined by the desired capacity of the caches. Each cache can be as little as one chip, and as many as four?? chips. The Execution Unit The EU handles the arithmetic and logical processing of data under the control of the IFU. Its design is similar to the arithmetic section of a conventional medium-scale mainframe. There is a 32-bit ALU, logic for a multiplication step (2 bits per cycle), and a shifter that can extract an arbitrary 32 bit field from a 64-bit quantity. There are 128 general registers, 16 auxilliary registers, and 12 constant registers. A typical sequence followed in the EU is: cycle 0: read two registers cycle 1: 32-bit signed add cycle 2: write the result to a register The EU is pipelined, such that an addition can be performed every cycle. Bypass logic allows the result of one add to be used as input to the next. The Instruction Fetch Unit The IFU processes variable length (1-4 bytes) instructions into signals that the EU obeys. The IFU is heavily pipelined with at least 4 cycles between the time when the IFU receives the word containing a given instruction to the time when the EU starts executing the instruction. The IFU processes instructions slightly faster than the EU does. This enables some instructions to be processed entirely in the IFU without wasted cycles in the EU. For example, the IFU often performs unconditional jumps, procedure calls, and procedure returns without unused EU cycles. Most programming languages make use of a call stack to implement procedure invocations. In Dragon, the IFU and the EU cooperate to keep the topmost frames in registers. The procedure linkage is kept in the IFU stack, and the local variables and procedure arguments are kept in the EU general registers. The Memory System The memory system is composed of main memory (a minimum of 1 Mword), a memory bus (Mbus) that is shared among all of the processors, and a map processor that translates between virtual and physical addresses. Bus, caches, & consistency The Mbus normally transfers 4 32-bit words at a time, which is matched to the size of cache entries. Cache is write-back for normal cache entries, write-through for shared cache entries. A cache entry is marked as shared if the cache "sees" a request for a given address on the bus when that address is also in the cache. To reduce latency, the first word requested in a cache miss arrives first. To reduce memory traffic for writing shared words, they are written onto the Mbus one word at a time as they are updated. The memory controller does not notice these updates, which are only migrated to main memory when a dirty cache entry is flushed. Map processor Map processor takes care of virtual to physical memory translation. It translates a 24-bit virtual page number (pages are 256 32-bit words) into a 16-bit real page number. Due to the large virtual address space, the map uses a combination of hash coding and tree search to perform this translation in a small number of probes (usually 1). Multiple map processors, one for each 16 Mwords, can be used to handle an arbitrary amount of physical memory (up to the limits of the virtual address space). Fast procedure call In well-structured programs many of the structural boundaries are acheived through procedure calls. This makes procedure call fairly The easy case In the easy case, which should be the dynamically most common, there are no stack overflows or underflows. All of the arguments and local variables fit within the easily addressed range of registers (0-15). Before the call, arguments are pushed. At the call instruction, the return PC and current RL are pushed onto the IFU stack and control transfers to the destination PC. The first byte in the procedure gives the number of arguments, which is subtracted from SP+1 to give RL. The arguments to the procedure become the locals without extra register movement. Before the return, the return values are written to the lowest words in the frame (where the arguments were). Large argument records and frames If the number of argument words exceeds some fixed limit (e.g. 12), then the compiler generates code to place the arguments into storage, and a pointer to that storage is placed on the stack in place of the arguments. The compiler also generates code in the called procedure to access these arguments through this storage. Large return records are handled in a similar manner. Similarly, if the number of words for local variables (and arguments that are in registers) exceeds 15, then at least some of the local variables are placed in auxilliary storage. Features of the instruction set Byte-coded to reduce space and IFU fetches. The instruction set has an unusual mix of small instructions (1-2 bytes) that expect their operands on the stack, and large instructions that are used primarily for speed. Each conditional jump has two opcodes, one which predicts that the condition will be true, and one that predicts that the condition will be false. The IFU uses this prediction to prefetch instructions after the conditional jump instruction. Mapping processes onto multiple processors Having multiple processors is of little use unless they can be made to work together on a problem. Dragon processors communicate through shared memory (CStore) and a single interrupt line. This is sufficient to support many processes on a few processors. We do not require that process to processor associations be fixed. Scheduling is performed by all of the processors. z "Cedar" styleIcenterMark centerHeaderblImemoHeads 'Lt !L ,IlogoIabstract  , -{NN NWtitle&head IblockQQQQ )IindentRR'QQQQQQQJQ QQ QQQm!QQQQ*Q