Dragon Architecture - SunShow 27 January 1988 DRAGON System Architecture I Introduction II The DynaBus III VLSI Chip Set IV Packaging V Computer Architectures VI Conclusions I. INTRODUCTION · MOTIVATIONS · DRAGON TECHNOLOGY · APPLICATIONS Motivations Foundations for building architectures for a wide range of document processing machines · controllers : Network, Scanners... · servers : Data base servers, Printers, Gateway · workstations : mid, high & very high end => high bandwidth of data Parallel processing architecture research · Project started in 81 · Follow a first generation of shared memory multiprocessor on conventionnal Bus. Currently implemented in a Xerox product. · new generation using more advanced concepts & technology : VLSI, packaging · Compatible with Operating systems like Mach or SunOS phase3 & langage like Cedar Dragon Technology · Communications studies : VLSI BUS << [Artwork node; type 'Artwork on' to command tool] >> · VLSI Chip Set << [Artwork node; type 'Artwork on' to command tool] >> · Packaging << [Artwork node; type 'Artwork on' to command tool] >> APPLICATIONS Applications & markets · High end parallel computer · Desk-Top multiprocessor · High end Workstations · High end Printers servers · File and Data-bases servers · Add-in boards in standards platform · Industrial control · OEM multiprocessors · Chips set & packaging II. The DynaBus · HIGH SPEED BUS ARCHITECTURE pipeline Bus configurations · ELECTRICAL CONSIDERATIONS Termination CK skew Packaging · LOGIC OF THE BUS 64 bits DBus Performances · PROTOCOL HIGH SPEED BUS ARCHITECTURE Principle Cycle of the Bus = Time to transmit information from one device to an another << [Artwork node; type 'Artwork on' to command tool] >> Tcycle = TckQ +Tprop+Tsetup + Tskew 8ns = 1ns + 4ns + 1 ns + 2ns Pipeline << [Artwork node; type 'Artwork on' to command tool] >> Only one Bidirectionnal segment in a pipelined Bus => Backpanel in a Multi-Board system Bus Configurations Level 1 : Mono-Board Computer << [Artwork node; type 'Artwork on' to command tool] >> Level 2 : Multi-Board Computer << [Artwork node; type 'Artwork on' to command tool] >> Level 3 : Multi-Board & Multi-Module Computer << [Artwork node; type 'Artwork on' to command tool] >> ELECTRICAL CONSIDERATIONS Imperatif to have Bus termination for balancing lines For CMOS version of BIC · open Drain for dissipation in resistors · 50 Ohms at each end · Power dissipation=U^2 /R 2 Volts swing => 80 mw/resitance 128 resistances => 10 Watts / Bus 10 Watts / BackPanel 15 Watts per Boards Clock Skew Clock distribution is critical low Skew is crucial Tcycle > (TckQ)max+Tprop+(Tsetup)max-Tskew Tskew < (TckQ)min + Tprop + (Tsetup)min CMOS Chips needs a huge uncontrolled Clock amplifier, for driving internal High capacitance Clock Hierarchical Clock distribution with BIC generating the Clock of each Chip << [Artwork node; type 'Artwork on' to command tool] >> Packaging & Transmisson Lines To obtain short cycles, lines must be balanced and act as perfect transmission lines << [Artwork node; type 'Artwork on' to command tool] >> Using standards PGA difficult because stub, but SMD FQPC are very good << [Artwork node; type 'Artwork on' to command tool] >> Next step is using Hybrid module << [Artwork node; type 'Artwork on' to command tool] >> LOGIC OF THE BUS Minimal number of wires for 64 bits data path. All commands are coded on 64 bits << [Artwork node; type 'Artwork on' to command tool] >> Performances Tcycle Row BdWidth Usable BdWidth(*4/7) 25 ns 320 MB/sec 182 MB/sec 10 ns 800 MB/sec 457 MB/sec DBus Seven wires. Used for initialisation & debuging PROTOCOL Protocol oriented for multiprocessor with shared memory · Hardware data consistency · Split-cycle for very High speed · Support multi-bank of memory · Bridges with industrial standards Bus · Support Multi-level Caches · Mathematical model and proof of coherency III. VLSI Chip Set · CURRENT FAMILLY OF SEVEN CHIPS BIC Arbiter Small Cache Memory Controller IOBridge Display/Printer Map Cache Bus Interface Chip : BIC Contains all the electrical specificity of the bus · Slice of Pipelined register ( 2 * 24 bits) · Control acess of the Bus · Contains low voltage driver & receiver · Clock Skew regeneration · Current implementation for Hybrid Modules Arbiter Control Bus access for up to 64 masters · Distributed arbiter. Current implementation : one arbiter chip control 8 masters connection up to eight arbiters · 7 priority levels, Round robin inside one level · Hold management, for lock of requestors · System Stop generation · Dbus predecoder Small Cache Interface between a Processor and the Bus · Contains the Snoopy algorithm for consistency · Full associative Dual port memory Virtual Cam on the Processor Side & Real Cam on the Bus side · Efficient first Cache of the Virtual to Real Table build-in for free · Implements Conditionnal Write, for efficient multiprocessor locking on a split-cycle Bus · Entry point on the Bus for all devices playing the consistency game. Exemple IOBridge. · Current implementation 2 micron : 5 KB with 0.8 => 32 KB / Chip IOBridge Interface between the DynaBus and an industrial standard Slow Bus · Use a Small Cache for access to the DynaBus. In future implementation in 1.2 micron : merging of both IOB & Cache · Provide transparent access & Maps from the Slow Bus to the DynaBus · Provide transparent from the DynaBus to the SlowBus · Choice of Virtual Address or Real Address for IO · Current implementation for the PC/AT, but easily adaptable to others standards : Micro-Channel, NuBus..or special internal Bus Memory Controller Control a Bank of memory from 8 MByte, up to 128 MByte · Implements consistency algorithm · Use Memory in nibble-mode for fast access · Implementing ECC on 64 bits + 8 Corrects one error, detects two · Multi-Bank, init by the DBus Display/Printer Refresh of the Display/Printer from the Bus in Background · Always low-priority, except when its Fifo is empty · Fully configurable : from B&W to 8 bits/pixels and up to 200 MByte/sec · Up to three controllers for 24 bits Color · Architecture adapted for multiprocessor, avoiding flushing of caches · Perfect with Second level Cache, in which the Frame Buffer is a multi-MegaByte Cache Map Cache Provide a second level of Cache of the Table Virtual to Real addresses · Because the first level cache is inside the Small Cache, this Map is not used at every misses · Replacement algorithm completly done by software by the processor which has a map miss. Control of victimisation for IO · Multi-Map Chip possible · First implementation 256 entries IV. Packaging · HYBRID PACKAGE · ASSEMBLY & CONNECTOR HYBRID PACKAGE Principles · Design of a "Chip Carrier" containing many Chips · Intermediate level between Chips and Boards · Perfect integration into our architecture · Substrate in Silicon for experimentation · Low to very low cost using large area process << [Artwork node; type 'Artwork on' to command tool] >> Advantages for lot of applications · increase of speed · gain in space · solve power dissipation problems · gain in cost HYBRID ASSEMBLY << [Artwork node; type 'Artwork on' to command tool] >> V. Computer Architectures · Mono-Board Computer · Add-on Multiprocessor · Multi-Board parallel Computer using conventionnal packaging · High performance parallel Computer using Hybrid packaging Mono-Board Computer << [Artwork node; type 'Artwork on' to command tool] >> Add-on Multiprocessor << [Artwork node; type 'Artwork on' to command tool] >> MultiBoard Parallel Computer with conventionnal Packaging Curent implementation allows 24 processors and 8 memory Banks for exemple << [Artwork node; type 'Artwork on' to command tool]>> MultiBoard Parallel Computer with Hybrid Modules VI. Conclusions · Key Features of the Architecture · Results and Milestones · Transfer of Research · Future Architecture Key Features · Unique VLSI Bus => SPEED & PERFORMANCE order of magnitude in speed, multiprocessor oriented & good for Standardization for VLSI · Unique Chip Set implementation => COST & SIMPLICITY only seven LSI for all the family, Chips replace functionnalities of complete boards · Unique Packaging use => COST, SIZE & PERFORMANCE Advanced packaging used at the architecture level for mid & low end computer Unique open architecture to industrial standards => EASY INTEGRATION Any standard microprocessor, Bridge with existing standard busses, Standard operating system order of magnitude in price/performance 2 years of advance to the competition Results & Milestones (Sparc Softcard : · Wire-wrapped prototype in March) Chips Set : · 5 chips will return from fab in february 88 (BIC, Arbiter, IOBridge, MemController, Simplex) · 3 chips will be send 1Q88 (Cache, MAP Cache, Display/Printer Wire-Wrapped June87 Prototype : · 3Q 88 with 4 Sparcs (depend of the Cache) High Speed Bus Prototype & Packaging : · 2Q 88 Transfer of Research Try to Standardize of the BUS a VLSI BUS does not exist yet Partnership with Equipment maker like : SUN with Semi-conductor companies : Motorola, National, AMD, Cypress, ... good chances because open architecture compatible with industrial standards : any microprocessor, any add-in boards & possible multi-operating system... Future Directions This architecture is only at its beginning : Lot of Evolutions are expected : Progressive use of the Advanced packaging, Second level of Cache & other topologies... Future Directions on parallelism Explore future parallel architecture in keeping this general model of "Shared Memory with Caches" Big opportunities with new operating systems like Mach, or SunOS phase3 which include this model of communication in the kernel, and languages like Cedar... High Speed Bus is Crucial for supporting high performance VLSI operators High speed vector processor, High speed network controller, graphic controller, High speed disk controller, Compression, Uncompression of images...