Dragon Architecture 2 February 1988 1 DRAGON System Architecture I Introduction II The DynaBus III VLSI Chip Set IV Packaging V Computer Architectures VI Conclusions I. INTRODUCTION · MOTIVATIONS · DRAGON TECHNOLOGY · APPLICATIONS Motivations Foundations for building architectures for a wide range of document processing machines · controllers : Network, Scanners... · servers : Data base servers, Printers, Gateway · workstations : mid, high & very high end => high data bandwidth Parallel processing architecture research · Project started in 81 · Follows a first generation of shared memory multiprocessor on conventional Bus. Currently implemented in a Xerox product. · new generation using more advanced concepts & technology : VLSI, packaging · Compatible with Operating systems like Mach or SunOS phase3 & languages like Cedar Dragon Technology · Communications studies : VLSI BUS << [Artwork node; type 'Artwork on' to command tool] >> · VLSI Chip Set << [Artwork node; type 'Artwork on' to command tool] >> · Packaging << [Artwork node; type 'Artwork on' to command tool] >> Computer Architecture << [Artwork node; type 'Artwork on' to command tool] >> APPLICATIONS Applications & markets · High end parallel computer · Desk-Top multiprocessor · High end Workstations · High end Printers servers · File and Data-bases servers · Add-in boards in standards platform · Industrial control · OEM multiprocessors · Chips set & packaging II. The DynaBus · HIGH SPEED BUS ARCHITECTURE Pipeline Bus configurations · ELECTRICAL CONSIDERATIONS Termination Clock skew Packaging · LOGIC OF THE BUS 64 bits DBus Performances · PROTOCOL HIGH SPEED BUS ARCHITECTURE Principle Bus Cycle = Time to transmit information from one device to an another << [Artwork node; type 'Artwork on' to command tool] >> Tcycle = TckQ +Tprop+Tsetup + Tskew 8ns = 1ns + 4ns + 1 ns + 2ns Pipeline << [Artwork node; type 'Artwork on' to command tool] >> Only one Bidirectional segment in a pipelined Bus => Backpanel in a Multi-Board system Bus Configurations Level 1 : Mono-Board Computer << [Artwork node; type 'Artwork on' to command tool] >> Level 2 : Multi-Board Computer << [Artwork node; type 'Artwork on' to command tool] >> Level 3 : Multi-Board & Multi-Module Computer << [Artwork node; type 'Artwork on' to command tool] >> ELECTRICAL CONSIDERATIONS Required to have Bus termination for balancing lines For CMOS version of BIC · open Drain for dissipation in resistors · 50 Ohms at each end · Power dissipation=U^2 /R 2 Volts swing => 80 mW/resistance 128 resistances => 10 Watts / Bus 10 Watts / BackPanel 15 Watts per Boards Clock Skew Clock distribution is critical, low Skew is crucial Tcycle > (TckQ)max+Tprop+(Tsetup)max-Tskew Tskew < (TckQ)min + Tprop + (Tsetup)min CMOS Chips needs a huge uncontrolled Clock amplifier, for driving internal High capacitance Clock Hierarchical Clock distribution with BIC generating the Clock of each Chip << [Artwork node; type 'Artwork on' to command tool] ·>> Packaging & Transmission Lines To obtain short cycles, lines must be balanced and act as perfect transmission lines << [Artwork node; type 'Artwork on' to command tool] >> Using standards PGA difficult because stubs, but SMD FQPC are very good << [Artwork node; type 'Artwork on' to command tool] >> Next step is using Hybrid module << [Artwork node; type 'Artwork on' to command tool] >> LOGIC OF THE BUS Minimal number of wires for 64 bits data path. All commands are coded on 64 bits << [Artwork node; type 'Artwork on' to command tool] >> Performances Tcycle Raw BdWidth Usable BdWidth(*4/7) 25 ns 320 MB/sec 182 MB/sec 10 ns 800 MB/sec 457 MB/sec DBus Seven wires. Used for initialization & debugging PROTOCOL Protocol oriented for multiprocessor with shared memory · Hardware data consistency · Split-cycle for very High speed · Supports multi-bank memory · Bridges with industrial standard Bus · Supports Multi-level Caches · Mathematical model and proof of coherency III. VLSI Chip Set · CURRENT FAMILY OF SEVEN CHIPS BIC Arbiter Small Cache Memory Controller IOBridge Display/Printer Map Cache Bus Interface Chip : BIC Contains all the electrical specificity of the bus · Slice of Pipelined register ( 2 * 24 bits) · Controls access of the Bus · Contains low voltage driver & receiver · Clock Skew regeneration · Current implementation for Hybrid Modules Arbiter Control Bus access for up to 64 masters · Distributed arbiter. Current implementation : one arbiter chip controls 8 masters up to eight arbiters · Works with all of pipeline configuration · 7 priority levels, Round robin inside one level · Hold management, for lock of requestors · System Stop generation · DBus predecoder Small Cache Interface between a Processor and the Bus · Contains the Snoopy algorithm for consistency · Full associative Dual port memory Virtual Cam on the Processor Side & Real Cam on the Bus side · Efficient first Cache of the Virtual to Real Table built-in for free · Implements Conditional Write, for efficient multiprocessor locking on a split-cycle Bus · Entry point on the Bus for all devices playing the consistency game. Example IOBridge. · Current implementation 2 micron : 5 KB with 0.8 => 32 KB / Chip IOBridge Interface between the DynaBus and an industrial standard Slow Bus · Use a Small Cache for access to the DynaBus. In future implementation in 1.2 micron : merging of both IOB & Cache · Provide transparent access & Maps from the Slow Bus to the DynaBus · Provide transparent from the DynaBus to the SlowBus · Choice of Virtual Address or Real Address for IO · Current implementation for the PC/AT, but easily adaptable to others standards : Micro-Channel, NuBus..or special internal Bus Memory Controller Control a Bank of memory from 8 MByte up to 1 GByte · Implements consistency algorithm · Uses Memory in nibble-mode for fast access · Implements ECC on 64 bits + 8 Corrects one error, detects two · Multi-Bank, init by DBus Display/Printer Refresh of the Display/Printer from the Bus in Background · Always low-priority, except when its Fifo is empty · Fully configurable : from B&W to 8 bits/pixels and up to 200 MByte/sec · Up to three controllers for 24 bits Color · Architecture adapted for multiprocessor, avoiding cache flushing · Perfect with Second level Cache, in which the Frame Buffer is a multi-MegaByte Cache Map Cache Provide a second level of Cache of the Table Virtual to Real addresses · Because the first level cache is inside the Small Cache, this Map is not used at every miss · Replacement algorithm completely done by software by the processor which has a map miss. Control of victimization for IO · Multiple MapCache Chips possible · First implementation 256 entries IV. Packaging · HYBRID PACKAGE · ASSEMBLY & CONNECTOR HYBRID PACKAGE Principles · Design of a "Chip Carrier" containing many Chips · Intermediate level between Chips and Boards · Perfect integration into our architecture · Substrate in Silicon for experimentation · Low to very low cost using large area process << [Artwork node; type 'Artwork on' to command tool] >> Advantages for lots of applications · increase of speed · gain in space · solves power dissipation problems · gain in cost HYBRID ASSEMBLY << [Artwork node; type 'Artwork on' to command tool] >> V. Computer Architectures · Multi-Board parallel Computer using conventional packaging · Mono-Board Computer using conventional packaging · Add-on Multiprocessor using Hybrid packaging · High performance parallel Computer using Hybrid packaging MultiBoard Parallel Computer with conventional Packaging Current implementation allows 24 processors and 8 memory Banks for example << [Artwork node; type 'Artwork on' to command tool] >> Add-on Multiprocessor << [Artwork node; type 'Artwork on' to command tool] >> MultiBoard Parallel Computer with Hybrid Modules << [Artwork node; type 'Artwork on' to command tool] >> VI. Conclusions · Key Features of the Architecture · Results and Milestones · Transfer of Research · Future Architecture Key Features · Unique VLSI Bus => SPEED & PERFORMANCE order of magnitude in speed, multiprocessor oriented & good for Standardization for VLSI · Unique Chip Set implementation => COST & SIMPLICITY only seven LSI for all the family, Chips replace functionalities of complete boards · Unique Packaging use => COST, SIZE & PERFORMANCE Advanced packaging used at the architecture level for mid & low end computer Unique open architecture to industrial standards => EASY INTEGRATION Any standard microprocessor, Bridge with existing standard busses, Standard operating system order of magnitude in price/performance 1-2 years of advance on competition Results & Milestones (Sparc Softcard : · Wire-wrapped prototype in March) Chips Set : · 5 chips will return from fab in february 88 (BIC, Arbiter, IOBridge, MemController, Simplex) · 3 chips will be sent 1Q88 (Cache, MAP Cache, Display/Printer) Wire-Wrapped June87 Prototype : · 3Q 88 with 4 Sparcs (depends on the Cache) High Speed Bus Prototype & Packaging : · 2Q 88 Transfer of Research Try to Standardize of the BUS a VLSI BUS does not exist yet Partnership with Equipment maker like : SUN with Semi-conductor companies : Motorola, National, AMD, Cypress, ... good chances because open architecture compatible with industrial standards : any microprocessor, any add-in boards & possible multi-operating system... Future Directions This architecture is only at its beginning : Lot of Evolutions are expected : Progressive use of the Advanced packaging, Second level of Cache & other topologies... Future Directions on parallelism Explore future parallel architecture in keeping this general model of "Shared Memory with Caches" Big opportunities with new operating systems like Mach, or SunOS phase3 which include this model of communication in the kernel, and languages like Ada or Cedar... High Speed Bus is Crucial for supporting high performance VLSI operators High speed vector processor, High speed network controller, graphic controller, High speed disk controller, Compression/Decompression of images...