Dragon Architecture May 1988 1 DRAGON System Architecture I Introduction II The DynaBus III VLSI Chip Set IV Packaging V Address Mapping VI Input-Output & Interruption VII Debug Bus & BootStrap processor IIX Computer Configurations IX Conclusions I. INTRODUCTION · MOTIVATIONS · DRAGON TECHNOLOGY · APPLICATIONS Motivations Foundations for building architectures for a wide range of document processing machines · controllers : Network, Scanners... · servers : Data base servers, Printers, Gateway · workstations : mid, high & very high end => high data bandwidth Parallel processing architecture research · Project started in 81 · Follows a first generation of shared memory multiprocessor on conventional Bus. Currently implemented in a Xerox product. · new generation using more advanced concepts & technology : VLSI, packaging · Compatible with Operating systems like Mach or SunOS phase3 & languages like Cedar Dragon Technology · Communications studies : VLSI BUS << [Artwork node; type 'Artwork on' to command tool] >> · VLSI Chip Set << [Artwork node; type 'Artwork on' to command tool] >> · Packaging << [Artwork node; type 'Artwork on' to command tool] >> Computer Architecture << [Artwork node; type 'Artwork on' to command tool] >> Configurations Monoboard system << [Artwork node; type 'Artwork on' to command tool] >> Computer Server 24 processors and 8 memory Banks << [Artwork node; type 'Artwork on' to command tool] >> APPLICATIONS Applications & markets · High end parallel computer · Desk-Top multiprocessor · High end Workstations · High end Printers servers · File and Data-bases servers · Add-in boards in standards platform · Industrial control · OEM multiprocessors · Chips set & packaging II. The DynaBus · HIGH SPEED BUS ARCHITECTURE Pipeline Bus configurations · ELECTRICAL CONSIDERATIONS Termination Clock skew Packaging · LOGIC OF THE BUS 64 bits DBus Performances · PROTOCOL HIGH SPEED BUS ARCHITECTURE Principle Bus Cycle = Time to transmit information from one device to an another << [Artwork node; type 'Artwork on' to command tool] >> Tcycle = TckQ +Tprop+Tsetup + Tskew 8ns = 1ns + 4ns + 1 ns + 2ns Pipeline << [Artwork node; type 'Artwork on' to command tool] >> Only one Bidirectional segment in a pipelined Bus => Backpanel in a Multi-Board system Bus Configurations Level 1 : Mono-Board Computer << [Artwork node; type 'Artwork on' to command tool] >> Level 2 : Multi-Board Computer << [Artwork node; type 'Artwork on' to command tool] >> Level 3 : Multi-Board & Multi-Module Computer << [Artwork node; type 'Artwork on' to command tool] >> ELECTRICAL CONSIDERATIONS Required to have Bus termination for balancing lines For CMOS version of BIC · open Drain for dissipation in resistors · 50 Ohms at each end · Power dissipation=U^2 /R 2 Volts swing => 80 mW/resistance 128 resistances => 10 Watts / Bus 10 Watts / BackPanel 15 Watts per Boards Clock Skew Clock distribution is critical, low Skew is crucial Tcycle > (TckQ)max+Tprop+(Tsetup)max-Tskew Tskew < (TckQ)min + Tprop + (Tsetup)min CMOS Chips needs a huge uncontrolled Clock amplifier, for driving internal High capacitance Clock Hierarchical Clock distribution with BIC generating the Clock of each Chip << [Artwork node; type 'Artwork on' to command tool] ·>> Packaging & Transmission Lines To obtain short cycles, lines must be balanced and act as perfect transmission lines << [Artwork node; type 'Artwork on' to command tool] >> Using standards PGA difficult because stubs, but SMD FQPC are very good << [Artwork node; type 'Artwork on' to command tool] >> Next step is using Hybrid module << [Artwork node; type 'Artwork on' to command tool] >> LOGIC OF THE BUS Minimal number of wires for 64 bits data path. All commands are coded on 64 bits << [Artwork node; type 'Artwork on' to command tool] >> Performances Tcycle Raw BdWidth Usable BdWidth(*4/7) 25 ns 320 MB/sec 182 MB/sec 10 ns 800 MB/sec 457 MB/sec DBus Seven wires. Used for initialization & debugging BUS INTERFACE Common logical connection <<[Artwork node; type 'ArtworkInterpress on' to command tool]>> << DynaBus Logical Interface>> <> << >> <> <> <> <> <> <> <> <> <> <> <<>> <> Protocol oriented for multiprocessor with shared memory · Hardware data consistency · Split-cycle for very High speed · Supports multi-bank memory · Bridges with industrial standard Bus · Supports Multi-level Caches · Mathematical model and proof of coherency <> <<[Artwork node; type 'ArtworkInterpress on' to command tool]>> · Each operation is atomic · Operations are serialized · Real Time ordering respected Single-Level Operation <<[Artwork node; type 'ArtworkInterpress on' to command tool]>> · Invariants I1. > 1 cached copies => Shared is set in each I2. At most one cache has Owner set I3. Copy last written has Owner set I4. Cached copies have identical values Two-Level Operation <<[Artwork node; type 'ArtworkInterpress on' to command tool]>> Invariants for two level I1. Every copy in a cache is also in its parent I2. The parent of a copy has ExistsBelow set I3. >1 brother copies => Shared set in each I4. The son of a Shared copy has Shared set I5. The parent of an Owner copy has Owner set I6. At most one brother has Owner set I7. Copy last written by a processor has Owner set I8. Shared copies have identical values III. VLSI Chip Set · CURRENT FAMILY OF SEVEN CHIPS BIC Arbiter Small Cache Memory Controller IOBridge Display/Printer Map Cache Bus Interface Chip : BIC Contains all the electrical specificity of the bus · Slice of Pipelined register ( 2 * 24 bits) · Controls access of the Bus · Contains low voltage driver & receiver · Clock Skew regeneration · Current implementation for Hybrid Modules Arbiter Control Bus access for up to 64 masters · Distributed arbiter. Current implementation : one arbiter chip controls 8 masters up to eight arbiters · Works with all of pipeline configuration · 7 priority levels, Round robin inside one level · Hold management, for lock of requestors · System Stop generation · DBus predecoder Small Cache Interface between a Processor and the Bus · Contains the Snoopy algorithm for consistency · Full associative Dual port memory Virtual Cam on the Processor Side & Real Cam on the Bus side · Efficient first Cache of the Virtual to Real Table built-in for free · Implements Conditional Write, for efficient multiprocessor locking on a split-cycle Bus · Entry point on the Bus for all devices playing the consistency game. Example IOBridge. · Current implementation 2 micron : 5 KB with 0.8 => 32 KB / Chip Cache Block Diagram << [Artwork node; type 'Artwork on' to command tool] >> Functional Specifications: MemOps · PRead 32 bit address, 32 bit data · PWrite 32 bit address, 32 bit data · PByteWrite 32 bit address, 32 bit data 4 write enable bits any of 16 patterns allowed · CWS 32 bit address, 64 bit data Functional Specifications: CWS · CWS[addr, old, new] RETURNS [sample]= { sample _ addr^; IF sample=old THEN addr _ new } · Implemented in caches · No bus traffic for private data · No locks anywhere · Maximum possible overlap Functional Specifications: IO · Single common IO address space · Much like memory (ie. hit/miss) · Local locations (accessable via P or B) CWSOld (32) CWSNew (32) AidReg (32) FaultCode (32) InterruptStatus (32) InterruptMask (32) Operating Mode (32) Functional Specifications: Mapping · One address space at a time · Specified by AidReg · No data flush on space change · Writing to AidReg clears all VPValid · Demap[realPage] Cache initiates DeMap transaction On reply all caches match and ClrVPValid · Aliasing avoided automatically Match on RA before writing IOBridge Interface between the DynaBus and an industrial standard Slow Bus · Use a Small Cache for access to the DynaBus. In future implementation in 1.2 micron : merging of both IOB & Cache · Provide transparent access & Maps from the Slow Bus to the DynaBus · Provide transparent from the DynaBus to the SlowBus · Choice of Virtual Address or Real Address for IO · Current implementation for the PC/AT, but easily adaptable to others standards : Micro-Channel, NuBus..or special internal Bus Memory Controller Control a Bank of memory from 8 MByte up to 1 GByte · Implements consistency algorithm · Uses Memory in nibble-mode for fast access · Implements ECC on 64 bits + 8 Corrects one error, detects two · Multi-Bank, init by DBus DISPLAY ARCHITECTURE · Complex problem for multiprocessor · Lot of different architectures depending of the performance · Solution studied is only focused on one particular application : low performance/low cost · the use of a high speed bus with consistency can provide some inovative solutions for the high end... Conventional Display Architecture <<[Artwork node; type 'ArtworkInterpress on' to command tool]>> <> <<[Artwork node; type 'ArtworkInterpress on' to command tool]>> <> <<[Artwork node; type 'ArtworkInterpress on' to command tool]>> <<>> Display/Printer Chip Refresh of the Display/Printer from the Bus in Background · Always low-priority, except when its Fifo is empty · Fully configurable : from B&W to 8 bits/pixels and up to 200 MByte/sec · Up to three controllers for 24 bits Color · Architecture adapted for multiprocessor, avoiding cache flushing · Perfect with Second level Cache, in which the Frame Buffer is a multi-MegaByte Cache Map Cache Provide a second level of Cache of the Table Virtual to Real addresses · Because the first level cache is inside the Small Cache, this Map is not used at every miss · Replacement algorithm completely done by software by the processor which has a map miss. Control of victimization for IO · Multiple MapCache Chips possible · First implementation 256 entries IV. Packaging · HYBRID PACKAGE · ASSEMBLY & CONNECTOR HYBRID PACKAGE Principles · Design of a "Chip Carrier" containing many Chips · Intermediate level between Chips and Boards · Perfect integration into our architecture · Substrate in Silicon for experimentation · Low to very low cost using large area process << [Artwork node; type 'Artwork on' to command tool] >> Advantages for lots of applications · increase of speed · gain in space · solves power dissipation problems · gain in cost HYBRID ASSEMBLY << [Artwork node; type 'Artwork on' to command tool] >> V. Address Mapping · Principles · Mapping Process · Address Spaces and Sharing areas · Map Cache functions · Paging in and Paging Out Principles · DynaBus deals exclusively with real address that is required for snooping · Small Cache does Virtual -> Real mapping · Map-Cache(s) provides system wide TLB · Hierarchical search for Mapping first try inside current cache entries try to ask Map cache then trap for software handling Mapping Process Inside the Cache << [Artwork node; type 'Artwork on' to command tool] >> Mapping Process (cont) Mapping in three step << [Artwork node; type 'Artwork on' to command tool] >> Pages structure Constants · Pages are 4 KByte, unit of maping & protection · Virtual Page (VP) 22 bits, (RP) 22 bits now · Address Space Number ASN on 32 bits · Flags : KWe, UWe, URe, Dirty, (Used ?) Map Table · Resides in main memory · Structured by software hash table (aid,vp)->(rp,flags) or hierarchical tables Translation mecanism Three exceptions for the translation if ASN = -1 Identity Map Usefull for StartUp & IO if VP inside Map Bypass Area RP is computed if vp rp_(BypassBase if VP inside Shared Area (Kernel..) we use ASN=0 if vp [rp, flags] _ LookUp [vp, ASN=0] Else (normal case) [rp, flags] _ LookUp [vp, ASN] Map Cache Functions · Map [aid, vp] -> [rp, flags] MapFault error if entry not present · ReadEntry [vp] -> [rp, flags]; implicit ASN MapFault error if entry not present · WriteEntry [vp, rp, flags, valid]; implicit ASN IOAccessFault error if not in kernel mode · Read & Write Internal Register ASN SharedPattern, SharedMask BypassPattern, BypassMask, BypassBase SubSetMask, SubSetPattern Paging In & Paging Out Paging In Only to insert the new VP->RP entry in the Memory Table Paging Out Remove all virtual page entries for that physical page from the Memory Table Remove all virtual page entries from the Map Cache Generate a DynaBus DeMap Request Issue the IO disk if the Page was dirty VI INPUT-OUTPUT & Interruptions · Adresses encoding · DynaBus IO Commands · DynaBus IO from a Processor · DynaBus IO from a DMA device · Bridges to commercial Busses · Interruptions Adresses encoding · 32-bit addresses (now) addressable item is a 32-bit word · split in 3 fields: DevType, DevNum, DevOffset DevType -> type of IO device (SmallCache, IOBridge, DisplayController, MapCache ...) DevNum -> number of device within type (unique) DevOffset -> word offset within device · 3 different encodings Large: 14 types, 16 devices per type, 24-bit offset ddddnnnn aaaaaaaa aaaaaaaa aaaaaaaa Medium: 31 types, 256 devices per type, 16-bit offset 000ddddd nnnnnnnn aaaaaaaa aaaaaaaa Small: 31 types, 1024 devices per type, 10-bit offset 0000000d ddddnnnn nnnnnnaa aaaaaaaa DynaBus IO Commands Processor is always transaction requestor · IORead 32-bit transfer from IO device to processor · IOWrite 32-bit transfer from processor to IO device · BIOWrite 32-bit transfer from processor to IO device with broadcast on device type unsafe since reply is generated by bus terminator, not by target device DynaBus IO from a DMA device · WriteBlockRequest overrides consistency protocol: value sent is taken as definitive independently of current ownership simpler for the IO device than full cache emulation more efficient as read-before-write is not needed · ReadBlockRequest IO devices use regular transaction for DMA reads simple for IO device if late consistency not required · Address translation must be provided by the IO device as only physical addresses are carried on the DynaBus IO devices may use MapCache services or have loadable translation tables Bridges to commercial Busses · Provides two-way transaction mapping DynaBus -> commercial bus commercial bus -> DynaBus commercial bus ITs -> DynaBus · Address mapping address space size mismatches virtual address IO · Interrupt mapping depends highly on commercial bus · Flexibility at the cost of performance · No support of consistency for RAM on commercial bus · Support for multiple commercial buses · Support for multiple bridges to single bus <<[Artwork node; type 'ArtworkInterpress on' to command tool]>> Performances Bridges offer easy extensibility at a price: · Throughput is limited - by allocation time on commercial bus - by usage of the SmallCache on DynaBus · In current implementation, effective throughput may reach - 20 MBytes/sec commercial bus -> DynaBus - 12 MBytes/sec DynaBus -> commercial bus Both limitations due to single outstanding request from cache · Latency may be high - up to 2 - may be a problem for certain devices DynaBus Interruptions · No specific interrupt transaction ITs are transmitted as IOWrites from IO devices to caches at a well-known address · 32 edge-triggered non-prioritized interrupts Priority management left to processor · Interrupts may be broadcast or directed Permits dedicated IOP(s) or dynamic IOP allocation · Interrupts may be generated by processor Permits interprocessor task scheduling · 2 registers per SmallCache Interrupt status Set bit(s), Clear bit(s), Read Interrupt mask Write, Read VII Debug Bus & Bootstrap Processor · Principles · Summary specification · DBus and system bootstrap · DBus and debugging Principles · Independant from DynaBus · Controlled by bootstrap/debug processor · All other system components are DBus slaves · Used to tune clock read chip identification test chips initialize chips (software configuration) start/stop system debug hardware failures · Permits auto-configured systems · Permits self-testing systems Debug Bus specification · Low-speed asynchronous serial bus 6 wires (+1) approx. 1 MBit/sec · Serial 16-bit addressing · Geographical addressing and hierarchical decoding board -> hybrid -> chip -> register · Functions Load/Unload chip register Request chip to execute a function Reset control · Very simple implementation 1 chip (PAL) on PC/AT bus DBus and system bootstrap · Bootstrap is controlled by bootstrap processor currently the PC/AT that also handles IO bootstrap processor may be very simplistic · Typical bootstrap sequence Assert Reset (DBus) Analyze machine configuration (DBus) Tune clock (DBus) Perform chip sanity checking (DBus) Initialize DynaBus Device IDs (DBus) Initialize chip-specific parameters (DBus) Synchronous reset of arbiters Rescind Reset (DBus) Start arbiters (SStop) Load bootstrap code Start a processor (through cache) DBus and debugging · Provides freeze function Freeze is asynchronous and non-restartable DynaBus provides packet-synchronous freeze by arbiter stop · Permits reading chip status for post-mortem dump As provided by chip designer Full LSSD possible · May initiate chip self-test IIX. Computer Configurations · Multi-Board parallel Computer using conventional packaging · Mono-Board Computer using conventional packaging · Add-on Multiprocessor using Hybrid packaging · High performance parallel Computer using Hybrid packaging MultiBoard Parallel Computer with conventional Packaging Current implementation allows 24 processors and 8 memory Banks << [Artwork node; type 'Artwork on' to command tool] >> Monoboard system << [Artwork node; type 'Artwork on' to command tool] >> Add-on Multiprocessor << [Artwork node; type 'Artwork on' to command tool] >> MultiBoard Parallel Computer with Hybrid Modules << [Artwork node; type 'Artwork on' to command tool] >> IX. Conclusions · Key Features of the Architecture · Results and Milestones · Transfer of Research · Future Architecture Key Features · Unique VLSI Bus => SPEED & PERFORMANCE order of magnitude in speed, multiprocessor oriented & good for Standardization for VLSI · Unique Chip Set implementation => COST & SIMPLICITY only seven LSI for all the family, Chips replace functionalities of complete boards · Unique Packaging use => COST, SIZE & PERFORMANCE Advanced packaging used at the architecture level for mid & low end computer Unique open architecture to industrial standards => EASY INTEGRATION Any standard microprocessor, Bridge with existing standard busses, Standard operating system order of magnitude in price/performance 1-2 years of advance on competition Results & Milestones (Sparc Softcard : · Wire-wrapped prototype in June) Chips Set : · 5 chips returned from fab in february 88 (BIC, Arbiter, IOBridge, MemController, Simplex) · 3 chips will be sent 2Q88 (Cache, MAP Cache, Display/Printer) Wire-Wrapped June87 Prototype : · 4Q 88 with 4 Sparcs (depends on the Cache) High Speed Bus Prototype & Packaging : · 2Q 88 Transfer of Research Try to Standardize of the BUS a VLSI BUS does not exist yet Partnership with Equipment maker like : SUN with Semi-conductor companies : Motorola, National, AMD, Cypress, ... good chances because open architecture compatible with industrial standards : any microprocessor, any add-in boards & possible multi-operating system... Future Directions This architecture is only at its beginning : Lot of Evolutions are expected : Progressive use of the Advanced packaging, Second level of Cache & other topologies... Future Directions on parallelism Explore future parallel architecture in keeping this general model of "Shared Memory with Caches" Big opportunities with new operating systems like Mach, or SunOS phase3 which include this model of communication in the kernel, and languages like Ada or Cedar... High Speed Bus is Crucial for supporting high performance VLSI operators High speed vector processor, High speed network controller, graphic controller, High speed disk controller, Compression/Decompression of images...