Dragon  Architecture
May 1988
1
           DRAGON  System Architecture
            

I    Introduction

II    The DynaBus 
III    VLSI Chip Set
IV    Packaging

V    Address Mapping
VI    Input-Output &  Interruption
VII    Debug Bus & BootStrap processor

IIX    Computer Configurations
IX    Conclusions
I.  INTRODUCTION


      ｷ  MOTIVATIONS
      
      ｷ  DRAGON TECHNOLOGY
      
      ｷ  APPLICATIONS 
     

Motivations
Foundations for building architectures  for a wide range of document processing machines 

      ｷ  controllers : Network, Scanners...
      
      ｷ  servers : Data base servers, Printers, Gateway
      
      ｷ  workstations : mid, high & very high end
     
           => high data bandwidth 
     
Parallel processing architecture research

      ｷ   Project started in 81
      
      ｷ   Follows a first generation of shared memory
           multiprocessor on conventional Bus.
           Currently implemented in a  Xerox product.
           
      ｷ  new generation using more advanced concepts
          & technology : VLSI, packaging
          
      ｷ  Compatible with Operating systems like Mach
          or SunOS phase3 & languages like Cedar

Dragon Technology


ｷ Communications studies : VLSI BUS
<< [Artwork node; type 'Artwork on' to command tool] >>
ｷ VLSI Chip Set
<< [Artwork node; type 'Artwork on' to command tool] >>
ｷ Packaging
<< [Artwork node; type 'Artwork on' to command tool] >>
Computer Architecture

<< [Artwork node; type 'Artwork on' to command tool] >>
Configurations
Monoboard system


<< [Artwork node; type 'Artwork on' to command tool] >>

Computer Server
 24 processors and 8 memory Banks  
<< [Artwork node; type 'Artwork on' to command tool] >>
APPLICATIONS
  
  
Applications & markets

    ｷ High end parallel computer
 
 
    ｷ Desk-Top multiprocessor
    ｷ High end Workstations
 
 
    ｷ High end Printers servers
    ｷ File and Data-bases servers
 
 
    ｷ Add-in boards in standards platform
    ｷ Industrial control
    ｷ OEM multiprocessors
 
 
    ｷ Chips set & packaging
II.  The DynaBus


ｷ HIGH SPEED BUS ARCHITECTURE 
    Pipeline
    Bus configurations
    
    
ｷ ELECTRICAL CONSIDERATIONS
    Termination
    Clock skew
    Packaging
     
     
ｷ LOGIC OF THE BUS
    64 bits
    DBus
    Performances
    
ｷ PROTOCOL
HIGH SPEED BUS ARCHITECTURE
Principle  
Bus Cycle = Time to transmit information from one device to an another
<< [Artwork node; type 'Artwork on' to command tool] >>
    Tcycle = TckQ +Tprop+Tsetup + Tskew
       8ns    =    1ns +     4ns  + 1 ns   +   2ns
Pipeline 
<< [Artwork node; type 'Artwork on' to command tool] >>
Only one Bidirectional segment in a pipelined Bus 
=> Backpanel in a Multi-Board system
Bus Configurations
Level 1 : Mono-Board Computer
<< [Artwork node; type 'Artwork on' to command tool] >>
Level 2  : Multi-Board Computer
<< [Artwork node; type 'Artwork on' to command tool] >>
Level 3 : Multi-Board & Multi-Module Computer
<< [Artwork node; type 'Artwork on' to command tool] >>
ELECTRICAL CONSIDERATIONS


Required to have Bus termination for balancing lines

For CMOS version of BIC
 
ｷ open Drain for dissipation in resistors
    
ｷ  50 Ohms  at each end

ｷ  Power dissipation=U^2 /R

   2 Volts swing => 80 mW/resistance
   128 resistances => 10 Watts / Bus
   10 Watts / BackPanel
   15 Watts per Boards
Clock Skew
Clock distribution is critical, low Skew is  crucial

Tcycle > (TckQ)max+Tprop+(Tsetup)max-Tskew
Tskew < (TckQ)min + Tprop + (Tsetup)min

CMOS Chips needs a  huge uncontrolled  Clock amplifier, for driving  internal High capacitance Clock

Hierarchical Clock distribution with BIC generating the Clock of each Chip

<< [Artwork node; type 'Artwork on' to command tool] ｷ>>
Packaging & Transmission Lines
To obtain short cycles, lines must be balanced and act as perfect transmission lines

<< [Artwork node; type 'Artwork on' to command tool] >>
Using standards PGA difficult because stubs, but SMD FQPC are very good
<< [Artwork node; type 'Artwork on' to command tool] >>

Next step is using Hybrid module

<< [Artwork node; type 'Artwork on' to command tool] >>
LOGIC OF THE BUS
Minimal number of wires for 64 bits data path. All commands are coded on 64 bits
<< [Artwork node; type 'Artwork on' to command tool] >>
Performances
Tcycle     Raw BdWidth     Usable BdWidth(*4/7)
25 ns    320 MB/sec    182 MB/sec
10 ns    800 MB/sec    457 MB/sec
DBus
Seven wires. Used for initialization &  debugging
BUS INTERFACE
           
           
           Common logical connection

<<[Artwork node; type 'ArtworkInterpress on' to command tool]>>
<< DynaBus Logical Interface>>
<<Bus Transactions>>
<<
>>
<<Read Block    RA,VRA    RA,D0,D1,D2,D3>>
<<Write Block    RA,D0,D1,D2,D3    RA,X>>
<<Flush Block    RA,D0,D1,D2,D3    RA,X>>
<<Write Single    RA,D            RA,D>>
<<CondWS    RA,D            RA,D,D,D,D>>
<<IORead    IOA,X            IOA,D>>
<<IOWrite    IOA,D            IOA,X>>
<<BIOW        IOA,D            IOA,X>>
<<Map        VA,X            RA,X>>
<<DeMap        RA,X            RA,X>>
<<>>
<<PROTOCOL

>>
Protocol oriented for multiprocessor with shared memory


ｷ Hardware data consistency

ｷ Split-cycle for very High speed

ｷ Supports multi-bank memory

ｷ Bridges with industrial standard Bus

ｷ Supports Multi-level Caches

ｷ  Mathematical model and proof of coherency
<<Shared Memory Model

>>


<<[Artwork node; type 'ArtworkInterpress on' to command tool]>>


ｷ Each operation is atomic
ｷ Operations are serialized
ｷ Real Time ordering respected

Single-Level Operation


<<[Artwork node; type 'ArtworkInterpress on' to command tool]>>

ｷ Invariants
I1. > 1 cached copies => Shared is set in each
I2. At most one cache has Owner set
I3. Copy last written has Owner set
I4. Cached copies have identical values
Two-Level Operation

<<[Artwork node; type 'ArtworkInterpress on' to command tool]>>
Invariants for two level


I1. Every copy in a cache is also in its parent
I2. The parent of a copy has ExistsBelow set
I3. >1 brother copies => Shared set in each
I4. The son of a Shared copy has Shared set
I5. The parent of an Owner copy has Owner set
I6. At most one brother has Owner set
I7. Copy last written by a processor has Owner set
I8. Shared copies have identical values


III.  VLSI Chip Set


ｷ CURRENT FAMILY OF SEVEN CHIPS

     BIC
     Arbiter
     Small Cache
     Memory Controller
     IOBridge
     Display/Printer
     Map Cache
             
Bus Interface Chip : BIC


Contains all the electrical specificity of the bus


ｷ Slice of Pipelined register  ( 2 * 24 bits)

ｷ Controls access of the Bus

ｷ Contains low voltage driver & receiver

ｷ  Clock Skew regeneration

ｷ Current implementation for Hybrid Modules
Arbiter


Control Bus access for up to 64 masters


ｷ Distributed arbiter. Current implementation : 
      one arbiter chip controls 8 masters
      up to eight arbiters
      
ｷ  Works with all of pipeline configuration

ｷ  7 priority levels, Round robin inside one level

ｷ  Hold management, for lock of requestors

ｷ  System Stop generation

ｷ  DBus predecoder
Small Cache


Interface between a Processor and the Bus


ｷ Contains the Snoopy algorithm for  consistency

ｷ  Full associative Dual port memory
      Virtual Cam on the Processor Side
      & Real Cam on the Bus side
      
ｷ  Efficient first Cache of the Virtual to Real Table
    built-in for free
    
ｷ  Implements Conditional Write, for efficient
    multiprocessor locking on a split-cycle Bus

ｷ  Entry point on the Bus for all devices  playing
     the consistency game. Example IOBridge.
     
ｷ  Current implementation 2 micron : 5 KB
    with  0.8 => 32 KB / Chip
Cache Block Diagram
<< [Artwork node; type 'Artwork on' to command tool] >>
Functional Specifications: MemOps


ｷ PRead        32 bit address, 32 bit data
ｷ PWrite        32 bit address, 32 bit data
ｷ PByteWrite        32 bit address, 32 bit data
4 write enable bits
any of 16 patterns allowed
ｷ CWS            32 bit address, 64 bit data 


Functional Specifications: CWS


ｷ CWS[addr, old, new] RETURNS [sample]= {
sample _ addr^;
IF sample=old THEN addr _ new
}
ｷ Implemented in caches
ｷ No bus traffic for private data
ｷ No locks anywhere 
ｷ Maximum possible overlap 

Functional Specifications: IO

ｷ Single common IO address space
ｷ Much like memory (ie.  hit/miss)
ｷ Local locations (accessable via P or B)
CWSOld        (32)
CWSNew        (32)
AidReg            (32)
FaultCode        (32)
InterruptStatus        (32)
InterruptMask        (32)
Operating Mode    (32)

Functional Specifications: Mapping


ｷ One address space at a time
ｷ Specified by AidReg
ｷ No data flush on space change
ｷ Writing to AidReg clears all VPValid
ｷ Demap[realPage]
Cache initiates DeMap transaction
On reply all caches match and ClrVPValid
ｷ Aliasing avoided automatically
Match on RA before writing
IOBridge


Interface between the DynaBus and an industrial standard  Slow Bus


ｷ Use a Small Cache for access to the DynaBus.
   In future implementation in 1.2 micron : 
   merging of both IOB & Cache

ｷ  Provide transparent access & Maps from the Slow Bus  to the DynaBus

ｷ  Provide transparent from the DynaBus to the SlowBus

ｷ Choice of Virtual Address or Real Address for IO

ｷ Current implementation for the PC/AT, but easily adaptable to others standards : Micro-Channel, NuBus..or special internal Bus
Memory Controller


Control a Bank of memory from 8 MByte up to 1 GByte


ｷ Implements consistency algorithm

ｷ Uses Memory in nibble-mode for fast access

ｷ Implements ECC on 64 bits + 8 
  Corrects one error, detects two

ｷ Multi-Bank, init by DBus
DISPLAY ARCHITECTURE

ｷ Complex problem for multiprocessor

ｷ Lot of different architectures depending
   of the performance

ｷ Solution studied is only  focused on one 
  particular application : low performance/low cost
ｷ the use of a high speed bus with consistency can provide some inovative solutions for the high end...    
Conventional Display Architecture
<<[Artwork node; type 'ArtworkInterpress on' to command tool]>>
<<Local Bus Configuration>>
<<[Artwork node; type 'ArtworkInterpress on' to command tool]>>
<<Interim Solution>>
<<[Artwork node; type 'ArtworkInterpress on' to command tool]>>
<<>>
Display/Printer Chip

Refresh of the Display/Printer from the Bus in Background


ｷ Always low-priority, except when its Fifo is empty

ｷ Fully configurable : from B&W to 8 bits/pixels
 and up to 200 MByte/sec

ｷ Up to three controllers for 24 bits Color

ｷ Architecture adapted for multiprocessor, avoiding
   cache flushing

ｷ Perfect with Second level Cache, in which  the
   Frame Buffer is a multi-MegaByte Cache
Map Cache


Provide a second level of Cache of the Table
Virtual to Real addresses


ｷ Because the first level cache is inside the Small
   Cache, this Map is not used at every miss

ｷ Replacement algorithm completely done by
  software by the processor which has a map miss.
  Control of victimization for IO
 
ｷ Multiple MapCache Chips possible

ｷ First implementation 256 entries


IV.  Packaging


ｷ HYBRID PACKAGE
        
ｷ ASSEMBLY & CONNECTOR
HYBRID PACKAGE
Principles 

ｷ Design of a "Chip Carrier" containing  many Chips
ｷ Intermediate level between Chips and Boards
ｷ Perfect integration into our architecture
ｷ Substrate in Silicon for experimentation
ｷ Low to very low cost using large area process 

<< [Artwork node; type 'Artwork on' to command tool] 
 >>
Advantages for lots of applications 

ｷ increase of speed
ｷ gain in space
ｷ solves power dissipation problems
ｷ gain in cost 
HYBRID ASSEMBLY
 
 
<< [Artwork node; type 'Artwork on' to command tool] >>
V. Address Mapping


ｷ Principles
ｷ Mapping Process
ｷ Address Spaces and Sharing areas
ｷ Map Cache functions
ｷ Paging in and  Paging Out
Principles

ｷ DynaBus deals exclusively with real address
that is required for snooping

ｷ Small Cache does Virtual -> Real mapping

ｷ Map-Cache(s) provides system wide TLB

ｷ Hierarchical search for Mapping
first try inside current cache entries
try to ask Map cache
then trap for software handling
Mapping Process
Inside the Cache
<< [Artwork node; type 'Artwork on' to command tool] >>
Mapping Process (cont)
Mapping in three step
<< [Artwork node; type 'Artwork on' to command tool] >>
Pages structure

Constants 

ｷ  Pages are 4 KByte, unit of maping & protection
ｷ  Virtual Page (VP)  22 bits, (RP) 22 bits now
ｷ Address Space Number ASN on 32 bits
ｷ Flags : KWe, UWe, URe, Dirty, (Used ?)

Map Table 

ｷ Resides in main memory
ｷ Structured by software 
    hash table (aid,vp)->(rp,flags)
    or hierarchical tables
Translation mecanism

Three exceptions for the translation 

if ASN = -1 Identity Map
     Usefull for StartUp & IO

if  VP inside Map Bypass Area  RP is computed
             if  vp
           rp_(BypassBase

if  VP inside Shared Area (Kernel..) we use ASN=0
             if  vp
            [rp, flags] _ LookUp [vp, ASN=0]
            
Else (normal case)      
            [rp, flags] _ LookUp [vp, ASN]
Map Cache Functions

ｷ  Map [aid, vp] -> [rp, flags]
             MapFault error if entry not present
    
ｷ  ReadEntry [vp] -> [rp, flags]; implicit ASN
              MapFault error if entry not present

ｷ  WriteEntry [vp, rp, flags, valid]; implicit ASN
             IOAccessFault error if not in kernel mode

ｷ  Read & Write  Internal Register 
             ASN 
               SharedPattern, SharedMask
               BypassPattern, BypassMask, BypassBase
               SubSetMask, SubSetPattern
Paging In & Paging Out
Paging In

      Only to  insert the new VP->RP entry 
      in the Memory Table
      
Paging  Out

       Remove all virtual page entries for that physical
       page from the Memory Table
       
       Remove all virtual page entries from the 
       Map Cache
       
       Generate a DynaBus DeMap Request
       
       Issue the IO disk if the Page was dirty

VI   INPUT-OUTPUT & Interruptions


ｷ Adresses encoding
ｷ DynaBus IO Commands
ｷ DynaBus IO from a Processor
ｷ DynaBus IO from a DMA device
ｷ Bridges to commercial Busses
ｷ Interruptions
Adresses encoding
ｷ 32-bit addresses (now)
addressable item is a 32-bit word
ｷ split in 3 fields: DevType, DevNum, DevOffset
DevType -> type of IO device (SmallCache, IOBridge, DisplayController, MapCache ...)
DevNum -> number of device within type (unique)
DevOffset -> word offset within device
ｷ 3 different encodings
Large:  14 types, 16 devices per type, 24-bit offset
ddddnnnn aaaaaaaa aaaaaaaa aaaaaaaa
Medium:  31 types, 256 devices per type, 16-bit offset
000ddddd nnnnnnnn aaaaaaaa aaaaaaaa
Small:  31 types, 1024 devices per type, 10-bit offset
0000000d ddddnnnn nnnnnnaa aaaaaaaa
DynaBus IO Commands

Processor is always transaction requestor
ｷ IORead
32-bit transfer from IO device to processor
ｷ IOWrite
32-bit transfer from processor to IO device
ｷ BIOWrite
32-bit transfer from processor to IO device with broadcast on device type
unsafe since reply is generated by bus terminator, not by target device

DynaBus IO from a DMA device

ｷ WriteBlockRequest
overrides consistency protocol: value sent is taken as definitive independently of current ownership
simpler for the IO device than full cache emulation
more efficient as read-before-write is not needed
ｷ ReadBlockRequest 
IO devices use regular transaction for DMA reads
simple for IO device if late consistency not required
ｷ Address translation 
must be provided by the IO device as only physical addresses are carried on the DynaBus
IO devices may use MapCache services or have loadable translation tables
Bridges to commercial Busses
ｷ Provides two-way transaction mapping
  DynaBus -> commercial bus
  commercial bus -> DynaBus
  commercial bus ITs -> DynaBus
ｷ Address mapping
  address space size mismatches
  virtual address IO
ｷ Interrupt mapping
  depends highly on commercial bus
ｷ Flexibility at the cost of performance
ｷ No support of consistency for RAM on commercial bus
ｷ Support for multiple commercial buses
ｷ Support for multiple bridges to single bus

<<[Artwork node; type 'ArtworkInterpress on' to command tool]>>
Performances
Bridges offer easy extensibility at a price:
ｷ Throughput is limited
- by allocation time on commercial bus
- by usage of the SmallCache on DynaBus
ｷ In current implementation, effective throughput may reach
- 20 MBytes/sec commercial bus ->  DynaBus
- 12 MBytes/sec DynaBus -> commercial bus
Both limitations due to single outstanding request from cache
ｷ Latency may be high
- up to 2 
- may be a problem for certain devices
DynaBus Interruptions
ｷ No specific interrupt transaction
    ITs are transmitted as IOWrites from IO devices to caches at a well-known address
ｷ 32 edge-triggered non-prioritized interrupts
Priority management left to processor
ｷ Interrupts may be broadcast or directed
Permits dedicated IOP(s) or dynamic IOP allocation
ｷ Interrupts may be generated by processor
Permits interprocessor task scheduling
ｷ 2 registers per SmallCache
Interrupt status
Set bit(s), Clear bit(s), Read
Interrupt mask
Write, Read
VII Debug Bus & Bootstrap Processor


ｷ Principles
ｷ Summary specification
ｷ DBus and system bootstrap
ｷ DBus and debugging
Principles
ｷ Independant from DynaBus
ｷ Controlled by bootstrap/debug processor
ｷ All other system components are DBus slaves
ｷ Used to
tune clock
read chip identification
test chips
initialize chips (software configuration)
start/stop system
debug hardware failures
ｷ Permits auto-configured systems
ｷ Permits self-testing systems
Debug Bus specification
ｷ Low-speed asynchronous serial bus
6 wires (+1)
approx. 1 MBit/sec
ｷ Serial 16-bit addressing
ｷ Geographical addressing and hierarchical decoding
board -> hybrid -> chip -> register
ｷ Functions
Load/Unload chip register
Request chip to execute a function
Reset control
ｷ Very simple implementation
1 chip (PAL) on PC/AT bus
DBus and system bootstrap
ｷ Bootstrap is controlled by bootstrap processor
currently the PC/AT that also handles IO
bootstrap processor may be very simplistic
ｷ Typical bootstrap sequence
Assert Reset (DBus)
Analyze machine configuration (DBus)
Tune clock (DBus)
Perform chip sanity checking (DBus)
Initialize DynaBus Device IDs (DBus)
Initialize chip-specific parameters (DBus)
Synchronous reset of arbiters
Rescind Reset (DBus)
Start arbiters (SStop)
Load bootstrap code
Start a processor (through cache)
DBus and debugging
ｷ Provides freeze function
Freeze is asynchronous and non-restartable
DynaBus provides packet-synchronous freeze by arbiter stop
ｷ Permits reading chip status for post-mortem dump
As provided by chip designer
Full LSSD possible
ｷ May initiate chip self-test
IIX.  Computer Configurations

ｷ Multi-Board parallel Computer
   using conventional packaging
        
ｷ Mono-Board Computer
   using conventional packaging

ｷ Add-on Multiprocessor 
   using Hybrid packaging 
        
ｷ High performance parallel Computer
   using Hybrid packaging 
MultiBoard Parallel Computer
with conventional Packaging
Current implementation allows 24 processors and 8 memory Banks 
<< [Artwork node; type 'Artwork on' to command tool] >>
Monoboard system


<< [Artwork node; type 'Artwork on' to command tool] >>
Add-on Multiprocessor
<< [Artwork node; type 'Artwork on' to command tool] >>
MultiBoard Parallel Computer
with Hybrid Modules
<< [Artwork node; type 'Artwork on' to command tool] >>
IX.  Conclusions 


ｷ Key Features of the Architecture

ｷ Results and Milestones

ｷ Transfer of Research

ｷ Future Architecture
Key Features

 ｷ  Unique VLSI  Bus 
        =>  SPEED & PERFORMANCE
order of magnitude in speed, multiprocessor oriented & good for Standardization for VLSI 

 ｷ  Unique Chip Set implementation  
        => COST & SIMPLICITY
only seven LSI for all the family, Chips replace functionalities of complete boards
ｷ  Unique Packaging  use 
        => COST, SIZE & PERFORMANCE
Advanced packaging used at the architecture level
 for mid & low end computer
 
Unique open architecture to industrial standards
        => EASY INTEGRATION
 Any standard microprocessor, Bridge with existing standard busses, Standard operating system
 
 
  order of magnitude in price/performance
  1-2 years of advance on competition
Results & Milestones


(Sparc Softcard :  
      ｷ Wire-wrapped prototype  in June)
      
Chips Set :
      ｷ 5 chips  returned from fab in february 88
       (BIC, Arbiter, IOBridge, MemController, Simplex) 
       
      ｷ 3 chips  will be sent 2Q88
       (Cache, MAP Cache, Display/Printer)

       
Wire-Wrapped June87 Prototype :
       
       ｷ 4Q 88      with 4  Sparcs 
                  (depends on the Cache)
                  
                  
High Speed Bus Prototype & Packaging :
       ｷ 2Q 88
       
Transfer  of Research


Try to Standardize of the BUS 
   a VLSI BUS does not exist yet
   
Partnership 
  
with  Equipment maker like : 
        SUN
    
   with  Semi-conductor companies : 
       Motorola, National, AMD, Cypress, ... 
         good chances because open architecture compatible with
    industrial standards : any microprocessor, any add-in boards &  
    possible multi-operating system... 
       
Future Directions  
        
 This architecture is only at its beginning : Lot of Evolutions are expected : Progressive use of the Advanced packaging, Second 
level of Cache & other topologies...

Future Directions  on parallelism

Explore future parallel architecture in keeping this general model of
                "Shared Memory with Caches"
Big opportunities with new operating systems like Mach, or SunOS phase3 which include this model of communication in the kernel, 
and languages like Ada or Cedar...


High Speed Bus is Crucial for  supporting  high performance VLSI operators
 
High speed vector processor, High speed network controller, graphic controller, High speed disk controller, 
Compression/Decompression of images...