Dragon Architecture
May 1988
1
DRAGON System Architecture

I Introduction
II The DynaBus
III VLSI Chip Set
IV Packaging
V Address Mapping
VI Input-Output & Interruption
VII Debug Bus & BootStrap processor
IIX Computer Configurations
IX Conclusions
I. INTRODUCTION





· MOTIVATIONS

· DRAGON TECHNOLOGY

· APPLICATIONS
Motivations
Foundations for building architectures for a wide range of document processing machines

· controllers : Network, Scanners...

· servers : Data base servers, Printers, Gateway

· workstations : mid, high & very high end

=> high data bandwidth
Parallel processing architecture research

· Project started in 81

· Follows a first generation of shared memory
multiprocessor on conventional Bus.
Currently implemented in a Xerox product.

· new generation using more advanced concepts
& technology : VLSI, packaging

· Compatible with Operating systems like Mach
or SunOS phase3 & languages like Cedar
Dragon Technology
· Communications studies : VLSI BUS
[Artwork node; type 'Artwork on' to command tool]
· VLSI Chip Set
[Artwork node; type 'Artwork on' to command tool]
· Packaging
[Artwork node; type 'Artwork on' to command tool]
Computer Architecture
[Artwork node; type 'Artwork on' to command tool]
Configurations
Monoboard system
[Artwork node; type 'Artwork on' to command tool]
Computer Server
24 processors and 8 memory Banks
[Artwork node; type 'Artwork on' to command tool]
APPLICATIONS

Applications & markets

 ·
High end parallel computer


 · Desk-Top multiprocessor
 ·
High end Workstations


 ·
High end Printers servers
 ·
File and Data-bases servers


 ·
Add-in boards in standards platform
 · Industrial control
 · OEM multiprocessors


 · Chips set & packaging
II. The DynaBus




· HIGH SPEED BUS ARCHITECTURE
 Pipeline
 Bus configurations


· ELECTRICAL CONSIDERATIONS
 Termination
 Clock skew
 Packaging


· LOGIC OF THE BUS
 64 bits
 DBus
 Performances

· PROTOCOL
HIGH SPEED BUS ARCHITECTURE
Principle
Bus Cycle = Time to transmit information from one device to an another
[Artwork node; type 'Artwork on' to command tool]
Tcycle = TckQ +Tprop+Tsetup + Tskew
8ns = 1ns + 4ns + 1 ns + 2ns
Pipeline
[Artwork node; type 'Artwork on' to command tool]
Only one Bidirectional segment in a pipelined Bus
=> Backpanel in a Multi-Board system
Bus Configurations
Level 1 : Mono-Board Computer
[Artwork node; type 'Artwork on' to command tool]
Level 2 : Multi-Board Computer
[Artwork node; type 'Artwork on' to command tool]
Level 3 : Multi-Board & Multi-Module Computer
[Artwork node; type 'Artwork on' to command tool]
ELECTRICAL CONSIDERATIONS

Required to have Bus termination for balancing lines
For CMOS version of BIC

· open Drain for dissipation in resistors

·
50 Ohms at each end

· Power dissipation=U^2 /R

2 Volts swing => 80 mW/resistance
128 resistances => 10 Watts / Bus
10 Watts / BackPanel
15 Watts per Boards
Clock Skew
Clock distribution is critical, low Skew is crucial

Tcycle > (TckQ)max+Tprop+(Tsetup)max-Tskew
Tskew < (TckQ)min + Tprop + (Tsetup)min
CMOS Chips needs a huge uncontrolled Clock amplifier, for driving internal High capacitance Clock
Hierarchical Clock distribution with BIC generating the Clock of each Chip
[Artwork node; type 'Artwork on' to command tool] ·
Packaging & Transmission Lines
To obtain short cycles, lines must be balanced and act as perfect transmission lines
[Artwork node; type 'Artwork on' to command tool]
Using standards PGA difficult because stubs, but SMD FQPC are very good
[Artwork node; type 'Artwork on' to command tool]
Next step is using Hybrid module
[Artwork node; type 'Artwork on' to command tool]
LOGIC OF THE BUS
Minimal number of wires for 64 bits data path. All commands are coded on 64 bits
[Artwork node; type 'Artwork on' to command tool]
Performances
Tcycle  Raw BdWidth Usable BdWidth(*4/7)
25 ns 320 MB/sec 182 MB/sec
10 ns 800 MB/sec 457 MB/sec
DBus
Seven wires. Used for initialization & debugging
BUS INTERFACE


Common logical connection
[Artwork node; type 'ArtworkInterpress on' to command tool]
DynaBus Logical Interface
Bus Transactions

Read Block RA,VRA RA,D0,D1,D2,D3
Write Block RA,D0,D1,D2,D3 RA,X
Flush Block RA,D0,D1,D2,D3 RA,X
Write Single RA,D   RA,D
CondWS RA,D   RA,D,D,D,D
IORead IOA,X   IOA,D
IOWrite IOA,D   IOA,X
BIOW  IOA,D   IOA,X
Map  VA,X   RA,X
DeMap  RA,X   RA,X
PROTOCOL

Protocol oriented for multiprocessor with shared memory


· Hardware data consistency

· Split-cycle for very High speed

· Supports multi-bank memory

· Bridges with industrial standard Bus

· Supports Multi-level Caches

· Mathematical model and proof of coherency
Shared Memory Model

[Artwork node; type 'ArtworkInterpress on' to command tool]

· Each operation is atomic
· Operations are serialized
· Real Time ordering respected
Single-Level Operation
[Artwork node; type 'ArtworkInterpress on' to command tool]
· Invariants
I1. > 1 cached copies => Shared is set in each
I2. At most one cache has Owner set
I3. Copy last written has Owner set
I4. Cached copies have identical values
Two-Level Operation
[Artwork node; type 'ArtworkInterpress on' to command tool]
Invariants for two level


I1. Every copy in a cache is also in its parent
I2. The parent of a copy has ExistsBelow set
I3. >1 brother copies => Shared set in each
I4. The son of a Shared copy has Shared set
I5. The parent of an Owner copy has Owner set
I6. At most one brother has Owner set
I7. Copy last written by a processor has Owner set
I8. Shared copies have identical values

III. VLSI Chip Set




· CURRENT FAMILY OF SEVEN CHIPS

  BIC
  Arbiter
  Small Cache
  Memory Controller
  IOBridge
  Display/Printer
  Map Cache
   
Bus Interface Chip : BIC

Contains all the electrical specificity of the bus


· Slice of Pipelined register ( 2 * 24 bits)

· Controls access of the Bus

· Contains low voltage driver & receiver

· Clock Skew regeneration

· Current implementation for Hybrid Modules
Arbiter

Control Bus access for up to 64 masters


· Distributed arbiter. Current implementation :
one arbiter chip controls 8 masters
up to eight arbiters

· Works with all of pipeline configuration

· 7 priority levels, Round robin inside one level

· Hold management, for lock of requestors

· System Stop generation

· DBus predecoder
Small Cache

Interface between a Processor and the Bus


· Contains the Snoopy algorithm for consistency

· Full associative Dual port memory
Virtual Cam on the Processor Side
& Real Cam on the Bus side


· Efficient first Cache of the Virtual to Real Table
built-in for free

· Implements Conditional Write, for efficient
multiprocessor locking on a split-cycle Bus

· Entry point on the Bus for all devices playing
the consistency game. Example IOBridge.

· Current implementation 2 micron : 5 KB
with 0.8 => 32 KB / Chip
Cache Block Diagram
[Artwork node; type 'Artwork on' to command tool]
Functional Specifications: MemOps


· PRead  32 bit address, 32 bit data
· PWrite  32 bit address, 32 bit data
· PByteWrite  32 bit address, 32 bit data
4 write enable bits
any of 16 patterns allowed
· CWS   32 bit address, 64 bit data
Functional Specifications: CWS
· CWS[addr, old, new] RETURNS [sample]= {
sample ← addr^;
IF sample=old THEN addr ← new
}
· Implemented in caches
· No bus traffic for private data
· No locks anywhere
· Maximum possible overlap
Functional Specifications: IO
· Single common IO address space
· Much like memory (ie. hit/miss)
· Local locations (accessable via P or B)
CWSOld  (32)
CWSNew  (32)
AidReg   (32)
FaultCode  (32)
InterruptStatus  (32)
InterruptMask  (32)
Operating Mode (32)
Functional Specifications: Mapping
· One address space at a time
· Specified by AidReg
· No data flush on space change
· Writing to AidReg clears all VPValid
· Demap[realPage]
Cache initiates DeMap transaction
On reply all caches match and ClrVPValid
· Aliasing avoided automatically
Match on RA before writing
IOBridge

Interface between the DynaBus and an industrial standard Slow Bus


· Use a Small Cache for access to the DynaBus.
In future implementation in 1.2 micron :
merging of both IOB & Cache

· Provide transparent access & Maps from the Slow Bus to the DynaBus


· Provide transparent from the DynaBus to the SlowBus

· Choice of Virtual Address or Real Address for IO

· Current implementation for the PC/AT, but easily adaptable to others standards : Micro-Channel, NuBus..or special internal Bus
Memory Controller

Control a Bank of memory from 8 MByte up to 1 GByte


· Implements consistency algorithm

· Uses Memory in nibble-mode for fast access

· Implements ECC on 64 bits + 8
Corrects one error, detects two

· Multi-Bank, init by DBus
DISPLAY ARCHITECTURE
· Complex problem for multiprocessor

· Lot of different architectures depending
of the performance

· Solution studied is only focused on one
particular application : low performance/low cost
· the use of a high speed bus with consistency can provide some inovative solutions for the high end... 
Conventional Display Architecture
[Artwork node; type 'ArtworkInterpress on' to command tool]
Local Bus Configuration
[Artwork node; type 'ArtworkInterpress on' to command tool]
Interim Solution
[Artwork node; type 'ArtworkInterpress on' to command tool]
Display/Printer Chip
Refresh of the Display/Printer from the Bus in Background


· Always low-priority, except when its Fifo is empty

· Fully configurable : from B&W to 8 bits/pixels
and up to 200 MByte/sec

· Up to three controllers for 24 bits Color

· Architecture adapted for multiprocessor, avoiding
cache flushing

· Perfect with Second level Cache, in which the
Frame Buffer is a multi-MegaByte Cache
Map Cache

Provide a second level of Cache of the Table
Virtual to Real addresses


· Because the first level cache is inside the Small
Cache, this Map is not used at every miss

· Replacement algorithm completely done by
software by the processor which has a map miss.
Control of victimization for IO

· Multiple MapCache Chips possible


· First implementation 256 entries


IV. Packaging




· HYBRID PACKAGE

· ASSEMBLY & CONNECTOR
HYBRID PACKAGE
Principles

·
Design of a "Chip Carrier" containing many Chips
·
Intermediate level between Chips and Boards
·
Perfect integration into our architecture
·
Substrate in Silicon for experimentation
· Low to very low cost using large area process
[Artwork node; type 'Artwork on' to command tool]
Advantages for lots of applications

· increase of speed
·
gain in space
·
solves power dissipation problems
·
gain in cost
HYBRID ASSEMBLY

[Artwork node; type 'Artwork on' to command tool]
V. Address Mapping
· Principles
· Mapping Process
· Address Spaces and Sharing areas
· Map Cache functions
· Paging in and Paging Out
Principles
· DynaBus deals exclusively with real address
that is required for snooping
· Small Cache does Virtual -> Real mapping
· Map-Cache(s) provides system wide TLB
· Hierarchical search for Mapping
first try inside current cache entries
try to ask Map cache
then trap for software handling
Mapping Process
Inside the Cache
[Artwork node; type 'Artwork on' to command tool]
Mapping Process (cont)
Mapping in three step
[Artwork node; type 'Artwork on' to command tool]
Pages structure

Constants
· Pages are 4 KByte, unit of maping & protection
· Virtual Page (VP) 22 bits, (RP) 22 bits now
· Address Space Number ASN on 32 bits
· Flags : KWe, UWe, URe, Dirty, (Used ?)
Map Table
· Resides in main memory
· Structured by software
 hash table (aid,vp)->(rp,flags)
 or hierarchical tables
Translation mecanism

Three exceptions for the translation
if ASN = -1 Identity Map
  Usefull for StartUp & IO
if VP inside Map Bypass Area RP is computed
if vp'BypassMask=BypassPattern'BypassMask
rp←(BypassBase'BypassMask)V(vp'~BypassMask)
if VP inside Shared Area (Kernel..) we use ASN=0
if vp'SharedMask=SharedPattern'SharedMask
[rp, flags] ← LookUp [vp, ASN=0]
Else (normal case)
[rp, flags] ← LookUp [vp, ASN]
Map Cache Functions
· Map [aid, vp] -> [rp, flags]
MapFault error if entry not present

· ReadEntry [vp] -> [rp, flags]; implicit ASN
MapFault error if entry not present
· WriteEntry [vp, rp, flags, valid]; implicit ASN
IOAccessFault error if not in kernel mode
· Read & Write Internal Register
ASN
SharedPattern, SharedMask
BypassPattern, BypassMask, BypassBase
SubSetMask, SubSetPattern
Paging In & Paging Out
Paging In

Only to insert the new VP->RP entry
in the Memory Table
Paging Out

Remove all virtual page entries for that physical
page from the Memory Table

Remove all virtual page entries from the
Map Cache

Generate a DynaBus DeMap Request

Issue the IO disk if the Page was dirty
VI INPUT-OUTPUT & Interruptions
· Adresses encoding
· DynaBus IO Commands
· DynaBus IO from a Processor
· DynaBus IO from a DMA device
· Bridges to commercial Busses
· Interruptions
Adresses encoding
· 32-bit addresses (now)
addressable item is a 32-bit word
· split in 3 fields: DevType, DevNum, DevOffset
DevType -> type of IO device (SmallCache, IOBridge, DisplayController, MapCache ...)
DevNum -> number of device within type (unique)
DevOffset -> word offset within device
· 3 different encodings
Large: 14 types, 16 devices per type, 24-bit offset
ddddnnnn aaaaaaaa aaaaaaaa aaaaaaaa
Medium: 31 types, 256 devices per type, 16-bit offset
000ddddd nnnnnnnn aaaaaaaa aaaaaaaa
Small: 31 types, 1024 devices per type, 10-bit offset
0000000d ddddnnnn nnnnnnaa aaaaaaaa
DynaBus IO Commands
Processor is always transaction requestor
· IORead
32-bit transfer from IO device to processor
· IOWrite
32-bit transfer from processor to IO device
· BIOWrite
32-bit transfer from processor to IO device with broadcast on device type
unsafe since reply is generated by bus terminator, not by target device
DynaBus IO from a DMA device

· WriteBlockRequest
overrides consistency protocol: value sent is taken as definitive independently of current ownership
simpler for the IO device than full cache emulation
more efficient as read-before-write is not needed
· ReadBlockRequest
IO devices use regular transaction for DMA reads
simple for IO device if late consistency not required
· Address translation
must be provided by the IO device as only physical addresses are carried on the DynaBus
IO devices may use MapCache services or have loadable translation tables
Bridges to commercial Busses
· Provides two-way transaction mapping
DynaBus -> commercial bus
commercial bus -> DynaBus
commercial bus ITs -> DynaBus
· Address mapping
address space size mismatches
virtual address IO
· Interrupt mapping
depends highly on commercial bus
· Flexibility at the cost of performance
· No support of consistency for RAM on commercial bus
· Support for multiple commercial buses
· Support for multiple bridges to single bus
[Artwork node; type 'ArtworkInterpress on' to command tool]
Performances
Bridges offer easy extensibility at a price:
· Throughput is limited
- by allocation time on commercial bus
- by usage of the SmallCache on DynaBus
· In current implementation, effective throughput may reach
- 20 MBytes/sec commercial bus -> DynaBus
- 12 MBytes/sec DynaBus -> commercial bus
Both limitations due to single outstanding request from cache
· Latency may be high
- up to 2 ms commercial bus -> DynaBus
- may be a problem for certain devices
DynaBus Interruptions
· No specific interrupt transaction
ITs are transmitted as IOWrites from IO devices to caches at a well-known address
· 32 edge-triggered non-prioritized interrupts
Priority management left to processor
· Interrupts may be broadcast or directed
Permits dedicated IOP(s) or dynamic IOP allocation
· Interrupts may be generated by processor
Permits interprocessor task scheduling
· 2 registers per SmallCache
Interrupt status
Set bit(s), Clear bit(s), Read
Interrupt mask
Write, Read
VII Debug Bus & Bootstrap Processor
· Principles
· Summary specification
· DBus and system bootstrap
· DBus and debugging
Principles
· Independant from DynaBus
· Controlled by bootstrap/debug processor
· All other system components are DBus slaves
· Used to
tune clock
read chip identification
test chips
initialize chips (software configuration)
start/stop system
debug hardware failures
· Permits auto-configured systems
· Permits self-testing systems
Debug Bus specification
· Low-speed asynchronous serial bus
6 wires (+1)
approx. 1 MBit/sec
· Serial 16-bit addressing
· Geographical addressing and hierarchical decoding
board -> hybrid -> chip -> register
· Functions
Load/Unload chip register
Request chip to execute a function
Reset control
· Very simple implementation
1 chip (PAL) on PC/AT bus
DBus and system bootstrap
· Bootstrap is controlled by bootstrap processor
currently the PC/AT that also handles IO
bootstrap processor may be very simplistic
· Typical bootstrap sequence
Assert Reset (DBus)
Analyze machine configuration (DBus)
Tune clock (DBus)
Perform chip sanity checking (DBus)
Initialize DynaBus Device IDs (DBus)
Initialize chip-specific parameters (DBus)
Synchronous reset of arbiters
Rescind Reset (DBus)
Start arbiters (SStop)
Load bootstrap code
Start a processor (through cache)
DBus and debugging
· Provides freeze function
Freeze is asynchronous and non-restartable
DynaBus provides packet-synchronous freeze by arbiter stop
· Permits reading chip status for post-mortem dump
As provided by chip designer
Full LSSD possible
· May initiate chip self-test
IIX. Computer Configurations
· Multi-Board parallel Computer
using conventional packaging

· Mono-Board Computer
using conventional packaging

· Add-on Multiprocessor
using Hybrid packaging

· High performance parallel Computer
using Hybrid packaging
MultiBoard Parallel Computer
with conventional Packaging
Current implementation allows 24 processors and 8 memory Banks
[Artwork node; type 'Artwork on' to command tool]
Monoboard system
[Artwork node; type 'Artwork on' to command tool]
Add-on Multiprocessor
[Artwork node; type 'Artwork on' to command tool]
MultiBoard Parallel Computer
with Hybrid Modules
[Artwork node; type 'Artwork on' to command tool]
IX. Conclusions


· Key Features of the Architecture
· Results and Milestones
· Transfer of Research
· Future Architecture
Key Features
· Unique VLSI Bus
  => SPEED & PERFORMANCE
order of magnitude in speed, multiprocessor oriented & good for Standardization for VLSI
· Unique Chip Set implementation
  => COST & SIMPLICITY
only seven LSI for all the family, Chips replace functionalities of complete boards
· Unique Packaging use
  => COST, SIZE & PERFORMANCE
Advanced packaging used at the architecture level
for mid & low end computer
Unique open architecture to industrial standards
  
=> EASY INTEGRATION
Any standard microprocessor, Bridge with existing standard busses, Standard operating system

order of magnitude in price/performance
1-2 years of advance on competition
Results & Milestones

(Sparc Softcard :
· Wire-wrapped prototype in June)
Chips Set :
· 5 chips returned from fab in february 88
(BIC, Arbiter, IOBridge, MemController, Simplex)

· 3 chips will be sent 2Q88
(Cache, MAP Cache, Display/Printer)

Wire-Wrapped June87 Prototype :

· 4Q 88 with 4 Sparcs
   (depends on the Cache)
  
  
High Speed Bus Prototype & Packaging :
· 2Q 88
Transfer of Research

Try to Standardize of the BUS
a VLSI BUS does not exist yet
Partnership

with Equipment maker like :
SUN
with Semi-conductor companies :
Motorola, National, AMD, Cypress, ...
good chances because open architecture compatible with
industrial standards : any microprocessor, any add-in boards &
possible multi-operating system...
Future Directions
This architecture is only at its beginning : Lot of Evolutions are expected : Progressive use of the Advanced packaging, Second level of Cache & other topologies...
Future Directions on parallelism

Explore future parallel architecture in keeping this general model of
"Shared Memory with Caches"
Big opportunities with new operating systems like Mach, or SunOS phase3 which include this model of communication in the kernel, and languages like Ada or Cedar...

High Speed Bus is Crucial for supporting high performance VLSI operators

High speed vector processor, High speed network controller, graphic controller, High speed disk controller, Compression/Decompression of images...