[_CD4_]<dragon7.0>Report88>DragonSystem.tioga!2

Dragon Architecture

May 1988

DRAGON System Architecture

I Introduction

II The DynaBus

III VLSI Chip Set

IV Packaging

V Address Mapping

VI Input-Output & Interruption

VII Debug Bus & BootStrap processor

IIX Computer Configurations

IX Conclusions

I. INTRODUCTION

· MOTIVATIONS

· DRAGON TECHNOLOGY

· APPLICATIONS

Motivations

Foundations for building architectures for a wide range of document processing machines

· controllers : Network, Scanners...

· servers : Data base servers, Printers, Gateway

· workstations : mid, high & very high end

=> high data bandwidth

Parallel processing architecture research

· Project started in 81

· Follows a first generation of shared memory
multiprocessor on conventional Bus.
Currently implemented in a Xerox product.

· new generation using more advanced concepts
& technology : VLSI, packaging

· Compatible with Operating systems like Mach
or SunOS phase3 & languages like Cedar

Dragon Technology

· Communications studies : VLSI BUS

[Artwork node; type 'Artwork on' to command tool]

· VLSI Chip Set

[Artwork node; type 'Artwork on' to command tool]

· Packaging

[Artwork node; type 'Artwork on' to command tool]

Computer Architecture

[Artwork node; type 'Artwork on' to command tool]

Configurations

Monoboard system

[Artwork node; type 'Artwork on' to command tool]

Computer Server
24 processors and 8 memory Banks

[Artwork node; type 'Artwork on' to command tool]

APPLICATIONS

Applications & markets

· High end parallel computer

· Desk-Top multiprocessor
· High end Workstations

· High end Printers servers
· File and Data-bases servers

· Add-in boards in standards platform
· Industrial control
· OEM multiprocessors

· Chips set & packaging

II. The DynaBus

· HIGH SPEED BUS ARCHITECTURE
Pipeline
Bus configurations

· ELECTRICAL CONSIDERATIONS
Termination
Clock skew
Packaging

· LOGIC OF THE BUS
64 bits
DBus
Performances

· PROTOCOL

HIGH SPEED BUS ARCHITECTURE

Principle
Bus Cycle = Time to transmit information from one device to an another

[Artwork node; type 'Artwork on' to command tool]

Tcycle = TckQ +Tprop+Tsetup + Tskew
8ns = 1ns + 4ns + 1 ns + 2ns

Pipeline

[Artwork node; type 'Artwork on' to command tool]

Only one Bidirectional segment in a pipelined Bus

=> Backpanel in a Multi-Board system

Bus Configurations

Level 1 : Mono-Board Computer

[Artwork node; type 'Artwork on' to command tool]

Level 2 : Multi-Board Computer

[Artwork node; type 'Artwork on' to command tool]

Level 3 : Multi-Board & Multi-Module Computer

[Artwork node; type 'Artwork on' to command tool]

ELECTRICAL CONSIDERATIONS

Required to have Bus termination for balancing lines

For CMOS version of BIC

· open Drain for dissipation in resistors

· 50 Ohms at each end

· Power dissipation=U^2 /R

2 Volts swing => 80 mW/resistance
128 resistances => 10 Watts / Bus
10 Watts / BackPanel
15 Watts per Boards

Clock Skew

Clock distribution is critical, low Skew is crucial

Tcycle > (TckQ)max+Tprop+(Tsetup)max-Tskew
Tskew < (TckQ)min + Tprop + (Tsetup)min

CMOS Chips needs a huge uncontrolled Clock amplifier, for driving internal High capacitance Clock

Hierarchical Clock distribution with BIC generating the Clock of each Chip

[Artwork node; type 'Artwork on' to command tool] ·

Packaging & Transmission Lines

To obtain short cycles, lines must be balanced and act as perfect transmission lines

[Artwork node; type 'Artwork on' to command tool]

Using standards PGA difficult because stubs, but SMD FQPC are very good

[Artwork node; type 'Artwork on' to command tool]

Next step is using Hybrid module

[Artwork node; type 'Artwork on' to command tool]

LOGIC OF THE BUS

Minimal number of wires for 64 bits data path. All commands are coded on 64 bits

[Artwork node; type 'Artwork on' to command tool]

Performances
Tcycle Raw BdWidth Usable BdWidth(*4/7)
25 ns 320 MB/sec 182 MB/sec
10 ns 800 MB/sec 457 MB/sec

DBus
Seven wires. Used for initialization & debugging

BUS INTERFACE

Common logical connection

[Artwork node; type 'ArtworkInterpress on' to command tool]

DynaBus Logical Interface

Bus Transactions

Read Block RA,VRA RA,D0,D1,D2,D3

Write Block RA,D0,D1,D2,D3 RA,X

Flush Block RA,D0,D1,D2,D3 RA,X

Write Single RA,D RA,D

CondWS RA,D RA,D,D,D,D

IORead IOA,X IOA,D

IOWrite IOA,D IOA,X

BIOW IOA,D IOA,X

Map VA,X RA,X

DeMap RA,X RA,X

PROTOCOL

Protocol oriented for multiprocessor with shared memory

· Hardware data consistency

· Split-cycle for very High speed

· Supports multi-bank memory

· Bridges with industrial standard Bus

· Supports Multi-level Caches

· Mathematical model and proof of coherency

Shared Memory Model

[Artwork node; type 'ArtworkInterpress on' to command tool]

· Each operation is atomic

· Operations are serialized

· Real Time ordering respected

Single-Level Operation

[Artwork node; type 'ArtworkInterpress on' to command tool]

· Invariants

I1. > 1 cached copies => Shared is set in each

I2. At most one cache has Owner set

I3. Copy last written has Owner set

I4. Cached copies have identical values

Two-Level Operation

[Artwork node; type 'ArtworkInterpress on' to command tool]

Invariants for two level

I1. Every copy in a cache is also in its parent

I2. The parent of a copy has ExistsBelow set

I3. >1 brother copies => Shared set in each

I4. The son of a Shared copy has Shared set

I5. The parent of an Owner copy has Owner set

I6. At most one brother has Owner set

I7. Copy last written by a processor has Owner set

I8. Shared copies have identical values

III. VLSI Chip Set

· CURRENT FAMILY OF SEVEN CHIPS

BIC
Arbiter
Small Cache
Memory Controller
IOBridge
Display/Printer
Map Cache

Bus Interface Chip : BIC

Contains all the electrical specificity of the bus

· Slice of Pipelined register ( 2 * 24 bits)

· Controls access of the Bus

· Contains low voltage driver & receiver

· Clock Skew regeneration

· Current implementation for Hybrid Modules

Arbiter

Control Bus access for up to 64 masters

· Distributed arbiter. Current implementation :
one arbiter chip controls 8 masters
up to eight arbiters

· Works with all of pipeline configuration

· 7 priority levels, Round robin inside one level

· Hold management, for lock of requestors

· System Stop generation

· DBus predecoder

Small Cache

Interface between a Processor and the Bus

· Contains the Snoopy algorithm for consistency

· Full associative Dual port memory
Virtual Cam on the Processor Side
& Real Cam on the Bus side

· Efficient first Cache of the Virtual to Real Table
built-in for free

· Implements Conditional Write, for efficient
multiprocessor locking on a split-cycle Bus

· Entry point on the Bus for all devices playing
the consistency game. Example IOBridge.

· Current implementation 2 micron : 5 KB
with 0.8 => 32 KB / Chip

Cache Block Diagram

[Artwork node; type 'Artwork on' to command tool]

Functional Specifications: MemOps

· PRead 32 bit address, 32 bit data

· PWrite 32 bit address, 32 bit data

· PByteWrite 32 bit address, 32 bit data

4 write enable bits

any of 16 patterns allowed

· CWS 32 bit address, 64 bit data

Functional Specifications: CWS

· CWS[addr, old, new] RETURNS [sample]= {

sample ← addr^;

IF sample=old THEN addr ← new

}

· Implemented in caches

· No bus traffic for private data

· No locks anywhere

· Maximum possible overlap

Functional Specifications: IO

· Single common IO address space

· Much like memory (ie. hit/miss)

· Local locations (accessable via P or B)

CWSOld (32)

CWSNew (32)

AidReg (32)

FaultCode (32)

InterruptStatus (32)

InterruptMask (32)

Operating Mode (32)

Functional Specifications: Mapping

· One address space at a time

· Specified by AidReg

· No data flush on space change

· Writing to AidReg clears all VPValid

· Demap[realPage]

Cache initiates DeMap transaction

On reply all caches match and ClrVPValid

· Aliasing avoided automatically

Match on RA before writing

IOBridge

Interface between the DynaBus and an industrial standard Slow Bus

· Use a Small Cache for access to the DynaBus.
In future implementation in 1.2 micron :
merging of both IOB & Cache

· Provide transparent access & Maps from the Slow Bus to the DynaBus

· Provide transparent from the DynaBus to the SlowBus

· Choice of Virtual Address or Real Address for IO

· Current implementation for the PC/AT, but easily adaptable to others standards : Micro-Channel, NuBus..or special internal Bus

Memory Controller

Control a Bank of memory from 8 MByte up to 1 GByte

· Implements consistency algorithm

· Uses Memory in nibble-mode for fast access

· Implements ECC on 64 bits + 8
Corrects one error, detects two

· Multi-Bank, init by DBus

DISPLAY ARCHITECTURE

· Complex problem for multiprocessor

· Lot of different architectures depending
of the performance

· Solution studied is only focused on one
particular application : low performance/low cost

· the use of a high speed bus with consistency can provide some inovative solutions for the high end...

Conventional Display Architecture

[Artwork node; type 'ArtworkInterpress on' to command tool]

Local Bus Configuration

[Artwork node; type 'ArtworkInterpress on' to command tool]

Interim Solution

[Artwork node; type 'ArtworkInterpress on' to command tool]

Display/Printer Chip

Refresh of the Display/Printer from the Bus in Background

· Always low-priority, except when its Fifo is empty

· Fully configurable : from B&W to 8 bits/pixels
and up to 200 MByte/sec

· Up to three controllers for 24 bits Color

· Architecture adapted for multiprocessor, avoiding
cache flushing

· Perfect with Second level Cache, in which the
Frame Buffer is a multi-MegaByte Cache

Map Cache

Provide a second level of Cache of the Table
Virtual to Real addresses

· Because the first level cache is inside the Small
Cache, this Map is not used at every miss

· Replacement algorithm completely done by
software by the processor which has a map miss.
Control of victimization for IO

· Multiple MapCache Chips possible

· First implementation 256 entries

IV. Packaging

· HYBRID PACKAGE

· ASSEMBLY & CONNECTOR

HYBRID PACKAGE

Principles

· Design of a "Chip Carrier" containing many Chips
· Intermediate level between Chips and Boards
· Perfect integration into our architecture
· Substrate in Silicon for experimentation
· Low to very low cost using large area process

[Artwork node; type 'Artwork on' to command tool]

Advantages for lots of applications

· increase of speed
· gain in space
· solves power dissipation problems
· gain in cost

HYBRID ASSEMBLY

[Artwork node; type 'Artwork on' to command tool]

V. Address Mapping

· Principles

· Mapping Process

· Address Spaces and Sharing areas

· Map Cache functions

· Paging in and Paging Out

Principles

· DynaBus deals exclusively with real address

that is required for snooping

· Small Cache does Virtual -> Real mapping

· Map-Cache(s) provides system wide TLB

· Hierarchical search for Mapping

first try inside current cache entries

try to ask Map cache

then trap for software handling

Mapping Process

Inside the Cache

[Artwork node; type 'Artwork on' to command tool]

Mapping Process (cont)

Mapping in three step

[Artwork node; type 'Artwork on' to command tool]

Pages structure

Constants

· Pages are 4 KByte, unit of maping & protection

· Virtual Page (VP) 22 bits, (RP) 22 bits now

· Address Space Number ASN on 32 bits

· Flags : KWe, UWe, URe, Dirty, (Used ?)

Map Table

· Resides in main memory

· Structured by software
hash table (aid,vp)->(rp,flags)
or hierarchical tables

Translation mecanism

Three exceptions for the translation

if ASN = -1 Identity Map
Usefull for StartUp & IO

if VP inside Map Bypass Area RP is computed
if vp'BypassMask=BypassPattern'BypassMask

rp←(BypassBase'BypassMask)V(vp'~BypassMask)

if VP inside Shared Area (Kernel..) we use ASN=0
if vp'SharedMask=SharedPattern'SharedMask

[rp, flags] ← LookUp [vp, ASN=0]

Else (normal case)

[rp, flags] ← LookUp [vp, ASN]

Map Cache Functions

· Map [aid, vp] -> [rp, flags]
MapFault error if entry not present

· ReadEntry [vp] -> [rp, flags]; implicit ASN
MapFault error if entry not present

· WriteEntry [vp, rp, flags, valid]; implicit ASN
IOAccessFault error if not in kernel mode

· Read & Write Internal Register
ASN
SharedPattern, SharedMask
BypassPattern, BypassMask, BypassBase
SubSetMask, SubSetPattern

Paging In & Paging Out

Paging In

Only to insert the new VP->RP entry
in the Memory Table

Paging Out

Remove all virtual page entries for that physical
page from the Memory Table

Remove all virtual page entries from the
Map Cache

Generate a DynaBus DeMap Request

Issue the IO disk if the Page was dirty

VI INPUT-OUTPUT & Interruptions

· Adresses encoding

· DynaBus IO Commands

· DynaBus IO from a Processor

· DynaBus IO from a DMA device

· Bridges to commercial Busses

· Interruptions

Adresses encoding

· 32-bit addresses (now)

addressable item is a 32-bit word

· split in 3 fields: DevType, DevNum, DevOffset

DevType -> type of IO device (SmallCache, IOBridge, DisplayController, MapCache ...)

DevNum -> number of device within type (unique)

DevOffset -> word offset within device

· 3 different encodings

Large: 14 types, 16 devices per type, 24-bit offset

ddddnnnn aaaaaaaa aaaaaaaa aaaaaaaa

Medium: 31 types, 256 devices per type, 16-bit offset

000ddddd nnnnnnnn aaaaaaaa aaaaaaaa

Small: 31 types, 1024 devices per type, 10-bit offset

0000000d ddddnnnn nnnnnnaa aaaaaaaa

DynaBus IO Commands

Processor is always transaction requestor

· IORead

32-bit transfer from IO device to processor

· IOWrite

32-bit transfer from processor to IO device

· BIOWrite

32-bit transfer from processor to IO device with broadcast on device type

unsafe since reply is generated by bus terminator, not by target device

DynaBus IO from a DMA device

· WriteBlockRequest

overrides consistency protocol: value sent is taken as definitive independently of current ownership

simpler for the IO device than full cache emulation

more efficient as read-before-write is not needed

· ReadBlockRequest

IO devices use regular transaction for DMA reads

simple for IO device if late consistency not required

· Address translation

must be provided by the IO device as only physical addresses are carried on the DynaBus

IO devices may use MapCache services or have loadable translation tables

Bridges to commercial Busses

· Provides two-way transaction mapping

DynaBus -> commercial bus

commercial bus -> DynaBus

commercial bus ITs -> DynaBus

· Address mapping

address space size mismatches

virtual address IO

· Interrupt mapping

depends highly on commercial bus

· Flexibility at the cost of performance

· No support of consistency for RAM on commercial bus

· Support for multiple commercial buses

· Support for multiple bridges to single bus

[Artwork node; type 'ArtworkInterpress on' to command tool]

Performances

Bridges offer easy extensibility at a price:

· Throughput is limited

- by allocation time on commercial bus

- by usage of the SmallCache on DynaBus

· In current implementation, effective throughput may reach

- 20 MBytes/sec commercial bus -> DynaBus

- 12 MBytes/sec DynaBus -> commercial bus

Both limitations due to single outstanding request from cache

· Latency may be high

- up to 2 ms commercial bus -> DynaBus

- may be a problem for certain devices

DynaBus Interruptions

· No specific interrupt transaction

ITs are transmitted as IOWrites from IO devices to caches at a well-known address

· 32 edge-triggered non-prioritized interrupts

Priority management left to processor

· Interrupts may be broadcast or directed

Permits dedicated IOP(s) or dynamic IOP allocation

· Interrupts may be generated by processor

Permits interprocessor task scheduling

· 2 registers per SmallCache

Interrupt status

Set bit(s), Clear bit(s), Read

Interrupt mask

Write, Read

VII Debug Bus & Bootstrap Processor

· Principles

· Summary specification

· DBus and system bootstrap

· DBus and debugging

Principles

· Independant from DynaBus

· Controlled by bootstrap/debug processor

· All other system components are DBus slaves

· Used to

tune clock

read chip identification

test chips

initialize chips (software configuration)

start/stop system

debug hardware failures

· Permits auto-configured systems

· Permits self-testing systems

Debug Bus specification

· Low-speed asynchronous serial bus

6 wires (+1)

approx. 1 MBit/sec

· Serial 16-bit addressing

· Geographical addressing and hierarchical decoding

board -> hybrid -> chip -> register

· Functions

Load/Unload chip register

Request chip to execute a function

Reset control

· Very simple implementation

1 chip (PAL) on PC/AT bus

DBus and system bootstrap

· Bootstrap is controlled by bootstrap processor

currently the PC/AT that also handles IO

bootstrap processor may be very simplistic

· Typical bootstrap sequence

Assert Reset (DBus)

Analyze machine configuration (DBus)

Tune clock (DBus)

Perform chip sanity checking (DBus)

Initialize DynaBus Device IDs (DBus)

Initialize chip-specific parameters (DBus)

Synchronous reset of arbiters

Rescind Reset (DBus)

Start arbiters (SStop)

Load bootstrap code

Start a processor (through cache)

DBus and debugging

· Provides freeze function

Freeze is asynchronous and non-restartable

DynaBus provides packet-synchronous freeze by arbiter stop

· Permits reading chip status for post-mortem dump

As provided by chip designer

Full LSSD possible

· May initiate chip self-test

IIX. Computer Configurations

· Multi-Board parallel Computer
using conventional packaging

· Mono-Board Computer
using conventional packaging

· Add-on Multiprocessor
using Hybrid packaging

· High performance parallel Computer
using Hybrid packaging

MultiBoard Parallel Computer
with conventional Packaging

Current implementation allows 24 processors and 8 memory Banks

[Artwork node; type 'Artwork on' to command tool]

Monoboard system

[Artwork node; type 'Artwork on' to command tool]

Add-on Multiprocessor

[Artwork node; type 'Artwork on' to command tool]

MultiBoard Parallel Computer
with Hybrid Modules

[Artwork node; type 'Artwork on' to command tool]

IX. Conclusions

· Key Features of the Architecture

· Results and Milestones

· Transfer of Research

· Future Architecture

Key Features

· Unique VLSI Bus
=> SPEED & PERFORMANCE
order of magnitude in speed, multiprocessor oriented & good for Standardization for VLSI

· Unique Chip Set implementation
=> COST & SIMPLICITY
only seven LSI for all the family, Chips replace functionalities of complete boards

· Unique Packaging use
=> COST, SIZE & PERFORMANCE
Advanced packaging used at the architecture level
for mid & low end computer

Unique open architecture to industrial standards
=> EASY INTEGRATION
Any standard microprocessor, Bridge with existing standard busses, Standard operating system

order of magnitude in price/performance
1-2 years of advance on competition

Results & Milestones

(Sparc Softcard :
· Wire-wrapped prototype in June)

Chips Set :
· 5 chips returned from fab in february 88
(BIC, Arbiter, IOBridge, MemController, Simplex)

· 3 chips will be sent 2Q88
(Cache, MAP Cache, Display/Printer)

Wire-Wrapped June87 Prototype :

· 4Q 88 with 4 Sparcs
(depends on the Cache)

High Speed Bus Prototype & Packaging :
· 2Q 88

Transfer of Research

Try to Standardize of the BUS
a VLSI BUS does not exist yet

Partnership

with Equipment maker like :

SUN

with Semi-conductor companies :

Motorola, National, AMD, Cypress, ...

good chances because open architecture compatible with
industrial standards : any microprocessor, any add-in boards &
possible multi-operating system...

Future Directions

This architecture is only at its beginning : Lot of Evolutions are expected : Progressive use of the Advanced packaging, Second level of Cache & other topologies...

Future Directions on parallelism

Explore future parallel architecture in keeping this general model of
"Shared Memory with Caches"
Big opportunities with new operating systems like Mach, or SunOS phase3 which include this model of communication in the kernel, and languages like Ada or Cedar...

High Speed Bus is Crucial for supporting high performance VLSI operators

High speed vector processor, High speed network controller, graphic controller, High speed disk controller, Compression/Decompression of images...