THE DYNABUS: A VLSI BUS FOR USE IN MULTI-PROCESSOR SYSTEMS
THE DYNABUS: A VLSI BUS FOR USE IN MULTI-PROCESSOR SYSTEMS
THE DYNABUS: A VLSI BUS FOR USE IN MULTI-PROCESSOR SYSTEMS
VERSION 1.0
VERSION 1.0

1
1
1
The DynaBus
A VLSI Bus for use in Multiprocessor Systems
L. Bland, J.C. Cuenod, D. Curry, J.M. Frailong, J. Gasbarro, J. Gastinel, B. Gunning, J. Hoel, E. McCreight, M. Overton, E. 
Richley, M. Ross,  and P. Sindhu
Dragon-88-08    Written 4 September 88        Revised 15 February 89 

©  Copyright 1986, 1987, 1988, 1989 Xerox Corporation.  All rights reserved. 
Abstract: The DynaBus is a synchronous, packet switched bus designed to address the requirements of high bandwidth, data 
consistency, and VLSI implementation within the memory system of a shared memory multiprocessor.  Each DynaBus transaction consists 
of a request packet followed an arbitrary time later by a reply packet, with the bus being free to be used by other transactions in 
the interim.  Besides making more efficient use of the bus, such packet switching enables the use of interleaved memory, allows 
arbitrarily slow devices to be connected without degrading performance, and simplifies data consistency in systems with multiple 
levels of caching.  The bus provides a usable bandwidth of several hundred megabytes per second, permitting the construction of 
machines executing several hundred MIPS while providing high IO throughput.  An efficient protocol ensures that multiple copies of 
read/write data within processor caches is kept consistent and that IO devices stream data into and out of a consistent view of 
memory.  Both the physical structure of the DynaBus and its protocol are designed specifically to allow a high level of system 
integration.  Complex functions such as memory and graphics controllers that traditionally required entire boards can be 
implemented in a single VLSI chip that is directly connected to the DynaBus. 
Keywords: VLSI DynaBus, Backpanel DynaBus, pipelined bus, timing, arbitration, DynaBus transactions, write-back cache, snoopy 
cache, data consistency, DynaBus signals, memory interconnect, multiprocessor bus, packet switched bus.  
FileName: [Dragon]<Dragon7.0>Documentation>DynaBus>DynaBusDoc.tioga, .ip

XEROX            Xerox Corporation
                Palo Alto Research Center
                3333 Coyote Hill Road
                Palo Alto, California 94304
Xerox Private Data
Contents
1.  Overview
2.  Definition of Terms
3.  Interconnection Schemes
4.  Chip Level Signals
5.  Arbitration and Flow Control
6.  Transactions
7.  Data Consistency
8.  Atomic Operations
9.  Input Output
10.  Address Mapping
11.  Error Detection and Reporting
Appendix I.  DynaBus Command Field Encoding

1.  Overview
    The DynaBus is a synchronous, packet switched bus designed to address the requirements of high bandwidth, data consistency, and 
    VLSI implementation within the memory system of a shared memory multiprocessor.  Each DynaBus transaction consists of a request 
    packet followed an arbitrary time later by a reply packet, with the bus being free to be used by other transactions in the 
    interim.  Besides making more efficient use of the bus, such packet switching enables the use of interleaved memory, allows 
    arbitrarily slow devices to be connected without degrading performance, and simplifies data consistency in systems with 
    multiple levels of caching.  The bus provides a usable bandwidth of many hundreds of megabytes per second, permitting the 
    construction of machines spanning a wide range of cost and performance.  
    Because DynaBus is intended for use in high performance shared memory multiprocessors, there is an efficient protocol for ensuring 
    that processors see a consistent view of memory in the face of caching and IO.  With this protocol the hardware ensures that 
    multiple copies of read/write data in caches are consistent, and that both input and output devices are able to take cached 
    data into account.  And the consistency protocol provides a model of shared memory that is both conceptually simple and 
    natural. 
    The DynaBus's physical structure and its protocol are designed to promote a high level of system integration.  Complex devices, 
    including: memory controllers, graphics controllers, high speed network controllers, and external bus controllers that 
    traditionally required entire boards can be implemented using a single chip connected directly to the DynaBus.  The result is a 
    high performance, but compact system.  Within a computer, the DynaBus may be used both as a VLSI interconnect to tie chips 
    together on a single board and as a backplane bus to tie boards together over a backpanel.  Figure 1 shows an application with 
    two boards. 
    << [Artwork node; type 'Artwork on' to command tool] >>
    Figure 1:  The DynaBus is a VLSI interconnection system.  Its efficient, compact design promotes a high level of integration.
    Key to Abbreviations in Figure 1:                                                                                                   
                 
    C: Cache     Arb: Arbiter     IOB: Input/Ouput Bridge 
    P: Processor    D/P: Display     MC: Memory Controller 
    MAP: Map Cache    Mem: Memory
The DynaBus design is flexible enough to allow its use in a wide variety of configurations.  For example, in Figure 1, DynaBus A 
may be connected to DynaBuses B and C in two quite different ways.  In the first, the buses are connected by pipeline registers.  
Here there is logically one DynaBus but three electrically separate bus segments, and all traffic on one segment is propagated to  
the others.  In the second, the buses are connected by second level caches.  Here there are three logically distinct DynaBuses, and 
traffic from one bus may or may not go to the others.   Another configuration, not shown in the figure, is to use multiple 
Dynabuses operating independently and in parallel with one another to provide very high bandwidths.
    The DynaBus has 80 signals, 64 of which consist of a multiplexed data/address path (Data, Figure 2).  HeaderCycle indicates whether 
    the information carried by Data is a packet header or not, while  DParity is parity computed over Data and HeaderCycle.  Shared 
    and Owner are signals used for data consistency.  RequestOut, Grant, and LongGrant constitute the interface to the Dynabus 
    arbiter.  AParity provides a single bit parity check over the consistency and arbitration wires. The clock signal Clock 
    provides global timing, while ClockOut allows the skew of Clock to be controlled.  At the pins of a package that interfaces to 
    DynaBus, the Data port signals can be provided optionally with inputs and outputs separated for added flexibility in building 
    high performance pipelined bus configurations.  The pin BidEn allows a given die to be used in either the bidirectional mode, 
    or the higher performance unidirectional mode.

    << [Artwork node; type 'Artwork on' to command tool] >>
    Figure 2:  Chip Level DynaBus Signals.
    The DynaBus's operation can be understood best in terms of three layers: cycles, packets, and transactions (these layers correspond 
    to the electrical, logical, and functional levels, respectively). A bus cycle is simply one complete period of the bus clock; 
    it forms the unit of time and information transfer on the bus; the information is typically either an address or data. A packet 
    is a contiguous sequence of cycles; it is the mechanism by which one-way logical information transfer occurs on the bus. The 
    first cycle of a packet carries address and control information; subsequent cycles typically carry data. There are two 
    different packet sizes: 2 cycles and 9 cycles. A transaction consists of a request packet and a corresponding reply packet that 
    together perform some logical function (such as a memory read).
    Each DynaBus has an arbiter that permits the bus to be multiplexed amongst contending devices, which are identified by a unique 
    deviceId.  Before a transaction can begin, the requesting device must get bus mastership from the arbiter. Once it has the bus, 
    the device puts its request packet on the bus one cycle at a time, and then waits for the reply packet.  Packet transmission is 
    uninterruptable in that no other device can take the bus away during this time, regardless of its priority.  The transaction is 
    complete when another device gets bus mastership and sends a reply packet. Request and reply packets may be separated by an 
    arbitrary number of cycles, provided timeout limits are not exceeded (see Section 11.2).  In the interval between request and 
    reply, the bus is free to be used by other devices.  The arbiter is able to grant requests in such a way that no cycles are 
    lost between successive packets.
    A request packet contains at least the transaction type, the requestor's deviceId, a small number of control bits, and an address; 
    it may contain additional transaction dependent information. The reply packet contains the same transaction type, the orignial 
    requestor's deviceId, the original address, some control bits, and transaction dependent data. This replication of type, 
    deviceId, and address information allows request and reply packets to be paired unambiguously. Normally, the protocol ensures a 
    one-to-one correspondence between request packets and reply packets; however, because of errors, some request packets may not 
    get a reply. Thus, devices must not depend on the number of request and reply packets being equal since this invariant will not 
    in general be maintained. The protocol requires devices to provide a simple, but crucial guarantee: they must service request 
    packets in arrival order. This guarantee forms the basis for the DynaBus's data consistency scheme.
    The DynaBus defines a complete set of transactions for data transfer between caches and memory, data consistency, synchronization, 
    input output, and address mapping. The ReadBlock transaction allows a device to read a block of data from memory or another 
    cache. WriteBlock allows new data to be introduced into the memory system (for example disk reads). FlushBlock allows caches to 
    write back dirty data to memory. KillBlock allows a block to be removed from all but one of the caches. WriteSingle is a short 
    transaction used by caches to update multiple copies of shared data without affecting main memory. IOReadSingle and 
    IOWriteSingle initiate and check IO operations, while IOReadBlock and IOWriteBlock allow block transfer of data between IO 
    devices, completely bypassing the consistency mechanism. The Map and DeMap transactions permit the implemention of high speed 
    address mapping in a multiple address space environment. Finally, the Interrupt transaction provides the mechanism for 
    signalling interrupts to processors.  The encoding space leaves room for defining five other transactions.
    The Dynabus has a maximum data transport efficiency of 8/11, or 73%. In other words, at least 3/11 of the overall bandwidth of the 
    bus is consumed by protocol overhead such as deviceID, address, and transaction type. This number derives from the fact that in 
    all of the block transfer transactions 8 cycles of data are transported for a total of 11 cycles. For example, the request 
    packet for a ReadBlock transaction is 2 cycles while the reply is 9 cycles, of which 8 are data. In typical applications, most 
    of the transactions on the bus are block transfer transactions so that the 73% efficiency is, in fact, close to what one would 
    actually obtain.

2.  Definition of Terms
    This section defines commonly used terms within this document. Definitions appear in bold and uses appear in italics.
arbiter
    an entity that allows multiple devices contending for the same DynaBus to use the bus in a time multiplexed fashion.
alignment
    an n-bit quantity is aligned within a container if the quantity is located starting at a position that is a multiple of n. This 
    assumes big-endian numbering (see below).
BIC
    bus interface chip.  A chip containing two pipeline registers, one input and one output, used to connect two DynaBus segments.
big-endian numbering
    a numbering system for data where the most significant unit (bit, byte, halfWord, word, doubleWord, or block) within a container is 
    placed leftmost and numbered 0. The DynaBus uses big-endian numbering (Figure 3). 
<< [Artwork node; type 'Artwork on' to command tool] >>
Figure 3:  Big-endian numbering as it is used on the DynaBus.
     
block
    512 bits of data. Within the real and IO address spaces block data is always aligned.
bus
    a collection of one or more bus segments connected by pipeline registers.
bus segment
    the portion of a bus that is traversed in one clock period.
byte
    8 bits of data. Within the real and IO address spaces byte data is always aligned.
cycle
    one complete period of the DynaBus clock. It is the unit of time and information transfer on the DynaBus. Generally, a cycle is 25 
    ns, and carries one doubleWord (64 bits) of data.
DBus
    a serial bus used for system initialization, testing, and debugging.
device
    an entity that can arbitrate for the bus and place packets on it.
deviceID
    a 10-bit unique identifier for DynaBus devices.  This number is loaded into a device over the DBus during system initialization. 
doubleWord
    64 bits of data. Within the real and IO address spaces doubleWord data is always aligned. The Dynabus transfers one doubleWord 
    every cycle.
halfWord 
    16 bits of data. Within the real and IO address spaces halfWord data is always aligned.  
header 
    the first cycle of a packet.  This cycle contains address and control information.  
hold 
    a state in which the arbiter grants requests for reply packets but does not grant requests for request packets.  
IO address 
    a 37-bit quantity used to address IO devices. An IO address consists of the address of an aligned doubleWord in IO address space 
    concatenated with a 4-bit single specifier that identifies a single (aligned byte, halfWord, word, or doubleWord) within the 
    doubleWord. An IO address may also be used to specify a block, in which case the single specifier identifies a single within 
    the target block. When the block is transported over the Dynabus, this single is sent first, with the remaining doubleWords 
    sent in cyclic order.
<< [Artwork node; type 'Artwork on' to command tool] >>
Figure 4: Format of an IO Address
IO address space 
    the set of all IO addresses.  
IOBridge 
    a chip that allows the DynaBus to be connected to an industry standard bus.  
MapCache
    a device that provides virtual to real address translation on the DynaBus.
master
    a device that has been granted the DynaBus.
module
    a unit of packaging intermediate between a chip package and a board.
packet
    a contiguous sequence of bus cycles.  The DynaBus supports packets of length 2 and 9.
packet switched
    a dissociation between the request and reply packets of a transaction to allow the bus to be used for other transactions between 
    request and reply.  Same as split transaction.

packet type
    a 5-bit field in the head cycle indicating one of 32 possible kinds of packet
real address
    a 37-bit quantity used to address real memory. An addressed location may reside both in main memory and in caches. A real address 
    consists of the address of an aligned doubleWord in real address space concatenated with a 4-bit single specifier that 
    identifies a single (aligned byte, halfWord, word, or doubleWord) within the doubleWord. A real address may also be used to 
    specify a block, in which case the single specifier identifies a single within the target block. When the block is transported 
    over the Dynabus, this single is sent first, with the remaining doubleWords being sent in cyclic order.
<< [Artwork node; type 'Artwork on' to command tool] >>
Figure 5: Format of a Real Address

real address space
    the set of all real addresses.
requester
    the device that sends the request packet of a transaction.
responder
    the device that sends the reply packet of a transaction.
single
    a byte, halfWord, word or doubleWord of data. Within the real and IO address spaces a single is always aligned. A single is 
    transported on the Dynabus in the same relative position within a 64-bit cycle as the position the single occupies within its 
    containing doubleWord in the real or IO address spaces. Non significant bits of the cycle containing the single are undefined.
slave
    a device that is listening to an incoming packet on the DynaBus.
snoopy cache
    a two port cache that watches transactions on the DynaBus port to maintain a consistent view of data as seen from the processor 
    port.
split transaction
    a dissociation between the request and reply packets of a transaction to allow the bus to be used for other transactions between 
    request and reply.  Same as packet switched. 
transaction
    a pair of packets, the first a request and the second a reply, that together performs some logical function.
virtual address
    a 32-bit quantity used by a processor to address memory.
virtual address space
    the set of all virtual addresses.
word
    32 bits of data. Within the real and IO address spaces word data is always aligned.
write-back cache
    a cache that updates cached data upon a processor write without immediately updating main memory.  
write-through cache
    a cache that does a write on the bus side for each write it receives from its processor side.

3.  Interconnection Schemes
    A unique aspect of DynaBus is that it can be used as an interconnection component in machines spanning a wide range of cost and 
    performance.  At the low end are low cost single board systems of up to a few hundred MIPS, while at the high end are more 
    expensive multi-board systems capable of approaching 1 GIPS and sustaining high IO throughput.  However, in all these systems, 
    the logical and much of the electrical specification of the bus stays the same.  This allows the same chip set to be employed 
    across an entire family of machines and results in economies of scale not permitted by other buses.
3.1 Low to Medium Performance Systems
    Low performance systems typically cannot afford high pin count packages because of increased package cost and the need for more 
    expensive high density interconnection on board.  With the bidirectional option, the DynaBus requires just 80 pins per package, 
    providing an attractive solution for low end systems.
    With the DynaBus confined to a single board, it is possible to build a high performance, compact 64-bit  bus consisting of just one 
    segment (Figure 6).  Each DynaBus chip has an input and an output register connected to the bidirectional data port.  These 
    registers make a shorter cycle time possible, eliminating any computation (decoding, gating) during the transmission of data 
    between chips.
    << [Artwork node; type 'Artwork on' to command tool] >>
    Figure 6:  A Single-Board System contains only one bus segment.  A special pin allows the input and output pins of a DynaBus chip 
    to be connected resulting in a bidirectional interface with only 80 wires.
    Low cost midrange systems can also be built using a non-pipelined bidirectional DynaBus that spans multiple boards.  Each board 
    would have bidirectional buffers at its interface, much like VME or FUTUREBUS (see Figure 7 left).  Such an implementation of 
    Dynabus would not cycle as fast as a single board version or a pipelined version, but it would nonetheless provide an 
    attractive low cost multi-board alternative. 

3.2 High Performance Pipelined Systems
    One of the most interesting features of DynaBus is that it allows pipelining: a single DynaBus can be broken up into multiple bus 
    segments separated by pipeline registers.  These registers are placed at the input and output of each chip, module and board 
    connecting to a DynaBus.  During one clock cycle a signal starts out in one pipeline register, traverses one bus segment, and 
    ends up in another pipeline register.  The principal advantage of such pipelining is that the signal transit times on carefully 
    designed short bus segments are a fraction of those on a single long segment whose length is the sum of the shorter segments.  
    Small signal transit times in turn mean that the bus can be operated at a higher frequency and therefore deliver more bandwidth.
    << [Artwork node; type 'Artwork on' to command tool] >>
    Figure 7: In a nonpipelined system the segment transit time (the clock period lower bound) is T = T1 + T2 + T3.  In a pipelined 
    version the segment transit time is MAX[T1, T2, T3], or about T/3 if comparable transit times for the backpanel and the board 
    are assumed.  Thus the bandwidth of the pipelined version is up to three times higher. (This is an upper bound, as the 
    additional setup and hold times will decrease the speed.)


Figure 8 illustrates a low cost multi-board system in which all three segments of the Dynabus are bidirectional. In this 
configuration, the on-board Dynabuses require 66 fewer wires than the unidirectional configuration, resulting in lower cost 
packaging. The price to pay for this decreased cost is lower performance. In all of the other examples in this section, the Dynabus 
can be utilized fully, while in this configuration it cannot because incoming packets would collide with outgoing ones on the 
on-board bidirectional buses. Two cycles are lost for each packet transferred, so that the transport efficiency here is around 73% 
of the transport efficiency of a fully utilizable configuration. For a detailed explanation of these lost cycles, or "bubbles" see 
Section 5.3.
<< [Artwork node; type 'Artwork on' to command tool] >>
Figure 8:  A Low-Cost Multi-Board System with 3 Dynabus segments.   

Figure 9 illustrates a multi-board system with three DynaBus segments.  The Backpanel is the only bidirectional segment.  The 
boards have two unidirectional input and output buses. 

<< [Artwork node; type 'Artwork on' to command tool] >>
Figure 9:  A Multi-Board System with 3 Dynabus segments.

Finally, Figure 10 illustrates a multi-module multi-board system where the DynaBus has 5 pipelined segments. 
<< [Artwork node; type 'Artwork on' to command tool] >>
Figure 10:  A Multi-Board Multi-Module System with 5 Dynabus segments.   

    In all these configurations care must be taken in the physical layout of bus segments to minimize reflections in order to increase 
    the clock rate.  Additionally, great care must be taken in distributing the clock to reduce skew.  The DynaBus uses balanced 
    transmission lines for bus segments and a special clock distribution scheme that minimizes clock skew.

4.  Chip Level Signals
    The signals comprising a DynaBus interface for a chip are divided into five groups: Control, Arbitration, Consistency, Data, and 
    optionally DataIn.  Control contains input and output versions of the Clock, and a BidEn pin that is used to either tie the 
    Data and DataIn groups together or allow them to be used separately.  The Arbitration group provides the signals used by the 
    chip to request the bus and also the signals used by the arbiter to grant the bus.  The consistency group contains input and 
    output versions of Shared and Owner.  Data provides a bidirectional (or optionally a unidirectional output) path for 64 bits of 
    data, header information, and parity.  Finally, the optional group DataIn provides a unidirectional input path for signals in 
    the Data group when that group is being used in unidirectional output mode.
    << [Artwork node; type 'Artwork on' to command tool] >>
    Figure 11:  The DynaBus Signals.
4.1 Control Signals
Clock
    This input signal provides the global timing signal for the DynaBus.    
ClockOut
    This output signal provides an internal, loaded version of the Clock that is used to deskew Clock. 
BidEn
    This signal is used to place the Data signals in the optional unidirectional mode. When BidEn is asserted, Data function in a 
    unidirectional output mode, and DataIn are used in a unidirectional input mode.  When BidEn is deasserrted, the Data signals 
    are used in a bidirectional mode, and the DataIn signals are not used.  This feature can be used to reduce the number of 
    DynaBus pins either for building low end systems or to simplify chip testing.
4.2 Arbitration Signals
LongGrant
    LongGrant is defined one cycle before the first cycle of a grant, and at other times its value is undefined.  It is asserted if the 
    arbiter is responding to a request for a long packet (9-cycle) and deasserted if it is responding to a short packet (2-cycle).    
Grant
    Grant is asserted by the arbiter once for each bus cycle that has been granted to a requesting device.  The duration of Grant is 2 
    or 9 cycles, depending on the length of the packet.
RequestOut[0..2]
    The RequestOut wires are used by a device to signal its Arbiter that it wants the bus. A device uses the RequestOut wires for 
    either one cycle or two consecutive cycles. The first cycle always communicates the priority of a request. For some requests, 
    the device uses the second cycle to communicate a length (the number of cycles for which it wants the bus) and a color that is 
    used by the arbiter to provide fair service. The encoding for the two cycles is as follows:
    First Cycle
        7: Stop Arbitration
        6: Reply High
        5: Reply Low
        4: Hold
        3: Request High
        2: Request Normal
        1: Request Low
        0: NoOp
    Second Cycle: xCL
        C: Color
        L: Packet Length (0=>2 cycles, 1=>9 cycles)
    For priorities corresponding to Stop, Hold, and NoOp, a request consists of one cycle, while for the remainder a request consists 
    of two cycles.
4.3 Consistency Signals
OwnerOut
    OwnerOut is asserted by a cache when it is the owner of the address specified in a ReadBlockRequest. The OwnerOut signal is needed 
    because the memory system uses write-back caches. When the main memory copy of a block is stale, OwnerOut signals the memory to 
    not respond to a ReadBlockRequest because the owning cache will respond instead.
SharedOut
    SharedOut is asserted by a cache to indicate that it holds a cached copy of the data whose address appears on the DynaBus.  When a 
    cache initiates a WriteSingle, ReadBlock or KillBlock, all caches that contain the datum except the one that initiated the 
    transaction assert SharedOut.  
OwnerIn
    OwnerIn is the logical OR of the OwnerOut wires of all caches.  It is used by the Memory Controller to determine if memory should 
    respond to a ReadBlockRequest.   If the value of the Memory Controller's OwnerIn wire is TRUE, memory does not respond because 
    one of the caches owns the datum and will issue the reply. 
SharedIn
    The SharedIn wire is used to compute the value of the Shared flag for a cache that initiates a WriteSingle, ReadBlock, or 
    KillBlock.  This wire is the logical OR of the SharedOut wires of all the caches.  When a cache initiates on of the above 
    transactions, all caches that contain the datum except the one that initiated the transaction assert SharedOut.  The Memory 
    Controller receives the logical OR of all the caches' SharedOut wires as SharedIn and reflects this value in its reply to the 
    transaction.  If none of the caches asserted SharedOut, the Memory Controller's reply indicates that the datum is no longer 
    shared. The cache that initiated the transaction then sets its Shared flag to false. 

4.4 AParity Signals
AParityOut
    This wire carries single bit parity computed over the signals RequestOut, OwnerOut, and SharedOut. Parity is generated by a sending 
    device, and checked by the arbiter.
AParityIn
    This wire carries single bit parity computed over the signals LongGrant, Grant, OwnerIn, and SharedIn. It is generated by the 
    arbiter and checked by a receiving device.
4.5 Data/DataOut Signals
Data[0..63]
    These 64 signals carry the bulk of the information being transmitted from one chip to another.  During header cycles they carry a 
    packet type, some control bits, a deviceID, and an address, and during other cycles they carry data.  These signals are driven 
    only after receiving Grant from the Arbiter, otherwise they remain in a high impedance state. 
HeaderCycle/HeaderCycleOut
    This signal indicates the beginning of a packet.  It is asserted during the first cycle of a packet, which is the header. It is 
    generated by the device sending the packet, and is driven only during cycles in which the device has Grant from the Arbiter. 
    During other cycles it remains in a high impedance state.
DParity/ParityOut
    This signal carries parity computed over the HeaderCycle/HeaderCycleOut and Data lines.
4.6 DataIn Signals
DataIn[0..63]
    These 64 wires carry a possibly delayed version of the information on the DataOut wires.
HeaderCycleIn
    This wire carries a possibly delayed version of the information on the HeaderCycleOut wire. HeaderCycleIn is asserted if and only 
    if the header cycle of a packet is being received.
DParityIn
    This wire carries the parity computed by the source of the data.  It is used to check if transmission of Data and HeaderCycle 
    encountered an error.

5.  Arbitration and Flow Control
    Each DynaBus has an arbiter that permits the bus to be time multiplexed amongst contending devices. Whenever a device has a packet 
    to send, it makes a request to the arbiter using dedicated request lines, and the arbiter grants the bus using dedicated grant 
    lines.  Different devices may have different priority levels, and the arbiter guarantees fair (bounded-time) service to each 
    device within its priority level. Bus allocation is non-preemptive, however, in that the transmission of a single packet is 
    noninterruptable. When making an arbitration request, a device indicates both the priority and the length of the packet it 
    wants to send.
    Two aspects of DynaBus arbitration ensure good performance. The first is that arbitration is overlapped with transmission, so that 
    no bus cycles are wasted during arbitration and it is possible to fill up the bus completely with packets. The second is that a 
    device may make multiple requests before the first request has been granted; this allows a single device to use the bus to its 
    maximum potential.
<< [Artwork node; type 'Artwork on' to command tool] >>
Figure 12:  Arbitration is overlapped with packet transmission so that it is possible to fill up the bus completely with packets.

    The arbiter is also used to implement flow control, which is a mechanism to avoid packet congestion. To understand why congestion 
    can occur, recall that Dynabus is packet switched: a device may get new requests while it is servicing older ones, so that 
    requests can pile up faster than a device is able to service them.
5.1 Arbitration
    Each device interacts with the arbiter via a dedicated port consisting of three request wires RequestOut[0..2] and one Grant wire. 
    One other wire, LongGrant, is shared by all devices connected to the arbiter. A device communicates requests by using the 
    RequestOut wires for either one cycle or two consecutive cycles. In the first cycle it always communicates the priority of its 
    request. For some of the requests the device uses a second cycle in which it indicates a length (number of cycles for which it 
    wants the bus) and a color that is used by the arbiter to provide fair service. The encoding for the two cycles is as follows:
    First Cycle: P2P1P0
        7: Stop Arbitration
        6: Reply High
        5: Reply Low
        4: Hold
        3: Request High
        2: Request Normal
        1: Request Low
        0: NoOp
    Second Cycle: xCL
        C: Color
        L: Packet Length (0=>2 cycles, 1=>9 cycles)
    The five priorities: Request Low, Request Normal, Request High, Reply Low, and Reply High correspond to "normal" requests for the 
    bus: they are used when the device actually intends to send a packet on the bus upon receiving a grant from the arbiter. Each 
    normal request consists of two cycles, with the first cycle indicating priority and the second the length and color. A device 
    may issue multiple requests back to back, but the number of non-granted requests may not exceed the implementation limit 
    imposed by the arbiter. A separate request is registered for each pair of cycles constituting a request. Higher priority 
    requests are served before lower priority ones, and requests within a priority level are serviced in approximately round-robin 
    order. These five priority levels are used as follows in a typical Dynabus system: cache replies would use Reply High; memory 
    replies would use Reply Low; requests from caches would use Request Normal. Other devices sending request packets would use one 
    of the request priorities depending on the urgency of the request. For instance, a block transfer IO device doing output 
    normally could use Request Low for ReadBlockRequests that pull data out of the memory system, but switch to Request High when 
    the internal FIFO in the display gets close to empty.
    The remaining priorities, NoOp, Hold, and Stop are different in that a device uses them to request special service from the 
    arbiter. Each such request consists of one cycle that specifies the priority. A device uses NoOp when it does not want to 
    request any service at all. It uses Hold when it wants to prevent the arbiter from granting any requests for request packets 
    (priorities below Hold). The arbiter stays in the Hold state for only as many cycles as the device asserts the Hold code. 
    Finally, Stop is used when a device wants to stop all arbitration: the arbiter simply stops granting the bus for as many cycles 
    as the device asserts the Stop code. However, while in Stop mode, the arbiter continues to accumulate requests from devices.
    Grant is used by the arbiter to signal that a device has grant. Grant is asserted for as many cycles as the packet is long.  If 
    Grant is asserted in cycle i then the device can drive its outgoing bus segment in cycle i+1. LongGrant describes a grant that 
    is about to take place.  In the cycle before Grant is asserted, LongGrant tells the device whether or not the next grant will 
    correspond to a 9 cycle packet. Figures 13 and 14 show the timing of important signals at the pins of a requester during the 
    arbitration and transmission of a 2 cycle and a 9 cycle packet, respectively. It is helpful to refer to the schematic of Figure 
    15 when reading the timing diagrams.

    << [Artwork node; type 'Artwork on' to command tool] >>
    Figure 13: Timing diagram for a two cycle packet assuming an arbitration latency of 6 cycles.  All signals are at the pins of the 
    requesting device (see Figure 15).  Note that LongGrant is valid in the cycle just before Grant, and that Grant is asserted for 
    two cycles.
    
    
    << [Artwork node; type 'Artwork on' to command tool] >>
    Figure 14: Timing diagram for a 9-cycle packet assuming an arbitration latency of 6 cycles.  All signals are at the pins of the 
    requesting device (see Figure 15). Note that LongGrant is valid in the cycle just before Grant, and that Grant is asserted for 
    nine cycles.
    << [Artwork node; type 'Artwork on' to command tool] >>
    Figure 15:  Schematic of the standard interface used by devices to connect to the DynaBus.
5.2 Flow Control
    The arbiter provides two mechanisms for flow control, the first being arbitration priorities. Devices making arbitration requests 
    to send reply packets always use priorities higher than devices making arbitration requests to send request packets. This 
    mechanism alone would eliminate the congestion problem if devices were always ready to reply before the onset of congestion, 
    but it may not be possible for all devices to satisfy this requirement: a device must either be able to service packets at the 
    maximum arrival rate, or it must have an input queue that is long enough so that it does not overflow even during the longest 
    service time for a packet. For certain slow devices like the memory controller, servicing packets at arrival rate clearly is 
    impossible, and the queue lengths required to ensure no overflow are prohibitive.
    The arbiter therefore provides a second mechanism suitable for slow devices. This mechanism involves the use of the special request 
    priority called Hold described earlier. As long as the arbiter receives Hold from a device, it refuses to grant arbitration 
    requests for sending request packets, but continues to grant requests for sending reply packets. This has the effect of choking 
    off new request packets as long as Hold is being asserted by some device, and allows the device asserting Hold to clear the 
    congestion. Because the effect of Hold can never be instantaneous, especially in pipelined configurations, devices still need 
    to provide headroom within their input queues to tolerate a few request packets while Hold takes effect. Devices must not use 
    Hold with abandon, however, because this would decrease bus throughput.
5.3  Arbitration in Pipelined Configurations
    In pipelined DynaBus configurations, the bus segments form a tree rooted at the bidirectional backpanel segment (Figure 16). All 
    IC's are connected on the leaf segment, labeled A. The arbiter controls access to this segment independently of any board or 
    module structure. Note that the existence of a bidirectional on-board bus means that the backpanel bus segment cannot be fully 
    utilized (Figure 17). Assume that device 1 wants to send a packet and gets a grant in cycle 0. In cycles (1, 2), the packet is 
    sent segment A1, in cycles (2, 3) it traverses segment B, and in cycles (3, 4) it moves to segment A2. If device 2 wants to 
    send a packet, the earliest it can transmit information is in cycles (5, 6) because A2 is occupied in cycles (3, 4). This 
    results in a bubble of two cycles on the backpanel bus B. Note also, that in this configuration, devices receive packets at 
    different times. Devices on board 2 receive a packet sent from a device on board 1 two cycles later. Since the Owner and Shared 
    signals from all devices need to be ORed together, this means that Owner and Shared for a board transmitting a packet must be 
    delayed by two cycles compared to boards that are receiving the packet.
<< [Artwork node; type 'Artwork on' to command tool] >>
Figure 16:  A low-cost DynaBus system with three segments. The arbiter makes grants on the leaf bus segment, labeled A, without 
knowledge of the number of pipelined stages in the System.

<< [Artwork node; type 'Artwork on' to command tool] >>
Figure 17:  Timing diagram showing the transmission of data for Grants 1 and 2 over the three segments of the DynaBus pictured in 
Figure 16.

    Figure 18 shows a high performance version of the above configuration. Here the backpanel bus is still bidirectional, but each 
    board has an input bus Ci, separate from the output bus Ai. In addition to allowing the Dynabus bandwidth to be utilized to its 
    full potential, separate buses facilitate the computation of the shared and owner bits because all devices receive a packet at 
    the same time. Shared and owner bits no longer need to be delayed differently as in the above configuration

<< [Artwork node; type 'Artwork on' to command tool] >>
Figure 18:  A DynaBus System with three segments.   The arbiter makes grants on the leaf bus segment, labeled A, without knowledge 
of the number of pipelined stages in the System.
<< [Artwork node; type 'Artwork on' to command tool] >>
Figure 19:  Timing diagram showing the transmission of data for Grants 1 and 2 over the three segments of the DynaBus pictured in 
Figure 18.

6.  Transactions
    Transactions form the top layer of the DynaBus protocol, with the two lower layers being packets and cycles. Each transaction 
    consists of a pair of request-reply packets, which are independently arbitrated. A transaction begins when the requester asks 
    the arbiter to grant the bus to send its request packet (Figure 20). Upon receiving bus grant, the requester sends the packet 
    one cycle at a time, with the cycle containing packet control information going first.  This first cycle, called the packet 
    header, contains all the information needed to identify the packet and select the set of devices that need to service the 
    packet. Subsequent cycles contain data that is dependent on the type of transaction in progress. All Dyanbus devices (including 
    the requester) receive the request packet, and each device examines the header to decide whether or not it needs to take 
    action. 
<< [Artwork node; type 'Artwork on' to command tool] >>
Figure 20:  A transaction on the DynaBus consists of a request and a reply.
    Exactly one of the receiving devices elects to generate a reply, typically after the action requested by the request packet is 
    complete. The mechanism by which a unique device is selected to respond is different for different transactions, but most 
    transactions use an address contained in the header cycle for this purpose. The responding device then requests the arbiter to 
    grant the bus for sending its reply packet. On receiving grant, this device sends the reply packet one cycle at a time, with 
    the header cycle going first. As before, the header cycle contains all the information needed to identify the packet, and in 
    particular to link it unambiguously to the corresponding request packet.  All DynaBus devices receive this reply packet as 
    well, and each device examines the header to see what action, if any, it needs to take. Typically, the initiating device 
    behaves somewhat differently than other devices. The transaction is complete when the initiating device receives the reply.
    Normally, this protocol ensures a one-to-one correspondence between request and reply packets; however, because of errors, some 
    request packets may not get a reply. Thus, devices must not depend on the number of request and reply packets being equal since 
    this invariant will not in general be maintained. The protocol does require devices to provide a simple, but crucial guarantee 
    that is central to the data consistency scheme: devices must service request packets in arrival order.  To understand why 
    arrival order must be maintained, see Section 7.1.
    The DynaBus defines a complete set of transactions for data transfer between caches and memory, data consistency, synchronization, 
    input output, and address mapping. Twelve of the sixteen transactions are defined. They are: ReadBlock, WriteBlock, FlushBlock, 
    KillBlock, WriteSingle, IOReadSingle, IOWriteSingle, IOReadBlock, IOWriteBlock, Map, DeMap, and Interrupt. The ReadBlock 
    transaction allows a cache to read a packet from memory or another cache. WriteBlock allows new data to be introduced into the 
    memory system (for example disk reads). FlushBlock allows caches to write back dirty data to memory. KillBlock allows a block 
    to be removed from all but one of the caches. WriteSingle is a short transaction used by caches to update multiple copies of 
    shared data without affecting main memory. IOReadSingle, and IOWriteSingle initiate and check IO operations, while IOReadBlock 
    and IOWriteBlock allow block transfer of data between IO devices, completely bypassing the consistency mechanism. The Map and 
    DeMap transactions permit the implemention of high speed address mapping in a multiple address space environment. Finally, the 
    Interrupt transaction provides the mechanism for signalling interrupts to processors. The encoding space leaves room for 
    defining five other transactions.
6.1 Header Cycle Format
    The first, or header cycle, of a request packet contains a Command, a Flavor bit, a Mode bit, a deviceID, and an Address (Figure 
    21). The Command identifies the transaction and indicates that the packet is a request rather than a reply packet. The Flavor 
    bit is used for a few of the transactions to indicate one of two possible semantics. The Mode bit is used in protection 
    checking by receiving devices. The deviceID identifies the initiator of the transaction, while the Address serves as a selector 
    for a memory location or IO device register.
    << [Artwork node; type 'Artwork on' to command tool] >>
    <<Figure 21:  The header cycle of a request packet is transmitted on the Data wires. It contains a Command, a Flavor bit, a Mode 
    bit, a deviceID, and an Address.>>
    Most of the information in the header cycle of the request packet is replicated in the header cycle of the reply packet (Figure 
    22).  In fact, only bits [4..6] may be different. The bit 4, which is part of the Command field identifies the packet as a 
    reply. The bit 5 indicates if the transaction encountered an error, while the bit 6 tells if the the addressed location is 
    shared or not.
    << [Artwork node; type 'Artwork on' to command tool] >>
    <<Figure 22:  The first cycle of a DynaBus reply packet is transmitted on the Data wires. It contains a Command, a Fault bit, a 
    ReplyShared bit, a deviceID, and an Address. All bits other than bits 4, 5, and 6 are the same as those in the header for the 
    corresponding request packet.>>
6.1.1 The Command Field
    The Command field in a header cycle is 5 bits. Four bits encode up to 16 different transactions, while the fifth bit encodes 
    whether the packet is a request (0) or a reply (1). Twelve of the sixteen transactions are currently defined, as shown in Table 
    1 below.

Table 1: Encoding of the Command Field of a Packet Header
Transaction Name    Abbreviation    Encoding        Length
ReadBlockRequest    RBRqst    0000 0        2        
ReadBlockReply    RBRply    0000 1        9
WriteBlockRequest    WBRqst    0001 0        9
WriteBlockReply    WBRply    0001 1        2    
FlushBlockRequest    FBRqst    0010 0        9
FlushBlockReply    FBRply    0010 1        2
KillBlockRequest    KBRqst    0011 0        2
KillBlockReply    KBRply    0011 1        2
WriteSingleRequest    WSRqst    0100 0        2
WriteSingleReply    WSRply    0100 1        2
Unused        0101 0
        0101 1
Unused        0110 0
        0110 1
Unused        0111 0
        0111 1
IOReadBlockRequest    IORBRqst    1000 0        2
IOReadBlockReply    IORBRply    1000 1        9
IOWriteBlockRequest    IOWBRqst    1001 0        9
IOWriteBlockReply    IOWBRply    1001 1        2
IOReadSingleRequest    IORRqst    1010 0        2
IOReadSingleReply    IORRply    1010 1        2
IOWriteSingleRequest    IOWRqst    1011 0        2
IOWriteSingleReply    IOWRply    1011 1        2
InterruptRequest     IntRqst    1100 0        2
InterruptReply    IntRply    1100 1        9
Unused        1101 0
        1101 1
MapRequest    MapRqst    1110 0        2
MapReply    MapRply    1110 1        2
DeMapRequest    DeMapRqst    1111 0        2
DeMapReply    DeMapRply    1111 1        2 
6.1.2 The Flavor/Fault Bit
    This bit has different interpretations for request and reply packets. In a request packet it supplies an additional command bit 
    that is used to indicate which of two semantics to use for the WriteSingle transaction. If Flavor=0, then the Memory Controller 
    does not update memory, while if Flavor=1 then the Memory Controller does update memory. In either case, it generates the same 
    reply packet.
    In a reply packet the same bit is used to encode whether the device servicing the request packet encountered a fault or not. A 0 
    indicates no fault, and a 1 indicates a fault. When the fault bit is set in a reply packet, the 32 low order bits of the second 
    cycle supply a FaultCode, while bits 7 through 16 supply the deviceID of the device that detected the fault. This format is 
    shown in Figure 23.  
    << [Artwork node; type 'Artwork on' to command tool] >>
    Figure 23:  Format of the second cycle of an error reply packet.

6.1.3 The Mode/ReplyShared Bit
    This bit has different interpretations for request and reply packets. In a request packet it supplies the privilege mode (kernel=0, 
    user=1) of the device that issued the request. When the requesting device is a cache, for example, this bit indicates whether 
    the processor is in kernel or user mode. The Mode bit is used by devices servicing a request packet to check if the requestor 
    has sufficient privileges to perform the requested operation.
    In the header of a reply packet the bit indicates whether the data whose address appears in the packet was shared at the time the 
    request packet corresponding to the reply was received. This bit has a meaning only for the transactions ReadBlock, WriteSingle 
    and KillBlock, and may be safely ignored by devices that do not participate in the consistency protocol.
    Caches use the value of ReplyShared within a RBReply to set the shared bit for the block being fetched. They use the value of 
    ReplyShared within WSReply to know if the block is no longer shared and to clear the shared bit of the cached block if it is 
    not.
6.1.4 The deviceID Field
    For request packets, the deviceID field of the header carries the unique identity of the device that sent the request packet. For 
    reply packets, the deviceID field of the header is the unique identity of the intended recipient (that is, the identity of the 
    device that sent the request packet). A deviceID is needed in reply packets because the address alone is not sufficient to 
    disambiguate replies.
    Devices that either can have only one outstanding reply at a time, or that can have multiple outstanding replies but can somehow 
    discriminate between them, need only one deviceID. Other devices must be allocated multiple deviceID's to allow them to 
    disambiguate replies. These deviceID's must be contiguous and must be a power of two in number. 
    The deviceID(s) for a device are loaded in at system initialization time via the Debug bus (see [DBusSpec] for details).
6.1.5 The Address Field
    The address field within a header cycle is 47 bits, of which the top 10 bits are reserved for virtual address disambiguation, and 
    the bottom 37 bits represent either a real address or an IO address, depending on the transaction. The disambiguation bits will 
    be used by virtual caches in which it is not possible to use the real address alone to locate the data.
<< [Artwork node; type 'Artwork on' to command tool] >>
Figure 24: Format of the Address Field
    The 37 address bits in turn consist of a 33 bit doubleWord address, and a 4-bit single specifier. The doubleWord address identifies 
    an aligned doubleWord location in IO address space or real address space, while the single specifier identifies one of 15 
    aligned singles within the doubleWord location. Of the fifteen singles 8 are bytes, 4 are halfWords, 2 are words, and one is a 
    doubleWord. The fifteen cases are encoded in four bits as follows:
    Code    Datum
    bbb0    byte bbb
    ww01    halfWord ww
    d011    word d
    0111    entire doubleWord
    1111    unused
The code assumes that data is numbered left to right, in big-endian order. Thus byte 0 is the leftmost byte in the doubleWord, 
halfWord 0 is the leftmost halfWord in the doubleWord, and so on.
    For packets that transmit a block of data, doubleWords constituting the block are transmitted in cyclic order, with the doubleWord 
    containing the addressed single being transmitted first (Figure 25).

    << [Artwork node; type 'Artwork on' to command tool] >>
    Figure 25:  When memory replies with a block of data, the 8 doubleWords appear on the bus in cyclic order starting with the 
    doubleWord containing the address single.  Cyclic order decreases the latency of requested data for cache misses.

6.2. ReadBlock
    The ReadBlock transaction is used to read a block of data from the memory system.  If a cache is owner then that cache replies, 
    otherwise memory replies.
Request (2 cycles)
    A ReadBlockRequest packet requests a block to be read from the memory system.  The first cycle contains the packet type, the 
    sender's deviceID, and the address of a single in the block.  The second cycle contains the address of a single in the block 
    (the victim) that the requested block will replace within the requesting device. Bits 27-63 contain the victim address, while 
    bit 0 indicates whether this address is valid (1=> valid, 0=> invalid). The victim address is invalid for non-cache devices and 
    for a cache when there is no victim (for example just after initialization).
<< [Artwork node; type 'Artwork on' to command tool] >>
Normal Reply (9 cycles)
    A ReadBlockReply packet returns the block data requested by an earlier ReadBlockRequest.  The first cycle reflects most of the 
    information in the request header, with the shared bit indicating whether the block is shared.  The remaining eight cycles 
    contain the eight doubleWords of block data in cyclic order, with the doubleWord containing the addressed single appearing 
    first.
    << [Artwork node; type 'Artwork on' to command tool] >>
Error Reply (9 cycles)
    The first cycle of an error ReadBlockReply is the same as that for a normal reply except that it has the Fault bit set (bit 5 = 1). 
     The second cycle contains the deviceID of the reporting device and a code describing the error.  Remaining cycles are 
    undefined.
    << [Artwork node; type 'Artwork on' to command tool] >>

6.3 WriteBlock
    The WriteBlock transaction is used to write a block of data into the memory system.  Memory is overwritten, as are any cached 
    copies. This transaction is used by producers of data outside the memory system to inject new data into the memory system.
Request (9 cycles)
    A WriteBlockRequest packet requests a block to be written to the memory system.  The first cycle contains the packet type, the 
    sender's deviceID, and the address of a single in the block. The remaining eight carry the eight doubleWords of block data in 
    cyclic order, with the cycle containing the addressed single appearing first.  
<<[Artwork node; type 'Artwork on' to command tool] >>
Normal Reply (2 cycles)
    A WriteBlockReply packet acknowledges an earlier WriteBlockRequest (WriteBlockReply is generated by memory).  The first cycle 
    reflects most of the information in the request header; the second cycle is undefined.
    << [Artwork node; type 'Artwork on' to command tool] >>
<<>>
Error Reply (2 cycles)
    The first cycle of an error WriteBlockReply is the same as that for a normal reply except that it has the Fault bit set (bit 5 = 
    1).  The second cycle contains the deviceID of the reporting device and a code describing the error.
    << [Artwork node; type 'Artwork on' to command tool] >>

6.4 FlushBlock
    The FlushBlock transaction is used by caches to write a dirty block being victimized back to memory.  Because caches are kept 
    up-to-date, only memory is updated by this transaction.
Request (9 cycles)
    A FlushBlockRequest packet requests a block to be written to main memory.  The first cycle contains the packet type, the sender's 
    deviceID, and the address of a single in the block.  The remaining eight cycles carry the eight doubleWords of block data in 
    cyclic order, with the cycle containing the addressed single appearing first.
<< [Artwork node; type 'Artwork on' to command tool] >>
Normal Reply (2 cycles)
    A FlushBlockReply packet acknowledges an earlier FlushBlockRequest (FlushBlockReply is generated by memory).  The first cycle 
    reflects most of the information in the request header.  The second cycle is undefined.
    << [Artwork node; type 'Artwork on' to command tool] >>
Error Reply (2 cycles)
    The first cycle of an error FlushBlockReply is the same as that for a normal reply except that it has the Fault bit set (bit 5 = 
    1).  The second cycle contains the deviceID of the reporting device and a code describing the error..
    << [Artwork node; type 'Artwork on' to command tool] >>

6.5 KillBlock
    The KillBlock transaction is used to remove all but one cached copies of a block. When a KillBlock completes, the cached copy 
    belonging to the initator is normally the only one that remains. This operation does not guarantee removal of other cached 
    copies when those copies are being actively written into by their processors.
Request (2 cycles)
    A KillBlockRequest requests all cached copies except the one in the initiator to be removed.  The first cycle contains the packet 
    type, the sender's deviceID, and the address of a single in the block.  The second cycle is undefined.
<< [Artwork node; type 'Artwork on' to command tool] >>
Normal Reply (2 cycles)
    A KillBlockReply performs the work requested by an earlier KillBlockRequest.  The first cycle reflects most of the information in 
    the request header, with the shared bit indicating whether the block is shared.  The second cycle is undefined. A cache 
    receiving a foreign KillBlockReply when it has a KillBlockReply or WriteSingleReply pending must not kill its copy of the 
    block. 
<< [Artwork node; type 'Artwork on' to command tool] >>
Error Reply (2 cycles)
    The first cycle of an error KillBlockReply is the same as for a normal reply except that it has the Fault bit set (bit 5 = 1).  The 
    second cycle contains the deviceID of the reporting device and a code describing the error.
    << [Artwork node; type 'Artwork on' to command tool] >>

6.6 WriteSingle
    The WriteSingle transaction is used to write a single to the memory system. There are two versions of the operation, one in which 
    only cached copies of the single are updated, and the other in which main memory is also updated.  This transaction is used by 
    caches to keep multiple copies of cached read/write data consistent.
Request (2 cycles)
    A WriteSingleRequest requests a write to all cached copies of a single.  The first cycle contains the packet type, the sender's 
    deviceID, and the address of the single.  The second supplies the data. If the Flavor bit in the header is 1 then main memory 
    copy of the single is also updated.
<< [Artwork node; type 'Artwork on' to command tool] >>
Normal Reply (2 cycles)
    A WriteSingleReply packet performs the work requested by an earlier WriteSingleRequest.  The first cycle reflects most of the 
    information in the request header, with the shared bit indicating whether the datum is shared.  The second cycle supplies the 
    64 bits of data just as in the request.  WriteSingleReply is generated by the memory controller. 
<< [Artwork node; type 'Artwork on' to command tool] >>
Error Reply (2 cycles)
    The first cycle of an error WriteSingleReply is the same as for a normal reply except that it has the Fault bit set (bit 5 = 1).  
    The second cycle contains the deviceID of the reporting device and a code describing the error.
    << [Artwork node; type 'Artwork on' to command tool] >>

6.7 IOReadBlock
    The IOReadBlock transaction is used to read a block of data from an IO device.
Request (2 cycles)
    An IOReadBlockRequest packet requests a block of data to be read from an IO device.  The first cycle contains the packet type, the 
    sender's deviceID, and the IO address of a single in the block; the IO address specifies both a device and a location in that 
    device.  The second cycle is undefined.

<< [Artwork node; type 'Artwork on' to command tool] >>
Normal Reply (9 cycles)
    An IOReadBlockReply packet returns the block requested by an earlier IOReadBlockRequest.  The first cycle reflects most of the 
    information in the request header, while the remaining eight cycles carry the eight doubleWords of block data in cyclic order, 
    with the doubleWord containing the addressed single appearing first.
    << [Artwork node; type 'Artwork on' to command tool] >>
Error Reply (9 cycles)
    The first cycle of an error IOReadBlockReply is the same for a normal reply except that it has the Fault bit set (bit 5 = 1).  The 
    second cycle contains the deviceID of the reporting device and a code describing the error.  Remaining cycles are undefined.
    << [Artwork node; type 'Artwork on' to command tool] >>

6.8 IOWriteBlock
    The IOWriteBlock transaction is used to write a block of data to an IO device.
Request (9 cycles)
    An IOWriteBlockRequest packet requests that a block of data be written to an IO device.   The first cycle contains the packet type, 
    the sender's deviceID, and the IO address of a single in the block; this IO address specifies both the device and a location in 
    the device.  The remaining eight carry the eight doubleWords of block data in cyclic order, with the cycle containing the 
    addressed single appearing first.
<< [Artwork node; type 'Artwork on' to command tool] >>
Normal Reply (2 cycles)
    An IOWriteBlockReply packet acknowledges the write requested by an earlier request packet. The first cycle reflects most of the 
    information in the request header; the second cycle is undefined.
<< [Artwork node; type 'Artwork on' to command tool] >>
Error Reply (2 cycles)
    The first cycle of an error IOWriteBlockReply is the same for a normal reply except that it has the Fault bit set (bit 5 = 1).  The 
    second cycle contains the deviceID of the reporting device and a code describing the error.
    << [Artwork node; type 'Artwork on' to command tool] >>

6.9 IOReadSingle
    The IOReadSingle transaction is used to read a single from an IO device.
Request (2 cycles)
    An IOReadSingleRequest packet requests a single to be read from an IO device.  The first cycle contains the packet type, the 
    sender's deviceID, and the IO address of the single; this IO address specifies both the device and a location in the device.  
    The second cycle is undefined.
    << [Artwork node; type 'Artwork on' to command tool] >>
Normal Reply (2 cycles)
    An IOReadSingleReply returns the single requested by an earlier IOReadSingleRequest.  The first cycle reflects most of the 
    information in the request header, while the second carries the requested data aligned as specified by the single bits of the 
    IO address.
    << [Artwork node; type 'Artwork on' to command tool] >>
Error Reply (2 cycles)
    The first cycle of an error ReadBlockReply is the same for a normal reply except that it has the Fault bit set (bit 5 = 1).  The 
    second cycle contains the deviceID of the reporting device and a code describing the error.
    << [Artwork node; type 'Artwork on' to command tool] >>

6.10 IOWriteSingle
    The IOWriteSingle transaction is used to write a single to an IO device.
Request (2 cycles)
    An IOWriteSingleRequest packet requests a single to be written to an IO device.  The first cycle contains the packet type, the 
    sender's deviceID, and the IO address of the single; this IO address specifies both the device and a location in the device.  
    The second cycle contains the data aligned as specified by the single specifier bits of the IO address.
<< [Artwork node; type 'Artwork on' to command tool] >>
Normal Reply (2 cycles)
    An IOWriteSingleReply packet acknowledges the write requested by a corresponding IOWriteSingleRequest. The first cycle reflects 
    most of the information in the request header, while the second cycle is undefined.
<< [Artwork node; type 'Artwork on' to command tool] >>
Error Reply (2 cycles)
    The first cycle of an error IOWriteSingleReply is the same for a normal reply except that it has the Fault bit set (bit 5 = 1).  
    The second cycle contains the deviceID of the reporting device and a code describing the error.
    << [Artwork node; type 'Artwork on' to command tool] >>

6.11 Interrupt
    The Interrupt transaction is used to signal an interrupt to one or more processors on the Dynabus.
Request (2 cycles) 
    An InterruptRequest packet requests that one or more processors on the Dynabus be interrupted.  The first cycle contains the packet 
    type, the sender's deviceID, and the IO address of a single; this IO address specifies an interrupt register within one or more 
    caches. The second cycle contains the single.
    << [Artwork node; type 'Artwork on' to command tool] >>
Normal Reply (9 cycles)
    An InterruptReply packet performs the work requested by an earlier InterruptRequest. The first cycle of the reply reflects most of 
    the information in the request header, while the remaining eight cycles are identical to the second cycle of the request 
    packet. The reply is nine cycles long to give a cache time to do the read-modify-write required to update its interrupt 
    registers.
<< [Artwork node; type 'Artwork on' to command tool] >>
Error Reply (9 cycles)
    The first cycle of an error InterruptReply is the same for a normal reply except that it has the Fault bit set (bit 5 = 1). The 
    second cycle contains the deviceID of the reporting device and a code describing the error. The remaining cycles are undefined.
    << [Artwork node; type 'Artwork on' to command tool] >>

6.12 Map
    The Map transaction is used to translate a 16-bit address space identifier and a 20-bit virtual page number to a 24-bit real page 
    number and associated protection flags.
Request (2 cycles)
    A MapRequest packet requests that a virtual page be translated to the corresponding real page.  The first cycle contains the packet 
    type, the sender's deviceID, and the 20-bits of the virtual page in bits (in bits 31 through 50).  The second cycle contains 
    the address space id.
<< [Artwork node; type 'Artwork on' to command tool] >>
Normal Reply (2 cycles)
    A MapReply returns the translation requested by an earlier MapRequest.  The first cycle contains the packet type, the deviceID of 
    the transaction initiator, the 22-bit real page and four Flags: Dirty, KWtEnable, UWtEnable, and URdEnable.  The second cycle 
    is unused.  Note that this is one reply packet whose address part is not the same as that of the corresponding request packet.
<< [Artwork node; type 'Artwork on' to command tool] >>
Error Reply (2 cycles)
    An error MapReply is used to indicate that the responding device (MapCache) could not perform the translation.  The first cycle 
    contains the packet type, and the deviceID of the transaction initiator.   The second cycle contains the deviceID of the 
    reporting device and a code describing the error (the code shown below corresponds to MapFault).
<< [Artwork node; type 'Artwork on' to command tool] >>

6.13 DeMap
    The DeMap transaction is used to remove all cached virtual to real translations that correspond to a given real page. 
Request  (2 cycles)
    A DeMapRequest packet requests that all cached virtual to real translations for a given real page be removed from processor caches. 
     The first cycle contains the packet type, the sender's deviceID, and the 22-bits of the real page.  The second cycle is 
    undefined.
<< [Artwork node; type 'Artwork on' to command tool] >>
Normal Reply (2 cycles)
    A DeMapReply actually performs the action requested by the corresponding DeMapRequest.  The first cycle reflects most of the 
    information in the header fo the request packet, while the second cycle is undefined.
    << [Artwork node; type 'Artwork on' to command tool] >>
Error Reply (2 cycles)
    The first cycle of an error DeMapReply is the same for a normal reply except that it has the Fault bit set (bit 5 = 1).  The second 
    cycle contains the deviceID of the reporting device and a code describing the error.
    << [Artwork node; type 'Artwork on' to command tool] >>

6.14 NoOps
    Occasionally, a device that has made a request of its arbiter and has nothing to send when it gets a grant. In this situation the 
    device is expected to send a NoOp packet of the same length as the packet it had originally intended to send. It does this 
    simply by putting a 0 value for HeaderCycleOut during its allocated header cycle. Thus, there is no special command to indicate 
    a  NoOp.


7.  Data Consistency
    The DynaBus supports an efficient protocol for maintaining cache coherency in a multiprocessor environment.  Using the transactions 
    just described, it is possible to build a high performance multiprocessor system that offers a simple model of shared memory to 
    the programmer.  In this system, processors are connected to the DynaBus via write-back caches. The caches are allowed to keep 
    multiple copies of read/write data as needed, and the consistency of this data is maintained automatically and transparently by 
    the hardware. Caches detect when a datum becomes shared by watching bus traffic, and they initiate a broadcast write when a 
    processor issues a write to shared data. IO devices are permitted direct access to the memory system while preserving a 
    consistent view of memory for the processors.  A measure of the efficiency of this coherency protocol is that it requires just 
    one more write to a shared datum than the absolute minimum. 
7.1 Definition of Data Consistency
    A useful definition of data consistency must satisfy three criteria: it must allow interesting programs to be written; it must be 
    simple to understand; and it must be practical to implement. A common way to define consistency is to say that all copies of 
    any given location have the same value during each clock cycle. While this definition is adequate for writing programs and easy 
    to understand, it is hard to implement efficiently when the potential number of cached copies is large. Fortunately, there is a 
    weaker definition that is still sufficient for programming, but is much easier to implement in a large system.  It is based on 
    the notion of serializability.
    Figure 26 shows an abstract model of a shared memory multiprocessor that will be used to define serializability. Each processor has 
    a private line to shared memory over which it issues the commands Fetch(A) and Store(A, D), where A is an address and D is 
    data. For Fetch(A) the memory returns the value currently stored at A; for Store(A, D) it writes the value D into A and returns 
    an indication that the write has completed.  Let the starting time of an operation be the moment a request is sent to shared 
    memory, and the ending time the moment a response is received by the processor.
    << [Artwork node; type 'Artwork on' to command tool] >>
    Figure 26:  A number of processors connected to a shared memory.
    A computation C on this abstract model consists of N sequences of fetches and stores, one sequence for each of the processors.  A 
    computation transforms the initial state I of the shared memory into a final state F, but does not have any other visible 
    effect.  The Fetches and Stores of C are said to be serializable if there exists some global serial order of all the N 
    sequences such that if the operations were performed in this order, without overlap, the same final state F would be reached 
    starting from the same initial state I (two operations p and q overlap if the starting time of p is before the ending time of q 
    and the starting point of q is before the ending time of p).  The serial order must, of course, also preserve the semantics of 
    Fetch and Store: the value returned by a Fetch(A) in this global sequence must have been the value stored by the most recent 
    Store(A, .), or A's initial value in I if no such Store exists.
    Given this definition of serializability, a shared memory multiprocessor is said to maintain data consistency if there is an 
    algorithmic procedure for serializing the Fetches and Stores for any computation C on this machine. This procedure takes the N 
    sequences of Fetches and Stores and produces a single global sequence that has the same effect on shared memory. The procedure, 
    of course, depends on concrete implementation details of the multiprocessor.  For example, if the multiprocessor has a single 
    port memory with no caches, the transformation of the N sequences to the global sequence is trivial.  For a DynaBus based 
    multiprocessor that has processor caches, the procedure depends on details of the cache consistency algorithm and certain 
    synchronization properties enforced by caches and memory controllers.
    This definition also has a simple and intuitive interpretation.  If a shared memory multiprocessor maintains data consistency 
    according to the above definition,  the memory model the programmer needs to know is the very simple one illustrated in Figure 
    26, regardless of the actual complexity of the machine's memory system.  The real machine behaves for programming purposes as 
    though its processors were directly connected to a simple read write memory with a single port that is able to service exactly 
    one Fetch or Store operation at a time.
    << [Artwork node; type 'Artwork on' to command tool] >>
    Figure 27: The model of shared memory illustrated here is sufficient for programmers writing for DynaBus-based systems.
     
7.2 An Example
    The simplest way to understand how the DynaBus consistency protocol works is to look at an example (a more careful specification 
    useful for reference will be given in the following section). Consider the five processor system showed in Figure 28.  The 
    example below describes a sequence of events for a particular location (address 73) starting from the state where none of the 
    five caches has the block that contains this location. Numbers in the figure correspond to the numbers in the text below.
    For the example, it is sufficient to know that a cache maintains two state bits Shared and Owner for each block of data.  When a 
    block has Shared=1 it means that there may be other cached copies of this block; Shared=0 means this is the only cached copy. 
    When Owner=1 it means that this cache's processor was the last one to update this block and any copies it has in other caches.  
    At most one cached copy of a block may have Owner=1. The protocol uses the DynaBus lines shared and owner defined in Section 
    4.5.

1.    Processor1 reads Address 73.
Cache1 misses and does a ReadBlock on the bus.
Memory provides the data.
The block is marked Shared1 = 0, Owner1 = 0.
2.    Processor2 reads Address 73.
Cache2 misses and does a ReadBlock on the bus.
Cache1 pulls the shared line to signal shared.
Memory still provides the data.
The block is marked Shared1 = Shared2  = 1, Owner2  = 0.
3.    Processor3 reads Address 73.
    Cache3 misses and does a ReadBlock on the bus.
Cache1  and Cache2 pull the shared line to signal shared.
Memory still provides the data.
The block is marked Shared1 = Shared2  = Shared3 = 1, Owner3  = 0.
4.    Processor2 writes Address 73.
Because the data is shared, Cache2 does a WriteSingle on the DynaBus.
Cache1 and Cache3 pull the shared line to signal shared.
Cache1,  Cache2 and Cache3 update their values, but Memory does not.
Cache2 becomes owner (Owner2  = 1).
5.    Processor4 reads Address 73.
Cache4 misses and does a ReadBlock on the bus.
Cache1, Cache2 and Cache3 pull the shared line to signal shared.
Cache2 pulls the owner line to keep Memory from responding and provides the data.
The block is marked Shared4 = 1, Owner4 = 0.
6.    Processor4 now writes Address 73.
Because the data is shared, Cache4 does a WriteSingle on the DynaBus.
Cache1, Cache2 and Cache3 pull the shared line to signal shared.
Ownership changes from Cache2 to Cache4 (Owner2 = 0, Owner4  = 1).
7.    Processor5 writes Address 73.
Cache5 misses and does a ReadBlock on the bus.
Cache1, Cache2, Cache3 and Cache4 pull the shared line to signal shared.
Cache4, the current owner, pulls the owner line and supplies the data.
The block is marked Shared5 = 1, Owner5 = 0.
Cache5 then does a WriteSingle because the data is shared.
Cache1, Cache2, Cache3 and Cache4 pull the shared line to signal shared.
Ownership switches from Cache4 to Cache5 (Owner4 = 0, Owner5  = 1).
<< [Artwork node; type 'Artwork on' to command tool] >>
Figure 28:  An example illustrating the DynaBus consistency protocol.
7.3 Protocol Description for Single Level Systems
    A single level system consists of one or more processors connected to the DynaBus through caches, and a single main memory. The 
    first thing to note about this configuration is that it is sufficient to maintain consistency between cached copies. The main 
    memory copy can be stale with respect to the caches without causing incorrect behavior because processors have no way to access 
    data except through caches. 
    The protocol requires that for each block of data a cache keep two additional bits, shared and owner. For a given block, the shared 
    bit indicates whether there are multiple copies of that block or not. This indication is not accurate, but conservative: if 
    there is more than one copy then the bit is 1; if there is only one copy then the bit is probably 0, but may be 1. We will see 
    later that this conservative indication is sufficient. The owner bit is set in a given cache if and only if the cache's 
    processor wrote into the block last; thus at most one copy of a datum can have owner set. A cache is also required to maintain 
    some pendingState for a transaction the cache has initiated but that hasn't been replied to as yet; this state allows a cache 
    to correctly compute the value of the shared bit for the block addressed in the pending transaction, and to take special 
    actions for certain "dangerous" packets that arrive while the reply is pending. In addition to this state, the protocol uses 
    two lines on the DynaBus, Shared and Owner that were described earlier in Section 4.5.
    Generally, a cache initiates a ReadBlock transaction when its processor does a Fetch or Store to a block and the block is not in 
    the cache; it initiates a FlushBlock when a block needs to get kicked out of the cache to make room for another one (only 
    blocks with owner set are written out); and it A single level system consists of one or more processors connected to the 
    DynaBus through caches, and a single main memory. The first thing to note about this configuration is that it is sufficient to 
    maintain consistency between cached copies. The main memory copy can be stale with respect to the caches without causing 
    incorrect behavior because processors have no way to access data except through caches. 
    The protocol requires that for each block of data a cache keep two additional bits, shared and owner. For a given block, the shared 
    bit indicates whether or not there are multiple copies of that block. This indication is not accurate, but conservative: if 
    there is more than one copy then the bit is 1; if there is only one copy then the bit is probably 0, but may be 1. We will see 
    later that this conservative indication is sufficient. The owner bit is set in a given cache if and only if the cache's 
    processor wrote into the block last; thus at most one copy of a datum can have owner set. A cache is also required to maintain 
    some pendingState for a transaction the cache has initiated but that has not received a reply; this state allows a cache to 
    correctly compute the value of the shared bit for the block addressed in the pending transaction, and to take special actions 
    for certain crucial packets that arrive while the reply is pending. In addition to this state, the protocol uses two lines on 
    the DynaBus, Shared and Owner that were described earlier in Section 4.5.
    Generally, a cache initiates a ReadBlock transaction when its processor does a Fetch or Store to a block and the block is not in 
    the cache; it initiates a FlushBlock when a block needs to be removed from the cache to make room for another one (only blocks 
    with owner set are written out); and it initiates a WriteSingle when its processor does a write to a block that has the shared 
    bit set. Caches do a match only if they see one of the following packet types: RBRqst, RBRply, WSRqst, WSRply, and WBRqst. In 
    particular, note that no match is done either for a FBRqst or a FBRply. This is because FB is used only to flush data from a 
    cache to memory, not to notify other caches that data has changed. No match is done for a WBRply, because this packet is only 
    used to acknowledge that the memory has processed the WBRqst.
    When a cache issues a RBRqst or WSRqst, all other caches match the block address to see if they have the block. Each cache that 
    matches, asserts Shared to signal that the block is shared and also sets its own copy of the shared bit for that block. The 
    requesting cache uses pendingState to compute the value of the shared bit. It cannot simply copy the value of Shared into the 
    shared bit like the other caches is because the status of the block might change from not shared to shared between request and 
    reply due to an intervening packet with the same address. This ensures that the shared bit is TRUE for a block only if there 
    are multiple copies, and that the shared bit is eventually cleared if there is only one copy.  The shared bit will be cleared 
    when only one copy is left and that copy's processor does a store. The store turns into a WSRqst, no one asserts Shared, and so 
    the value the requestor computes for the shared bit is FALSE.
    The manipulation of the owner bit is simpler. This bit is set each time a processor stores into one of the singles of the block; it 
    is cleared each time a WSRply arrives on the bus (except for the cache whose processor initiated the WSRqst). There are two 
    cases to consider when a processor does a store. If the shared bit for the block is FALSE, then the cache updates the 
    appropriate single and sets the owner bit right away. If the shared bit is TRUE, the cache puts out a WSRqst. When the memory 
    sees the WSRqst, it turns it around as a WSRply with the same address and data, making sure that the shared bit in the reply is 
    set to the value of the Shared line an appropriate number of cycles after the appearance of the WSRqst's header cycle. When the 
    requestor sees the WSRply, it updates the single and also sets owner. Other caches that match on the WSRply update the single 
    and clear owner. This guarantees that at most one copy of a block can ever have owner set. Owner may not be set at all, of 
    course, if the block has not been written into since it was read from memory.
    When an RBRqst appears on the bus, two distinct cases are possible. Either some cache has owner set for the block or none has. In 
    the first case the owner (and possibly other caches) assert Shared. The owner also asserts Owner, which prevents memory from 
    responding, and then proceeds to supply the block via an RBRply. The second case breaks down into two subcases. In the first 
    subcase no other cache has the block, Shared does not get asserted, and the block comes from memory. In the second subcase at 
    least one other cache has the data, Shared does get asserted, but the block still comes from memory because no cache asserted 
    Owner. Because the bus is packet switched, it is possible for the ownership of a block to change between the request and its 
    reply. Suppose for instance that a cache does an RBRqst at a time when memory was owner of the block, and before memory could 
    reply, some other cache issues a WSRqst which generates a WSRply which in turn makes the issuing cache the owner. Since Owner 
    wasn't asserted for the RBRqst, memory still believes it is owner, so it responds with the RBRply. To avoid using this stale 
    data, the cache that did the RBRqst uses pendingState to either compute the correct value of the data or to retry the ReadBlock 
    when the RBRply is received. Dangerous transactions for a pending ReadBlock are the ones that modify data: WSRply and WBRqst.
    It is interesting to note that in the above protocol the Shared and Owner lines are output only for caches and input only for 
    memory. This is because the caches never need the value on the Owner line, and the value on the Shared line is provided in the 
    reply packet so they don't need to look at the Shared line either.
    Finally, from the point of view of the memory system, the WBRqst is identical to the FBRqst.  From the point of view of the caches, 
    the two requests are different: caches take no action for a FBRqst, but overwrite their data and clear the owner bit for a 
    matching WBRqst.

7.4 Protocol Description for Two Level Systems
    Figure 29 illustrates a 2-level DynaBus-based system   A two-level system consists of a number of one-level systems called clusters 
    connected by a main DynaBus that also has the system's main memory. Each cluster contains a single large cache, that connects 
    the cluster to the main DynaBus, and a private DynaBus, that connects the large cache to the small caches in the cluster. This 
    private DynaBus is electrically and logically distinct from the DynaBuses of other clusters and from the main DynaBus. From the 
    standpoint of a private DynaBus, its large cache looks identical to the main memory in a single-level system. From the 
    standpoint of the main DynaBus, a large cache looks and behaves very much like a small cache in a single-level system. Further, 
    the design of the protocol and the consistency protocol is such that a small cache cannot even discover whether it is in a 
    one-level or a two-level system.  The response from its environment is the same in either case. Thus, the behavior of a small 
    cache in a two level system is identical to what was described in the previous section.
    << [Artwork node; type 'Artwork on' to command tool] >>
    Figure 29:  A 2-level DynaBus-based System.  The small open boxes pictured in each cluster might represent any of the devices that 
    are pictured below them, including: a Small Cache, a Processor, an I/O Bridge, a Display Controller, a Printer or a LAN.
    The protocol requires the large cache to keep all of the state bits a small cache maintains, plus some additional ones. These 
    additional bits are the existsBelow bits, kept one bit per block of the large cache. The existsBelow bit for a block is set 
    only if some small cache in that cluster also has a copy of the block. This bit allows a large cache to filter packets that 
    appear on the main bus and put only those packets on the private bus for which the existsBelow bit is set. Without such 
    filtration, all of the traffic on the main bus would appear on every private bus, defeating the purpose of a two-level 
    organization.
    The behavior of a small cache in a two-level system is identical to its behavior in a one-level system. In addition, a large cache 
    behaves like main memory at its private bus interface and a small cache at its main bus interface. The following paragraphs 
    will describe the internal functioning of a large cache and describe how packets on a private bus relate to those on the main 
    bus and vice-versa.
    When a large cache receives a RBRqst from its private bus, two cases are possible: either the block is there or it's not. If it's 
    there, the cache returns the data via an RBRply, making sure that it sets the shared bit in the reply packet to the OR of the 
    value on the bus and its current state in the cache. (In the single-level system main memory returned the value on the Shared 
    line for this bit.) If the block is not in the cache, the cache puts out a RBRqst on the main bus. When the RBRply comes back 
    the cache updates itself with the new data and its shared bit and puts the RBRply on the private bus. When a large cache gets a 
    WSRqst on its private bus, it checks to see if the shared bit for the block is set. If it is not set, then it updates the data, 
    sets owner, and puts a WSRply (with shared set to the value of the Shared line at the appropriate time) on the private bus. If 
    shared is set, then it puts out a WSRqst on the main bus. The memory responds some time later with a WSRply. At this time the 
    large cache updates the single, sets the owner bit, and puts a WSRply on the private bus with shared set to one. When a large 
    cache gets a FBRqst, it simply updates the block and sends back an FBRply.
    When a large cache gets an RBRqst on its main bus, it matches the address to determine if it has the block. If there is a match and 
    owner is set, then it responds with the data. However, there are two cases. If existsBelow is set, then the data must be 
    retrieved from the private bus by placing a RBRqst. Otherwise, the copy of the block it has is current, and it can return it 
    directly. When a large cache gets a WSRqst on the main bus, it matches the address to see if the block is there and asserts 
    shared as usual, but takes no other action. When the WSRply comes by, however, and there is a match, it updates the data it 
    has. In addition, if the existsBelow bit for that block happens to be set, it also puts WSRply on the private bus. Note that 
    this WSRply appears out of the blue on the private bus; that is, it has no corresponding request packet. This is another reason 
    why the number of reply packets on a bus may exceed the number of request packets.
 
8.  Atomic Operations
    The Dynabus WriteSingle transaction can be used to implement an atomic Swap operation. Typical implementations of Swap in 
    multiprocessors require the bus or specific memory locations to be locked. It is impractical to lock the Dynabus because it is 
    packet switched. And, memory locks entail performance compromises because it is impractical to have a lock for each location, 
    and the alternative imposes unnecessary conflicts. The use of WriteSingle to perform a Swap does not require bus or memory 
    locks, so that Swaps to the same location by different processors are limited only by the maximum rate at which WriteSingles 
    can be placed on the bus. Swap has the following semantics:

    Swap[address, value] Returns[sample] = 
        {<begin critical section>
        sample _ address^;
        address _ value;
        <end critical section>
        }
    These semantics are implemented by a cache in the following manner: When a processor requests a Swap, the cache first determines if 
    the location is shared. If it is, then the cache issues a WriteSingleRequest to that location and waits for the reply. Upon 
    receiving the reply, it reads the current value of the location and updates the location in one atomic action. If the location 
    is not shared, then the cache simply reads the current value and updates the location in one atomic action. In either case, the 
    final action is to return the value read to the waiting processor. Note that this implementation generates no bus traffic for 
    Swaps to non-shared locations.
    << [Artwork node; type 'Artwork on' to command tool] >>
    Figure 30: Using WriteSingle to implement Swap.  Because the data is shared, traffic is generated on the DynaBus.  

      
9.  Input Output
    All interactions with IO devices fall into one of two categories: control or data transfer. Control interactions are used to 
    initiate IO and to determine whether an earlier request has completed.  Data transfer interactions are used to move the data to 
    and from the memory system, or between IO devices. In most applications, the bandwidth requirements of control interactions is 
    small compared to those of data transfer, so that the transport efficiency of data transfer is much more important that of 
    control. When an IO device requires a low rate of data transfer, control interactions can also be used to transfer data.
9.1 Control
    All control interactions are carried out through the use of IOReadSingle, IOWriteSingle and Interrupt transactions directed to a 
    common, unmapped 36-bit IO address space. This address space is common in the sense that all processors see the same space, and 
    it is unmapped in the sense that addresses issued by processors are the ones seen by the IO devices. Generally, each type of IO 
    device is allocated a unique, contiguous chunk of IO space at system design time, and the device responds only if an 
    IOReadSingle, IOWriteSingle, or Interrupt is directed to its chunk. The term IO device is being used here not just for real IO 
    devices, but any device (such as a cache) that responds to a portion of the IO address space.
9.1.1  IOWriteSingle
    The IOWriteSingle transaction is used to set up IO transfers and to start IO. The address cycle of the request packet carries an IO 
    address, while the data cycle carries a single of data whose interpretation depends upon the IO address.  For block transfer 
    devices, a processor typically does a number of IOWriteSingles to set up the transfer, and then a final IOWriteSingle to 
    initiate the transfer.
    An IOWriteSingle starts out at a small cache as an IOWRqst packet. The large cache of the cluster puts the IOWRqst on the main 
    DynaBus, where it is picked up by all the other large caches. These caches put the IOWRqst on their private buses. Thus, the 
    IOWRqst is broadcast throughout the system. Broadcasting eliminates the need for requestors to know the location of devices in 
    the hierarchy and makes a simpler protocol possible. When the IOWRqst reaches the intended device, the device performs the 
    requested operation and sends an IOWRply. The IOWRply is broadcast in the same manner as the IOWRqst, so it eventually makes 
    its way to the requesting small cache. When the reply arrives, the small cache lets its processor proceed.
9.1.2 IOReadSingle
    The IOReadSingle transaction reads a single of data from an IO device.  This data may either be status bits that are useful in 
    controlling the device, or data being transferred directly from the device to the processor.
    The mechanics of the IOReadSingles are the same as IOWriteSingles: An IOReadSingle starts out at a small cache as an IORRqst 
    packet. The large cache of the cluster puts the IORRqst onto the main DynaBus, where it is picked up by other large caches and 
    put on the private buses. Once the intended IODevice receives the request, it reads the data from its registers and sends it 
    along via an IORRply. The IORRply is broadcast in exactly the same way as the IORRqst, and eventually makes its way to the 
    cache that initiated the transaction. Note that for both IOReadSingles and IOWriteSingles exactly one device responds to a 
    given IO address.
9.1.3 Interrupt
    The Interrupt transaction is used by IO devices to generate interrupts for one or more processors. Each processor's cache has a set 
    of interrupt registers each of which respond to two IO addresses, a directed address and a broadcast address. A directed 
    address is unique to one cache, while a broadcast address is recognized by all caches. When an IO device wants to send an 
    interrupt to one processor, it uses the directed address of that processor's cache in the Interrupt transaction. When an IO 
    device wants to interrupt all processors, it uses the broadcast address.
    An Interrupt starts at some device as an InterruptRqst packet. The large cache of the cluster puts the InterruptRqst on the main 
    bus. The memory then generates an InterruptRply with the same parameters as the InterruptRqst. An InterruptRply packet is nine 
    cycles long, with all the data cycles being identical. All the large caches put this InterruptRply on their private DynaBuses. 
    Thus the InterruptRply is broadcast throughout the system. Depending on the IO address parameter of the InterruptReply, either 
    all caches interrupt their processors or just one cache does. When the InterruptRply reaches the requesting device, the 
    transaction is complete. Note that the reply is not generated by the IO device, but by main memory. The reason is that there is 
    no unique IO device that can generate the reply packet. It is important to point out that errors that occur during a 
    InterruptRply may not be caught by the requesting device's time out mechanism. If one of the intended recipients of the 
    Interrupt is broken, for instance, the requestor will not get any indication. This is a fundamental problem with broadcast 
    operations; however, and there is no simple solution. 
9.2 Data Transfer
    IO devices connected to the DynaBus via a cache automatically participate in the data consistency algorithm. If performance were 
    not a problem, all devices could be connected to the DynaBus in this way, freeing designers of IO devices from having to build 
    special chips to interface to the DynaBus.  Unfortunately, this approach is insufficient for high speed input devices, which 
    would cause a cache to needlessly transfer blocks from memory to cache each time the cache got a miss. The protocol provides 
    the WriteBlock transaction to write directly to memory without going through a cache. Of course, a high speed output device 
    could use ReadBlock's to directly transfer data out of consistent memory without going through a cache.
    In addition, the Dynabus provides the transactions IOReadBlock and IOWriteBlock to transfer data between IO devices without 
    disturbing the contents of real address space. These operations would be useful when one IO device wants to stream data to 
    another over the Dynabus without processing the data in any way.

10.  Address Mapping
    Figure 31 shows how address mapping information is organized in a Dynabus system. There is a three-level hierarchy, with the first 
    level residing in the processor cache, the second in the Map Cache, and the third in a Map Table kept in main memory. The Map 
    Table keeps translation entries for all pages that are actually used in Main Memory. The Map Cache contains the subset of the 
    translations in the Map Table that are used frequently by the current computation. A processor cache in turn keeps the subset 
    of the entries in the Map Table that are frequently used by its processor. The Map Cache contains many more entries than a 
    processor cache and acts as a performance accelerator, avoiding frequent accesses to the main memory Map Table.
    << [Artwork node; type 'Artwork on' to command tool] >>
    Figure 31:  The organization of address mapping information.
10.1 The MapRequest/MapReply Transactions
    Figure 32 illustrates the Map transaction.  When the cache of translations within a processor cache encounters a miss, the 
    processor cache issues a MapRequest packet on the bus. This packet contains the virtual page number to be translated and an 
    address space identifier (aid). The Map Cache checks to see if it has an entry for the requested page, and if it does it 
    returns the translation via a MapReply. A MapReply contains the number of the real page and four flags: Dirty, KWtEnable 
    (kernel write enable), UWtEnable (user write enable), and URdEnable (user read enable). If the MapCache does not have the 
    entry, it sends a MapReply indicating a Map Fault. When the processor cache receives a Map Fault, it signals a TRAP to its 
    processor. The TRAP handler looks up the translation in the Map Table (the translation is guaranteed to be there if the real 
    page is resident in main memory), writes it to the Map Cache, rewinds the instruction being executed at the time of the TRAP 
    and returns. When the instruction is reexecuted, the processor cache gets another map miss, but this time the Map Cache has the 
    entry, so the miss is satisfied.
    << [Artwork node; type 'Artwork on' to command tool] >>
    Figure 32:  The Map transaction.  If the Processor Cache does not contain a map entry, it does a MapRequest to the Map Cache. If 
    the Map Cache contains the requested translation, it replies via MapReply.  If not, it uses MapReply to indicate a fault which 
    causes the Processor to TRAP to Map Fault handling code. 
10.2 DeMapRequest/DeMapReply
    The DeMap transaction is used to invalidate all translations from virtual pages to a given real page contained within processor 
    caches. DeMap is used whenever a mapping entry needs to be modified. Changes to a map entry need to be made carefully because 
    of the three levels in the mapping hierarchy. The system software must follow the following sequence:
    1.    Delete the mapping entry from the Map Table.
    2.    Delete the mapping entry from the Map Cache.
    3.    Initiate DeMap to remove the entry from the processor caches.
Other sequences are not correct because old copies of the translation being modified could remain in caches for arbitrarily long 
periods of time and cause unwanted behavior. A DeMap is issued by sending a DeMapRequest containing the number of the real page 
whose translations are to be invalidated (Figure 33). Main memory turns the request around as a DeMapReply, and it is during the 
reply that the mapping entries for the real page are removed.


<< [Artwork node; type 'Artwork on' to command tool] >>
Figure 33:  The DeMap Transaction requests that all translations from some virtual page to a given real page be removed from 
processor caches. The work of DeMap is done during the reply packet.

11.  Error Detection and Reporting
    The DynaBus specifies two aspects of dealing with errors: detection and reporting. Each device is expected to provide its own 
    facilities for detecting errors, whether the errors are internal to the device or result from interactions with other devices. 
    The bus provides parity to help check transport errors. Once an error is detected, a device must decide if it can handle the 
    error on its own or needs to report the error to some other party. Errors that the device can handle on its own are 
    uninteresting because the bus needs to provide no facilities. Errors that a device cannot handle are divided into recoverable 
    errors and catastrophic ones, and the bus provides facilities to handle each kind.
11.1 Bus Parity
    The DynaBus provides a single parity wire to check transport on the 64 Data wires. A device that sends a packet is expected to 
    generate the parity bit, and all receiving devices are expected to check the parity bit. Whether a device considers a DynaBus 
    parity error to be recoverable or catastrophic is not specified.
11.2 Time Outs
    The DynaBus requires each device to implement a timeout facility to detect devices that do not respond, or unconnected devices. 
    Each device must maintain a counter that starts counting bus cycles when the device issues a request to the arbiter to send a 
    request packet. If the system-wide constant maxWaitCycles cycles have elapsed before the corresponding reply packet is 
    received, the device must assume that an error has occurred. Whether a device considers a DynaBus timeout to be recoverable or 
    catastrophic is not specified.
    The determination of a system-wide value for maxWaitCycles is difficult because of the wide variance in expected service times. For 
    example, a low priority device might take a long time to receive a  bus grant, while a higher priority device would get a grant 
    relatively quickly. A low priority device might in fact be forced to wait for an arbitrarily long if a higher priority device 
    decides to hog the bus. Whether tthe possibility of freezing out low-priority devices should be interpreted as an error is 
    debatable.
    To avoid getting entangled in this issue, the DynaBus specifies a system-wide lower bound on the limit maxWaitCycles and lets the 
    device implementor decide the exact value. Such a lower limit is needed to avoid generating frequent false alarms. A 
    conservative lower limit can be arrived at by computing the worst-case service time for a cache request and increasing it by an 
    order of magnitude for safety (caches are taken since they are the lowest priority devices that do not change their request 
    priority). Assuming there are 8 caches and only one memory bank, the worst case service time is at most
    = 8*#cycles to service one request in an unloaded system
    = 8*25 cycles.
    Increasing this by an order of magnitude gives 2048 cycles, so each device is required to have maxWaitCycles 
11.3 Recoverable Errors
    When a device encounters a recoverable error while servicing a request packet, it uses the DynaBus Mode/Fault bit in the reply 
    packet to report the error. The least significant 32 bits of the first data single of the reply packet are set aside for the 
    FaultCode, while bits 7 through 16 are set aside for the deviceID of the reporting device.
11.4 Catastrophic Errors
    When a device encounters a catastrophic error it makes a Stop request to the arbiter.  Upon receiving this request, the arbiter 
    stops issuing all requests for the bus, bringing the system to a halt.  The service processor detects the lack of activity on 
    the DynaBus and initiates recovery.  

Appendix I.  DynaBus Command Field Encoding
    The table below gives the encoding for the Command field within the header cycle of a DynaBus packet.
    Transaction Name    Abbreviation    Encoding        Length
    ReadBlockRequest    RBRqst    0000 0        2        
    ReadBlockReply    RBRply    0000 1        9
    WriteBlockRequest    WBRqst    0001 0        9
    WriteBlockReply    WBRply    0001 1        2    
    FlushBlockRequest    FBRqst    0010 0        9
    FlushBlockReply    FBRply    0010 1        2
    KillBlockRequest    KBRqst    0011 0        2
    KillBlockReply    KBRply    0011 1        2
    WriteSingleRequest    WSRqst    0100 0        2
    WriteSingleReply    WSRply    0100 1        2
    Unused        0101 0
            0101 1
    Unused        0110 0
            0110 1
    Unused        0111 0
            0111 1
    IOReadBlockRequest    IORBRqst    1000 0        2
    IOReadBlockReply    IORBRply    1000 1        9
    IOWriteBlockRequest    IOWBRqst    1001 0        9
    IOWriteBlockReply    IOWBRply    1001 1        2
    IOReadSingleRequest    IORRqst    1010 0        2
    IOReadSingleReply    IORRply    1010 1        2
    IOWriteSingleRequest    IOWRqst    1011 0        2
    IOWriteSingleReply    IOWRply    1011 1        2
    InterruptRequest     IntRqst    1100 0        2
    InterruptReply    IntRply    1100 1        9
    Unused        1101 0
            1101 1
    MapRequest    MapRqst    1110 0        2
    MapReply    MapRply    1110 1        2
    DeMapRequest    DeMapRqst    1111 0        2
    DeMapReply    DeMapRply    1111 1        2