THE DYNABUS: A VLSI BUS FOR USE IN MULTI-PROCESSOR SYSTEMS
THE DYNABUS: A VLSI BUS FOR USE IN MULTI-PROCESSOR SYSTEMS
THE DYNABUS: A VLSI BUS FOR USE IN MULTI-PROCESSOR SYSTEMS
PRELIMINARY VERSION
PRELIMINARY VERSION
1
1
1
The DynaBus
A VLSI Bus for use in Multiprocessor Systems
L. Bland, J.C. Cuenod, D. Curry, J.M. Frailong, J. Gasbarro, J. Gastinel, B. Gunning, J. Hoel, E. McCreight, M. Overton, E. Richley, M. Ross, and P. Sindhu
Dragon-88-08 Written 4 September 88 Revised 21 November 88
© Copyright 1986, 1987, 1988 Xerox Corporation. All rights reserved.
Abstract: The DynaBus is a synchronous, packet switched bus designed to address the requirements of high bandwidth, data consistency, and VLSI implementation within the memory system of a shared memory multiprocessor. Each DynaBus transaction consists of a request packet followed an arbitrary time later by a reply packet, with the bus being free to be used by other transactions in the interim. Besides making more efficient use of the bus, such packet switching enables the use of interleaved memory, allows arbitrarily slow devices to be connected, and simplifies data consistency in systems with multiple levels of cache. The bus provides a usable bandwidth of several hundred megabytes per second, permitting the construction of machines executing several hundred MIPS while providing high IO throughput. An efficient protocol ensures that multiple copies of read/write data within processor caches is kept consistent and that IO devices stream data into and out of a consistent view of memory. Both the physical structure of the DynaBus and its protocol are designed specifically to allow a high level of system integration. Complex functions such as memory and graphics controllers that traditionally required entire boards can be implemented in a single VLSI chip that is directly connected to the DynaBus.
Keywords: VLSI DynaBus, Backpanel DynaBus, pipelined bus, timing, arbitration, DynaBus transactions, write-back cache, snoopy cache, data consistency, DynaBus signals, memory interconnect, multiprocessor bus, packet switched bus.
FileName: [Dragon]<Dragon7.0>Documentation>DynaBus>DynaBusDoc.tioga, .ip
XEROX Xerox Corporation
Palo Alto Research Center
3333 Coyote Hill Road
Palo Alto, California 94304
Xerox Private Data
Contents
1. Overview
2. Definition of Terms
3. Interconnection Schemes
4. Chip Level Signals
5. Arbitration and Flow Control
6. Transactions
7. Data Consistency
8. Atomic Operations
9. Input Output
10. Address Mapping
11. Error Detection and Reporting
Appendix I. DeviceType Encodings
Appendix II. DynaBus Command Field Encoding
Appendix III. Format of FaultCode
1. Overview
The DynaBus is a synchronous, packet switched bus designed to address the requirements of high bandwidth, data consistency, and VLSI implementation within the memory system of a shared memory multiprocessor. Each DynaBus transaction consists of a request packet followed an arbitrary time later by a reply packet, with the bus being free to be used by other transactions in the interim. Besides making more efficient use of the bus, such packet switching enables the use of interleaved memory, allows arbitrarily slow devices to be connected, and simplifies data consistency in systems with multiple levels of cache. The bus provides a usable bandwidth of many hundreds of megabytes per second, permitting the construction of machines spanning a wide range of cost and performance.
Because DynaBus is intended for use in high performance shared memory multiprocessors, there is an efficient protocol for ensuring that processors see a consistent view of memory in the face of caching and IO. This protocol allows the hardware to ensure that multiple copies of read/write data in caches are kept consistent, and that both input and output devices are able to take cached data into account. Despite its efficiency, the consistency protocol provides a model of shared memory that is conceptually both simple and natural.
The DynaBus's physical structure and its protocol are designed to promote a high level of system integration. Complex functions such as memory controllers, graphics controllers, high speed network controllers, and external bus controllers that traditionally required entire boards can be implemented using a single chip connected directly to the DynaBus. The result is a high performance, but compact system. Within a computer, the DynaBus may be used both as a
VLSI interconnect to tie chips together on a single board and as a backplane bus to tie boards together over a backpanel. Figure 1 shows an application with two boards.
The DynaBus's design is flexible enough to allow its use in a wide variety of configurations. In Figure 1, for example, DynaBus A may be connected to DynaBuses B and C in two quite different ways. In the first, the buses are connected by pipeline registers. Here there is logically one DynaBus but three electrically separate bus segments, and all traffic that goes on one segment also goes on the others. In the second, the buses are connected by second level caches. Here there are three logically distinct DynaBuses, and traffic from one bus may or may not go to the others.
The DynaBus has 82 signals, 64 of which consist of a multiplexed data/address path Data (Figure 2). HeaderCycle indicates whether the information carried by Data is a packet header or not. Parity and two Spares account for three other signals. Shared and Owner are signals used for data consistency. The Clock signal provides global timing, while SStop is for synchronously starting and stopping a DynaBus system. At the pins of a package that interfaces to DynaBus, the data port signals can be provided optionally with inputs and outputs separated for added flexibility in building high performance pipelined bus configurations. The pin BidEn allows a single die to be used in either the
bidirectional mode, or the higher performance unidirectional mode.
[Artwork node; type 'Artwork on' to command tool]
Figure 2: Chip Level DynaBus Signals.
The DynaBus's operation can be understood best in terms of three layers: cycles, packets, and transactions (these layers correspond to the electrical, logical, and functional levels, respectively). A bus cycle is simply one complete period of the bus clock; it forms the unit of time and electrical information transfer on the bus; the information is typically either an address or data. A packet is a contiguous sequence of cycles; it is the mechanism by which one-way logical information transfer occurs on the bus. The first cycle of a packet carries address and control information; subsequent cycles typically carry data. A transaction consists of a request packet and a corresponding reply packet that together perform some logical function (such as a memory read).
Each DynaBus has an arbiter that permits the bus to be multiplexed amongst contending devices, which are identified by a unique deviceId. Before a transaction can begin, the requesting device must get mastership from the arbiter. Once it has the bus, the device puts its request packet on the bus one cycle at a time, and then waits for the reply packet. Packet transmission is uninterruptable in that no other device can take the bus away during this time, regardless of its priority. The transaction is complete when another device gets bus mastership and sends a reply packet. Request and reply packets may be separated by an arbitrary number of cycles, provided timeout limits are not exceeded (see Section 11.2). In the interval between request and reply, the bus is free to be used by other devices. The arbiter is able to grant requests in such a way that no cycles are lost between successive packets.
A request packet contains at least the transaction type, the requestor's deviceId, a small number of control bits, and an address; it may contain additional transaction dependent information. The reply packet contains the same transaction type, the orignial requestor's deviceId, the original address, some control bits, and transaction dependent data. This replication of type, deviceId, and address information allows request and reply packets to be paired unambiguously. Normally, the protocol ensures a one-to-one correspondence between request packets and reply packets; however, because of errors, some request packets may not get a reply. Thus, devices must not depend on the number of request and reply packets being equal since this invariant will not in general be maintained. The protocol requires devices to provide a simple, but crucial guarantee: they must service request packets in arrival order. This guarantee forms the basis for the DynaBus's data consistency scheme.
The DynaBus defines a complete set of transactions for data transfer between caches and memory, data consistency, synchronization, input output, and address mapping. The ReadBlock transaction allows a device to read a packet from memory or another cache. FlushBlock allows caches to write back dirty data to memory. WriteBlock allows new data to be introduced into the memory system (for example disk reads). WriteSingle is a short transaction used by caches to update multiple copies of shared data without affecting main memory. IORead, IOWrite and BIOWrite initiate and check IO operations. The Map and DeMap transactions permit the implemention of high speed address mapping in a multiple address space environment. Finally, the ConditionalWriteSingle transaction provides the ability to do atomic read-modify-writes to shared data at the maximum rate transactions can be initiated on the bus. The encoding space leaves room for defining six other transactions.
2. Definition of Terms
Arbiter
an entity that allows multiple devices contending for the same DynaBus to use the bus in a time multiplexed fashion.
BIC
bus interface chip. A chip for connecting two DynaBus segments, containing two pipeline registers.
big-endian numbering
a numbering system where the most significant bit and byte of a word are placed leftmost and numbered 0. The DynaBus uses big-endian numbering (Figure 3).
[Artwork node; type 'Artwork on' to command tool]
Figure 3: Big-endian numbering as it is used on the DynaBus.
block
8 contiguous 32-bit words aligned in real address space such that the 32-bit word address of the first word is 0 MOD 8.
bus
a collection of one or more bus segments connected together by pipeline registers.
bus segment
the portion of a bus that is traversed in one clock period.
cycle
one complete period of the DynaBus clock. It is the unit of time and information transfer on the DynaBus. Nominally, a cycle is 25 ns.
DBus
a serial bus used for system initialization, testing, and debugging.
Device
an entity that can arbitrate for the bus and place packets on it.
DeviceID
a 10-bit unique identifier for DynaBus devices. This number is loaded into a device over the DBus during system initialization.
doubleWord
a 64-bit quantity. The DynaBus is capable of transferring one doubleWord of data each cycle.
header
the first cycle of a packet. This cycle contains address and control information.
Hold
a state in which the Arbiter grants requests for reply packets but does not grant requests for request packets.
IO address
a 32-bit quantity used to address IO devices. IO addresses consist of three fields: DeviceType, DeviceNumber and DeviceOffset.
IO address space
the set of all IO addresses.
IOBridge
a chip that allows the DynaBus to be connected to an industry standard bus.
MapCache
a device that provides virtual to real address translation on the DynaBus.
master
a device that has been granted the DynaBus.
module
a unit of packaging intermediate between a chip package and a board.
packet
a contiguous sequence of bus cycles. The DynaBus supports packets of length 2 and 5.
packet switched
a dissociation between the request and reply packets of a transaction that allows the bus to be used for other transactions between request and reply. Same as split transaction.
real address
a 32-bit quantity used to address real memory. The location addressed may reside in main memory as well as in caches. All addresses on the DynaBus are word addresses.
real address space
the set of all real addresses.
requester
the device that sends the request packet of a transaction.
responder
the device that sends the reply packet of a transaction.
slave
a device that is listening to an incoming packet on the DynaBus.
snoopy cache
a two port cache that watches transactions on the DynaBus port to maintain a consistent view of data as seen from the processor port.
split transaction
a dissociation between the request and reply packets of a transaction that allows the bus to be used for other transactions between request and reply. Same as packet switched.
transaction
a pair of packets, the first a request and the second a reply, that together performs some logical function.
virtual address
a 32-bit quantity used by a processor to address memory.
virtual address space
the set of all virtual addresses.
write-back cache
a cache that updates cached data upon a processor write without immediately updating main memory.
write-through cache
a cache that updates main memory and the cache contents whenever a processor does a write.
3. Interconnection Schemes
A unique aspect of DynaBus is that it can be used as an interconnection component in machines spanning a wide range of cost and performance. At the low end are low cost single board systems of up to a few hundred MIPS, while at the high end are more expensive multi-board systems capable of approaching 1 GIPS and sustaining high IO throughput. However, in all these systems, the logical and much of the electrical specification of the bus stays the same. This allows the same chip set to be employed across an entire family of machines and results in economies of scale not permitted by other buses.
3.1 Low to Medium Performance Systems
Low performance systems typically cannot afford high pin count packages because of increased package cost and the need for more expensive high density interconnection on board. With the bidirectional option, the DynaBus requires just 82 pins per package, providing an attractive solution for low end systems.
With the DynaBus confined to a single board, it is possible to build a high performance, compact 64-bit bus consisting of just one segment (Figure 5). Each DynaBus chip has an input and an output register connected to the bidirectional data port. These registers make a shorter cycle time possible, eliminating any computation (decoding, gating) during the transmission of data between chips.
[Artwork node; type 'Artwork on' to command tool]
Figure 5: A Single-Board System contains only one bus segment. A special pin allows the input and output pins of a DynaBus chip to be connected resulting in a bidirectional interface with only 82 wires.
Low cost mid range systems can also be built using a non-pipelined bidirectional DynaBus that spans multiple boards. Each board would have bidirectional buffers at its interface, much like VME or FUTUREBUS (see Figure 4 left). Such an implementation of Dynabus would not cycle as fast as a single board version or a pipelined version, but it would nonetheless provide an attractive low cost multi-board alternative.
3.2 High Performance Pipelined Systems
One of the most interesting features of DynaBus is that it allows pipelining: a single DynaBus can be broken up into multiple bus segments separated by pipeline registers. These registers are placed at the input and output of each chip, module and board connecting to a DynaBus. During one clock cycle a signal starts out in one pipeline register, traverses one bus segment, and ends up in another pipeline register. The principal advantage of such pipelining is that the signal transit times on carefully designed short bus segments are a fraction of those on a single long segment whose length is the sum of the shorter segments. Small signal transit times in turn mean that bus clock frequency and therefore bus bandwidth is increased.
[Artwork node; type 'Artwork on' to command tool]
Figure 4: In a nonpipelined system the segment transit time (the clock period lower bound) is T = T1 + T2 + T3. In a pipelined version the segment transit time is MAX[T1, T2, T3], or about T/3 if comparable transit times for the backpanel and the board are assumed. Thus the bandwidth of the pipelined version is up to three times higher. (This is an upper bound, as the additional setup and hold times will decrease the speed.)
Figure 6 illustrates a multi-board system with three DynaBus segments. The Backpanel is the only bidirectional segment. The boards have two unidirectional input and output buses.
[Artwork node; type 'Artwork on' to command tool]
Figure 6: A Multi-Board System with 3 Dynabus segments.
Finally, Figure 7 illustrates a multi-module multi-board system where the DynaBus has 5 pipelined segments.
[Artwork node; type 'Artwork on' to command tool]
Figure 7: A Multi-Board Multi-Module System with 5 Dynabus segments.
In all these configurations care must be taken in the physical layout of bus segments to minimize reflections and thereby increase the clock rate. Additionally, great care must be taken in distributing the clock to reduce skew. The DynaBus uses balanced transmission lines for bus segments and a special clock distribution scheme that minimizes clock skew.
4. Chip Level Signals
The signals comprising a DynaBus interface for a chip are divided into five groups: Control, Arbitration, Consistency, Data, and optionally DataIn. Control contains input and output versions of the Clock, input and output versions of SStop, and a BidEn pin that is used to either tie the Data and DataIn groups together or allow them to be used separately. The Arbitration group provides the signals used by the chip to request the bus and also the signals used by the arbiter to grant the bus. The consistency group contains input and output versions of Shared and Owner. Data provides a bidirectional (or optionally a unidirectional
output) path for 64 bits of data, header information, and parity. Finally, the optional group DataIn provides a unidirectional
input path for signals in the Data group when that group is being used in unidirectional output mode.
[Artwork node; type 'Artwork on' to command tool]
Figure 8: The DynaBus Signals.
4.1 Control Signals
Clock
This input signal provides the global timing signal for the DynaBus.
ClockOut
This output signal provides an internal, loaded version of the Clock that is used to deskew Clock.
SStopIn
Synchronous Stop In. SStopIn is the logical OR of the SStopOut lines of all devices on the DynaBus. When the Arbiter sees SStopIn, it stops granting requests, bringing the system to a halt. Additionally, each device may use this signal to bring its operation to a synchronous stop.
SStopOut
Synchronous Stop Out. A device on the DynaBus asserts SStopOut to bring the system to a synchronous stop. SStopOut is eventually seen as SStopIn by all devices on the DynaBus. The Arbiter stops granting the bus until this signal is deasserted.
BidEn
This signal is used to place the Data/DataOut signals in bidirectional mode, eliminating the need for DataIn. When BidEn is deasserted, Data/DataOut go into unidirectional output mode. This feature can be used to reduce the number of DynaBus pins either for building low end systems or to simplify chip testing.
4.2 Arbitration Signals
LongGrant
LongGrant is asserted one cycle before the first cycle of a grant if the arbiter is responding to a long packet (5-cycle) request from the requesting device. At other times the state of LongGrant is undefined.
HiPGrant
HiPGrant is asserted one cycle before the first cycle of a sequence of grant cycles if the arbiter is responding to a high priority request from the requesting device. At other times the state of HiPGrant is undefined.
Grant
Grant is asserted by the arbiter once for each bus cycle that has been granted to a requesting device. The duration of Grant is 2 or 5 cycles, depending on the length of the packet.
RequestOut[0..1]
The RequestOut wires are used by devices to communicate with the Arbiter for allocating the bus. Requesting devices are of two types, normal and memory. Normal devices make requests at two priority levels, high and low. Memory devices always make requests at the same priority level, but the request may be for sending a two or a five cycle packet. The four combinations for RequestOut[0..1] have the following meanings:
Normal Device
0: Release this device's demand for Hold, if any
1: Demand Hold
2: Add a low priority request
3: Add a high priority request
Memory Device
0: Release this device's demand for Hold, if any
1: Demand Hold
2: Add a request for a 2-cycle packet
3: Add a request for a 5-cycle packet
4.3 Consistency Signals
OwnerOut
OwnerOut is asserted by a cache when it is the owner of the address specified in a ReadBlockRequest. Because the memory system uses write-back caches, the OwnerOut signal is necessary to assure that the owning cache, not memory, responds to ReadBlockRequest's when the main memory copy is stale.
SharedOut
SharedOut is asserted by a cache to indicate that it holds a cached copy of the data whose address appears on the DynaBus. When a cache initiates a WriteSingle, ConditionalWriteSingle or ReadBlockRequest, all caches that contain the datum except the one that initiated the transaction assert SharedOut.
OwnerIn
OwnerIn is the logical OR of the OwnerOut wires of all caches. It is used by the MemoryController to determine if memory should respond to a ReadBlockRequest. If the value of the MemoryController's OwnerIn wire is true, memory does not respond because one of the caches owns the datum and will issue a reply.
SharedIn
The SharedIn wire is used to maintain the value of the several caches' Shared flags. It is the logical OR of the SharedOut wires of all caches. When a cache initiates a WriteSingle, ConditionalWriteSingle or ReadBlockRequest, all caches that contain the datum except the one that initiated the transaction assert SharedOut. The Memory Controller receives the logical OR of the several caches' SharedOut wires as SharedIn and reflects this value in its reply to the transaction. If none of the caches asserted SharedOut, the MemoryController's reply indicates that the datum is no longer shared. The cache that initiated the transaction then sets its Shared flag to false.
4.4 Data/DataOut Signals
Data/DataOut[0..63]
These 64 signals carry the bulk of the information being transmitted from one chip to another. During header cycles they carry a packet type, some control bits, a DeviceID, and an address, and during other cycles they carry data. These signals are active only after receiving Grant from the Arbiter, otherwise they remain in a high impedance state.
Parity/ParityOut
This signal carries parity computed over the Data/DataOut lines.
HeaderCycle/HeaderCycleOut
This signal indicates the beginning of a packet. It is asserted during the first cycle of a packet, which is the header. It is generated by the device sending the packet. This signal is active only during cycles in which the device has Grant from the Arbiter, otherwise it remains in a high impedance state.
Spare/SpareOut[0..1]
These signals are currently unused. They are provided for future expansion.
4.5 DataIn Signals
DataIn[0..63]
These 64 signals carry a possibly delayed version of the information on the DataOut signals.
ParityIn
This signal carries the parity computed by the source of the data. It is used to check if transmission of Data encountered an error.
HeaderCycleIn
HeaderCycleIn is asserted if and only if the header cycle of a packet is being received.
SpareIn[0..1]
These signals are currently unused. They are provided for future expansion.
5. Arbitration and Flow Control
Each DynaBus has an arbiter that permits the bus to be time multiplexed amongst contending devices. Whenever a device has a packet to send, it makes a request to the arbiter using dedicated request lines, and the arbiter grants the bus using dedicated grant lines. Different devices may have different priority levels, and the arbiter guarantees fair (bounded-time) service to each device within its priority level. Bus allocation is non-preemptive, however, in that the transmission of a single packet is noninterruptable. When making an arbitration request, a device indicates both the priority and the length of the packet it wants to send.
Two aspects of DynaBus arbitration ensure good performance. The first is that arbitration is overlapped with transmission, so that no bus cycles are wasted during arbitration and it is possible to fill up the bus completely with packets. The second is that a device may make multiple requests before the first request has been granted; this allows a single device to use the bus to its maximum potential.
[Artwork node; type 'Artwork on' to command tool]
Figure 9: Arbitration is overlapped with packet transmission so that it is possible to fill up the bus completely with packets.
The arbiter is also used to implement flow control, which is a mechanism to avoid packet congestion. To understand why congestion can occur, it is important to distinguish between request and reply packets. A request packet is normally answered in a bounded time by a matching reply packet. The canonical example of a request-reply pair is a memory read: "What does location 227 contain?" -> "It contains 31415926." Since each of the two packets is independently arbitrated, some mechanism is necessary to avoid request packets piling up at the memory. Thus congestion occurs when a device receives too many request packets before it has had the opportunity to reply to any one.
5.1 Arbitration
Each device interacts with the arbiter via a dedicated port consisting of two request wires
Request[0..1] and one
Grant wire. Two other wires,
HiPGrant, and
LongGrant, are shared by all devices connected to the arbiter. The arbiter contains two types of ports,
normal and
memory, that a requesting device can be connected to. A
normal port has two request counters, one for high priority requests and the other for low priority requests. In addition, each counter has an associated packet length. A device indicates whether it wants to make a low or high priority request via its
Request wires. It may also use these wires to request and release Hold. Recall that while in the Hold state, the arbiter refuses to grant requests for sending request packets but continues to grant requests for sending reply packets. The
Request wires for a normal port are encoded as follows:
Encoding Interpretation
0 Release this device's demand for Hold, if any
1 Demand Hold
2 Add a low priority request
3 Add a high priority request
A
memory port consists of a single request
FIFO, with a single priority for all incoming requests. A device indicates the length of a packet via its
Request wires. It may also use these wires to seize and release the bus. The
Request wires for a memory port are encoded as follows:
Encoding Interpretation
0 Release this device's demand for Hold, if any
1 Demand Hold
2 Add a request for a 2-cycle packet
3 Add a request for a 5-cycle packet
A device is permitted to have several pending requests within the arbiter, subject to an upper limit imposed by the arbiter implementation. A separate request is registered for each cycle in which the RequestOut wires are in the "add a request" state. For memory ports the FIFO guarantees that grants will be given in the order in which requests arrived.
The type of a port, as well as type-dependent characteristics (the lengths for the high and low priorities for a normal port, and the priority for a memory port) are provided at initialization time using the DBus. For a normal port the priority number of the high priority port (0, 1, 3, 4, 5, or 6) must be no greater than that of the low priority port, since lower priority numbers take precedence.
Grant is used by the arbiter to signal that a device has grant.
Grant is asserted for as many cycles as the packet is long. If
Grant is asserted in cycle
i then the device can drive its outgoing bus segment in cycle
i+1.
HiPGrant and
LongGrant describe a grant that is about to take place. In the cycle before
Grant is asserted,
HiPGrant tells the device whether the next grant will correspond to a high priority request or not; and
LongGrant tells the device whether the next grant will correspond to a 5-cycle packet or not. Figures 10 and 11 show the timing of important signals
at the pins of a requester during the arbitration and transmission of a two cycle and a five cycle packet, respectively. It is helpful to refer to the schematic of Figure 12 when reading the timing diagrams.
The arbiter supports six distinct priority levels. Highest priority requests are served first and requests within a level are served approximately round-robin. The current assignment of levels to devices is as follows:
Value of Priority Meaning
0 (highest) cache reply priority
1 memory priority
3 display high priority
4 I/O priority
5 cache low priority
6 (lowest) display low priority
The two highest priorities are assigned to reply packets. Memory uses its priority to send both five cycle and two cycle packets. The two priorities assigned to the display are the lowest and the highest for request packets. Normally, the display uses its low priority to satisfy requests; since this is the lowest priority, the display will end up using otherwise unusable bandwidth on the bus. Occasionally, when the display's queue gets dangerously empty, it adds a few high priority requests.
The synchronous stop signal is used to enable and disable arbitration. After machine reset it is used to start arbitration synchronously. Thereafter it may be asserted and de-asserted at will. While it is asserted, no new packets will be granted, but the arbiter will continue to count requests, and grant them later when synchronous stop is de-asserted.
5.2 Flow Control
The arbiter provides two mechanisms for flow control. The first is arbitration priorities: An arbitration request to send a reply packet is always given precedence over an arbitration request to send a request packet. This mechanism alone would eliminate the congestion problem if devices were always ready to reply before the onset of congestion, but this is a severe requirement to place on all devices. To satisfy this requirement, a device must either be able to service packets at the maximum arrival rate, or it must have an input queue that is long enough so that it does not overflow even during the longest service time for a packet. For certain slow devices like the memory controller, servicing packets at arrival rate is clearly impossible and the queue lengths required to ensure no overflow are prohibitive.
The arbiter therefore provides a second mechanism suitable for slow devices. This mechanism involves the use of a special request called Hold to the arbiter. When the arbiter receives Hold from a device, it refuses to grant any further arbitration requests for sending request packets, but continues to grant requests for sending reply packets. This has the effect of choking off new request packets as long as Hold is being asserted by some device, and allows the device asserting Hold to clear the congestion. Of course, since the effect of Hold can never be instantaneous, especially in piplelined configurations, devices still need to provide headroom within their input queues to tolerate a few request packets while Hold takes effect. Devices must not use Hold with abandon, however, because the use of Hold decreases bus throughput.
5.3 Arbitration in Pipelined Configurations
In pipelined DynaBus configurations, the bus segments form a tree rooted at the bidirectional backpanel segment (Figure 13). All IC's are connected on the leaf segment, labeled A in Figure 13. The arbiter controls access to this segment independently of any board or module structure. Note that the existence of an input bus Ci, separate from Ai, means that all IC's in the system receive information from the DynaBus at the same time. This is important because it facilitates the computation of shared and owner bits from the values asserted by individual devices. If different devices received information at slightly different times, this computation would become considerably more complex.
[Artwork node; type 'Artwork on' to command tool]
Figure 13: A DynaBus System with three segments. The arbiter makes grants on the leaf bus segment, labeled A, without knowledge of the number of pipelined stages in the System.
[Artwork node; type 'Artwork on' to command tool]
Figure 14: Timing diagram showing the transmission of data for Grants 1 and 2 over the three segments of the DynaBus pictured in Figure 13.
6. Transactions
Transactions form the top layer of the DynaBus protocol, with the two lower layers being packets and cycles. Each transaction consists of a pair of request-reply packets, which are independently arbitrated. A transaction begins when the requester asks the arbiter to grant the bus for sending its request packet (Figure 15). Upon receiving bus grant, the requester sends the packet one cycle at a time, with the cycle containing packet control information going first. This first cycle, called the packet header, contains all the information needed to identify the packet and select the set of devices that need to service the packet. Subsequent cycles contain data that is dependent on the type of transaction in progress. All Dyanbus devices (including the requester) receive the request packet, and each device examines the header to decide whether it needs to take action or not.
[Artwork node; type 'Artwork on' to command tool]
Figure 15: A transaction on the DynaBus consists of a request and a reply.
Exactly one of the receiving devices elects to generate a reply, typically after the action requested by the request packet is complete. The mechanism by which a unique device is selected to respond is different for different transactions, but most transactions use an address contained in the header cycle for this purpose. The responding device then requests the arbiter to grant the bus for sending its reply packet. On receiving grant, this device sends the reply packet one cycle at a time, with the header cycle going first. As before, the header cycle contains all the information needed to identify the packet, and in particular to link it unambiguously to the corresponding request packet. All DynaBus devices receive this reply packet as well, and each device examines the header to see what action, if any, it needs to take. Typically, the initiating device behaves somewhat differently than other devices. The transaction is complete when the initiating device receives the reply.
Normally, this protocol ensures a one-to-one correspondence between request and reply packets; however, because of errors, some request packets may not get a reply. Thus, devices must not depend on the number of request and reply packets being equal since this invariant will not in general be maintained. The protocol does require devices to provide a simple, but crucial guarantee that is central to the data consistency scheme: devices must service request packets in arrival order. To understand why arrival order must be maintained, see Section 7.1.
The DynaBus defines a complete set of transactions for data transfer between caches and memory, data consistency, synchronization, input output, and address mapping. Ten transactions are currently defined. They are: ReadBlock, FlushBlock, WriteBlock, WriteSingle, IORead, IOWrite, BIOWrite, Map, DeMap, and ConditionalWriteSingle. The ReadBlock transaction allows a cache to read a packet from memory or another cache. FlushBlock allows caches to write back dirty data to memory. WriteBlock allows new data to be introduced into the memory system (for example disk reads). WriteSingle is a short transaction used by caches to update multiple copies of shared data without affecting main memory. IORead, IOWrite and BIOWrite initiate and check IO operations. The Map and DeMap transactions permit the implemention of high speed address mapping in a multiple address space environment. Finally, the ConditionalWriteSingle transaction provides the ability to do atomic read-modify-writes to shared data at the maximum rate transactions can be initiated on the bus. The encoding space leaves room for defining six other transactions.
6.1 Header Cycle Format
The first, or header cycle, of a request packet contains a Command, a Mode bit, a DeviceID, and an Address (Figure 16). The Command identifies the transaction and indicates that the packet is a request rather than a reply packet. The Mode bit is used in protection checking by the receiving devices. The DeviceID identifies the initiator of the transaction, while the Address serves as a selector for a memory location or IO device register.
[ Nectarine, fuzzless and sweeter than peaches ]
Figure 16: The header cycle of a request packet is transmitted on the Data wires. It contains a Command, a Mode bit, a DeviceID, and an Address.
Most of the information in the header cycle of the request packet is replicated in the header cycle of the reply packet (Figure 17). In fact, only the fourth, fifth and sixth bits may be different. The fourth bit, which is part of the Command field identifies the packet as a reply. The fifth bit indicates if the transaction encountered an error, while the sixth bit tells if the the addressed location is shared or not.
[ Nectarine, fuzzless and sweeter than peaches ]
Figure 17: The first cycle of a DynaBus reply packet is transmitted on the Data wires. It contains a Command, a Fault bit, a ReplyShared bit, a DeviceID, and an Address. All bits other than bits 4, 5, and 6 are the same as those in the header for the corresponding request packet.
6.1.1 The Command Field
The Command field in a header cycle is 5 bits. Four bits encode up to 16 different transactions, while the fifth bit encodes whether the packet is a request (0) or a reply (1). Ten of the transactions are currently defined, as shown in Table 1 below.
Table 1: Encoding of the Command Field of a Packet Header
Transaction Name Abbreviation Encoding Length
ReadBlockRequest RBRqst 0000 0 2
ReadBlockReply RBRply 0000 1 5
WriteBlockRequest WBRqst 0001 0 5
WriteBlockReply WBRply 0001 1 2
WriteSingleRequest WSRqst 0010 0 2
WriteSingleReply WSRply 0010 1 2
ConditionalWriteSingleRequest CWSRqst 0011 0 2
ConditionalWriteSingleReply CWSRply 0011 1 5
FlushBlockRequest FBRqst 0100 0 5
FlushBlockReply FBRply 0100 1 2
Unused 0101 0 ...
0111 1
IOReadRequest IORRqst 1000 0 2
IOReadReply IORRply 1000 1 2
IOWriteRequest IOWRqst 1001 0 2
IOWriteReply IOWRply 1001 1 2
BIOWriteRequest BIOWRqst 1010 0 2
BIOWriteReply BIOWRply 1010 1 2
MapRequest MapRqst 1110 0 2
MapReply MapRply 1110 1 2
DeMapRequest DeMapRqst 1111 0 2
DeMapReply DeMapRply 1111 1 2
6.1.2 The Mode/Fault Bit
This bit has different interpretations for request and reply packets. In a request packet it supplies the privilege mode (kernel=0, user=1) of the device that issued the request. When the requesting device is a cache, for example, this bit indicates whether the processor is in kernel or user mode. The Mode bit is used by devices servicing a request packet to check if the requestor has sufficient priviliges to perform the requested operation.
In a reply packet the same bit is used to encode whether the device servicing the request packet encountered a fault or not. A 0 indicates no fault, and a 1 indicates a fault. When the fault bit is set in a reply packet, the 32 low order bits of the second cycle supply a
FaultCode. The format of
FaultCode is defined in Figure 18.
[Artwork node; type 'Artwork on' to command tool]
Figure 18: Format of FaultCode.
The high order 10 bits identify the device that is reporting the fault; the low order 3 bits indicate the MajorCode, which divides up the fault into eight interesting categories defined in Table 2. The remaining 19 bits supply a device specific MinorCode.
Table 2: The Major FaultCodes
Fault Name Encoding Meaning
MemAccessFault 000 first write to page or insufficient privilege
IOAccessFault 001 insufficient privilege to read or write IO location
MapFault 010 map cache miss
ProcessorFault 011
DynaBusTimeOut 100 transaction timeout on DynaBus
DynaBusOtherFault 111 some other DynaBus fault reported via reply packet
6.1.2 The ReplyShared Bit
This bit is unused in the header of a request packet. In the header of a reply packet it indicates whether the data whose address appears in the packet was shared at the time the request packet corresponding to the reply was received. This bit has a meaning only for the transactions ReadBlock, WriteSingle, and ConditionalWriteSingle, and may be safely ignored by devices that do not participate in the consistency game.
Caches use the value of ReplyShared within a RBReply to set the shared bit for the block being fetched. They use the value of ReplyShared within WSReply and CWSReply to know if the block is no longer shared and to clear the shared bit of the cached block if it is not.
6.1.3 The DeviceID Field
For request packets, the DeviceID field of the header carries the unique identity of the device that sent the request packet. For reply packets, the DeviceID field of the header is the unique identity of the intended recipient (that is, the identity of the device that sent the request packet). A DeviceID is needed in reply packets because the address alone is not sufficient to disambiguate replies.
Devices that either can have only one outstanding reply at a time, or that can have multiple outstanding replies but can somehow discriminate between them, need only one DeviceID. Other devices must be allocated multiple DeviceID's to allow them to disambiguate replies. These DeviceID's must be contiguous and must be a power of two in number.
The DeviceID(s) for a device are loaded in at system initialization time via the Debug bus (see [DBusSpec] for details).
6.1.4 The Address Field
The address field within a header cycle is 47 bits. In the current implementation, only 32 of these bits are used and the high-order 15 must be 0. For reasons of extensibility, devices must check that these high order bits are 0 and do nothing if they are not. For non-IO transactions, the 32 bits represent the address of a 32-bit word in real address space; for IO transactions the 32 bits represent the IO address of a 32-bit word in the IO address space.
For all transactions other than Map, the contents of the address field in the request packet and its corresponding reply must be identical. The Map transaction is an exception to simplify cache implementation.
For packets that transmit a block of data, doubleWords constituting the block are transmitted in cyclic order, with the doubleWord containing the addressed word being transmitted first (Figure 19).
[Artwork node; type 'Artwork on' to command tool]
Figure 19: When memory replies with a block of data, the 4 doublewords appear on the bus in cyclic order relative to the specified address. The addressed word is transmitted in the first data cycle, thus decreasing the latency of the requested word when the cache misses.
6.2. ReadBlock
The ReadBlock transaction is used to read a block of data from the memory system. If a cache is owner then that cache replies, otherwise memory replies.
Request (2 cycles)
A ReadBlockRequest packet requests a block to be read from the memory system. The first cycle contains the packet type, the sender's DeviceID, and the address of a word in the block. The second cycle is either empty or contains the address of a block that the requested block will replace within the requesting device. This "victim" address is meaningful only for caches.
[Nectarine figure; type 'Artwork on' to a CommandTool]
Normal Reply (5 cycles)
A ReadBlockReply packet returns the block data requested by an earlier ReadBlockRequest. The first cycle reflects most of the information in the request header, with the shared bit indicating whether the block is shared. The remaining four cycles contain the block data in cyclic order, with the doubleWord containing the addressed word appearing first.
[Artwork node; type 'Artwork on' to command tool]
Error Reply (5 cycles)
The first cycle of an error ReadBlockReply is the same as that for a normal reply except that it has the Fault bit set (bit 5 = 1). The second cycle contains the DeviceID of the reporting device and a code describing the error. Remaining cycles are undefined.
[Nectarine figure; type 'Artwork on' to a CommandTool]
6.3 WriteBlock
The transaction WriteBlock is used to write a block of data into the memory system. Memory is overwritten, as are any cached copies. This transaction is used by producers of data outside the memory system to inject new data.
Request (5 cycles)
A WriteBlockRequest packet requests a block to be written to the memory system. The first cycle contains the packet type, the sender's DeviceID, and the address of a word in the block. The remaining four carry the four cycles of the block, in cyclic order, with the cycle containing the addressed word appearing first.
[ Nectarine, fuzzless and sweeter than peaches ]
Normal Reply (2 cycles)
A WriteBlockReply packet acknowledges an earlier WriteBlockRequest (WriteBlockReply is generated by memory). The first cycle reflects most of the information in the request header; the second cycle is undefined.
[
Error Reply (2 cycles)
The first cycle of an error WriteBlockReply is the same as that for a normal reply except that it has the Fault bit set (bit 5 = 1). The second cycle contains the DeviceID of the reporting device and a code describing the error.
[Nectarine figure; type 'Artwork on' to a CommandTool]
6.4 FlushBlock
The transaction FlushBlock is used by caches to write dirty data being victimized back to memory. Caches do not listen to FlushBlocks, since they are already up to date, only memory does.
Request (5 cycles)
A FlushBlockRequest packet requests a block to be written to main memory. The first cycle contains the packet type, the sender's DeviceID, and the address of a word in the block. The remaining four cycles carry the four cycles of the block, in cyclic order, with the cycle containing the addressed word appearing first.
[ Nectarine, fuzzless and sweeter than peaches ]
Normal Reply (2 cycles)
A FlushBlockReply packet acknowledges an earlier FlushBlockRequest (FlushBlockReply is generated by memory). The first cycle reflects most of the information in the request header. The second cycle is undefined..
[Nectarine: fuzzless and sweeter than peaches]
Error Reply (2 cycles)
The first cycle of an error FlushBlockReply is the same as that for a normal reply except that it has the Fault bit set (bit 5 = 1). The second cycle contains the DeviceID of the reporting device and a code describing the error..
[Nectarine: fuzzless and sweeter than peaches]
6.5 WriteSingle
The transaction WriteSingle is used to write a 32-bit word of data to the memory system. Only cached copies of the word are updated, main memory is not. This transaction is used by caches to keep multiple copies of cached read/write data consistent.
Request (2 cycles)
A WriteSingleRequest requests a write to all cached copies of a 32-bit word. The first cycle contains the packet type, the sender's DeviceID, and the address of the word. The second supplies the data.
[ Nectarine, fuzzless and sweeter than peaches ]
Normal Reply (2 cycles)
A WriteSingleReply packet performs the work requested by an earlier WriteSingleRequest. The first cycle reflects most of the information in the request header, with the shared bit indicating whether the datum is shared. The second cycle supplies the 32-bits of data just as in the request. WriteSingleReply is generated by memory.
[ Nectarine, fuzzless and sweeter than peaches ]
Error Reply (2 cycles)
The first cycle of an error WriteSingleReply is the same as for a normal reply except that it has the Fault bit set (bit 5 = 1). The second cycle contains the DeviceID of the reporting device and a code describing the error.
[Nectarine: fuzzless and sweeter than peaches]
6.6 ConditionalWriteSingle
The transaction ConditionalWriteSingle is used to perform an atomic read-modify-write to a 32-bit location in the memory system. Two 32-bit values old and new define the semantics as follows: if the contents of the target location equal old then new is written into the target. The previous value is always returned as the result (see Section 8.2.3). Only cached copies are affected, main memory is not. This transaction is used to synchronize the actions of multiple processors.
Request (2 cycles)
A ConditionalWriteSingleRequest requests a read-modify-write to all cached copies of a 32-bit word. The first cycle contains the packet type, the sender's DeviceID, and the address of the word. The second carries the
old and a
new values, with
old appearing in the most significant 32 bits and
new in the least significant 32 bits.
[Nectarine: fuzzless and sweeter than peaches]
Normal Reply (5 cycles)
A ConditionalWriteSingleReply packet performs the work requested by the corresponding request. The first cycle of the reply reflects most of the information in the request header, with the shared bit indicating whether the datum is shared. The four data cycles are identical, each containing a copy of the second cycle of the request packet.
[Nectarine: fuzzless and sweeter than peaches]
Error Reply (5 cycles)
The first cycle of an error ConditionalWriteSingleReply is the same as for a normal reply except that it has the Fault bit set (bit 5 = 1). The second cycle contains the DeviceID of the reporting device and a code describing the error. The remaining cycles are undefined.
[Nectarine: fuzzless and sweeter than peaches]
6.7 IORead
The IORead transaction is used to read a 32-bit word from an IO device.
Request (2 cycles)
An IOReadRequest packet requests a 32-bit read from an IO device. The first cycle contains the packet type, the sender's DeviceID, and the IO address of the word; this IO address specifies both the device and a location in the device. The second cycle is undefined.
N [Artwork node; type 'Artwork on' to command tool]
Normal Reply (2 cycles)
An IOReadReply packet returns the 32-bit data requested by an earlier IOReadRequest. The first cycle reflects most of the information in the request header, while the second carries the data in the least significant 32 bits.
[Artwork node; type 'Artwork on' to command tool]
Error Reply (2 cycles)
The first cycle of an error ReadBlockReply is the same for a normal reply except that it has the Fault bit set (bit 5 = 1). The second cycle contains the DeviceID of the reporting device and a code describing the error.
[Nectarine: fuzzless and sweeter than peaches]
6.8 IOWrite
The IOWrite transaction is used to write a 32-bit word to an IO device.
Request (2 cycles)
An IOWriteRequest packet requests a 32-bit write to an IO device. The first cycle contains the packet type, the sender's DeviceID, and the IO address of the word; this IO address specifies both the device and a location in the device. The second cycle contains the data in the least significant 32 bits.
[ Nectarine, fuzzless and sweeter than peaches ]
Normal Reply (2 cycles)
An IOWriteReply packet acknowledges the write requested by an earlier IOWriteRequest. The first cycle reflects most of the information in the request header, while the second cycle is undefined.
[ Nectarine, fuzzless and sweeter than peaches ]
Error Reply (2 cycles)
The first cycle of an error IOWriteReply is the same for a normal reply except that it has the Fault bit set (bit 5 = 1). The second cycle contains the DeviceID of the reporting device and a code describing the error.
[Nectarine: fuzzless and sweeter than peaches]
6.9 BIOWrite
The BIOWrite transaction is used to broadcast write a 32-bit word to all IO devices of a given type.
Request (2 cycles)
A BIOWriteRequest packet requests a 32-bit write to all IO devices of a given type. The first cycle contains the packet type, the sender's DeviceID, and the IO address of the word; this IO address specifies both the device type and a location in that device type. The second cycle contains the data in the least significant 32 bits.
[Nectarine: fuzzless and sweeter than peaches]
Normal Reply (2 cycles)
A BIOWriteReply packet acknowledges the write requested by an earlier BIOWriteRequest. The first cycle of the reply reflects most of the information in the request header, while the second cycle is undefined. The reply is generated by the memory.
[ Nectarine, fuzzless and sweeter than peaches ]
Error Reply (2 cycles)
The first cycle of an error BIOWriteReply is the same for a normal reply except that it has the Fault bit set (bit 5 = 1). The second cycle contains the DeviceID of the reporting device and a code describing the error.
[Nectarine: fuzzless and sweeter than peaches]
6.10 Map
The Map transaction is used to translate a 16-bit address space identifier and a 22-bit virtual page number to a 22-bit real page number and associated protection flags.
Request (2 cycles)
A MapRequest packet requests that a virtual page be translated to the corresponding real page. The first cycle contains the packet type, the sender's DeviceID, and the 22-bits of the virtual page. The second cycle contains the address space id.
[ Nectarine, fuzzless and sweeter than peaches ]
Normal Reply (2 cycles)
A MapReply returns the translation requested by an earlier MapRequest. The first cycle contains the packet type, the DeviceID of the transaction initiator, the 22-bit real page and four Flags: Dirty, KWtEnable, UWtEnable, and URdEnable. The second cycle is unused. Note that this is one reply packet whose address part is not the same as the address part of the corresponding request packet.
[ Nectarine, fuzzless and sweeter than peaches ]
Error Reply (2 cycles)
An error MapReply is used to indicate that the responding device (MapCache) could not perform the translation. The first cycle contains the packet type, and the DeviceID of the transaction initiator. The second cycle contains the DeviceID of the reporting device and a code describing the error (the code shown below corresponds to MapFault).
Nectarine, fuzzless and sweeter than peaches ]
6.11 DeMap
The DeMap transaction is used to remove all cached virtual to real translations that correspond to a given real page.
Request (2 cycles)
A DeMapRequest packet requests that all cached virtual to real translations for a given real page be removed from processor caches. The first cycle contains the packet type, the sender's DeviceID, and the 22-bits of the real page. The second cycle is undefined.
[ Nectarine, fuzzless and sweeter than peaches ]
Normal Reply (2 cycles)
A DeMapReply actually performs the action requested by the corresponding DeMapRequest. The first cycle reflects most of the information in the header fo the request packet, while the second cycle is undefined.
[Nectarine: fuzzless and sweeter than peaches]
Error Reply (2 cycles)
The first cycle of an error DeMapReply is the same for a normal reply except that it has the Fault bit set (bit 5 = 1). The second cycle contains the DeviceID of the reporting device and a code describing the error.
[Nectarine: fuzzless and sweeter than peaches]
6.12 NoOps
Occasionally, a device that has done an arbiter request has nothing to send when it gets grant. In this situation the device is expected to send a NoOp packet of the same length as the packet it had originally intended to send. It does this simply by putting a 0 value for HeaderCycleOut during its allocated header cycle. Thus, there is no special command to indicate NoOp.
7. Data Consistency
The DynaBus supports an efficient protocol for maintaining cache coherency in a multiprocessor environment. Using the transactions just described, it is possible to build a high performance multiprocessor system that offers a simple model of shared memory to the programmer. In this system, processors are connected to the DynaBus via write-back caches. The caches are allowed to keep multiple copies of read/write data as needed, and the consistency of this data is maintained automatically and transparently by the hardware. Caches detect when a datum becomes shared by watching bus traffic, and they initiate a broadcast write when a processor issues a write to shared data. IO devices are permitted direct access to the memory system while preserving a consistent view of memory for the processors. A measure of the efficiency of this coherency protocol is that it requires just one more write to a shared datum than the absolute minimum.
7.1 Definition of Data Consistency
A useful definition of data consistency must satisfy three criterea: it must allow interesting programs to be written; it must be simple to understand; and it must be practical to implement. A common way to define consistency is to say that all copies of any given location have the same value during each clock cycle. While this definition is adequate for writing programs and easy to understand, it is hard to implement efficiently when the potential number of cached copies is large. Fortunately, there is a weaker definition that is still sufficient for programming, but is much easier to implement in a large system. It is based on the notion of serializability.
Figure 20 shows an abstract model of a shared memory multiprocessor that will be used to define serializability. Each processor has a private line to shared memory over which it issues the commands
Fetch(
A) and
Store(
A,
D), where
A is an address and
D is data. For
Fetch(
A) the memory returns the value currently stored at
A; for
Store(
A,
D) it writes the value
D into
A and returns an indication that the write has completed. Let the
starting time of an operation be the moment a request is sent to shared memory, and the
ending time the moment a response is received by the processor.
[Artwork node; type 'Artwork on' to command tool]
Figure 20: A number of processors connected to a shared memory.
A computation C on this abstract model consists of N sequences of fetches and stores, one sequence for each of the processors. A computation transforms the initial state I of the shared memory into a final state F, but does not have any other visible effect. The Fetches and Stores of C are said to be serializable if there exists some global serial order of all the N sequences such that if the operations were performed in this order, without overlap, the same final state F would be reached starting from the same initial state I (two operations p and q overlap if the starting time of p is before the ending time of q and the starting point of q is before the ending time of p). The serial order must, of course, also preserve the semantics of Fetch and Store: the value returned by a Fetch(A) in this global sequence must have been the value stored by the most recent Store(A, .), or A's initial value in I if no such Store exists.
Given this definition of serializability, a shared memory multiprocessor is said to maintain data consistency if there is an algorithmic procedure for serializing the Fetches and Stores for any computation C on this machine. This procedure takes the N sequences of Fetches and Stores and produces a single global sequence that has the same effect on shared memory. The procedure, of course, depends on concrete implementation details of the multiprocessor. For example, if the multiprocessor has a single port memory with no caches, the transformation of the N sequences to the global sequence is trivial. For a DynaBus based multiprocessor that has processor caches, the procedure depends on details of the cache consistency algorithm and certain synchronization properties enforced by caches and memory controllers.
This definition also has a simple and intuitive interpretation. If a shared memory multiprocessor maintains data consistency according to the above definition, the memory model the programmer needs to know is the very simple one illustrated in Figure 20, regardless of the
actual complexity of the machine's memory system. The real machine behaves for programming purposes as though its processors were directly connected to a simple read write memory with a single port that is able to service exactly one Fetch or Store operation at a time.
[Artwork node; type 'Artwork on' to command tool]
Figure 21: The model of shared memory illustrated here is sufficient for programmers writing for DynaBus-based systems.
7.2 An Example
The simplest way to understand how the DynaBus consistency protocol works is to look at an example (a more careful specification useful for reference will be given in the following section). Consider the five processor system showed in Figure 22. The example below describes a sequence of events for a particular location (address 73) starting from the state where none of the five caches has the block that contains this location. Numbers in the figure correspond to the numbers in the text below.
For the example, it is sufficient to know that a cache maintains two state bits Shared and Owner for each block of data. When a block has Shared=1 it means that there may be other cached copies of this block; Shared=0 means this is the only cached copy. When Owner=1 it means that this cache's processor was the last one to update this block and any copies it has in other caches. At most one cached copy of a block may have Owner=1. The protocol uses the DynaBus lines shared and owner defined in Section 4.5.
1. Processor1 reads Address 73.
Cache1 misses and does a ReadBlock on the bus.
Memory provides the data.
The block is marked Shared1 = 0, Owner1 = 0.
2. Processor2 reads Address 73.
Cache2 misses and does a ReadBlock on the bus.
Cache1 pulls the shared line to signal shared.
Memory still provides the data.
The block is marked Shared1 = Shared2 = 1, Owner2 = 0.
3. Processor3 reads Address 73.
Cache3 misses and does a ReadBlock on the bus.
Cache1 and Cache2 pull the shared line to signal shared.
Memory still provides the data.
The block is marked Shared1 = Shared2 = Shared3 = 1, Owner3 = 0.
4. Processor2 writes Address 73.
Because the data is shared, Cache2 does a WriteSingle on the DynaBus.
Cache1 and Cache3 pull the shared line to signal shared.
Cache1, Cache2 and Cache3 update their values, but Memory does not.
Cache2 becomes owner (Owner2 = 1).
5. Processor4 reads Address 73.
Cache4 misses and does a ReadBlock on the bus.
Cache1, Cache2 and Cache3 pull the shared line to signal shared.
Cache2 pulls the owner line to keep Memory from responding and provides the data.
The block is marked Shared4 = 1, Owner4 = 0.
6. Processor4 now writes Address 73.
Because the data is shared, Cache4 does a WriteSingle on the DynaBus.
Cache1, Cache2 and Cache3 pull the shared line to signal shared.
Ownership changes from Cache2 to Cache4 (Owner2 = 0, Owner4 = 1).
7. Processor
5 writes Address 73.
Cache
5 misses and does a ReadBlock on the bus.
Cache
1, Cache
2, Cache
3 and Cache
4 pull the
shared line to signal shared.
Cache
4, the current owner, pulls the
owner line and supplies the data.
The block is marked Shared
5 = 1, Owner
5 = 0.
Cache
5 then does a WriteSingle because the data is shared.
Cache
1, Cache
2, Cache
3 and Cache
4 pull the
shared line to signal shared.
Ownership switches from Cache
4 to Cache
5 (Owner
4 = 0, Owner
5 = 1).
[Artwork node; type 'Artwork on' to command tool]
Figure 22: An example illustrating the DynaBus consistency protocol.
7.3 Protocol Description for Single Level Systems
A single level system consists of one or more processors connected to the DynaBus through caches, and a single main memory. The first thing to note about this configuration is that it is sufficient to maintain consistency between cached copies. The main memory copy can be stale with respect to the caches without causing incorrect behavior because processors have no way to access data except through caches.
The protocol requires that for each block of data a cache keep two additional bits, shared and owner. For a given block, the shared bit indicates whether there are multiple copies of that block or not. This indication is not accurate, but conservative: if there is more than one copy then the bit is 1; if there is only one copy then the bit is probably 0, but may be 1. We will see later that this conservative indication is sufficient. The owner bit is set in a given cache if and only if the cache's processor wrote into the block last; thus at most one copy of a datum can have owner set. A cache is also required to maintain some pendingState for a transaction the cache has initiated but that hasn't been replied to as yet; this state allows a cache to correctly compute the value of the shared bit for the block addressed in the pending transaction, and to take special actions for certain "dangerous" packets that arrive while the reply is pending. In addition to this state, the protocol uses two lines on the DynaBus, Shared and Owner that were described earlier in Section 4.5.
Generally, a cache initiates a ReadBlock transaction when its processor does a Fetch or Store to a block and the block is not in the cache; it initiates a FlushBlock when a block needs to get kicked out of the cache to make room for another one (only blocks with owner set are written out); and it initiates a WriteSingle when its processor does a write to a block that has the shared bit set. Caches do a match only if they see one of the following packet types: RBRqst, RBRply, WSRqst, WSRply, CWSRqst, CWSRply, and WBRqst. In particular, note that no match is done either for a FBRqst or a FBRply. This is because FB is used only to flush data from a cache to memory, not to notify other caches that data has changed. No match is done for a WBRply either, because all this packet does is acknowledge that the memory has processed the WBRqst.
When a cache issues a RBRqst or WSRqst, all other caches match the block address to see if they have the block. Each cache that matches, asserts Shared to signal that the block is shared and also sets its own copy of the shared bit for that block. The requesting cache uses pendingState to compute the value of the shared bit. The reason it can't just copy the value of Shared into the shared bit like the other caches is because the status of the block might change from not shared to shared between request and reply due to an intervening packet with the same address. This ensures that the shared bit is TRUE for a block only if there are multiple copies, and that the shared bit is cleared eventually if there is only one copy. The clearing happens when only one copy is left and that copy's processor does a store. The store turns into a WSRqst, no one asserts Shared, and so the value the requestor computes for the shared bit is FALSE.
The manipulation of the owner bit is simpler. This bit is set each time a processor stores into one of the words of the block; it is cleared each time a WSRply arrives on the bus (except for the cache whose processor initiated the WSRqst). There are two cases to consider when a processor does a store. If the shared bit for the block is FALSE, then the cache updates the appropriate word and sets the owner bit right away. If the shared bit is TRUE, the cache puts out a WSRqst. When the memory sees the WSRqst, it turns it around as a WSRply with the same address and data, making sure that the shared bit in the reply is set to the value of the Shared line an appropriate number of cycles after the appearance of the WSRqst's header cycle. When the requestor sees the WSRply, it updates the word and also sets owner. Other caches that match on the WSRply update the word and clear owner. This guarantees that at most one copy of a block can ever have owner set. Owner may not be set at all, of course, if the block has not been written into since it was read from memory.
When an RBRqst appears on the bus, two distinct cases are possible. Either some cache has owner set for the block or none has. In the first case the owner (and possibly other caches) assert Shared. The owner also asserts Owner, which prevents memory from responding, and then proceeds to supply the block via an RBRply. The second case breaks down into two subcases. In the first subcase no other cache has the block, Shared does not get asserted, and the block comes from memory. In the second subcase at least one other cache has the data, Shared does get asserted, but the block still comes from memory because no cache asserted Owner. Because the bus is packet switched, it is possible for ownership of a block to change between the request and its reply. Suppose for instance that a cache does an RBRqst at a time when memory was owner of the block, and before memory could reply, some other cache issues a WSRqst which generates a WSRply which in turn makes the issuing cache the owner. Since Owner wasn't asserted for the RBRqst, memory still believes it is owner, so it responds with the RBRply. To avoid taking this stale data, the cache that did the RBRqst uses pendingState to either compute the correct value of the data or to retry the ReadBlock when the RBRply is received. Dangerous transactions for a pending ReadBlock are the ones that modify data: WSRply, CWSRply, and WBRqst.
It is interesting to note that in the above protocol the Shared and Owner lines are output only for caches and input only for memory. This is because the caches never need the value on the Owner line, and the value on the Shared line is provided in the reply packet so they don't need to look at the Shared line either.
In the discussion above, we did not say how the transactions CWS and WB work. We will do this now. CWS is identical in its manipulation of the Shared and Owner bits and the Shared and Owner lines to WS, so as far as consistency is concerned these transactions can be treated the same. WB, on the other hand, is identical to FB as far as memory is concerned. Caches ignore FB, but overwrite their data for a matching WBRqst and clear the owner bit for this block.
7.4 Protocol Description for Two Level Systems
Figure 23 illustrates a 2-level DynaBus-based system A two-level system consists of a number of one-level systems called
clusters connected by a main DynaBus that also has the system's main memory. Each cluster contains a single
large cache, that connects the cluster to the main DynaBus, and a private DynaBus, that connects the large cache to the small caches in the cluster. This private DynaBus is electrically and logically distinct from the DynaBuses of other clusters and from the main DynaBus. From the standpoint of a private DynaBus, its large cache looks identical to the main memory in a single-level system. From the standpoint of the main DynaBus, a large cache looks and behaves very much like a small cache in a single-level system. Further, the design of the protocol and the consistency protocol is such that a small cache cannot even discover whether it is in a one-level or a two-level system. The response from its environment is the same in either case. Thus, the behavior of a small cache in a two level system is identical to what was described in the previous section.
The protocol requires the large cache to keep all of the state bits a small cache maintains, plus some additional ones. These additional bits are the existsBelow bits, kept one bit per block of the large cache. The existsBelow bit for a block is set only if some small cache in that cluster also has a copy of the block. This bit allows a large cache to filter packets that appear on the main bus and put only those packets on the private bus for which the existsBelow bit is set. Without such filtration, all of the traffic on the main bus would appear on every private bus, defeating the purpose of a two-level organization.
We have already stated that the behavior of a small cache in a two-level system is identical to its behavior in a one-level system. We have also said that a large cache behaves like main memory at its private bus interface and a small cache at its main bus interface. What remains to be described is what the large cache does internally, and how packets on a private bus relate to those on the main bus and vice-versa.
When a large cache gets an RBRqst from its private bus, two cases are possible: either the block is there or it's not. If it's there, the cache simply returns the data via an RBRply, making sure that it sets the shared bit in the reply packet to the OR of the value on the bus and its current state in the cache (recall that in the single-level system main memory returned the value on the Shared line for this bit). If the block is not in the cache, the cache puts out an RBRqst on the main bus. When the RBRply comes back the cache updates itself with the new data and its shared bit and puts the RBRply on the private bus. When a large cache gets a WSRqst on its private bus, it checks to see if the shared bit for the block is set. If it is not set, then it updates the data, sets owner, and puts a WSRply (with shared set to the value of the Shared line at the appropriate time) on the private bus. If shared is set, then it puts out a WSRqst on the main bus. The memory responds some time later with a WSRply. At this time the large cache updates the word, sets the owner bit, and puts a WSRply on the private bus with shared set to one. When a large cache gets a FBRqst, it simply updates the block and sends back an FBRply.
When a large cache gets an RBRqst on its main bus, it matches the address to see if has the block. If there is a match and owner is set, then it responds with the data. However, there are two cases. If existsBelow is set, then the data must be retrieved from the private bus by placing a RBRqst. Else the copy of the block it has is current, and it can return it directly. When a large cache gets a WSRqst on the main bus, it matches the address to see if the block is there and asserts shared as usual, but takes no other action. When the WSRply comes by, however, and there is a match, it updates the data it has. In addition, if the existsBelow bit for that block happens to be set, it also puts WSRply on the private bus. Note that this WSRply appears out of the blue on the private bus; that is, it has no corresponding request packet. This is another reason why the number of reply packets on a bus may exceed the number of request packets.
8. Atomic Operations
The DynaBus provides an atomic read-modify-write transaction called ConditionalWriteSingle. Typical implementations of read-modify-write operations in multiprocessors require locking the bus or locking memory locations. It is impractical to lock the DynaBus because it is packet switched. On the other hand, memory locks require performance compromises because it is impractical to have one lock per memory location, and the alternative imposes unnecessary conflicts. ConditionalWriteSingle, as defined on the DynaBus, neither requires bus locking nor uses memory locks. It is implemented directly by caches in somewhat the same way that the FetchAdd primitive is implemented by the processor to memory switching network in the NYU ULTRACOMPUTER.
The semantics of ConditionalWriteSingle are precisely those of the IBM CompareAndSwap. It takes three arguments, an address, an old value, and a new value:
ConditionalWriteSingle[address, oldval, newval] Returns[sample] =
{<begin critical section>
sample ← address^;
IF sample=oldval THEN address ← newval
<end critical section>
}
ConditionalWriteSingle is integrated completely into the data consistency scheme. Thus, when a processor initiates a ConditionalWriteSingle on non-shared data, the entire operation is performed locally by the cache and no bus traffic is generated. If the data is shared, then the cache initiates a ConditionalWriteSingle transaction on the bus, and the read-modify-write is done on all cached copies exactly as with a normal processor write. Figure 24 illustrates this, showing an example where the ownership of the data changes with the ConditionalWriteSingleReply.
[Artwork node; type 'Artwork on' to command tool]
Figure 24: A ConditionalWriteSingle. Because the data is shared, traffic is generated on the DynaBus.
Direct implementation of ConditionalWriteSingle by caches has several key advantages. First, it allows the maximum possible concurrency for read-modify-writes to a particular location; in fact, the bus may be saturated completely with ConditionalWriteSingle's to one location. Second, the cost of a single read-modify-write as seen by a processor is small, especially when the location is not shared. And third, it is easy to show that this scheme functions correctly because the proof of correctness is identical to that for WriteSingle.
9. Input Output
All interactions with IO devices fall into one of two categories: control or data transfer. Control interactions are used to initiate IO and to determine whether an earlier request has completed. Data transfer interactions are used to move the data to and from the memory system. In most applications, the bandwidth requirements of control interactions is small compared to those of data transfer so that transport efficiency of data transfer is important, while that of control is not. When an IO device requires a low rate of data transfer, control interactions can also be used to transfer data.
9.1 Control
All control interactions are carried out through the use of IORead, IOWrite and BIOWrite transactions directed to a common, unmapped 32-bit IO address space. This address space is common in the sense that all processors see the same space, and it is unmapped in the sense that addresses issued by processors are the ones seen by the IO devices. Generally, each type of IO device is allocated a unique, contiguous chunk of IO space at system design time, and the device responds only if an IORead, IOWrite, or BIOWrite is directed to its chunk. The term IO device is being used here not just for real IO devices, but any device (such as a cache) that responds to a portion of the IO address space.
9.1.1 Structure of IO Address Space
An IO address consists of three fields: a device type DeviceType, which is different for each type of device (eg. Cache, IOP, Map Cache); a device number DeviceNumber, which is different for each instance of a given type; and a DeviceOffset that is the address of an IO location within a particular device instance. Having an explicit concept of device type is convenient because it allows us to address all devices of a given type via the broadcast operation BIOWrite. The sizes of these 3 fields is proportional to the address space requirements of the device. Devices with modest address space requirements, termed small devices, are given the smallest DeviceOffsets, providing 210 contiguous addresses. Devices with somewhat greater address space requirements are given a DeviceOffset of 16 bits, resulting in 216 contiguous addresses. Devices with large address space requirements are given a DeviceOffset of 24 bits, resulting in 224 contiguous addresses. Although IO addresses are currently 32 bits wide, future implementations might extend them to as many as 47 bits. Figure 24 illustrates IOAddress, showing the size of the the DeviceType, the DeviceNumber and DeviceOffset fields for small, medium and large devices (Figure 25).
Figure 25: Format of IO Address
9.1.2 IOWrite
The IOWrite transaction is used to set up IO transfers and to start IO. The address cycle of the request packet carries an IO address, while the data cycle carries 32 bits of data whose interpretation depends upon the IO address. For block transfer devices, a processor typically will do a number of IOWrites to set up the transfer, and then a final IOWrite to initiate the transfer.
An IOWrite starts out at a small cache as an IOWRqst packet. The large cache of the cluster puts the IOWRqst on the main DynaBus, where it is picked up by all the other large caches. These caches put the IOWRqst on their private buses. Thus the IOWRqst is broadcast throughout the system. Broadcasting eliminates the need for requestors to know the location of devices in the hierarchy and makes for a simpler protocol. When the IOWRqst reaches the intended device, the device performs the requested operation and sends an IOWRply on its way. The IOWRply is broadcast in the same way as the IOWRqst, so it eventually makes its way to the requesting small cache. When the reply arrives, the small cache lets its processor proceed.
9.1.3 IORead
The IORead transaction reads 32 bits of data from an IO device. This data may either be status bits that are useful in controlling the device, or data being transferred directly from the device to the processor.
IOReads work the same as IOWrites: An IORead starts out at a small cache as an IORRqst packet. The large cache of the cluster puts the IORRqst onto the main DynaBus, where it is picked up by other large caches and put on the private buses. Once the intended IODevice receives the request, it reads the data from its registers and sends it on its way via an IORRply. The IORRply gets broadcast in exactly the same way as the IORRqst, and eventually makes its way to the cache that initiated the transaction. Note that for both IOReads and IOWrites exactly one device responds to a given IO address.
9.1.4 BIOWrite
BIOWrites are used in cases where a processor needs to have more than one IO device act upon a command without having to explicitly send multiple IOWrites (interprocessor interrupts and map updates are examples where BIOWrites are useful).
A BIOWrite starts out at a small cache as a BIOWRqst packet. The large cache of the cluster puts the BIOWRqst on the main bus. The memory then generates a BIOWRply with the same parameters as the BIOWRqst, and all large caches put this BIOWRply on their private DynaBuses. Thus the BIOWRply is broadcast throughout the system. When the BIOWRply reaches the requesting small cache, the cache lets its processor proceed. Note that the reply is not generated by the IO device, but by main memory. The reason is that there is no unique IO device that can generate the reply packet. It is important to point out that errors that occur during a BIOWrite may not be caught by the requesting device's time out mechanism. If one of the intended recipients of the BIOWrite is broken, for instance, the requestor won't get any indication. This is a fundamental problem with broadcast operations, however, and there is no simple solution.
9.2 Data Transfer
Devices connected DynaBus via a cache automatically participate in the data consistency algorithm. If performance were not a problem, all devices could be connected to the DynaBus this way freeing designers of IO devices from having to build special chips to interface to the DynaBus. Unfortunately, this approach is insufficient for high speed input devices, which would cause a cache to needlessly transfer blocks from memory to cache each time the cache got a miss. The protocol provides the transaction WritesBlock to write directly to memory without going through a cache. Of course, a high speed output device could use ReadBlock's to directly transfer data out of consistent memory without going through a cache.
10. Address Mapping
Figure 26 illustrates the architecture of the memory system. Note the following: Processors are connected to a the DynaBus via a Processor Cache. A Map Table resides in Main Memory. It contains virtual to real page translations for all pages that are in Main Memory. The Map Cache is a cache containing the
most frequently used subset of entries from the complete Map Table. It acts as a performance accelerator for the Processor Cache,
enhancing the Processor
Cache's ability to return data to the processor by translating virtual pages to real pages when the Cache has insufficient information to perform this translation. (There is also a complete table the translates from virtual to disk page that resides on the disk. All virtual pages are guaranteed to be included in this table.)
[Artwork node; type 'Artwork on' to command tool]
Figure 26: The Architecture of the Memory System.
10.1 The MapRequest/MapReply Transactions
Figure 27 illustrates the MapRequest/MapReply transactions. The MapRequest is used to perform virtual to real page translation when the data requested by the Processor is not contained in the Processor Cache. The Processor Cache sends an address space identifier (aid) and virtual page number to the Map Cache. The Map Cache uses the MapReply transaction to return information to the Processor Cache. If the Map Cache contains the requested entry, it returns the corresponding real page and 4 flags: Dirty, KWtEnable (kernel write enable), UWtEnable (user write enable), and URdEnable (user read enable). The Processor Cache then issues a ReadBlockRequest using the real address which it constructs from the real page. Main Memory returns the data. (See transactions 1a and 1b of Figure 27.) If the Map Cache does not contain the requested entry, it uses the MapReply to return the MapFault FaultCode. This fault initiates a software trap to load the requested entry from the complete Map Table. (See transactions 2a and 2b of Figure 27.)
[Artwork node; type 'Artwork on' to command tool]
Figure 26: The MapRequest transaction. If the Map Cache contains the requested virtual to real page translation, the Processor Cache constructs the real address by concatenating the real page and the offset. This address is used in a ReadBlockRequest. If the Map Cache does not contain the virtual page, a software trap is initiated by the Processor to load the page from Main Memory.
10.2 DeMapRequest/DeMapReply
The DeMap transaction is used to invalidate virtual to real page translations. Whenever the mapping information for a page must be modified, the DeMap transaction is performed to invalidate all cached copies of the mapping entry in the processor caches. Before the DeMap is initiated system software must:
1. Delete the mapping entry from the Map Table.
2. Delete the mapping entry from the Map Cache.
Then, a processor inititates the DeMapRequest[realPage] transaction. Main Memory responds with the DeMapReply[realPage] transaction. This transaction marks the valid bits for the corresponding virtualPages false in all of the caches. (Figure 27).
Figure 27: The DeMap Transaction. Memory marks the valid bits for the corresponding virtual pages false in all cached copies.
11. Error Detection and Reporting
The DynaBus specifies two aspects of dealing with errors: detection and reporting. Each device is expected to provide its own facilities for detecting errors, regardless of whether the errors are internal to the device or result from interactions with other devices. The bus provides parity to help check transport errors. Once an error is detected, a device must decide if it can handle the error on its own or needs to report the error to some other party. Errors that the device can handle on its own are uninteresting because the bus needs to provide no facilities. Errors that a device cannot handle are divided into recoverable errors and catastrophic ones, and the bus provides facilities to handle each kind.
11.1 Bus Parity
The DynaBus provides a single parity wire to check transport on the 64 Data wires. A device that sends a packet is expected to generate the parity bit, and all receiving devices are expected to check the parity bit. Whether a device considers a DynaBus parity error to be recoverable or catastrophic is not specified.
11.2 Time Outs
The DynaBus requires each device to implement a timeout facility to detect devices that do not respond, or devices that aren't there in the first place. Each device must maintain a counter that starts counting bus cycles when the device issues a request to the arbiter to send a request packet. If the system-wide constant maxWaitCycles cycles have elapsed before the corresponding reply packet is received, the device must assume that an error has occurred. Whether a device considers a DynaBus timeout to be recoverable or catastrophic is not specified.
The determination of a system-wide value for maxWaitCycles is tricky because of the wide variance in expected service times. For example, a low priority device might take a long time to just get bus grant, while a higher priority device would get grant relatively quickly. A low priority device might in fact be forced to wait arbitrarily long if a higher priority device decides to hog the bus. The question of whether this ought to be considered an error is debatable.
To avoid getting entangled in these issues, the bus specification simply specifies a system-wide lower bound on the limit
maxWaitCycles and leaves it up to the device implementor to decide the exact value. Such a lower limit is needed to avoid generating frequent false alarms. A conservative lower limit can be arrived at by computing the worst-case service time for a cache request and increasing it by an order of magnitude for safety (caches are taken since they are the lowest priority devices that do not change their request priority). Assuming there are 8 caches and only one memory bank, the worst case service time is at most
= 8*#cycles to service one request in an unloaded system
= 8*25 cycles.
Increasing this by an order of magnitude gives 2048 cycles, so each device is required to have maxWaitCycles e 2048.
11.3 Recoverable Errors
When a device encounters a recoverable error while servicing a request packet, it uses the DynaBus Mode/Fault bit in the reply packet to report the error. The least significant 32 bits of the first data word of the reply packet are set aside for the FaultCode (the format of FaultCode was described earlier, and appears again in Appendix III).
11.4 Catastrophic Errors
When a device encounters a catastrophic error it uses the DynaBus SStopOut signal to halt all DynaBus activity and signal the service processor. The service processor then uses the DBus to locate the failing device and take appropriate action.
Appendix I. DeviceType Encodings
Devices connected to the DynaBus are divided into 3 categories by size: small, medium and large. Small devices have a 12-bit
DevType, ranging from 01
H to 1F
H, a 10 bit
DevNum, and a 10-bit
DevOffset. The following table describes the current allocation:
Type Device Comments
01 Cache Access to Cache registers
02 Display Printer/Controller Access to Display control registers
03 Memory Controller
04-1F free
Medium devices have an 8-bit
DevType, ranging from 02
H to 1F
H, an 8 bit
DevNum, and a 16-bit
DevOffset. The following table describes the current allocation:
Type Device Comments
02 IO Bridge Access to PC/AT I/O space and IOB registers
02-1F free
Large devices have a 4-bit
DevType, ranging from 02
H to 0F
H, a 4 bit
DevNum, and a 24-bit
DevOffset. The following table describes the current allocation:
Type Device Comments
02 IO Bridge Access to PC/AT memory space per byte
03 IO Bridge Access to PC/AT memory space per halfword/fullword
04 free
05 MapCache Access to Map Cache entries and control registers
Appendix II. DynaBus Command Field Encoding
The table below gives the encoding for the Command field within the header cycle of a DynaBus packet.
Transaction Name Abbreviation Encoding Length
ReadBlockRequest RBRqst 0000 0 2
ReadBlockReply RBRply 0000 1 5
WriteBlockRequest WBRqst 0001 0 5
WriteBlockReply WBRply 0001 1 2
WriteSingleRequest WSRqst 0010 0 2
WriteSingleReply WSRply 0010 1 2
ConditionalWriteSingleRequest CWSRqst 0011 0 2
ConditionalWriteSingleReply CWSRply 0011 1 5
FlushBlockRequest FBRqst 0100 0 5
FlushBlockReply FBRply 0100 1 2
Unused 0101 0 ...
0111 1
IOReadRequest IORRqst 1000 0 2
IOReadReply IORRply 1000 1 2
IOWriteRequest IOWRqst 1001 0 2
IOWriteReply IOWRply 1001 1 2
BIOWriteRequest BIOWRqst 1010 0 2
BIOWriteReply BIOWRply 1010 1 2
MapRequest MapRqst 1110 0 2
MapReply MapRply 1110 1 2
DeMapRequest DeMapRqst 1111 0 2
DeMapReply DeMapRply 1111 1 2
Appendix III. Format of FaultCode
The error reporting mechanism on the DynaBus includes a Fault bit and 32 bits of information about the fault, FaultCode. This section defines the format of FaultCode.
FaultCode is divided up into a 3 bit MajorCode, which appears in the low-order three bits, and 29 bits of MinorCode which comprise the rest of the word. MajorCode divides up all faults into 8 categories that are important to distinguish quickly, while MinorCode provides a way to encode a large number of infrequent subcases.
The encoding of
MajorCode is as follows:
Encoding Name Meaning
000 MemAccessFault first write to page or insufficient privilege
001 IOAccessFault insufficient privilege to read or write IO location
010 MapFault map cache miss
011 AUFault arithmetic unit fault
100 DynaBusTimeOut transaction timeout on DynaBus
111 DynaBusOtherFault some other DynaBus fault reported via reply packet
The top 10 bits of MinorCode give the DynaBus DeviceID of the reporting device, while the remaining 19 bits indicate the fault. The encoding of these 19 bits is left up to the designers of individual devices.