THE DYNABUS: A VLSI BUS FOR USE IN MULTI-PROCESSOR SYSTEMS THE DYNABUS: A VLSI BUS FOR USE IN MULTI-PROCESSOR SYSTEMS THE DYNABUS: A VLSI BUS FOR USE IN MULTI-PROCESSOR SYSTEMS VERSION 1.0 VERSION 1.0 1 1 1 The DynaBus A VLSI Bus for use in Multiprocessor Systems L. Bland, J.C. Cuenod, D. Curry, J.M. Frailong, J. Gasbarro, J. Gastinel, B. Gunning, J. Hoel, E. McCreight, M. Overton, E. Richley, M. Ross, and P. Sindhu Dragon-88-08 Written 4 September 88 Revised 15 February 89 © Copyright 1986, 1987, 1988, 1989 Xerox Corporation. All rights reserved. Abstract: The DynaBus is a synchronous, packet switched bus designed to address the requirements of high bandwidth, data consistency, and VLSI implementation within the memory system of a shared memory multiprocessor. Each DynaBus transaction consists of a request packet followed an arbitrary time later by a reply packet, with the bus being free to be used by other transactions in the interim. Besides making more efficient use of the bus, such packet switching enables the use of interleaved memory, allows arbitrarily slow devices to be connected without degrading performance, and simplifies data consistency in systems with multiple levels of caching. The bus provides a usable bandwidth of several hundred megabytes per second, permitting the construction of machines executing several hundred MIPS while providing high IO throughput. An efficient protocol ensures that multiple copies of read/write data within processor caches is kept consistent and that IO devices stream data into and out of a consistent view of memory. Both the physical structure of the DynaBus and its protocol are designed specifically to allow a high level of system integration. Complex functions such as memory and graphics controllers that traditionally required entire boards can be implemented in a single VLSI chip that is directly connected to the DynaBus. Keywords: VLSI DynaBus, Backpanel DynaBus, pipelined bus, timing, arbitration, DynaBus transactions, write-back cache, snoopy cache, data consistency, DynaBus signals, memory interconnect, multiprocessor bus, packet switched bus. FileName: [Dragon]Documentation>DynaBus>DynaBusDoc.tioga, .ip XEROX Xerox Corporation Palo Alto Research Center 3333 Coyote Hill Road Palo Alto, California 94304 Xerox Private Data Contents 1. Overview 2. Definition of Terms 3. Interconnection Schemes 4. Chip Level Signals 5. Arbitration and Flow Control 6. Transactions 7. Data Consistency 8. Atomic Operations 9. Input Output 10. Address Mapping 11. Error Detection and Reporting Appendix I. DynaBus Command Field Encoding 1. Overview The DynaBus is a synchronous, packet switched bus designed to address the requirements of high bandwidth, data consistency, and VLSI implementation within the memory system of a shared memory multiprocessor. Each DynaBus transaction consists of a request packet followed an arbitrary time later by a reply packet, with the bus being free to be used by other transactions in the interim. Besides making more efficient use of the bus, such packet switching enables the use of interleaved memory, allows arbitrarily slow devices to be connected without degrading performance, and simplifies data consistency in systems with multiple levels of caching. The bus provides a usable bandwidth of many hundreds of megabytes per second, permitting the construction of machines spanning a wide range of cost and performance. Because DynaBus is intended for use in high performance shared memory multiprocessors, there is an efficient protocol for ensuring that processors see a consistent view of memory in the face of caching and IO. With this protocol the hardware ensures that multiple copies of read/write data in caches are consistent, and that both input and output devices are able to take cached data into account. And the consistency protocol provides a model of shared memory that is both conceptually simple and natural. The DynaBus's physical structure and its protocol are designed to promote a high level of system integration. Complex devices, including: memory controllers, graphics controllers, high speed network controllers, and external bus controllers that traditionally required entire boards can be implemented using a single chip connected directly to the DynaBus. The result is a high performance, but compact system. Within a computer, the DynaBus may be used both as a VLSI interconnect to tie chips together on a single board and as a backplane bus to tie boards together over a backpanel. Figure 1 shows an application with two boards. << [Artwork node; type 'Artwork on' to command tool] >> Figure 1: The DynaBus is a VLSI interconnection system. Its efficient, compact design promotes a high level of integration. Key to Abbreviations in Figure 1: C: Cache Arb: Arbiter IOB: Input/Ouput Bridge P: Processor D/P: Display MC: Memory Controller MAP: Map Cache Mem: Memory The DynaBus design is flexible enough to allow its use in a wide variety of configurations. For example, in Figure 1, DynaBus A may be connected to DynaBuses B and C in two quite different ways. In the first, the buses are connected by pipeline registers. Here there is logically one DynaBus but three electrically separate bus segments, and all traffic on one segment is propagated to the others. In the second, the buses are connected by second level caches. Here there are three logically distinct DynaBuses, and traffic from one bus may or may not go to the others. Another configuration, not shown in the figure, is to use multiple Dynabuses operating independently and in parallel with one another to provide very high bandwidths. The DynaBus has 80 signals, 64 of which consist of a multiplexed data/address path (Data, Figure 2). HeaderCycle indicates whether the information carried by Data is a packet header or not, while DParity is parity computed over Data and HeaderCycle. Shared and Owner are signals used for data consistency. RequestOut, Grant, and LongGrant constitute the interface to the Dynabus arbiter. AParity provides a single bit parity check over the consistency and arbitration wires. The clock signal Clock provides global timing, while ClockOut allows the skew of Clock to be controlled. At the pins of a package that interfaces to DynaBus, the Data port signals can be provided optionally with inputs and outputs separated for added flexibility in building high performance pipelined bus configurations. The pin BidEn allows a given die to be used in either the bidirectional mode, or the higher performance unidirectional mode. << [Artwork node; type 'Artwork on' to command tool] >> Figure 2: Chip Level DynaBus Signals. The DynaBus's operation can be understood best in terms of three layers: cycles, packets, and transactions (these layers correspond to the electrical, logical, and functional levels, respectively). A bus cycle is simply one complete period of the bus clock; it forms the unit of time and information transfer on the bus; the information is typically either an address or data. A packet is a contiguous sequence of cycles; it is the mechanism by which one-way logical information transfer occurs on the bus. The first cycle of a packet carries address and control information; subsequent cycles typically carry data. There are two different packet sizes: 2 cycles and 9 cycles. A transaction consists of a request packet and a corresponding reply packet that together perform some logical function (such as a memory read). Each DynaBus has an arbiter that permits the bus to be multiplexed amongst contending devices, which are identified by a unique deviceId. Before a transaction can begin, the requesting device must get bus mastership from the arbiter. Once it has the bus, the device puts its request packet on the bus one cycle at a time, and then waits for the reply packet. Packet transmission is uninterruptable in that no other device can take the bus away during this time, regardless of its priority. The transaction is complete when another device gets bus mastership and sends a reply packet. Request and reply packets may be separated by an arbitrary number of cycles, provided timeout limits are not exceeded (see Section 11.2). In the interval between request and reply, the bus is free to be used by other devices. The arbiter is able to grant requests in such a way that no cycles are lost between successive packets. A request packet contains at least the transaction type, the requestor's deviceId, a small number of control bits, and an address; it may contain additional transaction dependent information. The reply packet contains the same transaction type, the orignial requestor's deviceId, the original address, some control bits, and transaction dependent data. This replication of type, deviceId, and address information allows request and reply packets to be paired unambiguously. Normally, the protocol ensures a one-to-one correspondence between request packets and reply packets; however, because of errors, some request packets may not get a reply. Thus, devices must not depend on the number of request and reply packets being equal since this invariant will not in general be maintained. The protocol requires devices to provide a simple, but crucial guarantee: they must service request packets in arrival order. This guarantee forms the basis for the DynaBus's data consistency scheme. The DynaBus defines a complete set of transactions for data transfer between caches and memory, data consistency, synchronization, input output, and address mapping. The ReadBlock transaction allows a device to read a block of data from memory or another cache. WriteBlock allows new data to be introduced into the memory system (for example disk reads). FlushBlock allows caches to write back dirty data to memory. KillBlock allows a block to be removed from all but one of the caches. WriteSingle is a short transaction used by caches to update multiple copies of shared data without affecting main memory. IOReadSingle and IOWriteSingle initiate and check IO operations, while IOReadBlock and IOWriteBlock allow block transfer of data between IO devices, completely bypassing the consistency mechanism. The Map and DeMap transactions permit the implemention of high speed address mapping in a multiple address space environment. Finally, the Interrupt transaction provides the mechanism for signalling interrupts to processors. The encoding space leaves room for defining five other transactions. The Dynabus has a maximum data transport efficiency of 8/11, or 73%. In other words, at least 3/11 of the overall bandwidth of the bus is consumed by protocol overhead such as deviceID, address, and transaction type. This number derives from the fact that in all of the block transfer transactions 8 cycles of data are transported for a total of 11 cycles. For example, the request packet for a ReadBlock transaction is 2 cycles while the reply is 9 cycles, of which 8 are data. In typical applications, most of the transactions on the bus are block transfer transactions so that the 73% efficiency is, in fact, close to what one would actually obtain. 2. Definition of Terms This section defines commonly used terms within this document. Definitions appear in bold and uses appear in italics. arbiter an entity that allows multiple devices contending for the same DynaBus to use the bus in a time multiplexed fashion. alignment an n-bit quantity is aligned within a container if the quantity is located starting at a position that is a multiple of n. This assumes big-endian numbering (see below). BIC bus interface chip. A chip containing two pipeline registers, one input and one output, used to connect two DynaBus segments. big-endian numbering a numbering system for data where the most significant unit (bit, byte, halfWord, word, doubleWord, or block) within a container is placed leftmost and numbered 0. The DynaBus uses big-endian numbering (Figure 3). << [Artwork node; type 'Artwork on' to command tool] >> Figure 3: Big-endian numbering as it is used on the DynaBus. block 512 bits of data. Within the real and IO address spaces block data is always aligned. bus a collection of one or more bus segments connected by pipeline registers. bus segment the portion of a bus that is traversed in one clock period. byte 8 bits of data. Within the real and IO address spaces byte data is always aligned. cycle one complete period of the DynaBus clock. It is the unit of time and information transfer on the DynaBus. Generally, a cycle is 25 ns, and carries one doubleWord (64 bits) of data. DBus a serial bus used for system initialization, testing, and debugging. device an entity that can arbitrate for the bus and place packets on it. deviceID a 10-bit unique identifier for DynaBus devices. This number is loaded into a device over the DBus during system initialization. doubleWord 64 bits of data. Within the real and IO address spaces doubleWord data is always aligned. The Dynabus transfers one doubleWord every cycle. halfWord 16 bits of data. Within the real and IO address spaces halfWord data is always aligned. header the first cycle of a packet. This cycle contains address and control information. hold a state in which the arbiter grants requests for reply packets but does not grant requests for request packets. IO address a 37-bit quantity used to address IO devices. An IO address consists of the address of an aligned doubleWord in IO address space concatenated with a 4-bit single specifier that identifies a single (aligned byte, halfWord, word, or doubleWord) within the doubleWord. An IO address may also be used to specify a block, in which case the single specifier identifies a single within the target block. When the block is transported over the Dynabus, this single is sent first, with the remaining doubleWords sent in cyclic order. << [Artwork node; type 'Artwork on' to command tool] >> Figure 4: Format of an IO Address IO address space the set of all IO addresses. IOBridge a chip that allows the DynaBus to be connected to an industry standard bus. MapCache a device that provides virtual to real address translation on the DynaBus. master a device that has been granted the DynaBus. module a unit of packaging intermediate between a chip package and a board. packet a contiguous sequence of bus cycles. The DynaBus supports packets of length 2 and 9. packet switched a dissociation between the request and reply packets of a transaction to allow the bus to be used for other transactions between request and reply. Same as split transaction. packet type a 5-bit field in the head cycle indicating one of 32 possible kinds of packet real address a 37-bit quantity used to address real memory. An addressed location may reside both in main memory and in caches. A real address consists of the address of an aligned doubleWord in real address space concatenated with a 4-bit single specifier that identifies a single (aligned byte, halfWord, word, or doubleWord) within the doubleWord. A real address may also be used to specify a block, in which case the single specifier identifies a single within the target block. When the block is transported over the Dynabus, this single is sent first, with the remaining doubleWords being sent in cyclic order. << [Artwork node; type 'Artwork on' to command tool] >> Figure 5: Format of a Real Address real address space the set of all real addresses. requester the device that sends the request packet of a transaction. responder the device that sends the reply packet of a transaction. single a byte, halfWord, word or doubleWord of data. Within the real and IO address spaces a single is always aligned. A single is transported on the Dynabus in the same relative position within a 64-bit cycle as the position the single occupies within its containing doubleWord in the real or IO address spaces. Non significant bits of the cycle containing the single are undefined. slave a device that is listening to an incoming packet on the DynaBus. snoopy cache a two port cache that watches transactions on the DynaBus port to maintain a consistent view of data as seen from the processor port. split transaction a dissociation between the request and reply packets of a transaction to allow the bus to be used for other transactions between request and reply. Same as packet switched. transaction a pair of packets, the first a request and the second a reply, that together performs some logical function. virtual address a 32-bit quantity used by a processor to address memory. virtual address space the set of all virtual addresses. word 32 bits of data. Within the real and IO address spaces word data is always aligned. write-back cache a cache that updates cached data upon a processor write without immediately updating main memory. write-through cache a cache that does a write on the bus side for each write it receives from its processor side. 3. Interconnection Schemes A unique aspect of DynaBus is that it can be used as an interconnection component in machines spanning a wide range of cost and performance. At the low end are low cost single board systems of up to a few hundred MIPS, while at the high end are more expensive multi-board systems capable of approaching 1 GIPS and sustaining high IO throughput. However, in all these systems, the logical and much of the electrical specification of the bus stays the same. This allows the same chip set to be employed across an entire family of machines and results in economies of scale not permitted by other buses. 3.1 Low to Medium Performance Systems Low performance systems typically cannot afford high pin count packages because of increased package cost and the need for more expensive high density interconnection on board. With the bidirectional option, the DynaBus requires just 80 pins per package, providing an attractive solution for low end systems. With the DynaBus confined to a single board, it is possible to build a high performance, compact 64-bit bus consisting of just one segment (Figure 6). Each DynaBus chip has an input and an output register connected to the bidirectional data port. These registers make a shorter cycle time possible, eliminating any computation (decoding, gating) during the transmission of data between chips. << [Artwork node; type 'Artwork on' to command tool] >> Figure 6: A Single-Board System contains only one bus segment. A special pin allows the input and output pins of a DynaBus chip to be connected resulting in a bidirectional interface with only 80 wires. Low cost midrange systems can also be built using a non-pipelined bidirectional DynaBus that spans multiple boards. Each board would have bidirectional buffers at its interface, much like VME or FUTUREBUS (see Figure 7 left). Such an implementation of Dynabus would not cycle as fast as a single board version or a pipelined version, but it would nonetheless provide an attractive low cost multi-board alternative. 3.2 High Performance Pipelined Systems One of the most interesting features of DynaBus is that it allows pipelining: a single DynaBus can be broken up into multiple bus segments separated by pipeline registers. These registers are placed at the input and output of each chip, module and board connecting to a DynaBus. During one clock cycle a signal starts out in one pipeline register, traverses one bus segment, and ends up in another pipeline register. The principal advantage of such pipelining is that the signal transit times on carefully designed short bus segments are a fraction of those on a single long segment whose length is the sum of the shorter segments. Small signal transit times in turn mean that the bus can be operated at a higher frequency and therefore deliver more bandwidth. << [Artwork node; type 'Artwork on' to command tool] >> Figure 7: In a nonpipelined system the segment transit time (the clock period lower bound) is T = T1 + T2 + T3. In a pipelined version the segment transit time is MAX[T1, T2, T3], or about T/3 if comparable transit times for the backpanel and the board are assumed. Thus the bandwidth of the pipelined version is up to three times higher. (This is an upper bound, as the additional setup and hold times will decrease the speed.) Figure 8 illustrates a low cost multi-board system in which all three segments of the Dynabus are bidirectional. In this configuration, the on-board Dynabuses require 66 fewer wires than the unidirectional configuration, resulting in lower cost packaging. The price to pay for this decreased cost is lower performance. In all of the other examples in this section, the Dynabus can be utilized fully, while in this configuration it cannot because incoming packets would collide with outgoing ones on the on-board bidirectional buses. Two cycles are lost for each packet transferred, so that the transport efficiency here is around 73% of the transport efficiency of a fully utilizable configuration. For a detailed explanation of these lost cycles, or "bubbles" see Section 5.3. << [Artwork node; type 'Artwork on' to command tool] >> Figure 8: A Low-Cost Multi-Board System with 3 Dynabus segments. Figure 9 illustrates a multi-board system with three DynaBus segments. The Backpanel is the only bidirectional segment. The boards have two unidirectional input and output buses. << [Artwork node; type 'Artwork on' to command tool] >> Figure 9: A Multi-Board System with 3 Dynabus segments. Finally, Figure 10 illustrates a multi-module multi-board system where the DynaBus has 5 pipelined segments. << [Artwork node; type 'Artwork on' to command tool] >> Figure 10: A Multi-Board Multi-Module System with 5 Dynabus segments. In all these configurations care must be taken in the physical layout of bus segments to minimize reflections in order to increase the clock rate. Additionally, great care must be taken in distributing the clock to reduce skew. The DynaBus uses balanced transmission lines for bus segments and a special clock distribution scheme that minimizes clock skew. 4. Chip Level Signals The signals comprising a DynaBus interface for a chip are divided into five groups: Control, Arbitration, Consistency, Data, and optionally DataIn. Control contains input and output versions of the Clock, and a BidEn pin that is used to either tie the Data and DataIn groups together or allow them to be used separately. The Arbitration group provides the signals used by the chip to request the bus and also the signals used by the arbiter to grant the bus. The consistency group contains input and output versions of Shared and Owner. Data provides a bidirectional (or optionally a unidirectional output) path for 64 bits of data, header information, and parity. Finally, the optional group DataIn provides a unidirectional input path for signals in the Data group when that group is being used in unidirectional output mode. << [Artwork node; type 'Artwork on' to command tool] >> Figure 11: The DynaBus Signals. 4.1 Control Signals Clock This input signal provides the global timing signal for the DynaBus. ClockOut This output signal provides an internal, loaded version of the Clock that is used to deskew Clock. BidEn This signal is used to place the Data signals in the optional unidirectional mode. When BidEn is asserted, Data function in a unidirectional output mode, and DataIn are used in a unidirectional input mode. When BidEn is deasserrted, the Data signals are used in a bidirectional mode, and the DataIn signals are not used. This feature can be used to reduce the number of DynaBus pins either for building low end systems or to simplify chip testing. 4.2 Arbitration Signals LongGrant LongGrant is defined one cycle before the first cycle of a grant, and at other times its value is undefined. It is asserted if the arbiter is responding to a request for a long packet (9-cycle) and deasserted if it is responding to a short packet (2-cycle). Grant Grant is asserted by the arbiter once for each bus cycle that has been granted to a requesting device. The duration of Grant is 2 or 9 cycles, depending on the length of the packet. RequestOut[0..2] The RequestOut wires are used by a device to signal its Arbiter that it wants the bus. A device uses the RequestOut wires for either one cycle or two consecutive cycles. The first cycle always communicates the priority of a request. For some requests, the device uses the second cycle to communicate a length (the number of cycles for which it wants the bus) and a color that is used by the arbiter to provide fair service. The encoding for the two cycles is as follows: First Cycle 7: Stop Arbitration 6: Reply High 5: Reply Low 4: Hold 3: Request High 2: Request Normal 1: Request Low 0: NoOp Second Cycle: xCL C: Color L: Packet Length (0=>2 cycles, 1=>9 cycles) For priorities corresponding to Stop, Hold, and NoOp, a request consists of one cycle, while for the remainder a request consists of two cycles. 4.3 Consistency Signals OwnerOut OwnerOut is asserted by a cache when it is the owner of the address specified in a ReadBlockRequest. The OwnerOut signal is needed because the memory system uses write-back caches. When the main memory copy of a block is stale, OwnerOut signals the memory to not respond to a ReadBlockRequest because the owning cache will respond instead. SharedOut SharedOut is asserted by a cache to indicate that it holds a cached copy of the data whose address appears on the DynaBus. When a cache initiates a WriteSingle, ReadBlock or KillBlock, all caches that contain the datum except the one that initiated the transaction assert SharedOut. OwnerIn OwnerIn is the logical OR of the OwnerOut wires of all caches. It is used by the Memory Controller to determine if memory should respond to a ReadBlockRequest. If the value of the Memory Controller's OwnerIn wire is TRUE, memory does not respond because one of the caches owns the datum and will issue the reply. SharedIn The SharedIn wire is used to compute the value of the Shared flag for a cache that initiates a WriteSingle, ReadBlock, or KillBlock. This wire is the logical OR of the SharedOut wires of all the caches. When a cache initiates on of the above transactions, all caches that contain the datum except the one that initiated the transaction assert SharedOut. The Memory Controller receives the logical OR of all the caches' SharedOut wires as SharedIn and reflects this value in its reply to the transaction. If none of the caches asserted SharedOut, the Memory Controller's reply indicates that the datum is no longer shared. The cache that initiated the transaction then sets its Shared flag to false. 4.4 AParity Signals AParityOut This wire carries single bit parity computed over the signals RequestOut, OwnerOut, and SharedOut. Parity is generated by a sending device, and checked by the arbiter. AParityIn This wire carries single bit parity computed over the signals LongGrant, Grant, OwnerIn, and SharedIn. It is generated by the arbiter and checked by a receiving device. 4.5 Data/DataOut Signals Data[0..63] These 64 signals carry the bulk of the information being transmitted from one chip to another. During header cycles they carry a packet type, some control bits, a deviceID, and an address, and during other cycles they carry data. These signals are driven only after receiving Grant from the Arbiter, otherwise they remain in a high impedance state. HeaderCycle/HeaderCycleOut This signal indicates the beginning of a packet. It is asserted during the first cycle of a packet, which is the header. It is generated by the device sending the packet, and is driven only during cycles in which the device has Grant from the Arbiter. During other cycles it remains in a high impedance state. DParity/ParityOut This signal carries parity computed over the HeaderCycle/HeaderCycleOut and Data lines. 4.6 DataIn Signals DataIn[0..63] These 64 wires carry a possibly delayed version of the information on the DataOut wires. HeaderCycleIn This wire carries a possibly delayed version of the information on the HeaderCycleOut wire. HeaderCycleIn is asserted if and only if the header cycle of a packet is being received. DParityIn This wire carries the parity computed by the source of the data. It is used to check if transmission of Data and HeaderCycle encountered an error. 5. Arbitration and Flow Control Each DynaBus has an arbiter that permits the bus to be time multiplexed amongst contending devices. Whenever a device has a packet to send, it makes a request to the arbiter using dedicated request lines, and the arbiter grants the bus using dedicated grant lines. Different devices may have different priority levels, and the arbiter guarantees fair (bounded-time) service to each device within its priority level. Bus allocation is non-preemptive, however, in that the transmission of a single packet is noninterruptable. When making an arbitration request, a device indicates both the priority and the length of the packet it wants to send. Two aspects of DynaBus arbitration ensure good performance. The first is that arbitration is overlapped with transmission, so that no bus cycles are wasted during arbitration and it is possible to fill up the bus completely with packets. The second is that a device may make multiple requests before the first request has been granted; this allows a single device to use the bus to its maximum potential. << [Artwork node; type 'Artwork on' to command tool] >> Figure 12: Arbitration is overlapped with packet transmission so that it is possible to fill up the bus completely with packets. The arbiter is also used to implement flow control, which is a mechanism to avoid packet congestion. To understand why congestion can occur, recall that Dynabus is packet switched: a device may get new requests while it is servicing older ones, so that requests can pile up faster than a device is able to service them. 5.1 Arbitration Each device interacts with the arbiter via a dedicated port consisting of three request wires RequestOut[0..2] and one Grant wire. One other wire, LongGrant, is shared by all devices connected to the arbiter. A device communicates requests by using the RequestOut wires for either one cycle or two consecutive cycles. In the first cycle it always communicates the priority of its request. For some of the requests the device uses a second cycle in which it indicates a length (number of cycles for which it wants the bus) and a color that is used by the arbiter to provide fair service. The encoding for the two cycles is as follows: First Cycle: P2P1P0 7: Stop Arbitration 6: Reply High 5: Reply Low 4: Hold 3: Request High 2: Request Normal 1: Request Low 0: NoOp Second Cycle: xCL C: Color L: Packet Length (0=>2 cycles, 1=>9 cycles) The five priorities: Request Low, Request Normal, Request High, Reply Low, and Reply High correspond to "normal" requests for the bus: they are used when the device actually intends to send a packet on the bus upon receiving a grant from the arbiter. Each normal request consists of two cycles, with the first cycle indicating priority and the second the length and color. A device may issue multiple requests back to back, but the number of non-granted requests may not exceed the implementation limit imposed by the arbiter. A separate request is registered for each pair of cycles constituting a request. Higher priority requests are served before lower priority ones, and requests within a priority level are serviced in approximately round-robin order. These five priority levels are used as follows in a typical Dynabus system: cache replies would use Reply High; memory replies would use Reply Low; requests from caches would use Request Normal. Other devices sending request packets would use one of the request priorities depending on the urgency of the request. For instance, a block transfer IO device doing output normally could use Request Low for ReadBlockRequests that pull data out of the memory system, but switch to Request High when the internal FIFO in the display gets close to empty. The remaining priorities, NoOp, Hold, and Stop are different in that a device uses them to request special service from the arbiter. Each such request consists of one cycle that specifies the priority. A device uses NoOp when it does not want to request any service at all. It uses Hold when it wants to prevent the arbiter from granting any requests for request packets (priorities below Hold). The arbiter stays in the Hold state for only as many cycles as the device asserts the Hold code. Finally, Stop is used when a device wants to stop all arbitration: the arbiter simply stops granting the bus for as many cycles as the device asserts the Stop code. However, while in Stop mode, the arbiter continues to accumulate requests from devices. Grant is used by the arbiter to signal that a device has grant. Grant is asserted for as many cycles as the packet is long. If Grant is asserted in cycle i then the device can drive its outgoing bus segment in cycle i+1. LongGrant describes a grant that is about to take place. In the cycle before Grant is asserted, LongGrant tells the device whether or not the next grant will correspond to a 9 cycle packet. Figures 13 and 14 show the timing of important signals at the pins of a requester during the arbitration and transmission of a 2 cycle and a 9 cycle packet, respectively. It is helpful to refer to the schematic of Figure 15 when reading the timing diagrams. << [Artwork node; type 'Artwork on' to command tool] >> Figure 13: Timing diagram for a two cycle packet assuming an arbitration latency of 6 cycles. All signals are at the pins of the requesting device (see Figure 15). Note that LongGrant is valid in the cycle just before Grant, and that Grant is asserted for two cycles. << [Artwork node; type 'Artwork on' to command tool] >> Figure 14: Timing diagram for a 9-cycle packet assuming an arbitration latency of 6 cycles. All signals are at the pins of the requesting device (see Figure 15). Note that LongGrant is valid in the cycle just before Grant, and that Grant is asserted for nine cycles. << [Artwork node; type 'Artwork on' to command tool] >> Figure 15: Schematic of the standard interface used by devices to connect to the DynaBus. 5.2 Flow Control The arbiter provides two mechanisms for flow control, the first being arbitration priorities. Devices making arbitration requests to send reply packets always use priorities higher than devices making arbitration requests to send request packets. This mechanism alone would eliminate the congestion problem if devices were always ready to reply before the onset of congestion, but it may not be possible for all devices to satisfy this requirement: a device must either be able to service packets at the maximum arrival rate, or it must have an input queue that is long enough so that it does not overflow even during the longest service time for a packet. For certain slow devices like the memory controller, servicing packets at arrival rate clearly is impossible, and the queue lengths required to ensure no overflow are prohibitive. The arbiter therefore provides a second mechanism suitable for slow devices. This mechanism involves the use of the special request priority called Hold described earlier. As long as the arbiter receives Hold from a device, it refuses to grant arbitration requests for sending request packets, but continues to grant requests for sending reply packets. This has the effect of choking off new request packets as long as Hold is being asserted by some device, and allows the device asserting Hold to clear the congestion. Because the effect of Hold can never be instantaneous, especially in pipelined configurations, devices still need to provide headroom within their input queues to tolerate a few request packets while Hold takes effect. Devices must not use Hold with abandon, however, because this would decrease bus throughput. 5.3 Arbitration in Pipelined Configurations In pipelined DynaBus configurations, the bus segments form a tree rooted at the bidirectional backpanel segment (Figure 16). All IC's are connected on the leaf segment, labeled A. The arbiter controls access to this segment independently of any board or module structure. Note that the existence of a bidirectional on-board bus means that the backpanel bus segment cannot be fully utilized (Figure 17). Assume that device 1 wants to send a packet and gets a grant in cycle 0. In cycles (1, 2), the packet is sent segment A1, in cycles (2, 3) it traverses segment B, and in cycles (3, 4) it moves to segment A2. If device 2 wants to send a packet, the earliest it can transmit information is in cycles (5, 6) because A2 is occupied in cycles (3, 4). This results in a bubble of two cycles on the backpanel bus B. Note also, that in this configuration, devices receive packets at different times. Devices on board 2 receive a packet sent from a device on board 1 two cycles later. Since the Owner and Shared signals from all devices need to be ORed together, this means that Owner and Shared for a board transmitting a packet must be delayed by two cycles compared to boards that are receiving the packet. << [Artwork node; type 'Artwork on' to command tool] >> Figure 16: A low-cost DynaBus system with three segments. The arbiter makes grants on the leaf bus segment, labeled A, without knowledge of the number of pipelined stages in the System. << [Artwork node; type 'Artwork on' to command tool] >> Figure 17: Timing diagram showing the transmission of data for Grants 1 and 2 over the three segments of the DynaBus pictured in Figure 16. Figure 18 shows a high performance version of the above configuration. Here the backpanel bus is still bidirectional, but each board has an input bus Ci, separate from the output bus Ai. In addition to allowing the Dynabus bandwidth to be utilized to its full potential, separate buses facilitate the computation of the shared and owner bits because all devices receive a packet at the same time. Shared and owner bits no longer need to be delayed differently as in the above configuration << [Artwork node; type 'Artwork on' to command tool] >> Figure 18: A DynaBus System with three segments. The arbiter makes grants on the leaf bus segment, labeled A, without knowledge of the number of pipelined stages in the System. << [Artwork node; type 'Artwork on' to command tool] >> Figure 19: Timing diagram showing the transmission of data for Grants 1 and 2 over the three segments of the DynaBus pictured in Figure 18. 6. Transactions Transactions form the top layer of the DynaBus protocol, with the two lower layers being packets and cycles. Each transaction consists of a pair of request-reply packets, which are independently arbitrated. A transaction begins when the requester asks the arbiter to grant the bus to send its request packet (Figure 20). Upon receiving bus grant, the requester sends the packet one cycle at a time, with the cycle containing packet control information going first. This first cycle, called the packet header, contains all the information needed to identify the packet and select the set of devices that need to service the packet. Subsequent cycles contain data that is dependent on the type of transaction in progress. All Dyanbus devices (including the requester) receive the request packet, and each device examines the header to decide whether or not it needs to take action. << [Artwork node; type 'Artwork on' to command tool] >> Figure 20: A transaction on the DynaBus consists of a request and a reply. Exactly one of the receiving devices elects to generate a reply, typically after the action requested by the request packet is complete. The mechanism by which a unique device is selected to respond is different for different transactions, but most transactions use an address contained in the header cycle for this purpose. The responding device then requests the arbiter to grant the bus for sending its reply packet. On receiving grant, this device sends the reply packet one cycle at a time, with the header cycle going first. As before, the header cycle contains all the information needed to identify the packet, and in particular to link it unambiguously to the corresponding request packet. All DynaBus devices receive this reply packet as well, and each device examines the header to see what action, if any, it needs to take. Typically, the initiating device behaves somewhat differently than other devices. The transaction is complete when the initiating device receives the reply. Normally, this protocol ensures a one-to-one correspondence between request and reply packets; however, because of errors, some request packets may not get a reply. Thus, devices must not depend on the number of request and reply packets being equal since this invariant will not in general be maintained. The protocol does require devices to provide a simple, but crucial guarantee that is central to the data consistency scheme: devices must service request packets in arrival order. To understand why arrival order must be maintained, see Section 7.1. The DynaBus defines a complete set of transactions for data transfer between caches and memory, data consistency, synchronization, input output, and address mapping. Twelve of the sixteen transactions are defined. They are: ReadBlock, WriteBlock, FlushBlock, KillBlock, WriteSingle, IOReadSingle, IOWriteSingle, IOReadBlock, IOWriteBlock, Map, DeMap, and Interrupt. The ReadBlock transaction allows a cache to read a packet from memory or another cache. WriteBlock allows new data to be introduced into the memory system (for example disk reads). FlushBlock allows caches to write back dirty data to memory. KillBlock allows a block to be removed from all but one of the caches. WriteSingle is a short transaction used by caches to update multiple copies of shared data without affecting main memory. IOReadSingle, and IOWriteSingle initiate and check IO operations, while IOReadBlock and IOWriteBlock allow block transfer of data between IO devices, completely bypassing the consistency mechanism. The Map and DeMap transactions permit the implemention of high speed address mapping in a multiple address space environment. Finally, the Interrupt transaction provides the mechanism for signalling interrupts to processors. The encoding space leaves room for defining five other transactions. 6.1 Header Cycle Format The first, or header cycle, of a request packet contains a Command, a Flavor bit, a Mode bit, a deviceID, and an Address (Figure 21). The Command identifies the transaction and indicates that the packet is a request rather than a reply packet. The Flavor bit is used for a few of the transactions to indicate one of two possible semantics. The Mode bit is used in protection checking by receiving devices. The deviceID identifies the initiator of the transaction, while the Address serves as a selector for a memory location or IO device register. << [Artwork node; type 'Artwork on' to command tool] >> <
> Most of the information in the header cycle of the request packet is replicated in the header cycle of the reply packet (Figure 22). In fact, only bits [4..6] may be different. The bit 4, which is part of the Command field identifies the packet as a reply. The bit 5 indicates if the transaction encountered an error, while the bit 6 tells if the the addressed location is shared or not. << [Artwork node; type 'Artwork on' to command tool] >> <
> 6.1.1 The Command Field The Command field in a header cycle is 5 bits. Four bits encode up to 16 different transactions, while the fifth bit encodes whether the packet is a request (0) or a reply (1). Twelve of the sixteen transactions are currently defined, as shown in Table 1 below. Table 1: Encoding of the Command Field of a Packet Header Transaction Name Abbreviation Encoding Length ReadBlockRequest RBRqst 0000 0 2 ReadBlockReply RBRply 0000 1 9 WriteBlockRequest WBRqst 0001 0 9 WriteBlockReply WBRply 0001 1 2 FlushBlockRequest FBRqst 0010 0 9 FlushBlockReply FBRply 0010 1 2 KillBlockRequest KBRqst 0011 0 2 KillBlockReply KBRply 0011 1 2 WriteSingleRequest WSRqst 0100 0 2 WriteSingleReply WSRply 0100 1 2 Unused 0101 0 0101 1 Unused 0110 0 0110 1 Unused 0111 0 0111 1 IOReadBlockRequest IORBRqst 1000 0 2 IOReadBlockReply IORBRply 1000 1 9 IOWriteBlockRequest IOWBRqst 1001 0 9 IOWriteBlockReply IOWBRply 1001 1 2 IOReadSingleRequest IORRqst 1010 0 2 IOReadSingleReply IORRply 1010 1 2 IOWriteSingleRequest IOWRqst 1011 0 2 IOWriteSingleReply IOWRply 1011 1 2 InterruptRequest IntRqst 1100 0 2 InterruptReply IntRply 1100 1 9 Unused 1101 0 1101 1 MapRequest MapRqst 1110 0 2 MapReply MapRply 1110 1 2 DeMapRequest DeMapRqst 1111 0 2 DeMapReply DeMapRply 1111 1 2 6.1.2 The Flavor/Fault Bit This bit has different interpretations for request and reply packets. In a request packet it supplies an additional command bit that is used to indicate which of two semantics to use for the WriteSingle transaction. If Flavor=0, then the Memory Controller does not update memory, while if Flavor=1 then the Memory Controller does update memory. In either case, it generates the same reply packet. In a reply packet the same bit is used to encode whether the device servicing the request packet encountered a fault or not. A 0 indicates no fault, and a 1 indicates a fault. When the fault bit is set in a reply packet, the 32 low order bits of the second cycle supply a FaultCode, while bits 7 through 16 supply the deviceID of the device that detected the fault. This format is shown in Figure 23. << [Artwork node; type 'Artwork on' to command tool] >> Figure 23: Format of the second cycle of an error reply packet. 6.1.3 The Mode/ReplyShared Bit This bit has different interpretations for request and reply packets. In a request packet it supplies the privilege mode (kernel=0, user=1) of the device that issued the request. When the requesting device is a cache, for example, this bit indicates whether the processor is in kernel or user mode. The Mode bit is used by devices servicing a request packet to check if the requestor has sufficient privileges to perform the requested operation. In the header of a reply packet the bit indicates whether the data whose address appears in the packet was shared at the time the request packet corresponding to the reply was received. This bit has a meaning only for the transactions ReadBlock, WriteSingle and KillBlock, and may be safely ignored by devices that do not participate in the consistency protocol. Caches use the value of ReplyShared within a RBReply to set the shared bit for the block being fetched. They use the value of ReplyShared within WSReply to know if the block is no longer shared and to clear the shared bit of the cached block if it is not. 6.1.4 The deviceID Field For request packets, the deviceID field of the header carries the unique identity of the device that sent the request packet. For reply packets, the deviceID field of the header is the unique identity of the intended recipient (that is, the identity of the device that sent the request packet). A deviceID is needed in reply packets because the address alone is not sufficient to disambiguate replies. Devices that either can have only one outstanding reply at a time, or that can have multiple outstanding replies but can somehow discriminate between them, need only one deviceID. Other devices must be allocated multiple deviceID's to allow them to disambiguate replies. These deviceID's must be contiguous and must be a power of two in number. The deviceID(s) for a device are loaded in at system initialization time via the Debug bus (see [DBusSpec] for details). 6.1.5 The Address Field The address field within a header cycle is 47 bits, of which the top 10 bits are reserved for virtual address disambiguation, and the bottom 37 bits represent either a real address or an IO address, depending on the transaction. The disambiguation bits will be used by virtual caches in which it is not possible to use the real address alone to locate the data. << [Artwork node; type 'Artwork on' to command tool] >> Figure 24: Format of the Address Field The 37 address bits in turn consist of a 33 bit doubleWord address, and a 4-bit single specifier. The doubleWord address identifies an aligned doubleWord location in IO address space or real address space, while the single specifier identifies one of 15 aligned singles within the doubleWord location. Of the fifteen singles 8 are bytes, 4 are halfWords, 2 are words, and one is a doubleWord. The fifteen cases are encoded in four bits as follows: Code Datum bbb0 byte bbb ww01 halfWord ww d011 word d 0111 entire doubleWord 1111 unused The code assumes that data is numbered left to right, in big-endian order. Thus byte 0 is the leftmost byte in the doubleWord, halfWord 0 is the leftmost halfWord in the doubleWord, and so on. For packets that transmit a block of data, doubleWords constituting the block are transmitted in cyclic order, with the doubleWord containing the addressed single being transmitted first (Figure 25). << [Artwork node; type 'Artwork on' to command tool] >> Figure 25: When memory replies with a block of data, the 8 doubleWords appear on the bus in cyclic order starting with the doubleWord containing the address single. Cyclic order decreases the latency of requested data for cache misses. 6.2. ReadBlock The ReadBlock transaction is used to read a block of data from the memory system. If a cache is owner then that cache replies, otherwise memory replies. Request (2 cycles) A ReadBlockRequest packet requests a block to be read from the memory system. The first cycle contains the packet type, the sender's deviceID, and the address of a single in the block. The second cycle contains the address of a single in the block (the victim) that the requested block will replace within the requesting device. Bits 27-63 contain the victim address, while bit 0 indicates whether this address is valid (1=> valid, 0=> invalid). The victim address is invalid for non-cache devices and for a cache when there is no victim (for example just after initialization). << [Artwork node; type 'Artwork on' to command tool] >> Normal Reply (9 cycles) A ReadBlockReply packet returns the block data requested by an earlier ReadBlockRequest. The first cycle reflects most of the information in the request header, with the shared bit indicating whether the block is shared. The remaining eight cycles contain the eight doubleWords of block data in cyclic order, with the doubleWord containing the addressed single appearing first. << [Artwork node; type 'Artwork on' to command tool] >> Error Reply (9 cycles) The first cycle of an error ReadBlockReply is the same as that for a normal reply except that it has the Fault bit set (bit 5 = 1). The second cycle contains the deviceID of the reporting device and a code describing the error. Remaining cycles are undefined. << [Artwork node; type 'Artwork on' to command tool] >> 6.3 WriteBlock The WriteBlock transaction is used to write a block of data into the memory system. Memory is overwritten, as are any cached copies. This transaction is used by producers of data outside the memory system to inject new data into the memory system. Request (9 cycles) A WriteBlockRequest packet requests a block to be written to the memory system. The first cycle contains the packet type, the sender's deviceID, and the address of a single in the block. The remaining eight carry the eight doubleWords of block data in cyclic order, with the cycle containing the addressed single appearing first. <<[Artwork node; type 'Artwork on' to command tool] >> Normal Reply (2 cycles) A WriteBlockReply packet acknowledges an earlier WriteBlockRequest (WriteBlockReply is generated by memory). The first cycle reflects most of the information in the request header; the second cycle is undefined. << [Artwork node; type 'Artwork on' to command tool] >> <<>> Error Reply (2 cycles) The first cycle of an error WriteBlockReply is the same as that for a normal reply except that it has the Fault bit set (bit 5 = 1). The second cycle contains the deviceID of the reporting device and a code describing the error. << [Artwork node; type 'Artwork on' to command tool] >> 6.4 FlushBlock The FlushBlock transaction is used by caches to write a dirty block being victimized back to memory. Because caches are kept up-to-date, only memory is updated by this transaction. Request (9 cycles) A FlushBlockRequest packet requests a block to be written to main memory. The first cycle contains the packet type, the sender's deviceID, and the address of a single in the block. The remaining eight cycles carry the eight doubleWords of block data in cyclic order, with the cycle containing the addressed single appearing first. << [Artwork node; type 'Artwork on' to command tool] >> Normal Reply (2 cycles) A FlushBlockReply packet acknowledges an earlier FlushBlockRequest (FlushBlockReply is generated by memory). The first cycle reflects most of the information in the request header. The second cycle is undefined. << [Artwork node; type 'Artwork on' to command tool] >> Error Reply (2 cycles) The first cycle of an error FlushBlockReply is the same as that for a normal reply except that it has the Fault bit set (bit 5 = 1). The second cycle contains the deviceID of the reporting device and a code describing the error.. << [Artwork node; type 'Artwork on' to command tool] >> 6.5 KillBlock The KillBlock transaction is used to remove all but one cached copies of a block. When a KillBlock completes, the cached copy belonging to the initator is normally the only one that remains. This operation does not guarantee removal of other cached copies when those copies are being actively written into by their processors. Request (2 cycles) A KillBlockRequest requests all cached copies except the one in the initiator to be removed. The first cycle contains the packet type, the sender's deviceID, and the address of a single in the block. The second cycle is undefined. << [Artwork node; type 'Artwork on' to command tool] >> Normal Reply (2 cycles) A KillBlockReply performs the work requested by an earlier KillBlockRequest. The first cycle reflects most of the information in the request header, with the shared bit indicating whether the block is shared. The second cycle is undefined. A cache receiving a foreign KillBlockReply when it has a KillBlockReply or WriteSingleReply pending must not kill its copy of the block. << [Artwork node; type 'Artwork on' to command tool] >> Error Reply (2 cycles) The first cycle of an error KillBlockReply is the same as for a normal reply except that it has the Fault bit set (bit 5 = 1). The second cycle contains the deviceID of the reporting device and a code describing the error. << [Artwork node; type 'Artwork on' to command tool] >> 6.6 WriteSingle The WriteSingle transaction is used to write a single to the memory system. There are two versions of the operation, one in which only cached copies of the single are updated, and the other in which main memory is also updated. This transaction is used by caches to keep multiple copies of cached read/write data consistent. Request (2 cycles) A WriteSingleRequest requests a write to all cached copies of a single. The first cycle contains the packet type, the sender's deviceID, and the address of the single. The second supplies the data. If the Flavor bit in the header is 1 then main memory copy of the single is also updated. << [Artwork node; type 'Artwork on' to command tool] >> Normal Reply (2 cycles) A WriteSingleReply packet performs the work requested by an earlier WriteSingleRequest. The first cycle reflects most of the information in the request header, with the shared bit indicating whether the datum is shared. The second cycle supplies the 64 bits of data just as in the request. WriteSingleReply is generated by the memory controller. << [Artwork node; type 'Artwork on' to command tool] >> Error Reply (2 cycles) The first cycle of an error WriteSingleReply is the same as for a normal reply except that it has the Fault bit set (bit 5 = 1). The second cycle contains the deviceID of the reporting device and a code describing the error. << [Artwork node; type 'Artwork on' to command tool] >> 6.7 IOReadBlock The IOReadBlock transaction is used to read a block of data from an IO device. Request (2 cycles) An IOReadBlockRequest packet requests a block of data to be read from an IO device. The first cycle contains the packet type, the sender's deviceID, and the IO address of a single in the block; the IO address specifies both a device and a location in that device. The second cycle is undefined. << [Artwork node; type 'Artwork on' to command tool] >> Normal Reply (9 cycles) An IOReadBlockReply packet returns the block requested by an earlier IOReadBlockRequest. The first cycle reflects most of the information in the request header, while the remaining eight cycles carry the eight doubleWords of block data in cyclic order, with the doubleWord containing the addressed single appearing first. << [Artwork node; type 'Artwork on' to command tool] >> Error Reply (9 cycles) The first cycle of an error IOReadBlockReply is the same for a normal reply except that it has the Fault bit set (bit 5 = 1). The second cycle contains the deviceID of the reporting device and a code describing the error. Remaining cycles are undefined. << [Artwork node; type 'Artwork on' to command tool] >> 6.8 IOWriteBlock The IOWriteBlock transaction is used to write a block of data to an IO device. Request (9 cycles) An IOWriteBlockRequest packet requests that a block of data be written to an IO device. The first cycle contains the packet type, the sender's deviceID, and the IO address of a single in the block; this IO address specifies both the device and a location in the device. The remaining eight carry the eight doubleWords of block data in cyclic order, with the cycle containing the addressed single appearing first. << [Artwork node; type 'Artwork on' to command tool] >> Normal Reply (2 cycles) An IOWriteBlockReply packet acknowledges the write requested by an earlier request packet. The first cycle reflects most of the information in the request header; the second cycle is undefined. << [Artwork node; type 'Artwork on' to command tool] >> Error Reply (2 cycles) The first cycle of an error IOWriteBlockReply is the same for a normal reply except that it has the Fault bit set (bit 5 = 1). The second cycle contains the deviceID of the reporting device and a code describing the error. << [Artwork node; type 'Artwork on' to command tool] >> 6.9 IOReadSingle The IOReadSingle transaction is used to read a single from an IO device. Request (2 cycles) An IOReadSingleRequest packet requests a single to be read from an IO device. The first cycle contains the packet type, the sender's deviceID, and the IO address of the single; this IO address specifies both the device and a location in the device. The second cycle is undefined. << [Artwork node; type 'Artwork on' to command tool] >> Normal Reply (2 cycles) An IOReadSingleReply returns the single requested by an earlier IOReadSingleRequest. The first cycle reflects most of the information in the request header, while the second carries the requested data aligned as specified by the single bits of the IO address. << [Artwork node; type 'Artwork on' to command tool] >> Error Reply (2 cycles) The first cycle of an error ReadBlockReply is the same for a normal reply except that it has the Fault bit set (bit 5 = 1). The second cycle contains the deviceID of the reporting device and a code describing the error. << [Artwork node; type 'Artwork on' to command tool] >> 6.10 IOWriteSingle The IOWriteSingle transaction is used to write a single to an IO device. Request (2 cycles) An IOWriteSingleRequest packet requests a single to be written to an IO device. The first cycle contains the packet type, the sender's deviceID, and the IO address of the single; this IO address specifies both the device and a location in the device. The second cycle contains the data aligned as specified by the single specifier bits of the IO address. << [Artwork node; type 'Artwork on' to command tool] >> Normal Reply (2 cycles) An IOWriteSingleReply packet acknowledges the write requested by a corresponding IOWriteSingleRequest. The first cycle reflects most of the information in the request header, while the second cycle is undefined. << [Artwork node; type 'Artwork on' to command tool] >> Error Reply (2 cycles) The first cycle of an error IOWriteSingleReply is the same for a normal reply except that it has the Fault bit set (bit 5 = 1). The second cycle contains the deviceID of the reporting device and a code describing the error. << [Artwork node; type 'Artwork on' to command tool] >> 6.11 Interrupt The Interrupt transaction is used to signal an interrupt to one or more processors on the Dynabus. Request (2 cycles) An InterruptRequest packet requests that one or more processors on the Dynabus be interrupted. The first cycle contains the packet type, the sender's deviceID, and the IO address of a single; this IO address specifies an interrupt register within one or more caches. The second cycle contains the single. << [Artwork node; type 'Artwork on' to command tool] >> Normal Reply (9 cycles) An InterruptReply packet performs the work requested by an earlier InterruptRequest. The first cycle of the reply reflects most of the information in the request header, while the remaining eight cycles are identical to the second cycle of the request packet. The reply is nine cycles long to give a cache time to do the read-modify-write required to update its interrupt registers. << [Artwork node; type 'Artwork on' to command tool] >> Error Reply (9 cycles) The first cycle of an error InterruptReply is the same for a normal reply except that it has the Fault bit set (bit 5 = 1). The second cycle contains the deviceID of the reporting device and a code describing the error. The remaining cycles are undefined. << [Artwork node; type 'Artwork on' to command tool] >> 6.12 Map The Map transaction is used to translate a 16-bit address space identifier and a 20-bit virtual page number to a 24-bit real page number and associated protection flags. Request (2 cycles) A MapRequest packet requests that a virtual page be translated to the corresponding real page. The first cycle contains the packet type, the sender's deviceID, and the 20-bits of the virtual page in bits (in bits 31 through 50). The second cycle contains the address space id. << [Artwork node; type 'Artwork on' to command tool] >> Normal Reply (2 cycles) A MapReply returns the translation requested by an earlier MapRequest. The first cycle contains the packet type, the deviceID of the transaction initiator, the 22-bit real page and four Flags: Dirty, KWtEnable, UWtEnable, and URdEnable. The second cycle is unused. Note that this is one reply packet whose address part is not the same as that of the corresponding request packet. << [Artwork node; type 'Artwork on' to command tool] >> Error Reply (2 cycles) An error MapReply is used to indicate that the responding device (MapCache) could not perform the translation. The first cycle contains the packet type, and the deviceID of the transaction initiator. The second cycle contains the deviceID of the reporting device and a code describing the error (the code shown below corresponds to MapFault). << [Artwork node; type 'Artwork on' to command tool] >> 6.13 DeMap The DeMap transaction is used to remove all cached virtual to real translations that correspond to a given real page. Request (2 cycles) A DeMapRequest packet requests that all cached virtual to real translations for a given real page be removed from processor caches. The first cycle contains the packet type, the sender's deviceID, and the 22-bits of the real page. The second cycle is undefined. << [Artwork node; type 'Artwork on' to command tool] >> Normal Reply (2 cycles) A DeMapReply actually performs the action requested by the corresponding DeMapRequest. The first cycle reflects most of the information in the header fo the request packet, while the second cycle is undefined. << [Artwork node; type 'Artwork on' to command tool] >> Error Reply (2 cycles) The first cycle of an error DeMapReply is the same for a normal reply except that it has the Fault bit set (bit 5 = 1). The second cycle contains the deviceID of the reporting device and a code describing the error. << [Artwork node; type 'Artwork on' to command tool] >> 6.14 NoOps Occasionally, a device that has made a request of its arbiter and has nothing to send when it gets a grant. In this situation the device is expected to send a NoOp packet of the same length as the packet it had originally intended to send. It does this simply by putting a 0 value for HeaderCycleOut during its allocated header cycle. Thus, there is no special command to indicate a NoOp. 7. Data Consistency The DynaBus supports an efficient protocol for maintaining cache coherency in a multiprocessor environment. Using the transactions just described, it is possible to build a high performance multiprocessor system that offers a simple model of shared memory to the programmer. In this system, processors are connected to the DynaBus via write-back caches. The caches are allowed to keep multiple copies of read/write data as needed, and the consistency of this data is maintained automatically and transparently by the hardware. Caches detect when a datum becomes shared by watching bus traffic, and they initiate a broadcast write when a processor issues a write to shared data. IO devices are permitted direct access to the memory system while preserving a consistent view of memory for the processors. A measure of the efficiency of this coherency protocol is that it requires just one more write to a shared datum than the absolute minimum. 7.1 Definition of Data Consistency A useful definition of data consistency must satisfy three criteria: it must allow interesting programs to be written; it must be simple to understand; and it must be practical to implement. A common way to define consistency is to say that all copies of any given location have the same value during each clock cycle. While this definition is adequate for writing programs and easy to understand, it is hard to implement efficiently when the potential number of cached copies is large. Fortunately, there is a weaker definition that is still sufficient for programming, but is much easier to implement in a large system. It is based on the notion of serializability. Figure 26 shows an abstract model of a shared memory multiprocessor that will be used to define serializability. Each processor has a private line to shared memory over which it issues the commands Fetch(A) and Store(A, D), where A is an address and D is data. For Fetch(A) the memory returns the value currently stored at A; for Store(A, D) it writes the value D into A and returns an indication that the write has completed. Let the starting time of an operation be the moment a request is sent to shared memory, and the ending time the moment a response is received by the processor. << [Artwork node; type 'Artwork on' to command tool] >> Figure 26: A number of processors connected to a shared memory. A computation C on this abstract model consists of N sequences of fetches and stores, one sequence for each of the processors. A computation transforms the initial state I of the shared memory into a final state F, but does not have any other visible effect. The Fetches and Stores of C are said to be serializable if there exists some global serial order of all the N sequences such that if the operations were performed in this order, without overlap, the same final state F would be reached starting from the same initial state I (two operations p and q overlap if the starting time of p is before the ending time of q and the starting point of q is before the ending time of p). The serial order must, of course, also preserve the semantics of Fetch and Store: the value returned by a Fetch(A) in this global sequence must have been the value stored by the most recent Store(A, .), or A's initial value in I if no such Store exists. Given this definition of serializability, a shared memory multiprocessor is said to maintain data consistency if there is an algorithmic procedure for serializing the Fetches and Stores for any computation C on this machine. This procedure takes the N sequences of Fetches and Stores and produces a single global sequence that has the same effect on shared memory. The procedure, of course, depends on concrete implementation details of the multiprocessor. For example, if the multiprocessor has a single port memory with no caches, the transformation of the N sequences to the global sequence is trivial. For a DynaBus based multiprocessor that has processor caches, the procedure depends on details of the cache consistency algorithm and certain synchronization properties enforced by caches and memory controllers. This definition also has a simple and intuitive interpretation. If a shared memory multiprocessor maintains data consistency according to the above definition, the memory model the programmer needs to know is the very simple one illustrated in Figure 26, regardless of the actual complexity of the machine's memory system. The real machine behaves for programming purposes as though its processors were directly connected to a simple read write memory with a single port that is able to service exactly one Fetch or Store operation at a time. << [Artwork node; type 'Artwork on' to command tool] >> Figure 27: The model of shared memory illustrated here is sufficient for programmers writing for DynaBus-based systems. 7.2 An Example The simplest way to understand how the DynaBus consistency protocol works is to look at an example (a more careful specification useful for reference will be given in the following section). Consider the five processor system showed in Figure 28. The example below describes a sequence of events for a particular location (address 73) starting from the state where none of the five caches has the block that contains this location. Numbers in the figure correspond to the numbers in the text below. For the example, it is sufficient to know that a cache maintains two state bits Shared and Owner for each block of data. When a block has Shared=1 it means that there may be other cached copies of this block; Shared=0 means this is the only cached copy. When Owner=1 it means that this cache's processor was the last one to update this block and any copies it has in other caches. At most one cached copy of a block may have Owner=1. The protocol uses the DynaBus lines shared and owner defined in Section 4.5. 1. Processor1 reads Address 73. Cache1 misses and does a ReadBlock on the bus. Memory provides the data. The block is marked Shared1 = 0, Owner1 = 0. 2. Processor2 reads Address 73. Cache2 misses and does a ReadBlock on the bus. Cache1 pulls the shared line to signal shared. Memory still provides the data. The block is marked Shared1 = Shared2 = 1, Owner2 = 0. 3. Processor3 reads Address 73. Cache3 misses and does a ReadBlock on the bus. Cache1 and Cache2 pull the shared line to signal shared. Memory still provides the data. The block is marked Shared1 = Shared2 = Shared3 = 1, Owner3 = 0. 4. Processor2 writes Address 73. Because the data is shared, Cache2 does a WriteSingle on the DynaBus. Cache1 and Cache3 pull the shared line to signal shared. Cache1, Cache2 and Cache3 update their values, but Memory does not. Cache2 becomes owner (Owner2 = 1). 5. Processor4 reads Address 73. Cache4 misses and does a ReadBlock on the bus. Cache1, Cache2 and Cache3 pull the shared line to signal shared. Cache2 pulls the owner line to keep Memory from responding and provides the data. The block is marked Shared4 = 1, Owner4 = 0. 6. Processor4 now writes Address 73. Because the data is shared, Cache4 does a WriteSingle on the DynaBus. Cache1, Cache2 and Cache3 pull the shared line to signal shared. Ownership changes from Cache2 to Cache4 (Owner2 = 0, Owner4 = 1). 7. Processor5 writes Address 73. Cache5 misses and does a ReadBlock on the bus. Cache1, Cache2, Cache3 and Cache4 pull the shared line to signal shared. Cache4, the current owner, pulls the owner line and supplies the data. The block is marked Shared5 = 1, Owner5 = 0. Cache5 then does a WriteSingle because the data is shared. Cache1, Cache2, Cache3 and Cache4 pull the shared line to signal shared. Ownership switches from Cache4 to Cache5 (Owner4 = 0, Owner5 = 1). << [Artwork node; type 'Artwork on' to command tool] >> Figure 28: An example illustrating the DynaBus consistency protocol. 7.3 Protocol Description for Single Level Systems A single level system consists of one or more processors connected to the DynaBus through caches, and a single main memory. The first thing to note about this configuration is that it is sufficient to maintain consistency between cached copies. The main memory copy can be stale with respect to the caches without causing incorrect behavior because processors have no way to access data except through caches. The protocol requires that for each block of data a cache keep two additional bits, shared and owner. For a given block, the shared bit indicates whether there are multiple copies of that block or not. This indication is not accurate, but conservative: if there is more than one copy then the bit is 1; if there is only one copy then the bit is probably 0, but may be 1. We will see later that this conservative indication is sufficient. The owner bit is set in a given cache if and only if the cache's processor wrote into the block last; thus at most one copy of a datum can have owner set. A cache is also required to maintain some pendingState for a transaction the cache has initiated but that hasn't been replied to as yet; this state allows a cache to correctly compute the value of the shared bit for the block addressed in the pending transaction, and to take special actions for certain "dangerous" packets that arrive while the reply is pending. In addition to this state, the protocol uses two lines on the DynaBus, Shared and Owner that were described earlier in Section 4.5. Generally, a cache initiates a ReadBlock transaction when its processor does a Fetch or Store to a block and the block is not in the cache; it initiates a FlushBlock when a block needs to get kicked out of the cache to make room for another one (only blocks with owner set are written out); and it A single level system consists of one or more processors connected to the DynaBus through caches, and a single main memory. The first thing to note about this configuration is that it is sufficient to maintain consistency between cached copies. The main memory copy can be stale with respect to the caches without causing incorrect behavior because processors have no way to access data except through caches. The protocol requires that for each block of data a cache keep two additional bits, shared and owner. For a given block, the shared bit indicates whether or not there are multiple copies of that block. This indication is not accurate, but conservative: if there is more than one copy then the bit is 1; if there is only one copy then the bit is probably 0, but may be 1. We will see later that this conservative indication is sufficient. The owner bit is set in a given cache if and only if the cache's processor wrote into the block last; thus at most one copy of a datum can have owner set. A cache is also required to maintain some pendingState for a transaction the cache has initiated but that has not received a reply; this state allows a cache to correctly compute the value of the shared bit for the block addressed in the pending transaction, and to take special actions for certain crucial packets that arrive while the reply is pending. In addition to this state, the protocol uses two lines on the DynaBus, Shared and Owner that were described earlier in Section 4.5. Generally, a cache initiates a ReadBlock transaction when its processor does a Fetch or Store to a block and the block is not in the cache; it initiates a FlushBlock when a block needs to be removed from the cache to make room for another one (only blocks with owner set are written out); and it initiates a WriteSingle when its processor does a write to a block that has the shared bit set. Caches do a match only if they see one of the following packet types: RBRqst, RBRply, WSRqst, WSRply, and WBRqst. In particular, note that no match is done either for a FBRqst or a FBRply. This is because FB is used only to flush data from a cache to memory, not to notify other caches that data has changed. No match is done for a WBRply, because this packet is only used to acknowledge that the memory has processed the WBRqst. When a cache issues a RBRqst or WSRqst, all other caches match the block address to see if they have the block. Each cache that matches, asserts Shared to signal that the block is shared and also sets its own copy of the shared bit for that block. The requesting cache uses pendingState to compute the value of the shared bit. It cannot simply copy the value of Shared into the shared bit like the other caches is because the status of the block might change from not shared to shared between request and reply due to an intervening packet with the same address. This ensures that the shared bit is TRUE for a block only if there are multiple copies, and that the shared bit is eventually cleared if there is only one copy. The shared bit will be cleared when only one copy is left and that copy's processor does a store. The store turns into a WSRqst, no one asserts Shared, and so the value the requestor computes for the shared bit is FALSE. The manipulation of the owner bit is simpler. This bit is set each time a processor stores into one of the singles of the block; it is cleared each time a WSRply arrives on the bus (except for the cache whose processor initiated the WSRqst). There are two cases to consider when a processor does a store. If the shared bit for the block is FALSE, then the cache updates the appropriate single and sets the owner bit right away. If the shared bit is TRUE, the cache puts out a WSRqst. When the memory sees the WSRqst, it turns it around as a WSRply with the same address and data, making sure that the shared bit in the reply is set to the value of the Shared line an appropriate number of cycles after the appearance of the WSRqst's header cycle. When the requestor sees the WSRply, it updates the single and also sets owner. Other caches that match on the WSRply update the single and clear owner. This guarantees that at most one copy of a block can ever have owner set. Owner may not be set at all, of course, if the block has not been written into since it was read from memory. When an RBRqst appears on the bus, two distinct cases are possible. Either some cache has owner set for the block or none has. In the first case the owner (and possibly other caches) assert Shared. The owner also asserts Owner, which prevents memory from responding, and then proceeds to supply the block via an RBRply. The second case breaks down into two subcases. In the first subcase no other cache has the block, Shared does not get asserted, and the block comes from memory. In the second subcase at least one other cache has the data, Shared does get asserted, but the block still comes from memory because no cache asserted Owner. Because the bus is packet switched, it is possible for the ownership of a block to change between the request and its reply. Suppose for instance that a cache does an RBRqst at a time when memory was owner of the block, and before memory could reply, some other cache issues a WSRqst which generates a WSRply which in turn makes the issuing cache the owner. Since Owner wasn't asserted for the RBRqst, memory still believes it is owner, so it responds with the RBRply. To avoid using this stale data, the cache that did the RBRqst uses pendingState to either compute the correct value of the data or to retry the ReadBlock when the RBRply is received. Dangerous transactions for a pending ReadBlock are the ones that modify data: WSRply and WBRqst. It is interesting to note that in the above protocol the Shared and Owner lines are output only for caches and input only for memory. This is because the caches never need the value on the Owner line, and the value on the Shared line is provided in the reply packet so they don't need to look at the Shared line either. Finally, from the point of view of the memory system, the WBRqst is identical to the FBRqst. From the point of view of the caches, the two requests are different: caches take no action for a FBRqst, but overwrite their data and clear the owner bit for a matching WBRqst. 7.4 Protocol Description for Two Level Systems Figure 29 illustrates a 2-level DynaBus-based system A two-level system consists of a number of one-level systems called clusters connected by a main DynaBus that also has the system's main memory. Each cluster contains a single large cache, that connects the cluster to the main DynaBus, and a private DynaBus, that connects the large cache to the small caches in the cluster. This private DynaBus is electrically and logically distinct from the DynaBuses of other clusters and from the main DynaBus. From the standpoint of a private DynaBus, its large cache looks identical to the main memory in a single-level system. From the standpoint of the main DynaBus, a large cache looks and behaves very much like a small cache in a single-level system. Further, the design of the protocol and the consistency protocol is such that a small cache cannot even discover whether it is in a one-level or a two-level system. The response from its environment is the same in either case. Thus, the behavior of a small cache in a two level system is identical to what was described in the previous section. << [Artwork node; type 'Artwork on' to command tool] >> Figure 29: A 2-level DynaBus-based System. The small open boxes pictured in each cluster might represent any of the devices that are pictured below them, including: a Small Cache, a Processor, an I/O Bridge, a Display Controller, a Printer or a LAN. The protocol requires the large cache to keep all of the state bits a small cache maintains, plus some additional ones. These additional bits are the existsBelow bits, kept one bit per block of the large cache. The existsBelow bit for a block is set only if some small cache in that cluster also has a copy of the block. This bit allows a large cache to filter packets that appear on the main bus and put only those packets on the private bus for which the existsBelow bit is set. Without such filtration, all of the traffic on the main bus would appear on every private bus, defeating the purpose of a two-level organization. The behavior of a small cache in a two-level system is identical to its behavior in a one-level system. In addition, a large cache behaves like main memory at its private bus interface and a small cache at its main bus interface. The following paragraphs will describe the internal functioning of a large cache and describe how packets on a private bus relate to those on the main bus and vice-versa. When a large cache receives a RBRqst from its private bus, two cases are possible: either the block is there or it's not. If it's there, the cache returns the data via an RBRply, making sure that it sets the shared bit in the reply packet to the OR of the value on the bus and its current state in the cache. (In the single-level system main memory returned the value on the Shared line for this bit.) If the block is not in the cache, the cache puts out a RBRqst on the main bus. When the RBRply comes back the cache updates itself with the new data and its shared bit and puts the RBRply on the private bus. When a large cache gets a WSRqst on its private bus, it checks to see if the shared bit for the block is set. If it is not set, then it updates the data, sets owner, and puts a WSRply (with shared set to the value of the Shared line at the appropriate time) on the private bus. If shared is set, then it puts out a WSRqst on the main bus. The memory responds some time later with a WSRply. At this time the large cache updates the single, sets the owner bit, and puts a WSRply on the private bus with shared set to one. When a large cache gets a FBRqst, it simply updates the block and sends back an FBRply. When a large cache gets an RBRqst on its main bus, it matches the address to determine if it has the block. If there is a match and owner is set, then it responds with the data. However, there are two cases. If existsBelow is set, then the data must be retrieved from the private bus by placing a RBRqst. Otherwise, the copy of the block it has is current, and it can return it directly. When a large cache gets a WSRqst on the main bus, it matches the address to see if the block is there and asserts shared as usual, but takes no other action. When the WSRply comes by, however, and there is a match, it updates the data it has. In addition, if the existsBelow bit for that block happens to be set, it also puts WSRply on the private bus. Note that this WSRply appears out of the blue on the private bus; that is, it has no corresponding request packet. This is another reason why the number of reply packets on a bus may exceed the number of request packets. 8. Atomic Operations The Dynabus WriteSingle transaction can be used to implement an atomic Swap operation. Typical implementations of Swap in multiprocessors require the bus or specific memory locations to be locked. It is impractical to lock the Dynabus because it is packet switched. And, memory locks entail performance compromises because it is impractical to have a lock for each location, and the alternative imposes unnecessary conflicts. The use of WriteSingle to perform a Swap does not require bus or memory locks, so that Swaps to the same location by different processors are limited only by the maximum rate at which WriteSingles can be placed on the bus. Swap has the following semantics: Swap[address, value] Returns[sample] = { sample _ address^; address _ value; } These semantics are implemented by a cache in the following manner: When a processor requests a Swap, the cache first determines if the location is shared. If it is, then the cache issues a WriteSingleRequest to that location and waits for the reply. Upon receiving the reply, it reads the current value of the location and updates the location in one atomic action. If the location is not shared, then the cache simply reads the current value and updates the location in one atomic action. In either case, the final action is to return the value read to the waiting processor. Note that this implementation generates no bus traffic for Swaps to non-shared locations. << [Artwork node; type 'Artwork on' to command tool] >> Figure 30: Using WriteSingle to implement Swap. Because the data is shared, traffic is generated on the DynaBus. 9. Input Output All interactions with IO devices fall into one of two categories: control or data transfer. Control interactions are used to initiate IO and to determine whether an earlier request has completed. Data transfer interactions are used to move the data to and from the memory system, or between IO devices. In most applications, the bandwidth requirements of control interactions is small compared to those of data transfer, so that the transport efficiency of data transfer is much more important that of control. When an IO device requires a low rate of data transfer, control interactions can also be used to transfer data. 9.1 Control All control interactions are carried out through the use of IOReadSingle, IOWriteSingle and Interrupt transactions directed to a common, unmapped 36-bit IO address space. This address space is common in the sense that all processors see the same space, and it is unmapped in the sense that addresses issued by processors are the ones seen by the IO devices. Generally, each type of IO device is allocated a unique, contiguous chunk of IO space at system design time, and the device responds only if an IOReadSingle, IOWriteSingle, or Interrupt is directed to its chunk. The term IO device is being used here not just for real IO devices, but any device (such as a cache) that responds to a portion of the IO address space. 9.1.1 IOWriteSingle The IOWriteSingle transaction is used to set up IO transfers and to start IO. The address cycle of the request packet carries an IO address, while the data cycle carries a single of data whose interpretation depends upon the IO address. For block transfer devices, a processor typically does a number of IOWriteSingles to set up the transfer, and then a final IOWriteSingle to initiate the transfer. An IOWriteSingle starts out at a small cache as an IOWRqst packet. The large cache of the cluster puts the IOWRqst on the main DynaBus, where it is picked up by all the other large caches. These caches put the IOWRqst on their private buses. Thus, the IOWRqst is broadcast throughout the system. Broadcasting eliminates the need for requestors to know the location of devices in the hierarchy and makes a simpler protocol possible. When the IOWRqst reaches the intended device, the device performs the requested operation and sends an IOWRply. The IOWRply is broadcast in the same manner as the IOWRqst, so it eventually makes its way to the requesting small cache. When the reply arrives, the small cache lets its processor proceed. 9.1.2 IOReadSingle The IOReadSingle transaction reads a single of data from an IO device. This data may either be status bits that are useful in controlling the device, or data being transferred directly from the device to the processor. The mechanics of the IOReadSingles are the same as IOWriteSingles: An IOReadSingle starts out at a small cache as an IORRqst packet. The large cache of the cluster puts the IORRqst onto the main DynaBus, where it is picked up by other large caches and put on the private buses. Once the intended IODevice receives the request, it reads the data from its registers and sends it along via an IORRply. The IORRply is broadcast in exactly the same way as the IORRqst, and eventually makes its way to the cache that initiated the transaction. Note that for both IOReadSingles and IOWriteSingles exactly one device responds to a given IO address. 9.1.3 Interrupt The Interrupt transaction is used by IO devices to generate interrupts for one or more processors. Each processor's cache has a set of interrupt registers each of which respond to two IO addresses, a directed address and a broadcast address. A directed address is unique to one cache, while a broadcast address is recognized by all caches. When an IO device wants to send an interrupt to one processor, it uses the directed address of that processor's cache in the Interrupt transaction. When an IO device wants to interrupt all processors, it uses the broadcast address. An Interrupt starts at some device as an InterruptRqst packet. The large cache of the cluster puts the InterruptRqst on the main bus. The memory then generates an InterruptRply with the same parameters as the InterruptRqst. An InterruptRply packet is nine cycles long, with all the data cycles being identical. All the large caches put this InterruptRply on their private DynaBuses. Thus the InterruptRply is broadcast throughout the system. Depending on the IO address parameter of the InterruptReply, either all caches interrupt their processors or just one cache does. When the InterruptRply reaches the requesting device, the transaction is complete. Note that the reply is not generated by the IO device, but by main memory. The reason is that there is no unique IO device that can generate the reply packet. It is important to point out that errors that occur during a InterruptRply may not be caught by the requesting device's time out mechanism. If one of the intended recipients of the Interrupt is broken, for instance, the requestor will not get any indication. This is a fundamental problem with broadcast operations; however, and there is no simple solution. 9.2 Data Transfer IO devices connected to the DynaBus via a cache automatically participate in the data consistency algorithm. If performance were not a problem, all devices could be connected to the DynaBus in this way, freeing designers of IO devices from having to build special chips to interface to the DynaBus. Unfortunately, this approach is insufficient for high speed input devices, which would cause a cache to needlessly transfer blocks from memory to cache each time the cache got a miss. The protocol provides the WriteBlock transaction to write directly to memory without going through a cache. Of course, a high speed output device could use ReadBlock's to directly transfer data out of consistent memory without going through a cache. In addition, the Dynabus provides the transactions IOReadBlock and IOWriteBlock to transfer data between IO devices without disturbing the contents of real address space. These operations would be useful when one IO device wants to stream data to another over the Dynabus without processing the data in any way. 10. Address Mapping Figure 31 shows how address mapping information is organized in a Dynabus system. There is a three-level hierarchy, with the first level residing in the processor cache, the second in the Map Cache, and the third in a Map Table kept in main memory. The Map Table keeps translation entries for all pages that are actually used in Main Memory. The Map Cache contains the subset of the translations in the Map Table that are used frequently by the current computation. A processor cache in turn keeps the subset of the entries in the Map Table that are frequently used by its processor. The Map Cache contains many more entries than a processor cache and acts as a performance accelerator, avoiding frequent accesses to the main memory Map Table. << [Artwork node; type 'Artwork on' to command tool] >> Figure 31: The organization of address mapping information. 10.1 The MapRequest/MapReply Transactions Figure 32 illustrates the Map transaction. When the cache of translations within a processor cache encounters a miss, the processor cache issues a MapRequest packet on the bus. This packet contains the virtual page number to be translated and an address space identifier (aid). The Map Cache checks to see if it has an entry for the requested page, and if it does it returns the translation via a MapReply. A MapReply contains the number of the real page and four flags: Dirty, KWtEnable (kernel write enable), UWtEnable (user write enable), and URdEnable (user read enable). If the MapCache does not have the entry, it sends a MapReply indicating a Map Fault. When the processor cache receives a Map Fault, it signals a TRAP to its processor. The TRAP handler looks up the translation in the Map Table (the translation is guaranteed to be there if the real page is resident in main memory), writes it to the Map Cache, rewinds the instruction being executed at the time of the TRAP and returns. When the instruction is reexecuted, the processor cache gets another map miss, but this time the Map Cache has the entry, so the miss is satisfied. << [Artwork node; type 'Artwork on' to command tool] >> Figure 32: The Map transaction. If the Processor Cache does not contain a map entry, it does a MapRequest to the Map Cache. If the Map Cache contains the requested translation, it replies via MapReply. If not, it uses MapReply to indicate a fault which causes the Processor to TRAP to Map Fault handling code. 10.2 DeMapRequest/DeMapReply The DeMap transaction is used to invalidate all translations from virtual pages to a given real page contained within processor caches. DeMap is used whenever a mapping entry needs to be modified. Changes to a map entry need to be made carefully because of the three levels in the mapping hierarchy. The system software must follow the following sequence: 1. Delete the mapping entry from the Map Table. 2. Delete the mapping entry from the Map Cache. 3. Initiate DeMap to remove the entry from the processor caches. Other sequences are not correct because old copies of the translation being modified could remain in caches for arbitrarily long periods of time and cause unwanted behavior. A DeMap is issued by sending a DeMapRequest containing the number of the real page whose translations are to be invalidated (Figure 33). Main memory turns the request around as a DeMapReply, and it is during the reply that the mapping entries for the real page are removed. << [Artwork node; type 'Artwork on' to command tool] >> Figure 33: The DeMap Transaction requests that all translations from some virtual page to a given real page be removed from processor caches. The work of DeMap is done during the reply packet. 11. Error Detection and Reporting The DynaBus specifies two aspects of dealing with errors: detection and reporting. Each device is expected to provide its own facilities for detecting errors, whether the errors are internal to the device or result from interactions with other devices. The bus provides parity to help check transport errors. Once an error is detected, a device must decide if it can handle the error on its own or needs to report the error to some other party. Errors that the device can handle on its own are uninteresting because the bus needs to provide no facilities. Errors that a device cannot handle are divided into recoverable errors and catastrophic ones, and the bus provides facilities to handle each kind. 11.1 Bus Parity The DynaBus provides a single parity wire to check transport on the 64 Data wires. A device that sends a packet is expected to generate the parity bit, and all receiving devices are expected to check the parity bit. Whether a device considers a DynaBus parity error to be recoverable or catastrophic is not specified. 11.2 Time Outs The DynaBus requires each device to implement a timeout facility to detect devices that do not respond, or unconnected devices. Each device must maintain a counter that starts counting bus cycles when the device issues a request to the arbiter to send a request packet. If the system-wide constant maxWaitCycles cycles have elapsed before the corresponding reply packet is received, the device must assume that an error has occurred. Whether a device considers a DynaBus timeout to be recoverable or catastrophic is not specified. The determination of a system-wide value for maxWaitCycles is difficult because of the wide variance in expected service times. For example, a low priority device might take a long time to receive a bus grant, while a higher priority device would get a grant relatively quickly. A low priority device might in fact be forced to wait for an arbitrarily long if a higher priority device decides to hog the bus. Whether tthe possibility of freezing out low-priority devices should be interpreted as an error is debatable. To avoid getting entangled in this issue, the DynaBus specifies a system-wide lower bound on the limit maxWaitCycles and lets the device implementor decide the exact value. Such a lower limit is needed to avoid generating frequent false alarms. A conservative lower limit can be arrived at by computing the worst-case service time for a cache request and increasing it by an order of magnitude for safety (caches are taken since they are the lowest priority devices that do not change their request priority). Assuming there are 8 caches and only one memory bank, the worst case service time is at most = 8*#cycles to service one request in an unloaded system = 8*25 cycles. Increasing this by an order of magnitude gives 2048 cycles, so each device is required to have maxWaitCycles 11.3 Recoverable Errors When a device encounters a recoverable error while servicing a request packet, it uses the DynaBus Mode/Fault bit in the reply packet to report the error. The least significant 32 bits of the first data single of the reply packet are set aside for the FaultCode, while bits 7 through 16 are set aside for the deviceID of the reporting device. 11.4 Catastrophic Errors When a device encounters a catastrophic error it makes a Stop request to the arbiter. Upon receiving this request, the arbiter stops issuing all requests for the bus, bringing the system to a halt. The service processor detects the lack of activity on the DynaBus and initiates recovery. Appendix I. DynaBus Command Field Encoding The table below gives the encoding for the Command field within the header cycle of a DynaBus packet. Transaction Name Abbreviation Encoding Length ReadBlockRequest RBRqst 0000 0 2 ReadBlockReply RBRply 0000 1 9 WriteBlockRequest WBRqst 0001 0 9 WriteBlockReply WBRply 0001 1 2 FlushBlockRequest FBRqst 0010 0 9 FlushBlockReply FBRply 0010 1 2 KillBlockRequest KBRqst 0011 0 2 KillBlockReply KBRply 0011 1 2 WriteSingleRequest WSRqst 0100 0 2 WriteSingleReply WSRply 0100 1 2 Unused 0101 0 0101 1 Unused 0110 0 0110 1 Unused 0111 0 0111 1 IOReadBlockRequest IORBRqst 1000 0 2 IOReadBlockReply IORBRply 1000 1 9 IOWriteBlockRequest IOWBRqst 1001 0 9 IOWriteBlockReply IOWBRply 1001 1 2 IOReadSingleRequest IORRqst 1010 0 2 IOReadSingleReply IORRply 1010 1 2 IOWriteSingleRequest IOWRqst 1011 0 2 IOWriteSingleReply IOWRply 1011 1 2 InterruptRequest IntRqst 1100 0 2 InterruptReply IntRply 1100 1 9 Unused 1101 0 1101 1 MapRequest MapRqst 1110 0 2 MapReply MapRply 1110 1 2 DeMapRequest DeMapRqst 1111 0 2 DeMapReply DeMapRply 1111 1 2