DynaBusLogicalSpecifications.tioga
Written by: Sindhu, May 22, 1986 12:55:46 pm PDT
Last Edited by:
Pradeep Sindhu June 25, 1986 5:16:40 pm PDT
DYNABUS LOGICAL SPECIFICATIONS
DYNABUS LOGICAL SPECIFICATIONS
DYNABUS LOGICAL SPECIFICATIONS
DRAGON PROJECT — FOR INTERNAL XEROX USE ONLY
DRAGON PROJECT — FOR INTERNAL XEROX USE ONLY
DRAGON PROJECT — FOR INTERNAL XEROX USE ONLY
The Dragon DynaBus
Logical Specifications
Release as[Indigo]<Dragon>Documentation>DynaBusLogicalSpecifications.tioga, .press

© Copyright 1984 Xerox Corporation. All rights reserved.
Abstract: The DynaBus is a high bandwidth, synchronous, packet switched bus that will be used to interconnect the DRAGON memory system. This document provides the logical specifications for this bus and explains how it operates. Electrical specifications are not covered here, but are taken up in the document "DynaBus Electrical Specifications".
XEROX  Xerox Corporation
   Palo Alto Research Center
   3333 Coyote Hill Road
   Palo Alto, California 94304



For Internal Xerox Use Only
Contents
1. Introduction
2. Bus Overview
3. Bus Interface
4. Bus Protocol
5. Data Consistency
6. Input Output
7. Error Handling
Appendix A. Encoding of the Command Field
Appendix B. Encoding of the Result Field
Appendix C. Definition of IO Address Space
1. Introduction
The DynaBus is a high bandwidth, synchronous, packet switched bus that forms the backbone of the Dragon memory system. This bus is used at two levels in the memory hierarchy: At the first level, it interconnects the components of a Dragon cluster. These components include one or more Dragon processors attached to the bus via Small Caches; some memory, connected via a Big Cache; a Map Cache, and one or more IO devices (Figure 1). At the second level, the DynaBus interconnects clusters and main memory. The principal feature of this two level architecture is that it provides sufficient bandwidth to support a few tens of Dragon processors, high speed IO devices, and a high resolution color display, assuming that processors have a cycle time close to that of the bus. In our estimate, this assumption of balanced bus and processor cycle times most closely reflects what technology can deliver within the next five years.
This document describes the structure, protocol, and operation of the DynaBus at the logical level. It begins in the following section with an overview of the bus's main characteristics. Section 3 defines the bus interface, which encapsulates all of the knowledge about the bus that a device needs to have in order to connect to it. Section 4 uses this interface definition to specify the bus protocol: it discusses arbitration, lists the set of bus commands currently defined, and shows how these commands are used in the various transactions. Section 5 describes how the protocol is used to keep multiple copies of cached data consistent. Section 6 takes up the topic of IO. Finally, Section 7 discusses error handling. Electrical characteristics are not covered here at all, but appear in a companion document titled "DynaBus Electrical Specifications". The appendices contain details that do not fit into the flow of the main document.
2. Bus Overview
A key requirement for a multiprocessor bus is that it be able to deliver a large usable bandwidth. The cycle time of such a bus must be correspondingly small since the usable bandwidth is at most as large as the electrical bandwidth. Traditional circuit switched bus designs, which tie up the bus for the duration of each transaction, rapidly become unreasonable as the cycle time of the bus is decreased. To see this, we first need to note that reads to memory constitute most of the traffic on a memory bus. If we assume that memory access time is around an order of magnitude larger than bus cycle time, we see that the bus will be idle for a significant fraction of the time. One way to get around this difficulty is to increase the size of the transport unit to amortize the cost of memory accesses, but this greatly compromises cache efficiency. To avoid these idle bus cycles, the Dynabus is packet switched: the request and reply for each transaction are dissociated and the bus is free to be used by other requesters in between. Although such dissociation is strictly necessary only for transactions that involve main memory, for uniformity the DynaBus uses it for all transactions. This decision also simplifies the solutions to a number of other problems (bus deadlock, data consistency) that are somewhat difficult to handle if the bus is circuit switched.
The DynaBus can be understood best in terms of three layers: cycles, packets, and transactions (these layers correspond to the electrical, logical, and functional levels of operation, respectively). A bus cycle is simply one complete period of the bus clock; it forms the unit of time and electrical information transfer on the bus; the information is typically either an address or data. A packet is a contiguous sequence of cycles; it is the mechanism by which one-way logical information transfer (including broadcast) occurs on the bus. The first cycle of a packet carries address and control information; subsequent cycles typically carry data. A transaction consists of a pair of request-reply packets which together perform some logical function (such as a memory read); the pair may be separated by an arbitrary number of cycles, modulo timeout.
There are around 70 signals on the DynaBus. One of these is the bus clock, which provides synchronization. Several others serve the control functions of sending and receiving packets, of indicating bus error, and of resetting the system. A multiplexed data/address path accounts for a majority of the signals, 64. During data cycles this path carries 64 bits of data; during address cycles it carries a command, a device id, and an address; the address part is 40 bits and represents either a real address or an IO address, depending on the command. The unit of addressing is a 32-bit word. However, for efficient bus utilization, data is usually transferred in multiple word chunks called blocks. Finally, the signals shared and owner, are used by the data consistency algorithm.
Each DynaBus has an arbiter that permits the bus to be time multiplexed amongst contending devices. These devices make requests to the arbiter using dedicated request lines, and the arbiter grants the bus using dedicated grant lines. Different devices may have different levels of priority, and the arbiter guarantees fair (bounded-time) service to each device within its priority level. Bus allocation is non-preemptive, however. A device that has been granted the bus is called bus master. The sequence of cycles initiated by a master will typically involve at least one other device on the bus; such a device is called a slave. Since the DynaBus is packet switched, every device must be capable of serving both as a master and as a slave. The notions of master and slave are adequate to describe the transmission of a single packet, but fail to capture transaction level interaction. The terms requestor and responder are used in this context to indicate the devices that place a request and the corresponding reply packet on the bus, respectively.
The following transactions are currently defined for the DynaBus: ReadBlock, WriteBlock, WriteSingle, IORead, IOWrite, Map, and ConditionalStore. ReadBlock allows a cache to read a packet from memory or another cache. WriteBlock allows caches to write back dirty data to memory. WriteSingle is used by caches to update multiple copies of shared data. IORead and IOWrite do reads and writes to IO devices to perform the control part of IO. ConditionalStore implements the Dragon ConditionalStore instruction.
3. Bus Interface
All devices connect to the DynaBus through a standard bus interface, shown in Figure 2. The design of this interface is such that a device can interact with the bus entirely in terms of packets, or if it prefers, in terms of bus cycles. This degree of flexibility is necessary if the interface is to serve the needs of caches, memory controllers, display controllers, as well as IO controllers. Of course, each device type could have its own interface to the bus, but that would defeat the purpose of having a standard interface.
The interface consists of four groups of signals, clocks, control signals, consistency signals, and data. The clocks provide synchronization signals for the device. The control signals permit bus acquisition, sending and receiving of packets, system initialization, and error notification. The consistency signals allow caches and main memory to maintain multiple copies of cached data in a consistent state. The data group carries addresses and data.
This section simply names the signals in the interface and gives their functions without saying how these signals are used. Usage will be taken up in the bus protocol section, which immediately follows this one. The description uses the following conventions: S[a..b) denotes a group of b—a lines that encode the bits representing the signal S; most significant bits of the signal are written leftmost. A signal represented by a single wire is written unencumbered with the [..) notation. The term device will be used throughout to indicate an entity connected to the bus. A block will denote a chunk of 8 contiguous 32-bit words aligned in real address space such that the address of the first word is 0 MOD 8.
Clock Group
BClock (Input)
This signal provides the basic timing for the bus. For many devices, the cycle time of this signal will be smaller than the time resolution of the device, but it is included in the interface to allow fast devices to use the interface at the level of bus cycles.
BDeviceClock (Input)
This signal is obtained by dividing down BClock by a suitable integer. It provides the device with its basic timing.
Control Group
BBOP (Input)
This signal indicates Beginning Of Packet. It is asserted when the first cycle of a packet has arrived and is ready to be read from the interface; in other words, BBOP can be used to latch the first cycle of a packet.
BEOP (Input)
This signal indicates End of Packet. It is asserted when the last cycle of a packet is ready to be read from the interface; in other words BEOP can be used to latch the last cycle.
BSend (OutPut)
Asserting this signal causes the interface to send a packet out onto the bus. The contents of the packet are whatever happens to be on the BDataOut lines. BSend must not be asserted until BReadyToSend (see below) is asserted.
BReadyToSend (Input)
This signal indicates that the interface is ready to send another packet. It is used to prevent a device from sending a packet while the transmission of an earlier one is still in progress.
BRequest (OutPut)
This signal is meant for devices that want to request the bus before they have a packet to send so that they may absorb the request-to-grant delay. If BRequest is asserted in cycle T, BSend must be asserted at or before cycle T+3. Thus, only for devices with deterministic service times can make use of BRequest.
BError (Input)
This signal is used to signal a bus error.
BReset (Input)
This signal is puts all devices on the bus into a known state. This state is reached some (still to be determined) number of cycles after the signal is asserted. The signal is asserted after power up and after an unrecoverable error.
Consistency Group
BShared (Input/Output)
This signal is one of two used by caches to maintain consistency amongst multiple copies of read/write data. When an address cycle appears on the bus, all caches match the address with addresses they have stored. Caches that find a match assert BShared, signaling that the datum is shared.
BOwner (Input/Output)
This signal is also used by caches to maintain consistency. It is asserted by a cache when it is the owner of a block and a read request or response for that block appears on the bus. A cache is the owner of a block when its processor was the last one that did a write into the block.
Data Group
There are 640 data signals, 320 for input and 320 for output, corresponding to a maximum packet length of 5 cycles. The interpretation of the bits in cycle 0 is fixed, while that of subsequent cycles depends on the command in cycle 0.
DataIn/Out[0..64) (Input/Output)
This first data cycle always carries the bus command, a result, the sender's id, and an address. (the result field is relevant only for reply packets). Depending on the command, the address may be either in real address space or in IO address space. The fields within the 64 bits are arranged as follows:
 Cmd  DAL[0..6)
 Result  DAL[6..14)
 DeviceID  DAL[14..24)
 Address  DAL[24..64)
For machines with a real address smaller than 40 bits, the contents of the unused high order bits must be 0.
DataIn/Out[64..320) (Input/Output)
These four cycles carry a block. The 32-bit words within the block may be cyclically permuted since the cycle containing the requested word is delivered first. One or more of the trailing cycles may be insignificant, depending on packet length.
4. Bus Protocol
The DynaBus's protocol can be described best in terms of bus operation at three levels: cycles, packets, and transactions. These terms were defined earlier, but we will review them quickly. A cycle is one complete period of the bus clock; a packet is a contiguous sequence of cycles; and a transaction is a pair of request-reply packets. The first cycle of a packet always carries a device id, a command, and an address; subsequent cycles may carry data or address, depending on the command.
Before a transaction can begin, the requesting device must get bus mastership from the arbiter. Once it has the bus, the device puts a request packet on the bus one cycle at a time, releases the bus, and then waits for the reply packet. Packet transmission is uninterruptable in the sense that no other device can take the bus away during this time, regardless of its priority. The transaction is complete when another device sends a reply packet. Note that between the request and reply packets there may be an arbitrary number of cycles. This is not quite true because of timeouts and the need to do flow control, but more on these topics later. Although the protocol requires that every request packet have a corresponding reply packet, it does not require the inverse: it is permitted to have a reply packet on a given bus for which there is no unique request packet on that bus. The protocol requires devices to provide a simple but crucial guarantee that they service request packets in the order in which they arrive.
The detailed description of the protocol below first takes up arbitration and flow control. It then discusses time outs, the mechanism used to detect no response. Next, it lists the various bus commands and explains the purpose of the device id. The last subsection provides the detailed makeup of the different transactions. The description of the cache consistency algorithm is deferred to the following section.
Arbitration and Flow Control
The arbiter serves two related purposes. First, it allows multiple contending devices to use the bus without stepping on one another. Second, it ensures that the number of request packets seen on the bus never exceeds the number of reply packets by some predetermined number. This second function will be referred to as flow control; it ensures that devices whose service times prevent them from keeping up with the maximum rate of requests don't have worry about their input buffers overflowing. Let us call such devices "slow devices".
Each device has two request lines and one grant line running between it and the arbiter. One of the request lines, RqRqst, is for getting the bus to send a request packet, and the other, RqRply, is for getting the bus to send a reply packet. The two requests need to be distinguished to allow the arbiter to do flow control. A device requests the bus by asserting the appropriate request line. If the bus is free and there are no other requestors, the arbiter asserts the grant line for the device two cycles later. Otherwise, grant may take several more cycles depending on how many devices are ahead of the one asserting request. The arbiter may implement dynamic priority to allow devices with large buffers to indicate the importance of their requests. The details of dynamic priority are currently unspecified, but a constraint that any scheme will have to live with is the availability of only a small number of lines between each device and the arbiter.
To do flow control the arbiter maintains an internal register RRDiff. At system startup this register is set to zero. Then, every time the bus is granted via RqRqst, RRDiff is incremented; every time it is granted via RqRply RRDiff is decremented (but never below zero). If RRDiff ever reaches maxRRDiff, an integer determined at system design time, the arbiter stops granting the bus for RqRqst's but continues to grant it for RqRply's. This scheme has the effect of limiting the number of outstanding request packets to maxRRDiff. If slow devices have input buffers with at least maxRRDiff entries, their buffers will never overflow.
This flow control scheme imposes an important constraint on devices. Devices should respond relatively quickly to request packets. If they didn't, the utilization of the bus would be low.
Time Outs
To detect devices that do not respond, or devices that aren't there in the first place, the protocol defines a limit maxWaitCycles on the number of cycles that may elapse between a request packet and its reply. If this limit is exhausted, the device that generated the request can assume that an error has occurred and notify the appropriate party.
It should be pointed out that this mechanism could impose a bandwidth cost if it is triggered frequently enough. For the mechanism to not give false alarms, the number maxWaitCycles must be chosen to be an order of magnitude or so larger than the longest time for a device to respond. Now, when a device does not respond, the number of other requests that could be outstanding is cut down by one and this could prevent other requests from appearing on the bus.
Commands 
The packet command field is 5 bits, which gives a maximum of 32 different packet types. One of the command bits is used to encode whether the packet is a request or a reply, the other four encode sixteen different transactions. Seven transactions are currently defined, which means that 14 of the 32 decodes are used. The rest are reserved for future use.
The description below lists commands by pairs (request/reply). For convenience, each command is given an abbreviation which appear in parentheses. The decimal numbers in brackets give the encoding for the command.
ReadBlockRequest (RBRqst), ReadBlockReply (RBRply) [0, 1]
These commands are used by devices to read an block from the memory system.
WriteBlockRequest (WBRqst), WriteBlockReply (WBRply) [2, 3]
These commands are used by devices to write an block to the memory system. The first does the write, while the second is just an acknowledgement that the write has been performed.
WriteSingleRequest (WSRqst), WriteSingleReply (WSRply) [4, 5]
These commands are used by caches to update multiple copies of shared data. The first merely requests a write, while the second actually performs it.
IOReadRequest (IORRqst), IOReadReply (IORRply) [6, 7]
These commands are used by devices to read a single word from the IO address space.
IOWriteRequest (IOWRqst), IOWriteReply (IOWRply) [8, 9]
These commands are used by devices to write a single word out to the IO address space.
MapRequest (MapRqst), MapReply (MapRply) [10, 11]
These commands are used by caches to fetch missing map entries.
ConditionalStoreRequest (CSRqst), ConditionalStoreReply (CSRply) [12, 13]
These commands are used by caches to perform a read-modify-write operation on shared data. The first command merely requests the read-modify-write while the second actually performs it.
Unused [14..31]
These commands are not defined.
Device ID
Every packet on the bus carries a device identifier in its first cycle. For request packets, this DeviceID carries the unique number of the device making the request. For reply packets it is the number of the intended recipient. Note that such an identifier is needed because the address part of a packet alone is not sufficient to disambiguate replies.
Transactions
The following transactions are currently defined: ReadBlock, WriteBlock, WriteSingle, IORead, IOWrite, Map, and ConditionalStore.
ReadBlock
ReadBlock reads an block from main memory or from a cache, depending on whether main memory is consistent with the caches. Recall that an block is eight contiguous 32-bit words aligned in real address space such that the address of the first word is 0 MOD 8.
The request packet for ReadBlock consists of two address cycles. The first cycle provides the address of the block to be read; the second provides the address of the block being victimized within the cache requesting the ReadBlock.
The reply packet for ReadBlock consists of five cycles, the first of which is the address of the block being returned. The remaining four cycles carry the block. The data within the block is cyclically permuted such that the cycle containing the addressed word is delivered first.
WriteBlock
WriteBlock writes an block to main memory or to a willing cache. The request packet for WriteBlock is five cycles. The first cycle carries the address, and the remaining four cycles carry the four cycles of block data. The block address must be 0 MOD 8, unlike the address for a ReadBlock.
The reply packet, which is two cycles, serves as the acknowledgement that the write has been performed. The first cycle carries the address; the second is empty.
WriteSingle
WriteSingle writes a 32-bit word out to real address space. This transaction is used by caches to maintain multiple copies of read/write data in a consistent state. Unlike WriteBlock, WriteSingle does not write the word out to memory, but is directed exclusively to caches.
The request packet for WriteSingle consists of two cycles, address and data, in that order. The reply packet is identical, except that the command is WriteSingleReply.
IORead
The IORead and IOWrite transactions are used by caches to communicate with IO devices. The 32 bit address of an IORead or IOWrite defines a common, unmapped IO address space. Each IO device has a unique piece of the space allocated to itself, and is free to interpret IOReads and IOWrites to that piece in any manner it sees fit. Further details appear in Section 6 and in the appendices.
IORead reads a single word from the IO address space. The request packet consists of two cycles, the first of which carries the IO address. The second cycle is empty.
The reply packet also consists of two cycles. The first carries the IO address in the request, while the second carries the data read from the IO device.
IOWrite
IOWrite writes a single word to the IO address space. The request packet is two cycles, the first carries the IO address, the second the data.
The reply packet is identical to the request packet, except that it contains IOWriteReply as the command.
Map
Map reads a virtual page to real page mapping entry from the Map Cache. The request packet is two cycles. The first contains the address space and virtual page number, and the second is empty. The reply packet is also two cycles. The first contains the address space and virtual page number, while the second contains the corresponding real page.
ConditionalStore
ConditionalStore does a read-modify-write on a given location in real memory. The request packet is two cycles. The first carries the address, and the second carries the old and new values. The reply packet is also two cycles, and is identical to the request except for the command.
5. Data Consistency
A basic decision in the Dragon architecture is that multiple copies of read/write data are allowed to exist when data is cached. An immediate consequence of this is that the copies must be kept consistent if computations are to produce correct results.
This section describes how the transactions just described can be used to maintain consistency in a system with multiple levels of cache. The description begins with a definition of data consistency. It then describes how consistency is maintained in a single level system. The two-level case is taken up last. Although only the single-level and two-level cases are described here, it should be noted that the algorithm works equally well for the N-level case.
Definition of Data Consistency
One way to define consistency for a particular datum A is to say that during each clock cycle all copies of A have the same value. While this definition is adequate and easy to understand, it is hard to implement efficiently when the potential number of copies is large. Fortunately, there is a weaker definition that is still adequate for correctness, but is much easier to implement in a large system. It is based on the notion of serializability.
To define serializability, consider the abstract model of a shared memory multiprocessor shown in Figure xx. There are some number of processors connected to a shared memory. Each processor has a private line to shared memory; over this line the processor issues the commands Fetch(A) and Store(A, D), where A is an address and D is data. For Fetch(A) the memory returns the value currently stored at A; for Store(A, D) it writes the value D into A and returns an indication that the write has completed. Let the starting time of an operation be the moment the request is sent over the line and the ending time be the moment either the return value or an indication of completion is received by the processor.
Let S be the state of the shared memory before the start of some computation C, and F be the state after C has completed. The Fetches and Stores of C are said to be serializable if there exists some serial order such that if these operations were performed in this order, without overlap, the same final state F would be reached starting from the initial state S. Two operations are said to overlap if the starting time of one is before the ending time of the other and the starting of the other is before the ending time of the one.
Single Level Operation
For the purposes of explaining the consistency algorithm, a single level system consists of one or more processors connected to the DynaBus through caches, and a single main memory. The first thing to note about this configuration is that it is sufficient to maintain consistency between cached copies. Since processors have no way to access data except through their caches, the main memory copy can be stale without causing incorrect behavior.
The algorithm requires that for each block of data a cache keep two additional bits, shared and owner. For a given block, the shared bit indicates whether there are multiple copies of that block or not. This indication is not accurate, but conservative: if there is more than one copy then the bit is TRUE; if there is only one copy then the bit is probably FALSE, but may be TRUE. We will see later that this conservative indication is sufficient. The owner bit is set in a given cache if and only if the cache's processor wrote into the block last; thus at most one copy of a datum can have owner set. A cache is also required to keep a single state bit called sharedAccumulator, and a register rqstAdrs. The bit is used in computing the value to be put into the shared bit for some block during certain transactions, while the register is used to hold the address for a request packet for which a reply hasn't yet been received. In addition to this state, the algorithm requires two lines on the DynaBus, BShared and BOwner.
Generally, a cache initiates a ReadBlock transaction when its processor does a Fetch or Store to a block and the block is not in the cache; it initiates a WriteBlock when a block needs to get kicked out of the cache to make room for another one (only blocks with owner set are written out); and it initiates a WriteSingle when its processor does a write to a block that has the shared bit set. Caches do a match only they see one of the following packet types: RBRqst, RBRply, WSRqst, WSRply. In particular, note that no match is done either for a WQRqst or a WQRply.
When a cache issues a RBRqst or WSRqst, all other caches match the block address to see if they have the block. Each cache that matches, asserts BShared to signal that the block is shared and also sets its own copy of the shared bit for that block. The requesting cache on the other hand first clears sharedAccumulator, initializes rqstAdrs to the block address, and then waits. A fixed number of cycles later, it reads BShared and OR's it into sharedAccumulator. In between the request packet and its reply, the requesting cache watches all packets on the bus just like other caches. If the address on any of these packets matches rqstAdrs, the cache sets sharedAccumulator. When the requesting cache sees its reply packet, it OR's the sharedAccumulator and the shared bit from the packet and puts the result into the shared bit for the block. This ensures that the shared bit is set for a block only if there are multiple copies, and that the shared bit is cleared eventually if there is only one copy. The clearing happens when only one copy is left and that copy's processor does a store. The store turns into a WSRqst, no one asserts BShared, and so the value the requestor computes for the shared bit is zero.
The manipulation of the owner bit is simpler. The owner bit is set each time a processor stores into one of the words of the block; it is cleared each time a WSRply arrives on the bus. There are two cases to consider when a processor does a store. If the shared bit for the block is zero, then the cache updates the appropriate word and sets the owner bit right away. If the shared bit is set, the cache puts out a WSRqst. When the memory sees the WSRqst, it immediately turns it around as a WSRply with the same address and data, making sure that the shared bit in the reply is set to zero. When the requestor sees the WSRply, it updates the word and also sets owner. Other caches that match on the WSRply update the word and clear owner. This guarantees that at most one copy of a block can ever have owner set. Owner may not be set at all, of course, if the block has not been written into since it was read from memory.
When an RBRqst appears on the bus, two distinct cases are possible. Either some cache has owner set for the block or none has. In the first case the owner (and possibly other caches) assert BShared. The owner also asserts BOwner, which prevents memory from responding, and then proceeds to supply the block via an RBRply. The second case breaks down into two subcases. In the first subcase no other cache has the block, BShared does not get asserted, and the block comes from memory. In the second subcase at least one other cache has the data, BShared does get asserted, but the block still comes from memory because no cache asserted BOwner. Because the bus is packet switched, it is possible for ownership of a block to shift in between the request and a reply. Suppose for instance that a cache does an RBRqst at a time when memory was owner of the block, and before memory could reply, some other cache issues a WSRqst which generates a WSRply which in turn makes the issuing cache the owner. Since BOwner wasn't asserted for the RBRqst, memory still believes it is owner, so it responds with the RBRply. To prevent this stale data from being taken by the cache that did the RBRqst, all caches match RBRply packets as well. If a cache gets a match and has owner set, it asserts BOwner to indicate stale data. The cache waiting for the RBRply sees BOwner, throws away this block and waits for another RBRply. This reply is generated by the cache that asserted BOwner (this explains one of the reasons why the number of reply packets may exceed the number of request packets).
Two-Level Operation
For the purposes of the consistency algorithm, a two-level system consists of a number of one-level systems called clusters connected by a main that also has the system's main memory. Each cluster contains a single big cache, which connects the cluster to the main bus, and a private DynaBus, which connects the big cache to the small caches in the cluster. This private DynaBus is electrically and logically distinct from the DynaBuses of other clusters. From the standpoint of a private DynaBus, its big cache looks identical to the main memory in a single-level system. From the standpoint of the main DynaBus, a big cache looks and behaves very much like a small cache in a single-level system. Further, the design of the protocol and the consistency algorithm is such that a small cache cannot even discover whether it is in a one-level or a two-level system—the response from its environment is the same in either case. Thus the small cache's behavior is identical to what was described in the previous subsection.
The algorithm requires the big cache to keep all of the state bits a small cache maintains, plus some additional ones. These additional bits are the existsBelow bits, kept one bit per block of the big cache. The existsBelow bit for a block is set only if some small cache in that cluster also has a copy of the block. This bit allows a big cache to filter packets that appear on the main bus and put only those packets on the private bus for which the existsBelow bit is set. Without such filtration, all of the traffic on the main bus would appear on every private bus, defeating the purpose of a two-level organization.
We have already stated that the behavior of a small cache in a two-level system is identical to its behavior in a one-level system. We have also said that a big cache behaves like main memory at its private bus interface and a small cache at its main bus interface. What remains to be described is what the big cache does internally, and how packets on a private bus relate to those on the main bus and vice-versa.
When a big cache gets an RBRqst form its private bus, two cases are possible: either the block is there or its not. If its there, the cache simply returns the data via an RBRply, making sure that it sets the shared bit in the reply packet to its current state in the cache (recall that in the single-level system main memory always returned a zero for this bit). If the block is not in the cache, the cache puts out an RBRqst on the main bus. When the RBRply comes back the cache updates itself with the new data and its shared bit and puts the RBRply on the private bus. When a big cache gets a WSRqst on its private bus, it checks to see if the shared bit for the block is set. If it is not set, then it updates the data, sets owner, and puts a WSRply (with shared set to zero) on the private bus. If shared is set, then it puts out a WSRqst on the main bus. The memory responds some time later with a WSRply. At this time the big cache updates the word, clears the owner bit, and puts a WSRply on the private bus with shared set to one. When a big cache gets a WBRqst, it simply updates the block and sends back a WQRply.
When a big cache gets an RBRqst on its main bus, it matches the address to see if has the block. If there is a match and owner is set, then it responds with the data. However, there are two cases. If exists below is set, then the data must be retrieved from the private bus by placing a RBRqst. Else the copy of the block it has is current, and it can return it directly. When a big cache gets a WSRqst on the main bus, it matches the address to see if the block is there and asserts shared as usual, but takes no other action. When the WSRply comes by, however, and there is a match, it updates the data it has. In addition, if the existsBelow bit for that block happens to be set, it also puts WSRply on the private bus. Note that this WSRply appears out of the blue on the private bus; that is, it has no corresponding request packet. This is another reason why the number of reply packets on a bus may exceed the number of request packets.
This completes the description of the consistency algorithm. This description has been sketchy, but intentionally so since its meant to be an introduction rather than a rigorous definition. Such a definition is contained in Appendix B for those interested.
6. Input Output
The DynaBus protocol was written with a particular model of IO devices in mind. In this model, all interactions with IO devices fall into one of two categories, control, and data transfer. Control interactions are used to initiate IO and discover whether an earlier request has completed, while data transfer interactions actually move the data to and from memory. It is assumed that the bus bandwidth requirements of control interactions are small compared to those of data transfer. The model permits control and data interactions to be combined for devices with low data transfer rates.
Following the dictates of the model, this section first describes how control interactions work and then turns to data transfer.
Control
All control interactions are carried out through the use of IORead and IOWrite transactions directed to a common, unmapped 32-bit IO address space. This address space is common in the sense that all processors see the same space, and it is unmapped in the sense that addresses issued by processors are the ones seen by the IO devices. Generally, each IO device is allocated a unique, contiguous chunk of IO space at system design time, and the device responds only if an IORead or IOWrite is directed to its chunk. The term IO device is being used here not just for real IO devices, but any device (such as a cache) that responds to a portion of the IO address space.
IOWrite transactions are typically used to set up IO transfers and to start IO. The address cycle of the IOWRqst packet carries an IO address, while the data cycle carries 32 bits of data whose interpretation depends on the IO address. For block transfer devices, a processor typically will do a number of IOWrites to set up the transfer, and then a final IOWrite to initiate the transfer.
An IOWrite starts out at a small cache as an IOWRqst packet. The big cache of the cluster puts the IOWRqst on the main bus. The memory then generates an IOWRply with the same parameters as the IOWRqst, and all big caches simply put this IOWRply on their private DynaBuses. Thus an IOWRply is broadcast throughout the system. Broadcasting eliminates the need for requestors to know the location of devices in the hierarchy and makes for a simpler protocol. When the IOWRply reaches the requesting small cache, the cache lets its processor proceed. Note that the reply packet is not generated by the device, but by main memory. There is an important reason for this, and it has to do with the need for non-directed, or broadcast IOWrites. Certain operations (such as reschedule, or map updates) need to be performed by than one device. For such operations there is no unique device that can generate the reply packet, so some unique device such as main memory must do it.
IORead transactions are to read 32 bits of data from a device. This data may either be status bits that are useful in controlling the device, or data being transferred directly from the device to the processor.
An IORead starts out at a small cache as an IORRqst packet. The big cache of the cluster puts the IORRqst onto the main DynaBus, where it is picked up by all of the other big caches. These caches put the IORRqst on their private buses. Unlike an IOWrite, note that it is the request packet that gets broadcast for an IORead. Once the intended IODevice receives the request packet, it looks up the appropriate data from its registers and sends it on its way via an IORRply. The IORRply gets broadcast in exactly the same way as the IORRqst, and eventually makes its way to the cache that initiated the transaction. Note that for IOReads it does not make sense for more than one device to respond to a given IO address, although for IOWrites it does.
Data Transfer
A simple way to accomplish data transfer is to require that every IO device connect to the DynaBus via a cache. This way, data going in and out of the memory system automatically participates in the consistency algorithm, and no additional mechanism is needed.
Unfortunately, there is a disadvantage to this way of doing things. For input, the bus bandwidth required for data transfer is effectively doubled since a block must first be fetched to service a miss and then get written out when it is victimized. This doubling may be avoided by defining a new transaction, IOWriteStream, which is used directly to write to memory. The caches on the bus, of course, must also watch for IOWriteStreams and update their copies on a match.
7. Error Handling
To be written.
Appendix A. Encoding of the Command Field
To be written.
Appendix B. Encoding of the Result Field
To be written.
Appendix C. Definition of IO Address Space
To be written.