THE PROCESSOR CACHE THE PROCESSOR CACHE THE PROCESSOR CACHE 1 1 1 The Processor Cache Pradeep Sindhu and Lissy Bland Dragon-88-10 August 1988 © Copyright 1988 Xerox Corporation. All rights reserved. Keywords: Multiprocessor Cache, Snoopy Cache, Single Chip Cache, Multicopy Consistency, Integrated Mapping, Read-Modify-Writes, Input Output, Packet Switched Bus. Maintained by: Bland.pa, Sindhu.pa XEROX Xerox Corporation Palo Alto Research Center 3333 Coyote Hill Road Palo Alto, California 94304 For Internal Xerox Use Only The Processor Cache 1.0 Brief Description The Processor Cache is the interface between Dragon processors and the DynaBus. It provides high speed memory access to a processor and enables multiple processors on a DynaBus to share memory efficiently and transparently. A processor can read or write data from on-chip memory in a single cycle in the case of a hit, while misses are less than ten times as expensive. The cache implements a simple but efficient multicopy consistency algorithm that automatically detects sharing by monitoring bus activity and generates broadcast writes when a processor updates shared data. A read-modify-write operation allows processors to synchronize without using ordinary reads and writes. The cache also provides functions for address mapping, I/O, and interrupt handling. It translates virtual addresses from the processor to real addresses on the DynaBus, does page-level protection checking for kernel and user, solves the virtual address aliasing problem, and provides support for multiple address spaces. The cache performs two functions related to I/O: it implements operations for high bandwidth data transfer between the consistent memory and I/O devices; and it allows the processor to perform I/O control operations by sending commands over the DynaBus. Finally, the cache translates certain DynaBus I/O commands into interrupt and reset signals for the processor. The cache is implemented as a single integrated circuit chip, providing a considerably smaller and potentially faster solution than traditional multi-chip alternatives in which a cache controller is connected to off-the-shelf rams. Multiple cache chips may be connected in parallel to increase the total amount of cache memory available to a processor. Each chip operates on a single 40 MHz clock provided by the DynaBus; the processor clock is synchronous to the DynaBus clock and operates at 10 MHz. 2.0 Pin-Out The chip's 205 pins are divided into three main groups. One group connects to the DynaBus; the second group connects to the DBus; and the third group connects to the processor bus, or PBus. The bus side clocks, the processor side clocks, and processor reset make up the remaining pins. << [Artwork node; type 'Artwork on' to command tool] >> 3.0 Block Diagram of the Chip << [Artwork node; type 'Artwork on' to command tool] >> 4.0 Architectural Specifications The cache is a single chip implementation of a multiprocessor snoopy cache. The chip has two primary interfaces, one interface connects to the processor and the other to the DynaBus. From the processor side, the cache responds to reads and writes to a 32-bit virtual address space. This interface is not pipelined. The transfer unit is a 32-bit word, although byte writes are permitted. On the DynaBus side, the cache generates reads and writes to a 32 bit physical address space, and also responds to reads and writes from the DynaBus when appropriate. This interface is pipelined, so that more than one DynaBus request can be active within the chip at a time. The transfer unit is either a single line of 8, 32-bit words or a single 32-bit word. A copy back scheme is used to keep main memory consistent with caches and a write broadcast scheme is used to keep caches consistent with each other. 4.1 Key Features There are separate directories for virtual and real addresses that operate independently. The separate directories minimize the impact of DynaBus traffic on processor throughput by eliminating contention from irrelevant bus transactions. Both directories are fully associative and employ an algorithm that approximates least-frequently-used in selecting victims for replacement. A fully associative implementation was chosen for the following reasons: The hit rate is higher and more stable than for a direct map or small-way set associative cache with the same amount of data. Address aliasing (multiple virtual addresses pointing to the same physical address) can be handled easily. Demapping, or the operation of breaking all virtual links for a physical page, is trivial to implement. And finally, the structure provides an address translation table at no extra cost in area since partial matches on the page part of a virtual address can be used to look up the corresponding real page. The fully associative structure requires some discipline for line replacement; the one used is an approximation to least-frequently-used. The free translation table is a cache of virtual to real translations for pages that have at least one block within the cache. When there is a miss to such a page, the translation is performed without any Dynabus references, otherwise the cache requests a device on the Dynabus to perform the translation. For each page, the cache keeps tag bits that allow it to implement simple read/write protection checks for user and kernel. The cache also supports multiple address spaces, although at any given time virtual addresses from only one address space may be present within the cache. At address space switch time the virtual addresses for the old space are invalidated and those for the new one brought in on demand. Because of the independent physical directory, an address space switch does not require a data flush but only a virtual addresses flush. The cache implements a multicopy consistency algorithm that has the effect of globally serializing reads and writes from multiple processors connected to a Dynabus. Serialization means that the course of any computation running on a real machine is identical to that of the same computation running on an abstract machine in which only one memory operation is allowed to execute at a time. Each cache chip on a Dynabus detects the onset and termination of sharing for memory locations by watching Dynabus transactions. A cache generates a broadcast write when its processor does a write to shared data. All caches, including the initiator, process the broadcast write, thereby keeping the various copies consistent. The transfer unit for broadcast writes is kept small for efficiency. The consistency algorithm also incorporates a read-modify-write that can be used by processors for atomic updates or synchronization, and block transfer operations that can be used by high speed IO devices to transfer to and from memory while maintaining consistency. Recall that Dynabus is packet switched, so that the implementation of the above sketch is more tricky than on a circuit switched bus where transactions are atomic and the bus can be used as a global serializer. In addition to its basic function as a cache, the chip also serves as a conduit for all non-memory related interactions with the processor. There are a small number of local IO registers within the cache that are used for some of these interactions. When the processor invokes an IO operation that addresses one of these registers, the chip processes the request locally, otherwise it sends a request out on the Dynabus, much like a miss for memory references. The processor is allowed to proceed when the register has been read or written or when a reply is received on the Dynabus. Conversely, when an IO operation on the bus addresses one of the cache's IO registers, the cache performs a read or write as appropriate and causes whatever side effect happens to be defined for that register. For example the processor interrupt and reset lines are manipulated via such side effects. 4.2 Cache/Processor Interface The cache-processor interface is implemented via the PBus, a low-latency synchronous bus consisting of 48 wires (Figure 1). Of these wires, 32 comprise a multiplexed data/address path; 4 define the bus command; 4 define the byte enables for writes; one wire supplies the processor mode; a reject wire holds the processor for long operations; 4 wires are used to provide a fault indication; and 2 wires PhA and PhB supply the two phases of the processor clock. A typical operation begins in PhA when the processor transfers a 32 bit address to the cache. For a read type operation, the cache responds in PhB with either the 32 bit data, or a reject indicating that the operation will take more than one processor clock. For a write type operation the processor transfers the 32 bits of data during the first PhB while the cache uses reject during that same PhB to indicate if the operation is complete or additional cycles are needed. << [Artwork node; type 'Artwork on' to command tool] >> Figure 1: The Processor/Cache interface. << [Artwork node; type 'Artwork on' to command tool] >> Figure 2: The timing of a processor read when the data is present in the cache. Data is latched at the falling edge of Phase B. <
> Figure 3: The timing of a processor write when the address is present in the cache. << [Artwork node; type 'Artwork on' to command tool] >> Figure 4: The timing of a processor read when the cache asserts reject for one cycle. << [Artwork node; type 'Artwork on' to command tool] >> Figure 5: The timing of a processor write when the cache asserts reject for one cycle. << [Artwork node; type 'Artwork on' to command tool] >> Figure 6: The timing of processor cache interactions when PFault is asserted. 4.2.1 Memory Transactions The cache implements three memory transactions for the processor: Read, Write, and ConditionalWriteSingle (CWS). 4.2.1.1 Read Read takes a 32-bit virtual address and returns 32 bits of data if the operation succeeds. The Read fails if the processor does not have sufficient priviliges to perform a read into the target page, in which case the cache terminates the operation by indicating fault (see the section on fault handling for details). There are three possibilities for a successful Read: (1) the data is already in the cache, (a data hit) in which case the cache returns it in one processor cycle. (2) the data is not in the cache, but the virtual to real translation for the target is (a data miss map hit); in this case, the cache gets in the block containing the target word via a ReadBlock and returns the data to the waiting processor. (3) neither the data nor the translation is in the cache. In this case, the cache first fetches the translation information from the Map Cache via the Map Dynabus transaction and then gets the target block via a ReadBlock as before. The Map transaction itself may fail if the Map Cache does not have the translation, in which case the cache terminates the Read by causing a Map fault. The processor is expected to put the missing mapping entry into the Map Cache and then retry the instruction that issued the cache Read. 4.2.1.2 Write Write takes a 32-bit virtual address, a 4-bit byte enable specifier, and 32 bits of data properly byte aligned. If successful, the write updates the addressed location. The location is updated locally as well as in any other caches that may have copies, and only bytes corresponding to byte enable bits that are 1 are written. A Write fails if the processor does not have sufficient privilege to write into the target page. A Write for which the cache gets a data miss is treated exactly like a Read that misses followed immediately by a Write. If the Write is to a location that is not shared, the location is updated locally and the operation completes within one cycle. If the location is shared, the cache initiates a WriteSingle transaction on the DynaBus causing the local copy and copies in other caches to be updated. 4.2.1.3 ConditionalWriteSingle ConditionalWriteSingle takes a 32-bit virtual address, a 4 bit byte write specifier, and two 32-bit data values called old and new. If the processor has sufficient privilige to write into the target page, this operation does the following: it samples the current value of the location and compares it with old. If the two are equal it writes new into the location and returns the sampled value to the processor. The comparison is done on the entire 32 bits of the two words while the write is done according to the byte write specifier. If the processor has insufficient privilege, the cache terminates the operation by indicating a fault. Because the ConditionalWriteSingle needs more parameters than can be passed by the PBus protocol in a single transaction, it must be invoked with multiple transactions. The typical sequence takes three transactions: the first two are IOWrites to set up registers in the cache reserved for the old and new values, while the third is a special read that actually performs the ConditionalWriteSingle. If the ConditionalWriteSingle is to a location that is not shared, the location is checked and conditionally updated locally and the operation completes within a small number of processor cycles. If the location is shared, the cache generates a ConditionalWriteSingle transaction on the DynaBus causing the local copy and copies in other caches to be updated. A ConditionalWriteSingle for which the cache gets a data miss is treated exactly like a Read that misses followed immediately by a ConditionalWriteSingle. 4.2.2 IO Transactions The cache implements three IO transactions for the processor: IORead, IOWrite and BIOWrite. IO transactions are fundamentally different from memory transactions because there is no notion of consistency. The DynaBus IO architecture requires precisely one device to respond to a given IO address and forbids caching of IO data so that the consistency problem does not arise. Also, protection checking is implemented within IO devices rather than the processor. The low 256 locations of IO address space are reserved for local cache IO registers. 4.2.2.1 IORead IORead takes a 32-bit IO address and returns 32 bits of data. There are two possibilities for a successful IORead: 1. The IO address refers to one of the internal IO registers of the cache, in which case the cache returns the contents of the appropriate register in one processor cycle. 2. The IO address does not refer to an internal register, in which case the cache initiates an IORead transaction on the DynaBus. In both cases, the cache returns the data to the waiting processor when the IORead completes. An IORead may fail either because the device indicated by the IO address is non-existent or because the processor has insufficient privilege. The cache indicates either case by a fault on the PBus. 4.2.2.2 IOWrite IOWrite takes a 32-bit IO address and 32 bits of data and updates the specified location. There are two possibilities for a successful IOWrite: 1. The IO address refers to one of the internal IO registers of the cache, in which case the cache updates the appropriate register in one processor cycle. 2. The IO address does not refer to an internal register, in which case the cache initiates an IOWrite transaction on the DynaBus. In both cases, the cache returns the data to the waiting processor when the IOWrite completes. An IOWrite may fail either because the device indicated by the is non-existent or because the processor has insufficient privilege. The cache indicates either case by a fault on the PBus. 4.2.2.3 BIOWrite BIOWrite takes a 32-bit IO address and 32 bits of data and updates a particular location in all devices of a given type. An IO addresses consist of three parts: a device type, a device address, and an offset. When a cache receives a BIOWrite on the PBus, it initiates a BIOWrite transaction with the same IO address and data on the DynaBus. All devices of the type specified in the IO address write the data into the location specified by the offset part of the IO address. If BIOWrite is directed to a non-existent device type or device register, or if there is insufficient privilege, there is no indication to the requesting processor. No error is reported because of the fundamental difficulty of reporting faults for broadcast operations while still maintaining efficiency in the common case of no fault. 4.2.2.4 IO Registers The Table 1 below lists the IO registers for the cache. All registers are 32 bits wide. Table 1: The IO Registers IOAddr Reg Name AccessMode Function 1 CWSOld kernel/user old value used by ConditionalWriteSingle 3 CWSNew kernel/user new value used by ConditionalWriteSingle 9 AID kernel id of currently loaded address space 11 FaultCode kernel provides information about fault to processor 13 InterruptStatus kernel interrupt status register 15 InterruptMask kernel interrupt mask register 16 ClrStatusBits kernel a write clears selected InterruptStatus bits. The selected bits are specified by the value written. These locations do not actually store any data. 24 SetStatusBits kernel a write sets selected InterruptStatus bits. The selected bits are specified by the value written. These locations do not actually store any data. 37 Modes kernel a register containing miscellaneous mode bits 4.2.3 Mapping and Protection The cache supports a paged virtual memory architecture by implementing a first-level cache of translations from virtual to real addresses, by performing protection checks for operations initiated by the processor, and by providing transactions to allow processors to flush translations in order to modify them. At any given time the cache contains translations from a single address space. The number of this address space is kept in the internal 32-bit AID register. Address translation assumes 4KByte pages and works as follows. A 32 bit virtual address is broken up into a 22 bit page number and a 10 bit in-page offset (recall that the unit of addressing is a 32 bit word, not a byte). The internal translation table is used to lookup the 22 bit virtual page number and map it to a 22 bit physical number, which is concatenated with the 10 bit offset to produce the real address. If the lookup fails, however, the cache sends the contents of the aid register and the 22 bit virtual page number on the Dynabus via a MapRequest. The MapReply either provides a 22 bit page number along with protection flags that the cache enters into the translation table, or it indicates a map miss in which case the cache aborts the current processor operation by signaling a map fault. << [Artwork node; type 'Artwork on' to command tool] >> Figure 7: The address translation mechanism. If another block in the cache is from the same virtual page, the real address can be constructed by concatenating the real page and the offset. If the translation fails, the cache initiates a MapRequest on the DynaBus. The MapReply either provides a 22-bit physical page number with protection flags that the cache enters into the translation table, or it indicates a map miss in which case the cache aborts the current processor operation by signaling a map fault (Figure 8). Protection is implemented with three flags per page: KernelWriteEnable, UserReadEnable, and UserWriteEnable. Kernel reads are allowed to any page. << [Artwork node; type 'Artwork on' to command tool] >> Figure 8: The process of address translation for the cached memory system. The cache provides two transactions to flush translations, DeMap and ClrAllVPValid. DeMap takes a real page as a parameter and flushes all cached translations that map a virtual page to this real page. When a cache receives a DeMap request on the PBus, it puts a DeMapRequest with the same real page on the DynaBus. The flush is actually performed when the DeMapReply is received from the DynaBus (Figure 9). Note that the data part of an entry is not affected by DeMap. << [Artwork node; type 'Artwork on' to command tool] >> Figure 9: The DeMap request. ClrAllVPValid is different from DeMap in two respects: First, it affects only the cache whose processor issued the ClrAllVPValid. Second, all translations are flushed in this cache rather than just the ones for a given real page. 4.2.4 Fault Handling Whenever the cache encounters a fault, either one that it detects by itself (for example a protection violation) or one that is explicitly reported to it over the DynaBus (for example a map fault) it stores a 32-bit code specifying the fault into its local FaultCode register and aborts the operation currently in progress. The FaultCode register is divided into three fields. The 10 high-order bits give the DeviceID of the device reporting the fault. The 19 following bits constitute a device dependent MinorCode. The 3 least significant bits are the MajorCode. For faults reported over the DynaBus, the FaultCode register is set to the code that came in over the DynaBus. For a locally detected fault, the cache stores it own faultCode into the MajorCode field of the FaultCode register. Table 2 gives the MajorCodes for the cache. << [Artwork node; type 'Artwork on' to command tool] >> Figure 10: Format of FaultCode. Table 2: The MajorCodes used by the Cache Encoding Name Meaning 000 undefined 001 MemAccessFault insufficient privilege for memory operation 010 reserved 011 IOAccessFault insufficient privilege for IO operation 100 MapFault map cache miss 101 DynaBusTimeOut DynaBus timeout 110 undefined 111 DynaBusOtherFault explicitly reported DynaBus fault There is one fault that can arise even when no processor operation is in progress: it is an overflow of the cache's output FIFO. When this happens, the cache asserts the DynaBus SStopOut line, indicating an unrecoverable error and bringing the system to a halt a small number of cycles later. 4.3 DynaBus Interface The memory side of the cache connects to the DynaBus, a high bandwidth, synchronous, 64-bit packet-switched bus. Transactions on this bus consist of pairs of request-reply packets (Figure 11). A packet contains a header and some number of data words. The header specifies the transaction type, whether the packet is a request or a reply, the id of the transaction initiator, and a real memory address or an IO address, depending on the transaction. << [Artwork node; type 'Artwork on' to command tool] >> Figure 11: A transaction on the DynaBus consists of a request and a reply. Caches are required to service packets on the DynaBus within a fixed delay. This requirement has two implications: first, the cache must give priority to DynaBus requests over processor requests, and second, it cannot have a FIFO at its input to buffer packets but must service them in real time. The cache is organized in lines (Figure 12). << [Artwork node; type 'Artwork on' to command tool] >> Figure 12: A line in the cache contains 8 flags, a virtual address, a real address and a block of data. The 8 flags are as follows: shared indicates that there may be copies of the line in other caches. (It is possible for shared to be set when sharing has stopped, because the end of sharing is not detected immediately.) owner indicates that this cache's processor last wrote into the line. At most one cache may have owner set for a given line. VPValid indicates whether the address contained in the corresponding virtual address field is valid. The set of valid virtual addresses constitute the virtual directory. RPValid indicates whether the address contained in the corresponding real address field is valid. The set of valid real addresses constitute the real directory. KernelWriteEnable when TRUE, the kernal has write permission UserReadEnable when TRUE, user has read permission UserWriteEnable when TRUE, user has write permission Spare this flag is unused On a packet switched bus, transactions involving a particular address may occur between the time that a given cache initiates a request and when it receives a reply. Therefore, there is an auxiliary line that contains a real address, an address valid bit, a shared bit, and a replyStale bit. This line is used to monitor packets in the interval between a cache's own ReadBlockRequest and the corresponding ReadBlockReply so as to correctly implement the consistency algorithm on a packet switched bus. When a cache receives a packet, the packet may be one it launched itself, or one that comes from some other device. These two cases are discriminated by comparing a cache's own DeviceID, myId, with the DeviceID in the packet. We will use the shorthand RMatch[RA] to mean check if the real address RA is in the directory and return a boolean result. The shorthand PartialRMatch[RP] to mean check if the real page RP is in the directory and return a boolean result. The Victim line refers to the line selected to be replaced by the replacement algorithm when it is time to fetch data on a miss. Memory Transactions A ReadBlockRequest packet requests that some device in the memory system reply with current data for the addressed block. There are two cases for a cache: the packet is our own (idMatch), and the packet is from some other device (~idMatch). For the idMatch case, the cache enables its AuxLine by setting the AuxLine's valid bit and clears the cache's victim line. For the ~idMatch case, if the AuxLine address matches the packet address it sets the AuxLine's shared bit and pulls the DynaBus shared line. Next, if the packet address is in the directory, it reads the data part and if the packet is not its own, it sets the shared bit for the entry and pulls the DynaBus shared line. Finally, if the owner bit is set for the entry read out, the cache generates a ReadBlockReply corresponding to the ReadBlockRequest and puts it into its output FIFO. A ReadBlockReply packet returns the data requested by an earlier ReadBlockRequest. A cache does nothing if the packet is not in response to a request that it sent (the ~idMatch case). Otherwise, it clears the AuxLine and then checks if the incoming packet's address is already in the directory. If it is, the cache writes the addresses and data into the matching line, otherwise it writes them into the victim line. A WriteBlockRequest packet injects a new block of data into the memory system, overwriting all previous copies. A cache enables the AuxLine if there is an idMatch. If not, it checks whether the packet address matches the AuxLine contents, and if so, sets the ReplyStale bit in AuxLine. Finally, if the packet address is in the directory, the cache updates the matching line with data from the packet and clears the owner bit for the line. A WriteBlockReply packet simply acknowledges that the earlier WriteBlockRequest has been serviced. In the idMatch case the cache disables the AuxLine and marks the transaction not in progress. A WriteSingleRequest packet requests that all cached copies of a single 32 bit word be updated. In the idMatch case the cache enables the AuxLine. In the ~idMatch case, it pulls the DynaBus shared line if the packet address is in either the directory or the AuxLine. A WriteSingleReply packet does the work requested by the matching request packet. In the idMatch case the cache disables the AuxLine, marks the transaction not in progress, and if the packet address is in the directory, writes the data, sets the owner bit, and updates the shared bit; the value written into the shared bit is the OR of ReplyShared in the incoming packet and the shared bit in the AuxLine. In the case of ~idMatch it sets replyStale in AuxLine if AuxLine matches, and if the directory matches it writes data and clears owner for the matching line. A ConditionalWriteSingle packet requests that all cached copies of a single 32 bit word be conditionally updated. The condition is that the current contents of the word are equal to the old value parameter. In the idMatch case the cache enables the AuxLine. In the ~idMatch case, it pulls the DynaBus shared line if either the directory or the AuxLine match. A ConditionalWriteSingleReply packet does the work requested by the matching request packet. In the idMatch case the cache disables the AuxLine, marks the transaction not in progress, and if the address exists in the directory, writes either the new value or the current value back, sets the owner bit, and puts the correct value into the shared bit. In the case of ~idMatch if AuxLine matches it sets replyStale, and if the directory matches it clears owner and writes either the new value or the current value back. A FlushBlockRequest packet requests that the block of data it carries be written to main memory. In the idMatch case a cache enables the AuxLine, while in the ~idMatch case it does nothing. A FlushBlockReply packet simply acknowledges that the corresponding FlushBlockRequest has been serviced. In the idMatch case, it marks the transaction no longer in progress, while in the ~idMatch case it does nothing. IO Transactions An IOReadRequest packet requests that the addressed device reply with the value contained within the addressed location. Recall that the addressed device type, device id and the addressed location are all specified by the IO address. If the device type, the device id, and the target location all match then the cache returns the contents of the target location. An IOReadReply packet returns the value requested by the matching request packet. In the idMatch case, the cache marks the transaction no longer in progress. An IOWriteRequest packet requests that the addressed device update the addressed location with the value contained in the packet. If the device type, the device id and the target location all match the cache updates the target location. An IOWriteReply packet simply acknowledges that the matching IOWriteRequest has been processed. In the idMatch case, the cache marks the transaction no longer in progress. A BIOWriteRequest packet requests that all devices of the specified type update the addressed location with the value contained in the packet. A cache takes no action on BIOWriteRequest. A BIOWriteReply packet causes the write requested by the corresponding BIOWriteRequest packet to be actually performed. If the device type and the target location match, the cache updates the target location. Mapping Transactions A MapRequest packet requests that a MapCache map the virtual page contained in the packet. Caches take no action on MapRequest. A MapReply packet contains either the real page corresponding to the virtual page contained in the corresponding MapRequest, or an indication of a map fault. In the idMatch case a cache marks the transaction no longer in progress, and puts the returned value into an internal register. A DeMapRequest packet requests that all cached map entries that point to the real page in the packet be flushed. Caches take no action on this packet. A DeMapReply packet causes the action requested by the corresponding DeMapRequest to be completed. A cache clears the VPValid bits for all lines that contain the real page in the incoming packet. 4.4 Debugging Support The general strategy for in-system debugging of DynaBus devices is for a device that detects an error to assert SStopOut in order to bring the system to a halt. The DBus then allows the critical state within devices to be examined. The ability to read out state alone is sufficient to diagnose predictable errors. 4.4.1 Synchronous Stop The cache uses the DynaBus signal SStopIn to turn off requests from the processor. Any requests that are currently in progress are allowed to complete normally, but requests that come in after SStopIn is asserted are blocked via the PReject signal. To the processor, therefore, the effect of SStopIn is no different from that of a miss that never completes. Now, consider several cache chips embedded within a DynaBus system. Some time after a SStop arrives the processor ports of all caches will be shut off, turning off the source of cache requests. SStopIn also disables the arbiter so that it stops granting any more requests and DynaBus activity also eventually stops. Thus, some small but indeterminate time after SStopIn is asserted, all activity ceases on both the processor and bus interfaces of each cache, and it is possible to examine the frozen, internal state of the cache. 4.4.2 Debug Hardware and Microcode The frozen state of the cache is read over the DBus. This serial bus allows a chip to define a number of scan paths that can be used to either read out state from selected registers and/or reload these registers with new values. Most of the useful state of the cache is in the array of lines. This state is accessed by providing microroutines to read the array in 32-bit chunks and store the values in a register that is in one of the scan paths. The register and the data paths from the array to the register are all that is needed to examine the cache's normal functions. The remaining state, which consists of about a dozen bits, is collected together into another scan path register. 4.4.3 DBus Parameterization The cache has 5 registers that may be initialized and/or read by the DBus. Their addresses (path numbers) are as follows: 0: Chip Identification 16 bits. RO. Indicates the type and version of the cache. The Chip Identification number for the current version is 1010 0010 1011 1010. 1: DeviceID 10 bits, R/W. Specifies the DynaBus Device ID for the cache. 2: Data Register 32 bits, R/W. Array data is routed to this register. 3: PC Register 8 bits for microPC, 3 bits for RAM word address. R/W. holds the microPC that specifies the microroutine to be executed and specifies the word within a block to be read. 4: Miscellaneous Register 11 bits. R/W. contains miscellaneous control bits, including: ArrayShared, ArrayOwner, Flags, ArrayVirtualMatch, ArrayRealMatch, AuxLine.Shared, and AuxLine.ReplyStale. 5.0 Detailed Description of Functional Blocks Figure 13 is a functional block diagram for the Cache. It has nine major functional blocks: VCam, RCam, Array Control, Ram, Auxiliary Line (AuxLine), Output Section, BControl, PControl, and Interlock Control. The VCam matches incoming virtual addresses from the processor to detect hits. The RCam performs the same match for the real addresses from the DynaBus side. Array Control contains multiplexers to connect either the virtual or real sides to the Ram, a small number of control bits per cache line, and logic used in victimization. The Ram, which is single ported, contains the cache data. The AuxLine contains logic to detect events that are important to pending transactions. The output section handles all of the buffering, arbitration, and data transfer involved in sending data on the DynaBus. Finally, BControl and PControl implement the control machinery for the two buses respectively, while Interlock Control contains logic to resolve resource conflicts between the processor and bus sides. << [Artwork node; type 'Artwork on' to command tool] >> Figure 13: Cache Functional Block Diagram. 5.1 VCam The VCam is a content addressable memory that is divided into 2 sections. The first section addresses cached data. It contains one, 31-bit entry for each line of cached data. Figure 14 illustrates its format. There is a valid bit, an address type bit, a 22-bit page address, and a 7-bit block address. The valid bit indicates whether or not the VCam entry contains significant data. The address type bit is set to 0, indicating that this is a memory entry rather than an IO entry. The page and block addresses together specify the virtual address of a 32-byte block. (The word address within the block is specified by three other bits that do not participate in VCam matches.) A partial match facility allows address comparison at the page level (see Figure 7). << [Artwork node; type 'Artwork on' to command tool] >> Figure 14: The format of the VCam entries for cached data. The second section of the VCam addresses IO registers. It contains six, 31-bit entries providing access to the cache's IO registers. Figure 15 illustrates the format of these entries. There is a valid bit, which is always 1 (since the IO part is never modified); an address type bit, which is always 1 to indicate an IO entry; a 22-bit page address which is all zero; and a 7-bit block address which specifies 8 contiguous locations in IO address space. This part of the cam is never operated in partial match mode. << [Artwork node; type 'Artwork on' to command tool] >> Figure 15: The format of the VCam entires for IO registers. When an address is presented, the cached data and IO sections perform the comparison in parallel, but only one of them ends up matching for any given address. The VCam operates in one of two modes: full match, where all 31 bits of the stored entries are compared, and partial match, where only the high 24 bits are compared. Partial match is used only if the incoming address is a memory address and a full match has failed. 5.2 RCam The RCam is a content addressable memory that is organized like the VCam. In fact, there is a one-to-one correspondence between entries in the memory and IO parts of VCam and the RCam. The memory part consists of one, 31-bit entry for each line of cached data, while the IO part consists of six, 31-bit entries. As with the VCam, the RCam can do full match and partial matches. During partial matches only the high-order 24 bits of the stored entries are compared with the incoming address (Figure 7). Figure 16 illustrates the format for an RCam entry that addresses cached data. There is a valid bit, an address type bit, a 22-bit page address, and a 7-bit block address. The valid bit indicates if the RCam entry contains significant data. The address type bit is set to 0, indicating that this is a memory entry. The page and block addresses together specify the real address of a 32-byte block. (The address of the word in the block is specified by three other bits that do not participate in RCam matches.) A partial match facility makes it possible to find all the entries for a particular real page. This feature is used in implementing DeMap, which involves clearing the VPValid bits for all entries whose real page part matches a given real page. << [Artwork node; type 'Artwork on' to command tool] >> Figure 16: The format of the VCam entries for cached data. The second section of the RCam addresses IO registers. It contain a valid bit, which is always 1 (since the IO part is never modified); an address type bit, which is always 1 to indicate an IO entry; a 22-bit page address which is all zero; and a 7-bit block address which specifies 8 contiguous locations in IO address space. This part of the cam is never operated in partial match mode. << [Artwork node; type 'Artwork on' to command tool] >> Figure 17: The format of the RCam entires for IO registers. 5.3 Array Control Array control consists of several sub-blocks, each with a distinct function. The victim logic subblock computes which line to replace when the cache encounters a miss. This computation is done using an approximation to the least frequently used algorithm. The algorithm keeps an array of use bits, one per line, and a victim pointer that points to one of the lines. During each cycle the victim logic proceeds as follows: if the victim line has the use bit set, that bit is cleared and the victim pointer is advanced. If the victim line does not have the use bit set, the victim pointer stays where it is. When a miss occurs, the location pointed to by the victim pointer is deleted from the cache, whether or not the use bit is set. During miss servicing, the victim pointer is frozen to keep it from moving to another line. The use bit is set each time the processor references a line, including the time when a line is just brought in. This last case needs more explanation. When a line is brought in, it is in a frozen state and the use bit is clear. First victim is unfrozen, but this does not result in any change because the use bit is zero. Then, the use bit is set, which causes victim to advance and the use bit to return to zero. If the line is used frequently, there will be plenty of time for the processor to have set the use bit before the victim wraps around and reaches this line again. The shared/owner subblock contains a shared bit and an owner bit for each cache line. The shared bit for a line is 1 if there may be multiple cached copies of the line and 0 otherwise. The owner bit is 1 only if this line is the one last written into (amongst multiple copies, if any) by its processor. The shared bit is read during processor writes and, when set, inhibits processor writes into the ram. The reason that processor writes to the ram are inhibited when data is shared, is that the data must be updated in all cached copies. This is done be having the cache whose processor wants to update the data initiate a WriteSingleRequest on the DynaBus. The shared bit is written into from the bus side, but never read. The owner bit, on the other hand, is never read from the processor side, but is set from the processor side. It is both read and written from the bus side. The remaining components of array control are two 3:1 muxes, called RamSelectMux and CamSelectMux. RamSelectMux connects either the virtual match lines, the real match lines, or the victim lines to the ram select lines in the array. For example, following a processor read or write that hits, the RamSelectMux will be set to select the virtual match lines, and the ram line enabled by the virtual match line that went high will be read or written. The CamSelectMux connects either the virtual match lines, the real match lines, or the victim lines to the cam select lines in the array. There is only set of select lines for both the RCam and the VCam, since they are read and written together. For example, during a miss, the contents of the victim line need to be read out and written out to memory if the owner bit is set. The victim line is read out by setting both the RamSelectMux and the CamSelectMux to select victim as the source. 5.4 Ram The ram contains data for the cache and, like the two cams, is divided into a memory part and an IO part. The memory part contains some number of lines of 256 bits, which may be read and written in 256-bit and 32-bit units, with support for writing individual bytes within a 32-bit word. The IO part contains the data storage for the cache's IO registers and hardware for manipulating bits in the interrupt registers. The IO registers are always read and written in 32-bit units. 5.5 Auxiliary Line Because the DynaBus is packet-switched, transactions on a particular address may occur between the time that a given cache initiates a request and when it receives a reply to that request. The auxiliary line is used to detect this situation. It contains a 32-bit address register, a valid bit to indicate when the contents of the address register are significant, and 2 control bits, shared and replyStale. The auxiliary line operates between the time that the cache sees its own ReadBlockRequest and the time it sees the reply to that request. The address register and valid bit are loaded as the cache sees its own request, thus activating the auxiliary line's matching hardware. The matcher looks for certain transactions whose address field matches the contents of the auxiliary line's address register. If there is an address match on a WriteSingleReply, ConditionalWriteSingleReply, or a WriteBlockRequest, the reply stale bit is set, indicating that the reply to the ReadBlockRequest is bad. For a ReadBlockRequest from another cache it sets its shared bit and pulls the DynaBus shared line. It also pulls the DynaBus shared line if it sees a WriteSingleRequest or a ConditionalWriteSingleRequest from another cache. 5.6 DynaBus Output Section The output section is an autonomous piece of hardware that handles details associated with sending packets on the DynaBus, including requesting the DynaBus, buffering, and providing the control for actually outputing a packet. Internally, the output section is divided into three parts, a control part that does the sequencing, a two cycle request buffer, and a five cycle FIFO. The control part is a small state machine that contains logic for requesting the bus and for steering packet data to the bus when a grant arrives. The request buffer contains two 64 bit registers for the two cycles of a request packet and some state bits to indicate that a request is pending. A single packet buffer is sufficient because the cache never needs to send more than one two cycle request packet at a time. The FIFO contains space for a small number of five-cycle packets and some control bits to indicate whether it is empty or full. A FIFO is needed because the cache may have to reply at bus speed to several two-cycle ReadBlockRequest packets, each of which requires a five-cycle reply. The FIFO is also used for two other purposes: to send a five-cycle FlushBlockRequest when a cache wants to write dirty data back to memory, and to send two-cycle replies to IORequests. 5.7 DynaBus Control DynaBus Control provides the control machinery to handle incoming packets from the DynaBus. It decodes the command field for an incoming packet and generates the control signals to initiate whatever actions the cache needs to take for this packet. Signals meant for hardware dedicated to the DynaBus side are generated directly by DynaBus Control, whereas signals meant for hardware shared between the DynaBus and the processor side (such as the Ram) are sent to InterlockControl which in turn produces the actual control signals. 5.8 PBus Control PBus Control provides the control to handle requests from the processor side and to generate replies when the cache is ready. The requirements on PBus Control are quite different than on DynaBus Control. Most operations must be serviced very quickly, while the remainder consist of long sequences of control signals. In contrast, DynaBus Control control sequences do not have the short latency requirement, are heavily piplelined, and are intermediate in length between the very short and the very long sequences of PBus Control. 5.9 Interlock Control There are several portions of the cache for which there is contention from both the DynaBus and the processor side. The most important of these are the Ram and the RCam. Interlock Control takes signals from DynaBus Control and PBus Control, and generates controls for such blocks. The general strategy for arbitrating between the processor and DynaBus side is based on the following externally imposed constraints: 1. Bus side requests must be serviced within a fixed time, but this fixed time does not need to be very small. 2. Most processor requests, on the other hand, must be serviced with very low latency, but the processor may be stalled when needed. DynaBus side requests are therefore given priority, but they are delayed by a short time to allow an on-going processor side request to get out of the way. This delay allows the processor to retrieve data with the minimum latency in the when a request hits in the cache. 6.0 Pin Descriptions 6.1 DynaBus Pins The following table describes the DynaBus pins. Pin Name I/O Pin Description RequestOut O 2 bits, indicates arbiter request code as follows: 00: release bus hold 01: assert bus hold 10: make low priority two cycle request 11: make high priority five cycle request SpareOut O 2 bits, floated; present only for compatibility. SStopOut O when asserted, this signal indicates that the cache intends to bring the system to a halt. SharedOut O asserted by the cache to indicate that it holds a cached copy of the address that appeared on the DynaBus cycles earlier. SharedOut is central to the cache consistency algorithm: each cache "watches" all addresses on the DynaBus and compares each address to the addresses it contains. OwnerOut O asserted by a cache when it is the owner of the block specified in a ReadBlockRequest. This is used to prevent the memory from replying to a ReadBlockRequest when a cache is owner. HeaderCycleOut O asserted during the first (header) cycle of a packet. ParityOut O floated, present only for compatibility. DataOut O 64-bit wide DynaBus data output. Floated except when Grant delayed by one cycle is assserted. Grant I indicates that requester can use the DynaBus during the next cycle. HiPGrant I ignored, present only for compatibility. LongGrant I used to determine if the grant was for a 2 cycle or a 5 cycle packet. SpareIn I 2 bits, ignored, present only for compatibility. SStopIn I ignored, present only for compatibility. SharedIn I The SharedIn wire is used to accurately maintain the value of the several caches' Shared flags. When a cache initiates a WriteSingle, ConditionalWriteSingle or ReadBlockRequest, all caches that contain the datum (but not the cache that initiated the transaction) assert SharedOut. The Memory Controller receives the logical OR of the several caches' SharedOut wires as SharedIn and reflects this value in its reply to the transaction. If none of the caches asserted SharedOut, the MemoryController's reply indicates that the datum is no longer shared. The cache that initiated the transaction then sets its Shared flag to false. OwnerIn I OwnerIn is the logical OR of the several caches' OwnerOut wires. It is present only for compatibility. HeaderCycleIn I Indicates header cycles coming from the DynaBus. ParityIn I Ignored, present only for compatibility. DataIn I 64-bit wide DynaBus data input. Clock I DynaBus clock input. CkOut O Clock feedback output. Used to adjust the clock skew based on internal clock buffering delay. 6.2 PBus Pins Here is a description of the PBus pins of the Cache. See the PBus specification for details. Pin Name I/O Pin Description PhA I this signal is high during the A phase of the two-phase non-overlapping processor clock. PhB I this signal is high during the B phase of the two-phase non-overlapping processor clock. PCmd[0..4) I 4 bit wide processor command that indicates the operation being requested. It is asserted only during the first PhA of an operation. PCmd[0]=0 for a NoOp and 1 for a request, while PCmd[1..3) encodes the operation as follows: 0 0 0 MemoryRead 0 0 1 MemoryWrite 0 1 0 ConditionalWriteSingle 0 1 1 DeMap 1 0 0 IORead 1 0 1 IOWrite 1 1 0 FlushCache 1 1 1 BIOWrite PByteSelect[0..4) I 4 bit wide field that indicates which bytes within a 32 bit word should be written. It is asserted only during the first PhA of an operation. PByteSelect[0] corresponds to the byte in position [0..8), PByteSelect[1] to the byte in position [8..16), and so on. PMode I indicates processor mode. It is asserted only during the first PhA of an operation. 1 => user, 0 => kernel. PData[0..32) IO 32 bit data/address lines. During the first PhA these wires carry the address. During the first PhB they carry data, the direction being indicated by PCmd[3]. For a read, if the cache is not able to respond in one processor cycle, it drives the data during the first PhB for which it doesn't assert PReject. PReject O used by the cache to indicate that it cannot complete the requested operation during this processor cycle. This signal is driven to 0 by the cache during every PhA and to 1 during PhB's where the cache wants to indicate non-completion. PFault O used by the cache to indicate that the requested operation encountered a fault. This signal is driven to 0 by the cache during every PhA and to 1 during the last PhB of a faulting operation. Recall that PReject is also asserted during this last PhB. PFaultCode[0..3) O three bit field that specifies the fault when PFault is asserted. Its timing is the same as that for PFault. Its encoding is as follows: 000 undefined 001 insufficient privilege for memory operation 010 undefined 011 insufficient privilege for IO operation 100 map cache miss 101 DynaBus timeout 110 undefined 111 explicitly reported DynaBus fault PReschedule O used by the cache to cause the processor to take an interrupt. This signal is synchronous to the DynaBus clock; it is not synchronized to the two-phase processor clock. PReset O used by the cache to reset the processor. 6.3 DBus Pins Here is a description of the DBus pins of the Cache. See the DBus specification for details. Pin Name I/O Pin Description DSelect I DBus selection. This line is asserted (high) when the Small Cache should perform a non-address operation on the DBus. DSerialOut O Data emitted serially on the DBus by the Small Cache. This line is floated except when DSelect is asserted. DSerialIn I DBus serial input data and address. nDReset I System reset (active low) nDFreeze I Ignored by Small Cache. DExecute I DExecute asks the Map Cache to perform an execute cycle instead of a data/address transfer on the next positive edge of DShiftCk. DAddress I When high, address bits are presented serially on DSerialIn. The last three bits presented specify the DBus internal register selected. DShiftCK I DBus shift clock. 7.0 DC Characteristics This section lists the Direct Current characteristics of each signal. The signal types may be: input, output, tristate, open drain, etc. Group signals with the same characteristics together. A couple sample entries from the arbiter are given. Copy more table entries as needed. Pin Name Signal Type Voltage Current Group D 5V output L 0.5 2 ma H 4.0 0 DSerialOut 5V Tri-state L 0.5 2 ma H 4.5 2 ma Pin Type Pin Name Group D nDHybridSel DBdSel CkOut 8.0. AC Characteristics A. Definitions The timing characteristic of each port are described in this section. It is generally assumed that the charcteristics of all the wires connecting a chip to a particular component are the same, so that DynaBus signals, DBug Bus signals, Backpanel signals, (etc.) may be characterized as a group. << [Artwork node; type 'Artwork on' to command tool] >> Figure 4: Input Signal Characteristics Ts (setup time) = the mimimum time a signal must be stable before the rising edge of the clock. Th (hold time) = the mimimum time a signal must be stable after the rising edge of the clock. << [Artwork node; type 'Artwork on' to command tool] >> Figure 5: Output Signal Characteristics Tcycle = the time interval between successive rising edges of the clock Tpd (propagation delay) = the waiting time after the clock is high until an output becomes valid. Tm (maintenance of old data) = the time after rising edge of next clock cycle that old data remains valid. B. Values Qualififed Pin Name Tmin Ttypical Tmax Tcycle 20ns 25ns 27ns Ts.DynaBus In (setup.DynaBus In) 3ns Th.DynaBus In (hold.DynaBus In) 1ns Tpd.DynaBus Out (propagation delay.DynaBus Out) 5ns Tm.DynaBus Out (maintain.DynaBus Out) 2ns         9.0 Application Schematics of the Circuit << [Artwork node; type 'Artwork on' to command tool] >> 10.0 Physical Pin-Out For Each Package This section includes a table of the pin numbers/pin names and a diagram indicating the range of pins that are located on each side of the PGA. If the chip has been implemented in more that one packaging technology, there should be a table and diagram for each implementation. Some sample entries are given from the arbiter. No Name No Name No Name No Name 4 nGrant.0 37 nOwnerOut 73 TIOvdd 110 ArbReqOut.0 5 nRequestOut.0.0 38 nSharedOut.0 74 TIOgnd 111 OtherArbIn.0.0                         << [Artwork node; type 'Artwork on' to command tool] >>