THE PROCESSOR CACHE 
THE PROCESSOR CACHE 
THE PROCESSOR CACHE 
1
1
1
The Processor Cache

Pradeep Sindhu and Lissy Bland
Dragon-88-10    August 1988    

©  Copyright 1988 Xerox Corporation. All rights reserved.


Keywords: Multiprocessor Cache, Snoopy Cache, Single Chip Cache, Multicopy Consistency, Integrated Mapping, Read-Modify-Writes, 
Input Output, Packet Switched Bus.
Maintained by: Bland.pa, Sindhu.pa

XEROX            Xerox Corporation
                Palo Alto Research Center
                3333 Coyote Hill Road
                Palo Alto, California 94304

For Internal Xerox Use Only
The Processor Cache

1.0 Brief Description
    The Processor Cache is the interface between Dragon processors and the DynaBus.  It provides high speed memory access to a 
    processor and enables multiple processors on a DynaBus to share memory efficiently and transparently.  A processor can read or 
    write data from on-chip memory in a single cycle in the case of a hit, while misses are less than ten times as expensive.  The 
    cache implements a simple but efficient multicopy consistency algorithm that automatically detects sharing by monitoring bus 
    activity and generates broadcast writes when a processor updates shared data.  A read-modify-write operation allows processors 
    to synchronize without using ordinary reads and writes.
    The cache also provides functions for address mapping, I/O, and interrupt handling.  It translates virtual addresses from the 
    processor to real addresses on the DynaBus, does page-level protection checking for kernel and user, solves the virtual address 
    aliasing problem, and provides support for multiple address spaces.  The cache performs two functions related to I/O:  it 
    implements operations for high bandwidth data transfer between the consistent memory and I/O devices; and it allows the 
    processor to perform I/O control operations by sending commands over the DynaBus. Finally, the cache translates certain DynaBus 
    I/O commands into interrupt and reset signals for the processor.
    The cache is implemented as a single integrated circuit chip, providing a considerably smaller and potentially faster solution than 
    traditional multi-chip alternatives in which a cache controller is connected to off-the-shelf rams.  Multiple cache chips may 
    be connected in parallel to increase the total amount of cache memory available to a processor.  Each chip operates on a single 
    40 MHz clock provided by the DynaBus; the processor clock is synchronous to the DynaBus clock and operates at 10 MHz.
  
2.0 Pin-Out
    The chip's 205 pins are divided into three main groups. One group connects to the DynaBus; the second group connects to the DBus; 
    and the third group connects to the processor bus, or PBus. The bus side clocks, the processor side clocks, and processor reset 
    make up the remaining pins.

<< [Artwork node; type 'Artwork on' to command tool] >>

3.0 Block Diagram of the Chip

<< [Artwork node; type 'Artwork on' to command tool] >>

4.0 Architectural Specifications  
    The cache is a single chip implementation of a multiprocessor snoopy cache. The chip has two primary interfaces, one interface 
    connects to the processor and the other to the DynaBus.  From the processor side, the cache responds to reads and writes to a 
    32-bit virtual address space.  This interface is not pipelined.  The transfer unit is a 32-bit word, although byte writes are 
    permitted.  On the DynaBus side, the cache generates reads and writes to a 32 bit physical address space, and also responds to 
    reads and writes from the DynaBus when appropriate.  This interface is pipelined, so that more than one DynaBus request can be 
    active within the chip at a time.  The transfer unit is either a single line of 8, 32-bit words or a single 32-bit word.  A 
    copy back scheme is used to keep main memory consistent with caches and a write broadcast scheme is used to keep caches 
    consistent with each other.  
4.1 Key Features
    There are separate directories for virtual and real addresses that operate independently.  The separate directories minimize the 
    impact of DynaBus traffic on processor throughput by eliminating contention from irrelevant bus transactions.  Both directories 
    are fully associative and employ an algorithm that approximates least-frequently-used in selecting victims for replacement.  A 
    fully associative implementation was chosen for the following reasons: The hit rate is higher and more stable than for a direct 
    map or small-way set associative cache with the same amount of data. Address aliasing (multiple virtual addresses pointing to 
    the same physical address) can be handled easily. Demapping, or the operation of breaking all virtual links for a physical 
    page, is trivial to implement. And finally, the structure provides an address translation table at no extra cost in area since 
    partial matches on the page part of a virtual address can be used to look up the corresponding real page. The fully associative 
    structure requires some discipline for line replacement; the one used is an approximation to least-frequently-used.
    The free translation table is a cache of virtual to real translations for pages that have at least one block within the cache. When 
    there is a miss to such a page, the translation is performed without any Dynabus references, otherwise the cache requests a 
    device on the Dynabus to perform the translation.  For each page, the cache keeps tag bits that allow it to implement simple 
    read/write protection checks for user and kernel. The cache also supports multiple address spaces, although at any given time 
    virtual addresses from only one address space may be present within the cache. At address space switch time the virtual 
    addresses for the old space are invalidated and those for the new one brought in on demand. Because of the independent physical 
    directory, an address space switch does not require a data flush but only a virtual addresses flush.
    The cache implements a multicopy consistency algorithm that has the effect of globally serializing reads and writes from multiple 
    processors connected to a Dynabus.  Serialization means that the course of any computation running on a real machine is 
    identical to that of the same computation running on an abstract machine in which only one memory operation is allowed to 
    execute at a time.  Each cache chip on a Dynabus detects the onset and termination of sharing for memory locations by watching 
    Dynabus transactions.  A cache generates a broadcast write when its processor does a write to shared data. All caches, 
    including the initiator, process the broadcast write, thereby keeping the various copies consistent. The transfer unit for 
    broadcast writes is kept small for efficiency.  The consistency algorithm also incorporates a read-modify-write that can be 
    used by processors for atomic updates or synchronization, and block transfer operations that can be used by high speed IO 
    devices to transfer to and from memory while maintaining consistency.  Recall that Dynabus is packet switched, so that the 
    implementation of the above sketch is more tricky than on a circuit switched bus where transactions are atomic and the bus can 
    be used as a global serializer.
    In addition to its basic function as a cache, the chip also serves as a conduit for all non-memory related interactions with the 
    processor.  There are a small number of local IO registers within the cache that are used for some of these interactions.  When 
    the processor invokes an IO operation that addresses one of these registers, the chip processes the request locally, otherwise 
    it sends a request out on the Dynabus, much like a miss for memory references. The processor is allowed to proceed when the 
    register has been read or written or when a reply is received on the Dynabus.  Conversely, when an IO operation on the bus 
    addresses one of the cache's IO registers, the cache performs a read or write as appropriate and causes whatever side effect 
    happens to be defined for that register. For example the processor interrupt and reset lines are manipulated via such side 
    effects.


4.2 Cache/Processor Interface
    The cache-processor interface is implemented via the PBus, a low-latency synchronous bus consisting of 48 wires (Figure 1). Of 
    these wires, 32 comprise a multiplexed data/address path; 4 define the bus command; 4 define the byte enables for writes; one 
    wire supplies the processor mode; a reject wire holds the processor for long operations; 4 wires are used to provide a fault 
    indication; and 2 wires PhA and PhB supply the two phases of the processor clock.
    A typical operation begins in PhA when the processor transfers a 32 bit address to the cache. For a read type operation, the cache 
    responds in PhB with either the 32 bit data, or a reject indicating that the operation will take more than one processor clock. 
    For a write type operation the processor transfers the 32 bits of data during the first PhB while the cache uses reject during 
    that same PhB to indicate if the operation is complete or additional cycles are needed.


    << [Artwork node; type 'Artwork on' to command tool] >>
    Figure 1:  The Processor/Cache interface.


    << [Artwork node; type 'Artwork on' to command tool] >>
    Figure 2:  The timing of a processor read when the data is present in the cache.  Data is latched at the falling edge of Phase B.


    <<Figure  [Artwork node; type 'Artwork on' to command tool] >>
    Figure 3:   The timing of a processor write when the address is present in the cache.


    << [Artwork node; type 'Artwork on' to command tool] >>
    Figure 4: The timing of a processor read when the cache asserts reject for one cycle.


    << [Artwork node; type 'Artwork on' to command tool] >>
    Figure 5:  The timing of  a processor write when the cache asserts reject for one cycle.

    << [Artwork node; type 'Artwork on' to command tool] >>
    Figure 6:  The timing of processor cache interactions when PFault is asserted. 

4.2.1 Memory Transactions
    The cache implements three memory transactions for the processor: Read, Write, and ConditionalWriteSingle (CWS).
4.2.1.1 Read
    Read takes a 32-bit virtual address and returns 32 bits of data if the operation succeeds.  The Read fails if the processor does 
    not have sufficient priviliges to perform a read into the target page, in which case the cache terminates the operation by 
    indicating fault (see the section on fault handling for details).  There are three possibilities for a successful Read: (1) the 
    data is already in the cache, (a data hit) in which case the cache returns it in one processor cycle. (2) the data is not in 
    the cache, but the virtual to real translation for the target is (a data miss map hit); in this case, the cache gets in the 
    block containing the target word via a ReadBlock and returns the data to the waiting processor. (3) neither the data nor the 
    translation is in the cache.  In this case, the cache first fetches the translation information from the Map Cache via the Map 
    Dynabus transaction and then gets the target block via a ReadBlock as before. The Map transaction itself may fail if the Map 
    Cache does not have the translation, in which case the cache terminates the Read by causing a Map fault. The processor is 
    expected to put the missing mapping entry into the Map Cache and then retry the instruction that issued the cache Read.
4.2.1.2 Write
    Write takes a 32-bit virtual address, a 4-bit byte enable specifier, and 32 bits of data properly byte aligned.  If successful, the 
    write updates the addressed location. The location is updated locally as well as in any other caches that may have copies, and 
    only bytes corresponding to byte enable bits that are 1 are written.  A Write fails if the processor does not have sufficient 
    privilege to write into the target page.  A Write for which the cache gets a data miss is treated exactly like a Read that 
    misses followed immediately by a Write.   If the Write is to a location that is not shared, the location is updated locally and 
    the operation completes within one cycle.  If the location is shared, the cache initiates a WriteSingle transaction on the 
    DynaBus causing the local copy and copies in other caches to be updated.
4.2.1.3  ConditionalWriteSingle
    ConditionalWriteSingle takes a 32-bit virtual address, a 4 bit byte write specifier, and two 32-bit data values called old and new. 
     If the processor has sufficient privilige to write into the target page, this operation does the following: it samples the 
    current value of the location and compares it with old.  If the two are equal it writes new into the location and returns the 
    sampled value to the processor.  The comparison is done on the entire 32 bits of the two words while the write is done 
    according to the byte write specifier.  If the processor has insufficient privilege, the cache terminates the operation by 
    indicating a fault.
    Because the ConditionalWriteSingle needs more parameters than can be passed by the PBus protocol in a single transaction, it must 
    be invoked with multiple transactions.  The typical sequence takes three transactions: the first two are IOWrites to set up 
    registers in the cache reserved for the old and new values, while the third is a special read that actually performs the 
    ConditionalWriteSingle.   
    If the ConditionalWriteSingle is to a location that is not shared, the location is checked and conditionally updated locally and 
    the operation completes within a small number of processor cycles.  If the location is shared, the cache generates a 
    ConditionalWriteSingle transaction on the DynaBus causing the local copy and copies in other caches to be updated.  A 
    ConditionalWriteSingle for which the cache gets a data miss is treated exactly like a Read that misses followed immediately by 
    a ConditionalWriteSingle. 
4.2.2 IO Transactions
    The cache implements three IO transactions for the processor: IORead, IOWrite and BIOWrite.  IO transactions are fundamentally 
    different from memory transactions because there is no notion of consistency.  The DynaBus IO architecture requires precisely 
    one device to respond to a given IO address and forbids caching of IO data so that the consistency problem does not arise.  
    Also, protection checking is implemented within IO devices rather than the processor.  The low 256 locations of IO address 
    space are reserved for local cache IO registers.
4.2.2.1 IORead
    IORead takes a 32-bit IO address and returns 32 bits of data. There are two possibilities for a successful IORead:
1.    The IO address refers to one of the internal IO registers of the cache, in which case the cache returns the contents of the 
appropriate register in one processor cycle.
2.    The IO address does not refer to an internal register, in which case the cache initiates an IORead transaction on the DynaBus.
In both cases, the cache returns the data to the waiting processor when the IORead completes. 
    An IORead may fail either because the device indicated by the IO address is non-existent or because the processor has insufficient 
    privilege.  The cache indicates either case by a fault on the PBus.  
4.2.2.2 IOWrite
    IOWrite takes a 32-bit IO address and 32 bits of data and updates the specified location.  There are two possibilities for a 
    successful IOWrite:
1.    The IO address refers to one of the internal IO registers of the cache, in which case the cache updates the appropriate 
register in one processor cycle.
2.    The IO address does not refer to an internal register, in which case the cache initiates an IOWrite transaction on the 
DynaBus.
In both cases, the cache returns the data to the waiting processor when the IOWrite completes.
    An IOWrite may fail either because the device indicated by the is non-existent or because the processor has insufficient privilege. 
     The cache indicates either case by a fault on the PBus.  
4.2.2.3 BIOWrite
    BIOWrite takes a 32-bit IO address and 32 bits of data and updates a particular location in all devices of a given type.  An IO 
    addresses consist of three parts: a device type, a device address, and an offset.  When a cache receives a BIOWrite on the 
    PBus, it initiates a BIOWrite transaction with the same IO address and data on the DynaBus.  All devices of the type specified 
    in the IO address write the data into the location specified by the offset part of the IO address.  If BIOWrite is directed to 
    a non-existent device type or device register, or if there is insufficient privilege, there is no indication to the requesting 
    processor.  No error is reported because of the fundamental difficulty of reporting faults for broadcast operations while still 
    maintaining efficiency in the common case of no fault.

4.2.2.4 IO Registers 
    The Table 1 below lists the IO registers for the cache.  All registers are 32 bits wide.

Table 1:  The IO Registers

IOAddr    Reg Name    AccessMode    Function                                                              
1    CWSOld    kernel/user    old value used by ConditionalWriteSingle
3    CWSNew    kernel/user    new value used by ConditionalWriteSingle
9    AID    kernel    id of currently loaded address space
11    FaultCode    kernel    provides information about fault to processor
13    InterruptStatus    kernel    interrupt status register
15    InterruptMask    kernel    interrupt mask register
16    ClrStatusBits    kernel    a write clears selected InterruptStatus bits.  The selected bits are specified by the value 
written. These locations do not actually store any data.
24    SetStatusBits    kernel    a write sets selected InterruptStatus bits.  The selected bits are specified by the value written. 
These locations do not actually store any data.
37    Modes    kernel    a register containing miscellaneous mode bits

4.2.3 Mapping and Protection
    The cache supports a paged virtual memory architecture by implementing a first-level cache of translations from virtual to real 
    addresses, by performing protection checks for operations initiated by the processor, and by providing transactions to allow 
    processors to flush translations in order to modify them.  At any given time the cache contains translations from a single 
    address space.  The number of this address space is kept in the internal 32-bit AID register.
    Address translation assumes 4KByte pages and works as follows.  A 32 bit virtual address is broken up into a 22 bit page number and 
    a 10 bit in-page offset (recall that the unit of addressing is a 32 bit word, not a byte).  The internal translation table is 
    used to lookup the 22 bit virtual page number and map it to a 22 bit physical number, which is concatenated with the 10 bit 
    offset to produce the real address.  If the lookup fails, however, the cache sends the contents of the aid register and the 22 
    bit virtual page number on the Dynabus via a MapRequest.  The MapReply either provides a 22 bit page number along with 
    protection flags that the cache enters into the translation table, or it indicates a map miss in which case the cache aborts 
    the current processor operation by signaling a map fault.


    << [Artwork node; type 'Artwork on' to command tool] >>
    Figure 7:  The address translation mechanism.  If another block in the cache is from the same virtual page, the real address can be 
    constructed by concatenating the real page and the offset. 

        If the translation fails, the cache initiates a MapRequest on the DynaBus.  The MapReply either provides a 22-bit physical page 
        number with protection flags that the cache enters into the translation table, or it indicates a map miss in which case the 
        cache aborts the current processor operation by signaling a map fault (Figure 8).  Protection is implemented with three 
        flags per page: KernelWriteEnable, UserReadEnable, and UserWriteEnable.  Kernel reads are allowed to any page.  

        << [Artwork node; type 'Artwork on' to command tool] >>
        Figure 8: The process of address translation for the cached memory system.

        The cache provides two transactions to flush translations, DeMap and ClrAllVPValid.  DeMap takes a real page as a parameter and 
        flushes all cached translations that map a virtual page to this real page.  When a cache receives a DeMap request on the 
        PBus, it puts a DeMapRequest with the same real page on the DynaBus.  The flush is actually performed when the DeMapReply 
        is received from the DynaBus (Figure 9).  Note that the data part of an entry is not affected by DeMap.


        << [Artwork node; type 'Artwork on' to command tool] >>
        Figure 9:  The DeMap request.  
          
        ClrAllVPValid is different from DeMap in two respects: First, it affects only the cache whose processor issued the ClrAllVPValid.  
        Second, all translations are flushed in this cache rather than just the ones for a given real page.
4.2.4 Fault Handling
    Whenever the cache encounters a fault, either one that it detects by itself (for example a protection violation) or one that is 
    explicitly reported to it over the DynaBus (for example a map fault) it stores a 32-bit code specifying the fault into its 
    local FaultCode register and aborts the operation currently in progress.  The FaultCode register is divided into three fields.  
    The 10 high-order bits give the DeviceID of the device reporting the fault.  The 19 following bits constitute a device 
    dependent MinorCode.  The 3 least significant bits are the MajorCode.  For faults reported over the DynaBus, the FaultCode 
    register is set to the code that came in over the DynaBus.  For a locally detected fault, the cache stores it own faultCode 
    into the MajorCode field of the FaultCode register.  Table 2 gives the MajorCodes for the cache.  
    << [Artwork node; type 'Artwork on' to command tool] >>
    Figure 10:  Format of FaultCode.

    Table 2:  The MajorCodes used by the Cache

    Encoding    Name        Meaning            
    000    undefined    
    001    MemAccessFault     insufficient privilege for memory operation
    010    reserved        
    011    IOAccessFault    insufficient privilege for IO operation        
    100    MapFault        map cache miss
    101    DynaBusTimeOut    DynaBus timeout
    110    undefined
    111    DynaBusOtherFault    explicitly reported DynaBus fault

        There is one fault that can arise even when no processor operation is in progress: it is an overflow of the cache's output FIFO.  
        When this happens, the cache asserts the DynaBus SStopOut line, indicating an unrecoverable error and bringing the system 
        to a halt a small number of cycles later.

4.3 DynaBus Interface

    The memory side of the cache connects to the DynaBus, a high bandwidth, synchronous, 64-bit packet-switched bus.  Transactions on 
    this bus consist of pairs of request-reply packets (Figure 11).  A packet contains a header and some number of data words.  The 
    header specifies the transaction type, whether the packet is a request or a reply, the id of the transaction initiator, and a 
    real memory address or an IO address, depending on the transaction.
    << [Artwork node; type 'Artwork on' to command tool] >>
    Figure 11:  A transaction on the DynaBus consists of a request and a reply.
        Caches are required to service packets on the DynaBus within a fixed delay.  This requirement has two implications: first, the 
        cache must give priority to DynaBus requests over processor requests, and second, it cannot have a FIFO at its input to 
        buffer packets but must service them in real time.  The cache is organized in lines (Figure 12).

    << [Artwork node; type 'Artwork on' to command tool] >>
    Figure 12:  A line in the cache contains 8 flags, a virtual address, a real address and a block of data.
    The 8 flags are as follows:
        shared
            indicates that there may be copies of the line in other caches.  (It is possible for shared to be set when sharing has stopped, 
            because the end of sharing is not detected immediately.)
        owner
            indicates that this cache's processor last wrote into the line.  At most one cache may have owner set for a given line.
        VPValid
            indicates whether the address contained in the corresponding virtual address field is valid. The set of valid virtual addresses 
            constitute the virtual directory. 
        RPValid
            indicates whether the address contained in the corresponding real address field is valid.  The set of valid real addresses 
            constitute the real directory.
        KernelWriteEnable
            when TRUE, the kernal has write permission 
        UserReadEnable
            when TRUE, user has read permission
        UserWriteEnable
            when TRUE, user has write permission
        Spare
        this flag is unused 

        On a packet switched bus, transactions involving a particular address may occur between the time that a given cache initiates a 
        request and when it receives a reply.  Therefore, there is an auxiliary line that contains a real address, an address valid 
        bit, a shared bit, and a replyStale bit.  This line is used to monitor packets in the interval between a cache's own 
        ReadBlockRequest and the corresponding ReadBlockReply so as to correctly implement the consistency algorithm on a packet 
        switched bus.
        When a cache receives a packet, the packet may be one it launched itself, or one that comes from some other device.  These two 
        cases are discriminated by comparing a cache's own DeviceID, myId, with the DeviceID in the packet.  We will use the 
        shorthand RMatch[RA] to mean check if the real address RA is in the directory and return a boolean result.  The shorthand 
        PartialRMatch[RP] to mean check if the real page RP is in the directory and return a boolean result. The Victim line refers 
        to the line selected to be replaced by the replacement algorithm when it is time to fetch data on a miss.
Memory Transactions
A ReadBlockRequest packet requests that some device in the memory system reply with current data for the addressed block. There are 
two cases for a cache: the packet is our own (idMatch), and the packet is from some other device (~idMatch).  For the idMatch case, 
the cache enables its AuxLine by setting the AuxLine's valid bit and clears the cache's victim line.  For the ~idMatch case, if the 
AuxLine address matches the packet address it sets the AuxLine's shared bit and pulls the DynaBus shared line.  Next, if the packet 
address is in the directory, it reads the data part and if the packet is not its own, it sets the shared bit for the entry and 
pulls the DynaBus shared line.  Finally, if the owner bit is set for the entry read out, the cache generates a ReadBlockReply 
corresponding to the ReadBlockRequest and puts it into its output FIFO.
A ReadBlockReply packet returns the data requested by an earlier ReadBlockRequest.  A cache does nothing if the packet is not in 
response to a request that it sent (the ~idMatch case).  Otherwise, it clears the AuxLine and then checks if the incoming packet's 
address is already in the directory.  If it is, the cache writes the addresses and data into the matching line, otherwise it writes 
them into the victim line. 
A WriteBlockRequest packet injects a new block of data into the memory system, overwriting all previous copies.  A cache enables 
the AuxLine if there is an idMatch.  If not, it checks whether the packet address matches the AuxLine contents, and if so, sets the 
ReplyStale bit in AuxLine.  Finally, if the packet address is in the directory, the cache updates the matching line with data from 
the packet and clears the owner bit for the line.
A WriteBlockReply packet simply acknowledges that the earlier WriteBlockRequest has been serviced.  In the idMatch case the cache 
disables the AuxLine and marks the transaction not in progress. 
A WriteSingleRequest packet requests that all cached copies of a single 32 bit word be updated.  In the idMatch case the cache 
enables the AuxLine.  In the ~idMatch case, it pulls the DynaBus shared line if the packet address is in either the directory or 
the AuxLine.
A WriteSingleReply packet does the work requested by the matching request packet.  In the idMatch case the cache disables the 
AuxLine, marks the transaction not in progress, and if the packet address is in the directory, writes the data, sets the owner bit, 
and updates the shared bit; the value written into the shared bit is the OR of ReplyShared in the incoming packet and the shared 
bit in the AuxLine.  In the case of ~idMatch it sets replyStale in AuxLine if AuxLine matches, and if the directory matches it 
writes data and clears owner for the matching line. 
A ConditionalWriteSingle packet requests that all cached copies of a single 32 bit word be conditionally updated.  The condition is 
that the current contents of the word are equal to the old value parameter.  In the idMatch case the cache enables the AuxLine.  In 
the ~idMatch case, it pulls the DynaBus shared line if either the directory or the AuxLine match.
A ConditionalWriteSingleReply packet does the work requested by the matching request packet.  In the idMatch case the cache 
disables the AuxLine, marks the transaction not in progress, and if the address exists in the directory, writes either the new 
value or the current value back, sets the owner bit, and puts the correct value into the shared bit.  In the case of ~idMatch if 
AuxLine matches it sets replyStale, and if the directory matches it clears owner and writes either the new value or the current 
value back.
A FlushBlockRequest packet requests that the block of data it carries be written to main memory.  In the idMatch case a cache 
enables the AuxLine, while in the ~idMatch case it does nothing.
A FlushBlockReply packet simply acknowledges that the corresponding FlushBlockRequest has been serviced.  In the idMatch case, it 
marks the transaction no longer in progress, while in the ~idMatch case it does nothing.

IO Transactions
An IOReadRequest packet requests that the addressed device reply with the value contained within the addressed location.  Recall 
that the addressed device type, device id and the addressed location are all specified by the IO address.  If the device type, the 
device id, and the target location all match then the cache returns the contents of the target location.
An IOReadReply packet returns the value requested by the matching request packet.  In the idMatch case, the cache marks the 
transaction no longer in progress.
An IOWriteRequest packet requests that the addressed device update the addressed location with the value contained in the packet.  
If the device type, the device id and the target location all match the cache updates the target location.
An IOWriteReply packet simply acknowledges that the matching IOWriteRequest has been processed.  In the idMatch case, the cache 
marks the transaction no longer in progress.
A BIOWriteRequest packet requests that all devices of the specified type update the addressed location with the value contained in 
the packet.  A cache takes no action on BIOWriteRequest.
A BIOWriteReply packet causes the write requested by the corresponding BIOWriteRequest packet to be actually performed.  If the 
device type and the target location match, the cache updates the target location.
Mapping Transactions
A MapRequest packet requests that a MapCache map the virtual page contained in the packet.  Caches take no action on MapRequest.
A MapReply packet contains either the real page corresponding to the virtual page contained in the corresponding MapRequest, or an 
indication of a map fault.  In the idMatch case a cache marks the transaction no longer in progress, and puts the returned value 
into an internal register.
A DeMapRequest packet requests that all cached map entries that point to the real page in the packet be flushed.  Caches take no 
action on this packet.
A DeMapReply packet causes the action requested by the corresponding DeMapRequest to be completed.  A cache clears the VPValid bits 
for all lines that contain the real page in the incoming packet.
4.4 Debugging Support
    The general strategy for in-system debugging of DynaBus devices is for a device that detects an error to assert SStopOut in order 
    to bring the system to a halt.  The DBus then allows  the critical state within devices to be examined.  The ability to read 
    out state alone is sufficient to diagnose predictable errors.
4.4.1 Synchronous Stop
    The cache uses the DynaBus signal SStopIn to turn off requests from the processor.  Any requests that are currently in progress are 
    allowed to complete normally, but requests that come in after SStopIn is asserted are blocked via the PReject signal.  To the 
    processor, therefore, the effect of SStopIn is no different from that of a miss that never completes.  Now, consider several 
    cache chips embedded within a DynaBus system.  Some time after a SStop arrives the processor ports of all caches will be shut 
    off, turning off the source of cache requests.  SStopIn also disables the arbiter so that it stops granting any more requests 
    and DynaBus activity also eventually stops.   Thus, some small but indeterminate time after SStopIn is asserted, all activity 
    ceases on both the processor and bus interfaces of each cache, and it is possible to examine the frozen, internal state of the 
    cache.  
4.4.2 Debug Hardware and Microcode
    The frozen state of the cache is read over the DBus.  This serial bus allows a chip to define a number of scan paths that can be 
    used to either read out state from selected registers and/or reload these registers with new values.  Most of the useful state 
    of the cache is in the array of lines.  This state is accessed by providing microroutines to read the array in 32-bit chunks 
    and store the values in a register that is in one of the scan paths.  The register and the data paths from the array to the 
    register are all that is needed to examine the cache's normal functions.   The remaining state, which consists of about a dozen 
    bits, is collected together into another scan path register. 
4.4.3 DBus Parameterization
    The cache has 5 registers that may be initialized and/or read by the DBus. Their addresses (path numbers) are as follows:
    0: Chip Identification
        16 bits. RO. Indicates the type and version of the cache.  The Chip Identification number for the current version is 1010 0010 
    1011 1010.  
    1: DeviceID
        10 bits, R/W. Specifies the DynaBus Device ID for the cache.
    2: Data Register
        32 bits, R/W. Array data is routed to this register.
    3: PC Register
        8 bits for microPC, 3 bits for RAM word address.  R/W. holds the microPC that specifies the microroutine to be executed and 
    specifies the word within a block to be read.
    4: Miscellaneous Register
        11 bits.  R/W.  contains miscellaneous control bits, including: ArrayShared, ArrayOwner, Flags, ArrayVirtualMatch, 
    ArrayRealMatch, AuxLine.Shared, and AuxLine.ReplyStale. 

5.0 Detailed Description of Functional Blocks 
    Figure 13 is a functional block diagram for the Cache.  It has nine major functional blocks: VCam, RCam, Array Control, Ram, 
    Auxiliary Line (AuxLine), Output Section, BControl, PControl, and Interlock Control. The VCam matches incoming virtual 
    addresses from the processor to detect hits. The RCam performs the same match for the real addresses from the DynaBus side. 
    Array Control contains multiplexers to connect either the virtual or real sides to the Ram, a small number of control bits per 
    cache line, and logic used in victimization. The Ram, which is single ported, contains the cache data.  The AuxLine contains 
    logic to detect events that are important to pending transactions. The output section handles all of the buffering, 
    arbitration, and data transfer involved in sending data on the DynaBus.  Finally, BControl and PControl implement the control 
    machinery for the two buses respectively, while Interlock Control contains logic to resolve resource conflicts between the 
    processor and bus sides.

<< [Artwork node; type 'Artwork on' to command tool] >>
Figure 13:  Cache Functional Block Diagram.
5.1 VCam
    The VCam is a content addressable memory that is divided into 2 sections.   The first section addresses cached data.  It contains 
    one, 31-bit entry for each line of cached data.  Figure 14 illustrates its format.   There is a valid bit, an address type bit, 
    a 22-bit page address, and a 7-bit block address.  The valid bit indicates whether or not the VCam entry contains significant 
    data.  The address type bit is set to 0, indicating that this is a memory entry rather than an IO entry.  The page and block 
    addresses together specify the virtual address of a 32-byte block.  (The word address within the block is specified by three 
    other bits that do not participate in VCam matches.)   A partial match facility allows address comparison at the page level 
    (see Figure 7). 

    << [Artwork node; type 'Artwork on' to command tool] >>
    Figure 14:  The format of the VCam entries for cached data.
    The second section of the VCam addresses IO registers.  It contains six, 31-bit entries providing access to the cache's IO 
    registers.   Figure 15 illustrates the format of these entries.  There is a valid bit, which is always 1 (since the IO part is 
    never modified); an address type bit, which is always 1 to indicate an IO entry; a 22-bit page address which is all zero; and a 
    7-bit block address which specifies 8 contiguous locations in IO address space.  This part of the cam is never operated in 
    partial match mode.  

    << [Artwork node; type 'Artwork on' to command tool] >>
    Figure 15:  The format of the VCam entires for IO registers. 
    When an address is presented, the cached data and IO sections perform the comparison in parallel, but only one of them ends up 
    matching for any given address.  The VCam operates in one of two modes: full match, where all 31 bits of the stored entries are 
    compared, and partial match, where only the high 24 bits are compared.  Partial match is used only if the incoming address is a 
    memory address and a full match has failed.  
5.2 RCam
    The RCam is a content addressable memory that is organized like the VCam.  In fact, there is a one-to-one correspondence between 
    entries in the memory and IO parts of VCam and the RCam.  The memory part consists of one, 31-bit entry for each line of cached 
    data, while the IO part consists of six, 31-bit entries.  As with the VCam, the RCam can do full match and partial matches.  
    During partial matches only the high-order 24 bits of the stored entries are compared with the incoming address (Figure 7).
    Figure 16 illustrates the format for an RCam entry that addresses cached data.   There is a valid bit, an address type bit, a 
    22-bit page address, and a 7-bit block address.  The valid bit indicates if the RCam entry contains significant data.  The 
    address type bit is set to 0, indicating that this is a memory entry.  The page and block addresses together specify the real 
    address of a 32-byte block.  (The address of the word in the block is specified by three other bits that do not participate in 
    RCam matches.)  A partial match facility makes it possible to find all the entries for a particular real page.  This feature is 
    used in implementing DeMap, which involves clearing the VPValid bits for all entries whose real page part matches a given real 
    page.


    << [Artwork node; type 'Artwork on' to command tool] >>
    Figure 16: The format of the VCam entries for cached data. 
    The second section of the RCam addresses IO registers.  It contain a valid bit, which is always 1 (since the IO part is never 
    modified); an address type bit, which is always 1 to indicate an IO entry; a 22-bit page address which is all zero; and a 7-bit 
    block address which specifies 8 contiguous locations in IO address space.  This part of the cam is never operated in partial 
    match mode.

<< [Artwork node; type 'Artwork on' to command tool] >>
Figure 17: The format of the RCam entires for IO registers.

5.3 Array Control
    Array control consists of several sub-blocks, each with a distinct function.  The victim logic subblock computes which line to 
    replace when the cache encounters a miss.  This computation is done using an approximation to the least frequently used 
    algorithm.  
    The algorithm keeps an array of use bits, one per line, and a victim pointer that points to one of the lines.  During each cycle 
    the victim logic proceeds as follows: if the victim line has the use bit set, that bit is cleared and the victim pointer is 
    advanced.  If the victim line does not have the use bit set, the victim pointer stays where it is.  When a miss occurs, the 
    location pointed to by the victim pointer is deleted from the cache,  whether or not the use bit is set.  During miss 
    servicing, the victim pointer is frozen to keep it from moving to another line.  The use bit is set each time the processor 
    references a line, including the time when a line is just brought in.  This last case needs more explanation.  When a line is 
    brought in, it is in a frozen state and the use bit is clear.  First victim is unfrozen, but this does not result in any change 
    because the use bit is zero.  Then, the use bit is set, which causes victim to advance and the use bit to return to zero.  If 
    the line is used frequently, there will be plenty of time for the processor to have set the use bit before the victim wraps 
    around and reaches this line again.    
    The shared/owner subblock contains a shared bit and an owner bit for each cache line.  The shared bit for a line is 1 if there may 
    be multiple cached copies of the line and 0 otherwise.  The owner bit is 1 only if this line is the one last written into 
    (amongst multiple copies, if any) by its processor.  The shared bit is read during processor writes and, when set, inhibits 
    processor writes into the ram.  The reason that processor writes to the ram are inhibited when data is shared, is that the data 
    must be updated in all cached copies.  This is done be having the cache whose processor wants to update the data initiate a 
    WriteSingleRequest on the DynaBus.  The shared bit is written into from the bus side, but never read.  The owner bit, on the 
    other hand, is never read from the processor side, but is set from the processor side.  It is both read and written from the 
    bus side.  
    The remaining components of array control are two 3:1 muxes, called RamSelectMux and CamSelectMux.  RamSelectMux connects either 
    the virtual match lines, the real match lines, or the victim lines to the ram select lines in the array.  For example, 
    following a processor read or write that hits, the RamSelectMux will be set to select the virtual match lines, and the ram line 
    enabled by the virtual match line that went high will be read or written.  The CamSelectMux connects either the virtual match 
    lines, the real match lines, or the victim lines to the cam select lines in the array.  There is only set of select lines for 
    both the RCam and the VCam, since they are read and written together.  For example, during a miss, the contents of the victim 
    line need to be read out and written out to memory if the owner bit is set.  The victim line is read out by setting both the 
    RamSelectMux and the CamSelectMux to select victim as the source. 
5.4 Ram
    The ram contains data for the cache and, like the two cams, is divided into a memory part and an IO part.  The memory part contains 
    some number of lines of 256 bits, which may be read and written in 256-bit and 32-bit units, with support for writing 
    individual bytes within a 32-bit word.  The IO part contains the data storage for the cache's IO registers and hardware for 
    manipulating bits in the interrupt registers.  The IO registers are always read and written in 32-bit units.
5.5 Auxiliary Line
    Because the DynaBus is packet-switched, transactions on a particular address may occur between the time that a given cache 
    initiates a request and when it receives a reply to that request.  The auxiliary line is used to detect this situation.  It 
    contains a 32-bit address register, a valid bit to indicate when the contents of the address register are significant, and 2 
    control bits, shared and replyStale.
    The auxiliary line operates between the time that the cache sees its own ReadBlockRequest and the time it sees the reply to that 
    request.  The address register and valid bit are loaded as the cache sees its own request, thus activating the auxiliary line's 
    matching hardware.  The matcher looks for certain transactions whose address field matches the contents of the auxiliary line's 
    address register.  If there is an address match on a WriteSingleReply, ConditionalWriteSingleReply, or a WriteBlockRequest, the 
    reply stale bit is set, indicating that the reply to the ReadBlockRequest is bad.  For a ReadBlockRequest from another cache it 
    sets its shared bit and pulls the DynaBus shared line.  It also pulls the DynaBus shared line if it sees a WriteSingleRequest 
    or a ConditionalWriteSingleRequest from another cache. 
5.6 DynaBus Output Section
    The output section is an autonomous piece of hardware that handles details associated with sending packets on the DynaBus, 
    including requesting the DynaBus, buffering, and providing the control for actually outputing a packet.  Internally, the output 
    section is divided into three parts, a control part that does the sequencing, a two cycle request buffer, and a five cycle FIFO.
    The control part is a small state machine that contains logic for requesting the bus and for steering packet data to the bus when a 
    grant arrives.  The request buffer contains two 64 bit registers for the two cycles of a request packet and some state bits to 
    indicate that a request is pending.  A single packet buffer is sufficient because the cache never needs to send more than one 
    two cycle request packet at a time.
    The FIFO contains space for a small number of five-cycle packets and some control bits to indicate whether it is empty or full.  A 
    FIFO is needed because the cache may have to reply at bus speed to several two-cycle ReadBlockRequest packets, each of which 
    requires a five-cycle reply.  The FIFO is also used for two other purposes: to send a five-cycle FlushBlockRequest when a cache 
    wants to write dirty data back to memory, and to send two-cycle replies to IORequests.  
5.7 DynaBus Control
    DynaBus Control provides the control machinery to handle incoming packets from the DynaBus.  It decodes the command field for an 
    incoming packet and generates the control signals to initiate whatever actions the cache needs to take for this packet.  
    Signals meant for hardware dedicated to the  DynaBus side are generated directly by DynaBus    Control, whereas signals meant 
    for hardware shared between the DynaBus and the processor side (such as the Ram) are sent to InterlockControl which in turn 
    produces the actual control signals.
5.8 PBus Control
    PBus Control provides the control to handle requests from the processor side and to generate replies when the cache is ready.  The 
    requirements on PBus Control are quite different than on DynaBus Control.  Most operations must be serviced very quickly, while 
    the remainder consist of long sequences of control signals.  In contrast, DynaBus Control control sequences do not have the 
    short latency requirement, are heavily piplelined, and are intermediate in length between the very short and the very long 
    sequences of PBus Control.
5.9 Interlock Control
    There are several portions of the cache for which there is contention from both the DynaBus and the processor side.  The most 
    important of these are the Ram and the RCam.  Interlock Control takes signals from DynaBus Control and PBus Control, and 
    generates controls for such blocks.  The general strategy for arbitrating between the processor and DynaBus side is based on 
    the following externally imposed constraints:
1.    Bus side requests must be serviced within a fixed time, but this fixed time does not need to be very small.
2.    Most processor requests, on the other hand, must be serviced with very low latency, but the processor may be stalled when 
needed.
DynaBus side requests are therefore given priority, but they are delayed by a short time to allow an on-going processor side 
request to get out of the way.  This delay allows the processor to retrieve data with the minimum latency in the when a request 
hits in the cache.  

6.0 Pin Descriptions
6.1 DynaBus Pins
    The following table describes the DynaBus pins.

Pin Name    I/O    Pin Description
RequestOut    O    2 bits, indicates arbiter request code as follows:
        00:  release bus hold
        01:  assert bus hold
        10:  make low priority two cycle request
        11:  make high priority five cycle request
SpareOut    O    2 bits, floated; present only for compatibility.
SStopOut    O    when asserted, this signal indicates that the cache intends to bring the system to a halt.  
SharedOut    O    asserted by the cache to indicate that it holds a cached copy of the address that appeared on the DynaBus cycles earlier.  SharedOut is central to the cache consistency algorithm: each cache "watches" all addresses on the DynaBus and 
compares each address to the addresses it contains.       
OwnerOut    O    asserted by a cache when it is the owner of the block specified in a ReadBlockRequest. This is used to prevent the 
memory from replying to a ReadBlockRequest when a cache is owner.
HeaderCycleOut    O    asserted during the first (header) cycle of a packet. 
ParityOut    O    floated, present only for compatibility.
DataOut    O    64-bit wide DynaBus data output. Floated except when Grant delayed by one cycle is assserted.
Grant    I    indicates that requester can use the DynaBus during the next cycle.
HiPGrant    I    ignored, present only for compatibility.
LongGrant    I    used to determine if the grant was for a 2 cycle or a 5 cycle packet.
SpareIn    I    2 bits, ignored, present only for compatibility.
SStopIn    I    ignored, present only for compatibility.
SharedIn    I    The SharedIn wire is used to accurately maintain the value of the several caches' Shared flags.  When a cache 
initiates a WriteSingle, ConditionalWriteSingle or ReadBlockRequest, all caches that contain the datum (but not the cache that 
initiated the transaction) assert SharedOut.  The Memory Controller receives the logical OR of the several caches' SharedOut wires 
as SharedIn and reflects this value in its reply to the transaction.  If none of the caches asserted SharedOut, the 
MemoryController's reply indicates that the datum is no longer shared.  The cache that initiated the transaction then sets its 
Shared flag to false. 
OwnerIn    I    OwnerIn is the logical OR of the several caches' OwnerOut wires.  It is present only for compatibility.
HeaderCycleIn    I    Indicates header cycles coming from the DynaBus.
ParityIn    I    Ignored, present only for compatibility.
DataIn    I    64-bit wide DynaBus data input.
Clock    I    DynaBus clock input. 
CkOut    O    Clock feedback output. Used to adjust the clock skew based on internal clock buffering delay.

6.2 PBus Pins
     Here is a description of the PBus pins of the Cache.  See the PBus specification for details. 

    Pin Name    I/O    Pin Description
    PhA    I    this signal is high during the A phase of the two-phase non-overlapping processor clock.
    PhB    I    this signal is high during the B phase of the two-phase non-overlapping processor clock.
    PCmd[0..4)    I    4 bit wide processor command that indicates the operation being requested. It is asserted only during the first 
    PhA of an operation. PCmd[0]=0 for a NoOp and 1 for a request, while PCmd[1..3) encodes the operation as follows:
            0 0 0    MemoryRead
            0 0 1    MemoryWrite
            0 1 0    ConditionalWriteSingle
            0 1 1    DeMap
            1 0 0    IORead
            1 0 1    IOWrite
            1 1 0    FlushCache
            1 1 1    BIOWrite
    PByteSelect[0..4)    I    4 bit wide field that indicates which bytes within a 32 bit word should be written. It is asserted only 
    during the first PhA of an operation. PByteSelect[0] corresponds to the byte in position [0..8), PByteSelect[1] to the byte in 
    position [8..16), and so on.
    PMode    I    indicates processor mode. It is asserted only during the first PhA of an operation. 1 => user, 0 => kernel.
    PData[0..32)    IO    32 bit data/address lines. During the first PhA these wires carry the address. During the first PhB they 
    carry data, the direction being indicated by PCmd[3]. For a read, if the cache is not able to respond in one processor cycle, 
    it drives the data during the first PhB for which it doesn't assert PReject. 
    PReject    O    used by the cache to indicate that it cannot complete the requested operation during this processor cycle. This 
    signal is driven to 0 by the cache during every PhA and to 1 during PhB's where the cache wants to indicate non-completion.
    PFault    O    used by the cache to indicate that the requested operation encountered a fault. This signal is driven to 0 by the 
    cache during every PhA and to 1 during the last PhB of a faulting operation. Recall that PReject is also asserted during this 
    last PhB.
    PFaultCode[0..3)    O    three bit field that specifies the fault when PFault is asserted. Its timing is the same as that for 
    PFault. Its encoding is as follows:
            000    undefined
            001    insufficient privilege for memory operation
            010    undefined
            011    insufficient privilege for IO operation
            100    map cache miss
            101    DynaBus timeout
            110    undefined
            111    explicitly reported DynaBus fault
    PReschedule    O    used by the cache to cause the processor to take an interrupt. This signal is synchronous to the DynaBus clock; 
    it is not synchronized to the two-phase processor clock.
    PReset    O    used by the cache to reset the processor.

6.3 DBus Pins
     Here is a description of the DBus pins of the Cache.  See the DBus specification for details. 

Pin Name    I/O    Pin Description
DSelect    I    DBus selection. This line is asserted (high) when the Small Cache should perform a non-address operation on the 
DBus.
DSerialOut    O    Data emitted serially on the DBus by the Small Cache. This line is floated except when DSelect is asserted.
DSerialIn    I    DBus serial input data and address.
nDReset    I    System reset (active low)
nDFreeze    I    Ignored by Small Cache.
DExecute    I    DExecute asks the Map Cache to perform an execute cycle instead of a data/address transfer on the next positive 
edge of DShiftCk.
DAddress    I    When high, address bits are presented serially on DSerialIn.  The last three bits presented specify the DBus 
internal register selected.
DShiftCK    I    DBus shift clock.

7.0 DC Characteristics
    This section lists the Direct Current characteristics of each signal. The signal types may be: input, output, tristate, open 
    drain, etc.  Group signals with the same characteristics together.  A couple sample entries from the arbiter are given.  Copy 
    more table entries as needed. 


    Pin Name    Signal Type    Voltage        Current
    Group D    5V output    L    0.5    2 ma
            H    4.0    0
    DSerialOut    5V Tri-state    L    0.5    2 ma
            H    4.5    2 ma
    Pin Type    Pin Name

    Group D    nDHybridSel    
        DBdSel
    CkOut

8.0. AC Characteristics
A. Definitions
    The timing characteristic of each port are described in this section.  It is generally assumed that the charcteristics of all the 
    wires connecting a chip to a particular component are the same, so that DynaBus signals, DBug Bus signals, Backpanel signals, 
    (etc.) may be characterized as a group.

    << [Artwork node; type 'Artwork on' to command tool] >>

    Figure 4: Input Signal Characteristics

    Ts (setup time) = the mimimum time a signal must be stable before the rising edge of the clock.
    Th (hold time) = the mimimum time a signal must be stable after the rising edge of the clock.

<< [Artwork node; type 'Artwork on' to command tool] >>
Figure 5: Output Signal Characteristics

Tcycle = the time interval between successive rising edges of the clock
Tpd (propagation delay) = the waiting time after the clock is high until an output becomes valid.
Tm (maintenance of old data) = the time after rising edge of next clock cycle that old data remains valid.

B. Values 
Qualififed Pin Name    Tmin    Ttypical    Tmax 
Tcycle    20ns    25ns    27ns
Ts.DynaBus In (setup.DynaBus In)        3ns    
Th.DynaBus In (hold.DynaBus In)        1ns
Tpd.DynaBus Out (propagation delay.DynaBus Out)        5ns 
Tm.DynaBus Out (maintain.DynaBus Out)        2ns
            
            
9.0 Application Schematics of the Circuit


<< [Artwork node; type 'Artwork on' to command tool] >>

10.0 Physical Pin-Out For Each Package
    This section includes a table of the pin numbers/pin names and a diagram indicating the range of pins that are located on each 
    side of the PGA.  If the chip has been implemented in more that one packaging technology, there should be a table and diagram 
    for each implementation. Some sample entries are given from the arbiter.

    No Name        No    Name    No    Name    No Name
    4    nGrant.0    37    nOwnerOut    73    TIOvdd    110    ArbReqOut.0
    5    nRequestOut.0.0    38    nSharedOut.0    74    TIOgnd    111    OtherArbIn.0.0
                                
                                
    << [Artwork node; type 'Artwork on' to command tool] >>