BigCacheNotes.tioga
Don Curry February 9, 1988 5:13:52 pm PST
Notes
White Board
4 blocks per assoc in small cache
CWS turns around with correct data
Wts turn around at cache with owner and ~shared
Data always valid at and below cache with owner and shared below
RB snooper can record valid data. Retry not needed
Kill Block
To enable >2 levels
Assoc limits
last level isn't doing work unless assoc limits reached (and kill needed)
Instead, larger blocks more time external matches
Size Notes
   words  blocks  smlLine bigLine
blocks   8
smlLine  32   4
bigLine  1024  32   8
bigCache           256
32 blocks/bigLine * 7 data bits/block 7 data words/bigLine
Flush block - Executing the Victim
After locating a victim block, you must pick some time to look at the owner bit to decide whether to send a FBRqst. If the bit is set, then from that time until the FBRqst actually appears on the bus, the bus must be monitored for WSRply CWSRply WB. If one of these write packets is seen then ownership has been lost and the FBRqst must be aborted. There is still the time period, bus to snooper plus headerEnable to bus which if long enough for a matching WBRply or WSRply/FB to sneak in, could cause problems. In the first implementation, this is taken care of by requiring memorys to throw away any FB's which occur immediately following WB's. (The time period in question is too short for a WSRply/FB pair to sneak in).
Read Block
Between the time of sending out a RBRqst and receiving the reply, the cache is responsible for monitoring the requested address just as if the data had already been received. That is setting shared if someone else does a memory reference to that location and either recording new data of WSRply, CWSRply and WBRply or just marking the transaction as invalid using the replyStale bit of the Snooper.
Snooper
The Snooper is used in both cases described above:
Disable sending a FBRqst if the victim is no longer owned.
The Snooper is enabled with the victim addr when the victim.owned bit is read.
While waiting for the FBRqst to be sent, abortFB is set if: WSRply, CWSRply, WBRply
Notice new clients and values while waiting for RBRply
The Snooper is enabled with the reqested addr when the RBRqst is seen on the bus.
While waiting for the RBRply, replyStale is set if: WSRply, CWSRply, WBRply
While waiting for the RBRply, shared is set if: RB, WS, CWS, WB
Write Block
WriteBlockRqst moves up to the root where the write occurs. WriteBlockRply follows IsPresent down clearing owner and updating data on the way. Caches do not pull shared/owner for and WBRqst. WBRply clears owner and leaves shared unchanged.
WriteSingle
WriteSingleRqst moves up to the point where the block is no longer sharedAbove. Then it is turned around and follows IsPresent down. The sender path sets owner and the previous owner is cleared. Caches pull shared for the Rqst and set their shared values and data based on the reply.
ReadBlock
ReadBlockRqst moves up to a level where it is cached (default top) where it is turned around and follows the requestor path back down. Intermediate caches set their shared bits on this new entry. Caches pull shared/owner and the requestor sets shared based on the reply.
FlushBlock
FlushBlockRqst moves ownership up one level. At the top, this moves valid data to memory.
Victim pointer.
There is a use bit associated with each block.
Each cycle the victim pointer examines the use bit. IF set
THEN clear and advance
ELSE hold position
To execute victim, prevent movement for two cycles. This clears the use bit and since the processor can't do any more requests, the use bit will stay cleared.
Ensuring that an owner path gets initialized in a multi-level system.
In order to ensure that ownership of a block is communicated all the way up the tree when a block is first written, it is necessary for all intermediate caches to set their shared bits true for a block when it first gets read into the cache. If after the first write into this block it truely is not shared, these shared bits will be cleared. In the process though, an owner path from the top will have been defined.
Change Notes.
Multiple (4) blocks per associator in Bottom Cache.
50   associators per small cache
400 associators per second level Cache
8 Wds per Block
32 Wds per associator QuadBlock
1024 Wds per page
CWSRply returns the correct data (computation only done once at point of reflection).
All write requests (WS, CWS, WB) are reflected by cache with the block present and ~shared.
This is always true for the top cache server (memory).
All write replys (WS, CWS, WB) are 5 cycles and contain the entire new block.
While waiting for a RBRply:
Array Data is maintained as if the block were already present.
Any write reply to that block is performed and staleReply is set.
Array Shared is maintained as if the block were already present.
Bus shared is pulled on other requests.
Array Owner will always be false.
The arrival of the RBRply updates data only if NOT staleReply.
A cache which pulls owner on a BlockRead and does not currently have shared set must issue a modified RBRply (MRBRply) instead of RBRply in order to notify the cache above to update its data for the block and insure that the parent caches of shared caches have valid data.
This replaces FB.
Consider a revolving Blank in addition to victim (instead of Snooper)
Consider bottom cache may issue RB for address it already has (VM page alias)
Victim can't be touched until reply received
The client does an array match on all?
The server only receives:
Request where owner is not set.
Reply Addressed to it.
MRBReply.
FBRqst.
FB is only done by caches victimizing owned unshared blocks
If it was shared, the parents copy would already be valid
The FB snooper needs to only watch for WBRply (WSRply and CWSRply cant happen)
Big cache requirements:
Number of blocks matched in a big cache is >= sum of all the caches below.
Big Cache can maintain data off chip to make room for more matchers.
Since the data is maintained off chip on big caches, high speed consistancy maintenance operations triggered by monitoring the bus above must be simple. Specifically, CWS and WS are difficult to implement since CWS requires a read, compare, write operation on off chip data in 125ns and WS requires a 50ns write (assuming 25 ns bus cycle).
Proposed Changes:
SmallCache store 4 blocks per matcher rather than one.
This makes is possible for a big cache can cover 8 to 16 small caches.
Simplify write protocols.
Old bus write operation protocols:
CWS Rqst 2 Cycle Rply 5 Cycle  mirrors request (3 dummy cycles to provide time)
WS Rqst 2 Cycle Rply 2 Cycle  mirrors request
WB Rqst 5 Cycle Rply 2 Cycle  mirrors request address
Proposed bus write operation protocols:
CWS Rqst 2 Cycle Rply 5 Cycle  update block  replies with new block
WS Rqst 2 Cycle Rply 5 Cycle  update block  replies with new block
WB Rqst 5 Cycle Rply 5 Cycle  update block  replies with new block
Use Bit
Victim
Each cycle the victim pointer examines the use bit. IF set
THEN clear and advance
ELSE hold
To execute victim, prevent movement for two cycles. This clears the use bit and since the processor can't do any more requests, the use bit will stay cleared.
The first time an Intermediate cache hands back a RBRqst it sets shared to insure that a write will propagate to the top.
shared
owner
sharedBelow -- may contain exact details rather than BOOL
ownedBelow -- may contain exact details rather than BOOL
shared  =>  shared for all caches equal or below.
shared  =>  sharedBelow.
sharedBelow #> shared.
ownedBelow =>  ownedBelow for all caches above.
ownedBelow =>  owned.
owned  #> ownedBelow.
Caches reply to  botWrites to NOT shared blocks.
Caches relay up  botWrites to   shared blocks.
Caches reply to  topReads from owner   blocks.
Caches relay down topReads from ownedBelow blocks.
Caches reply to   botReads from NOT ownedBelow blocks.
Caches relay up  botReads from NOT ownedBelow AND UnCached blocks.
Caches update from topWrites to NOT sharedBelow blocks.
Caches relay down topWrites to   sharedBelow.
A victim block for which owner is set must be flushed.
bottom WSRqst ~ sharedAbove => WSRply bottom
bottom WSRqst  sharedAbove => WSRqst top
top  WSRply  ownedBelow => WSRply bottom
top  RBRqst  owned  => RBRply top
top  RBRqst  ownedBelow => RBRqst bottom
SnoopMatch[snoop, rAddr] should probably be enabled by ~myId
Snooper only really needed as FB enabler and RBRply shared and replyStale
Snooper initialized by RBRqst
Snooper.shared set if matching mem ref noticed RB, WS, CWS, WB. (not FB)
Snooper.rplyStal set if matching mem Wt ref noticed WS, CWS, WB.
Difficulty:
When victimizing a block, you must pick some instance to look at the owner bit to decide whether to send a FBRqst. From that instance until the FBRqst actually appears on the bus, the bus must be monitored for WSRply CWSRply WB and if seen, cancel the FB.
Header:
(Cmd, ModeOrFault, ReplyShared, DeviceID, Address)
ReadBlock 0000 Cache to Memory or Cache
RBRqst  2 Header; ValidVictim, VictimAddress
RBRply  5 Header; CyclicOrderData
WriteBlock 0001 External to Memory and Cache
WBRqst  5 Header; CyclicOrderData
WBRply  2 Header; x
WriteSingle 0010 Cache to (only) Caches
WSRqst  2 Header; Data
WSRply  2 Header; Data
CWriteSingle 0011 xxx
CWSRqst  2 Header; Old, New
CWSRply  5 Header; Old, New; x; x; x
FlushBlock 0100 Cache to (only) Memory
FBRqst  5 Header; CyclicOrderData
FBRply  2 Header; x
KillBlock  ---- BigCache to SmlCaches
KBRqst  5 Header; CyclicOrderData
FBRply  2 Header; x
IORead  1000 Caches to IO
IORRqst  2 Header; x
IORRply  2 Header; Data
IOWrite  1001 Caches to IO
IOWRqst  2 Header; Data
IOWRply  2 Header; x
BIOWrite  1010 Caches to IO Type
BIOWRqst 2 Header; Data
BIOWRply 2 Header; x
Map   1110 SmlCache to MapCache
MapRqst  2 VPage; AddrSpaceID
MapRply  2 RPage,  Flags; XXX
DeMap  1111 Clears VPValid in all Caches
DMapRqst 2 RPage; x
DMapRply 2 RPage; x
RBRqst  - Header; ValidVictim, VictimAddress
RBRply  5 Header; CyclicOrderData
WBRqst  5 Header; CyclicOrderData
WBRply  - Header; x
WSRqst  - Header; Data
WSRply  - Header; Data
CWSRqst  - Header; Old, New
CWSRply  5 Header; Old, New; x; x; x
FBRqst  5 Header; CyclicOrderData
FBRply  - Header; x
KBRqst  ? Header; CyclicOrderData
FBRply  - Header; x
IORRqst  - Header; x
IORRply  - Header; Data
IOWRqst  - Header; Data
IOWRply  - Header; x
BIOWRqst - Header; Data
BIOWRply - Header; x
MapRqst  - VPage;   AddrSpaceID
MapRply  - RPage, Flags; x
DMapRqst 2 RPage;   x
DMapRply 2 RPage;   x
     Rqstr/Rplyr Lstnr
RBRqst   C   C M
RBRply    M  C
WBRqst   C   C M
WBRply    M
WSRqst   C   C
WSRply    T  C
CWSRqst   C   C
CWSRply    T  C
FBRqst 
FBRply 
KBRqst 
FBRply 
MapRqst   C
MapRply    mc C
DMapRqst 
DMapRply 
IORRqst   IO   IO
IORRply   IO T  IO
IOWRqst   IO   IO
IOWRply   IO T  IO
BIOWRqst  IO   IO
BIOWRply   T  IO
Fetch or Store and miss =>  ReadBlock
Store and shared =>    WriteSingle
victim and owner =>   FlushBlock
For each block
shared
owner
For each outstanding request
sharedAccumulator
rplyStale  ReadBlockReply
For each block
existsBelow bit for a block is set only if some small cache also has a copy of the block
allows a big cache to filter packets that appear on the main bus
BigCache.tioga
Don Curry December 10, 1987 4:08:11 pm PST
Open BigCacheNotes.tioga DynaBusLogicalSpecifications.tioga DynaBusGuidelines.tioga
Time all referenced to Recieved Packets
Multi level
Shared at top implies shared at bottom
Owner at Bottom implies Owner at top
Two choices
one line in the big cache for each small cache line xxx
watch victims/ decode cache id/ keep track of who has what
Dynabus:
64 bits Per 25ns x 4/7 < 200 MBytes/Sec
Memory
average 80ns/bit = 12 Mbits/Sec
8*200/12 = 128 bit bus to memory
32 4x256K rams => 2 MBytes
120 = 16 Bytes Per Mem access
A hit must be encoded into Ram Address
Ops
32 bit real address =>
Assume technology = .8 => 1/6 2 micron area
Header:
(Cmd, ModeOrFault, ReplyShared, DeviceID, Address)
ReadBlock 0000 Cache to Memory or Cache
RBRqst  2 Header; ValidVictim, VictimAddress
RBRply  5 Header; CyclicOrderData
WriteBlock 0001 External to Memory and Cache
WBRqst  5 Header; CyclicOrderData
WBRply  2 Header; x
WriteSingle 0010 Cache to (only) Caches
WSRqst  2 Header; Data
WSRply  2 Header; Data
CWriteSingle 0011 xxx
CWSRqst  2 Header; Old, New
CWSRply  5 Header; Old, New; x; x; x
FlushBlock 0100 Cache to (only) Memory
FBRqst  5 Header; CyclicOrderData
FBRply  2 Header; x
KillBlock  ---- BigCache to SmlCaches
KBRqst  5 Header; CyclicOrderData
FBRply  2 Header; x
IORead  1000 Caches to IO
IORRqst  2 Header; x
IORRply  2 Header; Data
IOWrite  1001 Caches to IO
IOWRqst  2 Header; Data
IOWRply  2 Header; x
BIOWrite  1010 Caches to IO Type
BIOWRqst 2 Header; Data
BIOWRply 2 Header; x
Map   1110 SmlCache to MapCache
MapRqst  2 VPage; AddrSpaceID
MapRply  2 RPage,  Flags; XXX
DeMap  1111 Clears VPValid in all Caches
DMapRqst 2 RPage; x
DMapRply 2 RPage; x
RBRqst  - Header; ValidVictim, VictimAddress
RBRply  5 Header; CyclicOrderData
WBRqst  5 Header; CyclicOrderData
WBRply  - Header; x
WSRqst  - Header; Data
WSRply  - Header; Data
CWSRqst  - Header; Old, New
CWSRply  5 Header; Old, New; x; x; x
FBRqst  5 Header; CyclicOrderData
FBRply  - Header; x
KBRqst  ? Header; CyclicOrderData
FBRply  - Header; x
IORRqst  - Header; x
IORRply  - Header; Data
IOWRqst  - Header; Data
IOWRply  - Header; x
BIOWRqst - Header; Data
BIOWRply - Header; x
MapRqst  - VPage;   AddrSpaceID
MapRply  - RPage, Flags; x
DMapRqst 2 RPage;   x
DMapRply 2 RPage;   x
     Rqstr/Rplyr Lstnr
RBRqst   C   C M
RBRply    M  C
WBRqst   C   C M
WBRply    M
WSRqst   C   C
WSRply    T  C
CWSRqst   C   C
CWSRply    T  C
FBRqst 
FBRply 
KBRqst 
FBRply 
MapRqst   C
MapRply    mc C
DMapRqst 
DMapRply 
IORRqst   IO   IO
IORRply   IO T  IO
IOWRqst   IO   IO
IOWRply   IO T  IO
BIOWRqst  IO   IO
BIOWRply   T  IO
Fetch or Store and miss =>         ReadBlock
Store and shared =>           WriteSingle
victim and owner =>          FlushBlock
RBRqst and ~self and owner =>       RBRply
waiting on RB and (WSRply, CWSRply WBRqst) => RBRqst again  (RplyStale)
For each block
shared
owner
For each outstanding request
sharedAccumulator
rplyStale  ReadBlockReply
For each block
existsBelow bit for a block is set only if some small cache also has a copy of the block
allows a big cache to filter packets that appear on the main bus
BigCache.tioga
Don Curry December 10, 1987 4:08:11 pm PST
Open BigCacheNotes.tioga DynaBusLogicalSpecifications.tioga DynaBusGuidelines.tioga
Dynabus:
64 bits Per 25ns x 4/7 < 200 MBytes/Sec
Memory
average 80ns/bit = 12 Mbits/Sec
8*200/12 = 128 bit bus to memory
32 4x256K rams => 2 MBytes
120 = 16 Bytes Per Mem access
A hit must be encoded into Ram Address
Ops
32 bit real address =>
Assume technology = .8 => 1/6 2 micron area