<> <> <> <> 4 blocks per assoc in small cache CWS turns around with correct data Wts turn around at cache with owner and ~shared Data always valid at and below cache with owner and shared below RB snooper can record valid data. Retry not needed Kill Block To enable >2 levels Assoc limits last level isn't doing work unless assoc limits reached (and kill needed) Instead, larger blocks more time external matches <> words blocks smlLine bigLine blocks 8 smlLine 32 4 bigLine 1024 32 8 bigCache 256 32 blocks/bigLine * 7 data bits/block 7 data words/bigLine <<>> <> <> <> <> <> <> <> <> <> <> <> <> <> <> <> <> << WriteSingleRqst moves up to the point where the block is no longer sharedAbove. Then it is turned around and follows IsPresent down. The sender path sets owner and the previous owner is cleared. Caches pull shared for the Rqst and set their shared values and data based on the reply.>> <> << ReadBlockRqst moves up to a level where it is cached (default top) where it is turned around and follows the requestor path back down. Intermediate caches set their shared bits on this new entry. Caches pull shared/owner and the requestor sets shared based on the reply.>> <> << FlushBlockRqst moves ownership up one level. At the top, this moves valid data to memory.>> <> <> <> <> <> <> <> <> <> <> <<50 associators per small cache>> <<400 associators per second level Cache>> <<8 Wds per Block>> <<32 Wds per associator QuadBlock>> <<1024 Wds per page>> <> <> <> <> <> <> <> <> <> <> <> <> <> <> <> <> <> <> <> <> <> <> <> <> <> Big cache requirements: Number of blocks matched in a big cache is >= sum of all the caches below. Big Cache can maintain data off chip to make room for more matchers. Since the data is maintained off chip on big caches, high speed consistancy maintenance operations triggered by monitoring the bus above must be simple. Specifically, CWS and WS are difficult to implement since CWS requires a read, compare, write operation on off chip data in 125ns and WS requires a 50ns write (assuming 25 ns bus cycle). Proposed Changes: SmallCache store 4 blocks per matcher rather than one. This makes is possible for a big cache can cover 8 to 16 small caches. Simplify write protocols. Old bus write operation protocols: CWS Rqst 2 Cycle Rply 5 Cycle mirrors request (3 dummy cycles to provide time) WS Rqst 2 Cycle Rply 2 Cycle mirrors request WB Rqst 5 Cycle Rply 2 Cycle mirrors request address Proposed bus write operation protocols: CWS Rqst 2 Cycle Rply 5 Cycle update block replies with new block WS Rqst 2 Cycle Rply 5 Cycle update block replies with new block WB Rqst 5 Cycle Rply 5 Cycle update block replies with new block Use Bit Victim Each cycle the victim pointer examines the use bit. IF set THEN clear and advance ELSE hold To execute victim, prevent movement for two cycles. This clears the use bit and since the processor can't do any more requests, the use bit will stay cleared. The first time an Intermediate cache hands back a RBRqst it sets shared to insure that a write will propagate to the top. shared owner sharedBelow -- may contain exact details rather than BOOL ownedBelow -- may contain exact details rather than BOOL shared => shared for all caches equal or below. shared => sharedBelow. sharedBelow #> shared. ownedBelow => ownedBelow for all caches above. ownedBelow => owned. owned #> ownedBelow. Caches reply to botWrites to NOT shared blocks. Caches relay up botWrites to shared blocks. Caches reply to topReads from owner blocks. Caches relay down topReads from ownedBelow blocks. Caches reply to botReads from NOT ownedBelow blocks. Caches relay up botReads from NOT ownedBelow AND UnCached blocks. Caches update from topWrites to NOT sharedBelow blocks. Caches relay down topWrites to sharedBelow. A victim block for which owner is set must be flushed. bottom WSRqst ~ sharedAbove => WSRply bottom bottom WSRqst sharedAbove => WSRqst top top WSRply ownedBelow => WSRply bottom top RBRqst owned => RBRply top top RBRqst ownedBelow => RBRqst bottom SnoopMatch[snoop, rAddr] should probably be enabled by ~myId Snooper only really needed as FB enabler and RBRply shared and replyStale Snooper initialized by RBRqst Snooper.shared set if matching mem ref noticed RB, WS, CWS, WB. (not FB) Snooper.rplyStal set if matching mem Wt ref noticed WS, CWS, WB. Difficulty: When victimizing a block, you must pick some instance to look at the owner bit to decide whether to send a FBRqst. From that instance until the FBRqst actually appears on the bus, the bus must be monitored for WSRply CWSRply WB and if seen, cancel the FB. Header: (Cmd, ModeOrFault, ReplyShared, DeviceID, Address) ReadBlock 0000 Cache to Memory or Cache RBRqst 2 Header; ValidVictim, VictimAddress RBRply 5 Header; CyclicOrderData WriteBlock 0001 External to Memory and Cache WBRqst 5 Header; CyclicOrderData WBRply 2 Header; x WriteSingle 0010 Cache to (only) Caches WSRqst 2 Header; Data WSRply 2 Header; Data CWriteSingle 0011 xxx CWSRqst 2 Header; Old, New CWSRply 5 Header; Old, New; x; x; x FlushBlock 0100 Cache to (only) Memory FBRqst 5 Header; CyclicOrderData FBRply 2 Header; x KillBlock ---- BigCache to SmlCaches KBRqst 5 Header; CyclicOrderData FBRply 2 Header; x IORead 1000 Caches to IO IORRqst 2 Header; x IORRply 2 Header; Data IOWrite 1001 Caches to IO IOWRqst 2 Header; Data IOWRply 2 Header; x BIOWrite 1010 Caches to IO Type BIOWRqst 2 Header; Data BIOWRply 2 Header; x Map 1110 SmlCache to MapCache MapRqst 2 VPage; AddrSpaceID MapRply 2 RPage, Flags; XXX DeMap 1111 Clears VPValid in all Caches DMapRqst 2 RPage; x DMapRply 2 RPage; x RBRqst - Header; ValidVictim, VictimAddress RBRply 5 Header; CyclicOrderData WBRqst 5 Header; CyclicOrderData WBRply - Header; x WSRqst - Header; Data WSRply - Header; Data CWSRqst - Header; Old, New CWSRply 5 Header; Old, New; x; x; x FBRqst 5 Header; CyclicOrderData FBRply - Header; x KBRqst ? Header; CyclicOrderData FBRply - Header; x IORRqst - Header; x IORRply - Header; Data IOWRqst - Header; Data IOWRply - Header; x BIOWRqst - Header; Data BIOWRply - Header; x MapRqst - VPage; AddrSpaceID MapRply - RPage, Flags; x DMapRqst 2 RPage; x DMapRply 2 RPage; x Rqstr/Rplyr Lstnr RBRqst C C M RBRply M C WBRqst C C M WBRply M WSRqst C C WSRply T C CWSRqst C C CWSRply T C FBRqst FBRply KBRqst FBRply MapRqst C MapRply mc C DMapRqst DMapRply IORRqst IO IO IORRply IO T IO IOWRqst IO IO IOWRply IO T IO BIOWRqst IO IO BIOWRply T IO Fetch or Store and miss => ReadBlock Store and shared => WriteSingle victim and owner => FlushBlock For each block shared owner For each outstanding request sharedAccumulator rplyStale ReadBlockReply For each block existsBelow bit for a block is set only if some small cache also has a copy of the block allows a big cache to filter packets that appear on the main bus BigCache.tioga <> Open BigCacheNotes.tioga DynaBusLogicalSpecifications.tioga DynaBusGuidelines.tioga Time all referenced to Recieved Packets Multi level Shared at top implies shared at bottom Owner at Bottom implies Owner at top Two choices one line in the big cache for each small cache line xxx watch victims/ decode cache id/ keep track of who has what Dynabus: 64 bits Per 25ns x 4/7 < 200 MBytes/Sec Memory average 80ns/bit = 12 Mbits/Sec 8*200/12 = 128 bit bus to memory 32 4x256K rams => 2 MBytes 120 = 16 Bytes Per Mem access A hit must be encoded into Ram Address Ops 32 bit real address => Assume technology = .8 => 1/6 2 micron area Header: (Cmd, ModeOrFault, ReplyShared, DeviceID, Address) ReadBlock 0000 Cache to Memory or Cache RBRqst 2 Header; ValidVictim, VictimAddress RBRply 5 Header; CyclicOrderData WriteBlock 0001 External to Memory and Cache WBRqst 5 Header; CyclicOrderData WBRply 2 Header; x WriteSingle 0010 Cache to (only) Caches WSRqst 2 Header; Data WSRply 2 Header; Data CWriteSingle 0011 xxx CWSRqst 2 Header; Old, New CWSRply 5 Header; Old, New; x; x; x FlushBlock 0100 Cache to (only) Memory FBRqst 5 Header; CyclicOrderData FBRply 2 Header; x KillBlock ---- BigCache to SmlCaches KBRqst 5 Header; CyclicOrderData FBRply 2 Header; x IORead 1000 Caches to IO IORRqst 2 Header; x IORRply 2 Header; Data IOWrite 1001 Caches to IO IOWRqst 2 Header; Data IOWRply 2 Header; x BIOWrite 1010 Caches to IO Type BIOWRqst 2 Header; Data BIOWRply 2 Header; x Map 1110 SmlCache to MapCache MapRqst 2 VPage; AddrSpaceID MapRply 2 RPage, Flags; XXX DeMap 1111 Clears VPValid in all Caches DMapRqst 2 RPage; x DMapRply 2 RPage; x RBRqst - Header; ValidVictim, VictimAddress RBRply 5 Header; CyclicOrderData WBRqst 5 Header; CyclicOrderData WBRply - Header; x WSRqst - Header; Data WSRply - Header; Data CWSRqst - Header; Old, New CWSRply 5 Header; Old, New; x; x; x FBRqst 5 Header; CyclicOrderData FBRply - Header; x KBRqst ? Header; CyclicOrderData FBRply - Header; x IORRqst - Header; x IORRply - Header; Data IOWRqst - Header; Data IOWRply - Header; x BIOWRqst - Header; Data BIOWRply - Header; x MapRqst - VPage; AddrSpaceID MapRply - RPage, Flags; x DMapRqst 2 RPage; x DMapRply 2 RPage; x Rqstr/Rplyr Lstnr RBRqst C C M RBRply M C WBRqst C C M WBRply M WSRqst C C WSRply T C CWSRqst C C CWSRply T C FBRqst FBRply KBRqst FBRply MapRqst C MapRply mc C DMapRqst DMapRply IORRqst IO IO IORRply IO T IO IOWRqst IO IO IOWRply IO T IO BIOWRqst IO IO BIOWRply T IO Fetch or Store and miss => ReadBlock Store and shared => WriteSingle victim and owner => FlushBlock RBRqst and ~self and owner => RBRply waiting on RB and (WSRply, CWSRply WBRqst) => RBRqst again (RplyStale) For each block shared owner For each outstanding request sharedAccumulator rplyStale ReadBlockReply For each block existsBelow bit for a block is set only if some small cache also has a copy of the block allows a big cache to filter packets that appear on the main bus BigCache.tioga <> Open BigCacheNotes.tioga DynaBusLogicalSpecifications.tioga DynaBusGuidelines.tioga Dynabus: 64 bits Per 25ns x 4/7 < 200 MBytes/Sec Memory average 80ns/bit = 12 Mbits/Sec 8*200/12 = 128 bit bus to memory 32 4x256K rams => 2 MBytes 120 = 16 Bytes Per Mem access A hit must be encoded into Ram Address Ops 32 bit real address => Assume technology = .8 => 1/6 2 micron area