Map Cache
· Structure of Address Spaces
· Background on Address mapping
· Implementation and Performance
· DynaBus Requests
· Mapping algorithm
· Block Diagram and details of blocks
· Simulation
· Future Implementations
Structure of Address Spaces
· Identity map (aid=-1)
Useful for startup and IO, when caches are empty
· Bypass Area
vp'BypassMask=BypassPattern'BypassMask
rp is computed, no miss possible
Access to map table and fault-handling code
· Shared Area and Switched Area
vp'SharedMask=SharedPattern'SharedMask
if shared, use aid=0
kernel area; communication between address spaces
Terminology
· Address space id (aid): 16 bits (4 bits on Sun4)
· A Page is 4 Kbytes
Unit of mapping and protection
· Virtual Page (vp): 22 bits
· Real Page (rp): 22 bits
· Flags:
KWtEnable: kernel write permission
UWtEnable: user write permission
URdEnable: user read permission
Dirty: page has been written
· Map Table:
resides in main memory
hash table: (aid, vp) -> (rp, flags)
only entries for pages in main memory (< 0.2%)
structure known only to software
Address Mapping
· Small cache hit: va -> ra
· Small cache has a word in same virtual page
partial match on vp -> rp -> ra
· Small cache miss: map request (1 to 2% of ref.)
· Map hit: return rp
· Map miss:
map cache fault -> trap
faulting proc. handles trap (idle)
trap handler uses map table in memory,
writes entry to map cache
retries faulting instruction
· Map Cache is only an accelerator
1 entry would suffice (50% slower system)
Circuit Implementation
· One man-month!
It is a Prototype: simplicity first
Must be a no-trouble circuit
· Direct map, 256 entries
Pure cache (no updating in place)
Heavy-handed flush
Address is hash function on (vp, aid)
· It is only an accelerator
A single-entry map cache would work
Support for multiple map caches
Replacement by software
DynaBus Requests
Recognizes Map Request, IORead, IOWrite, BIOWrite
· Map [aid, vp] -> [rp, flags]
Map Request
MapFault error if entry not present
· ReadEntry [vp] -> [rp, flags]; implicit aid
IORead
DynaBusOtherFault error if entry not present
· WriteEntry [vp, rp, flags, valid]; implicit aid
IOWrite, BIOWrite
used to write / flush entry
IOAccessFault error if not in kernel mode
· ReadRegister [number] -> [value]
IORead
· WriteRegister [number, value]
IOWrite, BIOWrite
IOAccessFault error if not in kernel mode
Mapping algorithm
WriteEntry: no write if aid=-1 or Bypass Area.
Map and ReadEntry:
· If aid=-1, identity mapping:
rp ← vp; flags ← constant
· Else If vp in Bypass Area, compute rp:
rp ← (BypassBase'BypassMask) V (vp'~BypassMask)
flags ← constant
· Else If vp in Shared Area, use aid=0:
[rp, flags] ← LookUp [vp, aid=0]
· Else simple LookUp:
[rp, flags] ← LookUp [vp, aid]
· Constant flags are
Dirty, KWtEnable, ~UWtEnable, ~URdEnable
Internal Registers
8 registers; all are 22-bit wide except AID
· AID (16 bits)
Implicit aid for ReadEntry and WriteEntry
· SharedPattern
, SharedMask
Specify the shared area
· BypassPattern
, BypassMask
, BypassBase
Specify the bypass area and the mapping
· SubSetMask
, SubSetPattern
Specify which subset of vp is recognized
Allows use of multiple Map Caches
(uneven split of vp space is OK)
RAMs
Conservative dual-ported static RAM
Completely separate read and write logic
· Input FIFO
16 words of 78 bits
Overflow => requestor times out
Simulations show that FIFO is usually empty
Replies at Cache Reply priority (highest)
· Map Cache Table
256 words of 57 bits
22(vp)+22(rp)+16(aid)+4(flags)+1(valid)—8(h)
address = fold and XOR (aid, vp)
maps 1M bytes
Performance
· Cycle time
Designed for 40 MHz
FIFO decouples Table from bus
Command processing block is combinatorial; speed affects only latency, not cycle time
· Latency: 10 cycles minimum pin to pin
1 cycle: latch DynaBus data
2 cycles: header decoding
e 3 cycles: write and read FIFO
3 cycles: access table and form result
e 1 cycle: drive DynaBus if Grant
· Arbiter latency is currently dominant
Simulation
· Blocks
Individually simulated with test vectors
· Transaction level
Circuit at transistor, gate, or macro level
Instruction-level model of circuit (program)
Smart Comparator takes care of timing
Model of Arbiter and random requestors
· System simulation
To be done soon
Future implementations
· Larger table
512 entries possible right now
~300K transistors: yield problem ?
· Better Technology
2K entries possible in 1m
4 Map Caches would cover 32 Mbytes
· Set associative
No major gain
A lot of side effects: tags, reset, timing ...
· Fully-associative (Error 33 ?
)
Borrow cells from small cache
Implement a selective flush
· External memory
Trivial change in one FSM
No problem with timing or pin count