Map Cache Early 1988 1 Map Cache · Structure of Address Spaces · Background on Address mapping · Implementation and Performance · DynaBus Requests · Mapping algorithm · Block Diagram and details of blocks · Simulation · Future Implementations Structure of Address Spaces · Identity map (aid=-1) Useful for startup and IO, when caches are empty · Bypass Area vp rp is computed, no miss possible Access to map table and fault-handling code · Shared Area and Switched Area vp if shared, use aid=0 kernel area; communication between address spaces Terminology · Address space id (aid): 16 bits (4 bits on Sun4) · A Page is 4 Kbytes Unit of mapping and protection · Virtual Page (vp): 22 bits · Real Page (rp): 22 bits · Flags: KWtEnable: kernel write permission UWtEnable: user write permission URdEnable: user read permission Dirty: page has been written · Map Table: resides in main memory hash table: (aid, vp) -> (rp, flags) only entries for pages in main memory (< 0.2%) structure known only to software Address Mapping · Small cache hit: va -> ra · Small cache has a word in same virtual page partial match on vp -> rp -> ra · Small cache miss: map request (1 to 2% of ref.) · Map hit: return rp · Map miss: map cache fault -> trap faulting proc. handles trap (idle) trap handler uses map table in memory, writes entry to map cache retries faulting instruction · Map Cache is only an accelerator 1 entry would suffice (50% slower system) Circuit Implementation · One man-month! It is a Prototype: simplicity first Must be a no-trouble circuit · Direct map, 256 entries Pure cache (no updating in place) Heavy-handed flush Address is hash function on (vp, aid) · It is only an accelerator A single-entry map cache would work Support for multiple map caches Replacement by software DynaBus Requests Recognizes Map Request, IORead, IOWrite, BIOWrite · Map [aid, vp] -> [rp, flags] Map Request MapFault error if entry not present · ReadEntry [vp] -> [rp, flags]; implicit aid IORead DynaBusOtherFault error if entry not present · WriteEntry [vp, rp, flags, valid]; implicit aid IOWrite, BIOWrite used to write / flush entry IOAccessFault error if not in kernel mode · ReadRegister [number] -> [value] IORead · WriteRegister [number, value] IOWrite, BIOWrite IOAccessFault error if not in kernel mode Mapping algorithm WriteEntry: no write if aid=-1 or Bypass Area. Map and ReadEntry: · If aid=-1, identity mapping: rp _ vp; flags _ constant · Else If vp in Bypass Area, compute rp: rp _ (BypassBase flags _ constant · Else If vp in Shared Area, use aid=0: [rp, flags] _ LookUp [vp, aid=0] · Else simple LookUp: [rp, flags] _ LookUp [vp, aid] · Constant flags are Dirty, KWtEnable, ~UWtEnable, ~URdEnable Internal Registers 8 registers; all are 22-bit wide except AID · AID (16 bits) Implicit aid for ReadEntry and WriteEntry · SharedPattern, SharedMask Specify the shared area · BypassPattern, BypassMask, BypassBase Specify the bypass area and the mapping · SubSetMask, SubSetPattern Specify which subset of vp is recognized Allows use of multiple Map Caches (uneven split of vp space is OK) RAMs Conservative dual-ported static RAM Completely separate read and write logic · Input FIFO 16 words of 78 bits Overflow => requestor times out Simulations show that FIFO is usually empty Replies at Cache Reply priority (highest) · Map Cache Table 256 words of 57 bits 22(vp)+22(rp)+16(aid)+4(flags)+1(valid)8(h) address = fold and XOR (aid, vp) maps 1M bytes Layout · Description Parametrized schematics Libraries, FSM generator · Assembly 2700 Standard Cells (4 hr placement on a Sun4) Standard RAMs: 16x78 and 2x128x58 Standard pad frame: 13.6 mm by 13.4 mm Total layout time: 8hr · Layout 2 170K transistors in a 100 mm2 core Performance · Cycle time Designed for 40 MHz FIFO decouples Table from bus Command processing block is combinatorial; speed affects only latency, not cycle time · Latency: 10 cycles minimum pin to pin 1 cycle: latch DynaBus data 2 cycles: header decoding 3 cycles: access table and form result · Arbiter latency is currently dominant Simulation · Blocks Individually simulated with test vectors · Transaction level Circuit at transistor, gate, or macro level Instruction-level model of circuit (program) Smart Comparator takes care of timing Model of Arbiter and random requestors · System simulation To be done soon Future implementations · Larger table 512 entries possible right now ~300K transistors: yield problem ? · Better Technology 2K entries possible in 1 4 Map Caches would cover 32 Mbytes · Set associative No major gain A lot of side effects: tags, reset, timing ... · Fully-associative (Error 33 ?) Borrow cells from small cache Implement a selective flush · External memory Trivial change in one FSM No problem with timing or pin count