Map Cache Early 1988 1 Map Cache · Structure of Address Spaces · Background on Address mapping · Implementation and Performance · DynaBus Requests · Mapping algorithm · Block Diagram and details of blocks · Simulation · Future Implementations Structure of Address Spaces · Identity map (aid=-1) Useful for startup and IO, when caches are empty · Bypass Area vp&BypassMask=BypassPattern&BypassMask rp is computed, no miss possible Access to map table and fault-handling code · Shared Area and Switched Area vp&SharedMask=SharedPattern&SharedMask if shared, use aid=0 kernel area; communication between address spaces Terminology · Address space id (aid): 16 bits (4 bits on Sun4) · A Page is 4 Kbytes Unit of mapping and protection · Virtual Page (vp): 22 bits · Real Page (rp): 22 bits · Flags: KWtEnable: kernel write permission UWtEnable: user write permission URdEnable: user read permission Dirty: page has been written · Map Table: resides in main memory hash table: (aid, vp) -> (rp, flags) only entries for pages in main memory (< 0.2%) structure known only to software Address Mapping · Small cache hit: va -> ra · Small cache has a word in same virtual page partial match on vp -> rp -> ra · Small cache miss: map request (1 to 2% of ref.) · Map hit: return rp · Map miss: map cache fault -> trap faulting proc. handles trap (idle) trap handler uses map table in memory, writes entry to map cache retries faulting instruction · Map Cache is only an accelerator 1 entry would suffice (50% slower system) Circuit Implementation · One man-month! It is a Prototype: simplicity first Must be a no-trouble circuit · Direct map, 256 entries Pure cache (no updating in place) Heavy-handed flush Address is hash function on (vp, aid) · It is only an accelerator A single-entry map cache would work Support for multiple map caches Replacement by software DynaBus Requests Recognizes Map Request, IORead, IOWrite, BIOWrite · Map [aid, vp] -> [rp, flags] Map Request MapFault error if entry not present · ReadEntry [vp] -> [rp, flags]; implicit aid IORead DynaBusOtherFault error if entry not present · WriteEntry [vp, rp, flags, valid]; implicit aid IOWrite, BIOWrite used to write / flush entry IOAccessFault error if not in kernel mode · ReadRegister [number] -> [value] IORead · WriteRegister [number, value] IOWrite, BIOWrite IOAccessFault error if not in kernel mode Mapping algorithm WriteEntry: no write if aid=-1 or Bypass Area. Map and ReadEntry: · If aid=-1, identity mapping: rp _ vp; flags _ constant · Else If vp in Bypass Area, compute rp: rp _ (BypassBase&BypassMask) V (vp&~BypassMask) flags _ constant · Else If vp in Shared Area, use aid=0: [rp, flags] _ LookUp [vp, aid=0] · Else simple LookUp: [rp, flags] _ LookUp [vp, aid] · Constant flags are Dirty, KWtEnable, ~UWtEnable, ~URdEnable Internal Registers 8 registers; all are 22-bit wide except AID · AID (16 bits) Implicit aid for ReadEntry and WriteEntry · SharedPattern, SharedMask Specify the shared area · BypassPattern, BypassMask, BypassBase Specify the bypass area and the mapping · SubSetMask, SubSetPattern Specify which subset of vp is recognized Allows use of multiple Map Caches (uneven split of vp space is OK) RAMs Conservative dual-ported static RAM Completely separate read and write logic · Input FIFO 16 words of 78 bits Overflow => requestor times out Simulations show that FIFO is usually empty Replies at Cache Reply priority (highest) · Map Cache Table 256 words of 57 bits 22(vp)+22(rp)+16(aid)+4(flags)+1(valid)8(h) address = fold and XOR (aid, vp) maps 1M bytes Layout · Description Parametrized schematics Libraries, FSM generator · Assembly 2700 Standard Cells (4 hr placement on a Sun4) Standard RAMs: 16x78 and 2x128x58 Standard pad frame: 13.6 mm by 13.4 mm Total layout time: 8hr · Layout 2m CMOS 170K transistors in a 100 mm2 core Performance · Cycle time Designed for 40 MHz FIFO decouples Table from bus Command processing block is combinatorial; speed affects only latency, not cycle time · Latency: 10 cycles minimum pin to pin 1 cycle: latch DynaBus data 2 cycles: header decoding > 3 cycles: write and read FIFO 3 cycles: access table and form result > 1 cycle: drive DynaBus if Grant · Arbiter latency is currently dominant Simulation · Blocks Individually simulated with test vectors · Transaction level Circuit at transistor, gate, or macro level Instruction-level model of circuit (program) Smart Comparator takes care of timing Model of Arbiter and random requestors · System simulation To be done soon Future implementations · Larger table 512 entries possible right now ~300K transistors: yield problem ? · Better Technology 2K entries possible in 1m 4 Map Caches would cover 32 Mbytes · Set associative No major gain A lot of side effects: tags, reset, timing ... · Fully-associative (Error 33 ?) Borrow cells from small cache Implement a selective flush · External memory Trivial change in one FSM No problem with timing or pin count Κ8–"slides" style–#PressFonts 1000.0 firstVisibleFolio˜Iunleaded•Mark insideFooteršΡdis ˜ K– outsideFooterš ˜ Iblock– centerFooter–e"XeroxLogotypes" family 48 pt size 48 pt leading 48 pt topLeading 40 pt topIndent 24 pt bottomLeading˜title˜ IraggedšΟbœ˜Nšžœ˜Nšžœ˜ Nšžœ˜Nšžœ˜Nšžœ#˜%Nšžœ ˜ Nšžœ˜—˜šž˜N˜0—šž ˜ NšœΟmœŸœ ˜&N˜ N˜+—šž˜NšœŸœŸœ ˜&Nšœ˜N˜1——˜ Nšžœ1˜3šžœ˜N˜—Nšžœ˜Nšžœ˜šžœ˜Nšœ"˜"Nšœ ˜ Nšœ˜Nšœ˜—šžœ ˜ N˜N˜$N˜.N˜ ——šœ˜Nšžœ ˜šž-˜-N˜ —Nšž2˜2Nšž˜šž ˜ N˜N˜"N˜&N˜N˜—N˜šž"˜"N˜)——˜šž˜N˜#N˜—šž˜Nšœ!˜!Nšœ˜N˜%—šž˜N˜#N˜N˜——˜Nšœ1˜1šž˜N˜ NšΟiœ˜#—šž-˜-N˜Nš œ˜,—šž1˜1N˜N˜Nš  œ˜)—šž"˜"Nšœ˜—šž˜N˜Nš  œ˜)——˜Nšž.˜.Nšž˜šž˜Nšœ˜—šž(˜(NšœŸœŸœ ˜/Nšœ˜—šž'˜'Nšœ ˜ —šž˜Nšœ˜N˜—šž˜Nšœ(˜(——˜Nšž+˜+šž˜Nšœ)˜)—šžœ žœ ˜Nšœ˜—šžœ žœ žœ ˜'Nšœ'˜'—šžœ žœ ˜N˜(N˜"N˜ ——˜Nšž#˜#Nšž(˜(šž ˜ N˜N˜N˜+N˜)—šž˜Nšœ˜Nšœ,˜,N˜ N˜ ——˜šž ˜ Nšœ˜N˜—šž ˜ N˜.N˜!N˜&N˜—šž˜NšœΟgœΟs˜NšœΟuœ˜#——šœ ˜ šž ˜ N˜N˜N˜U—šž'˜'N˜N˜NšŸœ˜N˜&NšŸœ!˜"—Nšž'˜'—˜ šž˜N˜(—šž˜N˜+N˜,N˜%N˜&—šž˜N˜——˜šž˜N˜N˜"—šž˜Nšœ‘˜N˜"—šž˜N˜ N˜.—šžœ ž˜ N˜N˜—šž˜N˜N˜#———…—h¦