CacheSpecs.tioga
Last Edited by: Barth, May 9, 1985 12:26:23 pm PDT
DRAGON CACHE SPECIFICATIONS
DRAGON CACHE SPECIFICATIONS
DRAGON CACHE SPECIFICATIONS
DRAGON PROJECT — FOR INTERNAL XEROX USE ONLY
DRAGON PROJECT — FOR INTERNAL XEROX USE ONLY
DRAGON PROJECT — FOR INTERNAL XEROX USE ONLY
The Dragon Cache
Description and Specifications
Release as[Indigo]<Dragon>Documentation>CacheSpecs.press

© Copyright 1984 Xerox Corporation. All rights reserved.
Abstract: This memo describes the Dragon cache. It is intended to be used both as a convenient source for information about the cache and as a reference manual for cache specifications.
XEROX  Xerox Corporation
   Palo Alto Research Center
   3333 Coyote Hill Road
   Palo Alto, California 94304



For Internal Xerox Use Only
Contents
1. Introduction
2. Architectural Considerations
3. Pin Description
4. Functional Description
1. Introduction
The cache has two purposes: First, it reduces traffic on the M-bus, since most processor requests can be satisfied from the cache. Secondly, in addition to data caching, mapping information is also cached (although the mechanism is different).
2. Architectural Considerations
2.1 Cache Size
It has become clear from recent simulations that unless the amount of cached data per processor is increased substantially, we will not be able to support more than three or four processors on a single M-Bus. One way to get more data is to put however many cache chips are required per processor. This is unacceptable because it requires too much board area per processor and also increases the number of M-Bus loads to the point where bus timing requirements are difficult to meet. A more workable solution is two-level caches: the first level cache would be a custom chip with about the same amount of data as an old cache. The second level cache would consist of a custom chip and several off-the-shelf memory chips; it would be slower than the first level cache but contain much more data.
There is quite a variety of two-level schemes possible but most of them are either too hard to implement, involve throwing away too much work already completed, or do not provide reasonable intermediate implementations. The one we've settled on seems to have none of these disadvantages while providing all of the performance advantages we're after.
2.2 The Two-Level Cache Architecture
The two-level cache consists of three components: the first-level cache VC (for virtual cache), the second level cache RC (for real cache), and 8-16K words of memory Mem; VC and RC are both custom chips, while Mem consists of four or eight 8Kx8 off-the-shelf memory chips.
It turns
The VC's connected to a processor sit on an internal bus called the MI-Bus (for MBus, Internal) whose protocol is a strict superset of the M-Bus. VC is virtually identical to the old cache: its front-end connects to the processor and traffics in reads and writes to virtual addresses; its back-end connects to the MI-Bus and traffics in real addresses using the usual M-Bus transactions. VC's are designed so it is also possible to plug them directly to the M-Bus and have them work (indeed this is essential since it allows Dragons to be brought up without putting RC in the critical path).
The front-end of RC is connected to the MI-Bus, and its back end to the main M-Bus. Mem connects directly to RC from the side. We will have to revisit this decision if pin count becomes a problem for RC; Mem may then have to be moved to the MI-Bus. There is only one RC per MI-Bus and the arbiter for the MI-Bus sits inside RC. The arbiter for the M-Bus stays the same.
2.3 Design Decisions
->
There are two types of transactions on the MI-Bus: those for which VC knows it will need to get the M-Bus (WS, IOR, IOW, IORF, IOWF, MR, MRD) and those for which it cannot be sure that the M-Bus will be needed (RQ, WQ). Call the first type global and the second plocal (for probably local). For global transactions VC always gets the M-Bus before getting the MI-Bus. For plocal transactions it asks for the MI-Bus alone.
The motivation for this distinction is that only plocal transactions need to be made restartable in order to avoid deadlock over bus allocation. Global transactions need not be restarted since the M-Bus is always acquired first.
->
Each VC has two request wires running to the arbiter inside RC: Rq and GRq (for global request). For plocal transactions VC asserts Rq alone; for global transactions it asserts both Rq and GRq. A GRq request from VC flows through RC and comes out on the M-Bus as Rq in one phase.
->
Within RC, requests from the M-Bus are given priority over requests from the MI-Bus. That is, if an MI-Bus transaction has started and an M-Bus transaction arrives within RC then the MI-Bus transaction is forced to restart.
->
Restarts on MI are signalled by "grant yanking". Ie. the arbiter in RC deasserts Gnt prematurely to signal that the current transaction must be restarted. The semantics of restart for the plocal transactions are given below.
->
RQ restart. For a processor read we can let the processor go ahead as soon as the requested word is fetched by VC. For a processor write we must hold up the processor till the RQ is complete. The cost of this should be small since the number of writes that cause RQ's should be small. A data miss happens around 15% of the time, and a write ref happens 1 out of 21 cycles. Three cycles are added for each missing write, so the added cost is 3*0.15/21, or around 2%.
If the fetched quad goes into an existing line then don't set quadValid until after commit. If it goes into a victim then don't set vpValid, rpValid, and quadValid until after commit.
->
WQ restart. Note that unlike RQ we don't have to make the processor wait any longer than necessary. To see this, there are two cases to consider: WQ's caused by flush or launder; here the processor is not involved at all, so the assertion is true. For WQ's caused by a miss resulting from a processor reference the processor must be forced to wait anyway since the WQ is part of a WQ, RQ sequence.
Inside VC don't clear master until after commit.
->
To keep RC as simple as possible, make it fully associative. Back of the envelope calculations show that we can get over 600 associaters. With 16K words in Mem, this means that each associater has control over 4 or 8 quads. Note that its a good idea to put power of two associaters otherwise space in Mem will be wasted.
->
There will need to be a one cycle delay between M and MI in RC to provide enough time to acquire MI when a request comes in from M. This can be seen by considering a WS coming in from M: WS on M; match; yank Gnt; WS on MI;
The extra cycle is needed since MI becomes free only in the cycle following the yank. To keep things simple we should always insert the extra cycle regardless of whether MI is busy or not.
->
For each quad, RC keeps the following control bits:
master: when TRUE RC or one of the VC's connected to it have the current copy for the quad
shared: when TRUE there may be a copy of this quad in other RC's; when FALSE no other RC contains a copy
inVC: when TRUE there may be a copy of this quad in some VC connected to this RC's MI-Bus; when FALSE none of the VC's on this MI-Bus contains this quad. This bit is needed to prevent RQ's, WQ's and WS's on the M-Bus from penetrating the MI-Bus when not absolutely necessary. Note that like shared, this bit is a hint.
->
The transition of a quad in some VC from not shared to shared is tricky. The way it will work is as follows: when RC hands a quad to VC in response to an RQ, VC sets the shared bit for this quad. Now the first time VC's processor writes into this quad VC puts out a WS. This WS will go out on the M-Bus (this is not necessary but doesn't hurt; it is done to avoid having WS's that can stay within RC and therefore have to be restartable). In the meantime, RC asserts the shared line on MI if the shared bit for this quad in RC is set. If none of the other VC's on MI have a copy of this quad the VC that does have a copy will clear the shared bit. Also, RC must clear the master bit for the quad to signal that it may no longer have the current copy from this point on (since VC will will write into the quad without telling RC).
->
Victim selection in RC will be interesting given that every quad in VC must also be in RC. Consider the case where all of the quads in RC have inVC set. This could happen when a large amount of data is being read sequentially by the processor. Clearly, before RC can kick out a quad it has to check whether its in VC or not. It'd be much better if RC victimized only those quads for which quads didn't exist in any VC. One way is for RC to do an RQ on IM every time it tries to victimize a quad with inVC set. If someone pulls shared, RC knows to not kick this quad out. Note that the RQ on IM can be aborted as soon as RC has its answer so it needn't take up too much bandwidth on IM. Hmmm. Looks like most inVC bits set is the common case! Consider read-only data and code, for example. Every time such a quad gets into a VC its inVC bit gets set; this bit won't get cleared when the host VC victimizes this quad since this quad will be clean.
Now, this suggests a solution to our dilemma. Suppose every time a VC did an RQ, it put in one of the two idle cycles of the RQ the address of the victim quad. All of the VC's on MI would match on the address of the target of the RQ and then on the address of the victim. If any of them matches on the victim address it pulls down the shared line (or a new dedicated line called "nInVC"). RC would watch this line and set the inVC bit for the quad being victimized by the RQ according to the state of nInVC. Note that with this addition the inVC bit would no longer be a "hint".
->
It appears to be a real bad idea to put more than one processor on an MI. The reason is that the latency seen by each processor depends quite strongly on the load on MI (since the miss rate from VC's is quite high). We'd therefore like the MI's to be as lightly loaded as possible, which of course means one processor per MI.
2.4 Consistency Invariants
M-Bus
> If a quad is present in more than one RC all of its copies are identical.
> If a quad is present in more than one RC all copies have shared set.
> At most one RC copy of a quad may have master set.
MI-Bus
> If a quad is present in more than one VC all of its copies are identical.
> If a quad is present in more than one VC all copies have shared set.
> At most one VC copy of a quad may have master set.
M-Bus/MI-Bus
> Every quad in a VC is also present in the corresponding RC.
> If a quad is in VC then its RC copy must have inVC set.
> If a quad in RC has shared set then any VC copies must have shared set.
> If a quad in RC has shared set then any VC copies must be identical.
2.5 Control Bit Transition Rules
RC Control Bits
The transition rules for the control bits in RC are as follows:
master:  set on WS from MI
  clear on WS from M
inVC:  set on RQ from MI
  clear on {WS, RQ} to MI for which shared line on MI is deasserted
  (assuming victim addr issued on RQ: clear on RQ according to the state
  of the nInVC line on MI)
shared: set on {WS, RQ} on M for which shared line on M is asserted
  clear on {WS, RQ} on M for which shared line on M is deasserted
VC Control Bits
The transition rules for the control bits in RC are as follows:
master: set on write from processor
  clear on WS from MI
shared: set on {WS, RQ} on MI for which shared line on MI is asserted
  clear on {WS, RQ} on MI for which shared line on MI is deasserted
2.6 Statistics Needed
We need to get the following two statistics:
(a) How often do processor writes result in VC misses? This is needed to verify that the three cycles of latency added when a write misses in VC are not a performance problem.
(b) How often does a processor write into a clean quad in VC? This is needed to verify that the spurious WS generated by VC in communicating to RC that a quad has been written into is not a performance problem.
2.7 Things To Think About
line size
number of associators
flushing and laundering, filter on display addresses?, programmed filter? command programmed? 32 bit filter, base and bounds, IOWrite base, IOWrite bounds, IOWrite start
i/o commands
atomicity
dual vs. single port RAM
foreign processor support, save pins to indicate bus type so that 68020 and 80x86 compatible bus interfaces can be put into the cache.
local arbitration
multiple cache P buses with and without multiple M buses, how many ways of interleaving are enough?
How is the processor number initialized during reset so that minimal systems can be booted without the baseboard?
Should there be programmable page size? Over what range? Could implement it by running enable and enable' lines down through the low order bits of page so that the cells can drive the page or block match line low.
Why not a standard component implementation?
when the m controller sees its own command it must return bus fault to the processor and release the bus.
how can we have two virtual addresses for a single real address at the same time in one cache?
3. Pin Description
Most of the cache pins are devoted to the M, P and D buses. The corresponding specification should be referenced for such pin descriptions.
Cache Specific Pins
MDaisyIn
This pin ???
MDaisyOut
This pin ???
BusParitySelect
This pin controls a multiplexor that selects PParity as the source for store parity when the pin is high and selects internal parity computation logic when the pin is low.
WriteEnable[0..4)
These pins control the byte write enables within the cache. They allow the P bus to be extended to byte level addressing.
AddressMask[0..4)
These pins initialize the low order page address mask bits during reset. See section 4.??? for a complete description of multiple cache per P bus systems.
AddressMatch[0..4)
These pins initialize the low order page address match bits during reset. See section 4.??? for a complete description of multiple cache per P bus systems.
Package Pin Assignment
This section describes the reasoning behind the cache pin assignment shown in Figure 2. The initial assumption is that the control logic is all at the top of the array and that the data buses run across the top of the chip and then narrow down the two sides, the M bus on the right and the P bus on the left, meeting in the middle of the bottom of the chip. The initial pin count constraints are as follows:
Power   12
Clocks   4
P    42
M    43
D    6
Reset   1
Byte Enables  4
Parity Select  1
Address Division 8
Daisy chain  2
This requires a total of 123 pins out of the 144 available. The power pins are the obvious place to use some of the remaining 21 pins. One would like to minimize the internal wiring length from the closest power pads to any point on the chip. The added power pins should also be chosen from those with the lowest inductance in the package. The initial power pin count was chosen simply to use up all the low inductance pins in the package. They were then arbitrarily divided up as follows:
PadVdd   4
PadGnd   4
LogicVdd   2
LogicGnd   2
The pad power pins were assigned to the lowest inductance pins. The logic power pins were assigned to the next lowest inductance pins. The arrays in the logic are likely to use quite a bit of current during their precharge and discharge cycles. Lets add 2 pins to each of the logic supplies leaving 17 pins. The M bus has about between 5 and 20 times as much capacitance as the P bus but it runs at half the rate so it should get between 2 and 10 times more pins. Arbitrarily chosing 2 as the multiplier and 8 pins for the M bus means 4 additional pins for the P bus leaving 5 pins uncommited. They will be left uncommited so that if any extra functionality is required in the controller there are spare pins to implement them, e.g. a mechanism for determining when a cache flush or launder process is finished may require a pin or so. The M bus has about 39 pins that all swing at the same time so 39/8 means about every 10 pins there should be a PadVdd/PadGnd pair. They should also be constrained to be on the inner row of pins in the package. For the M bus this means pins 60, 66, 69, 77, 81, 85, 96 and 102 are the obvious choices. The P bus has 37 pins so every 9 pins should have a power or ground. The possible pins in the range of 20 through 53 are 24, 30, 41, 45, and 49. Eliminating 49 leaves 4 pins.
To Do
A little more engineering needs to be applied to the pin assignment after the capacitance and path delay of the M bus and P bus is determined more tightly and after the peak current requirements of the internal logic of the cache are better known.
4. Functional Description
This section describes the internal state of the cache and describes how the buses interact with each other and the state.
CacheBlockDiagram.press leftmargin: 1 in, topmargin: 1 in, width: 6.5 in, height: 8 in
CachePinOut.press leftmargin: 1 in, topmargin: 1 in, width: 6.5 in, height: 8 in
CacheTiming.press leftmargin: 1 in, topmargin: 1 in, width: 6.5 in, height: 8 in