NewCacheNotes.tioga
Written by: Sindhu, April 12, 1985 11:20:10 am PST
Last Edited by: Sindhu, April 23, 1985 4:29:34 pm PST
THE DRAGON CACHE ARCHITECTURE
THE DRAGON CACHE ARCHITECTURE
THE DRAGON CACHE ARCHITECTURE
DRAGON PROJECT — FOR INTERNAL XEROX USE ONLY
DRAGON PROJECT — FOR INTERNAL XEROX USE ONLY
DRAGON PROJECT — FOR INTERNAL XEROX USE ONLY
The Dragon Cache Architecture
Design Notes
Release as [Indigo]<Dragon>Documentation>NewCacheNotes.tioga, .press
© Copyright 1984 Xerox Corporation. All rights reserved.
Abstract: This memo is a relatively disorganized collection of notes about the new two-level cache architecture being considered for Dragon. For the moment its main purpose is to record architectural decisions as they are made, along with the reasons for each decision. Consequently, introductory handwaves are kept to a minimum and descriptions intentionally terse. It is hoped that eventually these notes will develop into the design document for the Dragon Caches.
XEROX Xerox Corporation
Palo Alto Research Center
3333 Coyote Hill Road
Palo Alto, California 94304
For Internal Xerox Use Only
Contents
1. Preliminaries
2. The Two-Level Cache Architecture
3. Design Decisions
1.
Preliminaries
It has become clear from recent simulations that unless the amount of cached data per processor is increased substantially, we will not be able to support more than three or four processors on a single M-Bus. One way to get more data is to put however many cache chips are required per processor. This is unacceptable because it requires too much board area per processor and also increases the number of M-Bus loads to the point where bus timing requirements are difficult to meet. A more workable solution is two-level caches: the first level cache would be a custom chip with about the same amount of data as an old cache. The second level cache would consist of a custom chip and several off-the-shelf memory chips; it would be slower than the first level cache but contain much more data.
There is quite a variety of two-level schemes possible but most of them are either too hard to implement, involve throwing away too much work already completed, or do not provide reasonable intermediate implementations. The one we've settled on seems to have none of these disadvantages while providing all of the performance advantages we're after.
2.
The Two-Level Cache Architecture
The two-level cache consists of three components: the first-level cache VC (for virtual cache), the second level cache RC (for real cache), and 8-16K words of memory Mem; VC and RC are both custom chips, while Mem consists of four or eight 8Kx8 off-the-shelf memory chips.
The VC's connected to a processor sit on an internal bus called the MI-Bus (for MBus, Internal) whose protocol is a strict superset of the M-Bus. VC is virtually identical to the old cache: its front-end connects to the processor and traffics in reads and writes to virtual addresses; its back-end connects to the MI-Bus and traffics in real addresses using the usual M-Bus transactions. VC's are designed so it is also possible to plug them directly to the M-Bus and have them work (indeed this is essential since it allows Dragons to be brought up without putting RC in the critical path).
The front-end of RC is connected to the MI-Bus, and its back end to the main M-Bus. Mem connects directly to RC from the side. We will have to revisit this decision if pin count becomes a problem for RC; Mem may then have to be moved to the MI-Bus. There is only one RC per MI-Bus and the arbiter for the MI-Bus sits inside RC. The arbiter for the M-Bus stays the same.
3.
Design Decisions
->
There are two types of transactions on the MI-Bus: those for which VC knows it will need to get the M-Bus (WS, IOR, IOW, IORF, IOWF, MR, MRD) and those for which it cannot be sure that the M-Bus will be needed (RQ, WQ). Call the first type global and the second plocal (for probably local). For global transactions VC always gets the M-Bus before getting the MI-Bus. For plocal transactions it asks for the MI-Bus alone.
The motivation for this distinction is that only plocal transactions need to be made restartable in order to avoid deadlock over bus allocation. Global transactions need not be restarted since the M-Bus is always acquired first.
->
Each VC has two request wires running to the arbiter inside RC: Rq and GRq (for global request). For plocal transactions VC asserts Rq alone; for global transactions it asserts both Rq and GRq. A GRq request from VC flows through RC and comes out on the M-Bus as Rq in one phase.
->
Within RC, requests from the M-Bus are given priority over requests from the MI-Bus. That is, if an MI-Bus transaction has started and an M-Bus transaction arrives within RC then the MI-Bus transaction is forced to restart.
->
Restarts on MI are signalled by "grant yanking". Ie. the arbiter in RC deasserts Gnt prematurely to signal that the current transaction must be restarted. The semantics of restart for the plocal transactions are given below.
->
RQ restart. For a processor read we can let the processor go ahead as soon as the requested word is fetched by VC. For a processor write we must hold up the processor till the RQ is complete. The cost of this should be small since the number of writes that cause RQ's should be small. A data miss happens around 15% of the time, and a write ref happens 1 out of 21 cycles. Three cycles are added for each missing write, so the added cost is 3*0.15/21, or around 2%.
If the fetched quad goes into an existing line then don't set quadValid until after commit. If it goes into a victim then don't set vpValid, rpValid, and quadValid until after commit.
->
WQ restart. Note that unlike RQ we don't have to make the processor wait any longer than necessary. To see this, there are two cases to consider: WQ's caused by flush or launder; here the processor is not involved at all, so the assertion is true. For WQ's caused by a miss resulting from a processor reference the processor must be forced to wait anyway since the WQ is part of a WQ, RQ sequence.
Inside VC don't clear master until after commit.
->
To keep RC as simple as possible, make it fully associative. Back of the envelope calculations show that we can get over 600 associaters. With 16K words in Mem, this means that each associater has control over 4 or 8 quads. Note that its a good idea to put power of two associaters otherwise space in Mem will be wasted.
->
There will need to be a one cycle delay between M and MI in RC to provide enough time to acquire MI when a request comes in from M. This can be seen by considering a WS coming in from M: WS on M; match; yank Gnt; WS on MI;
The extra cycle is needed since MI becomes free only in the cycle following the yank. To keep things simple we should always insert the extra cycle regardless of whether MI is busy or not.
->
For each quad, RC keeps the following control bits:
master: when TRUE RC or one of the VC's connected to it have the current copy for the quad
shared: when TRUE there may be a copy of this quad in other RC's; when FALSE no other RC contains a copy
inVC: when TRUE there may be a copy of this quad in some VC connected to this RC's MI-Bus; when FALSE none of the VC's on this MI-Bus contains this quad. This bit is needed to prevent RQ's, WQ's and WS's on the M-Bus from penetrating the MI-Bus when not absolutely necessary. Note that like shared, this bit is a hint.
->
The transition of a quad in some VC from not shared to shared is tricky. The way it will work is as follows: when RC hands a quad to VC in response to an RQ, VC sets the shared bit for this quad. Now the first time VC's processor writes into this quad VC puts out a WS. This WS will go out on the M-Bus (this is not necessary but doesn't hurt; it is done to avoid having WS's that can stay within RC and therefore have to be restartable). In the meantime, RC asserts the shared line on MI if the shared bit for this quad in RC is set. If none of the other VC's on MI have a copy of this quad the VC that does have a copy will clear the shared bit. Also, RC must clear the master bit for the quad to signal that it may no longer have the current copy from this point on (since VC will will write into the quad without telling RC).
->
Victim selection in RC will be interesting given that every quad in VC must also be in RC. Consider the case where all of the quads in RC have inVC set. This could happen when a large amount of data is being read sequentially by the processor. Clearly, before RC can kick out a quad it has to check whether its in VC or not. It'd be much better if RC victimized only those quads for which quads didn't exist in any VC. One way is for RC to do an RQ on IM every time it tries to victimize a quad with inVC set. If someone pulls shared, RC knows to not kick this quad out. Note that the RQ on IM can be aborted as soon as RC has its answer so it needn't take up too much bandwidth on IM. Hmmm. Looks like most inVC bits set is the common case! Consider read-only data and code, for example. Every time such a quad gets into a VC its inVC bit gets set; this bit won't get cleared when the host VC victimizes this quad since this quad will be clean.
Now, this suggests a solution to our dilemma. Suppose every time a VC did an RQ, it put in one of the two idle cycles of the RQ the address of the victim quad. All of the VC's on MI would match on the address of the target of the RQ and then on the address of the victim. If any of them matches on the victim address it pulls down the shared line (or a new dedicated line called "nInVC"). RC would watch this line and set the inVC bit for the quad being victimized by the RQ according to the state of nInVC. Note that with this addition the inVC bit would no longer be a "hint".
->
It appears to be a real bad idea to put more than one processor on an MI. The reason is that the latency seen by each processor depends quite strongly on the load on MI (since the miss rate from VC's is quite high). We'd therefore like the MI's to be as lightly loaded as possible, which of course means one processor per MI.
4.
Consistency Invariants
M-Bus
> If a quad is present in more than one RC all of its copies are identical.
> If a quad is present in more than one RC all copies have shared set.
> At most one RC copy of a quad may have master set.
MI-Bus
> If a quad is present in more than one VC all of its copies are identical.
> If a quad is present in more than one VC all copies have shared set.
> At most one VC copy of a quad may have master set.
M-Bus/MI-Bus
> Every quad in a VC is also present in the corresponding RC.
> If a quad is in VC then its RC copy must have inVC set.
> If a quad in RC has shared set then any VC copies must have shared set.
> If a quad in RC has shared set then any VC copies must be identical.
5. Control Bit Transition Rules
RC Control Bits
The transition rules for the control bits in RC are as follows:
master: set on WS from MI
clear on WS from M
inVC: set on RQ from MI
clear on {WS, RQ} to MI for which shared line on MI is deasserted
(assuming victim addr issued on RQ: clear on RQ according to the state
of the nInVC line on MI)
shared: set on {WS, RQ} on M for which shared line on M is asserted
clear on {WS, RQ} on M for which shared line on M is deasserted
VC Control Bits
The transition rules for the control bits in RC are as follows:
master: set on write from processor
clear on WS from MI
shared: set on {WS, RQ} on MI for which shared line on MI is asserted
clear on {WS, RQ} on MI for which shared line on MI is deasserted
6. Statistics Needed
We need to get the following two statistics:
(a) How often do processor writes result in VC misses? This is needed to verify that the three cycles of latency added when a write misses in VC are not a performance problem.
(b) How often does a processor write into a clean quad in VC? This is needed to verify that the spurious WS generated by VC in communicating to RC that a quad has been written into is not a performance problem.