SCDesignDecisions.tioga
Written by: Pradeep Sindhu, November 21, 1985 2:03:52 pm PST
Last Edited by:
Pradeep Sindhu, April 22, 1986 4:30:42 pm PST
DRAGON SMALL CACHE DESIGN DECISIONS
DRAGON SMALL CACHE DESIGN DECISIONS
DRAGON SMALL CACHE DESIGN DECISIONS
DRAGON PROJECT — FOR INTERNAL XEROX USE ONLY
DRAGON PROJECT — FOR INTERNAL XEROX USE ONLY
DRAGON PROJECT — FOR INTERNAL XEROX USE ONLY
Dragon Small Cache
Design Decisions
Release as [Indigo]<Dragon>SmallCache>Documentation>SCDesignDecisions.tioga, .press
© Copyright 1985 Xerox Corporation. All rights reserved.
Abstract: This document contains a record of design decisions for the Dragon small cache; where possible it also tries to provide the rationale for each decision. It is being used currently by the cache designers to preserve their sanity, and is expected to be useful in the future in writing documentation and/or papers about the Dragon memory system. Some information in here may be duplicated elsewhere (eg. in the design decisions document for the big cache, and in the design decisions document for the memory system).
XEROX Xerox Corporation
Palo Alto Research Center
3333 Coyote Hill Road
Palo Alto, California 94304
For Internal Xerox Use Only
Contents
Introduction
Buzzword Definition
Internal Interfaces
Deadlock Avoidance: Transaction Classes and Aborts
Request/Grant
Transition of Shared from 0 to 1
Signaling Victimization
Response to Done Command
RAM Quad Interface?
P-Bus Byte Interface
WQ and WQS
Cache Entry Size
IORead (IOR)
Transport Order
Latches for RMatch and VMatch
Flush, Launder and DeMap Implementation
Bypassing
Unresolved Issues
Introduction
This document is one in a series that attempts to record design decisions for the Dragon Memory System as they are made, and especially to outline the reasons for each decision. Given the overall complexity of the Memory System, the designers found it necessary to keep such a running record to avoid revisiting old decisions and trying to unearth the reasons for them for each contentious issue.
The structure of this series is as follows. The Memory System document forms the top level, confining itself to decisions that affect at least two of the components comprising the Memory System: the Small Cache, the Big Cache, the Map Cache, and Main Memory. This document confines itself to decisions that affect the Small Cache.
The section immediately below defines frequently used terms that will be used throughout this document. Each of the following sections covers one design decision.
Buzzword Definition
There will be a fair amount of shorthand in this document. This section makes an attempt to define some of the terms before they are used.
BC Big Cache
SC Small Cache
WS WriteSingle
RQ ReadQuad
WQ WriteQuad
RQS ReadQuadStream
WQS WriteQuadStream
IOR IORead
IOW IOWrite
IORF IOReadFlow
IOWF IOWriteFlow
RMSR ReadMapSetRef
RMSRSD ReadMapSetRefSetDirty
MRD Map Read
MIBus Internal MBus
Entry smallest SC unit that can match independently
Line geometric horizontally repeating unit (may contain 1 or more entries)
Internal Interfaces
The cache has two external interfaces: the P-Bus, and the M-Bus. The P-Bus interface has been under our control, but has nonetheless remained relatively stable. The M-Bus interface is not under our control and has changed enough to cause one design to be thrown away. Added to this instability is the fact that the M-Bus is outmoded and needs to be replaced by a cleaner, faster bus.
To finesse M-Bus problems, it is critically important to define stable internal interfaces, and then to design pieces of the cache to these interfaces. Then, if the M-Bus changes, only a portion of the cache needs to be redesigned. For such a strategy to work, the internal interfaces must not have any knowledge of the M-Bus, and they must be designed with the highest possible performance to allow faster buses to be substituted at either the P or the M side; of course P bus side speed is more important, so it will be given priority in the design. At the top, we see three internal interfaces: the P-Control/Array (PA) Interface, the M-Control/Array (MA) Interface, and the P-Control/M-Control (PM) Interface.
Restart Semantics
RQ Restart
For an RQ generated by a PReadMiss we can let the processor go ahead as soon as the requested word is fetched by SC. For a PWriteMiss we must hold up the processor till the RQ is complete. The cost of this should be small since the number of writes misses is small. Write misses are around 1% of EU references. Since there are 17 EURefs per 100 cycles, we get 0.17 write misses per 100 cycles. Three cycles are added for each missing write, so the added cost is 3*0.17=0.5%.
It would be nice to set VValid and RValid after commit, so we don't have to do anything to abort an RQ. Unfortunately, to make bypassing simpler we need to set VValid when the very first word is fetched. Therefore aborting an RQ will require clearing VValid (and RValid, if it was set).
WQ Restart
Unlike RQ we don't have to make the processor wait any longer than necessary. To see this, there are two cases to consider: WQ's caused by flush or launder; here the processor is not involved at all, so the assertion is true. For WQ's caused by a miss resulting from a processor reference the processor must be forced to wait anyway since the WQ is part of a WQ, RQ sequence.
Inside SC don't clear master until after commit.
Request/Grant
Each SC has two request wires running to the arbiter inside BC: Rq and GRq (for Global Request). For plocal transactions SC asserts Rq alone; for global transactions it asserts both Rq and GRq. A GRq request from SC flows through BC and comes out on the M-Bus as Rq in one phase; vice-versa, a Grant from the M-Bus arbiter is passed by BC and comes out on the MI-Bus in one phase.
Notification of First Write
When a quad is written into for the first time by an SC, the BC connected to that SC must be notified. If this is not done, BC has no way of knowing whether an RQ on the M-Bus must be responded to by it or not.
The way it will work is as follows: whenever SC gets a quad in response to an RQ it initiated, it initializes the quad bits to Master=0; Shared=1. Now the first time SC's processor writes into this quad, SC will put out a WS onto MI (since Shared is set) and will also set Master to 1. In the meantime BC checks if its copy of the Shared bit for the quad is set, and if it is BC pulls the shared line on MI. If BC's copy of Shared is not set, and if none of SC's brother caches pull shared then SC will clear the shared bit. In this latter case future writes will stay entirely within SC. The WS initiated by SC will of course go out to the main M-Bus (like all good WS's). This isn't necessary if Shared wasn't set in BC, but it doesn't hurt (logically speaking). To see that performance is not a problem see the section titled Main M-Bus Load—WS Contribution in MSDesignDecisions.tioga.
It may seem that the case Master=0; Shared=0 cannot happen because these bits are initalized to 01, and then change to 10 or 11. In fact the 00 case will happen when a Flush or a Launder is done to data that isn't shared but has been written into.
Signaling Victimization
The second idle cycle in an RQ is used to transfer the address of the victim into which the read quad is going. This is needed so BC can maintain the ExistsBelow bit as an absolute rather than as a hint. As far as SC is concerned the MBus has two shared lines, one called MnShared and the other MnVictimShared. SC must watch the first bus cycle of an RQ and assert MnShared in the usual way, but it must also watch the third cycle (containing the victim address) and assert MnVictimShared if there was a match. This is needed so BC can do its victimization without violating the invariants that BC and its SC's must maintain (see the design decisions document for BC).
This design decision locks us in to having at least one dead cycle in the M-Bus protocol, which is not such a hot idea. There are many ways to fix this—a good one being to have separate data and address lines on the M-Bus.
Response to Done Command
SC's must respond to all Dones, including ones corresponding to WriteMap and WriteMapFlags, which SC's don't issue. This was agreed upon at the May 29th meeting with ED.
RAM Quad Interface?
Arguments for Word Interface Only
I. A word interface is required anyway because the P-Bus deals in words, but a quad interface is not necessary.
II. Providing a quad interface increases power consumption since more bit lines get charged and discharged; peak currents are also larger.
III. Providing a quad interface increases complexity because of the need to multiplex the RAM bit lines between the quad interface and the word interface.
IV. There is virtually no performance advantage associated with providing a quad interface, assuming the Common M-Bus. Here's why. In most of the cases where the MBus side is accessing the RAM, the processor is waiting anyway, so its accesses couldn't conflict with those of the M-Bus (a processor reference that misses); since the M-Bus provides words one at a time, there is no performance loss associated with storing them one at a time. Now lets consider the cases where processor/MBus conflicts can arise over the RAM. The first is a ReadQuad on the M-Bus that matches a quad that has master set. In this case the cache must provide the quad. For the IFU, master bits will never be set, so the question of conflicts never arises. The EU makes many fewer references than the IFU, so although there is conflict, it doesn't happen very often. Let's make an estimate: the M-Bus (or MI-Bus) will be run 60% utilized. Most of this traffic is ReadQuads. Lets suppose 10% of these ReadQuads are directed to shared data, and that half of these hit quads in our cache that have master set. So, in 100 cycles there is 0.5 ReadQuads on the MBus that may conflict, or (1/2)*4=2 cycles within the cache. Assuming our standard 17 EU references per 100 cycles, we get an average of 0.34 cycles out of a 100 that will conflict (this analysis works only for small conflict probabilities). The average amount of time a processor will have to wait as a result of the conflict is 2 cycles, so the total cost in 100 cycles is 0.64 cycles, or 0.64%—end of problem. The other case is WQS's that hit. We have no data about the frequency of WQS's, but it is probably small.
Arguments for Quad Interface
If we take the Common M-Bus as given, then its clearly preferable to not have a quad interface. As it stands, however, this is hardly the case. The only reason for designing to the Common M-Bus is to avoid the time penalty of going through yet another bus design before having a working cache.
Thus, to permit us the flexibility of upgrading to a faster bus, SC will provide a quad interface. As pointed out by arguments II and III above, this will involve some penalties. I believe these penalties are well worth the benefits.
P-Bus Byte Interface
The cache will provide a byte interface for the processor. For processor byte reads the interface does nothing. That is, it simply returns the whole word—its up to the processor to figure out which byte to extract. For writes, the processor asserts the appropriate one of four wires provided by the cache: Byte0, Byte1, Byte2, and Byte3. Note that the processor can, in fact, cause any subset of the four bytes in a word to get written.
WQ and WQS
SC will swallow matching WQS's that appear on its M-Bus side, but will not swallow matching WQ's. This is to avoid increasing P/M conflicts over the RAM.
Cache Entry Size
A basic choice is whether cache entries should contain multiple quads or not. Multiple quads increase the amount of data in the cache but additional complexity because of presence bits. Furthermore, with multiple quads per entry substantial fractions of the RAM in the cache may be wasted because of non-locality, so the increased amount of data may not result in a better hit rate.
There are several competing organizations, some of which have only a marginal possiblity of fitting into a 10x10mm die. The summary data for several organizations appears below. The notation E/Q is used to denote a cache organization, where E is the number of cache entries and Q is the number of quads per entry. The table lists the number of added cycles due to misses for each organization in 100 processor cycles (the figures below assume 5 cycles to satisfy a miss; this is in fact an error since the number is more like 7; this error doesn't change the rank ordering however since all the numbers will get inflated by 7/5).
Program Ordering -> 1 2 3 4 5
CompSmall 100/2 (20.1) 128/1 (21.7) 100/1 (24.6) 50/4 (25.9) 50/2 (31.8)
CompMedium 100/2 (18.8) 128/1 (20.3) 100/1 (25.6) 50/4 (26.4) 50/2 (34.5)
Spinifex 100/2 (16.6) 128/1 (23.2) 50/4 (26.6) 100/1 (31.8) 50/2 (42.4)
WritePlainText 100/2 (4.5) 128/1 (4.9) 50/4 (5.8) 100/1 (6.4) 50/2 (8.6)
Waterlily 100/2 (4.4) 128/1 (4.4) 100/1 (6.9) 50/4 (8.3) 50/2 (15.3)
Its clear from the table that 100/2 is just slightly better than 128/1. Unfortunately, 100/2 is significantly more complex and further may not fit on a 10x10 die. 128/1 is a clear winner over the others, so we decided to go with a 1 quad per entry organization.
IORead (IOR)
SC will assume that IOR is flow controlled using the MnDV line. As long as our cache is the requestor of the IOR, the IOR will work no matter who responds and no matter where the requestor or responder is located in the two level system. If an ED cache issues an IOR, the IOR will work only if the responder is an ED cache or one of our caches located on the main M-Bus. Note that no time out mechanism will be provided within SC's to catch IOR's to non-existent locations, so such IOR's will hang the system. We expect that IOR's will only be issued from kernel space, and this code is supposed to not screw up, so this is ok for starters. In the long run, the M-Bus monitor chip will provide the timeout function, so the problem goes away.
Transport Order
Here is the latest information about transport order. The memory controller does not do alternate order. And it only does the following two (cyclic) cases of primary order (lets call it half-cyclic):
0 1 2 3
2 3 0 1
The ED caches do both primary and alternate order.
The consequence of this is that our caches also only need to know about these two orders. On a miss, the LSB of the address for the RQ should be zeroed (since we don't know whether a cache will respond or the memory controller will) regardless of whether the processor is requesting an odd word or not. Also, when responding to a RQ on the MBus, our caches should take the LSB of the address for the RQ and put it out as alternate order; this latter is needed because the ED caches understand alternate order and must be told that they are not getting the words in the requested order in case of odd addresses.
Since the requested word won't be delivered first for odd addresses, there will be a performance hit. Lets compute the magnitude: lets assume that odd and even words are requested equally often (in fact word 0 will probably be the most frequent, but our assumption makes an error in the pessimistic direction so its ok). If the memory controller did all four orders, our delay number for 100 ideal processor cycles would be: 56*0.05*7+17*0.08*7= 29.1. With the current memory controller the delay number would be 56*0.05*7.5+17*0.08*7.5 = 31.2. So the difference is (131-129)/130, or around 1.5%. Given that RQ's for 0 will be more frequent than others, this is not worth worrying about.
Latches for RMatch and VMatch
VMatch must be latched for the following reason: Consider the case of PRead-DataMissAdrsHit; here when trying to do page match the CAM bitlines are driven during PhA and any matching real page is read out of the real CAM in PhB. The question is when should CAMSelect be driven by VMatch? Clearly this cannot be done during PhA because the real and virtual CAMs share the select line, and driving the line during PhA would cause stuff to get written during PhA—BAD! So this has three consequences: 1. CAMSelect is low during nPhB, and is driven during PhB, 2. CAM bit lines are driven during PhB only if we want to write stuff into the CAM, and 3. VMatch should be latched during PhA so it can be used during PhB to drive CAMSelect.
Array Value Cacheing
A value that is read out of the cache array at time t and used later at some time t+x may be invalid at the later time. This is because operations on the M-Bus could have modified the value in between t and t+x. The only way we can be sure that the values at t+x and t are the same is if we were master (had grant). Now, for a number of cases its better (for performance) to read out the value from the array at the same time we request the bus. This way we can start whatever operation we intend to do on the M-Bus in the very next cycle after getting grant. Of course the problem is that the value we have in hand may have been updated in the meantime.
The simple fix for this is to always wait to get grant and give up the performance advantage.
Since this involves a performance hit for the most common case, here's the fix that'll probably be used. Read out the value as early as possible and put it in a holding register. This register also has a state bit called "valid" which is set when the value is read out. Now the M-Bus controller clears the valid bit whenever a "dangerous" operation appears on the M-Bus. The instant we get grant we check the valid bit. If its set no problem; if its not we read out the value once more. This way we pay the penalty only when necessary. The simplest definition of "dangerous" operation is any M-Bus operation. This may be too simplistic because the most frequent operation is RQ, and it is innocuous and therefore it will pay to filter out RQ's.
Flush, Launder and DeMap Implementation
The idea here is to devote as little hardware to these operations as possible since they aren't done frequently. A general problem is that while a cache is doing one of these operations it must keep listening to and responding to the MBus side if there is activity (the processor is shut off). One way to achieve this is to grab M-Bus grant before attempting to do any operation:
DO
Do Rq;
Grant => MatchMask, RealAdrsReg ← TRIT;
Do RealMatch;
IF NoMatch THEN EXITLOOP;
entry ← SelectOne[matchingLines];
DoOp[entry];
MatchMask ← PageMaskReg;
Drop Grant;
ENDLOOP;
There are several things to note. First, this loop will terminate for Flush and Launder because no master bits can get set once the processor is held up, and every pass through this loop clears at least one bit. Second, the function SelectOne[matchingLines] can be done in several ways: either by stepping the victim pointer serially, or by building a special serial arbiter (find first one). Given that we're holding the bus the serial arbiter solution is better because its faster. A serial arbiter will probably be able to step through 100 entries in one or two cycles while the victim pointer could take up to 100 cycles, which is unreasonably long to hold the bus.
Hmm. Why do you believe that DeMap will terminate if done in the same way? Looks like if someone is busily setting RPDirty bits while you are just as busily clearing them, the cache doing DeMap could be at it forever.
Bypassing
The general idea behind bypassing is to complete a processor reference as quickly as possible and let the processor go rather than to wait for the array to finish doing its thing. Since writes are difficult to handle (and are not that frequent anyway) it was decided to not be very clever about bypassing writes.
The computation for bypassing can be conveniently broken down in two parts. The first is to figure out if the VAdrs from the processor corresponds to the quad being processed. The second is to figure out if the word requested can be delivered at this instant. For the reference that causes a miss, the first part of the computation is unnecessary since the answer is yes, but the second part is necessary. For references that immediately follow the missing reference we must do both calculations. Now its possible to do the first part of the calculation (quad matching) using the array itself if we agree to write the CAM entry at the same time the first word is being written. If we do this, and we provide an extra vertical array match VictimMatch which is derived by anding VMatch with Victim for each line and then wire ORing it to VictimMatch. VictimMatch simply tells us whether the match done in this cycle hit the entry that we're writing into. The problem with this is that it forces us to clear VPValid when abort happens (the plan earlier was to write the CAM along with the very last word of the ReadQuad, in which case a restart wasn't necessary).
There was an alternative proposal for doing the first part of the computation, and this was to have the IFU tell us whether the current VAdrs is the previous VAdrs+1. This proposal wasn't exactly a hit with the IFU crew. They also pointed out a problem with this plan—none of the other processors that we might plan to attatch to the cache is likely to generate such a wire, and building a piece of discrete hardware to do this function would be a pain. Soooo, I've decided to bite the bullet and put the damn thing in.
Layout
The latest word on the layout is the following (see SCDataPath.dale for details). The array will consist of three columns: ArrayCAMLeft, ArrayRAM and ArrayCAMRight. ArrayCAMLeft and ArrayCAMRight are identical, with each row corresponding to one cache entry. ArrayRAM contains the RAM corresponding to ArrayCAMLeft and ArrayCAMRight. Each row of ArrayRAM contains one quad of data for the left side and one from the right side.
In ArrayCAM, the bits of VA and RA are interleaved to make it easier to transfer data from VA to RA (for operations like IORead etc).
Timing
See SCTiming.tioga
Unresolved Issues
1. Should SC's have address filters on the low bits of virtual address to permit SC's to be put in parallel? A simple way to implement this would be to put the filter between the PBus pads and the array, and have the cache see a NoOp when there is no match.
2. After an abort should the P side restart the operation from the beginning, or should the M side restart just the transaction that was aborted?
3. Should we return the requested word in PhB of the cycle in which it arrives on the MBus or in the next cycle? This one cycle difference is worth 3% or so, but it may be quite difficult to get.
4. Given that we have a serial arbiter present (see Flush, Launder and DeMap Implementation).
5. Should we add counters to the cache to record significant numbers? The counters would be accessable via the Fast IO space just like the TRIT registers for flush and launder.
6. A good way to reduce the number of first writes from SC is to include the master bit in victim selection. This will tend to keep an entry that's been written into in the cache for somewhat longer, and thereby amortize the cost of the first write. However, you need to be careful how this is implemented since it could easily backfire if it tends to keep non-frequently referenced entries in the cache.