[_CD4_]<dragon7.0>SmallCache>Documentation>SmallCacheDesignDecisions.tioga!1

SmallCacheDesignDecisions.tioga

Written by: Pradeep Sindhu, November 21, 1985 2:03:52 pm PST

Last Edited by:

Pradeep Sindhu, April 22, 1986 4:30:42 pm PST

DRAGON SMALL CACHE DESIGN DECISIONS

DRAGON PROJECT — FOR INTERNAL XEROX USE ONLY

Dragon Small Cache

Design Decisions

Release as [Indigo]<Dragon>SmallCache>Documentation>SmallCacheDesignDecisions.tioga, .press

Abstract: This document contains a record of design decisions for the Dragon Small Cache; where possible it also tries to provide the rationale for each decision. It is being used currently by the cache designer to preserve his sanity, but is expected to be useful in the future in writing documentation and/or papers about the Dragon memory system. Some information in here may be duplicated elsewhere (eg. in the design decisions document for the Big Cache, and in the design decisions document for the memory system).

XEROX Xerox Corporation
Palo Alto Research Center
3333 Coyote Hill Road
Palo Alto, California 94304

For Internal Xerox Use Only

Contents

Introduction

Buzzword Definition

RAM Quad Interface?

P-Bus Byte Interface

Cache Entry Size

Latches for RMatch and VMatch

Bypassing

Unresolved Issues

Introduction

This document is one in a series that attempts to record design decisions for the Dragon Memory System as they are made, and especially to outline the reasons for each decision. Given the overall complexity of the Memory System, the designers found it necessary to keep such a running record to avoid revisiting old decisions and trying to unearth the reasons for them for each time.

The structure of this series is as follows. The Memory System document forms the top level, confining itself to decisions that affect at least two of the components comprising the Memory System: the Small Cache, the Big Cache, the Map Cache, and Main Memory. This document confines itself to decisions that affect the Small Cache.

The section immediately below defines frequently used terms that will be used throughout this document. Each of the following sections covers one design decision.

Buzzword Definition

There will be a fair amount of shorthand in this document. This section makes an attempt to define some of the terms before they are used.

BC Big Cache
SC Small Cache
Entry smallest SC unit that can match independently
Line geometric horizontally repeating unit (may contain 1
or more entries)

P-Bus Byte Interface

The cache will provide a byte interface for the processor. For processor byte reads the interface does nothing. That is, it simply returns the whole word—its up to the processor to figure out which byte to extract. For writes, the processor asserts the appropriate one of four wires provided by the cache: Byte0, Byte1, Byte2, and Byte3. Note that the processor can, in fact, cause any subset of the four bytes in a word to get written.

Cache Entry Size

A basic choice is whether cache entries should contain multiple quads or not. Multiple quads increase the amount of data in the cache but additional complexity because of presence bits. Furthermore, with multiple quads per entry substantial fractions of the RAM in the cache may be wasted because of non-locality, so the increased amount of data may not result in a better hit rate.

There are several competing organizations, some of which have only a marginal possiblity of fitting into a 10x10mm die. The summary data for several organizations appears below. The notation E/Q is used to denote a cache organization, where E is the number of cache entries and Q is the number of quads per entry. The table lists the number of added cycles due to misses for each organization in 100 processor cycles (the figures below assume 5 cycles to satisfy a miss; this is in fact an error since the number is more like 7; this error doesn't change the rank ordering however since all the numbers will get inflated by 7/5).

Program Ordering -> 1 2 3 4 5

CompSmall 100/2 (20.1) 128/1 (21.7) 100/1 (24.6) 50/4 (25.9) 50/2 (31.8)

CompMedium 100/2 (18.8) 128/1 (20.3) 100/1 (25.6) 50/4 (26.4) 50/2 (34.5)

Spinifex 100/2 (16.6) 128/1 (23.2) 50/4 (26.6) 100/1 (31.8) 50/2 (42.4)

WritePlainText 100/2 (4.5) 128/1 (4.9) 50/4 (5.8) 100/1 (6.4) 50/2 (8.6)

Waterlily 100/2 (4.4) 128/1 (4.4) 100/1 (6.9) 50/4 (8.3) 50/2 (15.3)

Its clear from the table that 100/2 is just slightly better than 128/1. Unfortunately, 100/2 is significantly more complex and further may not fit on a 10x10 die. 128/1 is a clear winner over the others, so we decided to go with a 1 quad per entry organization.

Latches for RMatch and VMatch

VMatch must be latched for the following reason: Consider the case of PRead-DataMissAdrsHit; here when trying to do page match the CAM bitlines are driven during PhA and any matching real page is read out of the real CAM in PhB. The question is when should CAMSelect be driven by VMatch? Clearly this cannot be done during PhA because the real and virtual CAMs share the select line, and driving the line during PhA would cause stuff to get written during PhA—BAD! So this has three consequences: 1. CAMSelect is low during nPhB, and is driven during PhB, 2. CAM bit lines are driven during PhB only if we want to write stuff into the CAM, and 3. VMatch should be latched during PhA so it can be used during PhB to drive CAMSelect.

Array Value Cacheing

A value that is read out of the cache array at time t and used later at some time t+x may be invalid at the later time. This is because operations on the DynaBus could have modified the value in between t and t+x. The only way we can be sure that the values at t+x and t are the same is if we were master (had grant). Now, for a number of cases its better (for performance) to read out the value from the array at the same time we request the bus. This way we can start whatever operation we intend to do on the DynaBus in the very next cycle after getting grant. Of course the problem is that the value we have in hand may have been updated in the meantime.

The simple fix for this is to always wait to get grant and give up the performance advantage.

Since this involves a performance hit for the most common case, here's the fix that'll probably be used. Read out the value as early as possible and put it in a holding register. This register also has a state bit called "valid" which is set when the value is read out. Now the DynaBus controller clears the valid bit whenever a "dangerous" operation appears on the DynaBus. The instant we get grant we check the valid bit. If its set no problem; if its not we read out the value once more. This way we pay the penalty only when necessary. The simplest definition of "dangerous" operation is any DynaBus operation. This may be too simplistic because the most frequent operation is RQ, and it is innocuous and therefore it will pay to filter out RQ's.

Bypassing

The general idea behind bypassing is to complete a processor reference as quickly as possible and let the processor go rather than to wait for the array to finish doing its thing. Since writes are difficult to handle (and are not that frequent anyway) it was decided to not be very clever about bypassing writes.

The computation for bypassing can be conveniently broken down in two parts. The first is to figure out if the VAdrs from the processor corresponds to the quad being processed. The second is to figure out if the word requested can be delivered at this instant. For the reference that causes a miss, the first part of the computation is unnecessary since the answer is yes, but the second part is necessary. For references that immediately follow the missing reference we must do both calculations. Now its possible to do the first part of the calculation (quad matching) using the array itself if we agree to write the CAM entry at the same time the first word is being written. If we do this, and we provide an extra vertical array match VictimMatch which is derived by anding VMatch with Victim for each line and then wire ORing it to VictimMatch. VictimMatch simply tells us whether the match done in this cycle hit the entry that we're writing into. The problem with this is that it forces us to clear VPValid when abort happens (the plan earlier was to write the CAM along with the very last word of the ReadQuad, in which case a restart wasn't necessary).

There was an alternative proposal for doing the first part of the computation, and this was to have the IFU tell us whether the current VAdrs is the previous VAdrs+1. This proposal wasn't exactly a hit with the IFU crew. They also pointed out a problem with this plan—none of the other processors that we might plan to attatch to the cache is likely to generate such a wire, and building a piece of discrete hardware to do this function would be a pain. Soooo, I've decided to bite the bullet and put the damn thing in.

Protection Checking

Per page protection is implemented by SC. SC keeps three bits per entry:

KWED = KernelWriteEnable'Dirty
UWED = UserWriteEnable'Dirty
URE = UserReadEnable

If the cache is to implement user read and write protection then three bits is a minimum. Two bits (one for user read, one for user write) are not enough because SC needs to signal when a page is first written into after it has been brought into memory. Since we need at least three bits, we encode them so as to get the most checking.

The three bits are kept in the VCam and are matched along with the virtual address. The state of the bit lines for the various PBus operations is as follows:

Mode R/W KWED UWED URE
Kernel Read always match always match always match
Kernal Write match only if 1 always match always match
User Read always match always match match only if 1
User Write always match match only if 1 always match

There is a single match line that connects the virtual address cells and the protection bit cells. This means that a mismatch could happen either because the entry is missing or because there has been a protection violation. These cases are discriminated when the RP and protection bits are read out following partial virtual match.

Unresolved Issues

1. Should SC's have address filters on the low bits of virtual address to permit SC's to be put in parallel? A simple way to implement this would be to put the filter between the PBus pads and the array, and have the cache see a NoOp when there is no match.

5. Should we add counters to the cache to record significant numbers? The counters would be accessable via the Fast IO space just like the TRIT registers for flush and launder.

6. A good way to reduce the number of first writes from SC is to include the master bit in victim selection. This will tend to keep an entry that's been written into in the cache for somewhat longer, and thereby amortize the cost of the first write. However, you need to be careful how this is implemented since it could easily backfire if it tends to keep non-frequently referenced entries in the cache.