[_CD4_]<dragon7.0>Documentation>Cache>ProcessorCacheSpecs.tioga!3

ProcessorCacheSpecs.tioga

Last Edited by: Barth, June 17, 1985 2:59:41 pm PDT

DRAGON PROCESSOR CACHE SPECIFICATIONS

DRAGON PROJECT — FOR INTERNAL XEROX USE ONLY

The Dragon Processor Cache

Description and Specifications

Release as [Indigo]<Dragon>Documentation>ProcessorCacheSpecs.press

Abstract: This memo describes the Dragon processor cache. It is intended to be used both as a convenient source for information about the processor cache and as a reference manual for processor cache specifications.

XEROX Xerox Corporation
Palo Alto Research Center
3333 Coyote Hill Road
Palo Alto, California 94304

For Internal Xerox Use Only

Contents

1. Introduction

2. Architectural Considerations

3. Pin Description

4. Functional Description

1. Introduction

The processor cache has as its primary purpose the reduction of the average latency seen by the execution and instruction fetch units of a Dragon processor. It also reduces the amount of bandwidth each unit requires from the bus which connects the processor cache to the next level of the memory system, allowing a single bus to service multiple EU and IFU processor caches.

In addition to servicing virtual address references the processor cache can service real address references issued by devices attached to the M bus. This is the basis for the cache consistency mechanism.

Since the processor cache contains both the real and virtual addresses of cached data it also performs address translation.

2. Architectural Considerations

2.1 Processor Cache Size

Simulations have shown that the amount of data which can be put into a processor cache is insufficient to support more than two processors on a single M bus. One way to get more data is to put however many processor cache chips are required per processor. This is unacceptable because it requires too much board area per processor and also increases the number of M bus loads to the point where bus timing requirements are difficult to meet. A more workable solution is a two-level cache scheme. The first level cache reduces the average latency sufficiently that each processor's performance is not impacted severely. The second level cache reduces the amount of bandwidth required by each processsor so that a single M bus may support approximately 10 processors.

There is quite a variety of two-level schemes possible. The scheme described here is a compromise between complexity of implementation, compatibility with previous implementations and the performance advantages we're after.

2.2 A Two Level Cache Architecture

The two-level cache scheme consists of three components: the processor cache, the bus cache, and approximately 64KB of local memory. The processor cache and bus cache are both custom chips, while the local memory consists of off-the-shelf memory chips.

The processor caches connected to a processor sit on an internal bus called the MI bus (for M bus, internal) whose protocol is a strict superset of the M bus. The processor cache front end connects to the processor and traffics in reads and writes with virtual addresses; its back end connects to the MI bus and traffics in real addresses using the usual M bus transactions. Processor caches are designed so it is also possible to plug them directly into the M bus and have them work (indeed this is essential since it allows Dragons to be brought up without putting the bus cache in the critical path).

The front end of a bus cache is connected to a MI bus, and its back end to the main M bus. The local memory connects directly to the bus cache from the side or it may connect directly to MI, depending on timing and bus capacitance issues. There is only one bus cache per MI bus and the arbiter for the MI bus sits inside the bus cache.

2.3 Performance Estimates

This section has latency and bandwidth estimates based on Dragonman simulations. We believe that the cache models are correct and that Dragoman executes PrincOps correctly. I do not know of any verification ensuring that the reference stream which connects the Dragoman emulator to the cache model is correct.

Latency: wait cycles / cycle = (ifu ref / cycle * m ref / ifu ref * ifu cache wait cycles / m ref) + (eu ref / cycle * m ref / eu ref * eu cache wait cycles / m ref). I believe these are additive since the m bus serializes eu and ifu misses.

M bus bandwidth: m cycles / cycle = (ifu ref / cycle * m ref / ifu ref * ifu cache m cycles / m ref) + (eu ref / cycle * m ref / eu ref * eu cache m cycles / m ref).

Effective processor power:

2.4 Design Decisions

Transaction types

There are two types of transactions on the MI bus: those for which the processor cache knows it will need to get the M bus (WS, IOR, IOW, IORF, IOWF, RMSR, RMSRSD) and those for which it cannot be sure that the M bus will be needed (RQ, WQ). Call the first type global and the second plocal (for probably local). For global transactions the processor cache always gets the M bus before getting the MI bus. For plocal transactions it asks for the MI bus alone.

The motivation for this distinction is that only plocal transactions need to be made restartable in order to avoid deadlock over bus allocation. Global transactions need not be restarted since the M bus is always acquired first.

MI request wires

Each processor cache has two request wires running to the arbiter inside the bus cache: MnRqst and MnGRqst (for global request). For plocal transactions a processor cache asserts MnRqst alone; for global transactions it asserts both MnRqst and MnGRqst. A MnGRqst request from a processor cache flows through the bus cache and comes out on the M bus as MnRqst in one phase.

Bus priority

Within a bus cache, requests from the M bus are given priority over requests from the MI bus. That is, if an MI bus transaction has started and an M bus transaction arrives within a bus cache then the MI bus transaction is forced to restart.

MI abort mechanism

Restarts on MI are signalled by "grant yanking". Ie. the arbiter in the bus cache deasserts MnGrant prematurely to signal that the current transaction must be restarted. The semantics of restart for the plocal transactions are given below.

RQ restart

For a processor read we can let the processor go ahead as soon as the requested word is fetched by the processor cache. For a processor write we must hold up the processor until the RQ is complete. The cost of this should be small since the number of writes that cause RQ's should be small. A data miss happens around 15% of the time, and a write ref happens 1 out of 21 cycles. Three cycles are added for each missing write, so the added cost is 3*0.15/21, or around 2%.

If the fetched quad goes into an existing line then quadValid should not be set until after commit. If it goes into a victim then vpValid, rpValid, and quadValid should not be set until after commit.

WQ restart

Note that unlike RQ the processor doesn't have to wait any longer than if a single level cache scheme were used. To see this, there are two cases to consider: WQ's caused by flush or launder; here the processor is not involved at all, so the assertion is true. For WQ's caused by a miss resulting from a processor reference the processor must be forced to wait anyway since the WQ is part of a WQ, RQ sequence.

The processor cache must not clear master until after commit.

Bus cache associators

To keep the bus cache as simple as possible, make it fully associative. Back of the envelope calculations show that we can get over 600 associaters. With 16K words in the local memory this means that each associater has control over 4 or 8 quads. Note that it is a good idea to put apower of two associators, otherwise space in the local memory will be wasted.

M to MI delay

There will need to be a one cycle delay between M and MI in the bus cache to provide enough time to acquire MI when a request comes in from M. This can be seen by considering a WS coming in from M: WS on M; match; yank grant; WS on MI;

The extra cycle is needed since MI becomes free only in the cycle following the yank. To keep things simple we should always insert the extra cycle regardless of whether MI is busy or not.

Bus cache control bits

For each quad, a bus cache keeps the following control bits:

master: when TRUE the bus cache or one of the processor cache's connected to it have the current data for the quad.

shared: when TRUE there may be a copy of this quad in other bus caches; when FALSE no other bus cache contains a copy

inPC: when TRUE there may be a copy of this quad in some processor cache connected to this bus cache's MI bus; when FALSE none of the processor caches on this MI bus contains this quad. This bit is needed to prevent RQ's, WQ's and WS's on the M bus from being broadcast on the MI bus when not absolutely necessary. Note that like shared, this bit is a hint.

Processor cache shared

The transition of a quad in some processor cache from not shared to shared is tricky. The way it works is as follows: when a bus cache hands a quad to a processor cache in response to a RQ the processor cache sets the shared bit for this quad. Now the first time the processor writes into this quad the processor cache puts out a WS. This WS will go out on the M bus (this is not necessary but doesn't hurt; it is done to avoid having WS's that have to be restartable). In the meantime, the bus cache asserts the shared line on MI if the shared bit for this quad in the bus cache is set. If none of the other processor caches on MI have a copy of this quad the processor cache that does have a copy will clear the shared bit. Also the bus cache must clear the master bit for the quad to signal that it may no longer have the current copy from this point on (since the processor cache can write into the quad without telling the bus cache).

If the processor cache does not set shared initially then there is no way for the bus cache to determine if a RQ request on the M bus must be satisfied from a processor cache and so it would have to assume that it must read the data from MI before satisfying the request on M.

Bus cache victim selection

Victim selection in a bus cache will be interesting given that every quad in a processor cache must also be in a bus cache. Consider the case where all of the quads in a bus cache have inPC set. This could happen when a large amount of data is being read sequentially by the processor. Clearly, before the bus cache can kick out a quad, it has to check whether it is in a processor cache or not. It would be much better if the bus cache victimized only those quads for which quads didn't exist in any processor cache. One way is for the bus cache to do a RQ on MI every time it tries to victimize a quad with inPC set. If someone pulls shared, the bus cache knows to not kick this quad out. Note that the RQ on MI can be aborted as soon as the bus cache has its answer so it needn't take up too much bandwidth on MI.

Looks like most inPC bits set is the common case! Consider read-only data and code, for example. Every time such a quad gets into a processor cache its inVC bit gets set; this bit won't get cleared when the host processor cache victimizes this quad since this quad will be clean.

Now, this suggests a solution to our dilemma. Suppose every time a processor cache did a RQ, it put in one of the two idle cycles of the RQ the address of the victim quad. All of the processor caches on MI would match on the address of the target of the RQ and then on the address of the victim. If any of them matches on the victim address it pulls down a new dedicated line called "MnInPC". The bus cache would watch this line and set the inPC bit for the quad being victimized by the RQ according to the state of MnInPC. Note that with this addition the inPC bit is no longer a "hint". One might think that it would still be a hint because of other ways to eliminate quads from caches, e.g. flush, but one would be wrong. Flush removes the quad from all bus caches as well as all processor caches.

Multiple processors per MI

It appears to be a real bad idea to put more than one processor on an MI. The reason is that the latency seen by each processor depends quite strongly on the load on MI (since the miss rate from processor caches is quite high). We'd therefore like the MI's to be as lightly loaded as possible, which of course means one processor per MI.

2.5 Consistency Invariants

M bus

> If a quad is present in more than one bus cache all of its copies are identical.

> If a quad is present in more than one bus cache all copies have shared set.

> At most one bus cache copy of a quad may have master set.

MI bus

> If a quad is present in more than one processor cache all of its copies are identical.

> If a quad is present in more than one processor cache all copies have shared set.

> At most one processor cache copy of a quad may have master set.

M bus/MI bus

> Every quad in a processor cache is also present in the corresponding bus cache.

> If a quad is in a processor cache then its bus cache copy must have inPC set.

> If a quad in a bus cache has shared set then any processor cache copies must have shared set.

> If a quad in a bus cache has shared set then any processor cache copies must be identical.

2.6 Control Bit Transition Rules

Bus cache control bits

The transition rules for the control bits in a bus cache are as follows:

master: set on WS from MI
clear on WS from M

inPC: set on RQ from MI
clear on {WS, RQ} to MI for which shared line on MI is
deasserted (assuming victim addr issued on RQ: clear on
RQ according to the state of the MnInPC line on MI)

shared: set on {WS, RQ} on M for which shared line on M is
asserted clear on {WS, RQ} on M for which shared line on
M is deasserted

Processor cache control bits

The transition rules for the control bits in a processor cache are as follows:

master: set on write from processor
clear on WS from MI

shared: set on {WS, RQ} on MI for which shared line on MI is
asserted clear on {WS, RQ} on MI for which shared line
on MI is deasserted

2.7 Statistics Needed

We need to get the following two statistics:

(a) How often do processor writes result in processor cache misses? This is needed to verify that the three cycles of latency added when a write misses in a processor cache are not a performance problem.

(b) How often does a processor write into a clean quad in a processor cache? This is needed to verify that the spurious WS generated by a processor cache in communicating to the bus cache that a quad has been written into is not a performance problem.

2.8 Things To Think About

line size

number of associators

flushing and laundering, filter on display addresses?, programmed filter? command programmed? 32 bit filter, base and bounds, IOWrite base, IOWrite bounds, IOWrite start

i/o commands

atomicity

dual vs. single port RAM

foreign processor support, save pins to indicate bus type so that 68020 and 80x86 compatible bus interfaces can be put into the cache.

local arbitration

multiple cache P buses with and without multiple M buses, how many ways of interleaving are enough?

How is the processor number initialized during reset so that minimal systems can be booted without the baseboard?

Should there be programmable page size? Over what range? Could implement it by running enable and enable' lines down through the low order bits of page so that the cells can drive the page or block match line low.

Why not a standard component implementation?

when the m controller sees its own command it must return bus fault to the processor and release the bus.

how can we have two virtual addresses for a single real address at the same time in one cache?

breakpoint registers.

what other actions cause quads to be removed from caches besides victimization? A flush causes dirty quads to be written back first but the clean quads merely have their valid bits reset. But what order does this occur in? While the flush is in progress is it possible for a quad to be in a processor cache but not in a bus cache? What prevents this?

there is a specific space id which must be sent during readmapsetref or readmapsetrefsetdirty. Write it down in here.

have reschedule be a line which comes out of the processor cache and is invoked by an I/O operation on the M bus.

3. Pin Description

Most of the cache pins are devoted to the M, P and D buses. The corresponding specification should be referenced for such pin descriptions. The M bus specification can be found in [Indigo]<Workstation>MbusSpec>MbusSpec.press and MbusFig.press. The P bus specification is in [Indigo]<Dragon>Documentation>PBus>PBusSpecs.press. The D bus specification is in [Indigo]<Dragon>Documentation>DBus>DBusSpecs.press.

Cache Specific Pins

BusParitySelect

This pin controls a multiplexor that selects PParity as the source for store parity when the pin is high and selects internal parity computation logic when the pin is low.

WriteEnable[0..4)

These pins control the byte write enables within the cache. They allow the P bus to be extended to byte level addressing.

AddressMask[0..4)

These pins initialize the low order page address mask bits during reset. See section 4.??? for a complete description of multiple cache per P bus systems.

AddressMatch[0..4)

These pins initialize the low order page address match bits during reset. See section 4.??? for a complete description of multiple cache per P bus systems.

Package Pin Assignment

This section describes the reasoning behind the cache pin assignment shown in Figure 2. The initial assumption is that the control logic is all at the top of the array and that the data buses run across the top of the chip and then narrow down the two sides, the M bus on the right and the P bus on the left, meeting in the middle of the bottom of the chip. The initial pin count constraints are as follows:

Power 12

Clocks 4

P 42

M 43

D 6

Reset 1

Byte Enables 4

Parity Select 1

Address Division 8

Daisy chain 2

This requires a total of 123 pins out of the 144 available. The power pins are the obvious place to use some of the remaining 21 pins. One would like to minimize the internal wiring length from the closest power pads to any point on the chip. The added power pins should also be chosen from those with the lowest inductance in the package. The initial power pin count was chosen simply to use up all the low inductance pins in the package. They were then arbitrarily divided up as follows:

PadVdd 4

PadGnd 4

LogicVdd 2

LogicGnd 2

The pad power pins were assigned to the lowest inductance pins. The logic power pins were assigned to the next lowest inductance pins. The arrays in the logic are likely to use quite a bit of current during their precharge and discharge cycles. Lets add 2 pins to each of the logic supplies leaving 17 pins. The M bus has about between 5 and 20 times as much capacitance as the P bus but it runs at half the rate so it should get between 2 and 10 times more pins. Arbitrarily chosing 2 as the multiplier and 8 pins for the M bus means 4 additional pins for the P bus leaving 5 pins uncommited. They will be left uncommited so that if any extra functionality is required in the controller there are spare pins to implement them, e.g. a mechanism for determining when a cache flush or launder process is finished may require a pin or so. The M bus has about 39 pins that all swing at the same time so 39/8 means about every 10 pins there should be a PadVdd/PadGnd pair. They should also be constrained to be on the inner row of pins in the package. For the M bus this means pins 60, 66, 69, 77, 81, 85, 96 and 102 are the obvious choices. The P bus has 37 pins so every 9 pins should have a power or ground. The possible pins in the range of 20 through 53 are 24, 30, 41, 45, and 49. Eliminating 49 leaves 4 pins.

To Do

A little more engineering needs to be applied to the pin assignment after the capacitance and path delay of the M bus and P bus is determined more tightly and after the peak current requirements of the internal logic of the cache are better known.

4. Functional Description

This section describes the internal state of the cache and describes how the buses interact with each other and the state.

CacheBlockDiagram.press leftmargin: 1 in, topmargin: 1 in, width: 6.5 in, height: 8 in

CachePinOut.press leftmargin: 1 in, topmargin: 1 in, width: 6.5 in, height: 8 in

CacheTiming.press leftmargin: 1 in, topmargin: 1 in, width: 6.5 in, height: 8 in