[_CD4_]<dragon7.0>SmallCache>OldStuff>OldCacheSpecs.tioga!1

CacheSpecs.tioga

Jean Vuillemin September 20, 1985 5:17:18 pm PDT

Pradeep Sindhu February 12, 1986 11:48:28 pm PST

DRAGON CACHE SPECIFICATIONS : Barth, Sindhu, Vuillemin.

DRAGON PROJECT — FOR INTERNAL XEROX USE ONLY

The Dragon Cache

Description and Specifications

Release as [Indigo]<Dragon>Documentation>CacheSpecs.press

Abstract: This memo describes Dragon's Hierarchical Cache Memory. It is intended to be used both as a convenient source for information about the caches and as a reference document for both caches specifications.

Contents:

1. Shared Memory Multiprocessors

1.1 Abstract Model for Serial Shared Memory.

1.2 Dragon's Hierarchical Cached Memory.

1.3 Consistency Algorithms.

2. Architectural Considerations

3. Pin Description

4. Functional Description

XEROX Xerox Corporation
Palo Alto Research Center
3333 Coyote Hill Road
Palo Alto, California 94304

For Internal Xerox Use Only

1. Shared Memory Multiprocessors

1.1 Abstract Model for Serial Shared Memory.

A shared memory multiprocessor has ke1 processors, and a common memory system. Each processor is connected to memory through a private line, used to communicate commands issued by the processor to memory. Commands are Fetch(A) and Store(A,D) where A and D stand for address and data. Memory responds to each command by providing a word of data. For Fetch(A), memory returns the data value associated to address A. For Store(A,D), it proves convenient to assume that memory returns the data value associated to A prior to actually performing the Store, i.e. the same as Fetch(A). In the case of a unique processor (k=1), the semantics of Fetch and Store is straightforward: the value associated to address A is the data argument D of the last Store(A,D) into A; it is an error when there is no such last Store. We define the history of a line connecting processor and memory to be the sequence of commands which appear on that line, together with their associated values.

When memory is shared between many (k>1) processors, we have to resolve access conflicts. What if two processors try to write into the same location at the same time? What if one tries to read a location at the same time the other is wrting into it? A simple way to avoid such problems is to postulate that all processors lines are connected to a global arbiter, which in turn is connected to memory trough a unique shared line (see Figure 1). The purpose of the arbiter is to choose a command on one of the processors line, pass it on to memory through the shared line, and dispatch back memory response to the appropriate processor's line. If we look at the history of all the communication lines, we see that the arbiter is merging the processors histories into a global sequential memory history. All we have to assume about the arbiter is that the sequential order of requests issued by each processor is preserved. The order in which commands from different processors get merged is arbitrary; we may (or may not) require in addition the arbiter to be fair, i.e. each command will eventualy get serviced. Any implementation of shared memory which behaves as if it was controlled by a serial arbiter is called consistent. This is kown in the field of data bases as the serialisability condition. Dragon's hierarchical cached memory will be proved to be consistent.

Although each processor is given sequential access to memory and the history of its own commands is well defined, it is not immediately obvious how to use such a system. Processor1 with history Store1(A,D1);Fetch1(A) may not get D1 as the value of Fetch1(A): Processor2's history could be Store2(A,D2);Fetch2(A) and the arbiter may have merged these into Store1(A,D1);Store2(A,D2);Fetch2(A);Fetch1(A). It is nevertheless possible for processors 1 and 2 to synchronize their activity, because both of their memory systems go through the very same sequence of state transitions. One way to achieve synchronization in this model, due to Dekker, goes as follows, assuming memory locations A and B are initialy set to 0:

Code for processorA:
ReturnA : Store(A,1);
if Fetch(B)=0 then <CriticalSectionForA>;
Store(A,0); goto ReturnA.

Code for processorB:
ReturnB : Store(B,1);
if Fetch(A)=0 then <CriticalSectionForB>;
Store(B,0); goto ReturnB.

1.2 Dragon's Hierarchical Cached Memory.

A consistent shared memory cannot be directly implemented from the abstract model for efficiency reasons: large memories are slow, and a common access bus is saturated by a single processor. Dragon solves the problem by associating private caches to each processor. Each cache contains relevant copies of addresses and data from main memory. When the processor issues a Fetch(A), the cache responds with the value D, if it has that data (hit case); otherwise, (miss) the cache fetches D from main memory, and keeps it for further references. Since the cache is smaller than main memory, it has to victimise (delete) an entry on every miss. The global effect is to speed up most Fetch (assuming the cache to respond faster than memory), and to reduce traffic on the shared line to memory. Simulation of various schemes has shown that, for the cache sizes which available technologies allow us to build, a one level cache system will not support much more than 2 processors before the main bus becomes saturated. We therefore have to use a two (or more) level cache system.

Dragon's memory is implemented as a hierarchical cached memory HCM. The root of HCM is main memory MM which is connected (below) to a global communication M-bus. Each Dragon processor is connected (above) to a private processor cache, called SC (for small cache). Each SC is connected (above) to a M-bus, which can be either the main M-bus, or a local internal M-bus, called MI-bus. Each MI-bus is connected above to another type of cache, called BC (for Big Cache). A given BC is connected above to a M-bus (which can be the main M-bus in a 2 levels systems, or a MI-bus in deeper hierarchies). Each M-bus is thus connected above to one memory system, which is MM for the main M-bus, and BC for a local MI-bus; it is convenient to think of the M-bus arbiter to be part of the memory system above it. Each M-bus is connected below to a number of caches, either of the SC or the BC type. An HCM is thus a tree of memory modules, the root being MM, internal nodes BCs, and leaf nodes SCs. Figure 2 presents an example of a possible Dragon configuration.

In such a HCM, the root MM of hierarchy contains all data in the address space of the system. Some data is copied in caches below in the tree, and any address present in a given cache is also present in the memory system above (MM or BC). To insure consistency, 3 bits are associated with each word in all types of memories (MM, SC, BC). They are called Master, Shared and ExistsBelow, and their semantic goes as follows:

Master: this bit is set to one on all copies of address A located on the unique path from MM to the cache of the last processor which performed a Store into A. If Master(A)=1 in a cache, the same must hold in the copy of A located above, and at most one copy of A below can have Master(A)=1. Master defaults to 1 in MM for all addresses.

Shared: this bit is set to one for all cached copies as soon as two copies of the address exist in caches which are not on a path between the root of HCM and a leaf (they are brothers in the tree). The converse does not necesserally hold: a cached copy of A may have Shared(A)=1 although no copy of A exist in any of it's brother caches. Shared defaults to 0 in MM.

ExistsBelow: this bit is set to one in a memory module iff a cached copy of that address exits in a subtree of that module. ExistsBelow is not explicitly stored in MM; it defaults to 0 in SC.

The implementation of Fetch and Store described latter will guarantee that any HCM behaves as a consistent serial shared memory. The simplest scheme for achieving that would be to make Stores into any cache write through all the way to MM, keeping all copies consistent at all times. Dragon uses a more elaborate lazy scheme for handling Stores. Store(A,D) into a word which has Master(A)=1 and Shared(A)=0 is only performed locally in that cache, leaving an out of date copy into memory above. Consequently, the cache which has Master(A)=1 associated with address A is responsible for providing the answer to a Fetch(A) command on the shared memory line. A Store(A,D) which does not have both Master(A)=1 and Shared(A)=0 is handled by writing through. The global effect of that choice is to keep consistent copies of the lastest value of address A in the lowest common subtree containing A. Equivalently, the value of address A in a given memory module may be out of date iff Master(A)=1, Shared(A)=0 and ExistsBelow(A)=1.

Dragon's storage unit is a quad, i.e. 4 words of 32 bits. A cache entry is thus made of 32 bits of address, 4x32 bits of data, and (at most) 3 bits for Shared, Master and ExistsBelow. The following assertions summarize the semantic intent of these bits in a HCM:

(1) Every address in C is also present in the father of C.

(2) The father copy of any address has ExistsBelow set.

(3) The father of a master address is also a master address.

At most one of two sister addresses can be master.

(4) Two sister addresses both have shared set.

The son of a shared address is also a shared address.

Two copies of an address which are marked shared have identical data values.

2. Functional Description

Bus commands, issued by a processor, are {pFetch,pStore} on the pBus below a SC. mBus Fetch is RQ (ReadQuad), and we have two types of Store: the first WQ (WriteQuad) indicates that a cached entry is to be saved in the memory module above, because it is victimised in the cache; the second WS (WriteSingle) indicates a write through, and only shared copies in brother trees need to be updated. Each cache is connected to a bus above and a bus below; it has to issue and monitor commands on both buses. Actions to be performed are described here.

2.1 Generalities.

When the address of a command from below misses, the cache issues a RQ on the bus above. It then restarts the command, as in the hit case. When the address of a command from above misses, the cache ignores that command. From now on, we talk about a command on the assumption that it hits.

A cache monitors address cycles of commands issued by other caches on both buses. It pulls the shared line above on address match, and below when the address is marked shared.

The value of Master is set to 1 on Store from below; it is set to 0 on Store from above.

Victim selection in SC is done on a hex basis, where a hex is 4 quads. This happens on a line miss in SC, which provoques a RQ. The victim address is indicated during the 3rd cycle of the RQ (the second dead cycle), together withe the hex size, so as to make this mechanism general, and also usable during victimisation in BC. It follow that ExistsBelow is set in BC on RQ from below, and it is cleared according to the victim address in the 3rd cycle of a RQ from below, provided that no other cache has pulled the shared line after that cycle. Victim selection in BC is interesting since every quad in the caches below must also be in BC: a good victim for BC is one for which none of the ExistsBelow bits is set.

Before issuing a transaction for which the main MBus is required (WS, IOR, IOW, IORF, IOWF, MR, MRD), a cache first acquires the main mBus. Each cache has two request wires running to the arbiter above: Rq and GRq (for global request). For global transactions it asserts both Rq and GRq. A GRq request from below flows through intermediate caches and comes out on the M-Bus as Rq in one phase.

Unlike RQ and WQ, these other requests do not need to be restartable. A pBus command which asserts hold is handled in the same way, by first acquiring the main mBus.

RQ restart. For a processor read we can let the processor go ahead as soon as the requested word is fetched by SC. For a processor write we must hold up the processor till the RQ is complete. The cost of this should be small since the number of writes that cause RQ's should be small. A data miss happens around 15% of the time, and a write ref happens 1 out of 21 cycles. Three cycles are added for each missing write, so the added cost is 3*0.15/21, or around 2%.

If the fetched quad goes into an existing line then don't set quadValid until after commit. If it goes into a victim then don't set vpValid, rpValid, and quadValid until after commit
WQ restart. Note that unlike RQ we don't have to make the processor wait any longer than necessary. To see this, there are two cases to consider: WQ's caused by flush or launder; here the processor is not involved at all, so the assertion is true. For WQ's caused by a miss resulting from a processor reference the processor must be forced to wait anyway since the WQ is part of a WQ, RQ sequence. Inside BC don't clear master until after commit

2.2 ReadQuad.

2.21 Issue RQ above.

mBus below: Miss(A)
?A x x
mBus above
←Rqst Gnt ←RQA x x D0 D1 D2 D3

Upon reading a fresh quad, a cache sets Shared to 1, Master and ExistsBelow to 0.

2.22 Monitor RQ above.

When Master, provide the data, using the mBus nMAbort convention. The correct value has to be obtained through a RQ below when ~Shared and ExistsBelow; otherwise the cache uses its internal storage.

mBus above: Master(A) and {Shared(A) or ~ExistsBelow(A)}
RQA x ←nMAb � � � �

mBus above: Master(A) and ~Shared(A) and ExistsBelow(A)
RQA x ←nMAb ←nMdv ←nMdv ←nMdv � � � �
mBus below:
←RQA x x D0 D1 D2 D3

2.23 Monitor RQ below.

The cache just provides the data, assuming the nMAbort line has not been pulled during cycle 3:

RQA x x � � � �

2.3 WriteSingle.

2.31 Issue WS above.

To ensure atomicity of Stores, a cache needs to acquire the main mBus before issuing a WS (see Hold operations).

←*Rqst *Gnt ←WSA 𡤍

2.32 Monitor WS below.

2.33 Monitor WS above.

Modify internal value. If ExistsBelow, pass the WS on the bus below :

mBus above: ExistsBelow(A) {note: it must be the case that Shared(A)}
WSA D
mBus below:
←WSA 𡤍

2.32 Monitor WS below.

To ensure atomicity of Stores, a cache needs to acquire the main mBus before issuing a WS.

2.33 Issue WS above.

The value of a SharedBit gets updated to that of the SharedLine on every WS issued above by that cache

2.23 Issue WS below.

2.4 WriteQuad.

A cache should not monitor WQ above. If it happen to own a copy of that quad, it has already been maintained consistent through WS.

2.41 Monitor WQ below.

2.42 Issue WQ above.

2.6 Hold commands.

2.7 IO Commands.

3. Architectural Considerations

3.1 Small Cache

3.11 Internal Structure

3.12 Pins

Most of the cache pins are devoted to the M, P and D buses. The corresponding specification should be referenced for such pin descriptions.

Cache Specific Pins

MDaisyIn

This pin ???

MDaisyOut

This pin ???

BusParitySelect

This pin controls a multiplexor that selects PParity as the source for store parity when the pin is high and selects internal parity computation logic when the pin is low.

WriteEnable[0..4)

These pins control the byte write enables within the cache. They allow the P bus to be extended to byte level addressing.

AddressMask[0..4)

These pins initialize the low order page address mask bits during reset. See section 4.??? for a complete description of multiple cache per P bus systems.

AddressMatch[0..4)

These pins initialize the low order page address match bits during reset. See section 4.??? for a complete description of multiple cache per P bus systems.

Package Pin Assignment

This section describes the reasoning behind the cache pin assignment shown in Figure 2. The initial assumption is that the control logic is all at the top of the array and that the data buses run across the top of the chip and then narrow down the two sides, the M bus on the right and the P bus on the left, meeting in the middle of the bottom of the chip. The initial pin count constraints are as follows:

Power 12

Clocks 4

P 42

M 43

D 6

Reset 1

Byte Enables 4

Parity Select 1

Address Division 8

Daisy chain 2

This requires a total of 123 pins out of the 144 available. The power pins are the obvious place to use some of the remaining 21 pins. One would like to minimize the internal wiring length from the closest power pads to any point on the chip. The added power pins should also be chosen from those with the lowest inductance in the package. The initial power pin count was chosen simply to use up all the low inductance pins in the package. They were then arbitrarily divided up as follows:

PadVdd 4

PadGnd 4

LogicVdd 2

LogicGnd 2

The pad power pins were assigned to the lowest inductance pins. The logic power pins were assigned to the next lowest inductance pins. The arrays in the logic are likely to use quite a bit of current during their precharge and discharge cycles. Lets add 2 pins to each of the logic supplies leaving 17 pins. The M bus has about between 5 and 20 times as much capacitance as the P bus but it runs at half the rate so it should get between 2 and 10 times more pins. Arbitrarily chosing 2 as the multiplier and 8 pins for the M bus means 4 additional pins for the P bus leaving 5 pins uncommited. They will be left uncommited so that if any extra functionality is required in the controller there are spare pins to implement them, e.g. a mechanism for determining when a cache flush or launder process is finished may require a pin or so. The M bus has about 39 pins that all swing at the same time so 39/8 means about every 10 pins there should be a PadVdd/PadGnd pair. They should also be constrained to be on the inner row of pins in the package. For the M bus this means pins 60, 66, 69, 77, 81, 85, 96 and 102 are the obvious choices. The P bus has 37 pins so every 9 pins should have a power or ground. The possible pins in the range of 20 through 53 are 24, 30, 41, 45, and 49. Eliminating 49 leaves 4 pins.

To Do

A little more engineering needs to be applied to the pin assignment after the capacitance and path delay of the M bus and P bus is determined more tightly and after the peak current requirements of the internal logic of the cache are better known.

3.1 Big Cache

The two-level cache consists of three components: the first-level cache SC, the second level cache BC, and 8-16K words of memory Mem; SC and BC are both custom chips, while Mem consists of four or eight 8Kx8 off-the-shelf memory chips.

To keep BC as simple as possible, make it fully associative. Back of the envelope calculations show that we can get over 600 associaters. With 16K words in Mem, this means that each associater has control over 4 or 8 quads. Note that its a good idea to put power of two associaters otherwise space in Mem will be wasted.

3.2 Dragon PC Board

3.21 Board Layout

3.22 Performance Estimates

3.3 Statistics Needed

We need to get the following two statistics:

(a) How often do processor writes result in SC misses? This is needed to verify that the three cycles of latency added when a write misses in SC are not a performance problem.

(b) How often does a processor write into a clean quad in SC? This is needed to verify that the spurious WS generated by SC in communicating to BC that a quad has been written into is not a performance problem.

3.4 Things To Think About

line size

number of associators

flushing and laundering, filter on display addresses?, programmed filter? command programmed? 32 bit filter, base and bounds, IOWrite base, IOWrite bounds, IOWrite start

i/o commands

atomicity

dual vs. single port RAM

foreign processor support, save pins to indicate bus type so that 68020 and 80x86 compatible bus interfaces can be put into the cache.

local arbitration

multiple cache P buses with and without multiple M buses, how many ways of interleaving are enough?

How is the processor number initialized during reset so that minimal systems can be booted without the baseboard?

Should there be programmable page size? Over what range? Could implement it by running enable and enable' lines down through the low order bits of page so that the cells can drive the page or block match line low.

Why not a standard component implementation?

when the m controller sees its own command it must return bus fault to the processor and release the bus.

how can we have two virtual addresses for a single real address at the same time in one cache?