[_CD4_]<dragon>MemorySystem>Documentation>MSDesignDecisions.tioga!3

MSDesignDecisions.tioga

Written by: Pradeep Sindhu, November 21, 1985 2:06:45 pm PST

Last Edited by:

Pradeep Sindhu, November 22, 1985 4:52:49 pm PST

DRAGON MEMORY SYSTEM DESIGN DECISIONS

DRAGON PROJECT — FOR INTERNAL XEROX USE ONLY

Dragon Memory System

Design Decisions

Release as [Indigo]<Dragon>MemorySystem>Documentation>MSDesignDecisions.tioga, .press

Abstract: This document contains a record of design decisions for the entire Dragon Memory System; where possible it also tries to provide the rationale for each decision. The decisions that appear here are ones that affect at least two of the components comprising the Memory System—those that affect single components are delegated to the documents for those components. However, some information here may be duplicated in the per-component documents.

The document is being used currently by designers to preserve their sanity, and is expected to be used in the future in writing other documentation and/or papers about the Dragon memory system.

XEROX Xerox Corporation
Palo Alto Research Center
3333 Coyote Hill Road
Palo Alto, California 94304

For Internal Xerox Use Only

Contents

Introduction

Buzzword Definition

Deadlock Avoidance: Transaction Classes and Aborts

Request/Grant

Transition of Shared from 0 to 1

Signaling Victimization

Response to Done Command

Main M-Bus Load

Multiple processors per MI

Introduction

This document is one in a series that attempts to record design decisions for the Dragon Memory System as they are made, and especially to outline the reasons for each decision. Given the overall complexity of the Memory System, the designers found it necessary to keep such a running record to avoid revisiting old decisions and trying to unearth the reasons for them for each contentious issue.

The structure of this series is as follows. This document forms the top level, confining itself to decisions that affect at least two of the components comprising the Memory System: the Small Cache, the Big Cache, the Map Cache, and Main Memory. The other documents mostly limit themselves to internal details of these components. Main Memory is a given for Dragon, so there isn't a separate document, but separate documents do exist for the other components.

The section immediately below defines frequently used terms that will be used throughout this document. Each of the following sections covers one design decision.

Buzzword Definition

There will be a fair amount of shorthand in this document. This section makes an attempt to define some of the terms before they are used.

BC Big Cache
SC Small Cache
WS WriteSingle
RQ ReadQuad
WQ WriteQuad
RQS ReadQuadStream
WQS WriteQuadStream
IOR IORead
IOW IOWrite
IORF IOReadFlow
IOWF IOWriteFlow
RMSR ReadMapSetRef
RMSRSD ReadMapSetRefSetDirty
MRD Map Read
MIBus Internal MBus
Entry Smallest cache unit that can match independently
Line Geometric, horizontally repeating unit in cache array
(may contain 1 or more entries)

Transaction Classes

Transactions on the MI-Bus are divided into two classes: those for which SC knows it will need to get the M-Bus (WS, IOR, IOW, IORF, IOWF, RMSR, RMSRSD) and those for which it cannot be sure that the M-Bus will be needed (RQ, WQ). Call the first type global and the second plocal (for probably local). For global transactions SC always gets the M-Bus before getting the MI-Bus. For plocal transactions it asks for the MI-Bus alone.

The motivation for this distinction is that only plocal transactions need to be made restartable in order to avoid deadlock over bus allocation. Global transactions need not be restarted since the M bus is always acquired first.

MI Arbitration

The arbiter for MI lives inside its BC. Each SC on the MI has two request wires running to this arbiter: MnRq and MnGRq (for global request). For plocal transactions a processor cache asserts MnRq alone; for global transactions it asserts both MnRq and MnGRq. A global request from an SC flows through the BC and comes out on the M-Bus as MnRqst in one phase; similarly a grant on the M-Bus flows through BC and comes out on MI in one phase.

Bus priority

Within BC requests from the M-Bus are given priority over requests from MI. That is, if an MI bus transaction has started and an M-Bus transaction arrives within a bus cache then the MI bus transaction is aborted if necessary and forced to restart. Within SC, the M-Bus side has priority over the P-Bus side. However, in case of conflicts the P-Bus transaction is not aborted but rather the processor is forced to wait.

MI Abort Mechanism

Aborts on MI are signalled by "grant yanking". Ie. the arbiter in BC deasserts Gnt prematurely to signal that the current transaction must be aborted and then restarted. The semantics of restart for the plocal transactions are given below.

RQ Restart

For a processor read we can let the processor go ahead as soon as the requested word is fetched by SC. For a processor write we must hold up the processor till the RQ is complete. The cost of this should be small since the number of writes that cause RQ's should be small. A data miss happens around 15% of the time, and a write ref happens 1 out of 21 cycles. Three cycles are added for each missing write, so the added cost is 3*0.15/21, or around 2%.

If the fetched quad goes into an existing line then don't set quadValid until after commit. If it goes into a victim then don't set vpValid, rpValid, and quadValid until after commit.

WQ Restart

Note that unlike RQ we don't have to make the processor wait any longer than necessary. To see this, there are two cases to consider: WQ's caused by flush or launder; here the processor is not involved at all, so the assertion is true. For WQ's caused by a miss resulting from a processor reference the processor must be forced to wait anyway since the WQ is part of a WQ, RQ sequence.

An SC must not clear master until after commit.

M to MI Delay

There will need to be a one cycle delay between M and MI in the bus cache to provide enough time to acquire MI when a request comes in from M. This can be seen by considering a WS coming in from M: WS on M; match; yank grant; WS on MI;

The extra cycle is needed since MI becomes free only in the cycle following the yank. To keep things simple we should always insert the extra cycle regardless of whether MI is busy or not.

BC Control Bits

For each quad, BC keeps the following control bits:

Master: when TRUE BC or one of its SC's have the latest value for the quad.

Shared: when TRUE there may be a copy of this quad in other BC's; when FALSE no other bus cache contains a copy

ExistsBelow: when TRUE there is a copy of this quad in one of the SC's connected to this BC; when FALSE none of the processor caches on this MI bus contains this quad. This bit is needed to prevent RQ's, WQ's and WS's on the M bus from being broadcast on the MI bus when not absolutely necessary. Note that this bit is not a hint, but contains the truth.

Notification of First Write

When a quad is written into for the first time by an SC, the BC connected to that SC must be notified. If this is not done, BC has no way of knowing whether an RQ on the M-Bus must be responded to by it or not.

The way this will work is as follows: whenever SC gets a quad in response to an RQ it initiated, it initializes the quad bits to Master=0; Shared=1. Now the first time SC's processor writes into this quad, SC will put out a WS onto MI (since Shared is set) and will also set Master to 1; this acts as the notification that master has changed from 0 to 1 in SC. In the meantime BC checks if its copy of the Shared bit for the quad is set, and if it is BC pulls the shared line on MI. If BC's copy of Shared is not set, and if none of SC's brother caches pull shared then SC will clear the shared bit. In this latter case future writes will stay entirely within SC. The WS initiated by SC will of course go out to the main M-Bus (like all good WS's). This isn't necessary if Shared wasn't set in BC, but it doesn't hurt (logically speaking). To see that performance is not a problem take a look at the section titled Main M-Bus Load—WS Contribution.

Victim Selection in BC

Victim selection in a bus cache will be interesting given that every quad in a processor cache must also be in a bus cache. Consider the case where all of the quads in a bus cache have inPC set. This could happen when a large amount of data is being read sequentially by the processor. Clearly, before the bus cache can kick out a quad, it has to check whether it is in a processor cache or not. It would be much better if the bus cache victimized only those quads for which quads didn't exist in any processor cache. One way is for the bus cache to do a RQ on MI every time it tries to victimize a quad with inPC set. If someone pulls shared, the bus cache knows to not kick this quad out. Note that the RQ on MI can be aborted as soon as the bus cache has its answer so it needn't take up too much bandwidth on MI.

Looks like most inPC bits set is the common case! Consider read-only data and code, for example. Every time such a quad gets into a processor cache its inVC bit gets set; this bit won't get cleared when the host processor cache victimizes this quad since this quad will be clean.

Now, this suggests a solution to our dilemma. Suppose every time a processor cache did a RQ, it put in one of the two idle cycles of the RQ the address of the victim quad. All of the processor caches on MI would match on the address of the target of the RQ and then on the address of the victim. If any of them matches on the victim address it pulls down a new dedicated line called "MnInPC". The bus cache would watch this line and set the inPC bit for the quad being victimized by the RQ according to the state of MnInPC. Note that with this addition the inPC bit is no longer a "hint". One might think that it would still be a hint because of other ways to eliminate quads from caches, e.g. flush, but one would be wrong. Flush removes the quad from all bus caches as well as all processor caches.

The second idle cycle in an RQ is used to transfer the address of the victim into which the read quad is going. This is needed so BC can maintain the ExistsBelow bit as an absolute rather than as a hint. As far as SC is concerned the MBus has two shared lines, one called MnShared and the other MnVictimShared. SC must watch the first bus cycle of an RQ and assert MnShared in the usual way, but it must also watch the third cycle (containing the victim address) and assert MnVictimShared if there was a match. This is needed so BC can do its victimization without violating the invariants that BC and its SC's must maintain (see the design decisions document for BC).

Response to Done Command

SC's must respond to all Dones, including ones corresponding to WriteMap and WriteMapFlags, which SC's don't issue. This was agreed upon at the May 29th meeting with ED.

A Note on WS's

Note that for our consistency algorithm every write to a shared datum is forced to propagate up to the main M-Bus even when this is not necessary. However, WS's propagate downward only when necessary. A pleasant consequence of making all WS's equally expensive is that there is no asymmetry amongst the processors! Ie., we don't have to do something silly like assign closely cooperating processes to processors that live on the same MI.

Main M-Bus Load

Given the above consistency maintenance algorithm in a two-level system, this section totals up all the significant contributions to load on the main M-Bus. Let the number of processors be N.

RQ Contribution

The number of main M-Bus cycles per 100 processor cycles for RQ's is given by:

CRQ = 7 * [IFURefs*IFUSCMissRate + EURefs*EUSCMissRate] * BCMissRate * N,

where the 7 comes from 7 Cycles/ReadQuad. Assuming our standard numbers of IFURefs=56, EURefs=17, IFUSCMissRate=0.05, EUSCMissRate=0.05, and BCMissRate=0.04 (for a 128Kbyte cache for two processors), we get

CRQ = 1.0N cycles per 100 cycles.

WS Contribution

There are two contributions to WS's. One of these is authentic WS's, and the other is "fake" WS's issued by SC to inform BC that it was writing into a quad for the first time. Lets call these two contributions CWS and CFWS. We have,

CWS = 2 (Cycles/WS) * IFURefs * SharedDataWriteRate * N, and

CFWS = 2 (Cycles/WS) * IFURefs * FakeSharedDataWriteRate * N, and

Assuming 5% of all EU references are authentic writes to shared data (a high number), we have SharedDataWrites=17*0.05=0.85 per 100 cycles.

The number of FakeSharedDataWrites is equal to the number of writes to clean quads. From simulation data, this number is around 1.8% of all EU refs, or 17*0.018=0.31 per 100 cycles.

Therefore,

CWS = 1.7N

CFWS = 0.6N

Map Miss Contribution

The Map Miss contribution in 100 cycles is

CMM = 4*[IFURefs*IFUSCMapMissRate + EURefs*EUSCMapMissRate] * N

From simulation data, we have IFUSCMapMisses = 56*0.005=0.28; and EUSCMapMisses = 17*0.01=0.17, so

CMM = 4*[0.28+0.17] * N = 1.8N

Total

Contributions other than the ones listed above are in the noise, so the total is

CTotal = CRQ+CWS+CFWS+CMM = N+1.7N+0.6N+1.8N = 5.1N

If we want to run the M-Bus at a load of 0<L<1, the number of processors we can have is

N = 100L/5.1

For 70% loading, N = 14; for 80% loading N = 16.

A Caveat

The average queueing theory fan reading through the above would probably have torn out all his hair by now, so its appropriate to say something about these calculations. Strictly speaking the numbers IFURefs and EURefs, which we assumed to be 56 and 17 respectively, are dependent on the loads on the internal and main M-Buses. The dependence is especially strong when either of the buses is heavily loaded, and is more sensitive to internal M-Bus load than to main M-Bus load (both numbers decrease with bus load). The right way to compute bus load is to assume the values for IFURefs and EURefs for an SC that never misses, compute the load, find out the actual values of IFURefs and EURefs for this load and then iterate till the load converges. What we've done is to deflate the values for IFURefs and EURefs from those for an ideal (infinite) cache to those for a single processor system connected up in a one-level configuration with the standard SC (100 entries, 1 quad/entry). These deflated values are virtually identical to the real ones if we assume only one processor per MI, and are within four percent or so if we assume two processors per MI.

Essentially then, the computations work because IFURefs and EURefs are relatively insensitive to main M-Bus load for loads less than around 0.80, and for MI bus load less than around 0.60, and we are within these bounds on both accounts.

See the paper by Ashcroft [1] or the book by Kleinrock [2] for the math.

Multiple processors per MI

The SC's for one processor put a load of 33 cycles on MI. If we put two processors per MI, the effective contribution of each processor will drop from 0.75 ideal processors to 0.72 ideal processors [1]. This is less than 4% degradation, so its probably acceptable to put up to two processors on a single MI bus. With 3 processors the contribution drops to 0.68 (10%), and for 4 processors to 0.64 (15%).

The design of SC and BC are independent of how many processors will be put on each MI so this decision can be postponed till system fabrication time.

References

[1] Ashcorft, H., "The Productivity of Several Machines Under the Care of One Operator", Journal of the Royal Statistical Society, Series B, 1950.

[2] Klienrock, L., "Queuing Systems-Volume I", Wiley Interscience.