[_CD4_]<dragon>BigCache>Documentation>BCDesignDecisions.tioga!2

BCDesignDecisions.tioga

Written by: Sindhu, September 22, 1985 2:18:11 pm PST

Last Edited by:

Pradeep Sindhu March 13, 1986 12:33:06 pm PST

DRAGON BIG CACHE DESIGN DECISIONS

DRAGON PROJECT — FOR INTERNAL XEROX USE ONLY

Dragon Big Cache

Design Decisions

Release as [Indigo]<Dragon>BigCache>Documentation>BCDesignDecisions.tioga, .press

Abstract: This document contains a record of design decisions for the Dragon Big Cache; where possible it also tries to provide the rationale for each decision. It is being used currently by the cache designers, and is expected to be useful in the future in writing documentation and/or papers. Some information in here may be duplicated elsewhere (eg. in the design decisions documents for the Small Cache, the Memory System, and the Map Cache).

XEROX Xerox Corporation
Palo Alto Research Center
3333 Coyote Hill Road
Palo Alto, California 94304

For Internal Xerox Use Only

Contents

Introduction

Buzzword Definition

Deadlock Avoidance: Transaction Classes and Aborts

Request/Grant

Transition of Shared from 0 to 1

Signaling Victimization

Introduction

This document is one in a series that attempts to record design decisions for the Dragon Memory System as they are made, and especially to outline the reasons for each decision. Given the overall complexity of the Memory System, the designers found it necessary to keep such a running record to avoid revisiting old decisions and trying to unearth the reasons for them for each contentious issue.

The structure of this series is as follows. The Memory System document forms the top level, confining itself to decisions that affect at least two of the components comprising the Memory System: the Small Cache, the Big Cache, the Map Cache, and Main Memory. This document confines itself to decisions that affect the Big Cache.

The section immediately below defines frequently used terms that will be used throughout this document. Each of the following sections covers one design decision.

Buzzword Definition

There will be a fair amount of shorthand in this document. This section makes an attempt to define some of the terms before they are used.

BC Big Cache
SC Small Cache
WS WriteSingle
RQ ReadQuad
WQ WriteQuad
RQS ReadQuadStream
WQS WriteQuadStream
IOR IORead
IOW IOWrite
IORF IOReadFlow
IOWF IOWriteFlow
RMSR ReadMapSetRef
RMSRSD ReadMapSetRefSetDirty
MRD Map Read
MIBus Internal MBus
Entry Smallest cache unit that can match independently
Line Geometric, horizontally repeating unit in cache array
(may contain 1 or more entries)

Deadlock Avoidance: Transaction Classes and Aborts

Transactions on the MI-Bus are divided into two classes: those for which SC knows it will need to get the M-Bus (WS, IOR, IOW, IORF, IOWF, RMSR, RMSRSD) and those for which it cannot be sure that the M-Bus will be needed (RQ, WQ). Call the first type global and the second plocal (for probably local). For global transactions SC always gets the M-Bus before getting the MI-Bus. For plocal transactions it asks for the MI-Bus alone. This strategy prevents deadlock over bus allocation in a two-level cache system.

Restarts on MI are signalled by "grant yanking". Ie. the arbiter in BC deasserts Gnt prematurely to signal that the current transaction must be aborted and then restarted. The semantics of restart for the plocal transactions are given below.

RQ, RQS restart. For a processor read we can let the processor go ahead as soon as the requested word is fetched by SC. For a processor write we must hold up the processor till the RQ is complete. The cost of this should be small since the number of writes that cause RQ's should be small. A data miss happens around 15% of the time, and a write ref happens 1 out of 21 cycles. Three cycles are added for each missing write, so the added cost is 3*0.15/21, or around 2%.

If the fetched quad goes into an existing line then don't set quadValid until after commit. If it goes into a victim then don't set vpValid, rpValid, and quadValid until after commit.

WQ, WQS restart. Note that unlike RQ we don't have to make the processor wait any longer than necessary. To see this, there are two cases to consider: WQ's caused by flush or launder; here the processor is not involved at all, so the assertion is true. For WQ's caused by a miss resulting from a processor reference the processor must be forced to wait anyway since the WQ is part of a WQ, RQ sequence.

Inside SC don't clear owner until after commit.

Note that with this scheme every write to a shared datum is forced to propagate up to the main M-Bus even when this is not necessary. However, WS's propagate downward only when strictly necessary. To get a feel for what the M-Bus loading would be due to shared writes, lets do a little arithmetic. Assume 16 processors, each processor doing 1 EU ref in 4 cycles. Further assume that 5% of these references are writes to shared data. Given that a WS takes 2 M-Bus cycles, the M-Bus loading comes out to 40%, which isn't too bad.

A pleasant consequence of making all WS's equally expensive is that there is no asymmetry amongst the processors! Ie., we don't have to do something crazy like assign really closely cooperating processes to processors that live on the same MI bus.

Request/Grant

Each SC has two request wires running to the arbiter inside BC: Rq and GRq (for Global Request). For plocal transactions SC asserts Rq alone; for global transactions it asserts both Rq and GRq. A GRq request from SC flows through BC and comes out on the M-Bus as Rq in one phase; vice-versa, a Grant from the M-Bus arbiter is pased by BC and comes out on the MI-Bus in one phase.

Transition of Owner from 0 to 1

The transition of a quad in an SC from owner=0 to owner=1 is tricky. The problem is that in a two level system SC must somehow notify its BC that the latest copy of data does not reside within BC anymore. This is to get BC to do a ReadQuad on MI when a ReadQuad appears on M.

The way it will work is as follows: whenever SC gets a quad in response to an RQ it initiated, it initializes the quad bits to Owner=0; Shared=1. Now the first time SC's processor writes into this quad, SC will put out a WS onto MI (since Shared is set) and will also set Owner to 1. In the meantime BC checks if its copy of the Shared bit for the quad is set, and if it is BC pulls the shared line on MI. If BC's copy of Shared is not set, and if none of SC's brother caches pull shared then SC will clear the shared bit. In this latter case future writes will stay entirely within SC. The WS initiated by SC will of course go out to the main M-Bus (like all good WS's). This isn't necessary if Shared wasn't set in BC, but it doesn't hurt (logically speaking). To see that performance is not a problem we have to do some analysis.

The added cost of this scheme is that it forces the first write for a datum to appear>>>>>

Signaling Victimization

The second idle cycle in an RQ is used to transfer the address of the victim into which the read quad is going. This is needed so BC can maintain the ExistsBelow bit as an absolute rather than as a hint. As far as SC is concerned the MBus has two shared lines, one called MnShared and the other MnVictimShared. SC must watch the first bus cycle of an RQ and assert MnShared in the usual way, but it must also watch the third cycle (containing the victim address) and assert MnVictimShared if there was a match. This is needed so BC can do its victimization without violating the invariants that BC and its SC's must maintain (see the design decisions document for BC).