BCDesignDecisions.tioga
Written by: Sindhu, September 22, 1985 2:18:11 pm PST
Last Edited by:
Pradeep Sindhu March 13, 1986 12:33:06 pm PST
DRAGON BIG CACHE DESIGN DECISIONS
DRAGON BIG CACHE DESIGN DECISIONS
DRAGON BIG CACHE DESIGN DECISIONS
DRAGON PROJECT — FOR INTERNAL XEROX USE ONLY
DRAGON PROJECT — FOR INTERNAL XEROX USE ONLY
DRAGON PROJECT — FOR INTERNAL XEROX USE ONLY
Dragon Big Cache
Design Decisions
Release as [Indigo]<Dragon>BigCache>Documentation>BCDesignDecisions.tioga, .press

© Copyright 1985 Xerox Corporation. All rights reserved.
Abstract: This document contains a record of design decisions for the Dragon Big Cache; where possible it also tries to provide the rationale for each decision. It is being used currently by the cache designers, and is expected to be useful in the future in writing documentation and/or papers. Some information in here may be duplicated elsewhere (eg. in the design decisions documents for the Small Cache, the Memory System, and the Map Cache).
XEROX  Xerox Corporation
   Palo Alto Research Center
   3333 Coyote Hill Road
   Palo Alto, California 94304



For Internal Xerox Use Only
Contents
Introduction
Buzzword Definition
Deadlock Avoidance: Transaction Classes and Aborts
Request/Grant
Transition of Shared from 0 to 1
Signaling Victimization
Introduction
This document is one in a series that attempts to record design decisions for the Dragon Memory System as they are made, and especially to outline the reasons for each decision. Given the overall complexity of the Memory System, the designers found it necessary to keep such a running record to avoid revisiting old decisions and trying to unearth the reasons for them for each contentious issue.
The structure of this series is as follows. The Memory System document forms the top level, confining itself to decisions that affect at least two of the components comprising the Memory System: the Small Cache, the Big Cache, the Map Cache, and Main Memory. This document confines itself to decisions that affect the Big Cache.
The section immediately below defines frequently used terms that will be used throughout this document. Each of the following sections covers one design decision.
Buzzword Definition
There will be a fair amount of shorthand in this document. This section makes an attempt to define some of the terms before they are used.
 BC Big Cache
 SC Small Cache
 WS WriteSingle
 RQ ReadQuad
 WQ WriteQuad
 RQS ReadQuadStream
 WQS WriteQuadStream
 IOR IORead
 IOW IOWrite
 IORF IOReadFlow
 IOWF IOWriteFlow
 RMSR ReadMapSetRef
 RMSRSD ReadMapSetRefSetDirty
 MRD Map Read
 MIBus Internal MBus
 Entry Smallest cache unit that can match independently
 Line Geometric, horizontally repeating unit in cache array
   (may contain 1 or more entries)
Deadlock Avoidance: Transaction Classes and Aborts
Transactions on the MI-Bus are divided into two classes: those for which SC knows it will need to get the M-Bus (WS, IOR, IOW, IORF, IOWF, RMSR, RMSRSD) and those for which it cannot be sure that the M-Bus will be needed (RQ, WQ). Call the first type global and the second plocal (for probably local). For global transactions SC always gets the M-Bus before getting the MI-Bus. For plocal transactions it asks for the MI-Bus alone. This strategy prevents deadlock over bus allocation in a two-level cache system.
Restarts on MI are signalled by "grant yanking". Ie. the arbiter in BC deasserts Gnt prematurely to signal that the current transaction must be aborted and then restarted. The semantics of restart for the plocal transactions are given below.
RQ, RQS restart. For a processor read we can let the processor go ahead as soon as the requested word is fetched by SC. For a processor write we must hold up the processor till the RQ is complete. The cost of this should be small since the number of writes that cause RQ's should be small. A data miss happens around 15% of the time, and a write ref happens 1 out of 21 cycles. Three cycles are added for each missing write, so the added cost is 3*0.15/21, or around 2%.
If the fetched quad goes into an existing line then don't set quadValid until after commit. If it goes into a victim then don't set vpValid, rpValid, and quadValid until after commit.
WQ, WQS restart. Note that unlike RQ we don't have to make the processor wait any longer than necessary. To see this, there are two cases to consider: WQ's caused by flush or launder; here the processor is not involved at all, so the assertion is true. For WQ's caused by a miss resulting from a processor reference the processor must be forced to wait anyway since the WQ is part of a WQ, RQ sequence.
Inside SC don't clear owner until after commit.
Note that with this scheme every write to a shared datum is forced to propagate up to the main M-Bus even when this is not necessary. However, WS's propagate downward only when strictly necessary. To get a feel for what the M-Bus loading would be due to shared writes, lets do a little arithmetic. Assume 16 processors, each processor doing 1 EU ref in 4 cycles. Further assume that 5% of these references are writes to shared data. Given that a WS takes 2 M-Bus cycles, the M-Bus loading comes out to 40%, which isn't too bad.
A pleasant consequence of making all WS's equally expensive is that there is no asymmetry amongst the processors! Ie., we don't have to do something crazy like assign really closely cooperating processes to processors that live on the same MI bus.
Request/Grant
Each SC has two request wires running to the arbiter inside BC: Rq and GRq (for Global Request). For plocal transactions SC asserts Rq alone; for global transactions it asserts both Rq and GRq. A GRq request from SC flows through BC and comes out on the M-Bus as Rq in one phase; vice-versa, a Grant from the M-Bus arbiter is pased by BC and comes out on the MI-Bus in one phase.
Transition of Owner from 0 to 1
The transition of a quad in an SC from owner=0 to owner=1 is tricky. The problem is that in a two level system SC must somehow notify its BC that the latest copy of data does not reside within BC anymore. This is to get BC to do a ReadQuad on MI when a ReadQuad appears on M.
The way it will work is as follows: whenever SC gets a quad in response to an RQ it initiated, it initializes the quad bits to Owner=0; Shared=1. Now the first time SC's processor writes into this quad, SC will put out a WS onto MI (since Shared is set) and will also set Owner to 1. In the meantime BC checks if its copy of the Shared bit for the quad is set, and if it is BC pulls the shared line on MI. If BC's copy of Shared is not set, and if none of SC's brother caches pull shared then SC will clear the shared bit. In this latter case future writes will stay entirely within SC. The WS initiated by SC will of course go out to the main M-Bus (like all good WS's). This isn't necessary if Shared wasn't set in BC, but it doesn't hurt (logically speaking). To see that performance is not a problem we have to do some analysis.
The added cost of this scheme is that it forces the first write for a datum to appear>>>>>
Signaling Victimization
The second idle cycle in an RQ is used to transfer the address of the victim into which the read quad is going. This is needed so BC can maintain the ExistsBelow bit as an absolute rather than as a hint. As far as SC is concerned the MBus has two shared lines, one called MnShared and the other MnVictimShared. SC must watch the first bus cycle of an RQ and assert MnShared in the usual way, but it must also watch the third cycle (containing the victim address) and assert MnVictimShared if there was a match. This is needed so BC can do its victimization without violating the invariants that BC and its SC's must maintain (see the design decisions document for BC).