The Memory System of a High-Performance Personal Computerby Douglas W. Clark1, Butler W. Lampson, and Kenneth A. PierJanuary 1981ABSTRACTThe memory system of the Dorado, a compact high-performance personal computer, hasvery high I/O bandwidth, a large paged virtual memory, a cache, and heavily pipelinedcontrol; this paper discusses all of these in detail. Relatively low-speed I/O devices transfersingle words to or from the cache; fast devices, such as a color video display, transferdirectly to or from main storage while the processor uses the cache. Virtual addresses areused in the cache and for all I/O transfers. The memory is controlled by a seven-stagepipeline, which can deliver a peak main-storage bandwidth of 530 million bits per second toservice fast I/O devices and cache misses. Interesting problems of synchronization andscheduling in this pipeline are discussed. The paper concludes with some performancemeasurements that show, among other things, that the cache hit rate is over 99 percent.A revised version of this paper will appear in IEEE Transactions on Computers.CR CATEGORIES6.34, 6.21.KEY WORDS AND PHRASESbandwidth, cache, latency, memory, pipeline, scheduling, storage, synchronization, virtualmemory.1. Present address: Digital Equipment Corporation, Tewksbury, Mass. 01876.c Copyright 1981 by Xerox Corporation.XEROXPALO ALTO RESEARCH CENTER3333 Coyote Hill Road / Palo Alto / California 94304 \p Y>" RqXSQrRq( LJ Cs @tH ?3utut = =.utut <+2 :C 9#utut6 7 H 6utut ; 4 !* 3 K //vwt +\s 't "s t XP E uH Xxu%)Xy)aqX)asG0TVk( THE MEMORY SYSTEM OF A HIGH-PERFORMANCE PERSONAL COMPUTER521. IntroductionThis paper describes the memory system of the Dorado, a high-performance compact personalcomputer. This section explains the design goals for the Dorado, sketches its overall architecture,and describes the organization of the memory system. Later sections discuss in detail the cache (2), the main storage ( 3), interactions between the two ( 4), and synchronization of the variousparallel activities in the system ( 5). The paper concludes with a description of the physicalimplementation ( 6), and some performance measurements ( 7).1.1 GoalsA high-performance successor to the Alto computer [18], the Dorado is intended to provide thehardware base for the next generation of computer system research at the Xerox Palo Alto ResearchCenter. The Dorado is a powerful but personal computing system supporting a single user within aprogramming system that extends from the microinstruction level to an integrated programmingenvironment for a high-level language. It is physically small and quiet enough to occupy space nearits users in an office or laboratory setting, and inexpensive enough to be acquired in considerablenumbers. These constraints on size, noise, and cost have had a major effect on the design.The Dorado is designed to rapidly execute programs compiled into a stream of byte codes [16]; themicrocode that does this is called an emulator. Byte code compilers and emulators exist for Mesa[6, 13], Interlisp [4, 17], and Smalltalk [7]. An instruction fetch unit (IFU) in the Dorado fetchesbytes from such a stream, decodes them as instructions and operands, and provides the necessarycontrol and data information to the emulator microcode in the processor; it is described in anotherpaper [9]. Further support for fast execution comes from a very fast microcycle, and amicroinstruction set powerful enough to allow interpretation of a simple byte code in a singlemicrocycle; these are described in a paper on the Dorado processor [10]. There is also a cache [2,11] which has a latency of two cycles, and which can deliver a 16-bit word every cycle.Another major goal is to support high-bandwidth input/output. In particular, color monitors, rasterscanned printers, and high speed communications are all part of the computer research activities;these devices typically have bandwidths of 20 to 400 million bits per second. Fast devices must notexcessively degrade program execution, even though the two functions compete for many of thesame resources. Relatively slow devices, such as a keyboard or an Ethernet interface [12], must alsobe supported cheaply, without tying up the high-bandwidth I/O system. These considerationsclearly suggest that I/O activity and program execution should proceed in parallel as much aspossible. The memory system therefore allows parallel execution of cache accesses and main storagereferences. Its pipeline is fully segmented: a cache reference can start in every microinstructioncycle, and a main storage reference can start in every main storage cycle.1.2 Gross structure of the DoradoFigure 1 is a simplified block diagram of the Dorado. Aside from I/O, the machine consists of theprocessor, the IFU, and the memory system, which in turn contains a cache, a hardware virtual-to-real address map, and main storage. Both the processor and the IFU can make memory referencesand transfer data to and from the memory through the cache. Slow, or low-bandwidth I/O devicescommunicate with the processor, which in turn transfers their data to and from the cache. Fast, orhigh-bandwidth devices communicate directly with storage, bypassing the cache most of the time.For the most part, data is handled sixteen bits at a time. The relatively narrow busses, registers,data paths, and memories which result from this choice help to keep the machine compact. This isespecially important for the memory, which has a large number of busses. Packaging, however, isnot the only consideration. Speed dictates a heavily pipelined structure in any case, and thisparallelism in the time domain tends to compensate for the lack of parallelism in the space domain.Keeping the machine physically small also improves the speed, since physical distance (i.e., wirelength) accounts for a considerable fraction of the basic cycle time. Finally, performance is oftenlimited by the cache hit rate, which cannot be improved, and may be reduced, by wider data paths(if the number of bits in the cache is fixed).f2zX{z{z{z{z{z{z{ z{z{ z _| \zU [,U Y] X$_ VO U 0 Q}X MzA LnF JQ If 4 G -, F^ T DS A,} z @+ }z3 >H{z =#C ;\ :M 81 7 X 5T 2dW 0,. /\[ - +& ,T;& *8{z{z )L {z{zE ': &D }z. $D }X z#{z{z  {zO ){z  B{z{z  2&  Q /2 S&7  J K= C C N ? ;Y +TVk(SEC. 1INTRODUCTION53<==M =#3, ;= : ){z{z$ 87# 7{z{zL 5M 2dR 0}z7{z{z$ /\Z -G ,Tb *? )L/} z 'C &DR #,1 !W  J {z#  ? .2 *4 }` #{z/ u 0 v}X KzG O C6 @ ;-/ T TVk(E THE MEMORY SYSTEM OF A HIGH-PERFORMANCE PERSONAL COMPUTER54Memory references specify a 16 or 28 bit displacement, and one of 32 base registers of 28 bits; thevirtual address is the sum of the displacement and the base. Virtual address translation, or mapping,is implemented by table lookup in a dedicated memory. Main storage is the permanent home ofdata stored by the memory system. The storage is necessarily slow (i.e., it has long latency, whichmeans that it takes a long time to respond to a request), because of its implementation in cheap butslow dynamic MOS RAMs (random access memories). To make up for being slow, storage is big,and it also has high bandwidth, which is more important than latency for sequential references. Inaddition, there is a cache which services non-sequential references with high speed (low latency),but is inferior to main storage in its other parameters. The relative values of these parameters areshown in Table 1.CacheStorageLatency-1151Bandwidth12Capacity1250Table 1: Parameters of the cache relative to storageWith one exception (the IFU), all memory references are initiated by the processor, which thus actsas a multiplexor controlling access to the memory (see 1.2 and [10]), and is the sole source ofaddresses. Once started, however, a reference proceeds independently of the processor. Each onecarries with it the number of its originating task, which serves to identify the source or sink of anydata transfer associated with the reference. The actual transfer may take place much later, andeach source or sink must be continually ready to deliver or accept data on demand. It is possiblefor a task to have several references outstanding, but order is preserved within each type ofreference, so that the task number plus some careful hardware bookkeeping is sufficient to matchup data with references. Table 2 lists the types of memory references executable by microcode. Figure 2, a picture of thememory system's main data paths, should clarify the sources and destinations of data transferred bythese references (parts of Figure 2 will be explained in more detail later). All references, includingfast I/O references, specify virtual, not real addresses. Although a microinstruction actually specifiesa displacement and a base register which together form the virtual address, for convenience we willsuppress this fact and write, for example, Fetch(a) to mean a fetch from virtual address a.A Fetch from the cache delivers data to a register called FetchReg, from which it can be retrieved atany later time; since FetchReg is task-specific, separate tasks can make their cache referencesindependently. An I/ORead reference delivers a 16-word block of data from storage to theFastOutBus (by way of the error corrector, as shown in Figure 2), tagged with the identity of therequesting task; the associated output device is expected to monitor this bus and grab the datawhen it appears. Similarly, the processor can Store one word of data into the cache, or do anI/OWrite reference which demands a block of data from an input device and sends it to storage (byway of the check-bit generator). There is also a Prefetch reference, which brings a block into thecache. Fetch, Store and Prefetch are called cache references. There are special references to flushdata from the cache and to allow a map entries to be read and written; these will be discussed later.The instruction fetch unit is the only device that can make a reference independently of theprocessor. It uses a single base register, and is treated almost exactly like a processor cache fetch,except that the IFU has its own set of registers for receiving memory data (see [9] for details). Ingeneral we ignore IFU references from now on, since they add little complexity to the memorysystem.f2zX{z{z{z{z{z{z{ z{z{ z _#} z } z ^WV}z \+ } z [O9' YR XG{z{z7 VX U? }z* S5- R7 OdO Od O  cOd(O (/Od O 91OdoO o!SL}(O KFK F cK(F/K F91KoFHzIe{"tHz,GT",E"* EDLE DL cE(DL(/E DL 91EoDLoBA!X/ ={zF => = c>(=(/> = 91>o=o:X8 7y}z2}z 5 }z}z9 4q{z{z! 2 }z}z8 1i 9 /({z{z .a {z{z *b}X$ '7z= % #[}z0! 6{z{z  S1"#{z{z2w}z4SoD %2  I [ 4* Y  )* Y TVk() THE MEMORY SYSTEM OF A HIGH-PERFORMANCE PERSONAL COMPUTER56<==} O z} z$ MU L< JM H|zB GxB E# BW AE }zA ? {z8=i;' 9W 8 '. 6> 5H 3} P 1 .S,v6*(S&B2$ !T 3+ )|z  %- 4 |X zH}z { z}z }',  {z{z< u{z{z{z{z{z 4{z m{zTVk( THE MEMORY SYSTEM OF A HIGH-PERFORMANCE PERSONAL COMPUTER58<== {z"3 IQ V  S }z }z }z- ;#   }z O C , }z & TVk(SEC. 2THE CACHE59<=={ 5zB 4 3 02+ /\ +]}X (2z\ & O %*U #@" ""8$  sO A k -+ _ cQ O [ ~z  ! A LD  9 D TVk(SEC. 2THE CACHE61Unfortunately this sleight-of-hand does not always work. The sequence(1)Fetch(address)(2)computation involving FetchReg actually needs the data during cycle (2), which will therefore have to wait for one cycle (see 5.1).Data retrieved in cycle (1) would be the old value of FetchReg; this allows a sequence of fetches(1)Fetch(address1)(2)register1 _ FetchReg, Fetch(address2)(3)register2 _ FetchReg, Fetch(address3)(4)register3 _ FetchReg, Fetch(address4). . .to proceed at full speed.3. The storage pipelineCache misses and fast I/O references use the storage portion of the pipeline, shown in Figure 3. Inthis section we first describe the operation of the individual pipeline stages, then explain how fastI/O references use them, and finally discuss how memory faults are handled. Using I/O referencesto expose the workings of the pipeline allows us to postpone until 4 a close examination of themore complicated references involving both cache and storage.3.1 Pipeline stagesEach of the pipeline stages is implemented by a simple finite-state automaton that can change stateon every microinstruction cycle. Resources used by a stage are controlled by signals that itsautomaton produces. Each stage owns some resources, and some stages share resources with others.Control is passed from one stage to the next when the first produces a start signal for the second;this signal forces the second automaton into its initial state. Necessary information about thereference type is also passed along when one stage starts another.3.1.1 The ADDRESS stageAs we saw in 2, the ADDRESS stage computes a reference's virtual address and looks it up inCacheA. If it hits, and is not I/ORead or I/OWrite, control is passed to HITDATA. Otherwise, controlis passed to MAP, starting a storage cycle. In the simplest case a reference spends just onemicroinstruction cycle in ADDRESS, but it can be delayed for various reasons discussed in 5.3.1.2 The MAP stageThe MAP stage translates a virtual address into a real address by looking it up in a hardware tablecalled the MapRAM, and then starts the STORAGE stage. Figure 5 illustrates the straightforwardconversion of a virtual page number into a real page number. The low-order bits are not mapped;they point to a single word on the page.Three flag bits are stored in MapRAM for each virtual page:ref, set automatically by any reference to the page;dirty, set automatically by any write into the page;writeProtect, set by memory-management software (using the MapWrite reference). A virtual page not in use is marked as vacant by setting both writeProtect and dirty, an otherwisenonsensical combination. A reference is aborted by the hardware if it touches a vacant page,attempts to write a write-protected page, or causes a parity error in the MapRAM. All three kindsof map fault are passed down the pipeline to READTR2 for reporting; see 3.1.5. f2{G;G?z _ 9 ]~z [ YV X#] U~z TG~z R~z Q?~z O Mc H|X Eez{z{z8 C\ B]{z{z; |z{z{z @1. ?U9 ;V}X 8+zH 69# 5#}z!}z 3$}z 2< 09 ,}X} )mz{z# '{z{~z{z{~z{}z &e {z,! $ {z= }X} z{zH 3}{z{z  *, +$ !{z~z1P~z/~ z/~z "}z~ z~z  |zB E{z }z!{z lTVk(~ THE MEMORY SYSTEM OF A HIGH-PERFORMANCE PERSONAL COMPUTER62<== {z{z{z =#X{*+zTVk(SEC. 3THE STORAGE PIPELINE65and completes the StorageRAM cycle ( 3.1.3). READTR1 and READTR2 transport the data, controlthe error corrector, and deliver the data to FastOutBus ( 3.1.5). Fault reporting, if necessary, isdone by READTR2 as soon as the condition of the last quadword in the block is known ( 3.3).It is clear from Figure 7 that an I/ORead can be started every eight machine cycles, since this is thelongest period of activity of any stage. This would result in 530 million bits per second ofbandwidth, the maximum supportable by the memory system. The inner loop of a fast I/O task canbe written in two microinstructions, so if a new I/ORead is launched every eight cycles, one-fourthof the processor capacity will be used. Because ADDRESS is used for only one cycle per I/ORead,other tasksnotably the emulatormay continue to hit in the cache when the I/O task is notrunning.I/OWrite(x) writes into virtual location x a block of data delivered by a fast input device, togetherwith appropriate Hamming code check bits. The data always goes to storage, never to the cache,but if address x happens to hit in the cache, the entry is invalidated by setting a flag ( 4). Figure8 shows that an I/OWrite proceeds through the pipeline very much like an I/ORead. The difference,of course, is that the WRITETR stage runs, and the READTR1 and READTR2 stages, although they run,do not transport data. Note that the write transport, from FastInBus to WriteBus, proceeds inparallel with mapping. Once the block has been loaded into WriteReg, STORAGE issues a writesignal to StorageRAM. All that remains is to run READTR1 and READTR2, as explained above. If amap fault occurs during address translation, the write signal is blocked and the fault is passed alongto be reported by READTR2.<==F =##}z( ; R :P 8M 7"{z 5} z7 4 0U /\G -C ,T}z< *}z )L6$ '9{z &D/${z ${z* #<G{z i  i   c i( (/ i  91 io o}: F F c(F/ F91oFz::{-z{z:{z0:{ $Q$ Q c$(Q(/$ Q 91$oQo &zX" >} wz(} z {z I o 7 MDTVk(8SEC. 3THE STORAGE PIPELINE67other tasks. If more faults have occurred in the meantime, FaultCount will have been incrementedagain and the fault task will be reawakened.The fault task does different things in response to the different types of fault. Single bit errors,which are corrected, are not reported at all unless a special control bit in the hardware is set. Withthis bit set, the fault task can collect statistics on failing storage chips; if too many failures areoccurring, the bit can be cleared and the machine can continue to run. Double bit errors may bedealt with by re-trying the reference; a recurrence of the error must be reported to the operatingsystem, which may stop using the failing memory, and may be able to reread the data from the diskif the page is not dirty, or determine which computation must be aborted. Page faults are the mostlikely reason to awaken the fault task, and together with write-protect faults are dealt with byyielding to memory-management software. MapRAM parity errors may disappear if the reference isre-tried; if they do not, the operating system can probably recover the necessary information.Microinstructions that read the various parts of History are provided, but only the emulator and thefault task may use them. These instructions use an alternate addressing path to History which doesnot interfere with the SRN addressing used by references in the pipeline. Reading base registers,the MapRAM, and CacheA can be done only by using these microinstructions.This brings us to a serious difficulty with treating History as a pure ring buffer. To read aMapRAM entry, for example, the emulator must first issue a reference to that entry (normally aMapRead), and then read the appropriate part of History when the reference completes; similarly, aDummyRef (see Table 3) is used to read a base register. But because other tasks may run and issuetheir own references between the start of the emulator's reference and its reading of History, theemulator cannot be sure that its History entry will remain valid. Sixteen references by I/O tasks, forexample, will destroy it.To solve this problem, we designate History[0] as the emulator's "private" entry: MapRead,MapWrite, and DummyRef references use it, and it is excluded from the ring buffer. Because thefault task may want to make references of its own without disturbing History, another private entryis reserved for it. The ring buffer proper, then, is a 14-element memory used by all referencesexcept MapRead, MapWrite, and DummyRef in the emulator and fault task. For historical reasons,Fetch, Store and Flush references in the emulator and fault task also use the private entries; the tagmechanism ( 4.1) ensures that the entries will not be reused too soon.In one case History is read, rather than written, by a pipeline stage. This happens during a readtransport, when READTR1 gets from History the cache address (row and column) it needs for writingthe new data and the cache flags. This is done instead of piping this address along from ADDRESSto READTR1.4. Cache-storage interactionsThe preceding sections describe the normal case in which the cache and main storage functionindependently. Here we consider the relatively rare interactions between them. These can happenfor a variety of reasons:Processor references that miss in the cache must fetch their data from storage.A dirty block in the cache must be re-written in storage when its entry is needed.Prefetch and flush operations explicitly transfer data between cache and storage.I/O references that hit in the cache must be handled correctly.Cache-storage interactions are aided by the four flag bits that are stored with each cache entry tokeep track of its status (see Figure 4). The vacant flag indicates that an entry should never match; itis set by software during system initialization, and by hardware when the normal procedure forloading the cache fails, e.g., because of a page fault. The dirty flag is set when the data in the entryis different from the data in storage because the processor did a store; this means that the entry f2{G4G?z _(4 ^W' [,02 Y@" X$Y V ; U5( SZ RH P., O ${z0 MU J]N H Q GU{z= E{z? BX A"{zX ?~z7" >~z: <M ;:{z{z 9 6cJ~z 4~z~z= 3[H 15) 0S~z~z~z! .~z~z~zL -K> * E ( {zJ '>{ %z{z |X z}z/  S 6ORQ.{z{z<  I ~z4 > w"~z' #=$TVk( THE MEMORY SYSTEM OF A HIGH-PERFORMANCE PERSONAL COMPUTER68must be written back to storage before it is used for another block. The writeProtected flag is a copyof the corresponding bit in the map. It causes a store into the block to miss and set vacant; theresulting storage reference reports a write-protect fault ( 3.3). The beingLoaded flag is set for about 15 cycles while the entry is in the course of being loaded from storage; whenever the ADDRESSstage attempts to examine an entry, it waits until the entry is not beingLoaded, to ensure that theentry and its contents are not used while in this ambiguous state.When a cache reference misses, the block being referenced must be brought into the cache. Inorder to make room for it, some other block in the row must be displaced; this unfortunate iscalled the victim. CacheA implements an approximate least-recently-used rule for selecting thevictim. With each row, the current candidate for victim and the next candidate, called next victim,are kept. The victim and next victim are the top two elements of an LRU stack for that row;keeping only these two is what makes the replacement rule only approximately LRU. On a miss, thenext victim is promoted to be the new victim and a pseudo-random choice between the remainingtwo columns is promoted to be the new next victim. On each hit, the victim and next victim areupdated in the obvious way, depending on whether they themselves were hit.The flow of data in cache-storage interactions is shown in Figure 2. For example, a Fetch thatmisses will read an entire block from storage via the ReadBus, load the error-corrected block intoCacheD, and then make a one-word reference as if it had hit.What follows is a discussion of the four kinds of cache-storage interaction listed above.4.1 Clean missWhen the processor or IFU references a word w that is not in the cache, and the location chosen asvictim is vacant or holds data that is unchanged since it was read from storage (i.e., its dirty flag isnot set), a clean miss has occurred. The victim need not be written back, but a storage read mustbe done to load into the cache the block containing w. At the end of the read, w can be fetchedfrom the cache. A clean miss is much like an I/ORead, which was discussed in the previous section.The chief difference is that the block from storage is sent not over the FastOutBus to an outputdevice, but to the CacheD memory. Figure 9 illustrates a clean miss.All cache loads require a special cycle, controlled by READTR1, in which they get the correct cacheaddress from History and write the cache flags for the entry being loaded; the data paths ofCacheA are used to read this address and write the flags. This RThasA cycle takes priority over allother uses of CacheA and History, and can occur at any time with respect to ADDRESS, which alsoneeds access to these resources. Thus all control signals sent from ADDRESS are inhibited byRThasA, and ADDRESS is forced to idle during this cycle. Figure 9 shows that the RThasA cycleoccurs just before the first word of the new block is written into CacheD. (For simplicity andclarity we will not show RThasA cycles in the figures that follow.) During RThasA, the beingLoadedflag is cleared (it was set when the reference was in ADDRESS) and the writeProtected flag is copiedfrom the writeProtected bit in MapRAM. As soon as the transport into CacheD is finished, the wordreference that started the miss can be made, much as though it had hit in the first place. If thereference was a Fetch, the appropriate word is sent to FetchReg in the processor (and loaded intoFetchRegRAM); if a Store, the contents of StoreReg are stored into the new block in the cache.If the processor tries to use data it has fetched, it is prevented from proceeding, or held until theword reference has occurred (see 5.1). Each fetch is assigned a sequence number called its tag,which is logically part of the reference; actually it is written into History, and read when needed byREADTR1. Tags increase monotonically. The tag of the last Fetch started by each task is kept inStartedTag (it is written there when the reference is made), and the tag of the last Fetch completedby the memory is kept in DoneTag (it is written there as the Fetch is completed); these are task-specific registers. Since tags are assigned monotonically, and fetches always complete in orderwithin a task, both registers increase monotonically. If StartedTag=DoneTag, all the fetches thatf2zX{z{z{z{z{z{z{ z{z{ z _"$~ z ^WP~z \*~ z [O={ Yz+~ z XG= U: SP R}z@ P=} z O <{z MF{z LK JH HC EB~z DM'5 B5 ?Y ;}X 8tz {z~z5 6B 5l} z2 32~z~z 2d |{z{~z. 0L /\> ,14{z *P ))1~z 'C{z &!:{z $~z{z?~z #I !~z~z~ z{z ~ z ~ z {z 3  (1 ~z. {z~zF 3"}z R P}z  T J{z3~z } z-~z B}z~z  M :_TVk(SEC. 4CACHE-STORAGE INTERACTIONS69<== P{z*++] (2= &7' %*E #{z4 "" 3 ; F 5 kD &: cB  }X ~z ~zO 1F ~z#{z{z~z )\ \TVk( SEC. 4CACHE-STORAGE INTERACTIONS71A Flush explicitly removes the block containing the addressed location from the cache, rewriting itin storage if it is dirty. Flush is used to remove a virtual page's blocks from the cache so that itsMapRAM entry can be changed safely. If a Flush misses, nothing happens. If it hits, the hitlocation must be marked vacant, and if it is dirty, the block must be written to storage. To simplifythe hardware implementation, this write operation is made to look like a victim write. A dirty Flushis converted into a FlushFetch reference, which is treated almost exactly like a Prefetch. Thus, whena Flush in ADDRESS hits, three things happen:the victim for the selected row of CacheA is changed to point to the hit column;the vacant flag is set;if the dirty flag for that column is set, the Flush is converted into a FlushFetch. Proceeding like a Prefetch, this does a useless read (which is harmless because the vacant flag hasbeen set), and then a write of the dirty victim. Figure 11 shows a dirty Flush. The FlushFetchspends two cycles in ADDRESS, instead of the usual one, because of an uninteresting implementationproblem.<== K F~z~ Iz{z= H*+ }X}} z{z{~z~z.  ?{z{~z  U 0' ~{z{~z1 )2 v !4 &{z{~z n TVk( THE MEMORY SYSTEM OF A HIGH-PERFORMANCE PERSONAL COMPUTER72<==X :} 6~zH 5jd 3V 2b~z+- 0@ - }z*~z+[}z~z)~zP(S{z ~z &&~z#{z%K,0#$3"C~z~z' Q;3}{}z{z_{z{z2Q{zW({z%{zO~zF~z}z~z{z.s~z3 Jk D A cC +~z[ ~z% tTVk(_ THE MEMORY SYSTEM OF A HIGH-PERFORMANCE PERSONAL COMPUTER74too expensive to do this, and we observed that stores are rare compared to fetches in anycase.History busy. As discussed in 3.3, a reference uses various parts of the History memory atvarious times as it makes its way through the pipeline. Microinstructions for readingHistory are provided, and they must be held if they will conflict with any other use.The memory system must generate Hold for precisely the above reasons. It turns out, however, thatthere are several situations in which hardware or time can be saved if Hold is generated when it isnot strictly needed. This was done only in cases that we expect to occur rarely, so the performancepenalty should be small. An extra Hold has no logical effect, since it only converts the currentmicroinstruction into a jump-to-self. One example of this situation is that a reference in the cycleafter a miss is always held, even though it must be held only if the miss' victim is dirty or the mapis busy; the reason is that the miss itself is detected barely in time to generate Hold, and there is notime for additional logic. Another example: uses of FetchReg are held while ADDRESS is busy,although they need not be, since they do not use it.5.2 Waiting in ADDRESSA reference in ADDRESS normally proceeds either to HITDATA (in the case of a hit) or to MAP (for amiss, a victim write or an I/O reference) after one cycle. If HITDATA or MAP is busy, it will wait inADDRESS, causing subsequent references to be held because ADDRESS is busy, as discussed above.HITDATA uses CacheD, and therefore cannot be started when CacheD is busy. A reference that hitsmust therefore wait in ADDRESS while CacheD is busy, i.e., during transports to and from storage,and during single-word transfers resulting from previous fetches and stores. Some additionalhardware would have enabled a reference to be passed to HITDATA and wait there, instead of inADDRESS, for CacheD to become free; ADDRESS would then be free to accept another reference.This performance improvement was judged not worth the requisite hardware.When MAP is busy with an earlier reference, a reference in ADDRESS will wait if it needs MAP. Anexample of this is shown in Figure 10, where the victim write waits while MAP handles the read.However, even if MAP is free, a write must wait in ADDRESS until it can start WRITETR; sinceWRITETR always takes longer than MAP, there is no point in starting MAP first, and theimplementation is simplified by the rule that starting MAP always frees ADDRESS. Figure 13 showstwo back-to-back I/OWrites, the second of which waits one extra cycle in ADDRESS before startingboth WRITETR and MAP.The last reason for waiting in ADDRESS has to do with the beingLoaded flag in the cache. IfADDRESS finds that beingLoaded is set anywhere in the row it touches, it waits until the flag iscleared (this is done by READTR1 during the RThasA cycle). A better implementation would waitonly if the flag is set in the column in which it hits, but this was too slow and would also requirespecial logic to ensure that an entry being loaded is not chosen as a victim. Of course it would bemuch better to Hold a reference to a row being loaded before it ever gets into ADDRESS, butunfortunately the reference must be in ADDRESS to read the flags in the first place.5.3 Waiting in MAPThe traffic control techniques discussed thus far, namely, Hold and waiting in ADDRESS, are notsufficient to prevent all the conflicts shown in Table 4. In particular, neither deals with conflictsdownstream in the pipeline. Such conflicts could be resolved by delaying a reference in ADDRESSuntil it was certain that no further conflicts with earlier references could occur. This is not a goodidea because references that hit, which is to say most references, must be held when ADDRESS isbusy. If conflicts are resolved in MAP or later, hits can proceed unimpeded, since they do not uselater sections of the pipeline.f2zX{z{z{z{z{z{z{ z{z{ z_ L^W[}z BZ{9XN U}z ~z& THB~z RF Q@~z O< N8O LQ~z  K02{z I, E}X  Bz {z{z{z @:{z{z ?z{z2{z H D}X} Az{z}z{z @N<! > =sqsqsq ;+& 9T 8 ` 37sX 0 q@ .&tqtq -7tq +0,t )qG (xO &T #rq5 "Erq> .    c((/  91oo>uX 'G; 'G/; F F c(F/ F91oF q'G/; Z'G/; rq'G/; %R% R c%(R(/% R 91%oRo'" K x vq4  tq1TVk(LSEC. 7PERFORMANCE79largeall over 99 percent. This is one reason that the number of held cycles is small: a miss cancause the processor to be held for about thirty cycles while a reference completes. In fact, the tableshows that Hold and hit are inversely related over the programs measured. Beads has the lowest hitrate and the highest Hold rate; Placer has the highest hit rate and the lowest Hold rate.The percentage of Store references is interesting because stores eventually give rise to dirty victimwrite operations, which consume storage bandwidth and cause extra occurrences of Hold by tying upthe ADDRESS section of the pipeline. Furthermore, one of the reasons that the StoreReg registerwas not made task-specific was the assumption that stores would be relatively rare (see thediscussion of StoreReg in 5.1). Table 5 shows that stores accounted for between 10 and 19percent of all references to the cache.Comparing the number of hits to the number of stores shows that the write-back discipline used inthe cache was a good choice. Even if every miss had a dirty victim, the number of victim writeswould still be much less than under the write-through discipline, when every Store would cause awrite. In fact, not all misses have dirty victims, as shown in the last column of the table. Thepercentage of misses with dirty victims varies widely from program to program. Placer, which hadthe highest frequency of stores and the lowest frequency of misses, naturally has the highestfrequency of dirty victims. Beads, with the most misses but the fewest stores, has the lowest. Thelast three columns of the table show that write operations would increase about a hundredfold ifwrite-through were used instead of write-back.Acknowledgements The concept and structure of the Dorado memory system are due to Butler Lampson and ChuckThacker. Much of the design was brought to the register-transfer level by Lampson and BrianRosen. Final design, implementation, and debugging were done by the authors and Ed McCreight,who was responsible for the main storage boards. Debugging software and microcode were writtenby Ed Fiala, Willie-Sue Haugeland, and Gene McDaniel. Haugeland and McDaniel were also ofgreat help in collecting the statistics reported in 7. Useful coments on earlier versions of thispaper were contributed by Forest Baskett, Gene McDaniel, Jim Morris, Tim Rentsch, and ChuckThacker. References1.Bell, J. et. al. An investigation of alternative cache organizations. IEEE Trans. Computers C-23, 4, April 1974, 346-351.2.Bloom, L., et. al. Considerations in the design of a computer with high logic-to-memory speed ratio. Proc. GigacycleComputing Systems, AIEE Special Pub. S-136, Jan. 1962, 53-63.3.Conti, C.J. Concepts for buffer storage. IEEE Computer Group News 2, March 1969, 9-13.4.Deutsch, L.P. Experience with a microprogrammed Interlisp system. Proc. 11th Ann. Microprogramming Workshop, PacificGrove, Nov. 1979.5.Forgie, J.W. The Lincoln TX-2 input-output system. Proc. Western Joint Computer Conference, Los Angeles, Feb. 1957,156-160.6.Geschke, C.M. et. al. Early experience with Mesa. Comm. ACM 20, 8, Aug. 1977, 540-552.7.Ingalls, D.H. The Smalltalk-76 programming system: Design and implementation. 5th ACM Symp. Principles ofProgramming Languages, Tucson, Jan. 1978, 9-16.8.Knuth, D.E. TEX and METAFONT: New Directions in Typesetting. American Math. Soc. and Digital Press, Bedford, Mass.,1979.9.Lampson, B.W. et. al. An instruction fetch unit for a high-performance personal computer. Technical Report CSL-81-1,Xerox Palo Alto Research Center, Jan. 1981. Submitted for publication.10.Lampson, B.W., and Pier, K.A. A processor for a high performance personal computer. Proc. 7th Int. Symp. ComputerArchitecture, SigArch/IEEE, La Baule, May 1980, 146-160. Also in Technical Report CSL-81-1, Xerox Palo Alto ResearchCenter, Jan. 1981.11.Liptay, J.S. Structural aspects of the System/360 model 85. II. The cache. IBM Systems Journal 7, 1, 1968, 15-21.12.Metcalfe, R.M., and Boggs, D.R. Ethernet: distributed packet switching for local computer networks. Comm. ACM 19, 7,July 1976, 395-404. f2rG8 G?q _.+ ^W9) \tqI [Otq6tq X$tq)% V0tq Urq J SO R F P Me6" K uq5 J]Htq H= GU : EJ DM> BV AE !  L  # ,. m8?B nLT[c h m x d W  Kej/0 DoradoMem.bxDakeJanuary 13, 1981 3:42 PM