The Memory System of a High-Performance Personal Computerby Douglas W. Clark1, Butler W. Lampson, and Kenneth A. PierJanuary 1981ABSTRACTThe memory system of the Dorado, a compact high-performance personal computer, hasvery high I/O bandwidth, a large paged virtual memory, a cache, and heavily pipelinedcontrol; this paper discusses all of these in detail. Relatively low-speed I/O devices transfersingle words to or from the cache; fast devices, such as a color video display, transferdirectly to or from main storage while the processor uses the cache. Virtual addresses areused in the cache and for all I/O transfers. The memory is controlled by a seven-stagepipeline, which can deliver a peak main-storage bandwidth of 530 million bits per second toservice fast I/O devices and cache misses. Interesting problems of synchronization andscheduling in this pipeline are discussed. The paper concludes with some performancemeasurements that show, among other things, that the cache hit rate is over 99 percent.A revised version of this paper will appear in IEEE Transactions on Computers.CR CATEGORIES6.34, 6.21.KEY WORDS AND PHRASESbandwidth, cache, latency, memory, pipeline, scheduling, storage, synchronization, virtualmemory.1. Present address: Digital Equipment Corporation, Tewksbury, Mass. 01876.c Copyright 1981 by Xerox Corporation.XEROXPALO ALTO RESEARCH CENTER3333 Coyote Hill Road / Palo Alto / California 94304НОMmpТ─НОJП"НОCєqТXОD1rОCєqП(НО=*НО4бsНО1≈tТбТцПHНО0Тэutut ТщП=НО.▐Т⌠П.Т■ututНО-ТлП2ТмНО+┤Т≥Т ПCНО*ТюТаututП6НО(Т▓ Т⌠ПHНО&ШТлututТмП;НО%w ТйП!ТкП*НО#СТ─ПKНО хП/vwtНО┤Т─ПSНО;\Т▒Т▓П,} z НО9ьТ⌡Т° }zП3НО8TТ╙ПH{zНО6пТіПCТїНО5LТ░П\НО3хТТПMНО2DТмТнП1НО0ю Т√ПXНО/<Т─ПTНО,Т┘Т├ПWНО*█Т╘П,Т╙П.НО) Т┤П[Т┬НО'┘ ТІП+ТЇП&НО&Т├П;Т┤П&НО$}ТьП8{z{zНО"ЫТр {z{zПEНО!uТ─П:Т│НОЯ ТҐ}zТЎП.НОmТ─ПDНОn}ТXНОCzТ█П#Т▌{z{zНО© Т Т⌡{zПOНО;Т■П)Т∙{zНОЇТ▌ПBТ▐{z{zНО3 Т▄П2Т█П&НО╞ Т─ПQНО└ТіП/ТїП2НОТ┼П&Т▀П7НО | Т ПJТ⌡НОЬТпТяП=НОt Т┤ПCТ┬НОПТ╣ТІПNНОlТ║П?Т╒НОХТ█ПYНОdТ─П+Ъ╨ЇSH÷dуЄSEC. 1INTRODUCTION53<==┤1НО;\Т░П,{z{zП%Т▒НО9ьТ┬П=Т┴{z{zНО8TТ═Т║ПMНО6пТ┤П3Т┬П,НО5LТ╛ТґП=НО3х ТаП){z{zП$НО2DТ┬П7Т┴П#НО0юТ╪{z{zТҐПLНО/<Т─ПMНО,ТтПRТуНО*█}zТ⌡Т°П7{z{zП$НО) ТіТїПZНО'┘Т┘Т├ПGНО&Т┌ПbНО$}Т╒ТёП?НО"ЫТ┴Т┼П/}zНО!uТ╟ПCТ╠НОЯТ─ПRНОфТ√П,Т≈П1НОBТ╧ПWТ╨НОЎ Т╟ПJТ╠ НО:ТЁ{zП#ТЄНОІТ╗Т╘П?НО2Т▄П.Т█П2НОўТ▓П*Т⌠П4НО*Т┤П`НОіТҐП#ТЎ{zП/НО" Т─П0НО#}ТXНОЬzТжПGТвНОt ТеТфПOНОПТ╧П6Т╨НОlТ²Т·П@НОХТ²П-Т·П/НОdТ─ПTЪнЇSH÷dуEНхО╛ЧД$Н$╛О%Ч$╚НхОЧ $НхОЧ$нНхОЕЧ$∙НхОЕЧ $Н$╛О Ч$rНхОWЧД$Н1sО╛ЧV$Н9иО%Ч$╚Н1sОЧy$Н1sОЧ$нН=ґОЧ$нН=ґОЧ y$НKОAЧ$╚Н=ґОхЧ V$НVrО╛ПНVОWП Н3ОsПНA░О ▐ПН&WОWЧ$КНОЕЧ=┴╡Н;tОЧk rН▌ОПН▌О╛ПН:ОхПНРОхПН NОхПН▌ОхПН╚О sЧ9$НДО^Ч$9Н╚О:Ч]$Н╚О:Ч$]НО:Ч$]НО:Ч∙$Н▐О^Ч$9НО sЧr$НДО sЧ$НО^Ч$9НДО:Ч@$НДО:Ч$]НДО ≈ЧGДНVО ≈ЧGДНО ≈ЧGДН╚ОЧ▌2Н╚О ≈ЧGДН▌ОЧ:kНО sЧг$НгО^Ч$9НО:ЧК$НО:Ч$]Н9иО3Ч╚GНVОП Н$╛ОЧ╚$Н$╛ОЧхkН3pО:ПН3╛ОвПН :ОпЧk9Н$╛tОdЯНpО░П НОЕП Н&ЕО░П Н&ЕОЕП Н&·О╛П Н'sО ╛П Н',О ПН'sО:П Н╚ОхПН?WОхПЪнDvK&{f THE MEMORY SYSTEM OF A HIGH-PERFORMANCE PERSONAL COMPUTER54Memory references specify a 16 or 28 bit displacement, and one of 32 base registers of 28 bits; thevirtual address is the sum of the displacement and the base. Virtual address translation, or mapping,is implemented by table lookup in a dedicated memory. Main storage is the permanent home ofdata stored by the memory system. The storage is necessarily slow (i.e., it has long latency, whichmeans that it takes a long time to respond to a request), because of its implementation in cheap butslow dynamic MOS RAMs (random access memories). To make up for being slow, storage is big,and it also has high bandwidth, which is more important than latency for sequential references. Inaddition, there is a cache which services non-sequential references with high speed (low latency),but is inferior to main storage in its other parameters. The relative values of these parameters areshown in Table 1.CacheStorageLatency-1151Bandwidth12Capacity1250Table 1: Parameters of the cache relative to storageWith one exception (the IFU), all memory references are initiated by the processor, which thus actsas a multiplexor controlling access to the memory (see І 1.2 and [10]), and is the sole source ofaddresses. Once started, however, a reference proceeds independently of the processor. Each onecarries with it the number of its originating task, which serves to identify the source or sink of anydata transfer associated with the reference. The actual transfer may take place much later, andeach source or sink must be continually ready to deliver or accept data on demand. It is possiblefor a task to have several references outstanding, but order is preserved within each type ofreference, so that the task number plus some careful hardware bookkeeping is sufficient to matchup data with references. Table 2 lists the types of memory references executable by microcode. Figure 2, a picture of thememory system's main data paths, should clarify the sources and destinations of data transferred bythese references (parts of Figure 2 will be explained in more detail later). All references, includingfast I/O references, specify virtual, not real addresses. Although a microinstruction actually specifiesa displacement and a base register which together form the virtual address, for convenience we willsuppress this fact and write, for example, Fetch(a) to mean a fetch from virtual address a.A Fetch from the cache delivers data to a register called FetchReg, from which it can be retrieved atany later time; since FetchReg is task-specific, separate tasks can make their cache referencesindependently. An I/ORead reference delivers a 16-word block of data from storage to theFastOutBus (by way of the error corrector, as shown in Figure 2), tagged with the identity of therequesting task; the associated output device is expected to monitor this bus and grab the datawhen it appears. Similarly, the processor can Store one word of data into the cache, or do anI/OWrite reference which demands a block of data from an input device and sends it to storage (byway of the check-bit generator). There is also a Prefetch reference, which brings a block into thecache. Fetch, Store and Prefetch are called cache references. There are special references to flushdata from the cache and to allow a map entries to be read and written; these will be discussed later.The instruction fetch unit is the only device that can make a reference independently of theprocessor. It uses a single base register, and is treated almost exactly like a processor cache fetch,except that the IFU has its own set of registers for receiving memory data (see [9] for details). Ingeneral we ignore IFU references from now on, since they add little complexity to the memorysystem.ЪНДОY┴zТXУ{z{z{z{z{z{z{ z{z{НzНОS2Т∙П#}zТ√ } zНОQўТ─ПVТ│}zНОP*Т═П+Т║ }zНОNіТ√П9Т≈П'НОM"Т└ПRТ┘НОK·Т╙{z{zТ╚П7НОJТ█Т▌ПXНОH√Т╞}zТ╟П*НОGТ√П5Т≈П-НОE▌Т─НОB╩ЧД█ОBcЧДНДОB╩Ч х█ОBcЧ хН╛ОB╩Ч(█ОBcЧ(Н#тОB╩Ч і█ОBcЧ іН-zОB╩Чo█ОBcЧoН°О@}Н≤НО?ЧДFНДО?Ч хFН╛О?Ч(FН#тО?Ч іFН-zО?ЧoFНЖО ~z~zНgНmП$~zНО;х~z~zНgНmТбП4НmО:DТ─~НО7Лz~zНgНmП1~НО5■z~zНgНm~zП1~НО3<z~zНgНmТ┤П7НmО1╦Т─П)НО1ЧДО04ЧД█НДО1Ч хО04Ч х█Н╛О1Ч(О04Ч(█Н#тО1Ч іО04Ч і█Н-zО1ЧoО04Чo█НВО- ТXП8НО)чТ▀}zП2Т▄}z НО(ZТ■ }z}zП9НО&жТ├{z{zТ┤П!НО%RТзТш}z}zП8НО#н Т╞П9Т╟НО"JТ■П({z{zТ∙НО фТ─ {z{zНОг}ТXП$НО°zТ│П=Т┌НОТ─ НЖОю}zТзТшП0НЖО<Т≈П6{z{zТ≤НЖО╦Т÷П1Т═П"НЖО4Т─П#{z{zП2НЖОэ}zТ┤Т┬П4НЖО XТ┐ПSНЖОтТ─ПDНО |ТуП%ТжП2НОЬ Т Т⌡ПIНОtТ▀П[Т▄ НОПТ∙П4Т√П*НОlТ┌ПYТ┐НОХТўП)Т╞П*НОdТ─ПYзЇ ⌡H÷]█) THE MEMORY SYSTEM OF A HIGH-PERFORMANCE PERSONAL COMPUTER56<==WПНхО>WПН О,░ПН О:╩ПН HОB╧Ч$─Н жОB╧Ч$─НО'░ПНpО,ЙПН xО'░ПН ▐О/Ч$─НdpО,ЙПН xОE▒ПНжpОP3ПНжxОжПНРОЬЧr$НРОЧ+$НРО├Ч+$НжО ╚ПНжОгПНРО╒Ч+$НРО1Ч+$НРОЙЧr$Н:rО?@ПНжО=OПН+ОПЧ$╚НДxОdПН╚ОЪЧ$НОЪЧ$НжОrПНРОrПН9ОЪЧ$НхОЪЧ$Н HpО 0ПН HО[ПН(О5ЧVGН-О ├П Н)eО/╡ПН%:О.ПН$eО1\Ч▐$Н3ТО-UЧ$+Н$eО-2Ч╡$Н$eО-2Ч$NН$О;КЧ$NН$О;КЧ╡$Н3ґО<Ч$+Н$О@Ч▐$Н&WО=√ПН$eО#Ч$NН$eО#Ч╡$Н3ТОFЧ$+Н$eОMЧ▐$Н)ТОёПН&·ОЬПН$ОЧ▐$Н3ґОЪЧ$+Н$ОшЧ╡$Н$ОшЧ$NН%иО├ПН+ЕО2UЧ$Н+·xО*WПН,tО2UЧ$Н,tО(Ч$GН+ЕО(Ч$GН+·О ПЫЗЪЬЪН-pО*#ПЫЗЬН+·xОHПН+ЕОFЧ$GН,tОFЧ$GН-pО\ПН,tОЪЧ$rН+·xОПН+ЕОЪЧ$rН(ОqЧVGН.О█Ч$╧Н-вО ▐ПН)О.;ПНОфЧ$$СНОёЧ G$Н)О ▐ПНО9$Чж$Н)ТО69Ч$НО9GЧ$─Н жО8╧Ч$UН ▐О4уЧ$РН╚О8√Ч $Н╛ОTЧ$$eН╛О1Ч G$Н жО8√ЧH$Н╛О8√Ч ╧$Н.ґО1Ч $Н8ґО1Ч$3пН ОGщЧ+а$Н.ґОёЧ ▐$Н9;ОёЧ$4ЛН│ОHkЧ,ч$Н.ґОTЧ$РН.ґО█Ч$9Н ▐ОMжЧ$╚Н─ОSeЧdGНЕОQ╨ЧGРН─ОQ╨ЧGРН+rОTПН!ОQ╨ЧGРН6tОQ╨ЧGРН!ОSeЧdGН$СОTПН HpО?ПН О■ПН HОПН ОjПН│ОFДЧ$╚Н)ТО█Ч$гН)eО█Ч$9Н)eО69Ч$─Н ОFДЧ$Н╧ОПЧ$dН╨ОПН)ТrО$NПНСpОiЯНdО ⌡ЧGPn8┤9^UVЗSEC. 1INTRODUCTION57All busses are 16 bits wide; blocks of data are transferred to and from storage at the rate of 16 bitsevery half cycle (30 ns). This means that 256 bits can be transferred in 8 cycles or 480 ns., which issomewhat more than the 375 ns cycle time of the RAM chips that implement main storage. Thus ablock size of 256 bits provides a fairly good match between bus and chip bandwidths; it is also acomfortable unit to store in the cache. The narrow busses increase the latency of a storage transfersomewhat, but they have little effect on the bandwidth. A few hundred nanoseconds of latency isof little importance either for sequential I/O transfers or for delivery of data to a properlyfunctioning cache.Various measures are taken to maximize the performance of the cache. Data stored there is notwritten back to main storage until the cache space is needed for some other purpose (the write-backrather than the more common write-through discipline [1, 14]); this make it possible to use memorylocations much like registers in an interpreted instruction set, without incurring the penalty of mainstorage accesses. Virtual rather than real addresses are stored in the cache, so that the speed ofmemory mapping does not affect the speed of cache references. (Translation buffers [15, 20] areanother way to accomplish this.) This would create problems if there were multiple address spaces.Although these problems can be solved, in a single-user environment with a single address spacethey do not even need to be considered.Another important technique for speeding up data manipulation in general, and cache references inparticular, is called bypassing. Bypassing is one of the speed-up techniques used in the CommonData Bus of the IBM 360/91 [19]. Sequences of instructions having the form(1) register _ computation1(2) computation2 involving the registerare very common. Usually the execution of the first instruction takes more than one cycle and ispipelined. As a result, however, the register is not loaded at the end of the first cycle, andtherefore is not ready at the beginning of the second instruction. The idea of bypassing is to avoidwaiting for the register to be loaded, by routing the results of the first computation directly to theinputs of the second one. The effective latency of the cache is thus reduced from two cycles to onein many cases (see І 2.3).The implementation of the Dorado memory reflects a balance among competing demands:for simplicity, so that it can be made to work initially, and maintained when componentsfail; for speed, so that the performance will be well-matched to the rest of the machine;for space, since cost and packaging considerations limit the number of components andedgepins that can be used.None of these demands is absolute, but all have thresholds that are costly to cross. In the Doradowe set a somewhat arbitrary speed requirement for the whole machine, and generally tried to savespace by adding complexity, pushing ever closer to the simplicity threshold. Although many of thecomplications in the memory system are unavoidable consequences of the speed requirements, someof them could have been eliminated by adding hardware.2. The cacheThe memory system is organized into two kinds of building blocks: pipeline stages, which providethe control (their names are in SMALL CAPITALS), and resources, which provide the data paths andmemories. Figure 3 shows the various stages and their arrangement into two pipelines. One,consisting of the ADDRESS and HITDATA stages, handles cache references and is the subject of thissection; the other, containing MAP, WRITETR, STORAGE, READTR1 and READTR2, takes care ofstorage references and is dealt with in І 3 and 4. References start out either in PROC, the processor,or in the IFU.НО\){ТGУН,кН;┬zНОUрТ┼ПOТ▀НОTNТ└ПPТ┘НОRйТ█П({zТ▌П%НОQFТ·ПRТ÷ НОOб Т▀ПQТ▄НОN>Т√Т≈П;НОL╨ТОП){z{zП'ТПНОK6 Т─НОHТ╚ПCТ╛НОF┤Т├П>Т┤} НОEzТ▀}zТ▄П$НОCТ▀ПUТ▄НОAШТ╞П<Т╟НО@wТ╗ПMТ╘НО>СТ┴|Т┼zПBНО=oТ╚ПBТ╛НО;КТ─П#НО8юТ┼ПWТ▀НО7< ТёТє}zПAНО5╦Т─{zП8НЖО3`НЖО1эП'НО/└Т²ПWТ·НО. ТсП'ТтП.НО,|Т┼Т▀П>НО*ЬТ≈Т≤ПHНО)tТ└ Т┘ПPНО'ПТ─НО$еПSНЖО"mТєП6Т╔НЖО ИТ─НЖО▒ПSНЖО9ТаТбП2НЖО╣Т─НО┼Т▒ПTТ▓ НОТ≤П3Т≥П+НО┌Т▀Т▄П)|zНОЧТ└П%Т┘П-НОzТ─П4НОї|ТX НО |zТ≥ПH}zНОЬ{ z}zТ НОtТхП'ТиП,НОП Т⌡Т°{z{zП<НОlТЮТА{z{z{z{z{zНОХТ│Т┌П4{zНОdТ─{zЪФЇ H÷a⌠ THE MEMORY SYSTEM OF A HIGH-PERFORMANCE PERSONAL COMPUTER58<==НО3Тґ{zП"ТўП3НО╞Т─ПQНО└Т²Т·ПVНОТ▀ Т▄ПSНО |Тг}z }z Тх}zП-НОЬТ├П;Т┤П#НОt ТєТ╔ }zНОПТ└ПOТ┘НОlТ·ПCТ÷НОХТ▓П,Т⌠ }zНОdТ─П& БЇ H÷d▌Н9О\Ч$НUxОЕПНОЕПН9О!нЧ$НUОПНОПНОsПНUОsПН9О#Ч$НО:ПНUО:ПН9О╠Ч$НгОЧ$Н ДО▐ПН ▌О▐ПН ДОхПН ▌ОхПНrО HПН хО HПН╚ОЬЧ$НrОПН хОПН╚О ├Ч$НДОЧ$НО▐ПН╚О▐ПНДОyЧ$НОхПН╚ОхПН*:О ├Ч$Н+VОПН(ОПН*:ОЬЧ$Н+VО HПН(О HПН2▐О HПН5ЕО HПН4хОЬЧ$Н2▐ОПН5ЕОПН4хО ├Ч$Н>иО ├Ч$Н?ЕОПН<░ОПН>иОЬЧ$Н?ЕО HПН<░О HПНrО²ПН хО²ПН╚ОMЧ$НrОdПН хОdПН╚ОшЧ$НtОNПН╙ОёПНОЬПН9ОЬПНОxПН(▐ОxПН3ОxПН=ОxПНОмПН#О©Ч$Н-▐О©Ч$Н8О©Чr$НVО©Ч$Н▌О∙Ч9$НгОcЧ$UНгО@Чг$Н▌ОЙЧ9$НгОЙЧ$yНЦxОПНО╛ПНхО²ПН&VО²ПН0ЕО²ПН:ЕО²ПНVО Ч$НхОРПН UrО%╡ПН rО%ЧVGН rО#ЦЧGdН+VОNПНB╛ОфЧGНДО°Ч%хGНДОЧGdНхО$*ЧGНVОуЧ$▌НО╠Ч9$НVОуЧ$НVО 8Ч$ ²НгtОПНО?ПНОП НО [П НОUЧ$КНО!Чr$Н ▌xОаПН╚tО"ёПН╚О!┤ПНгОyЧ$Н▌ОгЧ$yНО#П Н▐pО╠П%НОнЧr$Н!ДОЯНtО xП Ъ БG9╒BТ&РЭSEC. 2THE CACHE59<==О$∙Ч$∙Н │О╠Ч$КН єО%ЬЧ@$Н"sО╠Ч$КН О█ЧДGНРО ЯЧGДН О ╘Ч +GН О ╘ЧG +Н&²О(ЦЧsGН?О$╦ЧGrН&²О$qЧ╨GН&²О$qЧG╧НжО?ЧгGН&²О:РЧGrНжО:╚ЧGНжО:╚ЧG╧Н²О#Ч$▌Н²О"ЙЧЛ$Н&аО&@ЧO$Н&²О%jЧs$Н2ItО)NПН7вО)NПН,╨О)NПН',О)NПН<О%▌Ч$╡НЛОHПН:бО┬ПН>ЛО┬ПН;╩О'8ЧU$Н;╩О$эЧU$Н#жО╚ПН?О╚ПН#жОVПН?ОVПН)СОЧ$Н*│О╟Ч$Н'╨ОFЧGrНBeОFЧуrН(▐tОПН6 ОПН,,ОтЧ$UН0·ОтЧ$UН5WОтЧ$UН-HОПН1╨ОПН:WО мПН?ОПН>╔О bПН9┌ОFЧ$rН:ОFЧ$rН=ТОЧ$╡Н=ТОFЧ$rН^О%▌ПН ЛrО=╧ПН╚О;О)NПН╨xО#eПНGОDdЧrGНРО69Ч²GН+О)Ч╚GНДО3+Ч$Н─О╘Ч$─НгtО©ПН ²О©ПЫЗЪЬЪНжpО"П"ЫЗЬН&²ОCkЧР$Н-▐ОAДЧ$╚Н&²ОAюЧ$Н&²ОAюЧ$нН'╨О=щЧг$Н',tОBOПН",xО*ЕПН&²О4GЧ$9НжО4GЧ$9Н"╨О3+Ч$ЫЗЪЬЪН▌rО4╡ПН▌О)ПНуОПНОПНОПЫЗЬН#жpОwЯН0ЕtО bП Н5ЕО bПН*│О=щЧ$НжО,9ПН╨xО$^ПЖUiH┌F²ж THE MEMORY SYSTEM OF A HIGH-PERFORMANCE PERSONAL COMPUTER60miss: the address is not present in the cache. During normal operation, it is not possible for morethan one column to match. The entire matching process can be seen in Figure 4, between 60 and90 ns after the start of the reference. The cache address latched at 90 contains the row, word andcolumn; these 14 bits address a single word in CacheD. Of course, only the top 16 key bits of theaddress need be matched, since the row bits are used to select the row, and all the words of a blockare present or absent together.Four flag bits are stored with each cache entry to keep track of its status. We defer discussion ofthese flags until І 4.2.2 Cache dataThe CacheD resource stores the data for the blocks whose addresses appear in CacheA; closelyassociated with it are the StoreReg and task-specific FetchReg registers which allow the processor todeliver and retrieve its data independently of the memory system's detailed timing. CacheD is quitesimple, and would consist of nothing but a 16K by 16 bit memory were it not for the bandwidth ofthe storage. To keep up with storage the cache must be able to accept a word every half cycle (30ns.). Since its memory chips cannot cycle this fast, CacheD is organized in two banks which run ahalf-cycle out of phase when transferring data to or from the storage. On a hit, however, bothbanks are cycled together and CacheD behaves like an 8K by 32 bit memory. A multiplexor selectsthe proper half to deliver into FetchReg. All this is shown in Figure 4.Figure 4 does not, however, show how FetchReg is made task-specific. In fact, there is a 16-wordmemory FetchRegRAM in addition to the register shown. The register holds the data value for thecurrently executing task. When a Fetch reference completes, the word from CacheD is alwaysloaded into the RAM entry for the task that made the reference; it is also loaded into FetchReg ifthat task is the one currently running. Whenever the processor switches tasks, the FetchRegRAMentry for the new task is read out and loaded into FetchReg. Matters are further complicated bythe bypassing scheme described in the next subsection.StoreReg is not task-specific. The reason for this choice and the problem it causes are explained inІ 5.1.2.3 Cache pipeliningFrom the beginning of a cache reference, it takes two and a half cycles before the data is ready inFetchReg, even if it hits and there are no delays. However, because of the latches in the pipeline(some of which are omitted from Figure 4), a new reference can be started every cycle, and if thereare no misses the pipeline will never clog up, but will continue to deliver a word every 60 ns. Thisworks because nothing in later stages of the pipeline affects anything that happens in an earlierstage.The exception to this principle is delivery of data to the processor itself. When the processor usesdata that has been fetched, it depends on the later stages of the pipeline. In general thisdependency is unavoidable, but in the case of the cache the bypassing technique described in І 1.4is used to reduce the latency. A cache reference logically delivers its data to the FetchReg registerat the end of the cycle following the reference cycle (actually halfway through the second cycle, at150 in Figure 4). Often the data is then sent to a register in the processor, with a (microcode)sequence such as(1)Fetch(address)(2)register _ FetchReg(3)computation involving register. The register is not actually loaded until cycle (3); hence the data, which is ready in the middle ofcycle (3), arrives in time, and instruction (2) does not have to wait. The data is supplied to thecomputation in cycle (3) by bypassing. The effective latency of the cache is thus only one cycle inthis situation.НДО_RzТXУ{z{z{z{z{z{z{ z{z{НzНОXШТ░Т▒ПLНОWwТ■ПKТ∙НОUСТ■П%Т∙П<НОToТ█ Т▌ПMНОRКТ┐Т└ПBНОQgТ─НОN<Т≤ПHТ≥НОL╦Т─НОH╧}ТX НОE▌zТ╪ТҐП9НОD Т└П4Т┘П'НОB├Т─ПWТ│НОAТ┘Т├ПDНО?~Т▌П9Т▐П&НО=ЗПLТ░НО{НО.╞zТ²Т·ПBНО-+Т─П3НО*Т┬П2Т┴П+НО(|Т─НО$}}ТXНО!RzТ⌠П\Т■НОнТ√ Т≈ПOНОJТ┤ПUТ┬НОфТ┤П@Т┬П"НОBТІП8ТЇП$НОЎНО⌠Т■ПOТ∙НОТЙТКПAНО▀ Т▌П-Т▐П+НОТ░Т▒П_НО┐Т√ПQТ≈НОЪТ╠Т╡ПOНО {Т─НСО#НЖ~zНСО ÷НЖНСОНЖП!НОПТ ПAТ⌡НОlТ╙Т╚ПDНОХ Т▐П9Т░НОdТ─ ЛЇЮH÷dH▓SEC. 2THE CACHE61Unfortunately this sleight-of-hand does not always work. The sequence(1)Fetch(address)(2)computation involving FetchReg actually needs the data during cycle (2), which will therefore have to wait for one cycle (see І 5.1).Data retrieved in cycle (1) would be the old value of FetchReg; this allows a sequence of fetches(1)Fetch(address1)(2)register1 _ FetchReg, Fetch(address2)(3)register2 _ FetchReg, Fetch(address3)(4)register3 _ FetchReg, Fetch(address4). . .to proceed at full speed.3. The storage pipelineCache misses and fast I/O references use the storage portion of the pipeline, shown in Figure 3. Inthis section we first describe the operation of the individual pipeline stages, then explain how fastI/O references use them, and finally discuss how memory faults are handled. Using I/O referencesto expose the workings of the pipeline allows us to postpone until І 4 a close examination of themore complicated references involving both cache and storage.3.1 Pipeline stagesEach of the pipeline stages is implemented by a simple finite-state automaton that can change stateon every microinstruction cycle. Resources used by a stage are controlled by signals that itsautomaton produces. Each stage owns some resources, and some stages share resources with others.Control is passed from one stage to the next when the first produces a start signal for the second;this signal forces the second automaton into its initial state. Necessary information about thereference type is also passed along when one stage starts another.3.1.1 The ADDRESS stageAs we saw in І 2, the ADDRESS stage computes a reference's virtual address and looks it up inCacheA. If it hits, and is not I/ORead or I/OWrite, control is passed to HITDATA. Otherwise, controlis passed to MAP, starting a storage cycle. In the simplest case a reference spends just onemicroinstruction cycle in ADDRESS, but it can be delayed for various reasons discussed in І 5.3.1.2 The MAP stageThe MAP stage translates a virtual address into a real address by looking it up in a hardware tablecalled the MapRAM, and then starts the STORAGE stage. Figure 5 illustrates the straightforwardconversion of a virtual page number into a real page number. The low-order bits are not mapped;they point to a single word on the page.Three flag bits are stored in MapRAM for each virtual page:ref, set automatically by any reference to the page;dirty, set automatically by any write into the page;writeProtect, set by memory-management software (using the MapWrite reference). A virtual page not in use is marked as vacant by setting both writeProtect and dirty, an otherwisenonsensical combination. A reference is aborted by the hardware if it touches a vacant page,attempts to write a write-protected page, or causes a parity error in the MapRAM. All three kindsof map fault are passed down the pipeline to READTR2 for reporting; see І 3.1.5.НО^┌{ТGУН/^Н;┬zНОX+Т─П9НСОUсНЖ~zНСОTOНЖНОQВТ▐ПVТ░НОPsТ─П]НСОNНЖ~z НСОL≈НЖ~z НСОKНЖ~z НСОI▐НЖ~z НЖОHНОEЁНО@Ю|ТXНО=╣zТ┬{z{zТ┴П8НО<1Т·П\Т÷НО:ґ{z{zТ⌠П;Т■|z{z{z НО9)Т÷П1Т═П.НО7╔Т─П9НО3і}ТXНО0{zТ█Т▌ПHНО.ВТьП9ТыП#НО-sТ┼}zП!}zТ▀НО+ОТ√Т≈П$}zНО*kТяП<ТрНО(ГТ─П9НО$Х}ТX}НО!ҐzТІ{zТЇП#НО 9Т│{z{~zТ┌{z{~z{}zНО╣Тъ {zП,ТЮП!НО1Т─ {zП=НО2}ТX}НОzТ≤{zПHТ≥НО┐Т╨}{z{zТ╩НОЪ Т▄П*Т█П,НО{Т─П$НОPП!{zНЖОЬ~zП1НЖО ═~zП/НЖОH~zП/~z НОПТґП"Тў}z~z~z НОl Тк|zТлПBНОХТ≥ПE{zТ НОdТ─}zП!{zЪlЇ╟H÷cx~ THE MEMORY SYSTEM OF A HIGH-PERFORMANCE PERSONAL COMPUTER62<==йОlЧ$9Н:XОlЧ$9Н<▒ОlЧ$9Н8ОlЧ$9НC;ОlЧ$9НAОlЧ$9НEtОlЧ$9Н5ФО 3Ч$9Н1tО 3Ч$9Н3ґО 3Ч$9Н(░О 3Ч$9Н-О 3Ч$9Н*иО 3Ч$9Н/;О 3Ч$9Н&WО Ч$]Н&WО ЧК$Н8О 3Ч$9Н&WО"HЧх$НхО%Чх$Н(░О"ЗЧ$9НхО"вЧК$НхО"вЧ$]Н╛О"ЗЧ$9Н:О"ЗЧ$9НsО"ЗЧ$9НО"ЗЧ$9Н$О"ЗЧ$9Н!ЕО"ЗЧ$9Н&WО"ЗЧ$9Н:О%аЧ$9НхО%аЧ$9НО%аЧ$9Н ДО%аЧ$9НVО%аЧ$9НО%аЧ$9Н▐О%аЧ$9Н╚О%·Ч$]Н╚О%·ЧК$НsО%аЧ$9Н╚О'вЧх$Н sО(eЧ$]Н sО(eЧ]$Н╚О(┴Ч$9Н sО*·Ч9$Н:О-eЧ9$Н sО+PЧ$9Н:О+,Ч]$Н:О+,Ч$]Н▐О+╩ПН▐О&,ПН▐О#eПНОЛЧGGН xО _ПНО _ПН╛ОжЧy$Н▐О&ПНхОжЧг$Н╚О _ПНО _ПН#░О _ПНОжЧ╡$НЕО&ПНЕОжЧ$НхО _ПН$╛О _ПН'sО≤ПН8О≤ПН(░ОHЧн$Н/;ОЄПН0WОHЧг$Н▐pО²ПН?ФxО _ПН8ґОжЧ]$Н7░О&ПН0WОжЧ9$Н/;О _ПНsО²Ч▌$НОЧ$╚НsОРЧ╡$НsОРЧ$нНsОРЧ$нНsОРЧК$Н :ОЧ$╚НsО²Чг$Н#О²Чr$Н'sОЧ$╚Н#ОРЧ∙$Н#ОРЧ$нН.ґО РЧ▌$Н4;ОюЧ$UН.ґО²Ч╡$Н.ґО²Ч$yН=ґО²Ч╚$НDXОЧ$╚Н=ґОРЧн$Н=ґОРЧ$нНДОРЧ$нНДОРЧ]$НОЧ$╚НДО²Ч9$НО HЧг$НО HЧr$Н :О HЧг$НхО&ПНхО&ПН!WО&ПН-О&ПН<О&ПН spО жПН sО+ПН]О&,ПН√О&,ПНоО&,ПНО&,ПНAО&,ПНzО&,ПНЁО&,ПНЛО&,ПН' О#eПН$пО#eПН"≈О#eПН ^О#eПН%О#eПНЛО#eПНЁО#eПНzО#eПН' О ·ПН)BО ·ПН+{О ·ПН-ЄО ·ПН/МО ·ПН2&О ·ПН4_О ·ПН6≤О ·ПНF&ОвПНCМОвПНAЄОвПН?{ОвПН=BОвПН; ОвПН8пОвПН6≤ОвПН▐О(СПН▐О ·ПН▐ОвПН VОПНsО┬ПН▐ОП НхО┬ПН#ОПН!ЕО┬ПН*;ОСПН0WОСПН1tОПН5WО┬П Н5WО²П НКО+╩ПН $О(СПН▐О жПНeОПН▐О жПН▐О+ПН#░О dПН#░О╧ПН(О²ПН/;О╧ПН>йО dПН>йО╧ПН>йОПН=xО ПН9;pОСПН%;ОOЧ$ДН=ґОаЧ$гН╛ОOЧ$ДН ▐ОOЧ$гНОПНО┬ПН:О┬ПН▐ОПНsО┬ПН'sО HЧ9$Н4;О HЧД$Н$╛ОdЯНхОПдD▄Gя-┴▄SEC. 3THE STORAGE PIPELINE65and completes the StorageRAM cycle (І 3.1.3). READTR1 and READTR2 transport the data, controlthe error corrector, and deliver the data to FastOutBus (І 3.1.5). Fault reporting, if necessary, isdone by READTR2 as soon as the condition of the last quadword in the block is known (І 3.3).It is clear from Figure 7 that an I/ORead can be started every eight machine cycles, since this is thelongest period of activity of any stage. This would result in 530 million bits per second ofbandwidth, the maximum supportable by the memory system. The inner loop of a fast I/O task canbe written in two microinstructions, so if a new I/ORead is launched every eight cycles, one-fourthof the processor capacity will be used. Because ADDRESS is used for only one cycle per I/ORead,other tasks²notably the emulator²may continue to hit in the cache when the I/O task is notrunning.I/OWrite(x) writes into virtual location x a block of data delivered by a fast input device, togetherwith appropriate Hamming code check bits. The data always goes to storage, never to the cache,but if address x happens to hit in the cache, the entry is invalidated by setting a flag (І 4). Figure8 shows that an I/OWrite proceeds through the pipeline very much like an I/ORead. The difference,of course, is that the WRITETR stage runs, and the READTR1 and READTR2 stages, although they run,do not transport data. Note that the write transport, from FastInBus to WriteBus, proceeds inparallel with mapping. Once the block has been loaded into WriteReg, STORAGE issues a writesignal to StorageRAM. All that remains is to run READTR1 and READTR2, as explained above. If amap fault occurs during address translation, the write signal is blocked and the fault is passed alongto be reported by READTR2.<==_О%ПН@≤О%ПНBяО%ПНE О%ПНGBО%ПНI{О%ПН9МО'вПН7ЄО'вПН5{О'вПН3BО'вПН1 О'вПН.пО'вПН,≈О'вПН*^О'вПНоО*·ПНО*·ПНAО*·ПН!zО*·ПН#ЁО*·ПН%ЛО*·ПН(%О*·ПН*^О*·ПНAО-eПНО-eПНоО-eПН√О-eПНКО-eПН╡О-eПН zО-eПНAО-eПНО*·ПНО-eПНО5╩ПНО5-Ч$]НVО5PЧ$9НО7fЧ9$НVО4÷Ч9$Н ▐О2┴Ч$9НVО2fЧ]$НVО2fЧ$]Н ▐О/Ч:$Н хО,ШЧ$9Н ▐О,вЧ]$Н ▐О,вЧ$]НЕО,ШЧ$9НО,ШЧ$9Н:О,ШЧ$9НхО,ШЧ$9НVО,ШЧ$9НО,ШЧ$9Н▐О,ШЧ$9Н)╛О*4Ч$9Н%:О*4Ч$9Н'sО*4Ч$9НVО*4Ч$9Н хО*4Ч$9Н▐О*4Ч$9Н#О*4Ч$9НО*Ч$]НО*ЧК$Н+ЕО*4Ч$9НО,IЧх$Н)╛О)┌Чх$Н;tО'lЧ$9Н)╛О'IЧК$Н)╛О'IЧ$]Н2░О'lЧ$9Н.О'lЧ$9Н0WО'lЧ$9Н+ЕО'lЧ$9Н7О'lЧ$9Н4иО'lЧ$9Н9;О'lЧ$9НHйО$╔Ч$9НDXО$╔Ч$9НF▒О$╔Ч$9Н;tО$╔Ч$9Н?ФО$╔Ч$9Н=ґО$╔Ч$9НBО$╔Ч$9Н9;О$┌Ч$]Н9;О$┌ЧК$НKО$╔Ч$9Н9;О&╩Чх$Н╚О5ъПНДО3ПНО0-ПН▐О/бЧ$9НО/бЧ$9НVО/бЧ$9НхО/бЧ$9Н:О/бЧ$9НО/бЧ$9НЕО/бЧ$9Н ▐О/·Ч$]Н ▐О/·Ч√$Н#О/бЧ$9Н ▐О1вЧs$НAО0-ПН zО0-ПН╡О0-ПНКО0-ПНО0-ПН:О0-ПНsО0-ПН╛О0-ПНsО,ШЧ$9Н╛О,ШЧ$9НHО-ПН│О-ПН хО/бЧ$9Н╛О/бЧ$9НsО/бЧ$9НЕО0-ПНО0-ПН ЛО0-ПН@txОCПН=pО!,ПНAО┬Ч$9НОСПНЕОHЧ$@НЕОHЧ∙$Н!WОlЧ$НЕОeЧr$НsО²ПНrО²ПНrО HПН хОzПНСО HПНHОЕП НeОHПНДО╨Ч ж$Н!WО²Ч╚$Н#О OЧ$rНО5-Ч]$НОПН%:ОdЯЪ<ЯK▒7┴Г THE MEMORY SYSTEM OF A HIGH-PERFORMANCE PERSONAL COMPUTER66Figure 8 shows a delay in the MAP stage's handling of I/OWrite. MAP remains in state 3 for twoextra cycles, which are labelled with asterisks, rather than state numbers, in Figure 8. This delayallows the write transport to finish before the write signal is issued to StorageRAM. Thissynchronization and others are detailed in І 5.Because WRITETR takes eleven cycles to run, I/OWrites can only run at the rate of one every elevencycles, yielding a maximum bandwidth for fast input devices of 390 million bits per second. At thatrate, two of every eleven cycles would go to the I/O task's inner loop, consuming 18 percent of theprocessor capacity. But again, other tasks could hit in the cache in the remaining nine cycles.3.3 History and fault reportingThere are two kinds of memory system faults: map and storage. A map fault is a MapRAM parityerror, a reference to a page marked vacant, or a write operation to a write-protected page. Astorage fault is either a single or a double error (within a quadword) detected during a read. Inwhat follows we do not always distinguish between the two types.Consider how a page fault might be handled. MAP has read the MapRAM entry for a reference andfound the virtual page marked vacant. At this point there may be another reference in ADDRESSwaiting for MAP, and one more in the processor waiting for ADDRESS. An earlier reference may bein READTR1, perhaps about to cause a storage fault. The processor is probably several instructionsbeyond the one that issued the faulting reference, perhaps in another task. What to do? It wouldbe quite cumbersome at this point to halt the memory system, deal with the fault, and restart thememory system in such a way that the fault was transparent to the interrupted tasks. Instead, theDorado allows the reference to complete, while blunting any destructive consequences it mighthave. A page fault, for example, forces the cache's vacant flag to be set when the read transport isdone. At the very end of the pipeline READTR2 wakes up the Dorado's highest-priority microtask,the fault task, which must deal appropriately with the fault, perhaps with the help of memory-management software.Because the fault may be reported well after it happened, a record of the reference must be keptwhich is complete enough that the fault task can sort out what has happened. Furthermore,because later references in the pipeline may cause additional faults, this record must be able toencompass several faulting references. The necessary information associated with each reference,about 80 bits, is recorded in a 16-element memory called History. Table 3 gives the contents ofHistory and shows which stage is responsible for writing each part. History is managed as a ringbuffer and is addressed by a 4-bit Storage Reference Number or SRN, which is passed along withthe reference through the various pipeline stages. When a reference is passed to the MAP stage, acounter containing the next available SRN is incremented. A hit writes the address portion ofHistory (useful for diagnostic purposes; see below), without incrementing the SRN counter.EntryWritten byVirtual address, reference type, task number, cache columnADDRESSReal page number, MapRAM flags, map faultMAPStorage fault, bit corrected (for single errors)READTR2Table 3: Contents of the History memoryTwo hardware registers accessible to the processor help the fault task interpret History: FaultCountis incremented every time a fault occurs; FirstFault holds the SRN of the first faulting reference.The fault task is awakened whenever FaultCount is non-zero; it can read both registers and clearFaultCount in a single atomic operation. It then handles FaultCount faults, reading successiveelements of History starting with History[FirstFault], and then yields control of the processor to theНДО]╚zТXУ{z{z{z{z{z{z{ z{z{НzНОWTТі{z{z{~z{z ТїНОUпТёП_НОTLТЩПK{zТЧНОRхТ─НОO²Т█{zТ▌{z{~zП.НОNТ─П7Т│П&НОL∙Т▌П,{z{zП+Т▐НОKТ─ПWНОG}ТXНОCГzТ√П(}z}zТ≈{zНОBcТбП*ТцП.НО@ъТіПHТїНО?[Т─П<НО<0Т├Т┤{z{zНО:╛Т╒П<Тё{НО9(zТ▄{zТ█{z}zНО7єТ√{zП"Т≈П7НО6 Т▓ПFТ⌠НО4°Т²П#}zП(Т·НО3Т≤ Т≥ПRНО1■ТйТкПPНО0Т▄Т█ПMНО.▄Т≈П"{zТ≤НО-Т╪} zП7ТҐНО+└ Т─ НО(YТ═Т║ПUНО&уТтПGТу НО%QТ╧ПCТ╨НО#мТї}zТ╗П<НО"IТ╘Т╙}zНО еТ═П6Т║П$НОAТ÷П9{zТ═НОҐТ П/Т⌡П${zНО9Тх{zП*Ти НО╣Т─ПG{zНОБЧД█О┼ЧДНДОБЧ х█О┼Ч хН╛ОБЧ(█О┼Ч(Н#тОБЧ і█О┼Ч іН-zОБЧo█О┼ЧoНЖО}Н.з НОЧДFНДОЧ хFН╛ОЧ(FН#тОЧ іFН-zОЧoFНЖОЧzП:Н.з{НЖОіz{zН.з{НЖОNzП0Н.з{НО ²ЧДОйЧД█НДО ²Ч хОйЧ х█Н╛О ²Ч(ОйЧ(█Н#тО ²Ч іОйЧ і█Н-zО ²ЧoОйЧo█НUО ÷zТXП"НОtТ▓П>Т⌠} НОПzТ╞П(} z {zТ╟НОlТєПIТ╔НОХ ТеТфП7НОdТ▄ПMТ█DЇ┤H÷b║8SEC. 3THE STORAGE PIPELINE67other tasks. If more faults have occurred in the meantime, FaultCount will have been incrementedagain and the fault task will be reawakened.The fault task does different things in response to the different types of fault. Single bit errors,which are corrected, are not reported at all unless a special control bit in the hardware is set. Withthis bit set, the fault task can collect statistics on failing storage chips; if too many failures areoccurring, the bit can be cleared and the machine can continue to run. Double bit errors may bedealt with by re-trying the reference; a recurrence of the error must be reported to the operatingsystem, which may stop using the failing memory, and may be able to reread the data from the diskif the page is not dirty, or determine which computation must be aborted. Page faults are the mostlikely reason to awaken the fault task, and together with write-protect faults are dealt with byyielding to memory-management software. MapRAM parity errors may disappear if the reference isre-tried; if they do not, the operating system can probably recover the necessary information.Microinstructions that read the various parts of History are provided, but only the emulator and thefault task may use them. These instructions use an alternate addressing path to History which doesnot interfere with the SRN addressing used by references in the pipeline. Reading base registers,the MapRAM, and CacheA can be done only by using these microinstructions.This brings us to a serious difficulty with treating History as a pure ring buffer. To read aMapRAM entry, for example, the emulator must first issue a reference to that entry (normally aMapRead), and then read the appropriate part of History when the reference completes; similarly, aDummyRef (see Table 3) is used to read a base register. But because other tasks may run and issuetheir own references between the start of the emulator's reference and its reading of History, theemulator cannot be sure that its History entry will remain valid. Sixteen references by I/O tasks, forexample, will destroy it.To solve this problem, we designate History[0] as the emulator's "private" entry: MapRead,MapWrite, and DummyRef references use it, and it is excluded from the ring buffer. Because thefault task may want to make references of its own without disturbing History, another private entryis reserved for it. The ring buffer proper, then, is a 14-element memory used by all referencesexcept MapRead, MapWrite, and DummyRef in the emulator and fault task. For historical reasons,Fetch, Store and Flush references in the emulator and fault task also use the private entries; the tagmechanism (І 4.1) ensures that the entries will not be reused too soon.In one case History is read, rather than written, by a pipeline stage. This happens during a readtransport, when READTR1 gets from History the cache address (row and column) it needs for writingthe new data and the cache flags. This is done instead of piping this address along from ADDRESSto READTR1.4. Cache-storage interactionsThe preceding sections describe the normal case in which the cache and main storage functionindependently. Here we consider the relatively rare interactions between them. These can happenfor a variety of reasons:Processor references that miss in the cache must fetch their data from storage.A dirty block in the cache must be re-written in storage when its entry is needed.Prefetch and flush operations explicitly transfer data between cache and storage.I/O references that hit in the cache must be handled correctly.Cache-storage interactions are aided by the four flag bits that are stored with each cache entry tokeep track of its status (see Figure 4). The vacant flag indicates that an entry should never match; itis set by software during system initialization, and by hardware when the normal procedure forloading the cache fails, e.g., because of a page fault. The dirty flag is set when the data in the entryis different from the data in storage because the processor did a store; this means that the entryЪНО^ё{ТGУН(eН;┬zНОXLТ▄П(Т█П4НОVхТ─П'НОS²Т╚П0Т╛П2НОRТ┬П@Т┴П"НОP∙Т╪ТҐПYНОO Т≈Т≤П;НОM█ТёП5ТєП(НОL Т┌ПZНОJ┘Т┤ПHТ┬НОIТиП.ТйП,НОG}Т▄П${zТ█П0НОEЫТ─ПUНОBнТ└Т┘ПNНОAJТ┴Т┼ПQНО?фТ╗{zП=Т╘ НО>BТ─{zП?НО;ТсПXТтНО9⌠{zТ╦ПXНО8~zТ▒П7Т▓П"НО6▀~zТ▌Т▐П:НО5Т╘ПMТ╙НО3┐Т─П:Т│{z{z НО1ЪТ─НО.тТТПJ~zНО-P~zТ╞~zП=Т╟НО+лТ▀Т▄ПHНО*HТЄП5Т╣П)НО(дТ╞~z~z~zП!Т╟НО'@~zТ≈~z~zПLТ≤НО%╪Т─П>НО"▒Т║ПEТ╒НО! Т┌Т┐{zПJНО┴Т░П>Т▒{НОzТ─{zНО2|ТXНОzТц}zП/ТдНО┐ Т▒ПSНОЪТ─НЖОїПOНЖОOПRНЖОВПQНЖО ÷{z{zП<НОtТ║Т╒ПIНОПТ│Т┌~zП4НОlТ╩Т╪П>НОХТ┘П"Т├~zП'НОdТ╙П#Т╚П=Ъ$Ї▐H÷c≥° THE MEMORY SYSTEM OF A HIGH-PERFORMANCE PERSONAL COMPUTER68must be written back to storage before it is used for another block. The writeProtected flag is a copyof the corresponding bit in the map. It causes a store into the block to miss and set vacant; theresulting storage reference reports a write-protect fault (І 3.3). The beingLoaded flag is set for about 15 cycles while the entry is in the course of being loaded from storage; whenever the ADDRESSstage attempts to examine an entry, it waits until the entry is not beingLoaded, to ensure that theentry and its contents are not used while in this ambiguous state.When a cache reference misses, the block being referenced must be brought into the cache. Inorder to make room for it, some other block in the row must be displaced; this unfortunate iscalled the victim. CacheA implements an approximate least-recently-used rule for selecting thevictim. With each row, the current candidate for victim and the next candidate, called next victim,are kept. The victim and next victim are the top two elements of an LRU stack for that row;keeping only these two is what makes the replacement rule only approximately LRU. On a miss, thenext victim is promoted to be the new victim and a pseudo-random choice between the remainingtwo columns is promoted to be the new next victim. On each hit, the victim and next victim areupdated in the obvious way, depending on whether they themselves were hit.The flow of data in cache-storage interactions is shown in Figure 2. For example, a Fetch thatmisses will read an entire block from storage via the ReadBus, load the error-corrected block intoCacheD, and then make a one-word reference as if it had hit.What follows is a discussion of the four kinds of cache-storage interaction listed above.4.1 Clean missWhen the processor or IFU references a word w that is not in the cache, and the location chosen asvictim is vacant or holds data that is unchanged since it was read from storage (i.e., its dirty flag isnot set), a clean miss has occurred. The victim need not be written back, but a storage read mustbe done to load into the cache the block containing w. At the end of the read, w can be fetchedfrom the cache. A clean miss is much like an I/ORead, which was discussed in the previous section.The chief difference is that the block from storage is sent not over the FastOutBus to an outputdevice, but to the CacheD memory. Figure 9 illustrates a clean miss.All cache loads require a special cycle, controlled by READTR1, in which they get the correct cacheaddress from History and write the cache flags for the entry being loaded; the data paths ofCacheA are used to read this address and write the flags. This RThasA cycle takes priority over allother uses of CacheA and History, and can occur at any time with respect to ADDRESS, which alsoneeds access to these resources. Thus all control signals sent from ADDRESS are inhibited byRThasA, and ADDRESS is forced to idle during this cycle. Figure 9 shows that the RThasA cycleoccurs just before the first word of the new block is written into CacheD. (For simplicity andclarity we will not show RThasA cycles in the figures that follow.) During RThasA, the beingLoadedflag is cleared (it was set when the reference was in ADDRESS) and the writeProtected flag is copiedfrom the writeProtected bit in MapRAM. As soon as the transport into CacheD is finished, the wordreference that started the miss can be made, much as though it had hit in the first place. If thereference was a Fetch, the appropriate word is sent to FetchReg in the processor (and loaded intoFetchRegRAM); if a Store, the contents of StoreReg are stored into the new block in the cache.If the processor tries to use data it has fetched, it is prevented from proceeding, or held until theword reference has occurred (see І 5.1). Each fetch is assigned a sequence number called its tag,which is logically part of the reference; actually it is written into History, and read when needed byREADTR1. Tags increase monotonically. The tag of the last Fetch started by each task is kept inStartedTag (it is written there when the reference is made), and the tag of the last Fetch completedby the memory is kept in DoneTag (it is written there as the Fetch is completed); these are task-specific registers. Since tags are assigned monotonically, and fetches always complete in orderwithin a task, both registers increase monotonically. If StartedTag=DoneTag, all the fetches thatЪНДОZ\zТXУ{z{z{z{z{z{z{ z{z{НzНОTТ┘П"Т├П$~ zНОR│ТїПPТ╗~zНОPЩТ┴П*Т┼~ zТ─НОOyТ╣ТІП={НОMУzТ╛ТґП+~ zНОLqТ─П=НОIFТ╡П:ТЁНОGбТ╨Т╩ПPНОF>Тх}zП@Ти НОD╨Т▐Т░П=} zНОC6ТҐТЎП<{zНОA╡Т─ПF{zТ│НО@.Т≈ПKТ≤ НО>╙Т⌡ПHТ°НО=&Т─ПCНО9ШТ╦Т╧ПB~zНО8wТ·П'Т÷П5НО6СТ─П5НО3хПYНО/и}ТX НО,·zТ▄Т█ {z~zП5НО+Т▐Т░ПBНО)√Т√} zТ≈П2НО(Т°П2~zТ²~zНО&▌Т┘Т├|{z{~zП.НО% Т╘Т╙ПLНО#├Т─П>НО [Т▓П4{zТ⌠НОвТрТсПPНОSТ▐Т░П1~zНОоТ⌠ПCТ■{zНОKТяП:Тр{zНОг~zТ╟{zТ╠П?~zНОCТ╦Т╧ПIНО©Т√~zТ≈~z~ НО;zТ Т⌡{z ~ zНОЇТ▌~ z {z Т▐П3НО3ТєП(Т╔П1НО╞Т÷~zТ═П.НО+{zТ─~zПFНОТ═П3Т║П"}z НО |Т² Т·ПP}zНОЬТ├Т┤ПTНОt{zТ╔ТіП3~zНОП} zТ░Т▒П-~z НОlТі}z~zТї НОХТл ТмПMНОdТ─П_╙ЇжH÷_RЙSEC. 4CACHE-STORAGE INTERACTIONS69<==НОFкТ─{zНtО!≤2НОmТҐП=ТЎНОИТ─П7Т│П'НОeТ─Т│ПEНОАТ≈{zП4НО] Т°Т²П3НОыТхП;ТиНОUТ═Т║ПFНОяТ─П5НОіТ≤Т≥ПDНО"Т⌠П&Т■П:НО ·Т█ПBТ▌НОТ─НО}ТXНОП~zТ⌠Т■ ~zПOНОlТ┌Т┐ПFНОХТ▓~zП#{z{zТ⌠~zНОdТ─П\ \Ї еH÷ac НtО2ПНОkПНОєПНОщПНОПНОOПНО┬ПНpОюПН╚tОКПН╚О$ПНЗОщПНщОщПН аОщПНєОщПН┬ОщПН kОщПН OОщПН2ОщПНyО+Ч$НаОЧ$НЁОРЧу$Н┬ОДЧ$]Н▐ОЧ$Н╡ОхЧG$Н╡ОЗЧ$нНКО+Ч╡$Н2ОЧ╡$НєОДЧ$нНхОЧ$уНКОЧ$ЫН─ОхЧ$уНєОоЧ$ЫНРО²Ч▌$Н9О▐Чk$НО┬Ч$▌Н2ОAЧ$GНхОdЧ$kНКО+Ч$9НЫОхЧ╡$Н╚ОlЧ$$Н┬ОЧ$нН┬О▐Ч$$НоОжЧ$$Н╚ОЗЧ$$НоОаЧ$╚НРОЁЧ$$НdОЁЧ$$Н╡ОжЧ╡$НО▐Ч╡$Н zОРЧЫ$Н VОЧ$Н VО+ЧG$Н zОРЧ$]НСО╧Ч$Н:О╚Ч$Н┬ОДЧG$НчОrЧє$Н$╛О│Ч$Н$╛ОrЧ$Н%╔О│Ч$Н+ОРЧ$НО│Чу$НчОrЧ$]НЕО╚Ч$НО│ЧЫ$НЕО╚Ч$НЕО╧ЧG$НО│Ч$]НAО╧Чу$НО╚Ч$]НОДЧ$Н%О2ПН#ЗО2ПН"чО2ПН!аО2ПН ╔О2ПН┬О2ПНlО2ПНOО2ПНOОkПН3ОkПНОkПНЗОkПНщОkПНаОkПН╔ОkПН┬ОkПНAО╧ЧЫ$НОДЧ$НОРЧG$НAО╧Ч$]Н┬ОєПНlОєПНOОєПН3ОєПНОєПНЗОєПНщОєПН аОєПН╨О│Ч─$НКО▐ЧЫ$НКОаЧ$РНКО²ЧЫ$НЁОdЧу$Н┬ОVЧ$]Н▐О▐Ч$Н ОdЧЫ$НДО▐Ч$НДО²ЧG$Н ОdЧ$]НоО+ЧЫ$Н╚ОVЧ$Н╚ОdЧG$НоО+Ч$]НщОщПНаОщПН╔ОщПН┬ОщПНlОщПНOОщПН3ОщПНОщПНyО▐Ч]$НVО²Чk$НОПНЗОПНщОПН аОПНєОПН┬ОПН kОПН OОПН3ОПНOОПНОПН▐ОПНЗОПН▐ОПН┬ОdЧ$Н]О+Ч$$НlОЧ$─Н▐ОdЧ$Н2ОOПН╚О$ПНsО9Ч$rН rО┬Ч $Н rОЧ $Н rО OЧ $НжО┬ПНРОOПН √ОdЧ9$Н│ОVЧ +$Н rО9Ч$rН"sО $ПН#░О Ч$9Н#О9Ч$9Н╛О$ПНО$ПНЕО rЧ$9Н$О9Ч$rНО OЧ$НОЧ$∙НОЧ@$НО┬Ч$НщО!sПНОЛЧ$dН&·О│Ч─$НlОєПНOОєПН3ОєПНОєПНЗОєПНщОєПНаОєПН╔ОєПН%О╧ЧЫ$НОДЧ$НОРЧG$Н%О╧Ч$]Н&3ОkПН%ОkПН#ЗОkПН"чОkПН!аОkПН ╔ОkПН┬ОkПНlОkПН-ШО2ПН,чО2ПН+бО2ПН*╔О2ПН)┴О2ПН(lО2ПН'PО2ПН&3О2ПН%О╧Чу$НЗО╚Ч$]НОДЧ$Н%ЛО│ЧЫ$Н%иО╚Ч$Н%иО╧ЧG$Н%ЛО│Ч$]Н%ЛО│Чу$Н&аОrЧ$]Н%иО╚Ч$НОРЧ$Н-░О│Ч$Н-░ОrЧ$Н.┴О│Ч$Н&аОrЧє$НlОДЧG$НО╚Ч$НвО╧Ч$НО+Чk$Н]ОРЧ$]Н]ОРЧу$НщОоПНаОоПН╔ОоПН┬ОоПН OОПН kОПН┬ОПНєОПНжОПНщОПН аОПН]О]ПНyОЁЧ$┤Н ▐ОКПНОКП Н╛ОКПН╨ОКПН!ЕОДПН2%ОЧ$ЫН,╩О9ПН2%О9ПН3ОЧ$ЫН1 ОЧ$ЫН/ЛОЧ$ЫН.пОЧ$ЫН-ЁОЧ$ЫН+{ОЧ$ЫН,≈ОЧ$ЫН' О9ПН!zО9ПНЛО9ПН√ОЧ$ЫНЁОЧ$ЫНоОЧ$ЫНЛОЧ$ЫНОЧ$ЫН%ОЧ$ЫН$пОЧ$ЫН#ЁОЧ$ЫН"≈ОЧ$ЫН!zОЧ$ЫН ^ОЧ$ЫНAОЧ$ЫН%ЛОЧ$ЫН' ОЧ$ЫН(%ОЧ$ЫН)BОЧ$ЫН*^ОЧ$ЫН]О9ПНоО9ПН╚О9ПНО9ПНzОЧ$ЫН]ОЧ$ЫНAОЧ$ЫН$ОЧ$ЫНОЧ$ЫН]ОЧ$ЫН zОЧ$ЫН√ОЧ$ЫНЁОЧ$ЫНоОЧ$ЫНКОЧ$ЫНAОЧ$ЫН $ОЧ$ЫН ОЧ$ЫНКОЧ$ЫНоОЧ$ЫН╡ОЧ$ЫНжО┤Ч-l$НVpОПН▐ОdЯ \ °*Ы3┴"l└SEC. 4CACHE-STORAGE INTERACTIONS71A Flush explicitly removes the block containing the addressed location from the cache, rewriting itin storage if it is dirty. Flush is used to remove a virtual page's blocks from the cache so that itsMapRAM entry can be changed safely. If a Flush misses, nothing happens. If it hits, the hitlocation must be marked vacant, and if it is dirty, the block must be written to storage. To simplifythe hardware implementation, this write operation is made to look like a victim write. A dirty Flushis converted into a FlushFetch reference, which is treated almost exactly like a Prefetch. Thus, whena Flush in ADDRESS hits, three things happen:the victim for the selected row of CacheA is changed to point to the hit column;the vacant flag is set;if the dirty flag for that column is set, the Flush is converted into a FlushFetch. Proceeding like a Prefetch, this does a useless read (which is harmless because the vacant flag hasbeen set), and then a write of the dirty victim. Figure 11 shows a dirty Flush. The FlushFetchspends two cycles in ADDRESS, instead of the usual one, because of an uninteresting implementationproblem.<==НОAТ╦ПF~zТ╧~ НО?~zТ┐{zТ└П=НО=ЗНtОў2НО╞}ТX}}НО└zТ·{z{~zТ÷~zП.НОТ≥П?{z{~zТ НО | Т▐Т░ПUНОЬТдП0ТеП'НОtТ═Т║{z{~zП1НОПТґП)ТўП2НОl Т╟П!Т╠П4НОХТКП&{z{~zТЛНОd Т─пЇ H÷aЪН:О┬Ч $Н:ОЧ $$Н:ОЧ$∙Н:О OЧ $Н%:О9Ч$rНО rЧ$9НVtО$ПН!VО$ПН╚О9Ч$rН╧ОzЧ +$НоО┬Ч9$Н╚О OЧ $Н╚ОЧ $Н╚О┬Ч $Н╚О9Ч$rН ДО$ПН kОsПНхО┬Ч$НєО+Ч$─Н√ОOЧ$$НаО┬Ч$НхО:ПН3О:ПНхО:ПН:О:ПН┬О:ПНkО:ПН┤О:ПНєО:ПН юО:ПНщО:ПНЫО:ПНО:ПН2О:ПНOО:ПН ▐ОаЧk$Н ▐ОЁЧ]$НOОПНkОПН┬ОПНєОПНаОПНщОПНЗОПНОПНОOЧ$]НДО┬ЧG$НДОzЧ$НОOЧЫ$Н@О┬Ч$]НОаЧG$НОЁЧ$Н@О┬ЧЫ$НхОЁЧ$НаОzЧ$]НКО┬Чу$Н $ОаЧЫ$Н $ОДЧ$РН $ОЁЧЫ$НСОєЧ─$НЫОхПНОхПН2ОхПНOОхПНkОхПН┬ОхПНєОхПНаОхПНzОщЧ$]НVОЧG$НVОЧ$НzОщЧЫ$НаО▐ПНщО▐ПНЗО▐ПНО▐ПН3О▐ПНOО▐ПНlО▐ПН┬О▐ПН┬ОVПН ╔ОVПН!аОVПН"чОVПН#ЗОVПН%ОVПН&3ОVПН'OОVПНVОЧ$НOОоЧ$]НzОщЧу$НAОєЧ$]НОщЧG$НОоЧ$НAОєЧЫ$НОоЧ$Н О√Ч$]НAОєЧу$НdОЧ$Н'чОєЧ$Н&ЕО√Ч$Н&ЕОєЧ$Н О√Чє$НаОЧG$НsОоЧ$Н+ОщЧ$Н╡ОЧ$]Н▐ОOЧG$Н▐ОAЧ$Н╡ОЧЫ$НОЁЧ╡$Н╡ОЗЧ╡$НdОжЧ$$НРОжЧ$$НнОДЧ$╚Н╚ОЧ$$НнОЗЧ$$Н┤ОЁЧ$$Н┤ОAЧ$нН╚О▐Ч$$НЫОЛЧ╡$Н $ОOЧ$9Н О┬Ч$kН kОdЧ$GН GО╚Ч$▌Н rОЁЧk$Н +ОаЧ▌$Н щОСЧ$ЫН ╧ОКЧ$уН $О:Ч$ЫН О3Ч$уН щОЧ$нН kОAЧ╡$Н $ОOЧ╡$Н╡ОЧ$РН╡ОЛЧk$НхОAЧ$НаОЧ$]НКОЧу$Н ЫОAЧ$Н ╡ОOЧ$Н kОПН┤ОПНєОПН юОПНщОПНЫОПНОПН2ОПНДО$ПНДОКПНpОюПНtО╛ПНОsПНО:ПНОПНОхПНО▐ПНОVПНєОДЧ$нНКОДЧ$нНнОаЧЫ$НКОаЧd$НКОЁЧ╚$Н:uО ╡ПН:О√ПНtО#╛ПН О,ПН┤О,ПНєО,ПН╚О"П Н +О zПН rО:ПНyОжЧ$Н юО,ПНЫО,ПНщО,ПНО,ПНО,ПН(вОєЧ─$НщОхПНЗОхПНОхПН3ОхПНOОхПНlОхПН┬ОхПН ╔ОхПН ]ОщЧ$]Н :ОЧG$Н :ОЧ$Н ]ОщЧЫ$Н ╔О▐ПН!аО▐ПН"чО▐ПН#ЗО▐ПН%О▐ПН&3О▐ПН'OО▐ПН(lО▐ПН(lОVПН)┬ОVПН*╔ОVПН+аОVПН,чОVПН-ЗОVПН/ОVПН03ОVПН :ОЧ$Н!3ОоЧ$]Н ]ОщЧу$Н(%ОєЧ$]Н(ОщЧG$Н(ОоЧ$Н(%ОєЧЫ$Н(ОоЧ$Н(ЗО√Ч$]Н(%ОєЧу$НHОЧ$Н0бОєЧ$Н/иО√Ч$Н/иОєЧ$Н(ЗО√Чє$НєОЧG$Н!VОоЧ$Н!ОщЧ$НOОOЧk$Н√ОЧ$]Н√ОЧ$НаОСПНщОСПНЗОСПНОСПН┬ОКП Н ОКПНєОКП Н!zОКПНуОнЧ-l$Н╡ОdЧ$ЫНнОdЧ$ЫНКОdЧ$ЫН ОdЧ$ЫН $ОdЧ$ЫН@ОdЧ$ЫНКОdЧ$ЫНоОdЧ$ЫН╡ОdЧ$ЫН√ОdЧ$ЫН yОdЧ$ЫН]ОdЧ$ЫНОdЧ$ЫН$ОdЧ$ЫНAОdЧ$ЫН]ОdЧ$ЫНzОdЧ$ЫНО─ПН╚О─ПНоО─ПН]О─ПН*^ОdЧ$ЫН)AОdЧ$ЫН(%ОdЧ$ЫН'ОdЧ$ЫН%ЛОdЧ$ЫНAОdЧ$ЫН ]ОdЧ$ЫН!zОdЧ$ЫН"√ОdЧ$ЫН#ЁОdЧ$ЫН$оОdЧ$ЫН$ОdЧ$ЫНОdЧ$ЫНЛОdЧ$ЫНоОdЧ$ЫНЁОdЧ$ЫН√ОdЧ$ЫНЛО─ПН!zО─ПН'О─ПН,≈ОdЧ$ЫН+zОdЧ$ЫН-ЁОdЧ$ЫН.пОdЧ$ЫН/ЛОdЧ$ЫН1 ОdЧ$ЫН3ОdЧ$ЫН2%О─ПН,╨О─ПН2%ОdЧ$ЫНО Ч$yН2ОsПНуО╛ПНРОsПН ▐ОЁЧ$dНРpОПНжОdЯпDUT3┴$╔╚ THE MEMORY SYSTEM OF A HIGH-PERFORMANCE PERSONAL COMPUTER72<==Т─П,НО:}ТXНО6СzТ└ {zТ┘{z{zНО5oТ├П:{z{zТ┤НО3К{zТ─П2{zНО0ю{zТ┐П9Т└НО/<Т {zП5Т⌡ НО-╦ТАП,ТБП.НО,4Т╚П"Т╛ {zНО*╟{zТўТ╞{zП0НО),Т─ПEНО&Т█{zТ▌П3{z{zНО$}Ті ТїП8{zНО"ЫТе{z{zТф{zНО!u{zТТ{z{zНОЯ Т⌠П(Т■{z {zНОmТє {z{~zП.Т╔{zНОИТ─{z{zНОЎТоТп{z~ zНО:{zТц~ zТдПBНОІТ║ Т╒{z~zП,НО2Т√ПXТ≈НОўТ▌Т▐ПMНО*Тк ~zП6Тл{zНОіТ─{zП&НОї}ТXНО |zТ╦П8~zТ╧{zНОЬ ТєП#Т╔П9НОt Т≥ПAТ {НОПzТ░П.Т▒П4НОlТ╘Т╙П4{zНОХТ∙{z Т√П2НОdТ─ЪєЇ▐H÷_≥ЮSEC. 5TRAFFIC CONTROL75At the other extreme, the rule could be that a stage waits only if it cannot acquire the resources itwill need in the very next cycle. This would be quite feasible for our system, and the proper choiceof priorities for the various stages can clearly prevent deadlock. However, each stage that may beforced to wait requires logic for detecting this situation, and the cost of this logic is significant.Furthermore, in a long pipeline gathering all the information and calculating which stages canproceed can take a long time, especially since in general each stage's decision depends on thedecision made by the next one in the pipe.For these reasons we adopted a different strategy in the Dorado. There is one point, early in thepipeline but after ADDRESS, at which all remaining conflicts are resolved. A reference is notallowed to proceed beyond that point without a guarantee that no conflicts with earlier referenceswill occur; thus no later stage ever needs to wait. The point used for this purpose is state 3 of theMAP stage, written as MAP.3. No shared resources are used in states 0-3, and STORAGE is not starteduntil state 4. Because there is just one wait state in the pipeline, the exact timing of resourcedemands by later stages is known and can be used to decide whether conflicts are possible. Wenow discuss the details.5.3.1 STORAGE and WRITETRIn a write operation, WRITETR runs in parallel but not in lockstep with MAP; see, for example,Figure 10. Synchronization of the data transport with the storage reference itself is accomplishedby two things.MAP.3 waits for WRITETR to signal that the transport is far enough along that the data willarrive at the StorageRAM chips no later than the write signal generated by STORAGE. Thiscondition must be met for correct functioning of the chips. Figure 13 shows MAP waitingduring an I/OWrite.WRITETR will wait in its next-to-last state for STORAGE to signal that the data hold time ofthe chips with respect to the write signal has elapsed; again, the chips will not work if thedata in WriteReg is changed before this point. Figure 10 shows WRITETR waiting during avictim write. The wait shown in the figure is actually more conservative than it needs tobe, since WRITETR does not change WriteReg immediately when it is started.<==┬{zТ─{zТ│П1{zНО=ТЎП$Т©П9НО;─Т╗Т╘П>НО9ЭТ─НО5Щ}ТX}НО2рzТц{z}z{zТдНО1NТёП<ТєП!НО/йТ─НЖО-r{zТ∙ {zТ√П$НЖО+НТ∙{zП%Т√ {zНЖО*jТ Т⌡П8{zНЖО(ФТ─{z{~zНЖО&▌{zТ■П){zТ∙НЖО% П*Т√П3НЖО#├Т▌Т▐П%{zНЖО"Т║Т╒ПMНЖО ~Т─{zП9НtОd2Ъ &ЇH÷\(НtО ▐ПНОхПНОПНО:ПНОsПНО╛ПНОЕПН+ОЧ$НrО Ч$НюОAЧG$НСОчЧ2$Н О оЧг$Н'ОчЧ$Н'О оЧ$Н(ОчЧ$НdО PЧ$НAОчЧу$Н О оЧ$]НО Ч$НAОчЧЫ$НО Ч$НОЧG$НAОчЧ$]НyОЧу$НOО Ч$]НVОAЧ$Н'OО ▐ПН&3О ▐ПН%О ▐ПН#ЗО ▐ПН"щО ▐ПН!аО ▐ПН єО ▐ПН┬О ▐ПН┬ОхПНkОхПНOОхПН2ОхПНОхПНЫОхПНщОхПНаОхПНyОЧЫ$НVОAЧ$НVО PЧG$НyОЧ$]НаОПНєОПН┬ОПНkОПНOОПН2ОПНОПНЫОПНyО┬Ч9$Н╡О%ЧG$НkОWЧ$РНгО%Ч▌$НєОWЧ$РН2О%Ч╡$НДОиЧ$$НюОzЧ$нНюОЛЧ$$Н О3Ч$$НДОWЧ$$Н ОЧ$╚Н +ОЧ$$Н°ОЧ$$НКО3Ч╡$Н NОЛЧ╡$НЫО%Ч╡$Н╚ОиЧ$$Н┤ОzЧ$нН┤ОЛЧ$$НнО3Ч$$Н╚ОWЧ$$НнОЧ$╚НРОЧ$$НdОЧ$$Н╡О3Ч╡$НОЛЧ╡$НОЛЧ╡$НгО░Ч$$НєОBЧ$нНєОЁЧ$$НКОЗЧ$$НгОЧ$$НКОЕЧ$╚НОвЧ$$Н─ОвЧ$$НнОЗЧ╡$Н2ОЁЧ╡$НOОЛЧ╡$НО░Ч$$НщОBЧ$нНщОЁЧ$$Н$ОЗЧ$$НОЧ$$Н$ОЕЧ$╚НHОвЧ$$Н╧ОвЧ$$НОЗЧ╡$НkОЁЧ╡$НКО╛Ч$]НКО┬Ч╡$НЫОsПНщОsПН юОsПНєОsПН┤ОsПН kОsПН NОsПН2ОsПН2О:ПН NО:ПН kО:ПН┤О:ПНщО:ПНЫО:ПНО:ПН2О:ПНОsПН2ОsПН╡О PЧ$]Н▌О┬ЧG$Н╡О PЧЫ$НюОAЧ$]НКО PЧу$НюО·Ч$$Н ОЗЧ $Н щОЛЧ ╧$НДОsПНOОsПНюОаЧ$Н°ОЁЧ@$НщОаЧ$Н2О≈ПНО≈ПНЫО≈ПНщО≈ПНаО≈ПНєО≈ПН┬О≈ПНkО≈ПНkО^ПНOО^ПН2О^ПНО^ПНаО^ПНєО^ПН┬О^ПНkО^ПН$О╛Ч$9Н$О┬Чє$Н┬О≈ПНО≈ПНkО≈ПНOО≈ПНОаЧ$НОЁЧ$Н ОаЧ$НЗОAЧ$єНжОЁЧ k$Н▌О┬Ч.╛$Н▌ОЧ$ЫН╚ОЧ$ЫНгОЧ$ЫНДОЧ$ЫН ОЧ$ЫНОЧ$ЫНгОЧ$ЫН╚ОЧ$ЫН▌ОЧ$ЫНrОЧ$ЫН VОЧ$ЫН9ОЧ$ЫНДОЧ$ЫНОЧ$ЫНОЧ$ЫН9ОЧ$ЫНVОЧ$ЫНЫОПН┤ОПН╚ОПН9ОПН*:ОЧ$ЫН)ОЧ$ЫН(ОЧ$ЫН&ЕОЧ$ЫН%хОЧ$ЫНОЧ$ЫН :ОЧ$ЫН!VОЧ$ЫН"sОЧ$ЫН#▐ОЧ$ЫН$╛ОЧ$ЫНОЧ$ЫНДОЧ$ЫНхОЧ$ЫН╚ОЧ$ЫН▐ОЧ$ЫНrОЧ$ЫНхОПН!VОПН&ЕОПН,sОЧ$ЫН+VОЧ$ЫН-▐ОЧ$ЫН.╛ОЧ$ЫН/хОЧ$ЫН0ЕОЧ$ЫН2ОЧ$ЫН4ОЧ$ЫН2ОПН,≈ОПН,,ОчЧ9$Н$dОЧ$Н$╛О Ч$НЗОAЧG$Н,OО оЧє$Н3ОчЧ$Н3О оЧ$Н4ОчЧ$Н²О PЧ$Н+zОчЧу$Н,OО оЧ$]Н+VО Ч$Н+zОчЧЫ$Н+VО Ч$Н+VОЧG$Н+zОчЧ$]Н#ЁОЧу$Н$┬О Ч$]Н#▐ОAЧ$Н3┴О ▐ПН2lО ▐ПН1PО ▐ПН03О ▐ПН/О ▐ПН-ЗО ▐ПН,чО ▐ПН+аО ▐ПН+аОхПН*╔ОхПН)┬ОхПН(lОхПН'OОхПН&3ОхПН%ОхПН#ЗОхПН#ЁОЧЫ$Н#▐ОAЧ$Н#▐О PЧG$Н#ЁОЧ$]Н#ЗОПН"щОПН!аОПН єОПН┬ОПНkОПНOОПН2ОПНхО┬ЧG$НКО PЧ$]НКО PЧу$НщОПНЫОПНєОСПН юОСПН юОeПНєОeПН kОeПН┤ОeПНЫОeПНщОeПНОeПН2ОeПН+ОeПН╡ОЕПНКОЕПНнО╛ПН NО╛ПНЫОIЧ$²Н2О%Ч$НUО ПНО^ПН3ОЧ$ЫН2pОПН┬ОdЯНkО%Чk$НkО3Ч▌$НєО3Ч▌$Н╚tО^ПНО П &рё4^═ THE MEMORY SYSTEM OF A HIGH-PERFORMANCE PERSONAL COMPUTER765.3.2 CacheD: consecutive cache loadsLoading a block into CacheD takes 9 cycles, as explained in І 4.1, and a word reference takes onemore. Therefore, although the pipeline stages proper are 8 cycles long, cache loads must be spacedeither 9 or 10 cycles apart to avoid conflict in CacheD. After a Fetch or Store, the next cache loadmust wait for 10 cycles, since these references tie up CacheD for 10 cycles. After a Prefetch,FlushFetch or dirty I/ORead, the next cache load must wait for 9 cycles. STORAGE sends MAP asignal that causes MAP.3 to wait for one or two extra cycles, as appropriate. Figure 14 shows a Fetchfollowed by a Prefetch, followed by a Store, and illustrates how CacheD conflict is avoided by extracycles spent in MAP.3 Note that the Prefetch waits two extra cycles, while the Store only waits oneextra.<==zТ▓~z~zТ⌠НОH╨Т≤ {zТ≥~zП#~zНОG6НtОЙ2НОК}ТXНОюzТ▒Т▓ПBНО<Т╨Т╩ПLНО╦ТбТцПNНО4Т█П'Т▌}zП&НО╟Т╛П5ТґП(НО,Т√Т≈ПDНО╗Т─НЖО P{zТч{zП*Тъ НЖОлТ─НЖОt{zТвТьП1НЖОП{zТЎПBТ©НЖОlТіПQНЖОХТвП$Ть{zП!НЖОd ЪТЇ ZH÷aнНО!≈Ч╡$Н╡О!чЧ╡$НdО!╨Ч$$НРО!╨Ч$$НнОхЧ$╚Н╚О"Ч$$НнО!чЧ$$Н┤О!≈Ч$$Н┤О"%Ч$нН╚О!sЧ$$НЫО#пЧ╡$НКО3Ч$9НгОlЧ$kН2ОHЧ$GНО▐Ч$▌Н9О!≈Чk$НРО╔Ч▌$НєОжЧ$ЫН─ОоЧ$уНКОЧ$ЫНгОЧ$уНєОЛЧ$нН2О%Ч╡$НКО3Ч╡$НєО!≈Ч╡$Н@О!чЧ╡$НРО!╨Ч$$Н─О!╨Ч$$Н]ОхЧ$╚Н9О"Ч$$Н]О!чЧ$$НО!≈Ч$$НО"%Ч$нН9О!sЧ$$Н┤О#пЧ╡$Н9О3Ч$9НОlЧ$kН│ОHЧ$GН]О▐Ч$▌Н┬О!≈Чk$Н@О╔Ч▌$НРОжЧ$ЫНоОоЧ$уН9ОЧ$ЫНОЧ$уНРОЛЧ$нН│О%Ч╡$Н9О3Ч╡$НщО!≈Ч$НщО╔Ч$Н юО!≈Ч$Н юО╔Ч$НєО!≈Ч$НєО╔Ч$Н@О"Ч$РНHО╔Чу$НdО#пЧk$Н╡О"Ч$РН╡О#пЧG$Н│О#пЧk$НdО╔Чу$Н]О"Ч$РНVО3Ч$9Н3ОlЧ$kН²ОHЧ$GНzО▐Ч$▌НєО!≈Чk$Н]О╔Ч▌$НОжЧ$ЫНЛОоЧ$уНVОЧ$ЫН3ОЧ$уНОЛЧ$нН²О%Ч╡$НVО3Ч╡$НЗО!≈Ч$НЗО╔Ч$НщО!≈Ч$НщО╔Ч$НаО!≈Ч$НаО╔Ч$НаО!≈Ч╡$Н]О!чЧ╡$НО!╨Ч$$Н²О!╨Ч$$НzОхЧ$╚НVО"Ч$$НzО!чЧ$$Н2О!≈Ч$$Н2О"%Ч$нНVО!sЧ$$НєО#пЧ╡$Н▐О%Ч$Н┬ОЛЧ$]Н╡ОЗЧу$НюО%Ч$НyО3Ч$Н2tОЕПН NОЕПН kОЕПН┤ОЕПНєОЕПН юОЕПНщОЕПНЫОЕПН╚О%Ч$НєОЛЧ$]НоОЗЧу$Н]ОЕПНVОЕПНOОЕПНkОЕПНаОЕПНщОЕПНЗОЕПНОЕПН√ОЗЧ$]НsО3ЧG$НsО%Ч$Н√ОЗЧЫ$НО"▐ПНО WПНОПНОЕПНО╛ПНОsПНО:ПНО+Ч :$Н9;О щЧ$rНО ╧Ч ]$НО ╧Ч$∙НОРЧ :$Н$О щЧ$rН.О щЧ$rНДОЧ$9Н(▐ОЧ$9Н2ОЧ$9НОхПН%:ОхПН/│ОхПН :ОхПН*хОхПН4иОхПН:ОхПН:О▐ПН]О╔Ч@$НzО╔Чу$НrpО╚ПНуtО"▐ПНРО WПНdО"▐ПН─О WПН│О"▐ПН²О WПНєО%Ч √$НаО%Ч$Н$╛О%Ч$Н%╔ОЛЧ$]Н$оОЗЧу$Н╨О┬Ч─$Н юО╛ПНщО╛ПНЫО╛ПНО╛ПН2О╛ПНOО╛ПНkО╛ПН┬О╛ПНAОаЧ$]НОЗЧG$НОЛЧ$НAОаЧЫ$Н┬ОsПНєОsПНаОsПНщОsПНЗОsПНОsПН3ОsПНOОsПНOО:ПНlО:ПН┬О:ПН ╔О:ПН!аО:ПН"чО:ПН#ЗО:ПН%О:ПНОЛЧ$НОЁЧ$]НAОаЧу$НО┬Ч$]НДОаЧG$НДОЁЧ$НО┬ЧЫ$НДОЁЧ$НщОzЧ$]НО┬Чу$Н+ОЗЧ$Н%╔О┬Ч$Н$╛ОzЧ$Н$╛О┬Ч$НщОzЧє$Н┬ОЛЧG$Н:ОЁЧ$НРОаЧ$Н yОЗЧ$]Н VО3ЧG$Н VО%Ч$Н yОЗЧЫ$Н(вО┬Ч─$НщО╛ПНЗО╛ПНО╛ПН3О╛ПНOО╛ПНlО╛ПН┬О╛ПН ╔О╛ПН ]ОаЧ$]Н :ОЗЧG$Н :ОЛЧ$Н ]ОаЧЫ$Н ╔ОsПН!аОsПН"чОsПН#ЗОsПН%ОsПН&3ОsПН'OОsПН(lОsПН(lО:ПН)┬О:ПН*╔О:ПН+аО:ПН,чО:ПН-ЗО:ПН/О:ПН03О:ПН :ОЛЧ$Н!3ОЁЧ$]Н ]ОаЧу$Н(%О┬Ч$]Н(ОаЧG$Н(ОЁЧ$Н(%О┬ЧЫ$Н(ОЁЧ$Н(ЗОzЧ$]Н(%О┬Чу$НHОЗЧ$Н0бО┬Ч$Н/иОzЧ$Н/иО┬Ч$Н(ЗОzЧє$НєОЛЧG$Н!VОЁЧ$Н!ОаЧ$Н√ОЗЧ$]НsО3ЧG$НsО%Ч$Н√ОЗЧЫ$Н%ОЕПН#ЗОЕПН"чОЕПН!аОЕПН┬ОЕПНlОЕПНsОЕПНzОЕПН2вО┬Ч─$Н"чО╛ПН#ЗО╛ПН%О╛ПН&3О╛ПН'OО╛ПН(lО╛ПН)┬О╛ПН*╔О╛ПН*^ОаЧ$]Н*:ОЗЧG$Н*:ОЛЧ$Н*^ОаЧЫ$Н*╔ОsПН+аОsПН,чОsПН-ЗОsПН/ОsПН03ОsПН1PОsПН2lОsПН2lО:ПН3┴О:ПН4╔О:ПН5бО:ПН6чО:ПН7ШО:ПН9О:ПН:4О:ПН*:ОЛЧ$Н+3ОЁЧ$]Н*^ОаЧу$Н2%О┬Ч$]Н2ОаЧG$Н2ОЁЧ$Н2%О┬ЧЫ$Н2ОЁЧ$Н2ЗОzЧ$]Н2%О┬Чу$Н#HОЗЧ$Н:бО┬Ч$Н9иОzЧ$Н9иО┬Ч$Н2ЗОzЧє$Н%╔ОЛЧG$Н+WОЁЧ$Н+ОаЧ$Н"√ОЗЧ$]Н"sО3ЧG$Н"sО%Ч$Н"√ОЗЧЫ$НVО3Ч@$Н#О щЧ$9Н8О щЧ$9Н#▐ОаЧ$9Н"sОхПН8╛ОаЧ$9Н7░ОхПНщО!≈Ч┤$НЁО!≈Ч┤$Н юО ПНщО ПНжО ПНєО²ПН┬О²ПНщО ПНЗО ПНСО ПН ╔О²ПН╨О▐ПНЗО▐ПН%:О▐ПН*:О▐ПН/иО▐ПН3BО▐ПНО%·ПН щО%·ПН┬О%·ПНО#СЧ$┤Н╚О#СЧ$┤НхО#СЧ$┤Н,╨О2ПН2%О2ПН7ЄО2ПН4^ОЧ$ЫН3BОЧ$ЫН5{ОЧ$ЫН6≈ОЧ$ЫН7ЄОЧ$ЫН8пОЧ$ЫН; ОЧ$ЫН2%ОЧ$ЫН1 ОЧ$ЫН/ЛОЧ$ЫН.пОЧ$ЫН-ЁОЧ$ЫН+zОЧ$ЫН,≈ОЧ$ЫН'О2ПН!zО2ПНЛО2ПН√ОЧ$ЫНЁОЧ$ЫНоОЧ$ЫНЛОЧ$ЫНОЧ$ЫН$ОЧ$ЫН$оОЧ$ЫН#ЁОЧ$ЫН"√ОЧ$ЫН!zОЧ$ЫН ]ОЧ$ЫНAОЧ$ЫН%ЛОЧ$ЫН'ОЧ$ЫН(%ОЧ$ЫН)AОЧ$ЫН*^ОЧ$ЫН]О2ПНоО2ПН╚О2ПНО2ПНzОЧ$ЫН]ОЧ$ЫНAОЧ$ЫН$ОЧ$ЫНОЧ$ЫН]ОЧ$ЫН yОЧ$ЫН√ОЧ$ЫН╡ОЧ$ЫНоОЧ$ЫНКОЧ$ЫН@ОЧ$ЫН $ОЧ$ЫН ОЧ$ЫНКОЧ$ЫНнОЧ$ЫН╡ОЧ$ЫН╡О─Ч5{$Н9ЛОЧ$ЫН!ЕО ┬ПН7О ┬ПНdpОП)НО!≈Ч╡$НжО╔Ч▌$НСО!≈ЧЫ$НоО╔Чу$Н]О3Ч9$НжОdЯЪТFU%Ю;-&≈+SEC. 5TRAFFIC CONTROL77Figure 15 shows a Store with clean victim followed by a Fetch with dirty victim and illustrates thisinterlock. ADDRESS waits until cycle 26 to start WRITETR. Also, the fetch waits in MAP.3 until thesame cycle, thus spending 13 extra cycles there, which forces the fetch victim to spend 13 extracycles in ADDRESS. The two-cycle gap in the use of CacheD shows that the fetch could have leftMAP.3 in cycle 24.<==ОхЧє$НDЕОжЧ$НDЕОхЧ$НEчОжЧ$Н.eОHЧ$Н=BОжЧу$Н>ОхЧ$]Н=ОЧ$Н=BОжЧЫ$Н=ОЧ$Н=ОЧG$Н=BОжЧ$]Н5zОЧу$Н6PОЧ$]Н5WО:Ч$НEPО┬ПНD4О┬ПНCО┬ПНAШО┬ПН@чО┬ПН?бО┬ПН>╔О┬ПН=┴О┬ПН=┴ОаПН^ОоЧ$ЫН?zОоЧ$ЫН@≈ОоЧ$ЫНAЁОоЧ$ЫНF%ОоЧ$ЫНBпОоЧ$ЫНCЛОоЧ$ЫНBСОКПН=BОКПН7ЁОКПН2%ОКПН,╨ОКПНE ОоЧ$ЫН"+О ЗПН9:О ЗПН²pОП0Н"╨ОdЯ$┬q/FI& g THE MEMORY SYSTEM OF A HIGH-PERFORMANCE PERSONAL COMPUTER78Main storage boards are the same size as logic boards but are designed to hold an array of MOSRAMs instead of random ECL logic. A pair of storage boards make up a module, which holds 512Kbytes (plus error correction) when populated with 16K RAMs, 2M bytes with 64K RAMs, or 8Mbytes with (hypothetical) 256K RAMs. There is room for four modules, and space not used forstorage modules can hold I/O boards. Within a module, one board stores all the words with evenaddresses, the other those with odd addresses. The boards are identical, and are differentiated bysideplane wiring.A standard Dorado contains, in addition to its storage boards, eleven logic boards, including disk,display, and network controllers. Extra board positions can hold additional I/O controllers. Threeboards implement the memory system (in about 800 chips); they are called ADDRESS, PIPE, andDATA, names which reflect the functional partition of the system. ADDRESS contains theprocessor interface, base registers and virtual address computation, CacheA (implemented in 256 by4 RAMs) and its comparators, and the LRU computation. It also generates Hold, addresses DATA onhits, and sends storage references to PIPE.DATA houses CacheD, which is implemented with 1K by 1 or 4K by 1 ECL RAMs, and holds 8K or32K bytes respectively. DATA is also the source for FastOutBus and WriteBus, and the sink forFastInBus and ReadBus, and it holds the Hamming code generator-checker-corrector. PIPEimplements MapRAM, all of the pipeline stage automata (except ADDRESS and HITDATA) and theirinterlocks, and the fault reporting, destination bookkeeping, and refresh control for the MapRAMand StorageRAM chips. The History memory is distributed across the boards: addresses onADDRESS, control information on PIPE, and data errors on DATA.Although our several prototype Dorados can run at a 50 nanosecond microcycle, most of themachines run instead at 60 nanoseconds. This is due mainly to a change in board technology froma relatively expensive point-to-point wire-routing method to a cheaper Manhattan routing method. 7. PerformanceThe memory system's performance is best characterized by two key quantities: the cache hit rateand the percentage of cycles lost due to Hold (І 5.1). In fact, Hold by itself measures the cache hitrate indirectly, since misses usually cause many cycles of Hold. Also interesting are the frequenciesof stores and of dirty victim writes, which affect performance by increasing the frequency of Holdand by consuming storage bandwidth. We measured these quantities with hardware event-counters,together with a small amount of microcode that runs very rarely and makes no memory referencesitself. The measurement process, therefore, perturbs the measured programs only trivially.We measured three Mesa programs: two VLSI design-automation programs, called Beads and Placer;and an implementation of Knuth's TEX [8]. All three were run for several minutes (several billionDorado cycles). The cache size was 4K 16-bit words.Percent of cycles:Percent of references:Percent of misses:ReferencesHoldHitsStoresDirty victims Beads36.48.1499.2710.516.3Placer42.94.8999.8218.765.5TEX38.46.3399.5515.234.9Table 5: Memory system performanceTable 5 shows the results. The first column shows the percentage of cycles that contained cachereferences (by either the processor or the IFU), and the second, how many cycles were lost becausethey were held. Hold, happily, is fairly rare. The hit rates shown in column three are gratifyinglyЪНДО^╒qТXУrqrqrqrqrqrqr qrqrНqНОXKТ╘П1Т╙П&rНОVгqТ┤rqПDНОUCТЇП1rqТ╦rqНОS©ТІrqТЇП8НОR;Т║rqrqП8Т╒ НОPЇ Т╗ Т╘ПNНОO3Т─НОLТ═Т║ПYНОJ└Т√П0Т≈rqrqНОIТ∙ПCsqТ√sqНОG|sqТЙП-ТКsqНОEЬТ▄ПYНОDtТ┘rqТ├rqП!tqsqНОBПТ─П!sqНО?еsqТ│П4Т┌rqrqНО>AТ╔sqТіП.НО<ҐТЯТРПFsНО;9q Т°rqП-rqrqТ²НО9╣ Т╟П2Т╠rНО81qТrqПIНО6ґsqТ─sqsqНО3┌ТжП+ТвП&НО1ЧТ┼Т▀ПTНО0zТ─П`НО+їsТXНО(|qТ╚Т╛П@НО&ЬТ▐П&tqtqТ░НО%tТ▒П7tqТ▓НО#ПТ÷П0Т═П,tНО"lqТ┤Т┬ПGНО ХТ■Т∙ПOНОdТ─ПTНО9Т└Т┘rqП5НО╣Т√rqТ≈П>НО1Т─П.НО^ЧД█ОЧДНДО^Ч х█ОЧ хН╛О^Ч(█ОЧ(Н#тО^Ч і█ОЧ іН-zО^Чo█ОЧoНДОўuТX Н░Н0;НДО* Н)Н░Н#тН0;НО<ЧДFНДО<Ч хFН╛О<Ч(FН#тО<Ч іFН-zО<ЧoFНОNqНДН)Н░Н#тН0;НОйНДН)Н░Н#тН0;НОFrНДqН)Н░Н#тН0;НО ∙ЧДО бЧД█НДО ∙Ч хО бЧ х█Н╛О ∙Ч(О бЧ(█Н#тО ∙Ч іО бЧ і█Н-zО ∙ЧoО бЧo█Н>О≈П"НОlТ╗ПKТ╘НОХ Т≥Т vqП4НОdТ√tqТ≈П1ЪЇ░H÷c≤LSEC. 7PERFORMANCE79large²all over 99 percent. This is one reason that the number of held cycles is small: a miss cancause the processor to be held for about thirty cycles while a reference completes. In fact, the tableshows that Hold and hit are inversely related over the programs measured. Beads has the lowest hitrate and the highest Hold rate; Placer has the highest hit rate and the lowest Hold rate.The percentage of Store references is interesting because stores eventually give rise to dirty victimwrite operations, which consume storage bandwidth and cause extra occurrences of Hold by tying upthe ADDRESS section of the pipeline. Furthermore, one of the reasons that the StoreReg registerwas not made task-specific was the assumption that stores would be relatively rare (see thediscussion of StoreReg in І 5.1). Table 5 shows that stores accounted for between 10 and 19percent of all references to the cache.Comparing the number of hits to the number of stores shows that the write-back discipline used inthe cache was a good choice. Even if every miss had a dirty victim, the number of victim writeswould still be much less than under the write-through discipline, when every Store would cause awrite. In fact, not all misses have dirty victims, as shown in the last column of the table. Thepercentage of misses with dirty victims varies widely from program to program. Placer, which hadthe highest frequency of stores and the lowest frequency of misses, naturally has the highestfrequency of dirty victims. Beads, with the most misses but the fewest stores, has the lowest. Thelast three columns of the table show that write operations would increase about a hundredfold ifwrite-through were used instead of write-back.Acknowledgements The concept and structure of the Dorado memory system are due to Butler Lampson and ChuckThacker. Much of the design was brought to the register-transfer level by Lampson and BrianRosen. Final design, implementation, and debugging were done by the authors and Ed McCreight,who was responsible for the main storage boards. Debugging software and microcode were writtenby Ed Fiala, Willie-Sue Haugeland, and Gene McDaniel. Haugeland and McDaniel were also ofgreat help in collecting the statistics reported in І 7. Useful coments on earlier versions of thispaper were contributed by Forest Baskett, Gene McDaniel, Jim Morris, Tim Rentsch, and ChuckThacker. References1.Bell, J. et. al. An investigation of alternative cache organizations. IEEE Trans. Computers C-23, 4, April 1974, 346-351.2.Bloom, L., et. al. Considerations in the design of a computer with high logic-to-memory speed ratio. Proc. GigacycleComputing Systems, AIEE Special Pub. S-136, Jan. 1962, 53-63.3.Conti, C.J. Concepts for buffer storage. IEEE Computer Group News 2, March 1969, 9-13.4.Deutsch, L.P. Experience with a microprogrammed Interlisp system. Proc. 11th Ann. Microprogramming Workshop, PacificGrove, Nov. 1979.5.Forgie, J.W. The Lincoln TX-2 input-output system. Proc. Western Joint Computer Conference, Los Angeles, Feb. 1957,156-160.6.Geschke, C.M. et. al. Early experience with Mesa. Comm. ACM 20, 8, Aug. 1977, 540-552.7.Ingalls, D.H. The Smalltalk-76 programming system: Design and implementation. 5th ACM Symp. Principles ofProgramming Languages, Tucson, Jan. 1978, 9-16.8.Knuth, D.E. TEX and METAFONT: New Directions in Typesetting. American Math. Soc. and Digital Press, Bedford, Mass.,1979.9.Lampson, B.W. et. al. An instruction fetch unit for a high-performance personal computer. Technical Report CSL-81-1,Xerox Palo Alto Research Center, Jan. 1981. Submitted for publication.10.Lampson, B.W., and Pier, K.A. A processor for a high performance personal computer. Proc. 7th Int. Symp. ComputerArchitecture, SigArch/IEEE, La Baule, May 1980, 146-160. Also in Technical Report CSL-81-1, Xerox Palo Alto ResearchCenter, Jan. 1981.11.Liptay, J.S. Structural aspects of the System/360 model 85. II. The cache. IBM Systems Journal 7, 1, 1968, 15-21.12.Metcalfe, R.M., and Boggs, D.R. Ethernet: distributed packet switching for local computer networks. Comm. ACM 19, 7,July 1976, 395-404.НО]ўrТGУН- Н;┬qНОWWТ≤П.Т≥П+НОUсТ└П9Т┘П)НОTOТ┤tqПIТ┬ НОRкТ─tqП6tqНОO═ТєtqП)Т╔П%НОNТ│П0Т┌tqНОL≤Т╗rq Т╘ПJНОKТНТОПOНОI░ ТдТеПFНОHТ─НОDАТ▄П6Т█П"НОC]Т║ Т╒uqП5НОAыТёПHtq НО@UТўП=Т╞НО>я Т▓Т⌠П:НО=MТчПJТъНО;иТ▒П>Т▓НО:EТ╗Т╘ПVНО8аТ─П!НО3НsТXНО0цqТіПVНО/?Т╪ТҐП4НО-╩Т┴ПMТ┼ НО,7Т█П8Т▌П$НО*ЁТ╒П)ТёП/НО)/Т╡П_НО'╚Т╔П?ТіНО&'Т─НО!Ts НО©rqН{rТGwrП9xwyrНОqН{r wrПVwН{ОзrwvrП&НО3qН{rП+xwyrНО▄qН{rПDwП)rН{ОNНОїqН{rП5wП'rН{ОiНОбqН{r wrwxryrНОqН{rПPwxwН{ОщrНО 6qН{rxwxwrП8Н{ОЬНО QqН{r wrПYvrН{О ПGНОlqН{rПVwН{О.r vrП9vrН{ОПНОIqН{rПMxwyrНО╒qН{rПfwxwyrН{ОdЪDЇ└H÷bє╣ THE MEMORY SYSTEM OF A HIGH-PERFORMANCE PERSONAL COMPUTER8013.Mitchell, J.G. et. al. Mesa Language Manual. Technical Report CSL-79-3, Xerox Palo Alto Research Center, April 1979.14.Pohm, A. et. al. The cost and performance tradeoffs of buffered memories. Proc. IEEE 63, 8, Aug. 1975, 1129-1135.15.Schroeder, M.D. Performance of the GE-645 associative memory while Multics is in operation. Proc. ACM SigOpsWorkshop on System Performance Evaluation, Harvard University, April 1971, 227-245. 16.Tanenbaum, A.S. Implications of structured programming for machine architecture. Comm. ACM 21, 3, March 1978, 237-246.17.Teitelman, W. Interlisp Reference Manual. Xerox Palo Alto Research Center, Oct. 1978.18.Thacker, C.P. et. al. Alto: A personal computer. In Computer Structures: Readings and Examples, 2nd edition, Sieworek,Bell and Newell, eds., McGraw-Hill, 1981. Also in Technical Report CSL-79-11, Xerox Palo Alto Research Center, August1979. 19.Tomasulo, R.M. An efficient algorithm for exploiting multiple arithmetic units. IBM J. R&D 11, 1, Jan. 1967, 25-33.20.Wilkes, M.V. Slave memories and segmentation. IEEE Trans. Computers C-20, 6, June 1971, 674-675.НДОDqТXУrqrqrqrqrqrqr qrqrНqНОМrqН{rТGwrwrvrП3НОFqН{rwrП=wxwyrНО ÷qН{rП^wxwН{ОaП)rП+НО ╨qН{rПSwxwyrН{О |НОуqН{r wrП.НО.qН{r wrwП*rН{ОППDvrП/Н{О╡НОqН{rПRxwyrНОdqН{rП0xwyr≤ЇNНH÷:⌡ HELVETICA HELVETICA HELVETICA HELVETICA HELVETICA HELVETICA HELVETICA HELVETICA MATH LOGO TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN HELVETICA HELVETICA HELVETICA HELVETICATEMPLATE@>L fк$ в-Э2 m< nE ЮN nX┌`;h еr z{ У┘х░d⌡і "╞>╩кг эт ║щ ГOС05ЄK%e)ФЖМ╡Т &ЩЖ 1 тN{.%н(Д+я0z59ress MemFig7.press MemFig8Zoh\'''ss MemFig3.press MemFig4.press MemFig5.pZ┴h\'''emFig14.press MemFig15.press MemFig2.preZёh\'''11j/-+ЪЪ√▀·ЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪЪmemFinal.presspiera13-Jan-81 16:21:19 PST+