юЋ$оп>юЋ$о!3п%Wю$‡оБп Вю$–о$п!Юю$Зо'sп/ю$о-ђп(‰ю$«о'sп(fю@$о'sп(fю$Оо!zп*ВюЋ$о ]п,ыю«$о(rп+-ро-ђп+Яю$о/Ґп+Яю$Іоп0tюЋkоdп-‰юkUоп-fюІkоп-‰юk2оп1ыюk2оdп1ыюk2оdп4жю$kоAп1Шю$kоп$ҐюGUо€п$‚юGUоAп'ґюkGоdп$‚юkGоп"ЧюІ$о«п3¦юл$оПп.нюd$о«п3_ю$оп.ню$•оИп/4ю$Nо3п3_юd$о дп3¦ю9$о дп3_ю9$оЭп.ню$•оЭп.ню@$оHп1mюд$о$п1&ю $о$п/4ющ$о$п1mю$о«п& юЂ$о$п$ю r$оЭп#Рю №$о дп& ющ$оHп& юХ$оЭп#фю$9о$п$;ю$то«п'&ю щ$оЋп {ю №$оп#ю$dоп ћю$¤о п&—ю$По!zп%Wю$GоЭп(Bю$Іо,п.нюІ$оЏп2Bю‡$о дп<Сю $$о%п(ю$dо!zп*{юЋ$о ]п,ґю«$о!zп*жю$Цо ]п'ю$Цо ]п-ю$Gоп<Сю k$оЋп2Bю$о9п4{ю k$о/мп+»ю$Іо-ђп+—юЂ$о$Aп&-ю y$о$п1Шю ќ$о$п1mю‡$о/мп1mюО$о/мп5-ю$ ]о$Aп5 юП$о$Aп5Qюd$о/Ґп5Qю$ Ѓо'sп.ню@Gо'sп(fю@Gо'sп(‰юG«о-lп(‰юG«о2ђп3;ю•kо2ђп$ю•kо2ђп$^юkHо6»п$;юkHо"–xп+про"–п*Dро"–п(™ро"–п&оро"–п Dро"–поро"–пCро ]п&Юю$оп(Bюk$оп.нюd$оБп/4ю«$о!zп0ют$о!3п0Яю9$о Ѓп2Bюл$оЏп2‰ю Э$о%п3нюG$о%п44юG$оп#‰юd$оп#Bюd$о!zп%4ют$оЏп$ню Э$о%п(‰юG$оzп+`роzп)'роVпроzпТроПп*DроПп,}роПпороПп'ро%мп$µро%мп"|ро1 п)nро1 пГроrп|рощп!»юІ$оПп4Троzп4Тро yп9џюN$оVп9џю$оЋп8ѓюл$о$Aп!»ю Ґ$о"–п™роЃrп2фроп»юЋ$оп‚ю$]оп‚юІ$оЏпҐю$9о!zпIю$¤о!zп%юk$о"–xпоро€пею$оПпћю$оіпzю@$оЏпВю$о$Aп;юЂ$о'п^юЭ$о$eпЙю Ѓ$оЏп?Qю$9оп?-юІ$оп?-ю$]опAfюЋ$ощпHџюІ$ощпFgю$]ощпFgюІ$о€пFЉю$9оЦпTKюІ$оЦпRю$]оЦпRюІ$о/ЙпBю$ЂоkпBю*Ѓ$опЧю+W$о$п%ћюП$опъю$:Wоzп=oропFюЋ$оЏпBою$UопBКюІ$опBКю$yоБ|п?ЁропEµю$о«пG_ю$оИпE‘ю$топEШю$«оИпEnюd$оіпDuю$о*Ип/4ю$dоПxп>Dроrп?aро ќпG§ю«$о ќпG_ю«$оkпFЉю9$оЋпFCю$о*:|п+проkпLѓю9$оkпLКю9$о yпKCюО$о yпKЉюО$оrxпCEроПпD…ро«пKCю%м$ощпJJюЋ$оПvпK ро3Brп+-роЏп<СюN$о%п4Xю$ќоп5Qю$UоAvпRГро9|пPѕро9пO~ро9пN>ро№пO~ро yпOJю9$о–пKљро9пJЎро9пL“роЋпPCюЗ$оЦпN.юІ$оЦпN.ю$]оЦпPgюІ$о]vпNаро€пO'ю/:Gо%:rпOЩр о2пYµюGо7пTnюGЋо2пT'юGGо2пT'юGЦо)eпY’ю$kо.eпUхюkдо)пUхюkдо.¬пUхю$о)пYЩюЋ$о)eпUхю$kо/ҐпU®ю‡$о.¬пXuюЂ$о.¬пX.юЂ$о2lпV§ро€пRю$]о€пN.ю$]о0WxпP0ро0WпMiро/мпUgю@$о=п$ю№$о9Йп)юќ$о9Йп*Вюќ$о9Йп.юќ$о2іп7юGо6»п5»юG9о2іп5tюNGо2іп5tюGЂо2ыvп6mро4Йп3‚ю$о4;|п0=ро7%п,mю$о9Йп*{юќ$о=фп#Рюr$о=фп"%ю$«о=п"%ю$о$Aп"Iюk$о=фп"Iют$о9Йп*Вю$dо9Йп)ю$dо9‚п(Рюд$о9‚п(фю$2о9Йп,&юy$о7%п,&юЂ$о>нtп+ ро>Йп)^роAђxпТроAђп ТроAђп"|роAђп$'роAђп%ТроAђп'}роAђп)'ро9Йп.;ю$о9‚п-фю$о9‚п-Рюд$оЏп=ю :$о%п<Сю]$о:Вп/жю$до:{п/џю$Бо:{п/{юЗ$о:Вп/ВюЂ$о€пG§ю%^$о3пG_ю#l$о;tп1&ю$Aо;tп1&ют$о;»п1mю«$о;»п1‘ю$о€пKЉю&W$о<ђп2фю«$опLѓюЋ$ощпJnю$9о€пJJю$]о<ђп3ю$Vо€пS.ю(%$о=п'Iю@$о<ыvп'ђро>;п%еро4мп4ВюO$о7пWYю д$о7пWю д$о9п=ю$лоп4жю$оЋп%4ю$о9п'ђю$ о9п'mю k$о«п'&ю$-Iо«пSјю$ло«п[„юCЙ$о9пR ю$ќо9п[юC;$оп3ю$Vоп2‰ю¤$оЋп$ню$оЋп%Wю$ оЋпБю$OоЭп1Iю$9о дп3_ющ$о дп1ыю$Uо Vп1Шю$yо Vп$‚ю$yо дп$Ґю$Uо Vп'ЧюЋ$о Vп'ЧюЋ$о дп& ющ$оЭп#фю$9о Vп$‚юІ$оC;п,ђю $оC;п,ю r$оLпБю$dоM;п3ю$ЃоЋпћюLB$опюM_$о%:rп2Bро%:п&»ро%:п р о%:п"ђроFpп!PроFrп,ЧроD4xпSроDпOроDпGроDпароDп`ро=п%ћю$о=п6JюЋ$оB|п2™роBп0проBп/DроBп#™роBп!оропS.юy$оЋпLою$@оkпF®ю$що9пFѕро9пDМроFpпZDроFпV<роFпNuроFпOµроC;п#‰юGо$п.юG Џо$пeюG@о VпEьюGUо VпIаюGUо VпMГюGо VпQ`юGАо rп@Jро п?-ро пропероп¬роFpп[`р оVпJnю$$оVпNQю$оVпCXю«$оVп@Jю«$оБ|п<љро –vпTµро –пH|юд$о –пHГро rп:»юд$о Vп8юGUо+ћпS.ю$Зо+|пQ“ро yпS.ю]$оpпро rvп7ро Oп*{роVпВю$6Pод|поропHю«$оЗпю$Uоп уюО$оп ую$yодpпро9п ЦюЋ$оЗпOю$«о9п,юІ$о9п,ю$Оодп,роzп єюЋ$опЃю$]оzпЃюІ$оzпЃю$]одпБюG9опроп,роUп юд$оsп ,ю+$оsпю+$оsп Hю•$оsп,ю•$оUпћюл$о,sпћюЋ$о,sпюЋ$о,sп ¬юЋ$о,sп eюЋ$о,sпєюЋ$о3п,ропћю$оvпероyпQою$оVtпђр оVпђр о€п!Ююд$оПп!—юќ$о"–xп™ро€п Вю9$о&Бп%ю$9оЏп@Jю$о¬п5Яю$ Џо¬п5»юr$о#|п2роrп2фроп%ћро0еpпdро3пур оrп%ћроЏп9џропNјроЏпS.роdtп#‰ро щп"lро щп/Вроdп0Яро4Йп7ю$‡о1,п94юА$о1,vп9{роFpпWYр о!еtп2Сро!еп'&ро!еп1&ро!еп%{ро!3п4{ро!еп/{ро!ћп(Рро!еп#Рро%:п-Bро%:п+ роdrп\роИtпCџроИп@‘роИперо%:п7ро0WпЧю$ло%:rп5»ро=п6mю$до?жtп._ро>¦п6‘ро>;п5 ро?{п,ґро^п$^ро>;п1ґро?4п3;ро*rпX.ро*:пVЛроC;п,mю $оLСпћю$dоMп3ю$ЃоC;п,&ю №$оLСп+»ю$ЋоЋпzюLB$оп3юM_$о$п3ю$Vоkпzю$–оkп%4ю$ 2оп'mю$ 2оОп'&ю$-Iо$пъю$:WоЋп Вю$%Ґоkп Вю$%ҐоЋпBю$]оkпBю$]оkпeю*Ѓ$опъю+W$оЋпF®ю$щоkпLою$@о9п4жю$оп=ю$лоОпSјю$лопR ю$ќо«п[`юCЙ$о9пZхюC;$о/мпBю$Ђо03пЧю$ло03п%Вю‡$о$Aп& ю y$о$Aп"%юk$о$Aп!Юю Ґ$о$eпҐю Ѓ$о$eп^юЂ$о!3пю$Uо!zпмют$о'п;юЭ$о>п"%ю‡$о,sпzюЋ$о,sп3юЋ$оFpп^роFпроFп»роFп3роAЧxп`роFpп"ђроAЧxпаро$п1ґю ќ$о$п1Iю‡$о –пTnюд$о Oп*4юy$о!Vtп {роAп8;р оVп/XроVп"ро+п0tро kп0tро+п#ро kп#роуп €р оупИр о4;|п4‹ро3pп ро'пЮю$¤о!3пЮющ$о¤vп)^роп*Wро%:tп¬р о%:пђроVп¬роVп¬роEtп¬р оEtпђро«пер оЏпИро«пђр о«п¬р о Vп5-юЋ$о Vп1ШюІ$о!3п 4ю9$о0Wп%ћю@$о0п1Iю‡$оЋrпSuро$п/Xю$о Otп,mро]п+tю$оOпeро€п {ю¤$о п&—юU$о$п%Вют$о&Бп2Bр о&ћп3_ро$п7ю«$оПп7фр о'spпdсо3пр(оdtпG<ро<ґп0 ря4њёеO ]RжA PROCESSOR FOR A HIGH-PERFORMANCE PERSONAL COMPUTER146. ImplementationIn this section we describe, at the block diagram level, the actual implementation of the Doradoprocessor. There is only space to cover the most interesting points and to illustrate the key ideasfrom ¶ 5.6.1 ClocksThe Dorado has a fully synchronous clock system, with a clock tick every 30 nanoseconds. A cycleconsists of two successive clock ticks; it begins on an even tick, which is followed by an odd tick,and completes coincident with the beginning of a new cycle on the next even tick. Even ticks maybe labeled with names like t-2, t0, t2, t4 to denote events within a microinstruction execution or apipeline, relative to some convenient origin. Odd ticks are similarly labeled t-1, t1, t3.6.2 The control sectionThe processor can be divided into two distinct sections, called control and data. The control sectionfetches and broadcasts the microinstructions to the data section (and the remainder of the Dorado),handles task switching, maintains a subroutine link, and regulates the clock system. It also has aninterface to a console and monitoring microcomputer which is used for initialization and debuggingof the Dorado. Figure 5 is a block diagram of the control section.6.2.1 Task pipelineThe task pipeline consists of an assortment of registers and a priority encoder. All the registers areloaded on even clocks. Wakeup requests are latched at t0 in WAKEUP, one bit per task; READY hascorresponding bits for preempted and explicitly readied tasks. The requests in WAKEUP and READYcompete. A task can be explicitly made ready by a microcode function. The priority encoderproduces the number of the highest priority task, which is loaded into BESTNEXTTASK and also usedto read the TPC of this task into BESTNEXTPC; these registers are the interface between the twostages in this pipeline. The NEXT bus normally gets the larger of BESTNEXTTASK and THISTASK.THISTASK is loaded from NEXT, and LASTTASK is loaded from THISTASK, as the pipeline progresses.This method of priority scheduling means that once a task is initiated, it must explicitly relinquishthe processor before a lower priority task can run. A bit in the microword, Block, is used toindicate that NEXT should get BESTNEXTTASK unconditionally (unless the instruction is held).Note that it takes a minimum of two cycles from the time a wakeup changes to the time thischange can affect the running task (one for the priority encoding, one to fetch the microinstruction).This implies that a task must execute at least two microinstructions after its wakeup is removedbefore it blocks; otherwise it will continue to run, since the effects of its wakeup will not have beencleared from the pipe. The device cannot remove the wakeup until it knows that the task will run(by seeing its number on NEXT). Hence the earliest the wakeup can be removed is t0 of the firstinstruction (NEXT has the task number in the previous cycle, and the wakeup is latched at t0); thusthe grain of processor allocation is two cycles for a task waking up after a Block.Some trouble was taken to keep the grain small, for the following reason. Since the memory isheavily pipelined and contains a cache which does not interact with high bandwidth I/O, the I/Omicrocode often needs to execute only two instructions, in which a memory reference is started anda count is decremented. The processor can then be returned to another task. The maximum rate atwhich storage references can be made is one every eight cycles (this is the cycle time of the mainstorage RAMs). A two cycle grain thus allows the full memory bandwidth of 530 megabits/secondto be delivered to I/O devices using only 25% of the processor.яодп[ zwфXхzwzwzwzwz wzwzоwопT¶{опQ‹wф®рPфЇ опP фџф рSопNѓфЂопJ„|фXопGYwф‹р;|w|опEХwфЎр/фў|w|wопDQф‹р7фЊр'опBНф©фЄ~пB@пBНw~пB@пBНw~пB@пBНw~пB@пBНwр:оп@ЯфЂрF~п@Rп@Яw~п@Rп@Яw~п@Rп@Яwоп<а|фXоп9µwфѓ ф„р3|w|wоп81фЌр1фЋр+оп6ф™рVфљоп5)ф‹фЊрVоп3ҐфЂрAоп/¦|фX оп,{wфЊр`фЌоп*чф‹фЊр.~п*jп*чwzwzwоп) ф‡рBф€zwzоп'…wфЅфѕрHоп&фѓр)ф„zw оп$}ф» zwz wр3оп"щфzwф® zwzwоп!uzwфЂzwzwzwопJфрVф™ опЖфЗрIфИ~wопBфЂzwzwр2опфБфВрFоп“фЃрCф‚опф¶р@ф·оп‹ф…р.ф†р3опфђр-ф‘р-опѓфџzwр4~пцпѓwоп• фђzwф‘рA~пп•wоп§фЂ|wрD~wоп |фр$ф®р6опшфЁр.ф©zwzwzwzопtwф€ф‰р=опрфЂр%фЃр;опlфњр?фќопиф™zwфљрOопdфЂzwzwр)яћ·%Hџ`ЯSEC. 6IMPLEMENTATION15A simpler design would require the microcode to explicitly notify its device when the wakeupshould be removed; it would then be unnecessary to broadcast NEXT to the devices. Since thisnotification could not be done earlier than the first instruction, however, the grain would be threecycles rather than two, and 37.5% of the processor would be needed to provide the full memorybandwidth. Other simplifications in the implementation would result from making the pipelinelonger; in particular, squeezing the priority encoding and reading of TPC into one cycle is quitedifficult. Again, however, this would increase the grain.6.2.2 Fetching microinstructionsRefer to the right hand side of Figure 5. At t0 of every instruction, the microinstruction registerMIR is loaded from the outputs of IM, the microinstruction memory, and the THISPC register isloaded with IMADDRESS. The NEXTPC is quickly calculated based on the NextControl field in MIR,which encodes both the instruction type and some bits of NEXTPC; see Figure 7 for details. Thiscalculation produces THISTASKNEXTPC, so called because if a task switch occurs it is not used as thenext IMADDRESS. Instead, the BESTNEXTPC computed in the task pipeline is used as IMADDRESS.<==Кп0Pю$Ћо<‘п0Pю$Ћо:Xп0Pю$Ћо8п0Pю$Ћо5жп0Pю$Ћо#tп/ро%;п/ро'sп/ро)¬п/ро+еп/ро.п/ро0Wп/ро2ђп/ро4;п/ро6tп/ро8п/ро:жп/ро=п/ро?Xп/ро?Xп*ро=п*ро:жп*ро8п*ро6tп*ро4;п*ро2ђп*ро0Wп*ро.п*ро+еп*ро)¬п*ро'sп*ро%;п*ро#п*ро5жп+Pю$Ћо8п+Pю$Ћо:Xп+Pю$Ћо<‘п+Pю$Ћо>Кп+Pю$ЋоAп+Pю$Ћо1tп+Pю$Ћо/;п+Pю$Ћо-п+Pю$Ћо*Йп+Pю$Ћо(ђп+Pю$Ћо$п+Pю$Ћо!еп+,ю$]о!еп+,юA$оAп+Pю$9о!еп-eю$о!еп(eю$оAп&Pю$9о!еп&,юA$о!еп&,ю$]о$п&Pю$Ћо(ђп&Pю$Ћо*Йп&Pю$Ћо-п&Pю$Ћо/;п&Pю$Ћо1tп&Pю$ЋоAп&Pю$Ћо>Кп&Pю$Ћо<‘п&Pю$Ћо:Xп&Pю$Ћо5жп&Pю$Ћо3п&Pю$Ћо#п%ро%;п%ро'sп%ро)¬п%ро+еп%ро.п%ро0Wп%ро2ђп%ро4;п%ро6tп%ро8п%ро:жп%ро=п%ро?Xп%ро?Xп ћро=п ћро:жп ћро8п ћро6tп ћро4;п ћро2ђп ћро0Wп ћро.п ћро+еп ћро)¬п ћро'sп ћро%;п ћро#п ћро5жп!Юю$Ћо<‘п!Юю$ЋоAп!Юю$Ћо1tп!Юю$Ћо/;п!Юю$Ћо-п!Юю$Ћо*Йп!Юю$Ћо(ђп!Юю$Ћо$п!Юю$Ћо!еп!єю$]о!еп!єюA$оAп!Юю$9о!еп#ую$о!епую$оAпЮю$9о!епєюA$о!епєю$]о$пЮю$Ћо&WпЮю$Ћо(ђпЮю$Ћо*ЙпЮю$Ћо-пЮю$Ћо/;пЮю$Ћо1tпЮю$ЋоAпЮю$Ћо>КпЮю$Ћо<‘пЮю$Ћо:XпЮю$Ћо8пЮю$Ћо5жпЮю$Ћо3пЮю$Ћо#пќро%;пќро'sпќро)¬пќро+епќро.пќро0Wпќро2ђпќро4;пќро6tпќро8пќро:жпќро=пќро?Xпќро?Xп,ро=п,ро:жп,ро8п,ро6tп,ро4;п,ро2ђп,ро0Wп,ро.п,ро+еп,ро)¬п,ро'sп,ро%;п,ро#п,ро3пlю$Ћо5жпlю$Ћо8пlю$Ћо:Xпlю$Ћо>Кпlю$ЋоAпlю$Ћо1tпlю$Ћо/;пlю$Ћо-пlю$Ћо*Йпlю$Ћо(ђпlю$Ћо$пlю$Ћо!епHю$]о!епHюA$оAпlю$9о!епЃю$о'sп+Pю$9о'sп&Pю$9о'sпlю$9о.фvпЏр о&Wп0Pю$Ћо3п0Pю$9о3п+Pю$9о4‚tп,ро6»п,ро8фп,ро;-п,ро=fп,ро?џп,ро8п&Pю$9о-ђvп'ро&Wп!Юю$Ћо3п!Юю$9о8п!єю9$о>Кп!Юю$9о8п!Юю$9о:Xп!Юю$9о?џп"Џро8фп"Џро<‘пlю$9о?XtпЂро=пЂро:жпЂро8пЂро6tпЂро4;пЂро2ђпЂро0WпЂро.пЂро+епЂро)¬пЂро'sпЂро%;пЂро#пЂро5жпАю$Ћо8пАю$Ћо:XпАю$Ћо<‘пАю$Ћо>КпАю$ЋоAпАю$Ћо1tпАю$Ћо/;пАю$Ћо-пАю$Ћо*ЙпАю$Ћо(ђпАю$Ћо&WпАю$Ћо$пАю$Ћо!епќю$]о!епќюA$оAпАю$9о!еп Цю$о3пАю$9о3пќю$о!еpп тр,одuп еро tп«р о пЗро-ђrп6Iро)vп1ро8п1ро",п,ро+Wп,ро",п'ро",про:п'ро)¬п"Џро3фп"Џро:џп"ЏроBuп#роBп"р о+еvпро<ШпроЏuпро(ђvпrро6»пrро!еп4Wро4;п4Wро:uп1ро:п,роWpпр оsrп6Iро Йпkю$ :опHю И$оИпkю$ :опkю$ :оп3‚ю И$оп6Iроп3‚юG$опHюG$оGпkю$ :оBtп1ђр о uпро!ћtпрSо п ро!ћптрJо$¬pпdсяюиЊGС7‰xA PROCESSOR FOR A HIGH-PERFORMANCE PERSONAL COMPUTER16TPC is written with the previous value of THISTASKNEXTPC every cycle (at t3), and read for the taskin BESTNEXTTASK every cycle as well. Thus, TPC is constantly recording the program counter valuefor the current task, and also constantly preparing the value for the next task in case there is a taskswitch.6.2.3 Miscellaneous featuresThere is a task specific subroutine linkage register, LINK, shown in Figure 5, which is loaded withthe value in THISPC+1 on every microcode call or return. Thus each task can have its ownmicrocoded coroutines. LINK can also be loaded from a data bus, so that control can be sent to anarbitrary computed address; this allows a microprogram to implement a stack of subroutine links,for example. In addition to conditional branches, which select one of two NEXTPC values, there arealso eight-way and 256-way dispatches, which use a value on the B bus to select one of eight, or oneof 256 NEXTPC values. Since the Dorado's microstore is writeable, there are data paths for reading and writing it. Relatedpaths allow reading and writing TPC. These paths (through the register TPIMOUT) are folded intoalready existing data paths in the control section and are somewhat tortuous, but they are usedinfrequently and hence have been optimized for space. In addition, another computer (either aseparate microcomputer or an Alto) serves as the console processor for the Dorado; it is interfacedvia the CPREG and a very small number of control signals.6.3 The data sectionFigure 6 is a block diagram of the data section, which is organized around an arithmetic/logic unit(ALU). It implements most of the registers accessible to the programmer and the microcodefunctions for selecting operands, doing operations in the ALU and shifter, and storing results. It alsocalculates branch conditions, decodes MIR fields and broadcasts decoded signals to the rest of theDorado, supplies and accepts memory addresses and data, and supplies I/O data and addresses.6.3.1 The microinstruction registerMIR (which actually belongs to the control section) is 34 bits wide and is partitioned into thefollowing fields:RAddress4Addresses the register bank RM.ALUOp4Selects the ALU operation or controls the shifter.BSelect3Selects the source for the B bus, including constants.LoadControl3Controls loading of results into RM and T.ASelect3Selects the source for the A bus, and starts memory references.Block1Blocks an I/O task, selects a stack operation for task 0.FF 8Catchall for specifying functions.NextControl8Specifies how to compute NEXTPC.6.3.2 BussesThe major busses are A, B (ALU sources), RESULT, EXTERNALB, MEMADDRESS, IOADDRESS, IODATA,IFUDATA, and MEMDATA .The ALU accepts two inputs (A and B) and produces one output (RESULT). The input busses have avariety of sources, as shown in the block diagram. RESULT usually gets the ALU output, but it isalso sourced from many other places, including a one bit shift in either direction of the ALU output.A copy of A is used for MEMADDRESS; two copies of B are used for EXTERNALB and IODATA.MEMADDRESS provides a sixteen bit displacement, which is added to a 28 bit base register in thememory system to form a virtual addresses. EXTERNALB is a copy of B which goes to the control,memory, and IFU sections, and IODATA is another copy which goes to the I/O system; the sources ofяодп^zwфXхzwzwzwzwz wzwzоwопX(zwфЋр'z w~пW›пX(wфЏопV:ф‹zwzwр2опT¶фЉф‹рRопS2опO3|фXопLwф›р1zwр$фњопJ„фРфСzwрFопI ф‹ zwрFопG|ф¤фҐр@опEшф€ф‰р?zwопDtф„| wzwр#опBрфЂzwоп?ЕфЌр5фЋр+оп>AфЈzwр%zwф¤оп<Ѕфєр#ф»р5оп;9ф¶ф·р>оп9µф“р(ф”р3оп81фЂzwр,оп42|фXоп1wфђрDф‘оп/ѓzwфарFфб оп-яфЂр,фЃzwр+оп,{ ф¦zwр&ф§оп*чфЂр>zwzwоп&ш|фXоп#НzwфМрXфНоп"IфЂоцпс~о›wо]zwоцпmz~о›wо]zwр#оцпй~о›wо]zwоцпe~ о›wо]р!zwzwоцпб~о›wо]zwр#оцп]~о›wо] zwzwр,оцпЩ~wо›о]р"оцпU~ о›wо]zwопV|фXоп+wф“ф”zwzwzw zwzwz wzwzwоп§zwфЂzwоп |ф…zwzwzwzwф†опшфўфЈzwzwопtфѓф„рDzwопрфВzwz wzwфГ zwzwопlz wф¦рNф§опифр&zw zwф™опdфЃzwzwф‚р#zwzw L·іHџcuнSEC. 6IMPLEMENTATION17B can thus be sent to the entire processor. Both are bidirectional and can serve as a source for B aswell. IOADDRESS is driven from a task specific register; it specifies the particular device and registerwhich should source or receive IODATA.IFUDATA and MEMDATA allow the processor to receive data from the IFU and memory in parallelwith other data transfers. MEMDATA has the value of the memory word most recently fetched bythe current task; if the fetch is not complete, the processor is held when it tries to use MEMDATA.IFUDATA has an operand of the current macroinstruction; as each operand is used, the IFU presentsthe next one on IFUDATA.6.3.3 RegistersHere is a list and brief description of registers seen by the microprogrammer. All are one word (16bits) wide.RM:a bank of 256 general purpose registers; a register can be read onto A, B, or theshifter, and loaded from RESULT under the control of LoadControl. Normally, thesame register is both read and loaded in a given microinstruction, but loading of adifferent register can be specified by FF.STACK:a memory addressed by the STACKPTR register. A word can be read or written,and STACKPTR adjusted up or down, in one microinstruction. If STACK is used in amicroinstruction, it replaces any use of RM, and the RAddress field in the microwordtells how much to increment or decrement STACKPTR. The 256 word memory isdivided into four 64 word stacks, with independent underflow and overflowchecking.T:a task specific register used for working storage; like RM, it can be read onto A, B,or the shifter, and loaded from RESULT under the control of LoadControl..COUNT:a counter; it can be decremented and tested for zero in one microinstruction, usingonly the NextControl or FF field. It is loaded from B or with small constants fromFF.SHIFTCTL:a register which controls the direction and amount of shifting and the width of leftand right masks; it is loaded from B or with values useful for field extraction fromFF.Q:a hardware aid for multiply and divide instructions; it can be read onto A or B, andloaded from B, and is automatically shifted in useful ways during multiply anddivide step microinstructions.The next group of registers vary in width. They are used as control or address registers, changeddynamically but infrequently by microcode.RBASE:RM addressing requires eight bits. Four come from the RAddress field in themicroword, and the other four are supplied from RBASE. It is loaded from B or FF,and can be read onto RESULT.STACKPTR:an eight bit register used as a stack pointer. Two bits of STACKPTR select a stack,and the least significant six bits a word in the stack. The latter bits areincremented or decremented under control of the RAddress field whenever a stackoperation is specified.MEMBASE:a five bit register which selects one of 32 base registers in the memory to be usedfor virtual address calculation. It is loaded from FF field or from B, and can beloaded from the IFU at the start of a macroinstruction.ALUFM:a 16 word memory which maps the four-bit ALUOp field into the six bits requiredto control the ALU.IOADDRESS:a task specific register which drives the IOADDRESS bus, and is loaded by I/Omicrocode to specify a device address for subsequent Input and Output operations.It may be loaded from B or FF.яоп\ъzфGхо+I о;€wопVЈzwфЂрTфЃ zwопUzwф‚рYопS›фЂzwопPpzwфўzwр$фЈ zwопNмфџzwр/ф опMhф™р,фљр,zwопKдzwфЊрMфЌzwопJ`фЂzwопFa|фX опC6wф†ф‡рLопAІфЂоп?Ўzwо мф¬рDzwzwо мп>ф¤zw~ wфҐ о мп<™ф’р&ф“р)о мп;фЂ~wоп9zwо мф±zwфІо мп7Ђф‚zwфѓр'zwо мп5ьф„ф…zw ~wо мп4xфќр$zwфћо мп2фффр5о мп1pоп/_zwо мф“ф”zwzwzwо мп-ЫфЂzw~ wоп+Кzwо мф‹р*фЊр(о мп*Fф—~ w~w фzwо мп(В~wоп&±zwо мфЌрEфЋ о мп%-фђzwф‘р(о мп#©~wоп!zwо мф„рHzwф…zwо мп фЗzwр(фИо мпђфЂопeфќр(фћр7опб фЂопРzwо мzwфЭр5~wфЮо мпL фѓф„zwzw~wо мпИфЂzwоп·zwо мф—фzwо мп3ффр3о мпЇ фҐф¦~wо мп+фЂ опzwо мфр/ф™р#о мп –ф©р1~wzwфЄ о мп фЂ zwр$опzwо мфф™z~wр!о мп}фЂzwопlzwо мфФр)zwфХzwzо мпиwф¦р,~wф§~wо мпdфЂzw~w · 8HџaрA PROCESSOR FOR A HIGH-PERFORMANCE PERSONAL COMPUTER186.3.4 The shifterThe Dorado has a 32 bit barrel shifter for handling bit-aligned data. It takes 32 bits of input fromRM and T, performs a left cycle of any number of bit positions, and places the result on A. TheALU output may be masked during a shift instruction, either with zeroes or with data fromMEMDATA.The shifter is controlled by the SHIFTCTL register. To perform a shift operation, SHIFTCTL is loaded(in one of a variety of ways) with control information, and then one of a group of "shift and mask"microoperations is executed.6.4 Physical organizationOnce the goal of a physically small but powerful machine was established, engineering design andmaterial lead times forced us to develop the Dorado package before the implementation was morethan partially completed, and the implementation then had to fit the package. The data section ispartitioned onto two boards, eight bits on each; the boards are about 70% identical. The controlsection divides naturally into one board consisting of all the IM chips (high speed 1K x 1 bit ECLRAMs) and their associated address drivers, and a second board with the task switch pipeline,NEXTPC logic, and LINK register.The sidepanel pins are distributed in clusters around the board edges to form the major busses.The remaining edge pins are used for point to point connections between two specific boards. TheI/O busses go uniformly to all the I/O slots, but all the other boards occupy fixed slots specificallywired for their needs. Half the pins available on the sideplanes are grounded, but wire lengths arenot controlled except in the clock distribution system, and no twisted pair is used in the machineexcept for distribution of one copy of the master clock to each board.We were very concerned throughout the design of the Dorado to balance the pipelines so that noone pipe stage is significantly longer than the others. Furthermore, we worked hard to make thelongest stage (which limits the speed of this fully synchronous machine) as short as possible. Thelongest stage in the processor, as one might have predicted, is the IMADDRESS calculation andmicroinstruction fetch in the control slice. There is about a 50 nanosecond limit for reliableoperation in a stitchwelded machine, and 60 ns in a multiwired machine. There are pipe stages ofabout the same length in the memory and IFU.We also worked hard to get the most out of the available real estate, by hand tailoring theintegrated circuit layout and component usage, and by incrementally adding function until nearlythe entire board was in use. We also found that performance could be significantly improved bycareful layout of critical paths for minimum loading and wiring delay. Although this was a verylabor intensive operation, we believe it pays off.7. PerformanceFour emulators have been implemented for the Dorado, interpreting the BCPL, Lisp, Mesa andSmalltalk instruction sets. A typical microinstruction sequence for a load or store instruction takesonly one or two microinstructions in Mesa (or BCPL), and five in Lisp. The Mesa opcode can senda 16 bit word to or from memory in one microinstruction; Lisp deals with 32 bit items and keepsits stack in memory, so two loads and two stores are done in a basic data transfer operation. Morecomplex operations (such as read/write field or array element) take five to ten microinstructions inMesa and ten to twenty in Lisp. Note that Lisp does runtime checking of parameters, while inMesa most checking is done at compile time. Function calls take about 50 microinstructions forMesa and 200 for Lisp.The Dorado supports raster scan displays which are refreshed from a full bitmap in main memory;this bitmap has one bit for each picture element (dot) on the screen, for a total of .5ќ1 megabitsодп]сzwфXхzwzwzwzwz wzwzоwопWљ|опTowфЌрFфЋопRлzwфЎzwфўр@zwопQgzwфер1фжр%опOгzwопLёф‚ фѓzwр*zw опK4ф…рLф†опI°фЂопE±|фXопB†wф›р4фњр(опAфр3ф™р#оп?~ф—р@фоп=ъ ф¤фҐр:опоп џф’ф“рLопф°рDф±оп—ф«р7ф¬р$опфЂопиф“рEф”|wопdфљр5ф›р)яь·AHџbзўSEC. 7PERFORMANCE19(more for gray-scale or color pictures). A special operation called BitBlt (bit boundary blocktransfer) makes it easier to create and update bitmaps; for more information about BitBlt consult [9],where it is called RasterOp. BitBlt makes extensive use of the shifting/masking capabilities of theprocessor, and attempts to prefetch data so that it will always be in the cache when needed. TheDorado's BitBlt can move display objects around in memory at 34 megabits/sec for simple cases likeerasing or scrolling a screen. More complex operations, where the result is a function of the sourceobject, the destination object and a filter, run at 24 megabits/sec.I/O devices with transfer rates up to 10 megabits/sec are handled by the processor via the IODATAand IOADDRESS busses. The microcode for the disk takes three cycles to transfer two words in thisway; thus the 10 megabit/sec disk consumes 5% of the processor. Higher bandwidth devices usethe fast I/O system, which does not interact with the cache. The fast I/O microcode for the displaytakes only two instructions to transfer a 16 word block of data from memory to the device. Thiscan consume the available memory bandwidth for I/O (530 megabits/sec) using only one quarter ofthe available microcycles (that is, two I/O instructions every eight cycles).Recall that the NEXTPC scheme (¶ 5.5 and ¶ 6.2.2) imposes a rather complicated structure on themicrostore, because of the pages, the odd/even branch addresses, and the special subroutine calllocations We were concerned about the amount of microstore which might be wasted by automaticplacement of instructions under all these constraints. In fact, however, the automatic placer can use99.9% of the available memory when called upon to place an essentially full microstore.AcknowledgementsThe early design of the Dorado processor was done by Chuck Thacker and Don Charnley. Thedata section was redesigned and debugged by Roger Bates and Ed Fiala. Peter Deutsch wrote themicrocode assembler and instruction placer, and Ed Fiala wrote the Dorado assembler macros, themicroprogram debugger, and the hardware manual. Willie-Sue Haugeland, Nori Suzuki, BruceHorn, Peter Deutsch, Ed Taft and Gene McDaniel are responsible for production and diagnosticmicrocode.References1.Clark, D.W. et. al. The memory system of a high-performance personal computer. Technical Report CSL-81-1, Xerox PaloAlto Research Center, January 1981. Revised version to appear in IEEE Transactions on Computers.2.Deutsch, L.P. Experience with a microprogrammed Interlisp system. Proc. 11th Ann. Microprogramming Workshop, PacificGrove, Nov. 1979.3.Geschke, C.M. et. al. Early experience with Mesa. Comm ACM 20, 8, Aug 1977, 540-5524.Ingalls, D.H. The Smalltalk-76 programming system: Design and implementation. 5th ACM Symp. Principles ofProgramming Languages, Tucson, Jan 1978, 9-16.5. Lampson, B.W. et. al. An instruction fetch unit for a high-performance personal computer. Technical Report CSL-81-1,Xerox Palo Alto Research Center, Jan. 1981. Submitted for publication.6.Mitchell, J.G. et. al. Mesa Language Manual, Technical Report CSL-79-3, Xerox Palo Alto Research Center, April 1979.7. Teitelman, W. Interlisp Reference Manual, Xerox Palo Alto Research Center, Oct. 1978.8. Thacker, C.P. et. al. Alto: A personal computer. In Computer Structures: Readings and Examples, 2nd edition, Sieworek, Bell and Newell, eds., McGraw-Hill, 1981. Also in Technical Report CSL-79-11, Xerox Palo Alto Research Center, August1979.9. Newman, W.M. and Sproull, R.F. Principles of Interactive Computer Graphics, 2nd ed. McGraw-Hill, 1979.оп=KzфGхо- о;€wоп6ффЩфЪр1~wоп5pфЃр.ф‚~wоп3мфЁ ~w~wф©р,оп2h фњр!фќр6оп0дфѓ~wр=ф„оп/`ф€р^оп-ЬфЂр=оп*±zwzwф—рSфzоп)-wфЌzwр0фЋр%оп'©фҐр,ф¦р-оп&%ф‹zwzwр)фЊzwzwоп$Ўфџр,ф р/оп#фЃр,zwzwф‚оп!™фЂр%zwzwр"опnф§ zwр,фЁопк фірPфґопfф†ф‡рTопвф…р-ф†р0оп^фЂрRоп‹{оп`wф©фЄрPопЬф•рZоп Xф ф™рLопФфЙр$фКр)оп Pфф®рEопМ опщ{ опdywо{zфGя ·(зHџBA оzп)KpфGхqрOrqо{п( рBsqpqоп&ftuо{qрDpр)qо{п%(оп#Ѓtuо{q pqpsqtqоп!Ъtuо{qрOpspо{п њqопхtuфXо{qфGpqрYrqо{п·рGопtuо{qpqpqrqр3опituфXо{q фGpqр-опВtuфXо{qфGpqpр*qо{п„рDrqр/о{пFопеtqо{pр+qя 6·HџџA PROCESSOR FOR A HIGH-PERFORMANCE PERSONAL COMPUTER20одпdquфXхququququq uquqоuя8·eОHџZ$An Instruction Fetch Unit for a High-Performance Personal Computerby Butler W. Lampson, Gene A. McDaniel and Severo M. OrnsteinJanuary 1981ABSTRACTThe instruction fetch unit (IFU) of the Dorado personal computer speeds up the emulation ofinstructions by pre-fetching, decoding, and preparing later instructions in parallel with theexecution of earlier ones. It dispatches the machine's microcoded processor to the properstarting address for each instruction, and passes the instruction's fields to the processor ondemand. A writeable decoding memory allows the IFU to be specialized to a particularinstruction set, as long as the instructions are an integral number of bytes long. There areimplementations of specialized instruction sets for the Mesa, Lisp, and Smalltalk languages.The IFU is implemented with a six-stage pipeline, and can decode an instruction every 60 ns.Under favorable conditions the Dorado can execute instructions at this peak rate (16 mips).This paper has been submitted for publication.CR CATEGORIES6.34, 6.21KEY WORDS AND PHRASEScache, emulation, instruction fetch, microcode, pipeline.c Copyright 1981 by Xerox Corporation. XEROXPALO ALTO RESEARCH CENTER3333 Coyote Hill Road / Palo Alto / California 94304яопMmpфЂопJр"опC¤qфXр;оп=*оп4Вrоп1—sфЏtsфђр4оп0ф·фёрNоп.Џфњфќр?оп-ф’р)ф“р-оп+‡фЕр)фЖtsр"оп* фф™рBоп(фќр4фћоп&ыф‰tsр9фЉоп%wфЂрVоп"Luр.оп“rопhs опЬrоп±sр9оп.vwtр$оп«wфXоЎпlxоЄпиqоЄпdrфGр0Ш· Hџ\ЎAN INSTRUCTION FETCH UNIT FOR A HIGH-PERFORMANCE PERSONAL COMPUTER221. IntroductionThis paper describes the instruction fetch unit (IFU) for the Dorado, a powerful personal computerdesigned to meet the needs of computing researchers at the Xerox Palo Alto Research Center.These people work in many areas of computer science: programming environments, automatedoffice systems, electronic filing and communication, page composition and computer graphics, VLSIdesign aids, distributed computing, etc. There is heavy emphasis on building working prototypes.The Dorado preserves the important properties of an earlier personal computer, the Alto [13], whileremoving the space and speed bottlenecks imposed by that machine's 1973 design. The history,design goals, and general characteristics of the Dorado are discussed in a companion paper [8],which also describes its microprogrammed processor. A second paper [1] describes the memorysystem. The Dorado is built out of ECL 10K circuits. It has 16-bit data paths, 28 bit virtual addresses, 4K-16K words of high-speed cache memory, writeable microcode, and an I/O bandwidth of 530Mbits/sec. Figure 1 shows a block diagram of the machine. The microcoded processor can executea microinstruction every 60 ns. An instruction of some high level language is performed byexecuting a suitable succession of these microinstructions; this process is called emulation.<==Пф–рQф—оп=Kф‘ф’р)|wоп;ЗфЂр/|wоп7И{фXр!оп4ќwф™р>фљоп3фЂр4оцп0Бр%оцп.iрDоцп,оцп)№оп&ЋфЦр1фЧр%оп% фЉр9ф‹оп#†фЎфўрTоп"{wфЏрBфђоп ~фЈф¤рDопъ фЇрAф°опvфХфЦр:оптф„рaопnфСрVфТопкф‹р;фЊр#опfф…рGф†опвфЂоп·ф†ф‡рBоп3ф±р!фІ{wопЇф‚фѓрGоп+ фЃр4ф‚р&оп§фЌфЋр<оп#фЂрUопшф†р=ф‡ywр"опtфЎрHфўопрфЃрZф‚опlф«рCф¬опиywфµр]опdфЂ{ wywр=ъ·%HџdІAN INSTRUCTION FETCH UNIT FOR A HIGH-PERFORMANCE PERSONAL COMPUTER26The EU demands instructions from the IFU at an irregular rate, depending on how fast it is able toabsorb the previous ones. A simple machine must completely process an instruction beforedemanding the next one. In a machine with multiple functional units, on the other hand, the firststage in the EU waits until the basic resources required by the instruction (adders, result registers,etc.) are available, and then hands it off to a functional unit for execution. Beyond this point theoperation cannot be described by a single pipeline, and complete execution of the instruction maybe long delayed, but even in this complicated situation the IFU still sees the EU as a single consumerof instructions, and is unaware of the concurrency which lies beyond.Under this umbrella definition for an IFU, a lot can be sheltered. To illustrate the way an IFU canaccommodate specific language features, we draw an example from Smalltalk [5]. In this language,the basic executable operation is applying a function f (called a method) to an object o: f(o, . . .).The address of the code for the function is not determined solely by the static program, butdepends on a property of the object called its class. There are many implementation techniques forfinding the class and then the function from the object. One possibility is to represent a class as ahash table which maps function names (previously converted by a compiler into numbers) into codeaddresses, and to store the address of this table in the first word of the object. The rather complexoperation of obtaining the hash table address and searching the table for the code addressassociated with f, is in the proper domain of an IFU, and removes a significant amount ofcomputation from the processor. No such specialization is present in the Dorado's IFU, however.2.3 Pipelining instruction fetchesFor the sake of definiteness, we will assume henceforth that the smallest addressable unit in the code is a byte;the memory delivers data in units called words, which are larger than bytes;an instruction (and its addresses, immediate operands, and other fields) may occupy one ormore bytes, and the first byte determines its essential properties (length, number of fields,etc.).Matters are somewhat simplified if the addresssable unit is the unit delivered by the memory or ifinstructions are all the same length, and somewhat complicated if instructions may be any numberof bits long. However, these variations are inessential and distracting.The operation of instruction fetching divides naturally into four stages:Generating addresses of instruction words in the code, typically by sequentially advancing aprogram counter, one memory word at a time.Fetching data from the code at these addresses. This requires interactions with themachine's memory in general, although recently used code may be cached within the IFU.Such a cache looks much like main memory to the rest of the IFU.Decoding instructions to determine their length and internal structure, and perhaps whetherthey are branches which the IFU should execute. Decoding changes the representation ofthe instruction, from one which is compact and convenient for the compiler, to one whichis convenient for the EU and IFU.Formatting the fields of each instruction (addresses, immediate operands, register numbers,mode control fields, or whatever) for the convenience of the EU; e.g., extracting fields ontothe EU's data busses.Buffering may be introduced between any pair of these stages, either the minimum of one itemrequired to separate the stages, or a larger amount to increase the elasticity. Note that an item mustbe a word early in the pipe (at the interface to the memory), must be an instruction late in the pipe(at the interface to the EU), and may need to be a byte in the middle.There are three sources of irregularity (see ¶ 2.1) in the pipeline, even when no wrong branches aretaken: яодп\ЬywфXхy wywywywywywy wywyоwопV…ф‘ywywф’опUфлр@фмопS}ф’рIф“опQщф—фywрXопPuр-ф™р8опNсфљрDф›опMmфЂр:ywywфЃопKйфЂрCопHѕфЏр!ywр(фђywопG: ф‘р4ф’р"опE¶фџр3{w {w{w{w{wопD2фТфУрJопB®ф…р({wф†опA*фЏ фђрUоп?¦фѓр;ф„р!оп>" ф‰рMфЉоп<ћфхрQоп; фл{w фмywр%оп9– фЂрHyw оп5—{фXоп3?wфЂр:оцп0зр4оцп.Џр){wоцп,7фЏрFфђоцп*іфњрNфќ оцп)/оп&Чф–р8ф—р#оп%SфрHоп#ПфЂрGоп ¤рIоцпL{ ф’ wр<ф“оцпИфЂр$оцпp{фсwр>фтоцпмфћр6фџywоцпhфЂр8ywоцп{фЏwфђр(оцпЊфЈywф¤оцпфрHф™оцп„фЂywywоцп,{ wфќ{wрFоцпЁфЏр9ywфђоцп$фЂywопМфір,фґр'опHфЂр+фЃр4опДф€{wр@{ wф‰оп@фЂyw{wопиф‡р.ф€р1опdфЂ· VHџaТ·SEC. 2THE PROBLEM27The instruction length is irregular, as noted in the previous paragraph; hence a uniformflow of instructions to the EU implies an irregular flow of bytes into the decoder, and viceversa.The memory takes an irregular amount of time to fetch data; if it contains a cache, theamount of time may vary by more than an order of magnitude. The EU demands instructions at an irregular rate. These considerations imply that considerable elasticity is needed in order to meet the EU's demandswithout introducing delays. 2.4 Hand-off to the EUFrom the IFU's viewpoint, handing-off an instruction to the EU is a simple producer-consumerrelationship. The EU demands a new instruction. If one is ready, the IFU delivers it as a pile ofsuitably formatted bits, and forgets about the instruction. Otherwise the IFU notifies the EU that itis not ready; in this case the EU will presumably repeat the request until it is satisfied. Thus at thislevel of abstraction, hand-off is a synchronized transfer of one data item (a decoded instruction)from one process (the IFU) to another (the EU).Usually the data in the decoded instruction can be divided into two parts: information about whatto do, and parameters. If the EU is a microprogrammed processor, for example, what to do canconveniently be encoded as the address of a microinstruction to which control should go (a dispatchaddress), and indeed this is done in the Dorado. Since microinstructions can contain immediateconstants, and in general can do arbitrary computations, it is possible in principle to encode all theinformation in the instruction into a microinstruction address; thus the instructions PushConstant(3)and PushConstant(4356) could send control to different microinstructions. In fact, however, micro-instructions are expensive, and it is impractical to have more than a few hundred, or at most a fewthousand of them. Hence we want to use the same microcode for as many instructions as possible,representing the differences in parameters which are treated as data by the microcode. Theseparameters are presented to the EU on some set of data busses; ¶ 4 has several examples.Half of the IFU-EU synchronization can also be encoded in the dispatch address: when the IFU isnot ready, it can dispatch the EU to a special NotReady location. Here the microcode can do anybackground processing it might have, and then repeat the demand for another instruction. Thesame method can be used to communicate other exceptional conditions to the EU, such as a pagefault encountered in fetching an instruction, or an interrupt signal from an I/O device. TheDorado's IFU uses this method (see ¶ 3.4). Measurements of typical programs [7, 11] reveal that most of the instructions executed are simple,and hence can be handled quickly by the EU. As a result, it is important to keep the cost of hand-off low, since otherwise it can easily dominate the execution time for such instructions. As the EUgets faster, this point gets more important; there are many instructions which the Dorado, forinstance, can execute in one cycle, so that one cycle of hand-off overhead would be 50%. Thispoint is discussed further in ¶ 3 and 4.2.5 AutonomyPerhaps the most important parameter in the design of an IFU is the extent to which it functionsindependently of the execution unit, which is the master in their relationship. At one extreme wecan have an IFU which is entirely independent of the EU after it is initialized with a code address (itmight also receive information about the outcome of branches); this initialization would only occuron a process switch, complex procedure call, or indexed or indirect jump. At the other extreme isan IFU which simply buffers one word of code and delivers successive bytes to the EU; when thebuffer is empty, the IFU dispatches the EU to a piece of microcode which fetches another memoryword's worth of code into the buffer. The first IFU must decode instruction lengths, follow jumps,and provide the program counter for each instruction to the EU (e.g., so that it can be saved as aяоп]¬yфGхо-Т о;€wоцпWUф·рEфёоцпUСф•ywр,ф–оцпTMоцпQхфІр2фір"оцпPqфЂр7оцпNywр,опKБфЉрHф‹ yw опJ=фЂопF>{фXопCwфЛywфМр#ywопAЏфџywф р2ywоп@ф’рCyw ywоп>‡ф†ф‡ ywрHоп=фір]оп;фЂywywоп8Tф•р+ф–р/оп6Рф«ywр-ф¬оп5Lф†рAф‡ {оп3ИwфЇр'ф°р/оп2D фЏфђрXоп0А фќфћр8|wоп/<фў|wфЈр:оп-ёфЋрLфЏ оп,4фЊрNфЌ оп*°фПфР{ wр3оп), фЂywр6оп&ф¤ywywфҐр.ywоп$}фЈyw |wф¤оп"щ ф¶р#ф·р0оп!uфџф р@ywопсфЪр+фЫywywопmфЂywопBфќ фћрKопѕф‰р%ywр+фЉ оп:ф’ф“рZyоп¶wфНр(фОр2оп2фіфґрFоп®фЂр#опЇ{фXоп„wфЈр2ywр$опфрFф™оп |фѓywр&ywф„опшф‘рXф’опtф”р`опрфЈywрLyw опlф™ywywр.фљопиф‘ф’р%ywр/опdфћр9ywфџ ъ·†HџbўДAN INSTRUCTION FETCH UNIT FOR A HIGH-PERFORMANCE PERSONAL COMPUTER28return link). The second leaves all these functions to the EU, except perhaps for keeping track ofwhich byte of the word it is delivering. One might think that the second IFU cannot helpperformance much, but in fact when working with a microcoded EU it can probably provide halfthe performance improvement of the first one, at one-tenth the cost in hardware. The reason canbe seen by examining the interpreter fragment at the beginning of ¶ 2; half a dozen micro-instructions are typically consumed in the clumsy GetInstruction operation, and things get worsewhen instructions do not coincide with memory words.When deciding what trade-offs to make, one important parameter is the speed of the EU. It ispointless to be able to execute most instructions in one or two cycles, if several cycles are consumedin GetInstruction. Hence a fast EU must have an autonomous IFU. An important special case is thespeed of the memory relative to the microinstruction time. If several microinstructions can beexecuted in the time required to fetch the next instruction from memory, the processor can use thistime to hold the IFU's hand, or to perform the GetInstruction itself. On the Dorado, the cacheensures that memory data arrives almost immediately, so there is no free time for handholding.An autonomous IFU must do more than simply transforming instructions into a convenient form forthe EU. There are two natural ways in which its internal operation may be affected by the instruc-tion stream: decoding instruction lengths, and following branches. Any IFU which handles morethan one instruction without processor intervention must calculate instruction lengths. Followingbranches is desirable because it avoids the cost of a start-up latency at every branch instruction(typically every fifth instruction is a branch). However, it does introduce potential complicationsbecause a conditional branch must be processed without accurate information (perhaps without anyinformation) about the actual value of the condition; indeed, often this value is not determineduntil the processor has executed the preceding instruction. A straightforward design decideswhether to branch based on the opcode alone, and the processor restarts the IFU at the correctaddress if the decision turns out to be wrong. The branch decision may be based on other historical information. The S-1 [17], for instance, keepsin its instruction cache one bit for each instruction, which records whether the instruction branchedlast time it was executed. This small amount of partial history reduces the fraction of incorrect branch decisions to 5% [Forest Baskett, personal communication]. The MU5 [4] remembers theaddresses of the last eight instructions which branched; such a small history leaves 35% of thebranches predicted wrongly, but the scheme allows the prediction to be made before the instructionis fetched. More elaborate designs [16] follow both branch paths, discarding the wrong one whenthe processor makes the branch decision. Each path may of course encounter further branches,which in turn may be followed both ways until the capacity of the IFU is exhausted. If each path istruly followed in parallel, then following n paths will in general require n times as much hardwareand n times as much memory bandwidth as following one path. Alternatively, part or all of theIFU's resources may be multiplexed between paths to reduce this cost at the expense of bandwidth. 2.6 BufferingAs we saw in ¶ 2.2, a pipeline with any irregularities must have buffering to provide elasticity, or itsperformance at each instant will approximate the performance of the slowest stage at that instant;this maximizing of the worst performance is highly undesirable. From the enumeration in ¶ 2.3 ofirregularities in the IFU, we can see that to serve the EU smoothly, there should be a buffer betweenthe EU and any sources of irregularity, as shown in Figure 2. Similarly, to receive words from theirregular memory, there should be a buffer between the memory and any sources of irregularity.Because of the irregularity caused by variable length instructions, a single buffer cannot serve bothfunctions. Note that additional regular stages (some are shown in the figure) have no effect oneway or the other.одпUЉywфXхy wywywywywywy wywyоwопO3фњфќр.ywр%опMЇфбр9фвywопL+ фҐр&ф¦ywопJ§фљр;ф›р"опI#фЩрAфЪопGџфМр&| w фНопFфЂр0опBрф¶рOywопAlфѓ ф„рRоп?ифђ| wywф‘ywр#оп>dфЕрWфЖоп<аф€ф‰рTоп;\фАфБyw| wр"оп9ШфЂрWоп6ф„ ф…ywрNоп5)фЌywфЋрQоп3ҐфЇр?ф°ywоп2!ф±р^оп0ќф№рAфєоп/ ф¬рBфоп-•фЋ фЏрKоп,ф№фєр=оп*ЌфлрIфмоп) фєр-ф»ywоп'…фЂр(оп$ZфЃф‚рZоп"Цф‹р4фЊр/оп!Rфір<фґр"фЂопОфЕр*фЖywопJ{wфИфЙрDопЖфЊр8фЌ{wопBфќр2фћр,опѕф¶ф·р?оп:фѓр5ф„ywоп¶ф”р&{w{wоп2ф§{wфЁрYоп®ywфЂр^опЇ{фX оп„wф‚фѓрcоп ф р%фЎр2оп |фЏрVфђопш ф„ywywр+опtф“ywф”рUопрфрGф® опlф”р>ф•опи фЁрIф© опdфЂ я·ЁHџZЂБSEC. 2THE PROBLEM29<==опф‚фѓр*ywоп— фЂопlф”р0ywф•р!опифѓф„рEопdфЗр6|я8·:®Hџ0zЛо(п:…pп;qфЗх фИrп:…pоп9$qф…ф†р>rп8—pп9$qоп76ф р%rп6©pп76qфЎр5оп5HфВрJфГоп3Дфёр#ф№р0rп37pп3Дqrп37pоп1Цqф‘рVф’оп0RфҐр/ф¦р'оп.ОsqфЈр&ф¤р9оп-Jф±рGфІоп+ЖфНфОрSоп*Bфр6ф™р%оп(ѕфЂsqоп$їtфXоп!”qф¦ф§sqр5оп ф„рaопЊфЂр3опaфбsqфвр7опЭsqфЂр5фЃsqопYфЂфЃрAопХфЌр%фЋр;опQфДрQфЕsqопНфѓsqф„р9опIфЇр<ф°р&опЕфЗрXопAф–ф— sqр2оп ЅфЈр'ф¤р1оп9фЂ8т·Hџ AN INSTRUCTION FETCH UNIT FOR A HIGH-PERFORMANCE PERSONAL COMPUTER32<==щюr$о rпуро rп!,роп!,ропуро«п%jюr$о«п)Ьюr$о rп#Аюд$оVп"8ю$«о rп"ю $о rп"ю$Оо rп†ю$Оо rп†ю $оVпЄю$«о rп1юд$о дп9юG«о дп5«юG«о дп/юG9о дп*юG«о дп#гюG«о дпqюGЗо дпUюG«о дпqюG9о дпЖюG9о дпqюG«оpп<2р оп8Nроп6¤ро«п+‡ро дпѕсо:п<2ро¬п3Nр о¬п1¤ро:п'Јро:пЈр оИпшро¬пр опjр о Џп<2ро Цп2Аро«п'ро Цп†ро Hп†ро Џп†роЏп-2ро rп.Ьюд$оVп-Uю$«о rп-2ю $о rп-2ю$Оо rп+‡ю$Оо rп+‡ю $оVп+«ю$«о rп-2юд$оЏп"Јро rпўюд$оVп ю$«о rпшю $о rпшю$Оо¬tп8р о¬п,лр о¬п!@р оЏп\роЏп Нроpпiр+`6G;#H?ЧSEC. 3ARCHITECTURE OF THE DORADO IFU33There are two words of buffering after MEMORY, but there is no other buffering except for theminimum single item between stages, contrary to the arguments of ¶ 2.6. This design was adoptedpartly to save space, and partly because we did not fully understand the issues in maintaining peakbandwidth. Fortunately the peak bandwidth of the IFU is substantially greater than what theprocessor is likely to demand for more than a very short interval (see ¶ 6), so that not much usefulthroughput is lost because of the inadequate buffering.3.4 ExceptionsException conditions are handled by extending the space of values stored in an item and handedoff from one stage to the next, rather than by establishing separate communication paths. Thus, forexample, a page fault from the memory is indicated by a status bit returned along with the dataword; the resulting "page fault value" is propagated through the pipe and decoded into a page faultdispatch address which is handed to the processor like any ordinary instruction. Each exception hasits own dispatch address. Interrupts cause a slight complication. The IFU accepts a signal calledReschedule which means "cause an interrupt;" this signal is actually generated by I/O microcode inthe processor, but it could come from separate hardware. The next item leaving DECODE ismodified to have a reschedule dispatch address. The microcode at this address examines registers tofind out what interrupt condition has occurred. Since the reschedule item replaces one of theinstructions in the code, it has a PC value, which is the address of the next instruction to beexecuted. After the interrupt has been dealt with, the IFU will be restarted at that point.The exceptions may be divided into three classes:1)the IFU has not (yet) finished decoding the next instruction, and hence is not ready torespond to a processor demand;2)it is necessary to do something different (to handle an interrupt or a page fault);3)there has been a hardware problemќit is not wise to proceed.Since more than one exception condition may obtain at a time, they are arranged in a fixed priorityorder. Exceptions are communicated only by a dispatch; hence, all exceptions having to do with aparticular opcode must be detected before it is handed off. Thus all the bytes of an instructionmust have been fetched from memory and be available within the IFU before it is handed off.3.5 Contention and dependenciesThere is no contention for resources within the IFU, and the only contention with the rest of theDorado is for access to the memory. The IFU shares with the processor a single address bus to theDorado's cache, but has its own bus for retrieving data. The processor has highest priority for theaddress bus, which can handle one request per cycle. Thus under worst-case conditions the IFU canbe locked out completely; eventually, of course, the processor will demand an instruction which isnot ready and stop using the bus. Actual address bus conflicts are not a major factor (see ¶ 6.3).Although ideally the MEMORY stage is regular, in fact collisions with the processor can happen;these irregularities are partially compensated by the two words of buffering after MEMORY. Inaddition cache misses, though very rare, cost about 30 cycles when they do occur.There is only one dependency on the rest of the execution pipeline: starting the IFU at a new PC.Since no attempt is made to detect modifications of code being executed, or to execute brancheswhich depend on the values of variables, the only IFU-processor communication is hand-offsynchronization and resetting of the PC, and these are also the only communication between the IFUstages. The IFU is completely reset when it gets a new PC; no attempt is made to follow more thanone branch path, or to cache information about the code within the IFU. The shortage of bufferingmakes the implementation of synchronization rather tricky; see ¶ 5 .The IFU takes complete responsibility for keeping track of the PC. Every item in the pipe carries itsPC value with it, so that when an instruction is delivered to the processor, the PC is delivered at theоп^ЈsфGхо џо;€qопXLф¶р"sq ф·р%опVИфђр@ф‘опUDфЋр"фЏр;опSА фТр#фУsqр'опR<фЊрFфЌопPё фЂр-опL№tфX опIЋqф¤р.фҐр'опH ф„р+ф…р6опF†ф©рCфЄопEф‚фѓрZопC~ф‚фѓрKопAъф§р@фЁsqоп@vr qфњфќрBsqsqоп>тфЫр"фЬр+sqоп=nфЂр>фЃоп;кфГфДрVоп:fфМsqфНр3оп8вфЂр/sqр!оп5·р1оlп3_оцфБsqр$фВр,оцп1ЫфЂоlп/ѓоцрSоlп-+оцр<оп*Уф…р"ф†р<оп)Oфђф‘р>оп'Л фрKф®оп&GфЂр;sqоп"HtфXопqфҐф¦sqр.оп™фЌр#sqфЋопф“р*ф”р2оп‘ф„ф…р:sqоп фљрAф›оп‰фЂр`оп^фєsqр%ф»опЪфВр4фГsqопVфЂрIоп+фњр>фќ sq sqоп§ф©р+фЄр/оп#фхр,фцsqр$оп џф„sqф…sопqф€sqр'ф‰sqр(оп—ф€р@sqопфЂр?опиф„sqр8sqф…опdsqф‰рJфЉsqV·ЏHџc™ІAN INSTRUCTION FETCH UNIT FOR A HIGH-PERFORMANCE PERSONAL COMPUTER34same time. The processor actually has access to all the information needed to maintain its own PC,but the time required to do this in microcode would be prohibitive (at least one cycle perinstruction). The IFU can also follow branches, provided they are PC-relative, have displacements specifiedentirely in the instruction, and are encoded in certain limited ways. These restrictions ensure thatonly information from the code (plus the current PC value) is needed to compute the branchaddress, so that no external dependencies are introduced. It would be possible to handle absoluteas well as PC-relative branches, but this did not seem useful, since none of the target instruction setsuse absolute branches. The decoding table specifies for each opcode whether it branches and howto obtain the displacement. On a branch, DECODE resets the earlier stages of the pipe and passesthe branch PC back to ADDRESS. The branch instruction is also passed on to the processor. If it isactually a conditional branch which should not have been taken, the processor will reset the IFU tocontinue with the next instruction; the work done in following the branch is wasted. If the branchis likely not to be taken, then the decoding table should be set up so that it is treated as anordinary instruction by the IFU, and if the branch is taken after all, the processor will reset the IFUto continue with the branch path; in this case the work done in following the sequential path iswasted. Even unconditional jumps are pased on to the processor, partly to avoid another case inthe IFU, and partly to prevent infinite loops in the IFU without any processor intervention. 4. IFU-processor hand-offWith a microcoded execution unit like the Dorado's processor, efficient emulation depends onsmooth interaction between the IFU and the processor, and on the right kind of concurrency in theprocessor itself. These considerations are less critical in a low-performance machine, where manymicrocycles are used to execute each instruction, and the loss of a few is not disastrous. A high-performance machine, however, executes many instructions in one or two microcycles. Adding oneor two more cycles because of a poorly chosen interface with the IFU, or because a very commonpair of operations cannot be expressed in a single microinstruction, slows the emulator down by 50-200%. The common operations are not very complex, and require only a modest amount ofhardware for an efficient implementation. The examples in this section illustrate these points.Good performance depends on two things:An adequate set of data busses, so that it is physically possible to perform the frequentcombinations of independent data transfers in a single cycle. We shall be mainly concernedwith the busses which connect the IFU and the processor, rather than with the internaldetails of the latter. These are summarized in Figure 4.A microinstruction encoding which makes it possible to specify these transfers in a singlemicroinstruction. A horizontal encoding does this automatically; a vertical one requiresgreater care to ensure that all the important combinations can still be specified.We shall use the term folding for the combination of several independent operations in a singlemicroinstruction. Usually folding is done by the microprogrammer, who surveys the operations tobe done and the resources of the processor, and arranges the operations in the fewest possiblenumber of microinstructions.одпL—sqфXхs qsqsqsqsqsqs qsqsоqопF@фЊр,фЌр0sqопDјфирSфйопC8фЂоп@ фЯsqр-sqфа оп>‰ф›фњрIоп=фСр,фТsqр'оп;Ѓф™р!фљр9оп9эфѓsqр?ф„оп8yф”р5ф•р(оп6хфќр(sqфћоп5qф‹sqsqр+фЊоп3нфЏфђр:sqоп2iфЌр[оп0ефЕфЖрLоп/aфЋsqфЏр2sоп-Эqф¬рVфоп,YфЈр*ф¤р/оп*ХфЂsqр.sqр$оп&uфXvuоп"ЧqфСр%фТр3оп!SфЏsqфђр,опПф¤р3фҐр&опK ф р&фЎр2опЗ фЌрTопCфћр:фџsqопїф‡р.ф€р1оп;фХфЦр2оп·фЂрXопЊр'оцп4ф¶р?ф·оцп°фЃф‚рLоцп,фВsqфГр)оцпЁфЂр2оцп Pф§рRфЁоцпМфГфДр1оцпHфЂрKопрфґ фµ tqрBопlф™фљрFопифГфДрOопdфЂ а·›HџQЌySEC. 4IFU-PROCESSOR HAND-OFF35<==>MEMORYoutputbuffer.........оп^¦sфGхо&Tо;€qоtпd1я2·ЊHџcњоИxп0`роИп2™роrп2™роrп0`роп6Чюr$оп;Iюr$оп1Iюr$оп,Чюr$оrп&`роrп(™роИп(™роИп&`роИпFроИпHDроrпHDроrпFропLѓюr$опPхюr$оИп;}роИп=¶роrп=¶роrп;}ропAфюr$опFfюr$о:пJьюG«о:пFЉюG«о:п@mюG«о:п;mюG9о:п4ВюG9о:п1mюG«о#ЏpпdсодпN.ропCџроќп8‚родп.‚роИп4ћюд$о«п3ю$«оИп2фю $оИп2фю$Оодп юskо!VпіюkИодпHюЭkодпHюk2оrп6Iю•$оrп zю$топ5Яю$о9п5»юл$о9п zю$dо Ћп6Чю]$о Ћп zю$Ѓо:п zю$Ђоr|пFро«пFродпFроrп;}ро«п;}родп;}роЃqп"роrп#eро№п"ро xп ро дп ро«п ро«п роИп6ыю$‡оп6Iю$Холpп&,ро$п&,ролп&,ро@п&,ро–uп#Aро–п"%ро]п#Aро]п"%родпSuюskо!Vп)Ґюk*:одп);юЭkодп);юk*ҐоVпJью$«одпKЉю$оVпFfю$Оо«пUхюЏ$о9пUgюr$оqп"ро(п—ю$!о-ђп ю–kоA&піюkИо-ђпHюkо-ђпHюk2о!Vпею@$о"єпю Ц$о"єпsю Ц$одпKfю«$о(Џп;mю$оVпJШю«$о(п=¦ю$ Vо#ЏпИро"sп:ро rпЧр о.¬пЧро«п]ю$J»о«п:ю6 $о7іп:ю$yо7%пИю$ло9плю$о(Џп—ю$!о(Џпsю$о!VпsюО$о=Bqп#‰ро?{п"ро@Pп$ЙроBРxпроDpп&єро«пєю$о«п+ю$о«пЃю$оVп+р оVптро«пр9о rпQѓроИпћюU$опъю$ЗоИпЦюy$оИпЦю$лоп,юЋ$о¬п¤ю$«о9пЃю–$о дпЃюд$оИпOю$Uо дп,ю$о дп,ю$yоrtпроrпроVпlю$rоVпЭю$rоrпsроrпЏро дпќю$yо дпќю$оИпБю$Uо дпуюд$оИпіюЋ$оИп%юЋ$о9п%ю«$о9п–ю«$о:пБю$ rо:пћю$о:п,ю$опюy$опЃюy$опуюy$опeюy$оVп^ю•$оVпю•$оЏпію$оЦпію$оVпПю]$оЦпПют$опИюy$оп¬юy$опWюy$оп:юy$оИпъюЋ$оИпlюЋ$опsюХ$отп:ю$отп]ю$2отпlют$о]пЭю‡$о9пЭю$Gо9пЃю$9оkплю$¤оъплю$¤о9пИюV$оъпИю,$о9пБю$9ЙоuпИроVvпsроuпроЏпOюЗ$оЦпюЂ$оЦпИю$dоЏпИю$«оVпҐю]$оЦпҐют$о,xп ро!ќп роdпLoро п ро дп ро«п ро«п роИп?=ро; пHю9$о=Bпъю$rо; пЦю]$о; пЦю$•о<%пъю$rо?{пъю$rо=Bп,ю9$о=Bпую9$о?{пюд$оC_пю$-%оVпHю,,$о>^п ро>^пСро9мtпsро:ћпVро:ћпро8Рп3ю]$о8Рпмю]$о7|п_ро5еп_ро4Йп_роќпКроќпхроќпроќп‘роќпfроќп<ро мп–юХ$о щпAют$о щпdю$ Џо ҐпЭю$о@п€юd$о@п€юЗ$о@п€юЗ$о@пAюЗ$о@п¬ю$о@п¬ю$№о Ґп€ю$yо мпAю$yо щпПюл$оуп:ю$]опT‘ю$$о:пTµю$dо:пQю$]о«пQю$Ђо«пTµю$Х2>њЊE V№AN INSTRUCTION FETCH UNIT FOR A HIGH-PERFORMANCE PERSONAL COMPUTER364.1 How the processor sees the IFUThe processor has four main operations for dealing with the IFU. Two are extremely frequent:IFUJump: The address of the next microinstruction is taken from the IFU; a ten bit bus passes thedispatch address to the processor's control section. In addition, parts of the processor state areinitialized from the IFU, and other parts are initialized to standard values (see ¶ 4.2). IFUJumpcauses the IFU to hand off an instruction to the processor if it has one ready. Otherwise the IFUdispatches the processor to the NotReady location. The microcode may issue another IFUJump at thatpoint, in which case the processor will loop at NotReady until the IFU has prepared the next instruc-tion. An IFUJump is coded in the branch control field of the microinstruction, and hence can bedone concurrently with any data manipulation operation.IFUData: The IFU delivers the next field datum on the IFUData bus, which is nine bits wide (eightdata bits plus a sign). Successive IFUData's during emulation of an instruction produce a fixedsequence of values determined by the decoding table entry for the opcode, and chosen from: a small constant N in the decoding table entry;the alpha byte, possibly sign extended;either half of the alpha byte;the beta byte;the instruction length.IFUData is usually delivered to the A bus, one of the processor's two main input busses, from whichit can be sent through the ALU, or used as a displacement in a memory reference. In this case it isencoded in the microinstruction field which controls the contents of this bus, and hence can bedone concurrently with all the other operations of the processor. IFUData can also be delivered to B,the other main input bus, from which it can be shifted, stored, sent to the other ALU input, oroutput. This operation is encoded in the special function field, where it excludes a large number ofrelatively infrequent operations as well as immediate constants and long jumps, all of which also usethis field. For the details of the processor and its microinstructions, see [8].The other two IFU-related operations are less frequent, and are also coded in the special functionfield of the microinstruction:PC: The IFU delivers the PC for the currently executing instruction to the B bus.PC_: resets the IFU and supplies a new PC value from the B bus. The IFU immediately startsfetching instructions from the location addressed by the new PC.In addition there are a number of operations that support initialization and testing of the hardware. Strictly speaking, the IFUData and PC operations do not interact with the IFU. All the informationthe IFU has about the instruction is handed off at the IFUJump, including the field data and the PC(about 40 bits). However, these bits are physically stored with the IFU, and sent to the processorbusses incrementally, in order to reduce the width of the busses needed (to 9 bits, plus a 16 bit busmultiplexed with many other functions). From the microprogrammer's viewpoint, therefore, thedescription we have given is natural. We illustrate the use of these operations with some examples. First, here is the actual microcodefor the PushConstant instruction introduced in ¶ 2.PushConstantByte:Push[IFUData], IFUJump;-- Reduced from 9 microinstructions to 1!To push a 16 bit constant, we need a three byte instruction; alpha contains the left eight bits of theconstant and beta the right eight bits.яодпX¶sфGхр@оqопR_tфXwопO4qфЂр9sqопL srqфћр,фџsqопJ…фЙфКрKопI фї sqрAфАsrопG}qф sqфЎр3sопEщq фЃrqф‚srqопDuф€ф‰ rq sqопBсф¤фҐsrqрOопAmфЂр3оп>Bsrqфљsqф›srqр$оп<ѕфКфЛsrqр5оп;:фЂрSоцп9¶rqоцп82р'оцп6®оцп5* оцп3¦оп0{srqф€sqф‰р.оп.чфЉsqр'ф‹оп-sф¶р=ф·оп+пф†р>srqф‡sqоп*kфµрOsqф¶ оп(зф„р@ф…оп'c фЃр!ф‚р:оп%ЯфЂрMоп"ґф¤ sqрQоп!0фЂопШsqsq sqр0sqопЂsqфА фБsqsqsq sqопьфЂр5sqоп¤рfопyф— фsrqsqр%sqопхф“sqр0srqф”sопqqфЎр5фў sqопнф‡р&ф€р9опi ф»фјр6оп е фЂоп єф фЎрQоп 6фЂrqопQsоaпxsфGxsор)опиqф‡р(ф€р<опdфЂя ј· |Hџ]¬ќSEC. 4IFU-PROCESSOR HAND-OFF37PushConstantWord:temp _ LeftShift[IFUData, 8];-- put alpha into the left half of tempPush[temp or IFUData], IFuJump;-- or in beta, push the result on the stack, and dispatch to the next instructionNotice that the first microinstruction uses the IFU to acquire data from the code stream. Then thesecond microinstruction simultaneously retrieves the second data byte and dispatches to the nextinstruction. These examples illustrate several points. Any number of microinstructions can be executed to emulate an instruction, i.e., betweenIFUJumps. Within an instruction, any number of IFUData requests are possible; see Table 3 for asummary of the data delivered to successive requests. IFUJump and IFUData may be done concurrently. The IFUData will reference the currentinstruction's data, and then the IFUJump will dispatch the processor to the first microinstruc-tion of the next instruction (or to NotReady).Suppose analysis of programs indicates that the most common PushConstant instruction pushes theconstant 0. Suppose further that 1 is the next most common constant, and 2 the next beyond that,and that all other constants occur much less frequently. A lot of code space can probably be savedby dedicating three one-byte opcodes to the most frequent PushConstant instructions, and using atwo-byte instruction for the less frequent cases, as in the PushConstantByte example above, where theopcode byte designates a PushConstantByte opcode and alpha specifies the constant. A third opcode,PushConstantWord, provides for 16-bit constants, and still others are possible.Pursuing this idea, we define five instructions to push constants onto the stack: PushC0, PushC1,PushC2, PushCB, PushCW. Any five distinct values can be assigned for the opcode bytes of theseinstructions, since the meaning of an opcode is completely defined by its decoding table entry. Theentries for these instructions are as follows: (N is a constant encoded in the opcode, Length is theinstruction length in bytes, and Dispatch is the microcode dispatch address; for details, see ¶ 5.4).OpcodePartial decoding table contents-- RemarksPushC0Dispatch_PushC, N_0, Length_1-- push 0 onto the stackPushC1Dispatch_PushC, N_1, Length_1-- push 1 onto the stackPushC2Dispatch_PushC, N_2, Length_1-- push 2 onto the stackPushCBDispatch_PushC, Length_2-- push alpha onto the stackPushCWDispatch_PushCWord, Length_3-- push the concatenation of alpha and beta onto the stackHere is the microcode to implement these instructions; we have seen it before:PushC:-- PushC0/1/2, (ifuData=N), PushCB, (ifuData=alpha)Push[IFUData], IFUJump; PushCWord:-- PushCW, temp _ Lshift[IFUData, 8];-- (IFUData=alpha here)Push[temp or IFUData], IFUJump;-- (IFUData=beta here)Observe that the same, single line of microcode (at the label PushC) implements four differentopcodes, for both one and two byte instructions. Only PushConstantWord requires two separatemicroinstructions.4.2 Initializing stateA standard method for reducing the size and increasing the usefulness of an instruction is toparameterize it. For example, we may consider an instruction with a base register field to beparameterized by that register: the "meaning" of the instruction depends on the contents of theregister. Thus the same instruction can perform different functions, and also perhaps can get bywith a smaller address field. This idea is also applicable to microcode, and is used in the Dorado.For example, there are 32 memory base registers. A microinstruction referencing memory does notspecify one of these explicitly; instead, there is a MemBase register, loadable by the microcode,which tells which base register to use. Provided the choice of register changes infrequently, this isan economical scheme. яоп^„sфGхо&Tо;€qопXtsоaпW6xsор'оaпUш vsxsxsоvsрLопRНqф’р*sqр0опQIфµр5ф¶р%опOЕфЂр,оцпMmфҐрMф¦оцпKйsrqфЂоцпI‘фКsrqр'фЛоцпH фЂр0оцпEµsrqфЇsrqф°srqоцпD1фѓsrq ф„р)оцпBфЂrqоп?‚фҐр5rqоп=юф–рXопоп3?ф·фёрErqrqоп1»rqфЁrqrqрIоп07ф…рWоп.іфўр'фЈyrqр&rqоп-/ фЂrqр<оп+sо4фGо оп) о4ооп'По4ооп&‘о4ооп%Sо4ооп$о4ор:оп кqфЂрJопЩsофGр2оaп›xsxsопЉ о оaпL xsоxsоaп vsxsxsоxsоп¶qфЗр0фИrqоп2фЗфИrqоп®опЇtфXоп„qфТрJфУопtqфЙфКрNоп |ф¶р'ф·р+опшф©рQфЄопtф”рXф•опрф‹р.фЊр/опlфЕр(фЖrqр%опиф•рaопdфЂяИ·®HџczВAN INSTRUCTION FETCH UNIT FOR A HIGH-PERFORMANCE PERSONAL COMPUTER38For emulation it presents some problems, however. Consider the microcode to push a localvariable; the address of the variable is given by the alpha byte plus the contents of the base registerlocalData, whose number is localDataRegNo:PushLocalVar:MemBase _ localDataRegNo;-- Make memory references relative to the local data.Fetch[IFUData];-- Use contents of PC+1 as offset.Push[memoryData], IFUJump;-- Push variable onto stack, begin next instructionThis takes three cycles, one of which does nothing but initialize MemBase. The point is clear: suchparametric state should be set from the IFU at the start of an instruction, using information in thedecoding table. This is in fact done on the Dorado. The decoding table entry for PushLocalVarspecifies localData as the initial value for MemBase, and the microcode becomes:PushVar:Fetch[IFUData];-- IFU initializes MemBase to the local dataPush[memoryData], IFUJump;-- Push variable onto stack, begin next instructionOne microinstruction is saved. Furthermore, the same microcode can be used for a PushGlobalVarinstruction, with a decoder entry which specifies the same dispatch address, but globalData as theinitial value of MemBase. Thus there are two ways in which parameterization saves space overspecifying everything in the microinstruction: each microinstruction can be shorter, and fewer areneeded. The need for initialization, however, makes the idea somewhat less attractive, since itcomplicates both the IFU and the EU, and increases the size of the decoding table. A major reduction in the size of the decoding table can be had by using the opcode itself as the dispatch address. This has a substantial cost in microcode, since typically the number of distinctdispatch addresses is about one-third of the 256 opcodes. If this price is paid and parameterizationeliminated, however, the IFU can be considerably simplified, since not only the decoding table spaceis saved, but also the buffers and busses needed to hand off the parameters to the processor, andthe parameterization mechanism in the processor itself. On the Dorado, the advantages ofparameterization were judged to be worth the price, but the decision is a fairly close one. Thecurrent memory base register and the current group of processor registers are parameters of themicroinstruction which are initialized from the IFU. The IFU also supplies the dispatch address atthe same time. The remainder of the information in the decoding table describes the data fieldsand instruction length; it is buffered in EXECUTE and passed to the processor on demand.4.3 ForwardingEarlier we mentioned folding of independent operations into the same microinstruction as animportant technique for speeding up a microprogram. Often, however, we would like to fold theemulation of two successive instructions, deferring some of the work required to finish emulation ofone instruction into the execution of its successor, where we hope for unused resources. Thiscannot be done in the usual way, since we have no a priori information about what instructioncomes next. However, there is a simple trick (due to Ed Fiala) which makes it possible in manycommon cases.We define for an entire instruction set a small number n of cleanup actions which may be forwardedto the next instruction for completion; on the Dorado up to four are possible, but one must usuallybe the null action. For each dispatch address we had before, we now define n separate ones, onefor each cleanup action. Thus if there were D addresses to which an IFUJump might dispatch, thereare now nD. At each one, there must be microcode to do the proper cleanup action in addition tothe work required to emulate the current instruction. The choice of cleanup action is specified bythe microcode for the previous instruction; to make this convenient, the Dorado actually has fourkinds of IFUJump operations (written IFUJump[i] for i=0, 1, 2, 3), instead of the one describedabove. The two bits thus supplied are ORed with the dispatch address supplied by the IFU todetermine the microinstruction to which control should go. To avoid any assumptions about whichpairs of successive instructions can occur, all instructions in the same instruction set must use thesame cleanup actions and must be prepared to handle all the cleanup actions. In spite of thislimitation, measurements show that forwarding saves about 8% of the execution time in straight-linecode (see ¶ 6.4); since the cost is very small, this is a bargain.яодп_2sфGхр@оqопXЫфЮр8фЯопWWф…рLф†опUУrqфЂrопSВsоaпR„фGор5оaпQFxsоxsоaпPxsор3опM°qфђр>rqф‘опL, ф™sqфљр,опJЁф«ф¬р.rопI$qфЂrqrqопGsоaпEХxsофGxsр&оaпD—xsор3опAlqфџр.tqф rоп?иqфр.ф®r qоп>dфВ rtqфГрCоп<а фЁрTф©оп;\фЖфЗрXоп9Ш фЂ sqsqр1оп6фўр9фЈр&фЂоп5)фЁ ф©рQоп3ҐфЊрLфЌоп2! фѓ sqр&ф„р"оп0ќфџф р?оп/фчфшрEоп-•фіфґрDоп,ф·рFфёоп*Ќфќфћ sqsqр&оп) ф¦ ф§рRоп'…фЂр'sqр'оп#†tфX оп [qфЯр8фаопЧф р>фЎопSф†рNф‡опПфИрVфЙопKфёф№tqр#опЗфўрZопCфЂопф‚р5tqфѓtqtq tоп”qф‡р9ф€р(опф™tqр/tqопЊф€tqр%tqф‰srqопфЊtqрJфЌоп„ф•р>ф–р"опфЎр5фўр)оп |фИsrqsrqtqфЙtqр*опшфїр!sqфАр+sqопtф‹рKфЊопрф¤рCфҐопlфЅрIфѕопи ф‡ф€рAопdфЂр>яж·Hџd(фSEC. 4IFU-PROCESSOR HAND-OFF39We illustrate this feature by modifying the implementation of PushLocalVar given above, to showhow one instruction's memory fetch operation can be finished by its successor, reducing the cost ofa PushLocalVar from two microinstructions to one. We use two cleanup actions. One is null (action0), but the other (action 2) finds the top of the stack not on the hardware stack but in thememoryData register. Thus, any instruction can leave the top of stack in memoryData and do anIFUJump[2]. Now the microcode looks like this:PushLocalVar[0]:Fetch[IFUData], IFUJump[2];-- this entry point assumes normal stack, and leaves top of stack inmemoryData.PushLocalVar[2]:Push[memoryData], Fetch[IFUData], IFUJump[2]; -- this entry point assumes top of stack is in memoryData and leaves it there.In both cases, the microcode executes IFUJump[2], since the top of stack is left in the memoryDataregister, rather than on the stack as it should be. In the case of PushLocalVar[2], the previous instruc-tion has done the same thing. Thus, the microcode at this entry point must move that data into thestack at the same time it makes the memory reference for the next stack value. The reader can seethat successive Push instructions will do the right thing. Of course there is a payoff only becausethe first microinstruction of PushLocalVar[0] is not using all the resources of the processor.It is instructive to look at the code for Add with this forwarding convention:Add[0]:temp _ Pop[];-- this entry point assumes and leaves normal stackStackTop _ StackTop+temp, IFUJump[0];Add[2]:StackTop _ StackTop+memoryData, IFUJump[0]; -- this entry point assumes top of stack is in memoryData, leaves normalstack.This example shows that the folding enabled by forwarding can actually eliminate data transferswhich are necessary in the unfolded code. At Add[2] the second operand of the Add is not put onthe stack and then taken off again, but is sent directly to the adder. The common data bus of the360/91 [15] obtains similar, but more sweeping, effects at considerably greater cost. It is alsopossible to do a cleanup after a NotReady dispatch; this allows some useful work to be done in anotherwise wasted cycle.4.4 Conditional branchesWe conclude our discussion of IFU-processor interactions, and give another example of forwarding,with the example of a conditional branch instruction. Suppose that there is a BranchNotZeroinstruction that takes the branch if the current top of the stack is not zero. Assume that itsdecoding table entry tells the IFU to follow the branch, and specifies the instruction length as thefirst IFUData value. Straightforward microcode for the instruction is:BranchNotZero:-- IFU jumps come here. IFU assumed result#0.if stack=0 then goto InsFromIFUData, Pop;-- Test result in this microinstruction.IFUJump;-- Result was non-zero, IFU did right thing.InsFromIFUData:-- Result was zero. Do the instruction at PC+IFUData.temp _PC+IFUData;-- PC should be PC+Instruction length.PC _ temp;-- Redirect the IFUIFUJump;-- This will be dispatched to NotReady, where the code will loop until the IFUrefills starting at the new location.The most likely case (the top of the stack non-zero) simply makes the test specified by theinstruction and does an IFUJump (two cycles). If the value is zero (the IFU took the wrong path),the microcode computes the correct value for the new PC and redirects the IFU accordingly (fourcycles, plus the IFU's latency of five cycles; guessing wrong is painful). If we think thatBranchNotZero will usually fail to take the branch, we can program the decoding table to treat it asоп^sфGхо&Tо;€qопWЖфр8ф®rqопVBфЌфЋрAопTѕф„rqр+ф…р*опS:фФр#фХр6опQ¶r qфєр%ф»r q опP2srqфЂр$опMMsоaпLxsфGxsорDоипJС опGмоaпF®xsxsрWопCѓqфЈр$stq ф¤r опAяqфЂр;rqоп@{ф‚фѓрRоп>чфЉр3ф‹р*оп=sф rqфЎр1оп;пфЂrqр4оп8Др*rqр!оп5Яsоaп4ЎфGор4оaп3cxsоп0~оaп/@xsрQоип.оп*Чqф¶р'ф·р4оп)Sф–ф—rqrq оп'Пф‘р<ф’р#оп&KфНфОрEоп$Зф™rqфљр"оп#CфЂ опDtфXопqф•sqр1ф–оп•фЬфЭрGrопq фОрDфПопЌфЎsq фўр8оп фЂsrqр:оп$s офGxsxsоaпжvsvsxs ор(оaпЁxsоxsоп—xsор*xsxsоaп Yxsxsоxs xsоaпxsоxоaп Эsоpwsр$xоип џsр%опtqфбр!фвр7опр фџsrqр*sqф опlфЁр2sqф©sqопифь sqр,фэопdrqф“ф”рBL·HџcјAN INSTRUCTION FETCH UNIT FOR A HIGH-PERFORMANCE PERSONAL COMPUTER40an ordinary instruction and deliver the branch displacement as IFUData, and reverse the sense of thetest.A slight modification of the forwarding trick allows further improvement. We introduce a cleanupaction (say action 1) to do the job of InsFromIFUData above (it must be action 1 or 3, since asuccessful test in the Dorado ors a 1 into the next microinstruction address). Now we write themicrocode (including for completeness the action 2 of ¶ 4.3):BranchNotZero[0]:-- IFU jumps come here. Expect result#0.Test[stack=0], Pop, IFUJump[0];-- Test result in this microinstruction; if the test succeeds, we do IFUJump[1].BranchNotZero[2]:Test[memoryData=0], IFUJump[0];EveryInstruction[1]:-- Branch was wrong. Do the instruction at PC+IFUData.temp _PC+IFUData;PC _ temp;-- Redirect the IFUIFUJump[0];-- This will be dispatched to NotReady, where the code will loop until the IFUrefills starting at the new location.Now a branch which was predicted correctly takes only one microinstruction. For this to work, theprocessor must keep the IFU from advancing to the next instruction if there is a successful test inthe IFUJump cycle. Otherwise, the PC and IFUData of the branch instruction would be lost, and thecleanup action could not do its job. Note that the first line at EveryInstruction[1] must be repeatedfor each distinct dispatch address; all these can jump to a common second line, however.5. ImplementationIn this section we describe the implementation of the Dorado IFU in some detail. The primaryfocus of attention is the pipeline structure, discussed within the framework established in ¶ 2 and ¶3.3, but in addition we give (in ¶ 5.4) the format of the decoding table, which defines how the IFUcan be specialized to the needs of a particular instruction set. Figure 3 gives the big picture of thepipeline. Table 1 summarizes the characteristics of each stage; succeeding subsections discuss eachrow of the table in turn. The first row gives the properties of an ideal stage, and the rest of thetable describes departures from this ideal. This information is expanded in the remainder of thissection; the reader may wish to use the table to compare the behavior of the different stages.The entire pipe is synchronous, running on a two-phase clock which defines a 60 ns cycle; someparts of the pipe use both phases and hence are clocked every 30 ns. An "ideal" stage is describedby the first line of the table. There is a buffer following each stage which can hold one item(b=1), and may be empty (represented by an empty flag); this is also the input buffer for the nextstage. The stage takes an item from its input buffer every cycle (tinput=1) and delivers an item toits output buffer every cycle (toutput=1); the item taken is the one delivered (l=1). The buffer isloaded on the clock edge which defines the end of one cycle and the start of the next. The stagehandles an item if and only if there is space in the output buffer for the output at the end of thecycle; hence if the entire pipe is full and an item is taken by the processor, every stage will processan item in that cycle. This means that information about available buffer space must propagate allthe way through the pipe in one cycle. Furthermore, this propagation cannot start until it is knownthat the processor is accepting the item, and it must take account of the various irregularities whichallow a stage to accept an item without delivering one or vice versa. Thus, the pipe has globalcontrol. Note that a stage delivers an output item whether or not its input buffer is empty; if it is,the special empty item is delivered. Thus the space bookkeeping is done entirely by counting emptyitems.Implementing global control within the available time turned out to be hard. It was consideredcrucial because of the minimal buffering between stages. The alternative, much easier approach islocal control: deliver an item to the buffer only if there is space for it there at the start of the cycle.This decouples the different stages completely within a cycle, but it means that if the pipe is full(best case) and the processor suddenly starts to demand one instruction per cycle (worst case), thepipe can only deliver at half this rate, even though each stage is capable of running at the full rate;одп_XsфGхр@оqопYф‚р=srqфѓопW}опTRф‘ф’рLопRОфАфБrsrqр)опQJ ф·uqр?опOЖфЂр4опLбsофGxsр#оaпKЈxsорExsопJeоaпI'xsопF«ор+xsxsоaпEmxsxsоaпD/xsоxоaпBсsоpwsр$xоипAіsр%оп>€qф€ф‰рPоп=фџsqр*ф оп;ЂфЋsrqsqsrqр1оп9ьфр2ф™rqоп8xфЂрUоп3ҐuфXоп0zqфіфґр.sqоп.цфЊр=фЌр#оп-rфЋфЏрDsоп+оqфЌр]фЋоп*jф•р+ф–р0оп(жф р=фЎр$оп'bфЈр8ф¤р%оп%ЮфЂрVоп"іфЁрIф©оп!/ф‰фЉрRоп«ф»фјр?оп'rqфћrqrqр)фџопЈф™р=rпpпЈqфљопµфЎrп(pпµqр(фўrqопЗфрPф™ опCф—р-фр/опїфЉр6ф‹р+оп;фђtqр$ф‘р*оп·ф„ф…рLоп3ф‚фѓрRопЇфІр"фір3tоп+qф‹фЊрLоп§фЏrqрLфђrоп#qопшф°р:ф±опtф—р1фр*опрtqфѓрEф„ tq опlфЈрLф¤опиф›фњрFопdфЂрc·ЪHџdNТSEC. 5IMPLEMENTATION41StageSizeInputOutputResetRemarks"ideal"t=1; takes onet=l=1; delivers one Clears bufferAll state is in the bufferitem if output item if buffer willto empty on after the stage.is possiblebe empty; b=1PC_ . . .ADDRESSwordNo inputNot if paused, MARand jump;Pass PC by incrementing;contention, or memalso acceptsa source, hence has busy; OK if space innew PCvaluestate (PC).any later buffer.MEMORYwordInternall>2; output is and jump;Must enforce FIFO;complicationsunconditional; b=2 discards out-not really part of IFU;put of fetcheshas state of 0-2in progressfetches in progressBYTESbytet=.5t=l=.5and jumpBreak byte feature.DECODEinstrt>.5; rate de-onlyRecycling to vary rate;pends on ins-splits beta byte; encodestruction lengthexceptions; does jumps.DISPATCHinstrOn IFUJumponlyNotReady is default delay; IFUHold is panic delay.EXECUTEbyteOn IFUDataNo output bufferReset unnecessaryTable 1: Summary of the pipeline stages<==‹sодqоrqо¬rqrqо#Фо-zоп=ќюдFодп=ќю ИFо¬п=ќю(Fо#Фп=ќю ¦Fо-zп=ќюoFоп;ѓsодqоrzqо#Фо-zоп9яо-zоп8{о-zоп7ЌюдFодп7Ќю ИFо¬п7Ќю(Fо#Фп7Ќю ¦Fо-zп7ЌюoFоп5ssодqоsrо#Фqо-zrqо-zп3пsqоп3юдFодп3ю ИFо¬п3ю(Fо#Фп3ю ¦Fо-zп3юoFоп0зsодqоsrо¬qо#Фоп06юдп/cюдЌодп06ю Ип/cю ИЌо¬п06ю(п/cю(Ќо#Фп06ю ¦п/cю ¦Ќо-zп06юoп/cюoЌо©п-р'оtпd1яА·ЊHџcњ4оЏпdсоИп#ую9$опъю$оИпЦю]$оИпЦю$@оИп!єю9$оИпЃю9$оИпHю9$оИпю9$оп"Hроп р опЦропќр опeр о Vп"Hро Vп ро VпЦро Vпќро:пю9$о:пHю9$о:пЃю9$о:п!єю9$о:пЦю$@о:пЦю]$оsпъю$о:п#ую9$о¬п#ую9$оепъю$о¬пЦю]$о¬пЦю$@о¬п!єю9$о¬пЃю9$о¬пHю9$о¬пю9$опю9$опHю9$опЃю9$оп!єю9$опЦю$@опЦю]$оVпъю$оп#ую9$оЏп#ую9$о Ипъю$оЏпЦю]$оЏпЦю$@оЏп!єю9$оЏпЃю9$оЏпHю9$оЏпю9$о#пю9$о#пHю9$о#пЃю9$о#п!єю9$о#пЦю$@о#пЦю]$о%:пъю$о#п#ую9$о'sп#ую9$о)¬пъю$о'sпЦю]$о'sпЦю$@о'sп!єю9$о'sпЃю9$о'sпHю9$о'sпю9$о+епю9$о+епHю9$о+епЃю9$о+еп!єю9$о+епЦю$@о+епЦю]$о.пъю$о+еп#ую9$опЂр оп№р оп троп +р опdроИп+юU$оИп dюU$оИпќюU$оИпЦюU$оИптю$@оИптюy$опю$оИпюU$о:пюU$оЏпю$о:пЦюU$о:пќюU$о:п dюU$о:п+юU$о Vпdро Vп +ро Vп тро Vп№ро:пю$о:птюy$о¬птюy$о¬пю$о¬п+юU$о¬п dюU$о¬пќюU$о¬пЦюU$опю$о¬пюU$опюU$оsпю$опЦюU$опќюU$оп dюU$оп+юU$опю$оптюy$оЏптюy$оЏпю$оЏп+юU$оЏп dюU$оЏпќюU$оЏпЦюU$о!епю$оЏпюU$о#пюU$о&Wпю$о#пЦюU$о#пќюU$о#п dюU$о#п+юU$о#пю$о#птюy$о'sптюy$о'sпю$о'sп+юU$о'sп dюU$о'sпќюU$о'sпЦюU$о*Йпю$о'sпюU$о+епюU$о/;пю$о+епЦюU$о+епќюU$о+еп dюU$о+еп+юU$о+епю$о+ептюy$оИп"HроИп роИпЦроИпeро:п"Hро:п ро:пќро¬п"Hро¬пЦро¬пeроп ропќро#ђп"Hро#ђпЦро#ђпeро(п ро(пќро,sп"Hро,sпЦро,sпeро:пeю$о:пую$о¬п,ю$о¬п ћю$опeю$опую$оп"Чю$о(п"Чю$о(пую$о(пeю$о#ђп,ю$о#ђп ћю$о,sп ћю$о,sп,ю$оИп,ю$оИп троИп +роИпdро:пюU$о:пЦюU$о:пќюU$о¬пќюU$о¬пЦюU$о¬пюU$о:пdро:п +ро¬пdропюU$опЦюU$опќюU$оИп№роVпЂро:п тро:п№роИпЂро¬п +ро¬п тро¬п№ро:пЂропdроп +роп троп№ро¬пЂро#ђп№ро#ђп тро#ђп +ро#пЦюU$о#пќюU$о#п dюU$о#ђпdро(п +ро'sп dюU$о'sпќюU$о(п тро(п№ро'sп dюU$о'sпќюU$о'sпЦюU$о(пdро,sп +ро+епќюU$о+еп dюU$о+еп+юU$о,sп№ро+еп dюU$о+еп+юU$о,sп тро+епЦюU$о+епќюU$о+еп dюU$о+еп+юU$о-пЂро(ђпЂро$пЂро,sпdро Vпую$о дпю$одпЃр<опрEяАЉњЊ/^$йAN INSTRUCTION FETCH UNIT FOR A HIGH-PERFORMANCE PERSONAL COMPUTER42Figure 5a illustrates this cogging. Figure 5b shows that with two items of buffering after each stage,local control does not cause cogging. The Dorado has small buffers and global control partlybecause buffers are fairly costly in components (see below), and partly because this issue was notfully understood during the design. Note that it is easy to implement global control over a groupof consecutive stages which have no irregularities, since every stage can safely advance if there isroom in the buffer of the last stage. In this IFU, alas, there are no two consecutive regular stages.Unfortunately, the cost of buffering is not linear in the number of items. A two item buffer costsmore than three times as much as a one item buffer; this is because the latter is simply a register,while the former requires two registers plus a multiplexor to bypass the second register when thebuffer is empty, as shown in Figure 6. Without the bypass a larger buffer increases the latency ofthe pipe, which is highly undesirable since it slows down every jump which the IFU doesn't predictsuccessfully. Once the cost of bypassing is paid, however, a multi-item buffer costs only a littlemore, since a RAM can be used in place of the second register. Although there are no such buffersin the Dorado, it is interesting to see how they are made.<==;пЄю$о9Йп‡ю•$о9Йп‡ю$@о2ђxп,ро0Wп,ро/;п1юЋ$о.¬пЋю$о.¬пjю9$о0епUю$9о5WпЋю$о3пjю]$о3пUю$9о+WпюО$о)¬tпОро,єпОро3eпОр о:ћпОро:uпGро:п*р о2п8ю$о!еп ю$9о«qп їроп їр оЏп роп р оЏп їр о:п ро7xпуро4Йпуро5Wп ю$9о3¬пшюЋ$о7ђпшю$]о:Wtп ?ро3пjю$о9;пяю$Ћо3пЫю@$о3пЫю$Іо:Wп xро3п їю$Іо3п їю@$о9;п гю$Ћо3пMю$о6tпЌю$Ћо6tпяю$Ћо7ђпjю$]о5WпЌю$9о4Йxпdро7пdро:Wtп?ро:Wп#роrп ю$9оsп ю$9оsпю$«оVпЈю$оVпяю$ rоЏпяю$ЗоЏпЈю$оsпЗю$«опNю$ОопNю $оепqю$«опшюд$опЬюд$оИxп Цроп Цро Vpпjр7одxп Hродпeро&Wпуро&Wпћро+Wпуро,suпр)о,sпкря~& c$Ћ@-rKSEC. 5IMPLEMENTATION43adder would be needed to handle the variable instruction lengths, and it would cost about fourtimes as much.Every item also carries a status field, which is used to represent various values that do notcorrespond to ordinary instructions: empty, page fault, memory error. These are converted intounique dispatch addresses when the item is passed to the processor, as discussed in ¶ 3.4.5.1. ADDRESS stageThis stage generates the addresses of memory words which contain the successive bytes of code.Unlike the other stages, it has no ordinary input, but instead contains a PC which it increments bytwo (there are two bytes per memory word) for each successive reference. The PC can also take ona pause value which prevents any further memory references until the processor resupplies ADDRESSwith an ordinary PC value. This pause state plays the same role for ADDRESS that an empty inputbuffer plays for the other stages; hence it is entered whenever this stage is reset. That happenseither because of a processor Reset operation (which resets the entire IFU pipe, and is not doneduring normal execution), or because of a Pause signal from DECODE. Correspondingly, a new PCcan be supplied either by a processor PC_ operation, or by a Jump signal from DECODE when it seesa jump instruction. Any of these operations resets the pipe between ADDRESS and DECODE; theprocessor operations reset the later stages also. ADDRESS makes a memory reference if the memory is willing to accept the reference; thiscorresponds to finding space in the buffer between ADDRESS and MEMORY, although theimplementation is quite different because the memory is not physically part of the IFU. In addition,ADDRESS contends with the processor for the memory address bus; since the IFU has lowest priority,it waits until this bus is not being used by the processor. Finally, it is necessary to worry aboutspace for the resulting memory word: the memory, unlike ordinary IFU stages, delivers its resultunconditionally, and hence must not be started unless there is a place to put the result. ADDRESSsurveys the buffering in the rest of the pipe, and waits until there are at least two free bytesguaranteed; it isn't necessary for these bytes to be in the MEMORY output buffer, since data in thatbuffer will advance into later buffers before the memory delivers the data. It is, however, necessaryto make the most pessimistic assumptions about instruction length and processor demands. On thisbasis, there are seven bytes of buffering altogether: four after MEMORY, two after BYTES, and oneafter DECODE.5.2 MEMORY stageThis stage has several peculiarities. Some arise from the fact that most of it is not logically orphysically a part of the IFU, but instead is shared with the processor and I/O system. As we saw inthe previous section, the memory delivers results unconditionally, rather than waiting for bufferspace to be available; ADDRESS allows for this in starting MEMORY. Furthermore, the memory hasconsiderable internal state and cannot be reset, so additional logic is required to discard items whichare inside the memory when the stage is reset.Other problems arise from the fact that the memory's latency is more than one cycle; in fact, itranges from two to about 30 cycles (the latter when there is a cache miss). To maintain fullbandwidth, the IFU must therefore have more than one item in the MEMORY stage at a time; sincel=2 when the cache hits, and this is the normal case, there is provision for up to two items inMEMORY. A basic principle of pipeline stages is that items emerge in the order they are supplied.A stage with fixed latency, or one which holds only one item, does this automatically, but MEMORYhas neither of these properties. Furthermore, its basic function is random access, with no sequentialrelationship between successive references. Hence if one reference misses and the next one hits, thememory is happy to deliver the second result first. To prevent this from happening, the IFUnotifies the memory that it has a reference outstanding when it makes the second one, and thememory rejects the second reference unless the first one is about to complete.оп[0sфGхо+I о;€qопTЩф·р3фёр&опSUфЂопP*фдфеtqр=опN¦ ф·р!фёр4опM"фЂрTопI#tфXwtопEшqфіфґрTопDtф•рDsqф– опBрф‹рKsqфЊопAlфЉrqр=ф‹sоп?иqф›sqфњrqsqоп>dфЇр9ф°р#оп<афёф№ rqр$sqоп;\фњфќ rqsqsоп9Шqфѓр#sqrqsqф„оп8TфрDsqф®sqоп6РфЂр)оп3Ґsqфцр2фчоп2! ффsqsq оп0ќ фЃр8ф‚sqоп/sqфЃр+ф‚sqоп-•ф¦ф§рFоп,фіропBЄфјt qрPопA& ф†ф‡рBоп?ўфРр+фСр/оп> фЃрFф‚оп<љфЊфЌрPоп;фЂр"оп7tфXwtоп3мqфђф‘рLоп2hфЂрPоп/=ф р#фЎsqр/оп-№фЌфЋрCsqоп,5фЂфЃрKоп*±фјр/sqфЅоп)-фЂsqр'оп&ZюдЌп&юдодп&Zю ИЌп&ю Ио¬п&Zю(Ќп&ю(о#Фп&Zю ¦Ќп&ю ¦о-zп&ZюoЌп&юoоп$~tо‘о Иоп#ђюдFодп#ђю ИFо¬п#ђю(Fо#Фп#ђю ¦Fо-zп#ђюoFоп ўrо ¦qо Ир2опrо Vqо Ир+опљrо Vqо Ир-опBr о Vqо Ир7опѕrо Vqо Иоп:rо Vqо Ир9опвrо Vqо Ир8оп^rо Vqо Иф”sqsqsq ф•rqо ИпЪфЂ rqопVrо Vqо ИsqопҐюдпТюдЌодпҐю ИпТю ИЌо¬пҐю(пТю(Ќо#ФпҐю ¦пТю ¦Ќо-zпҐюoпТюoЌоmп§фXр%оп |ф¤р^фҐопшф¤sqфҐопtsqф‹рQфЊопрфЂфЃрFопlф¶р0sqр%опиsqф€р$tqф‰опdфЙрGtq фКО·@HџbиSEC. 5IMPLEMENTATION45problem can be attacked by introducing a sub-stage within DECODE; unfortunately, this delays thereading of the decode table by half a cycle, so that its output is not available together with thealpha byte. To solve the problem it is necessary to provide a second output buffer for BYTES, andto feed back its contents into the main buffer if the instruction turns out to be only one byte long,as in Figure 7c. Some care must be taken to keep the PCs straight. This ugly backwarddependency seems to be an unavoidable consequence of the variable-width items. In fact, a three-byte instruction is not handled exactly as shown in Figure 7. Since the bandwidthof BYTES prevents it from being done in one cycle anyway, space is saved by breaking it into twosub-instructions, each two bytes long; for this purpose a dummy opcode byte is supplied betweenalpha and beta. Each sub-instruction is treated as an instruction item. The second one containsbeta and is slightly special: DECODE ignores its dummy opcode byte and treats it as a two-byteinstruction, and DISPATCH passes it on to EXECUTE after the alpha byte has been delivered.<==роИп+Яю$9оИп7фюЋ$оVп'ыю$о9п&-роп&-ро¬п&-ро«tп*жро¬п*жро«п'ЧюV$оп%Вю$9о«п%Вю$9оИп;Jю Џ$оИп'ђроИп&tроИп%Wро:пю$9о3пю$9о:п,ю&е$о/ЙpпЃро(пЃроИпЃроИпћроп ,роИпЧюО$о дпAюдGоИпюGrо дп Пю +Gо дп ПюG№о¬пЃроtпро<‘п+»ю$]оBп'ыю$о3п7фю$о.pп4џро3п1ыю$9о3п6mю$Зо.п6Jю$о9;п44ю$9о.п4ю@$о.п4ю$]о8п+»юд$оAп*Вю$о8п*џю $о8п*џю$@о$¬п*џю$@о$¬п*џю $о-ђп*Вю$о$¬п+»юд$о+еtп1ро'spп.‚ро$¬п-СюG№о$¬п-Сю +Gо-ђп.юGrо$¬п2BюдGо)п'ыю$Зо<‘п'ыю$Зо-ђп1ШюІ$о%Йп0-ро3п;Jю$]о-п>ро)п+Яю$9о3п7фю r$о#ђп&-ро:жп&-роAп&-ро(tп*жро;tп*жро#п'Чю!V$оDXп%Вю$9о#п%Вю$9о#п6Шю$о%:п6Шю$о'sп6Шю$о)¬п6Шю$о+еп6Шю$о.п6Шю$о:Xп6Шю$о8п6Шю$о5жп6Шю$о3п6Шю$о1tп6Шю$о/;п6Шю$о#п,Шю$о%:п,Шю$о'sп,Шю$о)¬п,Шю$о+еп,Шю$о.п,Шю$о:Xп,Шю$о8п,Шю$о5жп,Шю$о3п,Шю$о1tп,Шю$о/;п,Шю$о#п4Гю$о#п2Љю$о#п0Qю$о#п.ю$о;tп.ю$о;tп0Qю$о;tп2Љю$о;tп4Гю$о0жqп.‚р о Vpп9џро.п9џро.п;Jю Џ$о8п94ю$9о.п9ю І$о.п9ю$]оеtп8Кроеп9жроVп94ю$9оеп;ро9;п;ро8п94ю$9о9;п9жро9;п8Кро8п94ю$9ођпђропБю$9ођпsрођпWроpп,роsпЧю Џ$опБю$9оsпћю І$оsпћю$]о3tперо3пИро3п¬роVпOю$Зоsпъю$Зо$¬пћю$]о$¬пћю І$о/;пБю$9о$¬пЧю Џ$о/;пБю$9о&Wpп,роsпҐю$ЋоЏпю$]о¬пюЋ$оVп3ю$9оИxп роп роsпъю$«оVпю$$о)¬пOю$ rо)¬пъю$оVпуюЏ$о0жпOю$Ио#ђпЃю@$о#ђпЃю$лоЏпIю $$о$pп!»р,оепр-о:xп$'ро:п$'ро:п Dро:п DроИп Dро(ђп Dро<п DроA‘п Dро<п$'ро(ђп$'ро:п3¶ро2ђп3¶роепCроИпро)пCро)про0Wпро2ђп.¶ро-ђп)¶ро:п)¶роИпµро:пую'$о#п%ћю!z$о«п%ћюz$оpп!»р*я”ъFЊD{?QAN INSTRUCTION FETCH UNIT FOR A HIGH-PERFORMANCE PERSONAL COMPUTER46DECODE replaces the dispatch address from the table with an exception address if necessary. Inorder to obey the rule that exceptions must all be captured in the dispatch address, the exceptionvalues of all the instruction bytes are merged into its computation. For three-byte instructions, thisrequires looking back into BYTES for the state of the beta byte. If any of the bytes is empty,DECODE keeps the partial instruction item when it delivers an empty item with a NotReady dispatchinto its output buffer. If a Reschedule is pending, it is treated like any other exception, byconverting the dispatch address of the next instruction item into Reschedule. Thus there is always ameaningful PC associated with the exception.If the Jump field is set, DECODE computes a new program counter by adding an offset to the PC ofthe instruction. This offset comes from the alpha byte if there is one, otherwise from N andSplitAlpha; it is sign-extended if Sign is true. The new PC is sent back to ADDRESS, as described in ¶5.1, where Pause is also explained. Jump instructions in which the displacement is not encoded inthis way cannot be executed by the IFU, but must be handled by the processor.5.5 DISPATCH stageThe interesting work of this stage is done by the processor, which takes the dispatch address,together with the state initialization discussed in ¶ 4.2, from the DECODE output buffer when itexecutes an IFUJump. Because empty is encoded into a NotReady dispatch, the processor takes noaccount of whether the buffer is empty. There are some ugly cases, however, in which DECODE isunable to encode an exception quickly enough. In these cases DISPATCH asserts a signal called Holdwhich causes the processor to skip an instruction cycle; this mechanism is rather expensive toimplement, and is present only because it was essential for synchronization between the processorand the memory [1]. Once implemented, however, it is quite cheap for the IFU to use. TheNotReady dispatch is still preferable, because it gives the microcode an opportunity to do someuseful work while waiting.5.6. EXECUTE stageThis stage implements the IFUData function; as we have already seen, it is logically part of theprocessor. The sequence of data items delivered in response to IFUData is controlled by Jump,Length, N, and SplitAlpha according to Table 3; in addition, alpha is sign-extended if Sign is true.EXECUTE also provides the processor with the value of the PC in response to a different function.JumpLengthNSplitAlphaIFUDataYesќќќLength, . . .No1NoќLength, . . .No1YesќN, Length, . . .No2NoNoalpha, Length, . . .No2NoYesalphaHigh, alphaLow, Length, . . .No2YesNoN, alpha, Length, . . .No2YesYesN, alphaHigh, alphaLow, Length, . . .No3NoNoalpha, beta, Length, . . .No3NoYesalphaHigh, alphaLow, beta, Length, . . .No3YesNoN, alpha, beta, Length, . . .No3YesYesN, alphaHigh, alphaLow, beta, Length, . . .Table 3: Data items provided to IFUDataяодпYфsфGхр@оqопSќsqф°ф±рCопRфћфџрUопP•ф‹фЊрDопOфЖsqфЗр(rqопMЌsqф ф™р.rqrqопL фгr qфдр7опJ… фђр2ф‘r qопI фЂsqопEЦфђrqф‘sqр;sqопDRфЙр%rqфКр,rqопBОr qф‚rqsqфѓsqопAJф—rqр)фр)оп?ЖфЂsqр'оп;ЗtфXwtоп8њqфОфПр<оп7фєрHџ^к]SEC. 6PERFORMANCE476. PerformanceThe value of an instruction fetch unit depends on the fraction of total emulation time that it saves(over doing instruction fetching entirely in microcode). This in turn clearly depends on the amountof time spent in executng each instruction. For a language like Smalltalk-76 [5], a typicalinstruction requires 30-40 cycles for emulation, so that the half-dozen cycles saved by the IFU arenot very significant. At the other extreme, an implementation language like Mesa [9, 11] iscompiled into instructions which can often be executed in a single cycle; except for function callsand block transfers, no Mesa instruction requires more than half a dozen cycles. For this reason,we give performance data only for the Mesa emulator.The measurements reported were made on the execution of the Mesa compiler, translating aprogram of moderate size; data from a variety of other programs is very similar. All the operatingsystem functions provided in this single-user system are included. Disk wait time is excluded, sinceit would tend to bias the statistics. Some adjustments to the raw data have been made to removeartifacts caused by compatibility with an old Mesa instruction set. Time spent in the procedure calland return instructions (about 15%) has been excluded; these instructions take about 10 times aslong to execute as ordinary instructions, and hence put very little demand on the IFU.The Dorado has a pair of counters which can record events at any rate up to one per machinecycle. Together with supporting microcode, these counters provide sufficient precision that overflowrequires days of execution. It is possible to count a variety of interesting events; some arepermanently connected, and others can be accessed through a set of multiplexors which provideaccess to several thousand signals in the machine, independently of normal microprogram execution.6.1 Performance limitsThe maximum performance that the IFU can deliver is limited by certain aspects of itsimplementation; these limitations are intrinsic, and do not depend on the microcode of theemulator or on the program being executed. The consequences of a particular limitation, of course,depend on how frequently it is encountered in actual execution.Latency: after the microcode supplies the IFU with a new PC value, an IFUJump will go to NotReadyuntil the fifth following cycle (in a few cases, until the sixth cycle). Thus there are at least fivecycles of latency before the first microinstruction of the new instruction can be executed. Ofcourse, it may be possible to do useful work in these cycles. This latency is quite important, sinceevery instruction for which the IFU cannot compute the next PC will pay it; these are wronglyguessed conditional branches, indexed branches, subroutine calls and returns, and a few others ofnegligible importance.A branch correctly executed by the IFU causes a three-cycle gap in the pipeline. Hence if theprocessor spends one cycle executing it and each of its two predecessors, it will see three NotReadycycles on the next IFUJump. Additional time spent in any of these three instructions, however, willreduce this latency, so it is much less important than the other.Bandwidth: In addition to these minimum latencies, the IFU is also limited in its maximumthroughput by memory bandwidth and its limited buffering. A stream of one-byte instructions canbe handled at one per cycle, even with some processor references to memory. A stream of two-byteinstructions, however (which would consume all the memory bandwidth if handled at full speed),results in 33% NotReady even if the processor makes no memory references. The reason is that theIFU cannot make a reference in every cycle, because its buffering is insufficient to absorbirregularity in the processor's demand for instructions. As we shall see, these limitations are ofsmall practical importance.опWwsфGхо- о;€qопQ uфXопMхqф“р#ф”р>опLqф…р$ф†р;опJнфтрKфуопIi фЈрCф¤ sqопGефарYопFaфџф рVопDЭфџр3ф р,опCYфЂр2оп@.фбрUоп>ЄфђрIф‘оп=&ф€ф‰рLоп;ўф™р$фљр:оп:ф† ф‡рRоп8љф®р$фЇр9оп7фЂрNsqоп3лфґрLфµоп2gфЂфЃрPоп0гфбфврNоп/_ ф·рRоп-ЫфЂр\оп)ЬtфXоп&±qффsqр1оп%-фср,фтоп#©ф†рSф‡оп"%фЂр9опъtqфЏфђ sqsq srqrопvqф«ф¬р]оптфМфНрVопnф’р%ф“р9опкфВsqsqфГопfф¦рLф§ опв фЂоп·фЅр"sqр+фѕоп3ф—р@фrопЇqф‘srqр(ф’р"оп+фЂр;опtqфлфмsqоп | фЌр/фЋр'опшфЃр3ф‚р,опtфҐрIф¦опрф‘rqр2ф’опlsqфэ фюрJопиф№фєрDопdфЂяV·»Hџ\mЌAN INSTRUCTION FETCH UNIT FOR A HIGH-PERFORMANCE PERSONAL COMPUTER486.2 NotReady dispatchesOur measurements show that the average instruction takes 3.1 cycles to execute (including all IFUdelays). Jumps are 26% of all instructions, and incorrectly predicted jumps (40% of all conditionaljumps) are 10%. The average non-jump instruction takes 2.5 cycles.The performance of the IFU must be judged primarily on the frequency with which it fails to satisfythe processor's demand for an instruction, i.e., the frequency of NotReady dispatches. It isinstructive to separate these by their causes: latency, cache misses by the IFU, dearth of memory bandwidth, insufficient buffering in the IFU. The first dominates with 16% of all cycles, which is not surprising in view of the large number ofincorrectly predicted jumps. Note that since these NotReady cycles are predictable, unlike all theothers, they can be used to do any background tasks which may be around.Although the IFU's hit rate is 99.7%, the 25 cycle cost of a miss means that 2.5% of all cycles areNotReady dispatches from this cause. This is computed as follows: one cycle in three is a dispatch,and .3% of these must wait for a miss to complete. The average wait is near the maximum,unfortunately, since most misses are caused by resetting the IFU's PC, This yields 33% of .3%, or .1%,times 25, or 2.5%.The other causes of NotReady account for only 1%. This is also predictable, since more than halfthe instructions are one byte, and the average instruction makes only one memory reference in threecycles. Thus the average memory bandwidth available to the IFU is two words, or threeinstructions, per instruction processed, or about three times what is needed. Furthermore, straight-line instructions are demanded at less than half the peak rate on the average, and jumps are sofrequent that when the first instruction after a jump is dispatched, the pipe usually contains half theinstructions that will be executed before the next jump. 6.3 Memory bandwidthAs we have seen, there is no shortage of memory bandwidth, in spite of the narrow data pathbetween the processor and the IFU. Measurements show that the processor obtains a word from thememory in 16% of the cycles, and the IFU obtains a word in 32% of the cycles. Thus data issupplied by the memory in about half the cycles. The processor actually shuts out the IFU bymaking its own reference about 20% of the time, since some of its references are rejected by thememory and must be retried. The IFU makes a reference for each word transferred, and makesunsuccessful references during its misses, for a total of 35%. There is no memory reference about45% of the time.6.4 ForwardingThe forwarding trick saves a cycle in about 25% of the straight-line instructions, and hence speedsup straight-line execution by 8%. Jumps take longer and benefit less, so the speed-up within aprocedure is 5%. Like the IFU itself, forwarding pays off only when instructions are executed veryquickly, since it can save at most one cycle per instruction.одпTЫsфGхр@оqопN„tфXопKYqфЈр[sопIХqф•р4ф–р(опHQфЂр=опE&фѓsqф„р0опCўфчр?rqопB фЂр$оцп?Жоцп=nsqоцп;оцп8ѕsqоп6fфљр.ф›р1оп4в ф ф®rqр'оп3^фЂрAоп03ф sqрDфЎоп.Їrqф’ф“рLоп-+фЛфМр?оп+§ фЂр/sqsqфЃоп*#фЂоп&шфўrqфЈр9оп%tфЂрWфЃоп#рфр5sqфоп"lф‘ф’рHоп ифІр0фір+опdфѓрVф„опафЂр-опбtфXоп¶qфЅр&фѕр3оп2ф‚фѓsqр?оп®фЅsqр&фѕоп*ф»р@фјsqоп¦ф§фЁр;оп"ф»sqр-фј оп ћфљр<ф›опфЂопtфX опрqфљр,ф›р4опlф¶р&ф·р7опиф“ф” sqрEопdфЂр5яь·WHџYСQSEC. 6PERFORMANCE496.5 SizeA Dorado board can hold 288 standard 16-pin chips. The IFU occupies about 85% of a board;these 240 chips are devoted to the various stages as shown in Table 4. FunctionChips%ADDRESS-BYTES4017DECODE8635DISPATCH2410EXECUTE188Processor interface2711Clocks188Testing2711Table 4: Size of various parts of the IFUIn addition, about 25 chips on another board are part of MEMORY and BYTES. The early stages aremostly devoted to handling several PC values. DECODE is large because of the decoding table (27RAM chips) and its address drivers and data registers, as well as the branch address calculation. Table 5 shows the amount of microcode in the various emulators, and in some functions commonto all of them. In addition, each emulator uses one quarter of the decode table. Of course they arenot all resident at once.SystemWordsCommentsMesa1300Smalltalk1150Lisp1500Alto BCPL700I/O1000Disk, keyboard, regular and color display, Ethernet Floating point 300IEEE standard; there is no special hardware supportBit block transfer270Table 5: Size of various emulatorsAcknowledgementsThe preliminary design of the Dorado IFU was done by Tom Chang, Butler Lampson and ChuckThacker. Final design and checkout were done by Will Crowther and the authors. Ed Fialareviewed the design, did the microassembler and debugger software, and wrote the manual. Theemulators mentioned were written by Peter Deutsch, Willie-Sue Haugeland, Nori Suzuki and EdTaft.опSVsфGхо- о;€qопLяtфXопIФqфіфґр!sqопHPфЂрCопE}юдЌпE%юдодпE}ю ИЌпE%ю Ио¬пE}ю(ЌпE%ю(о#ФпE}ю ¦ЌпE%ю ¦о-zпE}юoЌпE%юoоцпBНtоДоаопAЯюдFодпAЯю ИFо¬пAЯю(Fо#ФпAЯю ¦Fо-zпAЯюoFоцп>сsqsоnqоіоцп=msоnqоіоцп;йsоnqоіоцп:esоnqо cоцп8боnоіоцп7]оnо cоцп5Щоnоіоп5(юдп4UюдЌодп5(ю Ип4Uю ИЌо¬п5(ю(п4Uю(Ќо#Фп5(ю ¦п4Uю ¦Ќо-zп5(юoп4UюoЌожп1*фXр!sоп-яqфЉр6ф‹sqsqоп,{фџsq sqф р+оп*чsqфЂр`оп'Мф›рFфњоп&Hф‚рHфѓоп$ДфЂоп!сюдЌп!™юдодп!сю ИЌп!™ю Ио¬п!сю(Ќп!™ю(о#Фп!сю ¦Ќп!™ю ¦о-zп!сюoЌп!™юoоцпAtо‚оопSюдFодпSю ИFо¬пSю(Fо#ФпSю ¦Fо-zпSюoFоцп9qо‹оцпµо‹оцп1о‹оцпsо;qоцп)sqsо‹qор4оцпҐо;оsqр/оцп!о;опpюдпќюдЌодпpю Ипќю ИЌо¬пpю(пќю(Ќо#Фпpю ¦пќю ¦Ќо-zпpюoпќюoЌопrфXоп џuопtqфЎфўsqр0опрфЙр(фКр)опlфЁрUопиф°ф±рCопdяА·ЬHџXL AN INSTRUCTION FETCH UNIT FOR A HIGH-PERFORMANCE PERSONAL COMPUTER50References1.Clark, D.W. et. al. The memory system of a high performance personal computer. Technical Report CSL-81-1, Xerox PaloAlto Research Center, January 1981. Revised version to appear in IEEE Transactions on Computers.2.Connors, W.D. et. al. The IBM 3033: An inside look. Datamation, May 1979, 198-218.3.Deutsch, L.P. A Lisp machine with very compact programs. Proc 3rd Int. Joint Conf. Artificial Intelligence, Stanford, 1973,687-703.4.Ibbett, R.N. and Capon, P.C. The development of the MU5 computer system. Comm. ACM 21, 1, Jan. 1978, 13-24.5.Ingalls, D.H. The Smalltalk-76 programming system: Design and implementation. 5th ACM Symp. Principles ofProgramming Languages, Tucson, Jan. 1978, 9-16.6.Intel Corp. MCS-86 User's Manual, Feb. 1979.7.Knuth, D.E. An empirical study of Fortran programs. SoftwareќPractice and Experience 1, 1971, 105-133.8.Lampson, B.W. and Pier, K.A. A processor for a high performance personal computer. Proc 7th Int. Symp. ComputerArchitecture, SigArch/IEEE, La Baule, May 1980, 146-160. Also in Technical Report CSL-81-1, Xerox Palo Alto ResearchCenter, Jan. 1981.9.Mitchell, J.G. et. al. Mesa Language Manual. Technical Report CSL-79-3, Xerox Palo Alto Research Center, April 1979.10.Russell, R.M. The CRAY-1 computer system. Comm. ACM 21, 1, Jan. 1978, 63-72.11.Tanenbaum, A.S. Implications of structured programming for machine architecture. Comm. ACM 21, 3, March 1978, 237-246.12.Teitelman, W. Interlisp Reference Manual. Xerox Palo Alto Research Center, Oct. 1978.13.Thacker, C.P, et. al. Alto: A personal computer. In Computer Structures: Readings and Examples, 2nd edition, Sieworek,Bell and Newell, eds., McGraw-Hill, 1981. Also in Technical Report CSL-79-11, Xerox Palo Alto Research Center, August1979.14.Thornton, J.E. The Control Data 6600, Scott, Foresman & Co., New York, 1970.15.Tomasulo, R.M. An efficient algorithm for exploiting multiple arithmetic units, IBM J. R&D 11, 1, Jan. 1967, 25-33.16.Anderson, D.W. et. al. The System/360 Model 91: Machine philosophy and instruction handling. IBM J. R&D 11, 8, Jan.1967, 8-24.17.Widdoes, L. C. The S-1 project: Developing high performance digital computers. Proc. IEEE Compcon, San Francisco, Feb.1980, 282-291.яодп,†sфGхр@оqоп&/u оп"аsо{wsрOxsо{п!ўрB{swsоп Aо{ wsxsw sопао{р9wр1sо{пўопAо{р4xsws{svsопао{рNw{wо{пўsопAо{{wsопао{р4wsvsопо{рSwо{пAs xsр9xsо{попўо{wswsxsр3опAо{xsws{svsопао{рQws{svsо{п ўопAо{ wsр-оп ао{ wsр"wр*sо{п ўрDxsр/о{пdопо{wsр(опўо{рP{wsvsопAо{ wsрF{swsvsо{п опўо{рPws{swsо{пd яф·9¬Hџ1|The Memory System of a High-Performance Personal Computerby Douglas W. Clark1, Butler W. Lampson, and Kenneth A. PierJanuary 1981ABSTRACTThe memory system of the Dorado, a compact high-performance personal computer, hasvery high I/O bandwidth, a large paged virtual memory, a cache, and heavily pipelinedcontrol; this paper discusses all of these in detail. Relatively low-speed I/O devices transfersingle words to or from the cache; fast devices, such as a color video display, transferdirectly to or from main storage while the processor uses the cache. Virtual addresses areused in the cache and for all I/O transfers. The memory is controlled by a seven-stagepipeline, which can deliver a peak main-storage bandwidth of 530 million bits per second toservice fast I/O devices and cache misses. Interesting problems of synchronization andscheduling in this pipeline are discussed. The paper concludes with some performancemeasurements that show, among other things, that the cache hit rate is over 99 percent.A revised version of this paper will appear in IEEE Transactions on Computers.CR CATEGORIES6.34, 6.21.KEY WORDS AND PHRASESbandwidth, cache, latency, memory, pipeline, scheduling, storage, synchronization, virtualmemory.1. Present address: Digital Equipment Corporation, Tewksbury, Mass. 01876.c Copyright 1981 by Xerox Corporation.XEROXPALO ALTO RESEARCH CENTER3333 Coyote Hill Road / Palo Alto / California 94304опMmpфЂопJр"опC¤qфXпD1rпC¤qр(оп=*оп4Вsоп1—tфВфГрHоп0фЬutut фЭр=оп.Џф“р.ф”ututоп-фМр2фНоп+‡ф™фљрCоп*фАфБututр6оп(ф’ ф“рHоп&ыфМututфНр;оп%w фКр!фЛр*оп#уфЂрKоп Ир/vwtоп‡фЂрSоп;\ф‘ф’р,} z оп9Шф›фњ }zр3оп8TфЄрH{zоп6Рф¦рCф§оп5Lфђр\оп3ИффрMоп2DфНфОр1оп0А ф–рXоп/<фЂрTоп,ф…ф†рWоп*Ќф©р,фЄр.оп) ф‡р[ф€оп'… ф¶р+ф·р&оп&ф†р;ф‡р&оп$}фШр8{z{zоп"щфТ {z{zрEоп!uфЂр:фЃопс фЅ}zфѕр.опmфЂрDопn}фXопCzфЌр#фЋ{z{zопї фљф›{zрOоп;ф”р)ф•{zоп·фЋрBфЏ{z{zоп3 фЊр2фЌр&опЇ фЂрQоп„ф¦р/ф§р2опфЉр&ф‹р7оп | фљрJф›опшфРфСр=опt ф‡рCф€опрфµф¶рNопlфЎр?фўопифЌрYопdфЂр+яє·SHџdХґSEC. 1INTRODUCTION53<==‡1оп;\фђр,{z{zр%ф‘оп9Шф€р=ф‰{z{zоп8Tф фЎрMоп6Рф‡р3ф€р,оп5Lф¬фр=оп3И фБр){z{zр$оп2Dф€р7ф‰р#оп0Афј{z{zфЅрLоп/<фЂрMоп,фФрRфХоп*Ќ}zф›фњр7{z{zр$оп) ф¦ф§рZоп'…ф…ф†рGоп&ф‚рbоп$}фўфЈр?оп"щф‰фЉр/}zоп!uф°рCф±опсфЂрRопЖф–р,ф—р1опBф№рWфєопѕ ф°рJф± оп:фі{zр#фґоп¶фЁф©р?оп2фЊр.фЌр2оп®ф’р*ф“р4оп*ф‡р`оп¦фЅр#фѕ{zр/оп" фЂр0оп#}фXопшzфЦрGфЧопt фЕфЖрOопрф№р6фєопlфќфћр@опифќр-фћр/опdфЂрTяО·SHџdХEоИп¬юд$о$¬п%ю$«оИпю $оИпю$ОоИпею$•оИпею $о$¬п ю$rоИпWюд$о1sп¬юV$о9Йп%ю$«о1sпюy$о1sпю$Оо=пю$Оо=пю y$оKпAю$«о=пИю V$оVrп¬роVпWр о3пsроAђп Џро&WпWю$лопею=‰Іо;tпюk rоЋпроЋп¬ро:пИротпИро NпИроЋпИро«п sю9$одп^ю$9о«п:ю]$о«п:ю$]оп:ю$]оп:ю•$оЏп^ю$9оп sюr$одп sю$оп^ю$9одп:ю@$одп:ю$]одп —юGдоVп —юGдоп —юGдо«пюЋ2о«п —юGдоЋпю:kоп sюЗ$оЗп^ю$9оп:юл$оп:ю$]о9Йп3ю«GоVпр о$¬пю«$о$¬пюИkо3pп:ро3¬пЧро :пРюk9о$¬tпdсоpпђр опер о&епђр о&епер о&ћп¬р о'sп ¬р о',п ро'sп:р о«пИро?WпИряОDvK&{f THE MEMORY SYSTEM OF A HIGH-PERFORMANCE PERSONAL COMPUTER54Memory references specify a 16 or 28 bit displacement, and one of 32 base registers of 28 bits; thevirtual address is the sum of the displacement and the base. Virtual address translation, or mapping,is implemented by table lookup in a dedicated memory. Main storage is the permanent home ofdata stored by the memory system. The storage is necessarily slow (i.e., it has long latency, whichmeans that it takes a long time to respond to a request), because of its implementation in cheap butslow dynamic MOS RAMs (random access memories). To make up for being slow, storage is big,and it also has high bandwidth, which is more important than latency for sequential references. Inaddition, there is a cache which services non-sequential references with high speed (low latency),but is inferior to main storage in its other parameters. The relative values of these parameters areshown in Table 1.CacheStorageLatency-1151Bandwidth12Capacity1250Table 1: Parameters of the cache relative to storageWith one exception (the IFU), all memory references are initiated by the processor, which thus actsas a multiplexor controlling access to the memory (see ¶ 1.2 and [10]), and is the sole source ofaddresses. Once started, however, a reference proceeds independently of the processor. Each onecarries with it the number of its originating task, which serves to identify the source or sink of anydata transfer associated with the reference. The actual transfer may take place much later, andeach source or sink must be continually ready to deliver or accept data on demand. It is possiblefor a task to have several references outstanding, but order is preserved within each type ofreference, so that the task number plus some careful hardware bookkeeping is sufficient to matchup data with references. Table 2 lists the types of memory references executable by microcode. Figure 2, a picture of thememory system's main data paths, should clarify the sources and destinations of data transferred bythese references (parts of Figure 2 will be explained in more detail later). All references, includingfast I/O references, specify virtual, not real addresses. Although a microinstruction actually specifiesa displacement and a base register which together form the virtual address, for convenience we willsuppress this fact and write, for example, Fetch(a) to mean a fetch from virtual address a.A Fetch from the cache delivers data to a register called FetchReg, from which it can be retrieved atany later time; since FetchReg is task-specific, separate tasks can make their cache referencesindependently. An I/ORead reference delivers a 16-word block of data from storage to theFastOutBus (by way of the error corrector, as shown in Figure 2), tagged with the identity of therequesting task; the associated output device is expected to monitor this bus and grab the datawhen it appears. Similarly, the processor can Store one word of data into the cache, or do anI/OWrite reference which demands a block of data from an input device and sends it to storage (byway of the check-bit generator). There is also a Prefetch reference, which brings a block into thecache. Fetch, Store and Prefetch are called cache references. There are special references to flushdata from the cache and to allow a map entries to be read and written; these will be discussed later.The instruction fetch unit is the only device that can make a reference independently of theprocessor. It uses a single base register, and is treated almost exactly like a processor cache fetch,except that the IFU has its own set of registers for receiving memory data (see [9] for details). Ingeneral we ignore IFU references from now on, since they add little complexity to the memorysystem.яодпY‰zфXх{z{z{z{z{z{z{ z{z{оzопS2ф•р#}zф– } zопQ®фЂрVфЃ}zопP*ф р+фЎ }zопN¦ф–р9ф—р'опM"ф„рRф…опKћфЄ{z{zф«р7опJфЌфЋрXопH–фЇ}zф°р*опGф–р5ф—р-опEЋфЂопB»юдЌпBcюдодпB»ю ИЌпBcю Ио¬пB»ю(ЌпBcю(о#ФпB»ю ¦ЌпBcю ¦о-zпB»юoЌпBcюoоњп@}ооп?юдFодп?ю ИFо¬п?ю(Fо#Фп?ю ¦Fо-zп?юoFоцп ~z~zоgоmр$~zоп;И~z~zоgоmфВр4оmп:DфЂ~оп7мz~zоgоmр1~оп5”z~zоgоm~zр1~оп3<z~zоgоmф‡р7оmп1ёфЂр)оп1юдп04юдЌодп1ю Ип04ю ИЌо¬п1ю(п04ю(Ќо#Фп1ю ¦п04ю ¦Ќо-zп1юoп04юoЌочп- фXр8оп)Юф‹}zр2фЊ}z оп(Zф” }z}zр9оп&Цф†{z{zф‡р!оп%RфЪфЫ}z}zр8оп#О фЇр9ф°оп"Jф”р({z{zф•оп ЖфЂ {z{zопЗ}фXр$опњzфЃр=ф‚опфЂ оцпА}zфЪфЫр0оцп<ф—р6{z{zфоцпёфџр1ф р"оцп4фЂр#{z{zр2оцпЬ}zф‡ф€р4оцп XфѓрSоцпФфЂрDоп |фХр%фЦр2опш фљф›рIопtф‹р[фЊ опрф•р4ф–р*опlф‚рYфѓопиф®р)фЇр*опdфЂрYЪ· ›Hџ]Ќ) THE MEMORY SYSTEM OF A HIGH-PERFORMANCE PERSONAL COMPUTER56<==WроИп>Wро п,ђро п:»ро HпB№ю$Ђо ЦпB№ю$Ђоп'ђроpп,кро xп'ђро Џп/ю$Ђоdpп,кро xпE‘роЦpпP3роЦxпЦротпшюr$отпю+$отп†ю+$оЦп «роЦпЗротпўю+$отп1ю+$отпкюr$о:rп?@роЦп=Oро+прю$«одxпdро«пяю$опяю$оЦпrротпrро9пяю$оИпяю$о Hpп 0ро Hп[ро(п5юVGо-п †р о)eп/Іро%:п.ро$eп1\юЏ$о3фп-Uю$+о$eп-2юІ$о$eп-2ю$Nо$п;лю$Nо$п;люІ$о3п<ю$+о$п@юЏ$о&Wп=–ро$eп#ю$Nо$eп#юІ$о3фпFю$+о$eпMюЏ$о)фпЈро&ћпшро$пюЏ$о3пяю$+о$пЫюІ$о$пЫю$Nо%Йп†ро+еп2Uю$о+ћxп*Wро,tп2Uю$о,tп(ю$Gо+еп(ю$Gо+ћп рщъяшяо-pп*#рщъшо+ћxпHро+епFю$Gо,tпFю$Gо-pп\ро,tпяю$rо+ћxпро+епяю$rо(пqюVGо.пЌю$№о-Чп Џро)п.;ропЖю$$уопЈю G$о)п Џроп9$юЦ$о)фп69ю$оп9Gю$Ђо Цп8№ю$Uо Џп4Хю$то«п8–ю $о¬пTю$$eо¬п1ю G$о Цп8–юH$о¬п8–ю №$о.п1ю $о8п1ю$3Ро пGЭю+Б$о.пЈю Џ$о9;пЈю$4моЃпHkю,Ю$о.пTю$то.пЌю$9о ЏпMЦю$«оЂпSeюdGоепQєюGтоЂпQєюGто+rпTро!пQєюGто6tпQєюGто!пSeюdGо$упTро Hpп?ро п”ро Hпро пjроЃпFдю$«о)фпЌю$Зо)eпЌю$9о)eп69ю$Ђо пFдю$о№прю$dоєпро)фrп$Nроуpпiсоdп ›юGPn8‡9^UVъSEC. 1INTRODUCTION57All busses are 16 bits wide; blocks of data are transferred to and from storage at the rate of 16 bitsevery half cycle (30 ns). This means that 256 bits can be transferred in 8 cycles or 480 ns., which issomewhat more than the 375 ns cycle time of the RAM chips that implement main storage. Thus ablock size of 256 bits provides a fairly good match between bus and chip bandwidths; it is also acomfortable unit to store in the cache. The narrow busses increase the latency of a storage transfersomewhat, but they have little effect on the bandwidth. A few hundred nanoseconds of latency isof little importance either for sequential I/O transfers or for delivery of data to a properlyfunctioning cache.Various measures are taken to maximize the performance of the cache. Data stored there is notwritten back to main storage until the cache space is needed for some other purpose (the write-backrather than the more common write-through discipline [1, 14]); this make it possible to use memorylocations much like registers in an interpreted instruction set, without incurring the penalty of mainstorage accesses. Virtual rather than real addresses are stored in the cache, so that the speed ofmemory mapping does not affect the speed of cache references. (Translation buffers [15, 20] areanother way to accomplish this.) This would create problems if there were multiple address spaces.Although these problems can be solved, in a single-user environment with a single address spacethey do not even need to be considered.Another important technique for speeding up data manipulation in general, and cache references inparticular, is called bypassing. Bypassing is one of the speed-up techniques used in the CommonData Bus of the IBM 360/91 [19]. Sequences of instructions having the form(1) register _ computation1(2) computation2 involving the registerare very common. Usually the execution of the first instruction takes more than one cycle and ispipelined. As a result, however, the register is not loaded at the end of the first cycle, andtherefore is not ready at the beginning of the second instruction. The idea of bypassing is to avoidwaiting for the register to be loaded, by routing the results of the first computation directly to theinputs of the second one. The effective latency of the cache is thus reduced from two cycles to onein many cases (see ¶ 2.3).The implementation of the Dorado memory reflects a balance among competing demands:for simplicity, so that it can be made to work initially, and maintained when componentsfail; for speed, so that the performance will be well-matched to the rest of the machine;for space, since cost and packaging considerations limit the number of components andedgepins that can be used.None of these demands is absolute, but all have thresholds that are costly to cross. In the Doradowe set a somewhat arbitrary speed requirement for the whole machine, and generally tried to savespace by adding complexity, pushing ever closer to the simplicity threshold. Although many of thecomplications in the memory system are unavoidable consequences of the speed requirements, someof them could have been eliminated by adding hardware.2. The cacheThe memory system is organized into two kinds of building blocks: pipeline stages, which providethe control (their names are in SMALL CAPITALS), and resources, which provide the data paths andmemories. Figure 3 shows the various stages and their arrangement into two pipelines. One,consisting of the ADDRESS and HITDATA stages, handles cache references and is the subject of thissection; the other, containing MAP, WRITETR, STORAGE, READTR1 and READTR2, takes care ofstorage references and is dealt with in ¶ 3 and 4. References start out either in PROC, the processor,or in the IFU.оп\){фGхо,Ло;€zопUТфЉрOф‹опTNф„рPф…опRКфЌр({zфЋр%опQFфћрRфџ опOВ ф‹рQфЊопN>ф–ф—р;опLєфпр){z{zр'фропK6 фЂопHф«рCф¬опF‡ф†р>ф‡} опEzф‹}zфЊр$опCф‹рUфЊопAыфЇр<ф°оп@wфЁрMф©оп>уф‰|фЉzрBоп=oф«рBф¬оп;лфЂр#оп8АфЉрWф‹оп7< фЈф¤}zрAоп5ёфЂ{zр8оцп3`оцп1Ьр'оп/„фќрWфћоп. фУр'фФр.оп,|фЉф‹р>оп*шф—фрHоп)tф„ ф…рPоп'рфЂоп$ЕрSоцп"mф¤р6фҐоцп йфЂоцп‘рSоцп9фБфВр2оцпµфЂопЉф‘рTф’ опфр3ф™р+оп‚ф‹фЊр)|zопюф„р%ф…р-опzфЂр4оп§|фX оп |zф™рH}zопш{ z}zфљопtфИр'фЙр,опр ф›фњ{z{zр<опlфафб{z{z{z{z{zопифЃф‚р4{zопdфЂ{zяж· Hџa“ THE MEMORY SYSTEM OF A HIGH-PERFORMANCE PERSONAL COMPUTER58<==оп3ф{zр"ф®р3опЇфЂрQоп„фќфћрVопф‹ фЊрSоп |фЗ}z }z фИ}zр-опшф†р;ф‡р#опt ф¤фҐ }zопрф„рOф…опlфћрCфџопиф’р,ф“ }zопdфЂр& в·љHџdЋо9п\ю$оUxпероперо9п!Ою$оUпропропsроUпsро9п#ю$оп:роUп:ро9п±ю$оЗпю$о дпЏро ЋпЏро дпИро ЋпИроrп Hро Ип Hро«пшю$оrпро Ипро«п †ю$одпю$опЏро«пЏродпyю$опИро«пИро*:п †ю$о+Vпро(про*:пшю$о+Vп Hро(п Hро2Џп Hро5еп Hро4Ипшю$о2Џпро5епро4Ип †ю$о>Йп †ю$о?епро<ђпро>Йпшю$о?еп Hро<ђп Hроrпќро Ипќро«пMю$оrпdро Ипdро«пЫю$оtпNроЄпЈропшро9пшропxро(Џпxро3пxро=пxропНро#пїю$о-Џпїю$о8пїюr$оVпїю$оЋп•ю9$оЗпcю$UоЗп@юЗ$оЋпкю9$оЗпкю$yогxпроп¬роИпќро&Vпќро0епќро:епќроVп ю$оИптро Urп%Іро rп%юVGо rп#гюGdо+VпNроB¬пЖюGодпњю%ИGодпюGdоИп$*юGоVпХю$Ћоп±ю9$оVпХю$оVп 8ю$ ќоЗtпроп?ропр оп [р опUю$лоп!юr$о ЋxпБро«tп"Јро«п!‡роЗпyю$оЋпЗю$yоп#р оЏpп±р%опОюr$о!дпсоtп xр я вG9ўBф&тьSEC. 2THE CACHE59<==п$•ю$•о Ѓп±ю$ло ¤п%шю@$о"sп±ю$ло пЌюдGотп сюGдо п ©ю +Gо п ©юG +о&ќп(гюsGо?п$ёюGrо&ќп$qюєGо&ќп$qюG№оЦп?юЗGо&ќп:тюGrоЦп:«юGоЦп:«юG№оќп#ю$Ћоќп"кюм$о&Бп&@юO$о&ќп%jюs$о2Itп)Nро7Чп)Nро,єп)Nро',п)Nро<п%Ћю$ІомпHро:Вп€ро>мп€ро;»п'8юU$о;»п$ЬюU$о#Цп«ро?п«ро#ЦпVро?пVро)упю$о*Ѓп°ю$о'єпFюGrоBeпFюХrо(Џtпро6 про,,пФю$Uо0ћпФю$Uо5WпФю$Uо-Hпро1єпро:Wп Нро?про>Ґп bро9‚пFю$rо:пFю$rо=фпю$Іо=фпFю$rо^п%Ћро мrп=№ро«п;п)Nроєxп#eроGпDdюrGотп69юќGо+п)ю«Gодп3+ю$оЂп©ю$ЂоЗtпїро ќпїрщъяшяоЦpп"р"щъшо&ќпCkют$о-ЏпAдю$«о&ќпAАю$о&ќпAАю$Оо'єп=ЭюЗ$о',tпBOро",xп*еро&ќп4Gю$9оЦп4Gю$9о"єп3+ю$щъяшяоЋrп4ІроЋп)роХпропропрщъшо#Цpпwсо0еtп bр о5еп bро*Ѓп=Эю$оЦп,9роєxп$^рцUiH‚FќЦ THE MEMORY SYSTEM OF A HIGH-PERFORMANCE PERSONAL COMPUTER60miss: the address is not present in the cache. During normal operation, it is not possible for morethan one column to match. The entire matching process can be seen in Figure 4, between 60 and90 ns after the start of the reference. The cache address latched at 90 contains the row, word andcolumn; these 14 bits address a single word in CacheD. Of course, only the top 16 key bits of theaddress need be matched, since the row bits are used to select the row, and all the words of a blockare present or absent together.Four flag bits are stored with each cache entry to keep track of its status. We defer discussion ofthese flags until ¶ 4.2.2 Cache dataThe CacheD resource stores the data for the blocks whose addresses appear in CacheA; closelyassociated with it are the StoreReg and task-specific FetchReg registers which allow the processor todeliver and retrieve its data independently of the memory system's detailed timing. CacheD is quitesimple, and would consist of nothing but a 16K by 16 bit memory were it not for the bandwidth ofthe storage. To keep up with storage the cache must be able to accept a word every half cycle (30ns.). Since its memory chips cannot cycle this fast, CacheD is organized in two banks which run ahalf-cycle out of phase when transferring data to or from the storage. On a hit, however, bothbanks are cycled together and CacheD behaves like an 8K by 32 bit memory. A multiplexor selectsthe proper half to deliver into FetchReg. All this is shown in Figure 4.Figure 4 does not, however, show how FetchReg is made task-specific. In fact, there is a 16-wordmemory FetchRegRAM in addition to the register shown. The register holds the data value for thecurrently executing task. When a Fetch reference completes, the word from CacheD is alwaysloaded into the RAM entry for the task that made the reference; it is also loaded into FetchReg ifthat task is the one currently running. Whenever the processor switches tasks, the FetchRegRAMentry for the new task is read out and loaded into FetchReg. Matters are further complicated bythe bypassing scheme described in the next subsection.StoreReg is not task-specific. The reason for this choice and the problem it causes are explained in¶ 5.1.2.3 Cache pipeliningFrom the beginning of a cache reference, it takes two and a half cycles before the data is ready inFetchReg, even if it hits and there are no delays. However, because of the latches in the pipeline(some of which are omitted from Figure 4), a new reference can be started every cycle, and if thereare no misses the pipeline will never clog up, but will continue to deliver a word every 60 ns. Thisworks because nothing in later stages of the pipeline affects anything that happens in an earlierstage.The exception to this principle is delivery of data to the processor itself. When the processor usesdata that has been fetched, it depends on the later stages of the pipeline. In general thisdependency is unavoidable, but in the case of the cache the bypassing technique described in ¶ 1.4is used to reduce the latency. A cache reference logically delivers its data to the FetchReg registerat the end of the cycle following the reference cycle (actually halfway through the second cycle, at150 in Figure 4). Often the data is then sent to a register in the processor, with a (microcode)sequence such as(1)Fetch(address)(2)register _ FetchReg(3)computation involving register. The register is not actually loaded until cycle (3); hence the data, which is ready in the middle ofcycle (3), arrives in time, and instruction (2) does not have to wait. The data is supplied to thecomputation in cycle (3) by bypassing. The effective latency of the cache is thus only one cycle inthis situation.одп_RzфXх{z{z{z{z{z{z{ z{z{оzопXыфђф‘рLопWwф”рKф•опUуф”р%ф•р<опToфЌ фЋрMопRлфѓф„рBопQgфЂопN<фрHф™опLёфЂопH№}фX опEЋzфјфЅр9опD ф„р4ф…р'опB†фЂрWфЃопAф…ф†рDоп?~фЋр9фЏр&оп=ърLфђоп{оп.ЇzфќфћрBоп-+фЂр3оп*ф€р2ф‰р+оп(|фЂоп$}}фXоп!Rzф“р\ф”опОф– ф—рOопJф‡рUф€опЖф‡р@ф€р"опBф¶р8ф·р$опѕоп“ф”рOф•опфкфлрAоп‹ фЋр-фЏр+опфђф‘р_опѓф–рQф—опяф±фІрOоп {фЂоуп#оц~zоуп џоцоупоцр!опрфљрAф›опlфЄф«рDопи фЏр9фђопdфЂ м·аHџdH’SEC. 2THE CACHE61Unfortunately this sleight-of-hand does not always work. The sequence(1)Fetch(address)(2)computation involving FetchReg actually needs the data during cycle (2), which will therefore have to wait for one cycle (see ¶ 5.1).Data retrieved in cycle (1) would be the old value of FetchReg; this allows a sequence of fetches(1)Fetch(address1)(2)register1 _ FetchReg, Fetch(address2)(3)register2 _ FetchReg, Fetch(address3)(4)register3 _ FetchReg, Fetch(address4). . .to proceed at full speed.3. The storage pipelineCache misses and fast I/O references use the storage portion of the pipeline, shown in Figure 3. Inthis section we first describe the operation of the individual pipeline stages, then explain how fastI/O references use them, and finally discuss how memory faults are handled. Using I/O referencesto expose the workings of the pipeline allows us to postpone until ¶ 4 a close examination of themore complicated references involving both cache and storage.3.1 Pipeline stagesEach of the pipeline stages is implemented by a simple finite-state automaton that can change stateon every microinstruction cycle. Resources used by a stage are controlled by signals that itsautomaton produces. Each stage owns some resources, and some stages share resources with others.Control is passed from one stage to the next when the first produces a start signal for the second;this signal forces the second automaton into its initial state. Necessary information about thereference type is also passed along when one stage starts another.3.1.1 The ADDRESS stageAs we saw in ¶ 2, the ADDRESS stage computes a reference's virtual address and looks it up inCacheA. If it hits, and is not I/ORead or I/OWrite, control is passed to HITDATA. Otherwise, controlis passed to MAP, starting a storage cycle. In the simplest case a reference spends just onemicroinstruction cycle in ADDRESS, but it can be delayed for various reasons discussed in ¶ 5.3.1.2 The MAP stageThe MAP stage translates a virtual address into a real address by looking it up in a hardware tablecalled the MapRAM, and then starts the STORAGE stage. Figure 5 illustrates the straightforwardconversion of a virtual page number into a real page number. The low-order bits are not mapped;they point to a single word on the page.Three flag bits are stored in MapRAM for each virtual page:ref, set automatically by any reference to the page;dirty, set automatically by any write into the page;writeProtect, set by memory-management software (using the MapWrite reference). A virtual page not in use is marked as vacant by setting both writeProtect and dirty, an otherwisenonsensical combination. A reference is aborted by the hardware if it touches a vacant page,attempts to write a write-protected page, or causes a parity error in the MapRAM. All three kindsof map fault are passed down the pipeline to READTR2 for reporting; see ¶ 3.1.5.оп^‚{фGхо/^о;€zопX+фЂр9оупUУоц~zоупTOоцопQчфЏрVфђопPsфЂр]оупNоц~z оупL—оц~z оупKоц~z оупIЏоц~z оцпHопEіоп@а|фXоп=µzф€{z{zф‰р8оп<1фћр\фџоп:{z{zф“р;ф”|z{z{z оп9)фџр1ф р.оп7ҐфЂр9оп3¦}фXоп0{zфЌфЋрHоп.чфШр9фЩр#оп-sфЉ}zр!}zф‹оп+пф–ф—р$}zоп*kфСр<фТоп(зфЂр9оп$и}фX}оп!Ѕzф¶{zф·р#оп 9фЃ{z{~zф‚{z{~z{}zопµфЯ {zр,фар!оп1фЂ {zр=оп2}фX}опzф{zрHф™опѓфє}{z{zф»опя фЊр*фЌр,оп{фЂр$опPр!{zоцпш~zр1оцп ~zр/оцпH~zр/~z опрфр"ф®}z~z~z опl фЛ|zфМрBопиф™рE{zфљопdфЂ}zр!{zяl·°Hџcx~ THE MEMORY SYSTEM OF A HIGH-PERFORMANCE PERSONAL COMPUTER62<==Кпlю$9о:Xпlю$9о<‘пlю$9о8пlю$9оC;пlю$9оAпlю$9оEtпlю$9о5жп 3ю$9о1tп 3ю$9о3п 3ю$9о(ђп 3ю$9о-п 3ю$9о*Йп 3ю$9о/;п 3ю$9о&Wп ю$]о&Wп юл$о8п 3ю$9о&Wп"HюИ$оИп%юИ$о(ђп"ъю$9оИп"Чюл$оИп"Чю$]о¬п"ъю$9о:п"ъю$9оsп"ъю$9оп"ъю$9о$п"ъю$9о!еп"ъю$9о&Wп"ъю$9о:п%Бю$9оИп%Бю$9оп%Бю$9о дп%Бю$9оVп%Бю$9оп%Бю$9оЏп%Бю$9о«п%ћю$]о«п%ћюл$оsп%Бю$9о«п'ЧюИ$о sп(eю$]о sп(eю]$о«п(‰ю$9о sп*ћю9$о:п-eю9$о sп+Pю$9о:п+,ю]$о:п+,ю$]оЏп+»роЏп&,роЏп#eропмюGGо xп _роп _ро¬пЦюy$оЏп&роИпЦюЗ$о«п _роп _ро#ђп _ропЦюІ$оеп&роепЦю$оИп _ро$¬п _ро'sпро8про(ђпHюО$о/;пґро0WпHюЗ$оЏpпќро?жxп _ро8пЦю]$о7ђп&ро0WпЦю9$о/;п _роsпќюЋ$опю$«оsптюІ$оsптю$Ооsптю$Ооsптюл$о :пю$«оsпќюЗ$о#пќюr$о'sпю$«о#птю•$о#птю$Оо.п тюЋ$о4;пАю$Uо.пќюІ$о.пќю$yо=пќю«$оDXпю$«о=птюО$о=птю$Оодптю$Оодптю]$опю$«одпќю9$оп HюЗ$оп Hюr$о :п HюЗ$оИп&роИп&ро!Wп&ро-п&ро<п&ро spп Цро sп+ро]п&,ро–п&,роПп&,роп&,роAп&,роzп&,роіп&,ромп&,ро' п#eро$Рп#eро"—п#eро ^п#eро%п#eромп#eроіп#eроzп#eро' п ћро)Bп ћро+{п ћро-ґп ћро/нп ћро2&п ћро4_п ћро6п ћроF&пЧроCнпЧроAґпЧро?{пЧро=BпЧро; пЧро8РпЧро6пЧроЏп(уроЏп ћроЏпЧро Vпроsп€роЏпр оИп€ро#про!еп€ро*;пуро0Wпуро1tпро5Wп€р о5Wпќр олп+»ро $п(уроЏп ЦроeпроЏп ЦроЏп+ро#ђп dро#ђп№ро(пќро/;п№ро>Кп dро>Кп№ро>Кпро=xп ро9;pпуро%;пOю$до=пБю$Зо¬пOю$до ЏпOю$Зопроп€ро:п€роЏпроsп€ро'sп Hю9$о4;п Hюд$о$¬пdсоИпрДDЊGС-‰ЊSEC. 3THE STORAGE PIPELINE65and completes the StorageRAM cycle (¶ 3.1.3). READTR1 and READTR2 transport the data, controlthe error corrector, and deliver the data to FastOutBus (¶ 3.1.5). Fault reporting, if necessary, isdone by READTR2 as soon as the condition of the last quadword in the block is known (¶ 3.3).It is clear from Figure 7 that an I/ORead can be started every eight machine cycles, since this is thelongest period of activity of any stage. This would result in 530 million bits per second ofbandwidth, the maximum supportable by the memory system. The inner loop of a fast I/O task canbe written in two microinstructions, so if a new I/ORead is launched every eight cycles, one-fourthof the processor capacity will be used. Because ADDRESS is used for only one cycle per I/ORead,other tasksќnotably the emulatorќmay continue to hit in the cache when the I/O task is notrunning.I/OWrite(x) writes into virtual location x a block of data delivered by a fast input device, togetherwith appropriate Hamming code check bits. The data always goes to storage, never to the cache,but if address x happens to hit in the cache, the entry is invalidated by setting a flag (¶ 4). Figure8 shows that an I/OWrite proceeds through the pipeline very much like an I/ORead. The difference,of course, is that the WRITETR stage runs, and the READTR1 and READTR2 stages, although they run,do not transport data. Note that the write transport, from FastInBus to WriteBus, proceeds inparallel with mapping. Once the block has been loaded into WriteReg, STORAGE issues a writesignal to StorageRAM. All that remains is to run READTR1 and READTR2, as explained above. If amap fault occurs during address translation, the write signal is blocked and the fault is passed alongto be reported by READTR2.<==_п%ро@п%роBСп%роE п%роGBп%роI{п%ро9нп'Чро7ґп'Чро5{п'Чро3Bп'Чро1 п'Чро.Рп'Чро,—п'Чро*^п'ЧроПп*ћроп*ћроAп*ћро!zп*ћро#іп*ћро%мп*ћро(%п*ћро*^п*ћроAп-eроп-eроПп-eро–п-eролп-eроІп-eро zп-eроAп-eроп*ћроп-eроп5»роп5-ю$]оVп5Pю$9оп7fю9$оVп4џю9$о Џп2‰ю$9оVп2fю]$оVп2fю$]о Џп/ю:$о Ип,ыю$9о Џп,Чю]$о Џп,Чю$]оеп,ыю$9оп,ыю$9о:п,ыю$9оИп,ыю$9оVп,ыю$9оп,ыю$9оЏп,ыю$9о)¬п*4ю$9о%:п*4ю$9о'sп*4ю$9оVп*4ю$9о Ип*4ю$9оЏп*4ю$9о#п*4ю$9оп*ю$]оп*юл$о+еп*4ю$9оп,IюИ$о)¬п)‚юИ$о;tп'lю$9о)¬п'Iюл$о)¬п'Iю$]о2ђп'lю$9о.п'lю$9о0Wп'lю$9о+еп'lю$9о7п'lю$9о4Йп'lю$9о9;п'lю$9оHКп$Ґю$9оDXп$Ґю$9оF‘п$Ґю$9о;tп$Ґю$9о?жп$Ґю$9о=п$Ґю$9оBп$Ґю$9о9;п$‚ю$]о9;п$‚юл$оKп$Ґю$9о9;п&»юИ$о«п5Яродп3роп0-роЏп/Вю$9оп/Вю$9оVп/Вю$9оИп/Вю$9о:п/Вю$9оп/Вю$9оеп/Вю$9о Џп/ћю$]о Џп/ћю–$о#п/Вю$9о Џп1Чюs$оAп0-ро zп0-роІп0-ролп0-роп0-ро:п0-роsп0-ро¬п0-роsп,ыю$9о¬п,ыю$9оHп-роЃп-ро Ип/Вю$9о¬п/Вю$9оsп/Вю$9оеп0-роп0-ро мп0-ро@txпCро=pп!,роAп€ю$9опуроепHю$@оепHю•$о!Wпlю$оепeюr$оsпќроrпќроrп Hро Ипzроуп HроHпер оeпHродпєю Ц$о!Wпќю«$о#п Oю$rоп5-ю]$опро%:пdся<сK‘7‰з THE MEMORY SYSTEM OF A HIGH-PERFORMANCE PERSONAL COMPUTER66Figure 8 shows a delay in the MAP stage's handling of I/OWrite. MAP remains in state 3 for twoextra cycles, which are labelled with asterisks, rather than state numbers, in Figure 8. This delayallows the write transport to finish before the write signal is issued to StorageRAM. Thissynchronization and others are detailed in ¶ 5.Because WRITETR takes eleven cycles to run, I/OWrites can only run at the rate of one every elevencycles, yielding a maximum bandwidth for fast input devices of 390 million bits per second. At thatrate, two of every eleven cycles would go to the I/O task's inner loop, consuming 18 percent of theprocessor capacity. But again, other tasks could hit in the cache in the remaining nine cycles.3.3 History and fault reportingThere are two kinds of memory system faults: map and storage. A map fault is a MapRAM parityerror, a reference to a page marked vacant, or a write operation to a write-protected page. Astorage fault is either a single or a double error (within a quadword) detected during a read. Inwhat follows we do not always distinguish between the two types.Consider how a page fault might be handled. MAP has read the MapRAM entry for a reference andfound the virtual page marked vacant. At this point there may be another reference in ADDRESSwaiting for MAP, and one more in the processor waiting for ADDRESS. An earlier reference may bein READTR1, perhaps about to cause a storage fault. The processor is probably several instructionsbeyond the one that issued the faulting reference, perhaps in another task. What to do? It wouldbe quite cumbersome at this point to halt the memory system, deal with the fault, and restart thememory system in such a way that the fault was transparent to the interrupted tasks. Instead, theDorado allows the reference to complete, while blunting any destructive consequences it mighthave. A page fault, for example, forces the cache's vacant flag to be set when the read transport isdone. At the very end of the pipeline READTR2 wakes up the Dorado's highest-priority microtask,the fault task, which must deal appropriately with the fault, perhaps with the help of memory-management software.Because the fault may be reported well after it happened, a record of the reference must be keptwhich is complete enough that the fault task can sort out what has happened. Furthermore,because later references in the pipeline may cause additional faults, this record must be able toencompass several faulting references. The necessary information associated with each reference,about 80 bits, is recorded in a 16-element memory called History. Table 3 gives the contents ofHistory and shows which stage is responsible for writing each part. History is managed as a ringbuffer and is addressed by a 4-bit Storage Reference Number or SRN, which is passed along withthe reference through the various pipeline stages. When a reference is passed to the MAP stage, acounter containing the next available SRN is incremented. A hit writes the address portion ofHistory (useful for diagnostic purposes; see below), without incrementing the SRN counter.EntryWritten byVirtual address, reference type, task number, cache columnADDRESSReal page number, MapRAM flags, map faultMAPStorage fault, bit corrected (for single errors)READTR2Table 3: Contents of the History memoryTwo hardware registers accessible to the processor help the fault task interpret History: FaultCountis incremented every time a fault occurs; FirstFault holds the SRN of the first faulting reference.The fault task is awakened whenever FaultCount is non-zero; it can read both registers and clearFaultCount in a single atomic operation. It then handles FaultCount faults, reading successiveelements of History starting with History[FirstFault], and then yields control of the processor to theодп]«zфXх{z{z{z{z{z{z{ z{z{оzопWTф¦{z{z{~z{z ф§опUРфЈр_опTLфэрK{zфюопRИфЂопOќфЌ{zфЋ{z{~zр.опNфЂр7фЃр&опL•фЋр,{z{zр+фЏопKфЂрWопG}фXопCзzф–р(}z}zф—{zопBcфВр*фГр.оп@Яф¦рHф§оп?[фЂр<оп<0ф†ф‡{z{zоп:¬фўр<фЈ{оп9(zфЊ{zфЌ{z}zоп7¤ф–{zр"ф—р7оп6 ф’рFф“оп4њфќр#}zр(фћоп3ф ф™рRоп1”фКфЛрPоп0фЊфЌрMоп.Њф—р"{zфоп-фј} zр7фЅоп+„ фЂ оп(Yф фЎрUоп&ХфФрGфХ оп%Qф№рCфєоп#Нф§}zфЁр<оп"Iф©фЄ}zоп Еф р6фЎр$опAфџр9{zф опЅфљр/ф›р${zоп9фИ{zр*фЙ опµфЂрG{zопвюдЌпЉюдодпвю ИЌпЉю Ио¬пвю(ЌпЉю(о#Фпвю ¦ЌпЉю ¦о-zпвюoЌпЉюoоцп}о.Ъ опюдFодпю ИFо¬пю(Fо#Фпю ¦Fо-zпюoFоцпюzр:о.Ъ{оцп¦z{zо.Ъ{оцпNzр0о.Ъ{оп ќюдпКюдЌодп ќю ИпКю ИЌо¬п ќю(пКю(Ќо#Фп ќю ¦пКю ¦Ќо-zп ќюoпКюoЌоUп џzфXр"опtф’р>ф“} опрzфЇр(} z {zф°опlф¤рIфҐопи фЕфЖр7опdфЊрMфЌD·‡HџbЎ8SEC. 3THE STORAGE PIPELINE67other tasks. If more faults have occurred in the meantime, FaultCount will have been incrementedagain and the fault task will be reawakened.The fault task does different things in response to the different types of fault. Single bit errors,which are corrected, are not reported at all unless a special control bit in the hardware is set. Withthis bit set, the fault task can collect statistics on failing storage chips; if too many failures areoccurring, the bit can be cleared and the machine can continue to run. Double bit errors may bedealt with by re-trying the reference; a recurrence of the error must be reported to the operatingsystem, which may stop using the failing memory, and may be able to reread the data from the diskif the page is not dirty, or determine which computation must be aborted. Page faults are the mostlikely reason to awaken the fault task, and together with write-protect faults are dealt with byyielding to memory-management software. MapRAM parity errors may disappear if the reference isre-tried; if they do not, the operating system can probably recover the necessary information.Microinstructions that read the various parts of History are provided, but only the emulator and thefault task may use them. These instructions use an alternate addressing path to History which doesnot interfere with the SRN addressing used by references in the pipeline. Reading base registers,the MapRAM, and CacheA can be done only by using these microinstructions.This brings us to a serious difficulty with treating History as a pure ring buffer. To read aMapRAM entry, for example, the emulator must first issue a reference to that entry (normally aMapRead), and then read the appropriate part of History when the reference completes; similarly, aDummyRef (see Table 3) is used to read a base register. But because other tasks may run and issuetheir own references between the start of the emulator's reference and its reading of History, theemulator cannot be sure that its History entry will remain valid. Sixteen references by I/O tasks, forexample, will destroy it.To solve this problem, we designate History[0] as the emulator's "private" entry: MapRead,MapWrite, and DummyRef references use it, and it is excluded from the ring buffer. Because thefault task may want to make references of its own without disturbing History, another private entryis reserved for it. The ring buffer proper, then, is a 14-element memory used by all referencesexcept MapRead, MapWrite, and DummyRef in the emulator and fault task. For historical reasons,Fetch, Store and Flush references in the emulator and fault task also use the private entries; the tagmechanism (¶ 4.1) ensures that the entries will not be reused too soon.In one case History is read, rather than written, by a pipeline stage. This happens during a readtransport, when READTR1 gets from History the cache address (row and column) it needs for writingthe new data and the cache flags. This is done instead of piping this address along from ADDRESSto READTR1.4. Cache-storage interactionsThe preceding sections describe the normal case in which the cache and main storage functionindependently. Here we consider the relatively rare interactions between them. These can happenfor a variety of reasons:Processor references that miss in the cache must fetch their data from storage.A dirty block in the cache must be re-written in storage when its entry is needed.Prefetch and flush operations explicitly transfer data between cache and storage.I/O references that hit in the cache must be handled correctly.Cache-storage interactions are aided by the four flag bits that are stored with each cache entry tokeep track of its status (see Figure 4). The vacant flag indicates that an entry should never match; itis set by software during system initialization, and by hardware when the normal procedure forloading the cache fails, e.g., because of a page fault. The dirty flag is set when the data in the entryis different from the data in storage because the processor did a store; this means that the entryяоп^Ј{фGхо(eо;€zопXLфЊр(фЌр4опVИфЂр'опSќф«р0ф¬р2опRф€р@ф‰р"опP•фјфЅрYопO ф—фр;опMЌфЈр5ф¤р(опL ф‚рZопJ…ф‡рHф€опIфЙр.фКр,опG}фЊр${zфЌр0опEщфЂрUопBОф„ф…рNопAJф‰фЉрQоп?ЖфЁ{zр=ф© оп>BфЂ{zр?оп;фУрXфФоп9“{zфёрXоп8~zф‘р7ф’р"оп6‹~zфЋфЏр:оп5ф©рMфЄоп3ѓфЂр:фЃ{z{z оп1яфЂоп.ФффрJ~zоп-P~zфЇ~zр=ф°оп+Мф‹фЊрHоп*Hфґр5фµр)оп(ДфЇ~z~z~zр!ф°оп'@~zф—~z~zрLфоп%јфЂр>оп"‘фЎрEфўоп! ф‚фѓ{zрJоп‰фђр>ф‘{опzфЂ{zоп2|фXопzфГ}zр/фДопѓ ф‘рSопяфЂоцп§рOоцпOрRоцпчрQоцп џ{z{zр<опtфЎфўрIопрфЃф‚~zр4опlф»фјр>опиф…р"ф†~zр'опdфЄр#ф«р=я$·ЏHџc™њ THE MEMORY SYSTEM OF A HIGH-PERFORMANCE PERSONAL COMPUTER68must be written back to storage before it is used for another block. The writeProtected flag is a copyof the corresponding bit in the map. It causes a store into the block to miss and set vacant; theresulting storage reference reports a write-protect fault (¶ 3.3). The beingLoaded flag is set for about 15 cycles while the entry is in the course of being loaded from storage; whenever the ADDRESSstage attempts to examine an entry, it waits until the entry is not beingLoaded, to ensure that theentry and its contents are not used while in this ambiguous state.When a cache reference misses, the block being referenced must be brought into the cache. Inorder to make room for it, some other block in the row must be displaced; this unfortunate iscalled the victim. CacheA implements an approximate least-recently-used rule for selecting thevictim. With each row, the current candidate for victim and the next candidate, called next victim,are kept. The victim and next victim are the top two elements of an LRU stack for that row;keeping only these two is what makes the replacement rule only approximately LRU. On a miss, thenext victim is promoted to be the new victim and a pseudo-random choice between the remainingtwo columns is promoted to be the new next victim. On each hit, the victim and next victim areupdated in the obvious way, depending on whether they themselves were hit.The flow of data in cache-storage interactions is shown in Figure 2. For example, a Fetch thatmisses will read an entire block from storage via the ReadBus, load the error-corrected block intoCacheD, and then make a one-word reference as if it had hit.What follows is a discussion of the four kinds of cache-storage interaction listed above.4.1 Clean missWhen the processor or IFU references a word w that is not in the cache, and the location chosen asvictim is vacant or holds data that is unchanged since it was read from storage (i.e., its dirty flag isnot set), a clean miss has occurred. The victim need not be written back, but a storage read mustbe done to load into the cache the block containing w. At the end of the read, w can be fetchedfrom the cache. A clean miss is much like an I/ORead, which was discussed in the previous section.The chief difference is that the block from storage is sent not over the FastOutBus to an outputdevice, but to the CacheD memory. Figure 9 illustrates a clean miss.All cache loads require a special cycle, controlled by READTR1, in which they get the correct cacheaddress from History and write the cache flags for the entry being loaded; the data paths ofCacheA are used to read this address and write the flags. This RThasA cycle takes priority over allother uses of CacheA and History, and can occur at any time with respect to ADDRESS, which alsoneeds access to these resources. Thus all control signals sent from ADDRESS are inhibited byRThasA, and ADDRESS is forced to idle during this cycle. Figure 9 shows that the RThasA cycleoccurs just before the first word of the new block is written into CacheD. (For simplicity andclarity we will not show RThasA cycles in the figures that follow.) During RThasA, the beingLoadedflag is cleared (it was set when the reference was in ADDRESS) and the writeProtected flag is copiedfrom the writeProtected bit in MapRAM. As soon as the transport into CacheD is finished, the wordreference that started the miss can be made, much as though it had hit in the first place. If thereference was a Fetch, the appropriate word is sent to FetchReg in the processor (and loaded intoFetchRegRAM); if a Store, the contents of StoreReg are stored into the new block in the cache.If the processor tries to use data it has fetched, it is prevented from proceeding, or held until theword reference has occurred (see ¶ 5.1). Each fetch is assigned a sequence number called its tag,which is logically part of the reference; actually it is written into History, and read when needed byREADTR1. Tags increase monotonically. The tag of the last Fetch started by each task is kept inStartedTag (it is written there when the reference is made), and the tag of the last Fetch completedby the memory is kept in DoneTag (it is written there as the Fetch is completed); these are task-specific registers. Since tags are assigned monotonically, and fetches always complete in orderwithin a task, both registers increase monotonically. If StartedTag=DoneTag, all the fetches thatяодпZ\zфXх{z{z{z{z{z{z{ z{z{оzопTф…р"ф†р$~ zопRЃф§рPфЁ~zопPэф‰р*фЉ~ zфЂопOyфµф¶р={опMхzф¬фр+~ zопLqфЂр=опIFфІр:фіопGВфєф»рPопF>фИ}zр@фЙ опDєфЏфђр=} zопC6фЅфѕр<{zопAІфЂрF{zфЃоп@.ф—рKф оп>Єф›рHфњоп=&фЂрCоп9ыфёф№рB~zоп8wфћр'фџр5оп6уфЂр5оп3ИрYоп/Й}фX оп,ћzфЊфЌ {z~zр5оп+фЏфђрBоп)–ф–} zф—р2оп(фњр2~zфќ~zоп&Ћф…ф†|{z{~zр.оп% ф©фЄрLоп#†фЂр>оп [ф’р4{zф“опЧфТфУрPопSфЏфђр1~zопПф“рCф”{zопKфСр:фТ{zопЗ~zф°{zф±р?~zопCфёф№рIопїф–~zф—~z~ оп;zфљф›{z ~ zоп·фЋ~ z {z фЏр3оп3ф¤р(фҐр1опЇфџ~zф р.оп+{zфЂ~zрFопф р3фЎр"}z оп |фќ фћрP}zопшф†ф‡рTопt{zфҐф¦р3~zопр} zфђф‘р-~z опlф¦}z~zф§ опифМ фНрMопdфЂр_Є·ЦHџ_RкSEC. 4CACHE-STORAGE INTERACTIONS69<==опFЛфЂ{zоtп!2опmфЅр=фѕопйфЂр7фЃр'опeфЂфЃрEопбф—{zр4оп] фњфќр3опЩфИр;фЙопUф фЎрFопСфЂр5оп¦фф™рDоп"ф“р&ф”р:оп ћфЌрBфЋопфЂоп}фXопр~zф“ф” ~zрOопlф‚фѓрFопиф’~zр#{z{zф“~zопdфЂр\ \· ЕHџac оtп2ропkроп¤ропЭропропOроп€роpпАро«tплро«п$роъпЭроЭпЭро БпЭро¤пЭро€пЭро kпЭро OпЭро2пЭроyп+ю$оБпю$оіптюХ$о€пдю$]оЏпю$оІпИюG$оІпъю$Оолп+юІ$о2пюІ$о¤пдю$ОоИпю$Холпю$щоЂпИю$Хо¤пПю$щотпќюЋ$о9пЏюk$оп€ю$Ћо2пAю$GоИпdю$kолп+ю$9ощпИюІ$о«пlю$$о€пю$Оо€пЏю$$оПпЦю$$о«пъю$$оПпБю$«отпію$$оdпію$$оІпЦюІ$опЏюІ$о zптющ$о Vпю$о Vп+юG$о zптю$]оуп№ю$о:п«ю$о€пдюG$оЮпrю¤$о$¬пЃю$о$¬пrю$о%ҐпЃю$о+птю$опЃюХ$оЮпrю$]оеп«ю$опЃющ$оеп«ю$оеп№юG$опЃю$]оAп№юХ$оп«ю$]опдю$о%п2ро#ъп2ро"Юп2ро!Бп2ро Ґп2ро€п2роlп2роOп2роOпkро3пkропkроъпkроЭпkроБпkроҐпkро€пkроAп№ющ$опдю$оптюG$оAп№ю$]о€п¤роlп¤роOп¤ро3п¤роп¤роъп¤роЭп¤ро Бп¤роєпЃюЂ$олпЏющ$олпБю$толпќющ$оіпdюХ$о€пVю$]оЏпЏю$о пdющ$одпЏю$одпќюG$о пdю$]оПп+ющ$о«пVю$о«пdюG$оПп+ю$]оЭпЭроБпЭроҐпЭро€пЭроlпЭроOпЭро3пЭропЭроyпЏю]$оVпќюk$опроъпроЭпро Бпро¤про€про kпро Oпро3проOпропроЏпроъпроЏпро€пdю$о]п+ю$$оlпю$ЂоЏпdю$о2пOро«п$роsп9ю$rо rп€ю $о rпю $о rп Oю $оЦп€ротпOро –пdю9$оЃпVю +$о rп9ю$rо"sп $ро#ђп ю$9о#п9ю$9о¬п$роп$роеп rю$9о$п9ю$rоп Oю$опю$•опю@$оп€ю$оЭп!sропмю$dо&ћпЃюЂ$оlп¤роOп¤ро3п¤роп¤роъп¤роЭп¤роБп¤роҐп¤ро%п№ющ$опдю$оптюG$о%п№ю$]о&3пkро%пkро#ъпkро"Юпkро!Бпkро Ґпkро€пkроlпkро-ып2ро,Юп2ро+Вп2ро*Ґп2ро)‰п2ро(lп2ро'Pп2ро&3п2ро%п№юХ$оъп«ю$]опдю$о%мпЃющ$о%Йп«ю$о%Йп№юG$о%мпЃю$]о%мпЃюХ$о&Бпrю$]о%Йп«ю$оптю$о-ђпЃю$о-ђпrю$о.‰пЃю$о&Бпrю¤$оlпдюG$оп«ю$оЧп№ю$оп+юk$о]птю$]о]птюХ$оЭпПроБпПроҐпПро€пПро Oпро kпро€про¤проЦпроЭпро Бпро]п]роyпію$‡о Џплроплр о¬плроєплро!епдро2%пю$що,»п9ро2%п9ро3пю$що1 пю$що/мпю$що.Рпю$що-іпю$що+{пю$що,—пю$що' п9ро!zп9ромп9ро–пю$щоіпю$щоПпю$щомпю$щопю$що%пю$що$Рпю$що#іпю$що"—пю$що!zпю$що ^пю$щоAпю$що%мпю$що' пю$що(%пю$що)Bпю$що*^пю$що]п9роПп9ро«п9роп9роzпю$що]пю$щоAпю$що$пю$щопю$що]пю$що zпю$що–пю$щоіпю$щоПпю$щолпю$щоAпю$що $пю$що пю$щолпю$щоПпю$щоІпю$щоЦп‡ю-l$оVpпроЏпdс \ њ*щ3‰"l„SEC. 4CACHE-STORAGE INTERACTIONS71A Flush explicitly removes the block containing the addressed location from the cache, rewriting itin storage if it is dirty. Flush is used to remove a virtual page's blocks from the cache so that itsMapRAM entry can be changed safely. If a Flush misses, nothing happens. If it hits, the hitlocation must be marked vacant, and if it is dirty, the block must be written to storage. To simplifythe hardware implementation, this write operation is made to look like a victim write. A dirty Flushis converted into a FlushFetch reference, which is treated almost exactly like a Prefetch. Thus, whena Flush in ADDRESS hits, three things happen:the victim for the selected row of CacheA is changed to point to the hit column;the vacant flag is set;if the dirty flag for that column is set, the Flush is converted into a FlushFetch. Proceeding like a Prefetch, this does a useless read (which is harmless because the vacant flag hasbeen set), and then a write of the dirty victim. Figure 11 shows a dirty Flush. The FlushFetchspends two cycles in ADDRESS, instead of the usual one, because of an uninteresting implementationproblem.<==опAфёрF~zф№~ оп?~zфѓ{zф„р=оп=ъоtп®2опЇ}фX}}оп„zфћ{z{~zфџ~zр.опф™р?{z{~zфљоп | фЏфђрUопшфДр0фЕр'опtф фЎ{z{~zр1опрфр)ф®р2опl ф°р!ф±р4опифлр&{z{~zфмопd фЂР· Hџaяо:п€ю $о:пю $$о:пю$•о:п Oю $о%:п9ю$rоп rю$9оVtп$ро!Vп$ро«п9ю$rо№пzю +$оПп€ю9$о«п Oю $о«пю $о«п€ю $о«п9ю$rо дп$ро kпsроИп€ю$о¤п+ю$Ђо–пOю$$оБп€ю$оИп:ро3п:роИп:ро:п:ро€п:роkп:ро‡п:ро¤п:ро Ап:роЭп:рощп:роп:ро2п:роOп:ро ЏпБюk$о Џпію]$оOпроkпро€про¤проБпроЭпроъпропропOю$]одп€юG$одпzю$опOющ$о@п€ю$]опБюG$опію$о@п€ющ$оИпію$оБпzю$]олп€юХ$о $пБющ$о $пдю$то $піющ$оуп¤юЂ$ощпИропИро2пИроOпИроkпИро€пИро¤пИроБпИроzпЭю$]оVпюG$оVпю$оzпЭющ$оБпЏроЭпЏроъпЏропЏро3пЏроOпЏроlпЏро€пЏро€пVро ҐпVро!БпVро"ЮпVро#ъпVро%пVро&3пVро'OпVроVпю$оOпПю$]оzпЭюХ$оAп¤ю$]опЭюG$опПю$оAп¤ющ$опПю$о п–ю$]оAп¤юХ$оdпю$о'Юп¤ю$о&еп–ю$о&еп¤ю$о п–ю¤$оБпюG$оsпПю$о+пЭю$оІпю$]оЏпOюG$оЏпAю$оІпющ$опіюІ$оІпъюІ$оdпЦю$$отпЦю$$оОпдю$«о«пю$$оОпъю$$о‡пію$$о‡пAю$Оо«пЏю$$ощпмюІ$о $пOю$9о п€ю$kо kпdю$Gо Gп«ю$Ћо rпіюk$о +пБюЋ$о Эпую$що №плю$Хо $п:ю$що п3ю$Хо Эпю$Оо kпAюІ$о $пOюІ$оІпю$тоІпмюk$оИпAю$оБпю$]олпюХ$о щпAю$о ІпOю$о kпро‡про¤про АпроЭпрощпропро2продп$родплроpпАроtп¬ропsроп:ропропИропЏропVро¤пдю$Оолпдю$ОоОпБющ$олпБюd$олпію«$о:uп Іро:п–роtп#¬ро п,ро‡п,ро¤п,ро«п"р о +п zро rп:роyпЦю$о Ап,рощп,роЭп,роп,роп,ро(Чп¤юЂ$оЭпИроъпИропИро3пИроOпИроlпИро€пИро ҐпИро ]пЭю$]о :пюG$о :пю$о ]пЭющ$о ҐпЏро!БпЏро"ЮпЏро#ъпЏро%пЏро&3пЏро'OпЏро(lпЏро(lпVро)€пVро*ҐпVро+БпVро,ЮпVро-ъпVро/пVро03пVро :пю$о!3пПю$]о ]пЭюХ$о(%п¤ю$]о(пЭюG$о(пПю$о(%п¤ющ$о(пПю$о(ъп–ю$]о(%п¤юХ$оHпю$о0Вп¤ю$о/Йп–ю$о/Йп¤ю$о(ъп–ю¤$о¤пюG$о!VпПю$о!пЭю$оOпOюk$о–пю$]о–пю$оБпуроЭпуроъпуропуро€плр о плро¤плр о!zплроХпОю-l$оІпdю$щоОпdю$щолпdю$що пdю$що $пdю$що@пdю$щолпdю$щоПпdю$щоІпdю$що–пdю$що yпdю$що]пdю$щопdю$що$пdю$щоAпdю$що]пdю$щоzпdю$щопЂро«пЂроПпЂро]пЂро*^пdю$що)Aпdю$що(%пdю$що'пdю$що%мпdю$щоAпdю$що ]пdю$що!zпdю$що"–пdю$що#іпdю$що$Ппdю$що$пdю$щопdю$щомпdю$щоПпdю$щоіпdю$що–пdю$щомпЂро!zпЂро'пЂро,—пdю$що+zпdю$що-іпdю$що.Рпdю$що/мпdю$що1 пdю$що3пdю$що2%пЂро,єпЂро2%пdю$щоп ю$yо2пsроХп¬ротпsро Џпію$dотpпроЦпdсРDUT3‰$Ґ« THE MEMORY SYSTEM OF A HIGH-PERFORMANCE PERSONAL COMPUTER72<==фЂр,оп:}фXоп6уzф„ {zф…{z{zоп5oф†р:{z{zф‡оп3л{zфЂр2{zоп0А{zфѓр9ф„оп/<фљ{zр5ф› оп-ёфбр,фвр.оп,4ф«р"ф¬ {zоп*°{zф®фЇ{zр0оп),фЂрEоп&фЌ{zфЋр3{z{zоп$}ф¦ ф§р8{zоп"щфЕ{z{zфЖ{zоп!u{zфф{z{zопс ф“р(ф”{z {zопmф¤ {z{~zр.фҐ{zопйфЂ{z{zопѕфПфР{z~ zоп:{zфГ~ zфДрBоп¶фЎ фў{z~zр,оп2ф–рXф—оп®фЋфЏрMоп*фЛ ~zр6фМ{zоп¦фЂ{zр&оп§}фXоп |zфёр8~zф№{zопш ф¤р#фҐр9опt ф™рAфљ {опрzфђр.ф‘р4опlф©фЄр4{zопиф•{z ф–р2опdфЂя¤·ЏHџ_™аSEC. 5TRAFFIC CONTROL75At the other extreme, the rule could be that a stage waits only if it cannot acquire the resources itwill need in the very next cycle. This would be quite feasible for our system, and the proper choiceof priorities for the various stages can clearly prevent deadlock. However, each stage that may beforced to wait requires logic for detecting this situation, and the cost of this logic is significant.Furthermore, in a long pipeline gathering all the information and calculating which stages canproceed can take a long time, especially since in general each stage's decision depends on thedecision made by the next one in the pipe.For these reasons we adopted a different strategy in the Dorado. There is one point, early in thepipeline but after ADDRESS, at which all remaining conflicts are resolved. A reference is notallowed to proceed beyond that point without a guarantee that no conflicts with earlier referenceswill occur; thus no later stage ever needs to wait. The point used for this purpose is state 3 of theMAP stage, written as MAP.3. No shared resources are used in states 0-3, and STORAGE is not starteduntil state 4. Because there is just one wait state in the pipeline, the exact timing of resourcedemands by later stages is known and can be used to decide whether conflicts are possible. Wenow discuss the details.5.3.1 STORAGE and WRITETRIn a write operation, WRITETR runs in parallel but not in lockstep with MAP; see, for example,Figure 10. Synchronization of the data transport with the storage reference itself is accomplishedby two things.MAP.3 waits for WRITETR to signal that the transport is far enough along that the data willarrive at the StorageRAM chips no later than the write signal generated by STORAGE. Thiscondition must be met for correct functioning of the chips. Figure 13 shows MAP waitingduring an I/OWrite.WRITETR will wait in its next-to-last state for STORAGE to signal that the data hold time ofthe chips with respect to the write signal has elapsed; again, the chips will not work if thedata in WriteReg is changed before this point. Figure 10 shows WRITETR waiting during avictim write. The wait shown in the figure is actually more conservative than it needs tobe, since WRITETR does not change WriteReg immediately when it is started.<==€{zфЂ{zфЃр1{zоп=фѕр$фїр9оп;ЂфЁф©р>оп9ьфЂоп5э}фX}оп2ТzфГ{z}z{zфДоп1NфЈр<ф¤р!оп/КфЂоцп-r{zф• {zф–р$оцп+оф•{zр%ф– {zоцп*jфљф›р8{zоцп(жфЂ{z{~zоцп&Ћ{zф”р){zф•оцп% р*ф–р3оцп#†фЋфЏр%{zоцп"фЎфўрMоцп ~фЂ{zр9оtпd2я &·Hџ\(оtп ЏропИропроп:ропsроп¬роперо+пю$оrп ю$оАпAюG$оупЮю2$о п ПюЗ$о'пЮю$о'п Пю$о(пЮю$оdп Pю$оAпЮюХ$о п Пю$]оп ю$оAпЮющ$оп ю$опюG$оAпЮю$]оyпюХ$оOп ю$]оVпAю$о'Oп Џро&3п Џро%п Џро#ъп Џро"Эп Џро!Бп Џро ¤п Џро€п Џро€пИроkпИроOпИро2пИропИрощпИроЭпИроБпИроyпющ$оVпAю$оVп PюG$оyпю$]оБпро¤про€проkпроOпро2пропрощпроyп€ю9$оІп%юG$оkпWю$тоЗп%юЋ$о¤пWю$то2п%юІ$одпЙю$$оАпzю$ОоАпмю$$о п3ю$$одпWю$$о пю$«о +пю$$оњпю$$олп3юІ$о NпмюІ$ощп%юІ$о«пЙю$$о‡пzю$Оо‡пмю$$оОп3ю$$о«пWю$$оОпю$«отпю$$оdпю$$оІп3юІ$опмюІ$опмюІ$оЗпђю$$о¤пBю$Оо¤пію$$олпъю$$оЗпю$$олпею$«опЧю$$оЂпЧю$$оОпъюІ$о2піюІ$оOпмюІ$опђю$$оЭпBю$ОоЭпію$$о$пъю$$опю$$о$пею$«оHпЧю$$о№пЧю$$опъюІ$оkпіюІ$олп¬ю$]олп€юІ$ощпsроЭпsро Апsро¤пsро‡пsро kпsро Nпsро2пsро2п:ро Nп:ро kп:ро‡п:роЭп:рощп:роп:ро2п:ропsро2пsроІп Pю$]оЋп€юG$оІп Pющ$оАпAю$]олп PюХ$оАпћю$$о пъю $о Эпмю №$одпsроOпsроАпБю$оњпію@$оЭпБю$о2п—роп—рощп—роЭп—роБп—ро¤п—ро€п—роkп—роkп^роOп^ро2п^роп^роБп^ро¤п^ро€п^роkп^ро$п¬ю$9о$п€ю¤$о€п—роп—роkп—роOп—ропБю$опію$о пБю$оъпAю$¤оЦпію k$оЋп€ю.¬$оЋпю$що«пю$щоЗпю$щодпю$що пю$щопю$щоЗпю$що«пю$щоЋпю$щоrпю$що Vпю$що9пю$щодпю$щопю$щопю$що9пю$щоVпю$щощпро‡про«про9про*:пю$що)пю$що(пю$що&епю$що%Ипю$щопю$що :пю$що!Vпю$що"sпю$що#Џпю$що$¬пю$щопю$щодпю$щоИпю$що«пю$щоЏпю$щоrпю$щоИпро!Vпро&епро,sпю$що+Vпю$що-Џпю$що.¬пю$що/Ипю$що0епю$що2пю$що4пю$що2про,—про,,пЮю9$о$dпю$о$¬п ю$оъпAюG$о,Oп Пю¤$о3пЮю$о3п Пю$о4пЮю$оќп Pю$о+zпЮюХ$о,Oп Пю$]о+Vп ю$о+zпЮющ$о+Vп ю$о+VпюG$о+zпЮю$]о#іпюХ$о$€п ю$]о#ЏпAю$о3‰п Џро2lп Џро1Pп Џро03п Џро/п Џро-ъп Џро,Юп Џро+Бп Џро+БпИро*ҐпИро)€пИро(lпИро'OпИро&3пИро%пИро#ъпИро#іпющ$о#ЏпAю$о#Џп PюG$о#іпю$]о#ъпро"Эпро!Бпро ¤про€проkпроOпро2проИп€юG$олп Pю$]олп PюХ$оЭпрощпро¤пуро Апуро Апeро¤пeро kпeро‡пeрощпeроЭпeропeро2пeро+пeроІперолпероОп¬ро Nп¬рощпIю$ќо2п%ю$оUп роп^ро3пю$що2pпро€пdсоkп%юk$оkп3юЋ$о¤п3юЋ$о«tп^роп р &ТЈ4^ THE MEMORY SYSTEM OF A HIGH-PERFORMANCE PERSONAL COMPUTER765.3.2 CacheD: consecutive cache loadsLoading a block into CacheD takes 9 cycles, as explained in ¶ 4.1, and a word reference takes onemore. Therefore, although the pipeline stages proper are 8 cycles long, cache loads must be spacedeither 9 or 10 cycles apart to avoid conflict in CacheD. After a Fetch or Store, the next cache loadmust wait for 10 cycles, since these references tie up CacheD for 10 cycles. After a Prefetch,FlushFetch or dirty I/ORead, the next cache load must wait for 9 cycles. STORAGE sends MAP asignal that causes MAP.3 to wait for one or two extra cycles, as appropriate. Figure 14 shows a Fetchfollowed by a Prefetch, followed by a Store, and illustrates how CacheD conflict is avoided by extracycles spent in MAP.3 Note that the Prefetch waits two extra cycles, while the Store only waits oneextra.<==zф’~z~zф“опHєф {zф™~zр#~zопG6оtпк2опл}фXопАzф‘ф’рBоп<фєф»рLопёфВфГрNоп4фЌр'фЋ}zр&оп°ф¬р5фр(оп,ф–ф—рDопЁфЂоцп P{zфЮ{zр*фЯ оцпМфЂоцпt{zфЧфШр1оцпр{zфѕрBфїоцпlф¦рQоцпифЧр$фШ{zр!оцпd яф· ZHџaОоп!—юІ$оІп!ЮюІ$оdп!єю$$отп!єю$$оОпИю$«о«п"ю$$оОп!Юю$$о‡п!—ю$$о‡п"%ю$Оо«п!sю$$ощп#РюІ$олп3ю$9оЗпlю$kо2пHю$GопЏю$Ћо9п!—юk$отпҐюЋ$о¤пЦю$щоЂпПю$Холпю$щоЗпю$Хо¤пмю$Оо2п%юІ$олп3юІ$о¤п!—юІ$о@п!ЮюІ$отп!єю$$оЂп!єю$$о]пИю$«о9п"ю$$о]п!Юю$$оп!—ю$$оп"%ю$Оо9п!sю$$о‡п#РюІ$о9п3ю$9опlю$kоЃпHю$Gо]пЏю$Ћо€п!—юk$о@пҐюЋ$отпЦю$щоПпПю$Хо9пю$щопю$Хотпмю$ОоЃп%юІ$о9п3юІ$оЭп!—ю$оЭпҐю$о Ап!—ю$о АпҐю$о¤п!—ю$о¤пҐю$о@п"ю$тоHпҐюХ$оdп#Рюk$оІп"ю$тоІп#РюG$оЃп#Рюk$оdпҐюХ$о]п"ю$тоVп3ю$9о3пlю$kоќпHю$GоzпЏю$Ћо¤п!—юk$о]пҐюЋ$опЦю$щомпПю$ХоVпю$що3пю$Хопмю$Ооќп%юІ$оVп3юІ$оъп!—ю$оъпҐю$оЭп!—ю$оЭпҐю$оБп!—ю$оБпҐю$оБп!—юІ$о]п!ЮюІ$оп!єю$$оќп!єю$$оzпИю$«оVп"ю$$оzп!Юю$$о2п!—ю$$о2п"%ю$ОоVп!sю$$о¤п#РюІ$оЏп%ю$о€пмю$]оІпъюХ$оАп%ю$оyп3ю$о2tперо Nперо kперо‡перо¤перо АпероЭперощперо«п%ю$о¤пмю$]оПпъюХ$о]пероVпероOпероkпероБпероЭпероъпероперо–пъю$]оsп3юG$оsп%ю$о–пъющ$оп"Џроп Wропропероп¬ропsроп:роп+ю :$о9;п Эю$rоп №ю ]$оп №ю$•оптю :$о$п Эю$rо.п Эю$rодпю$9о(Џпю$9о2пю$9опИро%:пИро/ЃпИро :пИро*ИпИро4ЙпИро:пИро:пЏро]пҐю@$оzпҐюХ$оrpп«роХtп"Џротп Wроdп"ЏроЂп WроЃп"Џроќп Wро¤п%ю –$оБп%ю$о$¬п%ю$о%Ґпмю$]о$ПпъюХ$оєп€юЂ$о Ап¬роЭп¬рощп¬роп¬ро2п¬роOп¬роkп¬ро€п¬роAпБю$]опъюG$опмю$оAпБющ$о€пsро¤пsроБпsроЭпsроъпsропsро3пsроOпsроOп:роlп:ро€п:ро Ґп:ро!Бп:ро"Юп:ро#ъп:ро%п:ропмю$опію$]оAпБюХ$оп€ю$]одпБюG$одпію$оп€ющ$одпію$оЭпzю$]оп€юХ$о+пъю$о%Ґп€ю$о$¬пzю$о$¬п€ю$оЭпzю¤$о€пмюG$о:пію$отпБю$о yпъю$]о Vп3юG$о Vп%ю$о yпъющ$о(Чп€юЂ$оЭп¬роъп¬роп¬ро3п¬роOп¬роlп¬ро€п¬ро Ґп¬ро ]пБю$]о :пъюG$о :пмю$о ]пБющ$о Ґпsро!Бпsро"Юпsро#ъпsро%пsро&3пsро'Oпsро(lпsро(lп:ро)€п:ро*Ґп:ро+Бп:ро,Юп:ро-ъп:ро/п:ро03п:ро :пмю$о!3пію$]о ]пБюХ$о(%п€ю$]о(пБюG$о(пію$о(%п€ющ$о(пію$о(ъпzю$]о(%п€юХ$оHпъю$о0Вп€ю$о/Йпzю$о/Йп€ю$о(ъпzю¤$о¤пмюG$о!Vпію$о!пБю$о–пъю$]оsп3юG$оsп%ю$о–пъющ$о%перо#ъперо"Юперо!Бперо€пероlпероsпероzперо2Чп€юЂ$о"Юп¬ро#ъп¬ро%п¬ро&3п¬ро'Oп¬ро(lп¬ро)€п¬ро*Ґп¬ро*^пБю$]о*:пъюG$о*:пмю$о*^пБющ$о*Ґпsро+Бпsро,Юпsро-ъпsро/пsро03пsро1Pпsро2lпsро2lп:ро3‰п:ро4Ґп:ро5Вп:ро6Юп:ро7ып:ро9п:ро:4п:ро*:пмю$о+3пію$]о*^пБюХ$о2%п€ю$]о2пБюG$о2пію$о2%п€ющ$о2пію$о2ъпzю$]о2%п€юХ$о#Hпъю$о:Вп€ю$о9Йпzю$о9Йп€ю$о2ъпzю¤$о%ҐпмюG$о+Wпію$о+пБю$о"–пъю$]о"sп3юG$о"sп%ю$о"–пъющ$оVп3ю@$о#п Эю$9о8п Эю$9о#ЏпБю$9о"sпИро8¬пБю$9о7ђпИроЭп!—ю‡$оіп!—ю‡$о Ап роЭп роЦп ро¤пќро€пќроЭп роъп роуп ро ҐпќроєпЏроъпЏро%:пЏро*:пЏро/ЙпЏро3BпЏроп%ћро Эп%ћро€п%ћроп#ую$‡о«п#ую$‡оИп#ую$‡о,єп2ро2%п2ро7ґп2ро4^пю$що3Bпю$що5{пю$що6—пю$що7ґпю$що8Рпю$що; пю$що2%пю$що1 пю$що/мпю$що.Рпю$що-іпю$що+zпю$що,—пю$що'п2ро!zп2ромп2ро–пю$щоіпю$щоПпю$щомпю$щопю$що$пю$що$Ппю$що#іпю$що"–пю$що!zпю$що ]пю$щоAпю$що%мпю$що'пю$що(%пю$що)Aпю$що*^пю$що]п2роПп2ро«п2роп2роzпю$що]пю$щоAпю$що$пю$щопю$що]пю$що yпю$що–пю$щоІпю$щоПпю$щолпю$що@пю$що $пю$що пю$щолпю$щоОпю$щоІпю$щоІпЂю5{$о9мпю$що!еп €ро7п €роdpпр)оп!—юІ$оЦпҐюЋ$оуп!—ющ$оПпҐюХ$о]п3ю9$оЦпdсяфFU%а;-&—+SEC. 5TRAFFIC CONTROL77Figure 15 shows a Store with clean victim followed by a Fetch with dirty victim and illustrates thisinterlock. ADDRESS waits until cycle 26 to start WRITETR. Also, the fetch waits in MAP.3 until thesame cycle, thus spending 13 extra cycles there, which forces the fetch victim to spend 13 extracycles in ADDRESS. The two-cycle gap in the use of CacheD shows that the fetch could have leftMAP.3 in cycle 24.<==пИю¤$оDепЦю$оDепИю$оEЮпЦю$о.eпHю$о=BпЦюХ$о>пИю$]о=пю$о=BпЦющ$о=пю$о=пюG$о=BпЦю$]о5zпюХ$о6Pпю$]о5Wп:ю$оEPп€роD4п€роCп€роAып€ро@Юп€ро?Вп€ро>Ґп€ро=‰п€ро=‰пБро^пПю$що?zпПю$що@—пПю$щоAіпПю$щоF%пПю$щоBРпПю$щоCмпПю$щоBуплро=Bплро7іплро2%плро,єплроE пПю$що"+п ъро9:п ъроќpпр0о"єпdс$€q/FI& g THE MEMORY SYSTEM OF A HIGH-PERFORMANCE PERSONAL COMPUTER78Main storage boards are the same size as logic boards but are designed to hold an array of MOSRAMs instead of random ECL logic. A pair of storage boards make up a module, which holds 512Kbytes (plus error correction) when populated with 16K RAMs, 2M bytes with 64K RAMs, or 8Mbytes with (hypothetical) 256K RAMs. There is room for four modules, and space not used forstorage modules can hold I/O boards. Within a module, one board stores all the words with evenaddresses, the other those with odd addresses. The boards are identical, and are differentiated bysideplane wiring.A standard Dorado contains, in addition to its storage boards, eleven logic boards, including disk,display, and network controllers. Extra board positions can hold additional I/O controllers. Threeboards implement the memory system (in about 800 chips); they are called ADDRESS, PIPE, andDATA, names which reflect the functional partition of the system. ADDRESS contains theprocessor interface, base registers and virtual address computation, CacheA (implemented in 256 by4 RAMs) and its comparators, and the LRU computation. It also generates Hold, addresses DATA onhits, and sends storage references to PIPE.DATA houses CacheD, which is implemented with 1K by 1 or 4K by 1 ECL RAMs, and holds 8K or32K bytes respectively. DATA is also the source for FastOutBus and WriteBus, and the sink forFastInBus and ReadBus, and it holds the Hamming code generator-checker-corrector. PIPEimplements MapRAM, all of the pipeline stage automata (except ADDRESS and HITDATA) and theirinterlocks, and the fault reporting, destination bookkeeping, and refresh control for the MapRAMand StorageRAM chips. The History memory is distributed across the boards: addresses onADDRESS, control information on PIPE, and data errors on DATA.Although our several prototype Dorados can run at a 50 nanosecond microcycle, most of themachines run instead at 60 nanoseconds. This is due mainly to a change in board technology froma relatively expensive point-to-point wire-routing method to a cheaper Manhattan routing method. 7. PerformanceThe memory system's performance is best characterized by two key quantities: the cache hit rateand the percentage of cycles lost due to Hold (¶ 5.1). In fact, Hold by itself measures the cache hitrate indirectly, since misses usually cause many cycles of Hold. Also interesting are the frequenciesof stores and of dirty victim writes, which affect performance by increasing the frequency of Holdand by consuming storage bandwidth. We measured these quantities with hardware event-counters,together with a small amount of microcode that runs very rarely and makes no memory referencesitself. The measurement process, therefore, perturbs the measured programs only trivially.We measured three Mesa programs: two VLSI design-automation programs, called Beads and Placer;and an implementation of Knuth's TEX [8]. All three were run for several minutes (several billionDorado cycles). The cache size was 4K 16-bit words.Percent of cycles:Percent of references:Percent of misses:ReferencesHoldHitsStoresDirty victims Beads36.48.1499.2710.516.3Placer42.94.8999.8218.765.5TEX38.46.3399.5515.234.9Table 5: Memory system performanceTable 5 shows the results. The first column shows the percentage of cycles that contained cachereferences (by either the processor or the IFU), and the second, how many cycles were lost becausethey were held. Hold, happily, is fairly rare. The hit rates shown in column three are gratifyinglyяодп^ўqфXхrqrqrqrqrqrqr qrqrоqопXKф©р1фЄр&rопVЗqф‡rqрDопUCф·р1rqфёrqопSїф¶rqф·р8опR;фЎrqrqр8фў опP· фЁ ф©рNопO3фЂопLф фЎрYопJ„ф–р0ф—rqrqопIф•рCsqф–sqопG|sqфкр-флsqопEшфЊрYопDtф…rqф†rqр!tqsqопBрфЂр!sqоп?ЕsqфЃр4ф‚rqrqоп>AфҐsqф¦р.оп<ЅфсфтрFsоп;9q фњrqр-rqrqфќоп9µ ф°р2ф±rоп81qфrqрIоп6sqфЂsqsqоп3‚фЦр+фЧр&оп1юфЉф‹рTоп0zфЂр`оп+§sфXоп(|qф«ф¬р@оп&шфЏр&tqtqфђоп%tф‘р7tqф’оп#рфџр0ф р,tоп"lqф‡ф€рGоп иф”ф•рOопdфЂрTоп9ф„ф…rqр5опµф–rqф—р>оп1фЂр.оп^юдЌпюдодп^ю ИЌпю Ио¬п^ю(Ќпю(о#Фп^ю ¦Ќпю ¦о-zп^юoЌпюoодп®uфX ођо0;одп* о)ођо#Фо0;оп<юдFодп<ю ИFо¬п<ю(Fо#Фп<ю ¦Fо-zп<юoFопNqодо)ођо#Фо0;опКодо)ођо#Фо0;опFrодqо)ођо#Фо0;оп •юдп ВюдЌодп •ю Ип Вю ИЌо¬п •ю(п Вю(Ќо#Фп •ю ¦п Вю ¦Ќо-zп •юoп ВюoЌо>п—р"опlфЁрKф©опи ф™фљvqр4опdф–tqф—р1я·ђHџcLSEC. 7PERFORMANCE79largeќall over 99 percent. This is one reason that the number of held cycles is small: a miss cancause the processor to be held for about thirty cycles while a reference completes. In fact, the tableshows that Hold and hit are inversely related over the programs measured. Beads has the lowest hitrate and the highest Hold rate; Placer has the highest hit rate and the lowest Hold rate.The percentage of Store references is interesting because stores eventually give rise to dirty victimwrite operations, which consume storage bandwidth and cause extra occurrences of Hold by tying upthe ADDRESS section of the pipeline. Furthermore, one of the reasons that the StoreReg registerwas not made task-specific was the assumption that stores would be relatively rare (see thediscussion of StoreReg in ¶ 5.1). Table 5 shows that stores accounted for between 10 and 19percent of all references to the cache.Comparing the number of hits to the number of stores shows that the write-back discipline used inthe cache was a good choice. Even if every miss had a dirty victim, the number of victim writeswould still be much less than under the write-through discipline, when every Store would cause awrite. In fact, not all misses have dirty victims, as shown in the last column of the table. Thepercentage of misses with dirty victims varies widely from program to program. Placer, which hadthe highest frequency of stores and the lowest frequency of misses, naturally has the highestfrequency of dirty victims. Beads, with the most misses but the fewest stores, has the lowest. Thelast three columns of the table show that write operations would increase about a hundredfold ifwrite-through were used instead of write-back.Acknowledgements The concept and structure of the Dorado memory system are due to Butler Lampson and ChuckThacker. Much of the design was brought to the register-transfer level by Lampson and BrianRosen. Final design, implementation, and debugging were done by the authors and Ed McCreight,who was responsible for the main storage boards. Debugging software and microcode were writtenby Ed Fiala, Willie-Sue Haugeland, and Gene McDaniel. Haugeland and McDaniel were also ofgreat help in collecting the statistics reported in ¶ 7. Useful coments on earlier versions of thispaper were contributed by Forest Baskett, Gene McDaniel, Jim Morris, Tim Rentsch, and ChuckThacker. References1.Bell, J. et. al. An investigation of alternative cache organizations. IEEE Trans. Computers C-23, 4, April 1974, 346-351.2.Bloom, L., et. al. Considerations in the design of a computer with high logic-to-memory speed ratio. Proc. GigacycleComputing Systems, AIEE Special Pub. S-136, Jan. 1962, 53-63.3.Conti, C.J. Concepts for buffer storage. IEEE Computer Group News 2, March 1969, 9-13.4.Deutsch, L.P. Experience with a microprogrammed Interlisp system. Proc. 11th Ann. Microprogramming Workshop, PacificGrove, Nov. 1979.5.Forgie, J.W. The Lincoln TX-2 input-output system. Proc. Western Joint Computer Conference, Los Angeles, Feb. 1957,156-160.6.Geschke, C.M. et. al. Early experience with Mesa. Comm. ACM 20, 8, Aug. 1977, 540-552.7.Ingalls, D.H. The Smalltalk-76 programming system: Design and implementation. 5th ACM Symp. Principles ofProgramming Languages, Tucson, Jan. 1978, 9-16.8.Knuth, D.E. TEX and METAFONT: New Directions in Typesetting. American Math. Soc. and Digital Press, Bedford, Mass.,1979.9.Lampson, B.W. et. al. An instruction fetch unit for a high-performance personal computer. Technical Report CSL-81-1,Xerox Palo Alto Research Center, Jan. 1981. Submitted for publication.10.Lampson, B.W., and Pier, K.A. A processor for a high performance personal computer. Proc. 7th Int. Symp. ComputerArchitecture, SigArch/IEEE, La Baule, May 1980, 146-160. Also in Technical Report CSL-81-1, Xerox Palo Alto ResearchCenter, Jan. 1981.11.Liptay, J.S. Structural aspects of the System/360 model 85. II. The cache. IBM Systems Journal 7, 1, 1968, 15-21.12.Metcalfe, R.M., and Boggs, D.R. Ethernet: distributed packet switching for local computer networks. Comm. ACM 19, 7,July 1976, 395-404.оп]®rфGхо- о;€qопWWфр.ф™р+опUУф„р9ф…р)опTOф‡tqрIф€ опRЛфЂtqр6tqопO ф¤tqр)фҐр%опNфЃр0ф‚tqопLфЁrq ф©рJопKфофпрOопIђ фДфЕрFопHфЂопDбфЊр6фЌр"опC]фЎ фўuqр5опAЩфЈрHtq оп@Uф®р=фЇоп>С ф’ф“р:оп=MфЮрJфЯоп;Йф‘р>ф’оп:EфЁф©рVоп8БфЂр!оп3оsфXоп0Гqф¦рVоп/?фјфЅр4оп-»ф‰рMфЉ оп,7фЌр8фЋр$оп*іфўр)фЈр/оп)/фІр_оп'«фҐр?ф¦оп&'фЂоп!Ts опїrqо{rфGwrр9xwyrопqо{r wrрVwо{пЪrwvrр&оп3qо{rр+xwyrопЊqо{rрDwр)rо{пNоп§qо{rр5wр'rо{пiопВqо{r wrwxryrопqо{rрPwxwо{пЭrоп 6qо{rxwxwrр8о{пшоп Qqо{r wrрYvrо{п рGопlqо{rрVwо{п.r vrр9vrо{пропIqо{rрMxwyrопўqо{rрfwxwyrо{пdяD·„Hџb¤µ THE MEMORY SYSTEM OF A HIGH-PERFORMANCE PERSONAL COMPUTER8013.Mitchell, J.G. et. al. Mesa Language Manual. Technical Report CSL-79-3, Xerox Palo Alto Research Center, April 1979.14.Pohm, A. et. al. The cost and performance tradeoffs of buffered memories. Proc. IEEE 63, 8, Aug. 1975, 1129-1135.15.Schroeder, M.D. Performance of the GE-645 associative memory while Multics is in operation. Proc. ACM SigOpsWorkshop on System Performance Evaluation, Harvard University, April 1971, 227-245. 16.Tanenbaum, A.S. Implications of structured programming for machine architecture. Comm. ACM 21, 3, March 1978, 237-246.17.Teitelman, W. Interlisp Reference Manual. Xerox Palo Alto Research Center, Oct. 1978.18.Thacker, C.P. et. al. Alto: A personal computer. In Computer Structures: Readings and Examples, 2nd edition, Sieworek,Bell and Newell, eds., McGraw-Hill, 1981. Also in Technical Report CSL-79-11, Xerox Palo Alto Research Center, August1979. 19.Tomasulo, R.M. An efficient algorithm for exploiting multiple arithmetic units. IBM J. R&D 11, 1, Jan. 1967, 25-33.20.Wilkes, M.V. Slave memories and segmentation. IEEE Trans. Computers C-20, 6, June 1971, 674-675.одпDqфXхrqrqrqrqrqrqr qrqrоqопнrqо{rфGwrwrvrр3опFqо{rwrр=wxwyrоп џqо{rр^wxwо{пaр)rр+оп єqо{rрSwxwyrо{п |опХqо{r wrр.оп.qо{r wrwр*rо{пррDvrр/о{пІопqо{rрRxwyrопdqо{rр0xwyr·NоHџ:› HELVETICA HELVETICA HELVETICA HELVETICA HELVETICA HELVETICA HELVETICAMATH TIMESROMAN LOGO HELVETICA HELVETICA HELVETICA HELVETICA HELVETICA HELVETICA MATH TIMESROMAN LOGO TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN HELVETICA HELVETICA HELVETICA HELVETICA HELVETICA HELVETICATEMPLATE@GATES HELVETICA HELVETICA HELVETICA HELVETICA HELVETICA HELVETICA MATH TIMESROMAN LOGO TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN MATH TIMESROMAN HELVETICA HELVETICA HELVETICA HELVETICA HELVETICA HELVETICA HELVETICA HELVETICA MATH LOGO TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMANLіэ fn" W,4e? RI ЂSN^ –g2r 1{Љ !”n © яі ^Ѕ WЗЬЙ_Н Ц Iа jйNф Iю <?”(#,.ы3N> ‡GrM cV >`k Du.Ђ =Љ 6”.џ ш© ± Јє sДЇМцТШ>ЬLз fсЛь Чь m n а& n0‚8;@ ЕJ zS х]Иhds~ "‡>“Лџ Ь¬ Ўµ їOЛ0Щ5зґтKэepresseditmFig9.press [h\'g6.press IfuFig7.press g7.press MemFig8[%h\'''.press IfuFig4.press IfuFig5.press IfuFi[?h\'''ress IfuFig1.press IfuFig2.press IfuFig3[Yh\'''pressedit/m ifuFinal.press _ Doradoifu.p[sh\'g6.press IfuFig7.press g10.press MemFig[Ќh\'''.press IfuFig4.press IfuFig5.press IfuFi[§h\j/ Sяя–‹ЎЇяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяяDoradoB-W.presspiera13-Jan-81 16:34:55 PST