TheDragonComputerSystemAnEarlyOverviewEdwardM.McCreightDraftofDecember7,1984Releaseas[Indigo]Documentation>DragonOverview.tioga,.presscCopyright1984XeroxCorporation.Allrightsreserved.Abstract:Dragonisa32-bitcomputersystembeingdesignedatXeroxPARC.Thisdocumentisanintroductiontoitsarchitecture.Aversionofthisdocumentisbeingpresentedasahalf-daylectureattheNATOAdvancedStudyInstituteonMicroarchitectureofVLSIComputersinUrbino,Italy,July9-20,1984,andasachapterofthebookthatcustomarilyissuesfromtheseInstitutes.XEROXXeroxCorporationPaloAltoResearchCenter3333CoyoteHillRoadPaloAlto,California94304pW,qS^/rN XIFdsD!etD!3:>u?+s?+9 #%(v;t ;;G!(|-V1S798=b:5) $>%' 0168;/8nrl$X)J+w-88.tt81`8N ;h8 4 TVm$q3IntroductionDragonisanewcomputersystembeingdesignedattheXeroxPaloAltoResearchCenter.InmanywaysDragonisthenewestdescendantinthelineofresearchpersonalcomputersthatbeganin1973withtheAltoandendstodayintheDorado.Likethosemachines,Dragonwillbeapersonalcomputer.Tous,thismeansahigh-performancedisplaycontrollerandavirtualmemorywithonlyrudimentaryprotectionhardware.DragonandDoradobothexecuteprogramscompiledfromtheCedarlanguage.(CedarisanextensionoftheXeroxMesalanguage,whichisaderivativeofPascal.)WeplantobuildmodestnumbersofDragonsforourselvestoallowustoexploreusefulwaysofexploitingsuchperformanceinapersonalsettingbeforesuchmachinesareeconomicalinwidespreaduse.Dragondiffersinsignificantwaysfromitspredecessors.Mostimportantly,itisatightly-coupledmultiprocessor.Weseemultiprocessingasanimportantwaveofthefuture,andweneedtounderstanditsimplicationsforsoftwaredesign.Dragonisalsoa32-bitmachine,bothinitsarithmeticpathandinitsvirtualaddresses.Ithasanewinstructionset,considerablyreducedinsemanticcomplexityfrompreviousmachines.Dragoncanhandle16-bitdata,butatanoticeablepenaltyincodesizeandexecutionspeedcomparedwith32-bitdata.Thesefactswillencourageoursystemsprogrammerstoeliminatevestigesof16-bit-nessthatsurvivefromAltodays.Dragonisalsoourfirstmachinetoincludecustomintegratedcircuits;previousmachineswerebuiltofcommercialintegratedcircuits.Dragonisnotevenclosetobeingfinished.Ourlogicsimulationsareabouttwo-thirdscomplete,andfloorplanningandlayouthavebegunonsomechips,butmodestarchitecturalrefinementsarestillaweeklyevent.Wecanfairlyclaimtohavemadeasolidbeginning,butamountainofhardworkremains.WehopebypublishingthispapertostimulatediscussionsthatcouldmakeDragonabettermachine.KeyIdeasofDragonOurpresentworkhorsemachine,theDorado,isapersonalcomputerinthesamesensethataRolls-Royceisapersonalautomobile.Itisexpensive,power-hungry,andnoisy.Thosefactsalonewouldbeenoughtomakeuswantabettermachine.Butwealsohadsomenewideasthatwewantedtoinvestigatethroughimplementation.MultiprocessingThefirstideaistightly-coupledMIMD(multipleinstructionstreamsoperatingonmultipledatastreams)multiprocessing.Asweconsideredthedecliningfractionoftotalsystemcostrepresentedbyasingleprocessor,theslowadvanceinprocessorcircuitspeed,andthekneeineveryprocessorspeedvs.costcurve,itbecameapparenttousthatthemoststraightforwardwaytoobtainanorder-of-magnitudeincreaseinprocessingpowerwasnottoattemptanorder-of-magnitudeincreaseincircuitspeedorarchitecturalcomplexity,butrathersimplytousemultipleprocessors.Thisapproachhastwomajordrawbacks.Thefirstisthatasthenumberoftg/Ey^ |[P#(,2473;?Y $))*-@1 90;=@eXW< [#'#(,/1i4y7':Y>?V`""'*,-3:=2?|U  % ,Q/k05;?Sa  %x*-336< QL&+t-/>57:>tP!P( "$x*3,015:@eNkx$&<(U-2,58 >L OW"')*802O 9_: KH #&*n,\ 5L8 @Fc3%r(*z4b6%8*>ESy r 'y)1 0288=CW-`aG!#D%( +.13\5@9 @B;H 1 & +4,2Z 9D<@]"&)}, -. 5|:g< ?r>"%).268?= #(*' 038P;>;g$h&R+m0h 7<9<< $ +7DQO!#'-04i ;><5 uw!'*T.168;?3  #r%().36.8<@o2N43 !^#%+_-#0j3:<@"0 % ,\/n3}7u<>&.y(a?|% &)>.0p17=?#kz3 $&(9.L 7y9;{ ! /#' */167;=@ !$(z+/=2749;Z uz&|2w%&|)/ 6;/el)+- 47=% "M$A%c)F /25:d<   $ *-/2`6T7 E%C'$3>8q:><   "j&x*,/  o d$| ,U/O2H368;@e TVm$4processorsincrease,sodoesthedifficultyofroutinginformationamongthemingeneralways.Iftheprocessorsareallattachedtoacommonsharedmemorybus,thenaccesstothatbusbecomestheresourcethatlimitssystemperformance.Apartialsolutiontothisproblemistointerposeacachememorybetweeneveryprocessorandthesharedmemorybus.Untilrecentlythissolutionhassufferedfromthe"cacheconsistencyproblem."Whenaprocessorstoresnewdataintoitscache,shouldthatstorealwaysappearonthesharedmemorybus?Ifso,seriouscontentionresults,becausetypicallyone-thirdofmemoryoperationsarestores.Ifnot,thenwhatpreventstwocachesfromhavingdifferentopinionsaboutthecontentsofthesharedmemory?Fortunatelyanewarchitecturalelement,thetwo-portcache,reduceseachprocessor'saverageneedtostoretothesharedmemorybus,therebyallowingmodestnumbersofprocessorstocoexistwithoutcripplinginterference.Theseconddrawbackismoreserious,especiallyinapersonalcomputer.TotakeadvantageofDragon-stylemultiprocessing,theremustusuallybeseveralprocessesreadytorun,andtheinterprocesscommunicationandsynchronizationamongprocessesmustbemodest.Whilethisisthenormalcaseforaheavily-loadedtimesharingsystem,maintainingmanyreadyprocessesinapersonalcomputerrequiresbreakingupsuchcompute-boundprogramsascompilers(andmanyothers)intoprocesses.Ourexistingcompilers,forexample,werewrittenassingleprocessesbecauseweknewthattheywouldbeexecutedonsingle-processorsystems.Tohavewrittenthemotherwisewouldhavecomplicatedtheprogrammingtaskandmadethecompilerrunmoreslowly.Dragonintroducesustotwointerestingsoftwareproblems.First,howshouldweadaptexistingsoftwareforthenewmultiprocessinghardware?Second,howshouldthecomingwaveofmultiprocessorhardwarechangehowwewriteprograms,orhowwethinkaboutprogramming?WeatXeroxsupposethatitishardertowriteprogramsasmultipleprocessesthanassingleprocesses,butthatsuppositionisbasedmoreonsecond-handreportsthanonfirst-handexperience.WeintendtouseDragonasatoolforinvestigatinghowDragon-stylemultiprocessingcanbeusedtobestadvantage.FasterexecutionSpeedisn'teverything,butit'skilometersaheadofwhateverisinsecondplace.Processorspeedcanbeputtousedirectlyinthosefewplacesinaprogramwhereit'sneeded,andinotherplacestradedawayforclarityofexpression,compactnessofrepresentation,anextralevelofinterpretation,delayedbinding,garbagecollection,orahostofotherworthwhilethings.WehopedtoimprovethespeedofDragon,comparedwithourearliermachines,primarilyintwoareas.Thefirstisserialexecutionofbasicblocks,wherepipeliningandrichpipelinebypasspathsreducetheaveragenumberofcyclesperinstruction.Thegoal,likethatoftheIBM801,wasasimpleinstructionsetandastoragehierarchythatwouldpermitthemachinetoexecuteaninstructionnearlyeverymachinecycle.Thesecondareaofimprovementistheprocedurecall,wheresimplifiedparameterpassing,acachedcontrolstack,andeliminationofseveralindirectionsincommoncaseswouldprovideadramaticspeedimprovementoverourearlierimplementations.CodedensityOneofthemostnotableachievementsoftheMesamachineinstructionset,implementedinmicrocodeonsuchmachinesastheDoradoandtheXeroxStar,wastg/E|_/ pb #N )d+K0M 8<@p] "%',.P/5C9?/[4mGA#&T+.27 w@[|Z9w#%U'I-.38>jX#p&*>/_179u>VD $ (')S/h3K6;9C<=UCGC"#&8*/3Y46;N S$&,% 25h:g< >QI"W&,2/68=?PM "$'C /z5[8 =N: <#N&(+-0#4:=Le. '/(-d2p8B J?!$) 013"8?HB @ #.^2N6;6=G4! # +j 5S8@EM0~T$-(+*,.H2578 C  #',&246;B>  (.0s69=@ j@ = &(.168C< >ln#%q+ -6<>=Hh" %v -H/ 8;>^;A&9& f? # *q/6:=7:7r "%(92>9?5Nk (.3:6.8E;406 '*,J058:[;@o2rY%).*. 5U7: 0M- !&o)+ 1 9-t z*D%|& "D$p +/0689>2%M  !$M)M+.1\5788l>#'Z$"'"*-1A3 :/ !ys!#~,S1G6; W\ !&)0-v/047;< %&,.169<> 4 R# *%,/5M9=aZr!$' ,/2j5.89w<_aa<|a?0  "#(.15:9< P$)1/547<@ek "M%[) 036<7=}2 "' .q058=  "%*'+zk| O! *,/28 ?  "Z!N$m*c+.M3E588[A #%b':)w+[[,|[.4 ;?Z9 K"%(.;347v8<X5$)+15:=wV [ 9$@'*025c8x:6?UC1Q  K#%;(,2Xw9+UC|UC:u<S& c &L((*-L4@5;l=2@Q2DI"%q',05]6;>RPM kMpm"6$)g+@/B13z6\9`:s>3L*"&0( 0}3K59R ?fJzyCM|@N0$;&.',>1x7E:W?>jwP>|>ah # ).3!8/;W>=Fw}=F|=FWw "Jw$=F|=F%(*+0w3=F|=F479:0 ;p"%* ,J2;68@8P# ;w!48P|8P"E% -0+ :K<>'6 w A"/$*0I264w9C6|6:T=V5^jw5|5 P# &+3-1397=43ZwY3Z|3Zws3Z|3Z0w0|0 &U(.074+68; /7p"q'-0Lz1/7/72|6/7/78I;- n #E(*{,X--,|/--04M8:-R&'D "$+.26z8;o %':%k(,#'z##{#|!(' :'3 {%::& |K'{3KK}( KK){K+Y }5+KK6;{K8y|]'{3]]}( ]]){]+Y }5+]]6;{]8y|o'{3oo}( oo){o+Y }5+oo6;{o8y|"!&)-'.57<@-' w"#|%')g+115~ @p2'v"%?'* 14-;<?'"c$( /926E8<_?'k ,6R9e@ '{5"$'m-.3Y8;? ' xTVm$60:WriteQuadaddress1:data[address]2:data[address+1]3:data[address+2]4:WriteQuadReadydata[address+3]iii.WriteSinglebroadcastsasinglewordinthememoryaddressspace.TheMastercachedoesthisbecauseitwantstochangeaword,andtheitsSharedflag(tobedescribedshortly)forthatwordisset,indicatingthatacopyofthatwordmayexistinothercaches.Thiswordisnotwrittentomemory,andthereisnoflowcontrol.0:WriteSingleaddress1:data[address]iv.IOReadreadsasinglewordintheI/Oaddressspace.0:IOReadaddress{anythingbutIOReadDone}*w+1:IOReadDonedatav.IOWritewritesasinglewordintheI/Oaddressspace.IftheaddressedSlaverespondswithIOWriteDoneincycle1asshown,thenthenexttransactioncancommenceinthenextcycle.OtherwisethenexttransactionwaitsuntilthecycleaftertheSlaveresponds.0:IOWriteaddress1:IOWriteDonedataInterprocessoratomicityisimplementedbyallowingtheM-busmastertolockthearbiter.Thereisnoautomatictimeout,soprocessors(orcaches)mustbedesignedsotheycannotlocktoolong.CachesDragoncaches,likeothercaches,havetworelatedpurposes.Acacheprovidesitsprocessorwithfasteraccesstomostfrequentlyaccesseddatathanthememorysystemcould.ThisisdoublyeffectiveinDragon'scase,becauseevenwhenaDragoncachedoesnotcontainthedatarequestedbyitsprocessor,itcanoftenimplementbyitselfthevirtual-to-realmappingthatisrequiredinordertofindthatdata.ItcandothisbecauseoursolutiontothecacheconsistencyproblemrequiresthataDragoncacheentrycontainnotonlyvirtualaddressanddata,butrealpagenumberaswell,whichisnearlyalltheinformationinthemappingtableentry.Byaddingtherestofthemappinginformationtoeachcacheentry,weenabledourcachetomapanyvirtualaddressforwhichitcontainsanentryinthesamepage.ThesecondpurposeofthecachesistoreducetheM-busbandwidthneededbyasingleprocessor.ThisistheonlywaythatmodestnumbersofprocessorscanshareacommonM-buswithoutcripplingmutualinterference.CacheArchitectureADragoncacheintermediatesbetweenaprocessor'sP-bus(thecache's"front"door)andthesystem'scommonM-bus(thecache's"back"door).Acachecontains64associativeentries,eachofwhichcomprisesthevirtualpagenumber,therealpagenumber,theaddressoftheblockwithinthepage,theblockof1632-bitwordsofdata,andvariouspropertybitsforthepageandforeachgroupoffourwordstg/E|_,'z_,_,5{_,|]>'{]>]> |[P'{[P[P |Yb'{YbYb |Wt' {Wt#|U  #_$(,-085:?FS'"%f*+/1s67H;*=@CR8'!#%)r.0379&; P'- G#&),.2K7:>7z?PP@e|N'#M&)*-0L'zLL5 {L t|K '{K K  |I+"&g(*t-A2+G2'zG2G2Y{G2|ED' CV'3 {CV#|Ah*=w#s' (+;.379W;?' # ,.P13 49u<?>' u!S(*-205<\? y%'n-+0S5V8}<='9zg")3+<-# 347;9%c%d(F)/j1-469<@#`?#%a'+ 3 8> @"C"\%j).1D47 9< p" )+M-3x6;h=R? %'f*.249<@oM $-%*,0j24x7u="I&').w0o|14;_@"+5I  b"&-).,%168 ?wZ|l|"(Y- {5 |-* $o)+ 158=J?D"&).S27w8<   !&1,/B373<?g  s"&z*-0268M:&> I.!$V&)(,/\159:>ZTVm$ 7withintheblock.Acache'sP-busconsistsofa32-bitphase-multiplexedvirtualaddress/datafield,aparitybit,acommandfield,areadybit,andanerrorfield.ThereisonlyoneprocessoronaP-bus,anditisalwaysmaster.Oneormorecaches,andthefloating-pointacceleratorchip(s),areslavesontheP-bus.AP-busslavedecodesitsinstructionsfromacombinationoftheP-buscommandandvirtualaddressfields.Commandstocachesinclude{Normal|Locked|IO}{Read|Write}.Aslaverespondsbysettingthereadybitandtheerrorfield.Whenacachecontainsdatabeingreadorwritten,itassertsreadyinthephasefollowingthereadorwriteoperation,sothatinthebest(and,wehope,mostlikely)case,acachereadorwritetakesonemachinecycle.CacheerrorsincludePageFaultandWriteProtectFault.Atthemoment,thecachereplacementalgorithmreplacesentriesinstrictlysequentialorderwithoutregardtowhethertheirdataisdirty,althoughthismaychange.Acache'scommanddecodingcanberestrictedtoasubsetofvirtualaddressesviaabit-serialinitialsetuppath.ThismechanismallowsmultiplecachechipstobeattachedtoaP-bus,therebyimplementingalargerbutset-associativecache,andimprovingthehitratio.CacheConsistencyAsystemofcachesisconsistent,orcoherent,ifallcopiesofthevalueineachmemorylocationareidentical.AnobviouswaytomaintainconsistencyistobroadcastallstorestotheM-bus.Sincetypicallyone-thirdofdatareferencesarestores,atbestthissolutionreducestheM-busbandwidthrequirementsofaprocessorbyafactorofthree,independentofcachesize.ThisfactorisnotlargeenoughforDragon.AbettersolutionistobroadcastontheM-busonlythosestoresthataffectdataexistinginmorethanonecache.DashiellandThackeratXeroxwereamongthefirsttodiscoverawaytoimplementthissolution,andGoodmanattheUniversityofWisconsindiscoveredasimilaroneataboutthesametime.Bothsolutionsdependonhavingasecondassociatorineachcache,calledtheback-doorassociator,orthe"snoopy"associator.ThissecondassociatormonitorsM-busreads,andvolunteersdatathatitknows.Ifthedatacomesfromanothercache,insteadoffromthemainmemory,thenallcachesinvolvedmarktheircopiesasshared.AnychangetocachedatathatismarkedsharedisbroadcasttotheM-bus.TheDragonimplementationusesaSharedDatalineintheM-bus,andaSharedflagineachcacheentry.DuringtheReadQuadM-bustransaction,iftherealaddressofsomeentryinanon-mastercachematchestherealaddressontheM-bus,thatcachesetsthatentry'sSharedbit,assertsSharedDataontheM-Bus,andprovidesthatentry'sdataduringdatatransport.DuringWriteSingle(causedbyaprocessorstoringtoacacheentrywhoseSharedflagisset),iftherealaddressofsomeentryinaslavecachematchestherealaddressontheM-bus,thatcacheassertsSharedData,andoverwritesthatentry'sdataduringdatatransport.ThemastercacheentryalwaysloadsitsSharedflagfromthevalueoftheSharedDataM-Busline.Inthismanneracacheisconstantlyawareofshareddata,andeventuallydiscoverswhensharingstops.CachePerformancetg/E|_/'\v `"#&2M6 >[ ':"/#t'Z),.2V6:<=?rYeB  "7#(.137<?W x U#,%w)Y+W-2Y37;@CV X #$'b+Z149>To:Z!I"'(-w.ToTo.|To0f2\6 6:=>R1!u#&c(,N0t459?Q BC#'),<0[69G<>Oy ] $>&_*#-x146 9<>M` # '+2Z5KVuv# +17 ;=I k#$*f-126kRDfww+D|D#[(| 1\269SC` {? w:* #&+/ 1 7- >@o9#(3.wr9#|9# $(v.4:69. ?7|W#w&D7||7|'V*P1 9S:< 5  #%D),/357:?4-1%JO%o'k).1*48;X?0 ha$0),r137;%?.c( %(g. 07_9;p , $')V-j/38&/K^;#y(,/35j:=v$w\a/ %&l,.Ew0$w|$w1!n_ #0&'7 .1S2w5E!|!6V9w:|;?Pt!$Y( 0d2w5|69<^-8 '~,M 38:< x+Y*!&)]*-/&14F95:>pzm"q';)"w+a|,s/2i6:D h  `#(N+ 25:>O} ;#&)+- 5u:=?|> !4%?' +.1 8n>ir{ z  TVm$8Theultimategoalisfortheentiresystemtoexecuteasetofprogramsasfastaspossible.Systemperformanceisdeterminedbytheinteractionofprocessors,caches,andM-bus,sowecannotdiscusscacheperformanceinisolation.Althoughwehaven'ttalkedmuchaboutaDragonprocessoryet,weconcentrateourattentiononhowourcachesbehavewhenattachedtoaDragonprocessor,sincewebelievethatthatbehaviorwillhavethegreatestinfluenceontotalsystemperformance.Othercontributionswillcomefromotherprocessors:themapprocessor,thedisplaycontroller,andtheI/Oprocessor.ADragonprocessorattachestotwocaches.Itfetchesitsinstructionsfromone,andfetchesandstoresitsdataintheother.Weexpectthesetwocachestodiffersignificantlyinsuchperformancestatisticsashitratioandaveragemachinecyclesbetweenreferences,becausetheprocessorusesthetwocachesquitedifferently.AtthemomenteachDragoncacheisplannedtocontain64entries,eachentryincludingavirtualaddress(divisiblebyfourwords),acomparator,andfourdatawords.Thatisafairlysmallcache,andweareconsideringenlargingit,butfordefinitenesswewillassumethatsizethroughouttherestofthissection.Ourowndataaboutcacheperformancearefromtwosources.Oneisacycle-by-cycleDragonsimulation.FromthissimulationweknowquitealotabouttheDragon'sperformanceonafewsimplebenchmarkprograms,notablyBaskett'swidely-circulatedPuzzle.SomestatisticsforPuzzlearelikelytoapplyaswellformore"typical"programs(likecompilersortextformatters),whileothersprobablywillnot.WeexpectthatPuzzle'saverageof7machinecyclesperdatacachereferenceand1.8machinecyclesperinstructioncachereferenceisfairlyrobust.Ontheotherhand,weexpectthatcachehitratioswillvaryconsiderablyfrombenchmarktobenchmark.Asanextremeexample,thePuzzleinstructioncachehitratiowasveryhighbecausenearlytheentireprogramfitwithintheinstructioncache.Onewouldnotexpectsuchbehaviorfromamorecomplexprogram.Ideally,cycle-by-cyclesimulationwouldtelluseveryperformancestatisticweneed.Practically,itsuffersfromtwolimitations.OurDragonCedarcompilerisnotready,sonotmuchDragoncodeexists,andwhatdoesexistishand-coded.Furthermore,evenifwehadplentyofinterestingprogramsexpressedasDragoncode,oursimulator'sspeedlimitsthenumberofinstructionswecansimulate.OurotherperformancedataaremeasurementstakenbyN.Suzukitwoyearsago.HemodifiedourstandardDoradomicrocodetocollectstatisticsonDragoncacheperformance.HepretendedthattheDragonwasexecutingtheDoradoMesainstructionset,andhecollectedstatisticsonlyfordatareferences.Thetwokindsofperformancedatamakequitedifferentassumptions,sothefactthattheyagreeverycloselyonPuzzle'saveragedatacachehitratio(89.7%vs.89.8%)shouldberegardedasaccidental.Suzuki'sdataalsoindicatesthattheMesacompiler(a"typical"program)hasadatacachehitratiothatis7%worsethanPuzzle.Thissuggeststhata"typical"programmightachieveanaverage83%datacachehitratio.Ifweassumethatacacheoperationhappensonceeverysevencycles,andthattheaveragecachemisscostssixcycles,thenDragonshouldrun14%slowerona"typical"programwithourdatacachethanwithaninfinitedatacachethatnevermisses.Performancewilllikelybeworsefortheinstructioncachethanforthedatacache.Unfortunately,atthemomentallourinstructioncacheperformancedatafor"typical"programsissecond-hand.J.SmithandJ.Goodmanmeasuredaverageinstructioncachehitratiosoverthreeprograms,NROFF,CACHE,andCOMPACT,alltg/E|_/~ #8'3+-z2~357=?]A/ $d& -{/28 9D;. [G#v(+ 35 ;Z9!"'-0b2 9SM m # )Q+-M03q8>'L*O ( H"(+.058 @#J'zu"&#w(*>/05z8< HK(   #() 1r47: >G4Z9" $h +136"8k ?EV$ "%X'),SC/# ,/.25;?6@Ak  $L($* 137o:<(><?P: L!a"%H)07j" (j*//15o75;<?R9&v"%b ,-/572:?7MR~!*$&*h,/ 7:5 "U(;*/ 69<"?r40:m!%*,03c :O?2"&'H*0_0   &b*-$/2 :?.fS j i#& .16:@,z#'-+k.T15E8:E + r!%' .4;L=)p #&+4, 4L6s8&f #&" /83w54&|&67;>%N#N(v/A1 5z :=# y$8&)X.P079q>t! 2" '+-Q0V g< %9(],1/5 =?sB\"$*4/>2N6%8M;@ 5Oq! (b-039K<>t"%&*.0c368l:>,P &,(0)57S$#&)+0/3H8:<?02;f:!$'u*{.I1q46;K>RI{  "&)C+ 26:"<?&u < %'*[ 1+4 <?  w% | &j'q+w. | / 06=  "&aw, -|0 w1 2|5 6w9 :V|?C @. |TVm$K9compiledfromCandrunningonaVAX-11underUNIX.Theirresultsindicatea90%hitratioforaDragoncache.Basedontheirmeasurements,theyarguethatcachesizeisthemostimportantfactorindetermininghitratio,andthatforcachesthesizeofours,randomreplacementisbetterthanLRUorFIFO,anddirectmappingisnearlyasgoodasfullassociativelookup.D.Patterson,etal.,measuredthehitratioofasimulatedRISCIIexecutingthePortableCcompilerwithacachesimilarto(butwithanarrowerentrythan)ours.Theirresultsindicatedahitratioof87%.Thissuggeststhatourinstructioncachehitratiowillbesomewhatbetterthanourdatacachehitratio.Ontheotherhand,wewilllikelyspendmanymorecycleswaitingforinstructioncachemissesthanfordatacachemisses,becauseoursimulationindicatesthatDragonaccessestheinstructioncachealmostfourtimesasoftenasthedatacache.Ifourcachehitratiosprovetobeunacceptablylow,wehavetheboardspaceandthearchitecturalflexibilitytoattachtwomorecachechipstoeachprocessor.Atthemoment,weexpecttoachieveinstruction/datacachebalance(suchthatgivinganewcacheentrytoeithercachesavesequalnumbersofcycles)withtwiceasmanyentriesforinstructionsasfordata.AnimportantissueishowmanyDragonprocessorsanM-buscansupport.ThenumbersabovesuggestthatasingleDragonprocessor,withonecachechipforinstructionsandonefordata,insteadystateonatypicalprogram,islikelytomissthecacheaboutonceevery13machinecycles.Sinceaquadwordreadtakesaminimum(andtheaverageisprobablyquiteneartheminimum)offivemachinecycles,thelimitofsystemthroughputis2.6processors'worth,independentofhowmanyprocessorsareavailable.AccordingtoSmithandGoodman'sdata,thislimitwouldincreaseto7.2ifasecondcachechipforinstructionswereaddedtoeachprocessor.ThisisthemainargumentforincreasingthesizeofDragon'scaches.Virtual-to-RealMappingAsmentionedabove,addressesontheM-busarereal.Cachescaninterceptandimplementmanyofthevirtual-to-realmappingoperationsrequiredbytheirprocessors.Butwhathappenswhentheycan't?Whenitneedstoknowmapinformation,acachedoesanIOReadfromanaddressthatincludesthevirtualpagenumbertobemapped.(Thisisonereasonwhyweneeda32-bitI/Oaddressspace.)ThemapprocessorrespondstotheIOReadbyprovidingthemapinformation.Whileformulatingitsresponse,themapprocessormaydowiththeM-busasitpleases,aslongasIOReadDoneisneverasserteduntiltheresponseisready.Thismeansthattherealmemoryitselfcanholdthemaptableentriesforthosepagesofvirtualmemorythatarepresentinrealmemory.Oneideaistoorganizetheresidentmapasahashtableoccupyingaconstantsmallfractionofrealmemory.Therewouldbeanentryinthemapcorrespondingtoeachvirtualpagepresentinrealmemory.Thehashfunctionwouldprobablybethelow-orderbitsofvirtualpagenumber.Collisionswouldberesolvedbyrootingabalancedtreeofmapentries,keyedbyvirtualpagenumber,atthesiteofacollisioninthehashtable.Othernodesofthebalancedtreewouldbestoredinotherwiseemptylocationsinthehashtable.Thisyieldsanexpectednumberofprobesarbitrarilycloseto1,dependingonhowbigandemptythehashtableis.Italsoyieldslog(virtualpagecount/hashtablesize)probesintheworstcase.Otherideasarealsoundertg/E|_/w_/|_/"&$Mw%_/_/&b|( _/_/(*w._/_//|1_/_/3W77;@]($ A%$)D+h. 89;g?P[_/"&( 0L2l68;=Z9/o %&*w-Z9Z9.|Z90w2kZ9Z93$|5$Z9Z958S hF #')-/18V<]?Q <!%c)]+.x2[6:p>'PM Q$ ' )+/4 9;e N"$ +/|47%:<@pLAFJ/#u%-'( /z2|47:g>iH6 : "$O(|+G.268c; G4 u%A//27;K=E#R&*{016&91<>SCR  Ak/w!%t*` 0279?F?!"h&+ 2c58<?> F!X# 'J*,-2089=Q?9=C@:T 8!'+T.179<9&8* &e'* 15e =Z?7 M "E(*.1n8<2>54lR$U(E+y- 5]8=>40 W 4&^( /1q45;z/9R|+F%y'*.149( UT!%x(&!epG!0$t ,-15#7<<@-$y=#!&|+-/59;="qe %+0.@17=?!+_!d )- 5 6<?Eq 7$&(-8/ 2K4 <>S`e a  &)q+/1r47~]!7#(w*1,3T6d9i:</T\9") *"/3 89^2! #(!*$.Z06V:@p&_!#&(.I04 ;A>@ $&x)-"/p037:>  G"$'p+0 4a8>:> }TVm$d10consideration,andtheprototypemachinewillprobablyhaveamapprocessorthatimplementstheidentityfunction.Itissometimesnecessarytonotifyallcachesthatthemaphaschanged,andChangeFlagsisaspecialM-bustransactionforthispurpose.ChangeFlagsbehaveslikeanIOWritewithnodatathatrequiresnoacknowledgment.ThemostcommonuseofChangeFlagsistobreakthevirtual/realassociationofarealpageinallcaches.Thisisdonejustbeforewritingapagetodisk,forexample,topreventprocessorsfromchangingdataasitisbeingwritten.Suzuki'smeasurementsindicatethatabout3%ofthedatareferencesofatypicalprogramrequireamaptransactionontheM-bus.Ifthemapprocessorisdesignedtocacheanysignificantfractionofrecentlyactivemapentries,itwillrespondtomostinquiriesintwocycles.Fromthis,andtheaverageof7machinecyclespermemoryoperation,weexpecttheaveragecostofmappingtobelessthan1%inexecutionspeed.DragonProcessorsInprevioussectionswehavedealtwiththeM-busthatbindsDragontogether,andwiththecachesthatattachtotheM-bus.Inthissection,wedealwiththeprocessors,seeninfigure1,thatattachtothecaches.OneoftheseistheI/Oprocessor,acommercialmicroprocessorthatinterfacestheM-bustoaVMEbus(anindustry-standard32-bitmicrocomputerbus)fordealingwithpurchasedI/Ocontrollers.Oneisthemappingprocessor,andoneisthedisplaycontroller.TheremainingonesareidenticalDragonprocessors,andtheyarethetopicofthissection.DragonInstructionSetTheDragonprocessorhasanewinstructionset.Thisinstructionsethadtobeconsiderablyreducedinsemanticcomplexityfrompreviousmachines,sothatwecouldimplementitinafastpipelinedtwo-chipengine,andhandletheresultinginter-instructionpipelinehazards.Atthesametime,wewantednottolosethecodedensitythatMesahadachieved.ThevariablestateofaDragonprocessorcanbeviewedasconsistingofthefollowingregisters:PC,a32-bitbyteprogramcounter.R,a128-wordx32-bitoperandstackcache.L,a7-bitregisterthatindexesR.ListhebaseofawindowofsixteenregistersinRthatareeasytoaddress.S,a7-bitregisterthatindexesR.SisthetopoftheexpressionevaluationstackinR.Manyinstructionsimplicitly(orexplicitly)addressR[S-2]..R[S+1],andincrementSby[-2..1].SLimit,a7-bitregisterthatiscomparedagainstStoknowwhentheoperandcacheisgettingclosetofull.A,a16-wordx32-bitgroupofeasily-addressedglobalvariables.IStack,acontrolstackcacheupto16elementslongcontainingsavedPCandLvalues.afewotherspecial-purposeregistersthatcontainfielddescriptors,quotienttg/E|_/ #C(+1s469-?P] L[ ;!#').B11369?gYe 6" )+.4 <Wn #(*6I9J@.To]2$ (*8-/|25O;H=R z!#p$&2)PL $'+~-/K14 :<=Nr;f $~&w(N|N)-/f14:@pHyAj |>r" I#&*,p037c3t59 :=*z0 |-xs!#& ,/2 9;>@-+ mC$ +H.4;=?*Bg?!R'-k2]5>9&n$x[; %+.057` =?" #p'[!S&*)W!')+t-/2468;=_'7"$')qK:!&(*+-0x2A4 ;{ 'D % , 36U =4#'!#*H+-5Lc!b$%u+0M13 6:n<'^ $W&u $&]0r4 O $u&{()/2 9`=?g ' J$*,15? < TVm$11orproduct,faultedmemoryaddresses,whethertheRescheduleinterruptisenabled,etc.Dragoninstructionsconsistofaone-byteopcodeandzero,one,two,orfourbytesofadditionaloperands.Theopcodesnowdefinedfallintoseveralgeneralcategories:Unconditionaljumpsorcallsto{I|[S]},returns.Conditionalrelativejumps,decidedbycomparing[S]against{I|[S]|[L]|[A]}.Memoryreads{[S]|[L]|[A]}:=[{[S]|[L]|[A]}+I]^Memorywrites[{[S]|[L]|[A]}+I]^:={[S]|[L]|[A]}Densely-encodedoperationsthatproduceanewtop-of-stack[S]:={I|C|[L]|[S]opI|[S]op[S-1]}Densely-encodedoperationsthatcopytop-of-stacktoregisterwindow[L]:=[S].Generalthree-operandinstructionswithlessdensity{[S]|[L]|[A]}:={[S]|[L]|[A]}op{[S]|[L]|[A]}Inthistable,Istandsforimmediatedata(ofseverallengths)withintheinstruction,andCisagroupof12registerscontainingpopularconstantslike0and-1.Thenotation{a|b}means"eitheralternativeaoralternativeb",[a]meansthecontentsofregistera,anda^meansthevalueinmemorylocationa.TheremainingopcodesarecalledXOP'sandtreatedasprocedurecallstoavectorofruntimesupportroutines.Sucharound-tripprocedurecallcostsfourmachinecycles,soitisreasonabletoimplementmanycomplexoperationsbyXOP'sinsteadofbyspecialcontrolwithintheprocessor,exceptforthepurgativeeffectsoftheirimplementationsontheinstructioncache.Dragon'sstackcacheRissimilartothestackcacheintheCmachineproposedbyDitzel,etal.,andreminiscentof(butmoreflexible,efficient,andcomplicatedthan)thatintheRISC.Themostinteresting(asestimatedbythecompiler)parameters,localvariables,temporariesoftheseveralmost-recently-calledandstill-activeproceduresoftheexecutingprocessarekeptwithinR.Onoverflow,theleastrecentlyusedregistersinRareflushedtomemory,andonunderflow,theyarereloadedfrommemory.Thisoverflowandunderflowdetectionisautomatic(usingSLimitandIStack),andflushingandreloadingisimplementedbycodeintraphandlers.Acallingprocedurecansendparameterstoacalledproceduresimplybypushingthemonthestack.ThefirstinstructionofthecalledprocedureloadsLwithS-(numberofparameters-1),andtheparametersbecomeitsfirstfewlocalvariables.Thecalledprocedurereturnsresultstoitscallingprocedurebystoringtheminitsfirstfewlocalvariables.TheReturninstructionthenloadsSwithL+(numberofresults-1),placingtheresultsontopofthecaller'sstack.IStackisacacheforcontrolasRisfordata.Itisalsoflushedandrefilledfrommemoryviatraphandlers,anditspurposeistoaccelerateCallandReturninstructions.Conditionaljumpscomeintwoforms,onepredictedtojump,andonepredictedtg/E|_/'$* 17: ]'Z#[  ="$%+03}6:=8?Ye  "4%4*-258=KW U  5!%&w'U|U()w*U|U*+-,S $&)\+l2Ow2S|S3R4M8w:GS|S;J<:w9?)w?S|S@FAAR:'wR:|R:]EPLN]'[wN]|N]^E"wN]|N]>%weN]|N]8 !T#$%w&2N]|N]&'(w(N]|N])*+uw+N]|N],-w/N]|N]1LoJ'CwJ|JE, wlJ|J% wLJ|J !w#J|J$&"()w*J|J*+,|w,J|J-./\w/J|J01zH $n'<,-0 F'wF|F)owF|FwnF|F w F|F!"#zw#F|F$|%dw'jF|F(X)5w)F|F*7+-%w-F|F.(09D $n'<* 238B'wB|BC*wB|B@Q !2 (+.B>'[w>|>^E"w>|>>%we>|>8 !T#$w%K>|>%&'w(>|>()*w*>|>+,-/1w1>|>233w4E>|>456w7%>|>782?g9"h?+^#(; .01 8:;<=7{AiH!#[')-/V494c #iw'44(h|)44*D+.j3:5;?@3X6&L)+J 2)8;?1bn' #%,}0D5 e11?9|@11A0 </ K$& -@136<@e.b) '+w (!&'*-m1025+6< *?8c #%(,u1 7:q ( #1& -0i696;& S (*-2?g%I _ a n&+b-057c9?#%T!L"%)+1@35 <?!3!L&)0H6:7= S$H'6-~/ 79=V?:0 K #% *+,0g6;<" )+.29w=}|?w|T S#%e ,i1|3T6*8< :f!*%'I)4-4=6I:>@C !&U -E0w4(|5aw8|9J@e -U "% &)9-p#%'m(+W/x125:=al^e#p&K(I-/>1 7:=k!   "6$(+{1s369z<  ^TVm$12nottojump.Theprocessorinitiallybehavesasiftheconditionaljumpwillhappeninthepredicteddirection,andusesthepipelinebackuphardware(thatmustbepresenttohandlepagefaults)torecoverfrommis-predictions.Correctly-predictedcontroltransfersofallkindstakeonemachinecycle,plusinstructionbufferrefilltime.Mis-predictedconditionaljumpscostfourcyclesplusrefilltime.Thisputsarealpremiumoncorrectprediction,aninterestingchallengeforcompilerwriters.TocheckwhetherdensecodewasstillpossibleinDragon'sinstructionset,wetranslatedtheCedarversionofPuzzlebyhand,tryingtosimulateaveragecompilertechnology.Hereistheresult,comparedwithsimilarresultsforothermachines(32-bitaddressesunlessnoted):CodebytesMachineLanguageRemarks1214AltoMesa(early)16-bitaddress,nochecks1369LilithModula216-bitaddress,nochecks1534DoradoCedarMixed16/32-bitaddress,range&NILchecks1938DragonCedarRange&NILchecks2080VAX11/780CPointersinsteadofsubscripts,nochecks250068000CNochecks2736RISCCNochecks3592MIPSPascalGlobaloptimization,rangechecks4744IBM370PascalAtthismomentonlyabouthalfofthe256possibleDragonopcodesaredefined;therestareimplementedasXOP's.AswefurthertuneDragon'sinstructionsetweexpectcodedensitytoimprove.StructureAstandardDragonprocessorclusterconsistsoffiveorsixlargechips.TwoofthesechipsarecachesthatmediatebetweentheprocessorandtheM-bus,oneforinstructionsandonefordata.Oneortwoarefloating-pointacceleratorchips.TheremainingtwochipsformtheheartoftheDragonfixed-pointexecutionengine.Thetwochipsthatconstitutethefixed-pointexecutionengineformasinglelogicalentity,shownschematicallyinFigure2.Instructionsanddataflowdownwardinthisdiagram.Dragonusesatwo-phaseclock;theheavyhorizontallinesmarkedAindicatelatchesthatareopenduringtheAphase,whilethosemarkedBshowlatchesopenduringB.OneofthechipsistheInstructionFetchUnit(IFU),whichreadsmachineinstructions,decodesthem,expandthemintomicroinstructions,andcontrolsthepipeliningofthesemicroinstructions.Theotherchipistheexecutionunit(EU)thatcontainsa160-register3-portdatamemory,apipelinedALU/Masker/Shifterwithhardwaretoassistmultiplyanddivide,theaddressanddatapathwaytothedatacache(butnotitscontrol),andseveralmultiplexerstoselectoperandsandimplementpipelineshort-circuits.Theheavyjaggeddiagonallinedenotestheinterfacebetweenthetwochips.Thelineisshownasdiagonalbecauseasamicroinstructionprogressesthroughthepipeline,ateachstagethereislessworkbeingdoneintheIFUandmorebeingdoneintheEUforthatinstruction.Aswescanfigure2fromtoptobottom,thefirstmajorblockweencounterisIFetch.Thisblockcontainsanasynchronously-filled6-wordbufferofinstructions.tg/E|_/Qj< J%C*u,-x/ 6:=3]J "%(p-29#<@-[oG!#9(+6 Z9\2>!$$&,0u3 :>Xe 6 "c&)`,j0g3f6:=@V$ % & -36;To $O').06 =x?R , e"&t(m,B0B17eSF_wzF_F_b|F_ DD8D#D-23w3DD4|D6ZCwC8|C.wC#|C-3+79x @Ai??8w?#|?-0F>w>8>#|>-0FS:9#w9#8|9#9##6\"]%&(+M0l5I:t<5GH w S55!'|"55#$' )7-16 =?3Y2y7z0 |,7#(g-m/-1359F=9@e+vs$*b,25w8+|+9(=*?)l =!$&j)+a 3 :?F'c !j$&) . 5;R%Iy6- $' .Z49P<>3# >q #%P)+} 25}8g;e!0a`!X")-/n3d 9=w S| SabX#l(w* S| S,0i4D8 w=1 S| S>mwY|z "?$(;), 36:4w:;|<=/> %(-1b4A?g]( (".J11479z;w|) a"` */3p: < w|T %.&*0S3-7:Y?gg!R#J&q)+038k @o"'1!48<r<$'T*%.146H:<q! ',/J469=)@_tc"%>(,j/1Y3686: NC!$,%+ -o0N4J8:?@ 3U "/4p8:g  TVm$13Whenawholeinstructionisready,itisbyte-alignedbyIFetcherandmadeavailabletoLevel0.PipelineLevel0generatesfromeachinstructionasequence(usuallyoflengthone)offairlywidemicroinstructions.TheregistersholdingLandSareatthislevel,andmidwaythroughthislevelthetwofetchaddressesfortheEURAMareformed.Unconditionaljumps,callstoliteraladdresses,andreturnsarecompletelyimplementedatLevel0.Theseoperationstakeonemachinecycleatthatlevel,plusthetimeneededtorefilltheinstructionbufferfromthetargetaddress,whichisonecycleinthebestcase.InLevel1thetwoEURAMfetchaddressespasstotheEU,andsomewhatlaterliteraldatafromtheinstruction,andinformationaboutthesourceoftheALUoperands,passtotheEU.InLevel2theALUoperationandinformationaboutwhatBooleanconditionisofinterestpassestotheEU,andsomewhatlatertheEUdoestheALUoperationandcomputestheBooleancondition.InLevel3adatacacheoperationisinitiatedifappropriate.TheaddresscameoutoftheALUinthepreviouslevel.ThewidedatapathfromEUtoIFUattheendofLevel3cancarrynewcontentsforanIFUregister,suchastheprogramcounter.WechosetoputdatacacheoperationsinapipelinestagefollowingtheALU,ratherthaninthesamepipelinestage,eventhoughthepipelineislongerthatway.Wedidthisbecauseaverylargefractionofmemoryinstructionsuseanaddressformedas"baseregisterplusoffset",sothecacheoperationisprecededbyaddressarithmetic.IfALUanddatacacheoperationswereininthesamepipelinestage,suchmemoryinstructionswouldrequireanextracycle.Amicroinstructioncanfailduetoapagefault,anarithmeticoverflow,ortherealizationthataconditionaljumphasbeenmis-predicted.BetweenLevel3andLevel4,acrucialdecisionismade.Onlybythispointisitcertainwhetheramicroinstructionwillcompletesuccessfully.Ifitcannot,thennotonlythatmicroinstruction,butallpreviousmicroinstructionsfortheinstructionthatgeneratedit,aswellasallmicroinstructionsforfollowinginstructions,mustbeaborted.Theprocessormustattainastateasifallmicroinstructionsforpreviousinstructionshadsuccessfullyexitedthepipeline(whichtheyhave),andallmicroinstructionsfortheinstructioncurrentlyatLevel3-4hadneverenteredthepipeline(acondition,alas,quitecontrarytofact).InLevel4,successfulinstructionsstorenewdatabackintheEURAM(seeLevel1),andsuccessfulcallsorreturnsarerecordedintheIStack.PipeliningDragonderivesmuchofitsspeedfromitsIFUandEUpipelining,andthefactthataverageinstructions(inPuzzle,forexample)operatewithapipelineheadwayoflessthantwocycles,eventhoughthepipelineisfourcycleslong.Unfortunatelythisspeedisnotfree.Wemustdetectandresolveinter-stagedependencies(hazards).Inaddition,todealwithfaultsindatacachereferences,arithmeticfaultsofvariouskinds,andmis-predictedconditionaljumps,weneedawaytoabortinstructions.Thisinvolvesrecoveringmuchoftheprocessorstatetoitsvaluethreepipelinestepsago.Hazardstg/E|_/  &$8%~& .06 8W[!#&{)/2 4\6:<UC  #G%O) 0Y3i8v;" S %i["N (+.s37M8;?$Q8S T '%+J.04: >?rPMUcFMwE $(O.13a58;L+F %(} 0366:<2>J")HY $' /?3/6;FaH!$*.0_268g;D#{ B?o5e T&y'-@. 69>@]1"&)-&003\69X; > ?>6"I'*>,S/488: <=Iu;"& -/b06 9?;~!C&*B-}2$4x9;?P9 {V "%](.M0?5 =@-8SJ"%*,~.28:;@"6 ! /"%) 0W35P69W<5K "&+-12= &"%b'(U+/J1H 7=?0  !$'b* 4:3>?g/:  C!&*,/3~5"6;n@-# ,=.2/5B8;?P+h#+-02R 9;*DOE#& ,$ 37T9N?F(Y-U "0#%|0A2|8 ?f& Tt"/&)-02=z?%N P!$#&N*"/!168 ?#l!,/  ')*-04F68;D?Qj  "'G)/V13cz6 |< "&T)+|.~1)3 :t=?q@ J %K'-257  T{  TVm$14Ourdesigndealswithinter-microinstructionpipelinehazardsinthreeways.Thesimplestistoavoidthem.MIPSavoidshazardsbyasetofprogrammingrulesthatarefollowedbythecompiler.WiththissolutioncomesthechoiceofaverysmartcompileroralotofextraneousNo-Op's.Wedidn'twanttointroducesuchrules(andtheirattendantincreaseincodesize)atthemachineinstructionlevel.Wedoavoidsomehazardsbyimplementingafewinfrequently-executedinstructionswithawfulsideeffects(StoreIFURegister,forexample)byacarefullycodedsequenceofmicroinstructionsthatclearsmicroinstructionsforprecedinginstructionsoutofsensitiveplacesinthepipelinebeforethekillermicroinstructiondoesitsdirtydeed,andthenpreventsmicroinstructionsforsucceedinginstructionsfromenteringsensitiveplacesinthepipelinewhilethedeedisstillinprogress.Anothercaseofhazardavoidanceisdelay-matchinginparallelpaths.Dragonhasasimpleloop-freepipelineinwhichnearlyeverycomputationalresource(gate,bus,etc.)isusedatonlyonepipelinestage.Thisgreatlysimplifiespipelinescheduling,butitalsomeansthatsomepathsthroughthepipelinearelongerthantheywouldotherwiseneedtobe.Forexample,considerthestore-backpathtotheEURAM.Forordinaryarithmetic,thiscouldbelocatedatLevel3.ButbecauseweputtheALUoperationsanddatacachereferencesintandem,ratherthaninparallel,fordatacachereferencestheEURAMstore-backpathhastobeatLevel4.AhazardisavoidedbyinsertinganextrapipelinestagebetweenALUoutputandEURAMstore-backpath,sothatitisalwaysusedatLevel4.AnotherwayDragondealswithhazardsistodetectthem,andtopreventsomeupperendangeredsegmentofthepipelinefromadvancing,whileallowingthelowerdanger-freesegmenttoadvancenormally.AmicrocodeNo-Op,whichchangesnostateandparticipatesinnohazards,isinsertedinthepipelinegapthuscreated.Theboundarybetweenthestationaryinitialpipelinesegmentandtheadvancingfinalpipelinesegmentischosenashighinthepipelineaspossible,butsothatatleastoneofthetwoparticipantsinanyhazardisonthestationarysideoftheboundary.InDragon,therearethreesuchboundaries.OneisatLevel0,anddealswithnextinstructionnotready,oraCall-Returnhazard.OneisatLevel1,anddealswiththosecasesofEURAMWrite-Readhazardthatcannotbeshort-circuited.AndoneisatStage3,anddealswithdatacachenotready.ThefinalwayDragondealswithhazardsistodetectthem,andtoachievethedesiredeffectinanotherway,withoutintroducingdelay.Thedatapathround-trip,fromtheEURAM,throughtheALU,throughabest-casedatacacheoperation(ortheparallelcollision-avoidingpipelinestage),andbackintotheEURAM,takesthreepipelinestages.DuringthattimeapotentialWrite-ReadhazardexistsontheEUregisterthatisspecifiedasthedestinationofamicroinstruction,becauseafollowingmicroinstructioncanreadthewrongvalue.ManyofthesehazardscanbehandledwithoutwastedcyclesbymeansofEUpipelineshort-circuits,whichareadditionaldatapathwaysintheEUthatcarrydatabackwardoneortwopipelinestages.Forexample,thereisashort-circuitfromtheoutputoftheALUbacktoeitherinput,sothataninstructionmayusethearithmeticresultofthepreviousinstruction.Theuseoftheseshort-circuitpathsisgovernedbytheIFU.Itmatchesthesourceregisteraddressesofaninstructionabouttoenterpipelinestage1againstthedestinationregisteraddressesofinstructionsfurtherdownthepipeline.Short-circuitpathsareexpensive,bothinareaandcomplexity,sowehaveincludedthemonlywheretheywillspeedupcommoninstructionsequences.InstructionAborttg/E|_/F O.38:>j]l"(&*W/Q1G2k465 >[2x%>(+05+7;=?Z9Sn &`,/368>X+I|~#%H(|+-/_4 ;?V (*J-: UCBSq"d%+.<4t68%>Sa!#'2M4y: QHfn  ]%),*/9=>PMPw(+ 1 9S<Nf~,!%'+J,/W1L* T&(< 138=J["-#',"/ 8>^H+O K#P(-1n6N <G4 !)$'+P028O:>EB!H$&,14Y :>?CuN %}('+-24]80:R<B>2c #'* 13v8=!@p@*nzM #&<(, 3_69:<>T>#\$ 8%'+04z:==H$ "&#&()z-126~: "G%Z*R+-\1c5<79>9% 2!2#(, 26<>T7~ =^")n+16:@5 tg$&+(,/4H69?F4/?[ #'-U258D>2E?!$&).0F58:f=J>0FQ !j$ ()+.W 47y92;/9`" *w-x.0v4=58|;?- ]n 'X,/1t3"78;p?+ov %+),P02!*i!J$'9(a. 5:=?!#o& -U/N0;@!!$'+2/35:8=@-4V6"T$w(*-r2;? q k"%(l,/5l80: < (!" *.)05&69j=@o>% % %*(X*- 4-8":< |FX (r,-k3Z5H7;g<6zM V"# *.0238X@Ze&])],/i48;C Vu"%. ,-/ 6:Oye k &5)602& :L*.v!$(m*!/34f78<?J& %'%( 27U<HP* "#$*-05s7;=?G4o*"$'+~04:?EU0$&C)*~/15f69>ChAji |!!$),1 59<?8. =$b&(,>0X28=>>l  #%W(,-0G5 :=%@/u0366:>R7~* #,'s*6,-R 59=_@-5 z2  |/8:&E &-* 1 8;-? o #(#*.P47|=v+z )$L +.0^4`6;8V ?@*B*" (, 4V7: ( UCc#)L~&w&~&w`&~&w:&&~&wD&~&w#&~&#w$&&%\~&(hw(&~&)j,,.w.&~&/w2&~&3:w5&~&68|# `!#(, 35o;>@!v["[ )-/58> U2K !u&'+-226R| #M),/3?4;8=? )'/"i%+%,138 >_DX" (-g1269< bU #&,F/047:@Ut^ %M(*/P16;@oi\ r"k% +w04?6=J oo! %+-x 47:>i #(*B 1*57n=sAJ!)#&)Z../2; 9+=von#(+*.12 9t;u= Q. "Z',13 48: Ay $'.135s;< TVm$(16theidleprocessthanifitisnot.Thismeansthatifthereareanyidleprocessors,oneofthembecomethescheduler,andtheotherswillonlyswitchprocessesifforcedbytheschedulercode.CedarThedesignersofDragonhadtheCedarlanguageuppermostinmindasthelanguagefromwhichDragonprogramswouldbecompiled.MuchoftheorganizationofDragonisintendedtooptimizetheexecutionofcommonCedarconstructs.Ofcourse,similarconstructscanbefoundinotherlanguagesaswell.HillarguesthattheBellLabsCmachine,withwhichDragonsharesmanyfeatures,isespeciallywellsuitedtoexecutingawidevarietyoflanguages.ProceduresandlocalframesCedarisasuccessorofPascal.Itdependsheavilyonthenotionthatabstractionisimplementedbyprocedurecall.AprocesscanbeviewedasaLIFOstackofsuspendedactivationrecordsandonecurrentactivationrecord.Eachactivationrecord(or"frame")typicallycontainsaprogramcounterandsomevariables.Tocallaprocedure,thecurrently-activeproceduregeneratesanewactivationrecordforthecalledprocedure,computessomearguments,storestheminthenewactivationrecord,suspendsitsownactivationrecord,andmakesthenewactivationrecordcurrent.Whenacurrentactivationfinishes,itsendsanumberofresultstothe(suspended)caller'sactivationrecord,destroysitsownactivationrecord,andmakesthecaller'sactivationrecordcurrent.ItisthiscomputationalparadigmforwhichDragonisheavilyoptimized.WithintheIFUchipisaLIFOcacheofprogramcountersandregisterindexeswithintheEUchipwherethevariablesofsuspendedactivationrecordsbegin.MostactivationrecordvariablescanbeimplementedasEUregisters,andparameterandresultpassingisdonebyindexadjustmentratherthandatacopying.SuchotherlanguagesasLispandSmalltalkalsofitthiscomputationalparadigmmoreorless,andDragonshouldbeaneffectiveengineforthemaswell.Butthereareothercomputationalparadigms,suchasProlog,forwhichanequivalentamountofhardware,organizeddifferently,wouldprobablydosignificantlybetterthanDragon.Strongtypingand"Safety"Cedarisastrongly-typedlanguagelikePascal.Thismeansthatthecompilercanknowagreatdealaboutwhatmighthappenduringexecution,andwhatcannot.Cedaralsohasagarbagecollector,likeLisp.Thismeansthatitmustbepossible,atrun-time,tofollowallpointers,andtoknowthesyntactictypesoftheirtargets.In"safe"Cedar(anautomatically-checkedsubsetofthelanguage)theuseofthegeneral"Address-Of"operatorisprohibited,asareoverlaidvariantrecords,andotherdevicesthatcanconfusethecompileraboutthesemanticsofexecution.Theresultisthatthecompilerknowsmanysituationswhenavariableinanactiveprocessdoesnotneedamemoryaddress.Suchvariablescanbeimplementedinmachineregisterswithoutfearthataprogrammightreferencethembymemoryaddress"alias"andgetaninconsistentcopy.tg/E|_/Q0 #'+n.K/3T58h;. ]rS & (+/26:@['"yUC|Q!$')+g1S8S:'=?PL~ s&\*`,&268:G N)$\&-.48 @ L !#'(,d24a8/:?PKV0B~!%*s.2?79 ?0ID#% zF` |C$v%+6/14 8f;" Ajv !w&#Aj|Aj',/S1l6<8w9dAjAj:|Aj<@e? J!T$B'$,% 284; >K!+&'-259- ?S+!Z %m)z,/( b#h& , .03 ;&m*#,%&,{0368A;>%M/  $').B0y46j =#C %*+0Q2 :>!zg)|`S !'*T/27 9@o[W "$I)n+l082 :l@Z9"W$*, /17=?:XG !$ .35$ =@V- "Tn  |!%#u(9 /2F3`7<RZ "a%'&*.+-/1Y47^;Q pvd &)/ 58:q Ox bL!%), .033 8:M %"J%e zH&|Een" &])-T0s5w < C  &)%/ 6 7:B>tB7 !!$&18k: w@ |;@@# ,/5O9 >1"&+20 7 @"=A ~ 6&2(*-p/28L>(;! %(W*.c 47:?r9"s (G)*/13h5;E 8Kw-#M),p18?<>6i0!{'*-~149 4 ww#44$U |4-/6 =3Uw3U3U |#93U3U#$({0!%',\w.600. |7008;e=/1r !& )+}1279h -wP-|-w-- g |-)+-1628;>+wF++ |+) '(*n-1&6;@e*;_!#$(,(275v:=(* !#)F.y024 <&&S "$]w&&&'_|&),.d4<,w?B&&@|%E B%o'(K-./39 ?#!& *!, 4 9 ;f=!%!(*01j2 9<  Ok2!$*,-03:t=yz|\ D!''-*/b16 7:?ac !%`w)*|.w0.0|3w45,|7Zw89|;w<=|AAw|&R #& )-u0 748>f-%v),13:$  N7#/ TVm$18ConclusionsWehavepresentedthecurrentstateofthedesignofahigh-performancegeneral-purposepersonalcomputersystemforuseinprogrammingresearch.Weareoptimisticthatitsperformancewillbequitegoodinasingle-processorconfiguration,anddependingonthetasksathandandthecachesizesused,modestnumbersofprocessorsinteractinginthesamevirtualaddressspacemaynotseriouslyimpedeeachother.Furtherconclusionsawaitfurtherexperiencewiththemachine.AcknowledgmentsDragonwasconceivedin1981byC.Thacker,B.Lampson,P.Petit,N.Wilhelm,andF.Baskett.Today'sarchitectsanddesignersincludeR.Atkinson,R.Barth,D.Curry,J.Gasbarro,L.Monier,M.Overton,andmyself.Whenthefinalstoryistold,therewillbemanyothers.GoodICdesigntoolsfromPARC'sComputerScienceLaboratory,andacooperativespiritandfunctionalsiliconfromPARC'sIntegratedCircuitLaboratory,havebeenandwillcontinuetobeessentialinwhateversuccesswemayachieve.Figure2wasprovidedbyD.Curry.BibliographyCensier,L.M.,andFeautrier,P.ANewSolutiontoCoherenceProblemsinMulticacheSystemsIEEETransactionsonComputers,C-27(12),pages1112-1118,December,1978.Ditzel,D,andH.R.McLellan.RegisterAllocationforFree:TheCMachineStackCacheACMSIGPLANNotices,XVII(4):48-56,1982.Goodman,JamesR.UsingCacheMemorytoReduceProcessor-MemoryTrafficComputerArchitectureSymposiumProceedings,IEEE,pages124-131,1983.Hennessy,John,withNormanJouppi,StevenPrzybylski,Christopher.Rowen,andThomasGross.DesignofaHigh-PerformanceVLSIProcessorThirdCaltechConferenceonVeryLargeScaleIntegration,pages33-54.EditedbyRandalBryant,ComputerSciencePress,1983.DescribestheStanfordMIPSmachine.Hill,DwightD.AnAnalysisofCMachineSupportforOtherBlock-StructuredLanguagesComputerArchitectureNews,publishedbyACMSpecialInterestGrouponComputerArchitecture,XI(4):6-16,1982.Kogge,PeterM.TheArchitectureofPipelinedMachinesMcGraw-Hill,1981.tg/Ey^ |[!h&* ,.3r56Y$)+.0^ 9?XX 5 &),V0`4R68*V W!#%o(*W-0269=UU x  #H$'*.37+:)= "<$-)+T1158 ;<:H!#y5 |2N\` %0_%+- 4{.  t|&..'p-R0 7>+up(# * g  !G$')U/2{(b|"(b(b"(,%$?5 "|'}3*{"I (3 |/'""047=] m %W*.u 5J =ue&){&| "j$='c+:. |5P&&6 9=%**.wo!|\*Ww|#(+/ 9{q !K|$e%i+w. .|1g6M;t@f !%)c W{ D {%|  TVm$19Patterson,DavidA.,andCarloH.Sequin.AVLSIRISCComputer,IEEE,XV(9):8-21,September,1982.Patterson,DavidA.,withPhilGarrison,MarkHill,DimitrisLioupis,ChrisNyberg,TimSippel,andKorbinVanDyke.ArchitectureofaVLSIInstructionCacheforaRISCComputerArchitectureSymposiumProceedings,IEEE,pages108-116,1983.Radin,George.The801MinicomputerIBMJournalofResearchandDevelopment,XXVII(3):237-246,1983.Sites,RichardHowtoUse1000RegistersCaltechConferenceonVLSI,pages527-532.EditedbyCharlesSeitz,1983.Smith,AlanJayCacheMemoriesACMComputingSurveys,XIV(3):473-530,1982.Smith,JamesE.,andJamesR.Goodman.AStudyofInstructionCacheOrganizationsandReplacementPoliciesComputerArchitectureSymposiumProceedings,IEEE,pages132-137,1983.tg/E|_/ e"|&F(B]E{[|[["`% -YF rwYF|YFv H#p&`,0W39>iWUN"'w*eU wUU|U ',.Ew/oUU0C{TAI (3 |/'TATA047=]QPr {NXkN"%] |NX-4z9KJa#{Ho   |$fHoHo%B) .249=EjD63{B|"MBB#)(U-?j U$m&R>MB !>%m .!0 9{<I (3 |/'<<047=]TVm$M20IFUEUDataCacheCacheInstr.P-busP-busaddr/datacontroladdr/datacontrolDragonProcessorM-busCacheBuscouplerVME BusI/OprocessorDiskinterfaceinterfaceEthernetetc.Mappingprocessorwith mapcacheDragon Processor ClusterDragon Processor Clusteretc.Memory ModuleModuleMemory etc.Display Controlleretc.D r a g o n S y s t e mF i g u r e 1Boot ROMFlt. Pt.tg/E4 U5H9$0]$ )$ )]$)2$r -9$2*-9$)2$r)]$)$:-$U:&$9))$))]$0)2$r)-9$*+**+*1)5H9$00$r)0]$)0$*34+ $1 $'s&$y's*+9$s.$9(-$(,$:&$90+G$ 6VG %NG %VG!%NGV. $V-$/O$/+ $$-$W-$U%:+G%:2%TVm$ p#0#4O#%#.:&:$& TVm$  4 30$r0$r/ *0X$)V$)V]$0V$r)[9$+WXI!W[9$(V$r!WV]$!WV$#Ye"sW(X$sX$s<"s=:]$Y$r:Ye $:Ye$s\-VZ[U$TU$:R$:R $R$r:W,$sUVSM:Q$L$r:L $:L$NU$OeHU$:F$:F $F$r:J$B:D$@$r:@e $:@e$BU$$O $0G3$$G $$G$ %:NI&WL'sJ(IH)G0K$ r7)G3!kG r!$)G r!$G03$2+k$"d\G3G@G@Gd \G +3$3$ GxGdxG3GG&c5WS$9;XIV$KO$9;Oey$9;Oe$ =T?R?G=I9;DH$ 9;DHy$KDl$9;M,V$5WH$5W=$9;BV$K9O$9;9+y$9;9+$ ?=9;$$ 9;$y$K$$9;-V$5W)$<)?9;$ 9;y$K$9;!V$5Wc$#%:VH*:.$:-$FTVm$j21LSBSLLSAALBBCABBIFetchSLLSSLLSSLResult1Result2Result3ABBALU / FieldABAAluAAluBStore3AddrCacheBRamcDataEUIFUABqq>>qrqABBAABBAABMacro Instruction AMicro Instruction BrABAAAABBAPLS-BLevel 2Level 3MuxTopBA>>AluOpMuxBotCacheCmndABAB>Wt to Cache>>ABBAACr3<>IStackLevel 4Level 0Level 1F i g u r e 2D r a g o n P r o c e s s o r P i p e l i n eX BusPCPCPCPCPCPCPCPCPCPC160 x 32 RAMtg/ETVm$p:::L3$9 L$] L $L3$9ML:L:TTS$9 S$]S$9UV$U $ysL3$9L$ TVm$p@tLL=$9 = $ =$]=$9GL3$9>L $>L$]EtLTVm$>pHN%G <]9G <] G #G G :l$9:l$9:l$9 :H $ :l$9 :H9$#:H$]#4$] 49$ 4$9 4 $4$94$9 4$9 6 G 69GTVm$p5k:5k:(( )y9G )y G '$9'$9'$9 'd $ '$9 'd9$#'d$]#,$] ,9$ -$9 , $-$9-$9 -$9 / G /9G-:-:!k!k "9G " G $9 $9 $9 $ $9 9$# $]  $2$92$9 2$9 # G: ?$ <$4/G=,$M<)$U=)$yF"GB#G TVm$ C;'G!GB$TVm$pM"M)2IX:TVm$A+G > TVm$pM.=/GM6M2F/GJTVm$"H<-?X-5!G4)yG4"GNTVm$"p?!k\TVm$> +Et%NG9A#G9A%GI#G98)GU8#GrI/OGA/OG?/Gd9;0$G8/G80G8kGrIGUF@GFkGAGA@UGDXkG`TVm$ypAfTVm$DXhTVm$pClTVm$N5NX8rTVm$p<"<)27/+$71$71d$C;/O$9B/O$H/O$K/O$9Ou%N$ 9I%+$Et $P$qN, POu%NX!{=6G=4G9;$V$K#$Ou ^4,$p<.D.Dx TVm$5'5-7/O$9 Gs G \rG Z$] Z$Z$9F-$9M<-$9DX-$9=-$9;t-$94-$9M<'$9='$9;t'$94'$9='d$4'd$76rG74$974r$;t4$];t:H$]7:Hr$7:l$97<]rGTVm$p<<&W,$]$,9$$-$9$/9G$<]9G$:l$9$:H9$&W:H$]&W4$]$49$$4$9$69G=)yG$)y9G$'$9$'d9$&W'd$]9;#$* $]( 9$( $9("9G;t $94 $94 $F $9F $M< $9="GDX$r=$rD"4$4$U;t$U4#G7TVm$5TVm$p<=\G?B2$9B$H2$9EtjG+W"G$yC;r$C;$UC;#rGH<=4$9M<4$9MS$S$9UGL$]:5k-(!k TVm$*;G)P:XZ$9Z$Z$]\G-\G GR< S $HR$OuO$UHOe$HOe$y'sGWy GW G<7G8:85H22] G<2]GTVm$xp932;<;6269G24$9249$44$]4:H$]2:H9$2:l$92<]9G's%9GVV4$9V?$ 9C;Nl$ Et7$UKPIQ3R$:XO$UOe$Oe$y HP F I$:F$U F$ F$y IG9?Gr \G9',<',6565< 6 < .  ""&)2&.2:Gq4%4c$5j$4G@$4G$@5G9;0kGr2$9 $9-$9'$94$9:l$9pV<$VG$@G@$:j$c$q%dG $$9 c$] c$ Gr G9$9p::j$ j$Vc$@Vc@$s$V$qVA$$V$9$r:$ %c$ j$ G@$ G$@: $ V G$@ V G@$s j$ Vc$ V % G G 9Gsc$@sc@$$s$sAGGpF$yF$:XF$UI$qKAKT BK'LEe$ $KK'LR$ V4Gr ?Gp :. :)2 :6 :<<6)2."L$]:MN%GL$:XL3$9VEe9$VXI9$:XS$9@TVm$&WTe&L&TVm$qB9_><]G(TVm$Et:>:H$]>:H $G:l$9@t:*TVm$$pG<A7$U:UU U M  sDl, TVm$.+GD%rG D%UG rD$ C$ rC@$ rC$@8TVm$5q r@pVE^KGPCOG9;CG9;AG97@r$;t>$97>$7>$]9;<G9OuPG3.2$q3*_p2=31t61t<.69G.4$9.49$04$]0:H$].:H9$.:l$9.<]9G0,$].,9$.-$9./9G/*$9/*$q2' 2+G.++G.%NG+!$+!$2%p.>3.$*/9G*-$9*,9$-,$]-'d$]*'d9$*'$9*)y9G*69G*4$9*49$-4$]-:H$]*:H9$*:l$9*<]9G*;=3$>O$=3-.-)2-<-6$"9G$ $9$ 9$&W $]%:$9%:V$q;tp* %:#$r%:&Gr$)#$UqAzs{ +r$#9G2$99$:$]p<6)2." 9$ $9"9G/9G-$9,9$:,$]:'d$]'d9$'$9)y9G69G4$949$:4$]::H$]:H9$:l$9<]9GF=TVm$!:TVm$p!W=3!$U>N$9>+U$$s$qsA$ r$$B k$$$]s9$s2$9s#9G!c$: $9k$rTVm$psP8& C;<$F1]I`TVm$+p T L : 5k - ( !k   VTVm$0A5H  TVm$22tg/ETVm$GATES  HELVETICA HELVETICA LAUREL TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMANYLOGO TIMESROMAN HELVETICAMATH HELVETICA HELVETICA HELVETICAY HELVETICA HELVETICA    # , 7 vC O cX c pm -w b cUj/[]<>dragonoverview.tioga$Friday, December 7, 1984 5:30 pm PST