THEARCHITECTUREOFTHEDRAGONLouisMonier,PradeepSindhuComputerScienceLaboratory,XeroxPARC3333CoyoteHillRoad,PaloAlto,CA94304AbstractTheDragonisa32-bitresearchworkstationcurrentlybeingdesignedattheXeroxPaloAltoResearchCenter.Itisaclosely-coupledmultiprocessorwithanovelarchitectureandisbuiltusingcustomVLSI.Weexpectitsperformancetobeinthe5-50MIPSrangedependingontheconfiguration.Itsprimaryusewillbeininvestigatingtheapplicationsofmultiprocessingtotheofficeenvironment.IntroductionDragonisanewcomputersystembeingdesignedatXeroxPARC.Itbelongstoalineofpersonalcomputers(theD-machines)thatstartedin1973withtheAltoandincludestheDolphin,DandelionandDorado7.Allthesemachinesarecharacterizedbyahigh-performancedisplay,asixteen-bitprocessoroperatinginasingleaddressspace,andonlyrudimentaryprotectionhardware.Dragonisasignificantdeparturefromitspredecessors.Itisatightly-coupledmultiprocessorwithanovelcachearchitecture;itisareal32-bitmachine,bothinitsdataandaddresspaths;anditusesanewinstructionsetwhichisaconsiderablysimplifiedversionoftheMesabytecodes3.Dragonisalsomuchmorepowerfulthanitspredecessors:theperformanceofasingleDragonprocessorshouldexceedtwicethatofaDorado,andweexpecttoproducemachineswithupto10processorseach.Finally,Dragonisourfirstmachinethatincludescustomintegratedcircuits.Weexpectthismovetoyieldasignificantreductioninsize,cost,andelectricalpower.WeplantobuildamodestnumberofDragonsforourselvestoexploreusefulwaysofexploitingthislevelofperformanceinapersonalsettingbeforesuchmachinesareeconomicalinwidespreaduse.Weseemultiprocessingasakeyelementinfuturesystemsandneedtounderstanditsimplicationsforsoftwaredesign.Atthiswritingwehavealmostcompletedthelogicsimulationoftheprocessorsandhaveafullyfunctionalarbiter.Thisisencouraging,butmuchworkremainstobedone.Wemustfirstcastdetailsofthearchitectureinconcretebeforedoingfurtherworkonchipfloorplanningandlayout..KeyIdeasbehindDragonEverybodywantsafastermachinewewantafastpersonalmachine.Giventhatincreasesinperformancepurelythroughcircuitspeedarebecomingincreasinglyhardertocomeby,webelievethebestwaytobuildsuchmachinesistousetightly-coupledmultiprocessinginwhichtheindividualprocessorsthemselvesarehigh-speeddevicesbuiltoutofcustomIC's.Therewereanumberofideasthatmadethisapproachpromisingandwewantedtotrytheseoutthroughimplementation.MultiprocessingDragonwillhaveamodestnumber(1-10)ofidenticalprocessorsattachedtoacommonsharedmemorybus,theM-bus.Akeyprobleminsuchstructuresisthataccesstothebusisabottleneckthatlimitssystemperformance;theobvioususeofcaches9toresolvethisproblemhasuntilrecentlybeenlimitedbythecacheconsistencyproblem1,2.Oursolutionistouseaspecialtwo-portsnoopycache(describedbelow)thatmaintainsconsistencyandreducesbustraffic.Thesoftwarewillusethisparallelismbymappingprocessestoprocessorsdynamically.GiventhatDragonisapersonalmachine,weareforcedtobreakuplargecompute-boundprograms,suchascompilersandVLSIdesignaids,intomultipleprocesses.Thisdecompositionisadifficultproblem,butwehavemadeamodestbeginning.TheprimitivesformultiprocessinghavebeenpartoftheCedarlanguage10forawhile,andsoftwarewrittenrecentlyhasbeenstructuredwithaviewtowardsexploitingtheparallelisminDragon.FasterProcessorsTherearetwoareasinwhichweexpectsingleDragonprocessorstobefasterthantheirpredecessors.Thefirstisininstructionexecutiontime.LiketheIBM8018,akeygoalistoexecuteaninstructionnearlyeverymachinecycle.Thisisachievedbyuseofasimpleinstructionset,pipeliningwithrichpipelinebypasspaths4,theuseofcaching,andinstructionprefetch.ThesearchitecturaldevicesarefeasibleinthecontextofapersonalcomputeronlythroughtheuseofcustomVLSI.Ourtechnologyisanin-house2microngateCMOSwithpolysilicideandtwolevelsofmetal.Themachine'stargetclockrateis10MHz.Thesecondareaofimprovementistheprocedurecall,wheresimplifiedparameterpassing,acachedcontrolstack,eliminationofparametercopy,andofseveralindirectionsinph!h /7h3xh9{hq"\a#&a#,[a#2=a#r \ $\ *^\ 2z\ 6\ Z@!WZ@&`Z@)Z@-Z@1QZ@5Z@7Z@sRrO,$  > 1&Mba eJ%Y'u)iK d k &),I gIB $&(Ha M # %/F7yz L w u(Dm s=M q:~  Ap/4$y&08Lh6#'6  oU"B( 4t5Dq4!%(-2  yS!}#c 1"c_Z^z#'S/W  - ! n \k ) +w  [!&)   b 8"$='(-  :W 1"')u&b  St&#pq&b$%=$V0 /L ( "  %!&! ^ SYkV%(u8 `N!'m$ |~!s$( G9 % v<!$2  ! (-! Cv G (x   #( Fb ` (F  T (  X c& q# NV  x &(  Dda "$  !1B!$sTg9<AiGqP08w<=A K,NOrRPPS~qO*-5 9<CFE1 NRM_-269@w HMOSVAK-36:>AEIQ3SkUI-9hEMH;MQw G- 5 <> FzKOQShF5-286:|;ACGJORDj-6:=CFJ-O RB-u?--q;06:>?EbKtP<R9- 5<>N@FKR\V8&-358e>@fC JL}OTV6[-02H3 :=B"G PbR4-02t56q48/:$?^B_HK]OT2-3,5x82<| Dt3MJKq2L!MQ*W0-/249?gDI PrU/+-4 <?E,H-`039<?XB% IKR +-/ 7y AUF,IO<Q R)-47:?BFIyM (-48<9@CGL1OR&5- 59Y CEZFLSUVA$k-2x79@ JN V)"-8<@^CEHkM t#'Rq"TWX -2[5[;c@FQIL SWY-1q7 >%@ HoJHu-2 q 058~;?BFIwNS!V- 5;7+9d=A'D NQU V- 5f<1@DFJ&tLlqM MO4RU?V-36- >CGqMSMW-46D8:< @ H\K R)U%-4$9rt<q%=>AEG_MQ Z-49 AFIwNPSP -/137V>0AGyJ'LNT, -1z 9;>DFjKOU6- 58;?AzFwIP|T0-02|e04_9=P? I%KMUs-2 :8AGI}NTt- 68.?CGIN\ VXm$commoncasesisexpectedtoprovidealargespeedimprovementoverourearlierimplementations.CodeDensityDragonwillretainthecompactnessofcompiledprogramsexhibitedbytheMesainstructionset,fortworeasons:itisobviouslycheapertostoresmallerprograms,butevenmoreimportantisthereductionincodetrafficfromdisktomemory,frommemorytocache,andfromcachetoinstructiondecoder.Inparticular,anydecreaseinM-busutilizationbyasingleprocessorwillcontributetogoodperformanceinasystemwithsharedmemory.Anundesirableconsequenceofacompactinstructionsetisthatittendstomakeinstructiondecodinghardwaremorecomplex.FortheDragonatleast,weexpectthegainstooutweighthisdisadvantage.GeneralStructureofDragonAsshowninFigure1,theDragonisorganizedaroundacentralmemorybus,theM-bus.Upto10identicalprocessorssharememoryviathisbus.AlsoontheM-busarethebusarbiter,themappingprocessor,thedisplaycontroller,afewhigh-speedI/Odevices(Ethernetanddiskcontrollers),andaVMEbuscoupler.Thisaccesstoastandardbuswillletususealargeselectionofcommercialdevices.KeyboardMouseCouplerBusVME busI/OProcessorDisplayProcessorMapArbiterM-busMemory andProcessor,Device(up to 10)ProcessorProcessorProcessor#1#2#nMemory4-16 MbytesM-busCommercialboardsFigure1:StructureoftheDragonTheMemorySystemTheM-busTheM-busisasynchronoustime-multiplexedbusthatcarriesdataamongtheprocessorcaches,mainmemory,themapprocessor,andthedisplaycontroller.M-bustransactionsaretime-sequencedona32-bitaddress/databusasdictatedbytheoperationspecifiedbythecurrentbusmaster.Addressesonthebusarereal,notvirtual,andthereareseparate32-bitaddressspacesforI/Oandmemory.InterprocessoratomicityisimplementedbyallowingtheM-busmastertolockthearbiter.Thereisnoautomatictimeout,soprocessors(orcaches)mustbedesignedsotheycannotlockfortoolong.MainMemoryPhysicalmemorywillhavefrom4to16Mbyteswitherrorcorrection.Memorypageframeswillconsistof25632-bitwords.Transferstoandfrommemorywilloccurin4-wordchunks,withtherequestedworddeliveredfirstforreadstodecreaselatency.TheArbiterM-busmastershipisgrantedbyasingle-chipsynchronousarbiterwith32requestgrantports.Afirstversionofthischipisworking.TheCacheADragoncachemediatesbetweenaprocessor'sP-busandthesystem'scommonM-bus.ItsmainpurposeistodecreasetheamountofM-busbandwidthconsumedbyeachprocessor,thusallowingmultipleprocessorstorunefficiently.Thecacheisasinglechip.Itisfullyassociative,andholds(atleast)64entrieswitheachentrycontainingavirtualblockaddress,ablockof432-bitwordsofdata,arealpagenumber,andvariouspropertybits.Severalcachescanbeconnectedinparalleltoincreasesizebyhavingeachonerespondtoasubsetofvirtualaddresses.Thecachealsotranslatesvirtualaddressestorealbymaintainingmappinginformationonaper-pagebasis.Iftherequestedwordispresent,thereferenceissatisfiedinonlyonecycle.Ifthewordismissing,thecacheissuesarejectsignalthatfreezestheprocessor.Twocasesarenowpossible:(i)thevirtualtophysicaltranslationforthetargetpageispresentbecauseanotherwordfromthispageisinthecache;and(ii)thisinformationisnotinthecache,inwhichcaseitissuppliedbythemapprocessor.Tomaintaininter-cachedataconsistency,thecachemonitorsphysicaladdressesontheM-bustodeterminesharing,writesthroughifanentryisshared,andwritesbackifanentryisnotshared.Onawrite-throughothercachesupdatetheircopiesofthevalue,ifany.Themonitoringisdonebyasecondassociatorineachcachecalledthe"snoopy"associator.ThisideawasindependentlydiscoveredbyC.ThackerofPARC,SteveDashiellfromXeroxElectronicDivision,andJ.Goodman2.Arelatedschemewaspublishedevenearlier1.TheP-busTheP-busconnectsoneormorecachestoaprocessor.Thebusconsistsofa32-bitmultiplexedvirtualaddress/datafield,aqk   9h N!r j/i5 ufqcK  qf$a ^ @ ]V!I'n),_ "&] =a"$c\   58/ $ZVS   "L$&mX { J!!"f'RV  u  #%$T  z Yd $=S+ %8#|%QaX sLG yqHl G >>h$ )uG  =_=X#| E?0 \  RuF"P%'Cur dI &H'A  vWy &S)u? :8!#&( >H ; ! ;v R$ $ T$9S 5 y$' Y$rS Y$rS V$5$p=$r=$r$G9 sG S R $ Y$r Y$r 5 $$9 %p$S$S"o $p%'oS)$S$$r#7$$rS$ $S+S $#7+v$rS+v$rS/$,o.&G-hG Go p S @$S&R$&v$S&v$S&$S*$*$S*$S+S$S,o$,$S,$S,$( "o9$/9$6o9$ 8$4Z$r 46 $6o 4Z$r -$r/ - $-$r 1$ Y$r"o 6 $ Y$r $$4p.p 5/GS8 $S2!$%p2!$5SS1 @$p3 kS:Sp  Rt s  tsZ u"qk04g9;=% F9RVUj/-25:=&CHLxRU6he- 5"8M;@p HM Vf- 8:<&@ IjLBNSVd-4;T=@xEHOVfc-0[35x8;?@DCFIDOS8a:-2t5a:qa:5wt69a:qa:7:_o0 ;AC M5OV]-27M8<>DHYIKR[-/ 79>BDKMPiU[Z-0Z3uV-1qS*06<?C0FH3JLQsTQ`- 6=A FQINQTCO-3:<?CIMQgSvM-37b:BAgELhORVL-3uH-1qE05 =u?DFH6 O CP-2638J=AFYHK1PuR[UAA-/tu>-1q:028 3v04 869; ?YCDF@I QT1-0F4\6;?BF N8OT/-349:6;<@BCQ KLPGRdS- 46:>CoF3M U7 -1f4 > FvHK)QGSj -29>OC KS#W A-t 4zq A568P=WBELP6t TUq ATu-1q\038>!@BFK6LN U-06'89?= EJ SWY Xm$paritybit,acommandfield,areadybit,andanerrorfield.TheP-busistwiceasfastastheM-bus,permittingacacheaccesswithinonecycleforahit.ThereisonlyoneprocessoronaP-bus,anditisalwaysmaster.Oneormorecachesandthefloating-pointacceleratorchipsareslaves.AP-busslavedecodesitsinstructionsfromacombinationofthecommandandvirtualaddressfields.MapprocessorThemapprocessorisasingleM-busdevicethatimplementsthemappingfromvirtualaddressestoreal.Italsocooperateswiththeprocessorcachestoimplementper-pagememoryprotectionandpermitsmultiplevirtualaddressspaces(thatmaysharepages)tocoexistsimultaneously.Mappinginformationforallspacesismaintainedinasingletablekeptinmainmemory.Thistablekeepsinformationonlyaboutpagescurrentlyinmainmemoryinorderthatitmayoccupyasmall,fixedfractionofrealmemory.Itisorganizedasahashtablewithdigitalsearchtoresolvehashconflicts.Multipleaddressspacesareimplementedbystoringtheaddressspaceidalongwiththevirtualaddressineachtableentry,andbymaintainingasmalltablewithinthemapprocessorthatgivesthedynamiccorrespondencebetweenprocessorsandaddressspaces.Themapprocessoralsocachesfrequentlyusedmappingentries.Thisallowsittorespondtomostcachemapmissesintwocycles.Ifthereisamisswithinthemapcachealso,themapprocessorreadsthemainmemorymap.Cachingcombinedwiththeabovehashingschemepermitsverygoodaverageresponseandaworstcaseresponsethatisboundedbyareasonablevalue.Themapprocessorwillbeimplementedbytwocustomchips:amapcachecontainingaround2000mappingentriesandamapcachecontroller.InstructionFetchInstructionCacheDataCacheExecutionFloating-pointchip setUnit (IFU)Unit (EU)M-busP-busP-busFigure2:ADragonProcessorTheProcessorADragonprocessorconsistsof4customchipstheInstructionFetchUnit,theExecutionUnit,andtheirrespectivecachesandacommercialfloating-pointchipset.Earlybenchmarkspredictaminimumof5MIPSperprocessor.TheInstructionFetchUnit(IFU)TheIFUisthemostcomplicatedchip.Itfetchesabyte-streamthroughitscache,formsvariablelengthinstructionsinanasynchronously-filled4-wordbuffer,andthendecodesthemusingseverallargePLAs.ContrarytothearchitectureofDragon'spredecessorsthereisnomicrocodeinRAMorROM.TheIFUisthusspecializedforoneparticularinstructionset.TheproductofthisdecodingprovidestheEUwiththreeaddressesforitsregisterfile(twosourceandonedestination),a32-bitliteral,variousopcodes,andconditionselect.TheIFUalsodetectsimpendinghazardsintheEUpipelineandissuespipelinebypasscommands.Thisavoidswastedcycleswhilepreventingregistersfrombeingupdatedinconsistently.TheIFUalsohasastackforPCandprocedurelinkageinformation.Itprocessesjumps,traps,callsandreturns,relyingontheEUforconditionaltests.Italsomaintainsthelocalframemodelwhichconsistsofastackpointerandaframepointer.Finally,theIFUcontrolsthefloating-pointunitandboththeinstructionanddatacaches.MostofthecomplexityoftheIFU,asidefromthesheersizeoftheinstructionset,stemsfromthetreatmentofexceptionsinthepipeline,whichincludeconditionaljumps,instructionaborts,rejectsfromthecache,andfaults.TheExecutionUnit(EU)TheEUisasingle-chipslaveoftheIFU.Itsmainfeaturesareathree-portedregisterfileholding16032-bitwordsforstackvariables,runtimevariables,andconstants;afast32-bitALU;afieldunitabletoextract/insert/shiftinonecycle;specialmultiplyhardware(2bits/cycle)anddividehardware(1bit/cycle).Logically,theEUandIFUformasingleentity.Botharepipelined,withthetwopipelinesinteractingatvirtuallyeverystage.Floating-pointUnitAsurveyofpotentialusersshowedthatfastfloating-pointismandatoryforgraphicsandsimulationapplications.EveryDragonprocessorwilluseasetoftwohigh-speedcommercialchipswhichoperateinanon-pipelinedmodeunderdirectcontroloftheIFU.CommunicationwiththeEUoccursovertheP-bus,withfloating-pointoperationslookingtotheEUverymuchlikeacacheaccesses.Thefloating-pointchipsetinconjunctionwithlowlevelsoftwarewillimplementthefullIEEEfloating-pointformatwithanaveragethroughputofabout1.5MFlops.DisplaySystemForusapersonalmachinerequiresahigh-performancedisplay.Dragonwillaccomodatebothcolorandveryhigh-resolutionblackandwhitedisplays.Thecolordisplaywillsupport1024X7688-bitpixels,withextensionto24bitsperpixel.TwoMbytesofdisplaymemorywillbedual-ported,thuspermittingbothhigh-bandwidthrefreshandaccessovertheM-bus.ThedisplaycontrollerwillmediateaccesstothedisplaymemoryovertheM-bus,controldisplayrefresh,andprovideqkF#V'j/c    !&Ghe m +f P 0l"$0%d ? 90 #& c x %)ua:   ;?<!mu]qZV A-Lr " X  zev %#K VC$T  cRW#'kS+ Qa e f #%G&mO$  'SM"b%'kL: x "3#J6 5  + $c Hl(  "%UF  }O)#n'D O {Zd#&C  E N "%UAA?w` 0% $n= E [7 $0(; pu' :$( :L  7#8L jF! %=6  D &)u4 }2f ' "W%1"Z .I"'/WH v' &)f$"$ x"$" $")f$""$"$"" $?'- [$ x&$&]$I9$ xI$[f[-" $$"$"$ $ x$$$f?k ? $ ?$[$ ? @$ [I m$ $  [ $9$f  \)f'-t fs |u(q ] }rf#>?  JY; +# t T A  ""&& q e  uk-1 8=@qh048V:\=oA JOQWYf- 6;=BFNKP d-02kA2FKO&Rc$-15:>CJ KN{ VaZ-4M <@BaDKMQS_-14b59D @CQFA M T]04$9;>EpKNXQ;T[-4769V>AEILO Z/-/"3v8=WCFxM+R+UCXd-16D=CEHXKFQ=T\V-39BEJOTT- 5m;Z?CDI'S148Z;>a?CFJHLS^Q9- 68?CGK/NSOo-0\376A8 AEGKR+UM-27z<:BD4EIOjRTDK-4S9<@.FH RXUJ-1s4 ;>AHD0469 ACF>JNQTFy-02g4 <<>BF_HO~Q: D-029`>C KQ B-27;B=BlEiu?r-18+;q;0368+9^ @DFfHMO-R:5-0R1z :?CAG6IN*R~T8j- 4:x A;DL KrLOT,6-/x3H6:U<|IKOS4-4;>v FJ2OwV3 - 1?0 7:}=N@PCGvHMRYV/u- 58;\>mD LNtT--u*7- 7q&027}9l?CI#LCO% $-/7Q9@ C* J TD#0-3w:9=<@A_CEH P !e-2:7<?"@ JOT[-3!57<= GgJMPrU7-0h48I A HNIP RUC-25 6T::049 =AmCE N_QUp-36=@vCG PmU6-05 =r?WCFs<Ckq04%6m7>YDJLd N-5;? HLQUC -8L 9 ("( F C hW'{  gN!X#(  s q q $ y` I4Y 2iX!l% (  ! X .aQ 'S o #7F!%yk-45 < DMGIOUj/-26< DHM U6he-0scK<q_06Y98@+AEzGIOQ^-/4+6=s@BHN U\D-4z9;BDImKuPQVrZz-34GBE MPV6X-4D6.;>@HhLOS?WV-158:?GEGOaRU-27;>CiGJ?MtUQqUSQSO-04269Q0583=OATEqKS+O- 59?: C"GLKPQ^ M-3 :>BCELHGNPmRL%-/6/;[=A sG > tC-C0594;A B02b5X:O;AGuH O`x@04 ;=zt@CFIKQTP? 0 =-=07(:<048\=f>CjMx:06 =D tJO::KNP807y-7y068[:t<@wA5038:-A+FKQyU22Vt1i026u9;/-/05B8{x.a03z :h;A4t,0 8+Y-+Y0 7<)038 =x(Q0 79/z;A(Q(Q<x(Q>DGLtN6&0z2w&&3(x&5Pt:&&;@bC_%H-%H0 6:tBRD#02b5x"@0t6*"@"@6:>@C JB - 03802b :C`DO!0x0038 ?uEGMY t0T06:8<(-(05035 x 04 8r9? A t I OP0-04704x04o:t>?D,GnIK-0 7( 02b5|:x 04t9/ 9r7[]<>Compcon.tioga'Wednesday, November 14, 1984 12:48 pm P