Dragon DocumentEdward R. Fiala16 August 198361. IntroductionI was asked to document the Dragon project by Chuck Thacker, who printed for me a collection ofmessages between Forest Baskett, Butler Lampson, Neil Wilhelm, Russ Atkinson, Dave Gifford, JimMorris, and himself over the period between 23 November 1981 and July, 1982; these messages wereabout the architecture of the proposed Dragon computer, an LSI machine which would be our nextcomputing engine. Since then, I have discussed various issues with many people, including Phil Petit,Rick Barth, Ed Satterthwaite, Dan Greene, Ed Taft, Roy Levin, Andrew Birrell, Hal Murray, Dick Sweet,Mark Brown, Larry Masinter, and the people named above.The first purpose of this document is to describe the current state of the Dragon hardware design. Itssecond purpose is to describe how our software systems can run efficiently on Dragon. Our primaryinterest is in Cedar, but some attention is also being given to Pilot, Lisp, and Smalltalk. Its third purposeis to discuss possible deficiencies and areas for future development.A major goal of the Dragon project has been developing LSI design aids. The principal systems used inthe design have been CHIPMONK (Petit and others) for layout, circuit extraction, and design rulechecking; THYME (Wilhelm) for circuit simulation; and MOSSIM (who?) for logic simulation and testgeneration. Although developing these design aids may be the most important act of the Dragon project,they will not be discussed in this document.Dragon LSI designers have been Chuck Thacker (Execution Unit and Cache), Phil Petit (InstructionFetch Unit), Rick Barth (Cache), and Louis Monier (EU). Russ Atkinson is building a Dragon simulator,designing low-level data structures for Cedar, and producing a Cedar compiler for Dragon.Since it is based upon rapidly changing, verbally conveyed information in most cases, this document'sdescription of Dragon is incomplete and inaccurate. In addition, I have incorporated my own thoughtsin extrapolating how various machine features might be used. I welcome all comments regarding anyaspect of this document. fp!q4] Gfp ar ^eq Q \? Z U Y7' W;U UpH S7 P4%B NiM LP JE Gb Z EC CM BS @7, <$< :[ 90Y 5G 3'> 2)N 0_T .MA9Dragon DocumentEdward R. Fiala16 August 198372. NotationIn a machine operation encoded as a sequence of instruction bytes, the first byte of the sequence is calledthe "opcode"; a, b, and g denote, respectively, the first, second, and third operand bytes of an opcode."[S]" denotes top-of-stack; "[S-1]," the word underneath that (at S-1), "[S+1]," the word above top-of-stack, etc. Similarly, "[RL+n]" or "LRn" refers to local register n, and "[RA+n]" or "ARn" to auxiliaryregister n."b" denotes "octal" and "d" denotes decimal after numbers. So 512d = 1000b. The "b" or "d" may beomitted where the interpretation seems obvious.Sequences of bits go left-to-right within words; the most significant bit of an integer is bit 0, the leastsignificant is bit 31d (with a 32d-bit word size). A sequence of bits within a word is denoted by[first..last]. For example, "[S][2..4]" refers to the three bits in the word at top-of-stack beginning with bit2 and ending with bit 4.The effect of memory references is denoted with "^" and of assignments with "_". So "[S+1] _([S]+a)^," for example, means that the a operand byte is added to the word at the top-of-stack todetermine a memory address; that word is fetched from the cache and written into the word above thetop-of-stack.A "push" is an operation which advances the stack pointer S and writes into the new top-of-stack; forexample, the DUP (DUPlicate) operation could be written as "[S+1] _ [S], S _ S+1," but is moreconcisely stated as "push [S]." Similarly, a "pop" is an operation which reads the top-of-stack and thendecrements the stack pointer; for example, the EXDIS (EXchange DIScard) operation coulde be writtenas "[S-1] _ [S], S _ S-1," but is more concisely stated as "pop into [S-1]." In describing memory sizes, I have tried to use "bytes" rather than words in most cases, to avoidconfusion in discussions where both 16d-bit and 32d-bit word size machines are compared. I use theterm "k" to denote 210 = 1024d = 1k, and "m" to denote 220 = 1,048,576d = 1m.32-bit values are frequently written as rather than as a single 11-character value.For example, 20000000000b = 100000b,,0 fp!q4] Gfp ar ^eq;0 \ sqsqsqD Y)P W^)? U R"X PW/ L^ KD IPa G D:# BIsq sq9 @~9* > ;AK 9w&8 790 5c 4M 07* .<' --t-q"-t-q )'D '&z %BBIDragon DocumentEdward R. Fiala16 August 198383. Project Goals3.1 Fundamental AssumptionsThe decision to use 32d-bit word size on Dragon is a firm one. 32-bit word size will offer substantiallybetter performance for Cedar and other systems which make extensive use of 32-bit pointers. Other data,such as floating point numbers, can also be implemented more efficiently with large words. Theseefficiencies are believed to dominate the disadvantages of larger word size. The pros and cons of 16-bitvs. 32-bit word size won't be discussed here.An execution environment for Mesa procedures is proposed substantially different from and withperformance advantages over that used in 16-bit Mesa. The use of this environment in Dragon is not atissue; only low-level implementation details are of interest now.The Dragon microstore will be much smaller than on our 16-bit machines. A principal consequence isthat hardware-implemented byte codes cannot include operations such as BitBlt, floating point, processopcodes, or any other complicated operations used in 16-bit Mesa.3.2 Ease of TransitionA consequence of the above design decisions is that the compiler will be producing different byte codesfor the Dragon--at a minimum, even if identical Mesa sources can be used on both Dragon and our 16-bit machines, these sources will have to be separately compiled.I don't think it is sufficient to carry out a one-way conversion of sources from the 16-bit world toDragon. There will be at least several years during which we will be running our major software systemson both 16-bit machines (Dorado, Dandelion, Dolphin, Dicentra, and Daffodil) and Dragon; we don'twant to have to edit twice for every change. It is important for sources, once converted, to be compilableon either a 16-bit machine or Dragon. However, some source changes may be required to convert anexisting program into machine-independent form initially, and some sources will be too machine-dependent to convert. Many people have expressed concern that this transition should be as easy aspossible.Butler Lampson insists that it be possible to compile "vanilla user programs" without any source changesfor the new 32-bit machine. However, extensive modification of "system programs" is acceptable toButler. I think by "user programs" he includes Star and other Mesa software outside CSL control aswell as vanilla parts of Cedar; by "system programs" I think he means the compiler, binder, debugger,input/output drivers, etc. Even user programs may require some conversion where "system" datastructures are referenced, machine-dependent inlines are used, or 16-bitness is embedded in some otherway. Consequently, conversion will not be zero effort, but Butler wants it to be very easy.Roy Levin is concerned about a smaller body of software, consisting of Cedar and Pilot software used byCedar. Although Roy expresses no concern about the larger problem of converting Star (etc.), he goesfurther than Butler with regard to conversion of the Pilot software--much of this code would presumablybe called "system software" by Butler, but Roy wants its conversion to Dragon to be easy. Roy is lessconcerned about converting Cedar software because CSL/ISL people are familiar with it and because hethinks it has been more transparently written with respect to a variety of conversion problems which willbe discussed in this memo. At any rate, he does not believe that the conversion labor for the Cedar fp!q4] Gfp ar \p YLq>+ WL U<% SF# R"- N8& L(> KA G*9 EH DA >p ;qE" 9[ 7@ 4@$ 2Q 0P /!Z -VG +=" )` ' $*> "[ J %e Z.0 A% \ SK G L U )U ^d A# LA\.Dragon DocumentEdward R. Fiala16 August 19839software is as significant as that for Pilot.Others I have talked with are concerned with making the transition from where-we-are to where-we-want-to-be automatic, easy, and smooth. However, they also want to correct problems in the existingsystem or to enhance it in various ways--improvements and changes frequently contradict the wish for asmooth transition.With regard to ease of conversion, the best we can hope for is something like this: The new compilershould be able to compile unchanged old "vanilla user programs", possibly after running some kind ofautomatic translation program over them; only if the program is doing something unusual shouldcompilation fail. If compilation succeeds, the program should run; if the program compiles and runs onthe Dragon, it should compile and run on the old 16-bit machines as well unless it is a "systemprogram". In other words, there should be no conversion difficulties except those flagged by thecompiler or translator, and there should not be very many of these.3.3 Problems to CorrectThe following problems are ones we should fix; these changes are desirable independent of the Dragon:1) Several Mesa size limits are discussed later; we want to alleviate the limit on the number of GlobalFrame Table entries (1024d of which 700d are used); this is of immediate concern. Medium termconcerns are the limit on total frame storage (128k bytes of which over half is used; this is the MDS orMain Data Space of Mesa) and the size of virtual memory (225 bytes on Dorado and 223 bytes onDolphin of which 222 bytes are used); VM size limits are unfixable on Dolphin, Dandelion, and Dicentra,but Dorado's VM can be made larger in several ways discussed later. Other limits are of less interest.Roy Levin does not expect that we will ever have to fix the frame storage size limit on the 16-bitmachines.2) "Long" data items in storage appear in reversed order with the high-order word at a larger addressthan the low-order word; we want to fix this problem both in our 32-bit representation (so that we don'thave to do a left-cycle 16 on every LONG POINTER or INT fetched from storage) and in any N x 32-bit representations that we introduce for Dragon. Fixing this on the 16-bit machines should requirechanging only the compiler and microcode. Although microcode changes are substantial, little softwareknows about the representation of the objects. If we don't fix it, then there are problems whenever 16and 32-bit machines communicate numbers with each other or with a data base.3) The floating point FSTICKY register should be included in the process state; at least 22 bits areneeded in FSTICKY. This register specifies one of four rounding modes, projective or affine treatmentof infinity, three treatments of underflow, and other options; in addition, it records the occurrence ofunderflow and rounding, which software might wish to examine. Until this is somehow accomplished, itis impractical for a computation to carry out floating point arithmetic with other than globally settableoptions shared by all processes.4) Arithmetic and its opcode support are inadequate in 16-bit Mesa. Only single-precision integers aresupported in the source language, and the implementation does not detect overflow; also, intermediateresults are truncated to single precision after each partial evaluation.Overflow should always be detected. It should be possible to specify both N-precision (N=3 or 4) andindefinite precision in the source language in both integer and fixed point representations. The opcode fp!q4] Gfp bq- ^a \T [d YL U'pq2 Ta RE<" PzU NH LN KC Fp Bq@% ?X =SY ; \ 9::Jt9q :Jt9q 78t7qI 6(U 4^T 2 /!Q -V#E +7) )); '*< &,I $aL H %G Z \ b #F  Z 8- H .7 35 pB\HDragon DocumentEdward R. Fiala16 August 198310support should be reasonable for these.5) Opcodes need to be defined for 64-bit floating point arithmetic. Neil Wilhelm and Ed McCreightboth feel that both 32-bit and 64-bit floating point should be provided unless the 64-bit floating point isinsignificantly slower than the 32-bit floating point. In other words, they want 32-bit floating pointbecause it is faster--they don't care about the difference in storage requirements.6) The current system is inadequate for multi-processor operation, and Dragon can be a multiprocessor.We want to enable multiple processors to operate concurrently using the same virtual memory to serviceany combination of Mesa processes and/or other special purpose processes. This will be a majorproblem, as discussed later.7) The limit of 64k elements in an array can be fixed on Dragon by using a positive 32-bit integer todescribe the number of elements rather than a 16-bit cardinal; this change will be more general andslightly more efficient. 32-bit length specifiers should be used over the network (i.e., in RPC) as well.3.4. Changes For a More Uniform or Enhanced MachineCedar is already being changed in some of the directions suggested here.1) Eventually, we want to get rid of the MDS (Main Data Space, a special 64k region of VM whichopcodes can reference with 16d-bit displacements). For a variety of reasons discussed later, this is anunnecessary, non-uniform, and ineffective concept. If supported for old programs, MDS should bedeemphasized. It has resulted in moderate performance improvement for 16-bit Mesa programs at theexpense of source language purity; but for 32-bit Mesa, it will be detrimental to performance. Amultiple-MDS scheme would prevent running out of MDS for awhile but introduces still more non-uniformity to the system.It might be possible to retain MDS on 16-bit machines but to ignore it on 32-bit machines, by treatingall POINTERs as LONG POINTERs and MDS objects as though they were not confined to an MDS.Perhaps this is a good idea.2) 16-bit Mesa fields are allowed to be 1 to 15 bits or N*16 bits in size. Although the Dragon compilerwill probably have to handle fields that are (2N+1)*16 bits in size, the code produced will besubstantially slower and more verbose than for 32*N bit fields. Programmers should avoid these slowfields.A decision to totally eliminate (2N+1)*16-bit fields would be painful because PUP software (i.e., Xeroxnetwork software) makes extensive use of 48-bit fields. There are undoubtedly other applications as well.It might be possible, however, to stamp out 80-bit, 112-bit, and other (2N+1)*16-bit fields, if thecompiler couldn't handle them reasonably; I suspect that they don't occur very often.We should consider permitting 17 to 31-bit fields and smaller fields which cross from an even half-wordto the odd half-word; doing this requires revising field descriptor format as discussed later. Such fieldscan't be referenced efficiently on 16-bit machines, so their use would be limited during the lifetime ofthe 16-bit machines. Allowing fields to cross word boundaries is also considered later, but notrecommended.3) Change procedure descriptors from 16 bits to 32 bits. This is necessary if local frames are removedfrom the MDS, and it enables both the global frame table and code segment entry vectors to be fp!q4] G?fp bq' ^D \Y [a YLS UB$ TT RE6) Pz MZ K>W IsQ DZp4 @qH =vF ;b 92. 86, 6K?" 48& 2 /DH -z M + (=<, &sH $P " kR d _  U U ^ X : V o 8/ 3I 2 B^*Dragon DocumentEdward R. Fiala16 August 198311eliminated, which makes procedure calls substantially faster. Details and variations are discussed later.4) Enable dynamic recompilation and loading of procedures and modules by means of some kind ofinterface function call that reintroduces indirection in procedure calls. Current procedure descriptors andxfer data structures do not allow easy enumeration of all places where a descriptor for some procedure isstored; eliminating entry vectors and the global frame table as proposed in (3) makes this even harder.An indirect procedure call allows the places where global frame addresses and PC's are stored to beisolated, so that they can be enumerated easily. Details and variations are discussed later.5) If we now believe that future machines will use 32-bit words, then programs should evolve towardconstructs efficient on 32-bit machines and away from constructs extraneous on 32-bit machines. Also,standardizing a few basic types seems highly desirable, as shown by the following example: Twofunctionally identical procedures are needed because one returns a LONG CARDINAL result while theother returns a LONG INTEGER result. Only by eliminating LONG CARDINALs from the programsource can this unfortunate replication be eliminated.With respect to opcode set implications, avoiding 16-bit forms is largely a compiler decision in unpackedstructures, as discussed in (7) below; local and global frames are normally unpacked. Of the 16-bitTrinity Pilot opcodes, many are the same except that one moves a single 16-bit word while the othermoves two 16-bit words between the stack and a frame; in addition, there are other opcodes identicalexcept where one does 16-bit logic or arithmetic while another does 32-bit logic or arithmetic. OnDragon, all these duplicates--about 41 Trinity opcodes--become redundant, if the compiler uses 32-bitrepresentations in frames. Another 16 opcodes become unimportant (because they can be replaced bytwo or three other opcodes) if users deemphasize 16-bit forms elsewhere or if most 16-bit items outsideframes are also unpacked. Of the 57 Trinity opcodes eliminated, where both 16-bit and 32-bit operationsexist, 46 are one byte while 11 are Esc/EscL opcodes. The opcodes thus freed can be used for otheroperations to improve code density or speed.The general proposal here is that Dragon opcodes will not provide much support for 16-bit constructs--about the same level of support as Trinity Pilot provides for bytes/strings would be about right. This ispractical for two reasons: First, even though the source language specifies a 16-bit construct, thecompiler can represent it with 32 bits in unpacked structures. Next, user programs will continue evolvingto use 32-bit constructs instead of 16-bit in many places.With respect to these considerations, the following proposals seem reasonable:5a) INTs and LONG INTEGERs are functionally equivalent--one should be deimplemented (non-controversial). INTEGERs and INTs are discussed below.5b) 16-bit POINTERs should be replaced by the 32-bit LONG POINTER (This follows automatically ifMDS is eliminated.).5c) LONG CARDINALs seem to be of marginal value; perhaps they can be deemphasized or eveneliminated altogether. This proposal and its variations are controversial. My proposal is to eithereliminate or diminish use of LONG CARDINALs in program sources, and to limit opcode/hardwaresupport for unsigned operations to that necessary for N-precision twos-complement arithmetic.LONG CARDINALs have several advantages over LONG INTEGERs, and these must be examinedcarefully before deciding to eliminate/diminish LONG CARDINALs. First, an arithmetic shift of aLONG CARDINAL is potentially simpler and faster than that for a LONG INTEGER; on a left-shift fp!q4] G?fp bq'C ^^ \l [!H YL?( WV U] REU Pz_ N@ LG KJ IP6 E81 D>& BIS @~8, >%> <I ;A! 9T70 7 ^ 5D 3, 09- .V , W +" ^ )W: %N "sE 7 7] l 0) 0e e!; ] )+* ^^ = LB\?Dragon DocumentEdward R. Fiala16 August 198312there is no important difference, but the LONG CARDINAL arithmetic right-shift need not determinewhether to shift in 1's or 0's, so it is faster than a LONG INTEGER right-shift, barring special hardware.However, no such operation as an arithmetic shift exists in 16-bit Mesa, and none is proposed becausedivision itself is infrequent and division by powers-of-two is still less frequent.Secondly, LONG CARDINALs allow positive numbers one bit larger than LONG INTEGERs duringarithmetic and in storage. However, the class of applications where a 32-bit positive number is largeenough but a 31-bit positive number is too small seems uninteresting to me.When LONG CARDINALs have been eliminated, the opcode set will still have to support N-precisiontwos-complement numbers--the kind of multi-precision arithmetic proposed for Dragon. In an N-precision number, the high-order word is a 32-bit integer, while low-order words are 32-bit cardinals, soarithmetic on low-order words may require some cardinal operations. The requirements for this kind ofarithmetic are examined carefully in a later section.5d) 16-bit INTEGERs and CARDINALs will have to remain Mesa source language primitives as long as16-bit machines are important targets, but these need not be supported very much within Dragon'sopcode set. In unpacked structures, these will be represented in 32 bits and be indistinguishable fromLONG INTEGERs. In packed structures, they can be handled like other sub-word fields, which is asfollows: A cardinal value is stored in a field to which an offset is specified; the compiler loads/stores thecardinal value and is clever enough to apply the offset only when it has to. For example, if the field issimply incremented and stored back, the offset is never applied. Coercion to a LONG INTEGER isaccomplished by adding the offset to the value from the field.The above approach is insufficient to handle the case of a MACHINE DEPENDENT interface (onecommunicated over a network) in which a 16-bit INTEGER appears. Such an interface would requirethe INTEGER to be transmitted in its 16-bit signed form. Machine-dependent interfaces should belimited to LONG INTEGERs and 16-bit CARDINALs so that this problem won't happen. However,even if it were necessary to transmit 16-bit integers over a network, my feeling is that no specialopcode/hardware help is needed for this case.Like 8-bit fields, 16-bit fields are likely to be popular, so they probably deserve special opcodesanalogous to Read String and Write String.6) We want to clean up the Mesa language so that necessary "loopholes" become less frequent, and wewant to cleanup Mesa sources so that they are less sensitive to implementation decisions in theunderlying machine. There are a variety of ways in which sources presume representations used in 16-bitMesa; some of these representations are inconvenient on Dragon, so we would like to have the freedomto be different. These changes will allow a larger body of software to be machine-independent.One mildly troublesome representation is that pointers address storage on 16-bit boundaries. It would benice if Dragon pointed at 32-bit boundaries while 16-bit machines pointed at 16-bit boundaries, yetcommon sources could be separately compiled into working code in both environments. We want toaddress 32-bit boundaries on Dragon, for example, because indexing into arrays is faster--the index doesnot have to be shifted left 1 before adding it to the base pointer. Also, there is then no need to worryabout 32-bit items crossing from the odd half of one word into the even half of the next.However, existing Mesa programs "know" that pointers address 16-bit quantities. For example, the sizeof a long pointer is sometimes represented by the raw numeral "2", implying that 2 addresses must beskipped to pass over a 32-bit long pointer; similarly, string indices are sometimes computed by left- fp!q4] G?fp bqB `S%E ^B# \S YLC W X UK RE)6 PzT NJ LC# K5 G5+ E` D [ BIa @~E) >L <@ ;> 7:! 5M 4] 2LZ 0\ .- +EQ ){* & Y $>2- "sY 7- ;$ l\ I ="  N BX wY P ;d pQ| )B\GDragon DocumentEdward R. Fiala16 August 198313shifting a pointer by 1. It is believed that all of these places can be "purified" by source changes whichmake code insensitive to the internal representation of a pointer--e.g., using "size(long pointer)" ratherthan "2". The subject of debate has not been that these places cannot be made insensitive but that toomuch editing is required.Satterthwaite has suggested a compiler kludge to produce an error wherever pointer arithmetic is carriedout explicitly, so that these places could be identified and fixed. This kludge would also indicate howlarge the source editing problem would be, if pointers were to 32-bit boundaries.If pointers do address 32-bit boundaries, then field descriptor format must also change. However, littlecode depends upon the format of field descriptors, and, as proposed above, we might want to allow 17 to31-bit fields anyway.I think what I am saying here is that Mesa programs can and should be made insensitive to pointerformat by appropriate editing. Dragon can then address storage at 32-bit boundaries, if we choose.7) Packed vs. unpacked constructs: The representation of pointers is only one example of a more generalproblem. Mesa has constructs, such as INTEGER and CARDINAL, represented as 16 bits on our 16-bitmachines that, in some contexts, could be more efficiently stored in 32-bits on Dragon. There are threegeneral intents which the programmer might have:a) A data record might be transmitted over a network or referenced by a hardware device, and its formmust meet externally specified requirements. It must either be in the required form all the time or beconverted before transmission and after reception, though not necessarily stored in the required forminternally.b) He is worried about storage and has packed fields into the record tightly. Although it is notabsolutely necessary to arrange fields as indicated by the programmer, that is his intent.c) He wants the record to be efficiently accessed and doesn't care much about its storage requirement.In this case, it would be better to promote a 16-bit INTEGER to 32 bits on Dragon but leave it 16 bitson a 16-bit Mesa machine.There is clearly a continuum of intents between 'b' and 'c'. I think that 'a' would normally be pinneddown by a MACHINE DEPENDENT declaration; 'b' we will refer to below as a "packed" structure,and 'c' as an "unpacked" structure. [MACHINE DEPENDENT is unfortunate terminology--where therecord is transmitted over a network, for example, it is really "machine-independent"; a notation such asDEFINITE might be more appropriate.]Unpacking has the desirable property that computations involving INTEGERs/CARDINALs can beefficient on any target machine. However, sources which rely upon mod 64k truncation ofINTEGERs/CARDINALs or do manual or loopholed pointer arithmetic (e.g., somehow "knowing" thatSize[Integer] equals 1) will not necessarily work. But which structures are packed?7a) One possibility is to pack everything the same in all Mesa implementations. This has the desirableproperty that what works on 16-bit machine will work on a 32-bit machine and vice-versa, and theprogrammer will have complete control.7b) Another possibility offered by Satterthwaite is to default INTEGERs, CARDINALs, and POINTERsto "unpacked" in frames, "packed" in records, and either way in arrays, allowing the programmer tooverrule the default through declarations. fp!q4] G?fp bq-> `S73 ^g \ YL08 WI UQ REX Pz70 N K>B IsE F'A D7Z Bl^ @0 >O <8S :nI 8 6N 4:Z 1(> /42 . *J (\ 'Y %5i #j$ ? .'!(7 c"; T 'Q \N &  6* U11 * t CA[/Dragon DocumentEdward R. Fiala16 August 198314With 7a, POINTERs must address 16-bit halfwords rather than 32-bit words; otherwise, exactlycompatible sizes accomplish little. Almost no editing of 16-bit programs would be required. Pointing at16-bit boundaries would require more shifting and masking in microcode and the handling of word-boundary crossing when a POINTER contains an odd address. However, the compiler and evolution ofuser programs could arrange for odd pointers and word-boundary crossing to become infrequent, so thiswould be inexpensive in execution time. Cedar is already encouraging conversion of INTEGERs andCARDINALs in source programs to INTs with a specified subrange, allowing the compiler to pick 16 or32-bit representations according to its wish. REFs must begin on an even 16-bit word. These changeswould make the disadvantages of 7a less onerous.With 7b, it is possible though unnecessary to have POINTERs address objects at 16-bit or otherboundaries rather than 32-bit boundaries, but pointing at 32-bit boundaries produces more efficient code.The whole question of "how big" things are seems to be an inconsistent one in Mesa. The compileralready chooses larger sizes for fields smaller than 16 bits in packed arrays, and it can choose 16-bitrepresentations of 32-bit INTs in some cases, but it does not have the freedom to enlarge 16-bit items.Also, it is legal to make a pointer to the beginning of a record and to some items within a record but notto others. The combination of 16-bit boundary addressing and choosing 7a avoids dealing with some ofthese inconsistencies. However, a superior long-term solution is to choose 7b and to clean up sourceswherever pointer arithmetic or object sizes are mentioned non-generally.Butler Lampson and Roy Levin have indicated that they want more-or-less a 7a approach, at leastinitially. Ed Satterthwaite, Ed Taft, Chuck Thacker, Phil Petit, and I favor a 7b approach in the long runand oppose any extra work that might be required for a 7a approach initially. The decision of what todo initially has not been firmly decided.Thacker and Petit are now proceeding with hardware pointers addressing 32-bit boundaries in "native"mode, with a corollary change to field descriptors allowing 17 to 31-bit fields. However, Dragon will,initially, also define four opcodes that accept pointers to 16-bit boundaries and do 16-bit or 32-bitloads/stores. By means of these opcodes, a 7a approach is possible initially at reduced speed andincreased code size, if we want to do that.8) Opcodes implemented by trap procedures in Trinity involve considerable execution time overhead.Dragon trap opcodes will compact common sequences for which there is insufficient microstore, so trapopcode execution time must be small. Fast trap opcode execution is also desirable on 16-bit machines,though not crucial. This memo details later a proposal to make a Dragon trap opcode semanticallyequivalent to a direct function call on the procedure which implements it. The same method cannotquite be made to work on 16-bit Mesa, but improvements are possible there as well.9) Other types of traps are also a problem. Here is a summary of Trinity traps and proposed dispositionof them:BoundsTrap from the BoundsCheck opcode,BreakTrap from the Break opcode,DivCheckTrap and DivZeroTrap from the Long Unsigned Divide opcode,PointerTrap from the Nil Check Long opcode, andProcessTrap from the Monitor Renter opcodeare examples of conditional traps that occur on a single opcode. These kinds of traps can be handledlike the unconditional traps discussed later. If the condition occurs in the trap procedure for anunimplemented opcode, then the event would be handled by a procedure call. Some of these traps will fp!q4] G?fp bq K `S63 ^F \K Z,9 Y)'9 W^X UE S0 PWN NS KY IPR G9. E&D Ce B%N @[H < Q ; ] 9T X 7) 4M 2L25 0(= . T ,+ ){4. 'A$ %C# $a "Pb R Q Ix'xxBxK/x* e c MF B]<Dragon DocumentEdward R. Fiala16 August 198315require software to backup the PC after the trap, but execution efficiency of low-frequency events isunimportant.Some traps associated with events that possibly won't occur on Dragon are as follows:InterruptError from the Disable and Enable Interrupts opcodes,RescheduleError from the scheduler,StackError from an opcode that changes the evaluation stack pointer,UnboundTrap from Xfer,XferTrap from Xfer,CodeTrap from Xfer, andControlTrap from Xfer.I think that InterruptError and RescheduleError won't occur on Dragon because these conditions onlyarise when interrupts are disabled, but disabling interrupts won't create an atomic code sequence on amultiprocessor because other processors may be running concurrently. The equivalent event on Dragonwill be some kind of monitor lockup, but it need not be handled by a trap. I am not sure whether theseconditions should persist in 16-bit Mesa uniprocessors after we fix them for the Dragon or not. I haven'tthought this through carefully yet. TBC.The other PrincOps traps listed above are more complex and will have to be detailed later.10) Data objects created on one machine, transported over a network, and potentially referenced later ona different machine should be specified to avoid problems with different word size machines. We shouldencourage programming with each data object beginning at its natural storage boundary--this means thata 32-bit item begins at a 32-bit storage boundary, a 64-bit data item at a 64-bit boundary, etc., and oddobject sizes such as the 48-bit network i.d.'s are undesirable.11) The following name changes seem appropriate for an environment where both 16-bit and 32-bitmachines exist. The name "word" becomes doubtful in this environment because of its ambiguity, so weprefer names suggesting "two bytes":16-bit NameGeneral NameCommentsPOINTERMDS POINTEREventually deimplement?LONG POINTERPOINTERINTINTLONG INTEGERINTThis change is already happeningLONG CARDINALCARDDeimplement?MACHINE DEPENDENTDEFINITESo record declarations become DEFINITE, PACKED, andUNPACKED according to the user's intent.LAWLADBLocal Address Double Byte opcodeGAWGADBGlobal Address Double Byte opcodeJWJDBJump Double Byte opcodeJIWJIDBJump Indexed Double Byte opcodeThe idea on the name changes is that old sources will be automatically translated by pretty-printing oldsources while simultaneously checking for various classes of errors and doing name substitutions. 16-bitMesa compilers should accept both old names and new names during a transition period, eventuallydiscarding obsolete names. Automatic source translation and checking can take place as soon as we knowwhat we want to do, without waiting for Dragon. fp!q4] G?fp bq14 `S \UxZC>xXx#xVDxTxSxQNxO L\ K15 IPH G(? E?+ C( @~Z = M ;A07 9wQ 7i 5? 2pN 0M .$x+u Q ,x(tQ ,x'F Qx%Qx$ Q,x#$ Q, x!Q,' , b(xQ,xQ,!x@Q,xQ, q08 90 !? 1L f/ AW5Dragon DocumentEdward R. Fiala16 August 1983163.5. What Can Change and What Must Not?Below I have given a list of the low-level data structures which might be involved in various proposalsand an assessment of how much headache changes and incompatibilities will cause. In other words,making the change has some cost in labor to find and edit all affected sources and to debug the result.16-bit vs. 32-bit addressing: There will be a lot of editing for a switch to 32-bit addressing. Levinconsiders this editing to be an unreasonable demand during the early Dragon bootstrap period.Page Size: The current page size is 29 = 512 bytes. Mark Brown argues that the present implementationof Cedar allows the page size within the virtual memory (VM) to be any multiple of the page size on thedisk. Present disk formats should be retained so that Dragon disk packs can be interchanged with 16-bitmachines; also, the disk page size should not be changed because it is embedded in user programs and isthe unit of atomic write for Alpine. The VM page size is discussed later, and a 210 byte size is proposed;perhaps 211 bytes would be better.PSB Format, Local Frame Overhead: No one I have talked with is concerned about changes to PSB(Process State Block) or local frame overhead format or about winding up with a Dragon formatincompatible with 16-bit machines.Revamping Xfer data structures: Ed Satterthwaite has not objected to any of the proposed changes to xferdata structures or to the proposed incompatibilities between Dragon and 16-bit data structures. Myopinion after discussing issues with various people is that the code generation stages of the Dragoncompiler, the binder, etc. will all be different from those used on the 16-bit machines. There is noparticular objection to making the Dragon data structures whatever we want. However, people want toavoid elaborate changes to the 16-bit Mesa machine.Eliminating MDS: I am not sure about this one.Representing 16-bit INTEGERs and CARDINALs in 32 bits: I am not sure about this one.Changing Field Descriptor Format: This should not require much editing; there are few places which"know" field descriptor format.Process Data Structures: These can change without many side effects; few places "know" the form of theprocess data structures.3.6. Proposed Conversion StrategyAs discussed earlier, although we can require identical user sources on both 16 and 32-bit machines(possibly with a small number of conditional compilations or with some different declaration files foreach machine), a different compiler and opcode set must be used for Dragon. The debugger, code lister,and similar programs would also require modifications.Differences between 16 and 32-bit compilers could be confined to code generator parts of the compiler.We see no way to avoid compiling programs twice, once for Dragon and once for 16-bit machines.With this plan, we need not change the existing 16-bit compiler as part of the bringing-up-Dragon effort,although fixing some problems mentioned above will require changes. Software will require someediting, depending upon how much internal representations are "known" by programs in one way or fp!q4] G?fp b( ^q@' \I [(? Wpq > U] Rhp qRtRhq: PR Nh MC$ K>6KtK>q IsJtIsq Fp!q" D74) Bl" >pq%$ =/C ;e] 9N 7.6 63 2pq /!p6q +p!q) ) &spq," $ p" qE R X #D 6 KO 0. Z D-2 yQ l 2B]4Dragon DocumentEdward R. Fiala16 August 198317another.Hence, the following approach is suggested for the conversion:1) Any name changes that we decide upon can be made by modifying the existing 16-bit compiler toaccept new names and by automatically translating sources. The program which does this translation, orsome other program, can check for certain kinds of errors more appropriately found by a translationprogram than by the compiler and report constructs which will not run on both 16 and 32-bit machines.It can also report places where dangerous pointer arithmetic is used--these may need attention. Thesemust be fixed by hand, if they cannot be fixed automatically.2) The Dragon processor will (at least initially) define four opcodes which use pointers to 16-bit half-words rather than 32-bit words; these opcodes each take a 32-bit pointer to a 16-bit boundary on thestack and fetch/store a 16/32-bit quantity. Petit presently proposes to implement these opcodes asXOPS, which means that they will always trap and be executed by a sequence of primary hardwareopcodes followed by a RET. All other fetch/store class opcodes will use pointers to 32-bit boundariesexclusively.During the early Dragon bootstrap period it will be possible to compile code which uses datarepresentations exactly compatible with those used on 16-bit Mesa machines; this code will be verboseand slow compared to Dragon's real potential, but might be on a par with the Dolphin in performance.The four opcodes mentioned above and a few others discussed later will be used heavily during thisbootstrap period.I don't think that 32-bit Dragon registers will interfere noticeably with attempts to be like 16-bitmachines. Data fetched from 16 bits of storage into a 32-bit register can have the left 16 bits of theregister either cleared or offset by 2^16 (= sign extended from the 16th bit) whenever appropriate, andthe left half can be discarded when storing.The bootstrap period would include both hardware development and software conversion anddevelopment. Later in the bootstrap, "native" pointer format, enlargement of some 16-bit objects to 32bits, MDS-elimination, and other changes can be introduced. It is not clear to me whether or not MDScan be eliminated on Dragon while retaining it on 16-bit machines. Possibly, POINTERs could beretained on 16-bit machines but compiled into LONG POINTERs on Dragon...?While the new Dragon compiler is being bootstrapped, the old 16-bit compiler will be clunking away asusual. We want minimal changes to the 16-bit compiler, microcode, and software during this earlyperiod. We might have to make some name changes, as discussed in (1) above, and we might have to fixproblems, but we don't want to combine unnecessary 16-bit changes with rapid Dragon compilerevolution.Later we can either freeze the 16-bit Mesa/Cedar environment or consider more extensive changes. TheDragon compiler will initially produce code only for Dragon, but it should be written so that codegenerators for 16-bit machines can be added later, if we choose.3) Apart from the four opcodes mentioned above and some sign-extension features, Dragon hardwarewill be designed without any consideration for backward compatibility; it will be aimed at the glorious32-bit world and the more uniform architecture we want to achieve. In other words, minimum supportand features for programs not yet upgraded will be provided. fp!q4] G?fp bq ^> [:@ YoM Wc UM TT RE= NH M X K>21 Is'7 G15 E Bl3) @50 >13 = ;' ;A 7U 6P 4:!F 2p, .+@, -3Q +i] )8' 'I $a41 "-4 >' F 7 H X 0@ (8 Q )6- ^< f BZ2Dragon DocumentEdward R. Fiala16 August 1983184) The ideal conversion would take place as follows: First, recompile the system for the 32-bit machine.During recompilation, no errors should be flagged because all of these should have been detected duringstep (1). The system will run without debugging. Next, as convenient, replace constructs that won't runwell on both 16 and 32-bit machines by ones that will. A compiler switch can cause obsolete usages tobe flagged as errors rather than compiled. When a module compiles and loads without error, it will run.5) To cover cases where old programs may compile correctly but not run, appropriate bounds checkingby the compiler (enabled by a switch) or microcode checks such as trap-on-pointer-odd should beavailable at runtime.The game being played during this transformation is that it will be possible for the compiler to produce a(possibly long and slow) sequence of opcodes to access 16-bit cardinals or 16-bit integers or to access datain the/an MDS, or to handle word-boundary crossings. However, the hardware and opcode set will notstruggle to make such sequences be fast or small. Instead, orientation will be toward programs in whichunsuitable 16-bit concepts have been deemphasized. Moderate performance degradation for unconvertedprograms is acceptable--in the range, lets say, of 3 times bigger and 3 times slower. fp!q4] G?fp bqpq Q `Sg ^27 \S Z ^ WS UP S PzDpq N ^ L U K ] IP<( GUf G>B Dragon DocumentEdward R. Fiala25 August 1983194. Hardware OverviewA Dragon system will have one or more Mesa processors and several special purpose processors on acommon memory bus (M). Most processors will interface to the M bus via an LSI cache component.The cache component accepts references over the processor bus (P). The basic cycle time of the machineis nominally predicted to be ~100 ns. Several possible configurations are shown in Figure 1.A minimum Mesa processor has four custom LSI packages:Instruction Fetch Unit (IFU);Execution Unit (EU);Cache (for the IFU)Cache (for the EU).Each of these devices is large, with the EU ~84 pins, IFU ~119 pins, and cache ~92 pins. The IFUprovides all control signals for a processor and includes a small, read-only microstore organized as severalprogrammed logic arrays (PLAs).Caches (at l = 2 microns) will hold ~64d blocks x 4 32-bit words/block; several caches can be gangedup on a P bus to increase effective cache size in ways discussed later. A cache is "fully associative,"which means that an address presented on the P bus is simultaneously compared against every address inthe selected cache. On misses, a (modified) cyclic algorithm is used to select the block for replacement.In response to P bus memory references, the cache may issue one or more M bus commands to mapvirtual pages into real pages, to obtain accurate data from storage, or to maintain consistent copies ofdata in all caches. A special purpose LSI device, the Arbiter, schedules the M bus using a round-robinalgorithm to resolve competition.When mapping information is not already in a cache, it sends an M bus command to the Map processor,another LSI device, which translates virtual page (VP) to real page (RP) using a map in storage--there isno separate map memory. The storage map also contains ReadOnly, WriteProtect, and Referenced bits.Other M bus commands read and write quadwords from storage. Storage is error corrected, but theerror-correcting codes are checked/generated at exit/entry to storage, and only parity is shipped over theM bus. The extent to which parity checking is used to detect errors on the M bus, in the cache, on theP bus, and elsewhere is not yet decided.A commercial microprocessor controls system configuration, monitors and tests storage, and perhaps doeslow-speed input/output. One thought mentioned to me by Petit is to interface the M bus to a Multibususing a single LSI part we design; then any standard Multibus peripheral could be operated fromDragon. A typical configuration might include display, keyboard, ethernet, clock, and disk peripherals,all of which could be driven via the Multibus except the display, which requires too much bandwidth.Other special purpose processors have not been definitely decided upon. Coprocessors, such as specialBitBlt or floating point processors, have been discussed; in Petit's proposal, a coprocessor would share theEU's P bus and, sometimes, the IFU's P bus. A coprocessor opcode traps to software, if the requiredcoprocessor is not attached, allowing a single IFU to be used in all coprocessor configurations. Fan outconsiderations limit P bus connections to about 10d, so the number of caches plus coprocessors is limitedto about 8. fp!q4] G?fp ar ^eq5, \,3 Z9. Y] U6xRxQ+xO`xM JH I-93 Gb C sq#5 B%17 @[ [ >V ;G 9T62 7` 5! 2Lc 0T .T +E;% ){P 'U %( "sL C" "= h I X  Y  P B&> wa U   BYp#Dragon DocumentEdward R. Fiala25 August 1983204.1. Timing and ClockingTBC4.2. Memory ReferencesMy terminology is "references" on the P bus and "commands" on the M bus. Processors makereferences to the caches; caches both send and receive M bus commands to maintain consistency, asdiscussed later, but an M bus command only happens when some reference requires it.During the wait to acquire the M bus "grant" from the Arbiter, and during the execution of thecommand, the requestor is suspended. In other words, the microinstruction initiating a reference doesnot complete until all M bus commands for that reference are complete (Data transport and store intoshared page exceptions to this rule are discussed later.). A cache may respond to an M bus commandissued by some other cache, but this response doesn't slow a reference on its own P bus unless thatreference tries to acquire the M bus grant and has to wait.Caches can accept the following P bus references; timing comments assume that the reference hits dataalready in the cache:No operationStart no reference in this cycle.FetchRead one word. Another reference or "Move" class opcodes (DUP,EXDIS, RMOV, SRn, and LRn discussed later) can be executed in thecycle after a fetch; also, a store of data just fetched can begin in the nextcycle. However, "Op" class opcode (ADD, AND, etc.) cannot start untilthe second cycle after the fetch.StoreStore one word. Another reference cannot start in the cycle after a storebecause data is transported to the cache at that time. A store that hits aword shared with another cache is 2 cycles slower.Fetch-and-holdFetch and acquire/retain the M bus. This is used to create atomicsequences of memory references on a multiprocessor. If obtaining M buscontrol is not delayed by competition, then the first fetch-and-hold in asequence is 1 cycle slower than a fetch; subsequent "-and-hold" referencesare not slowed.Store-and-holdStore and acquire/retain the M bus. This is used to create atomicsequences of memory references on a multiprocessor. If obtaining M buscontrol is not delayed by competition, then the first store-and-hold in asequence is 1 cycle slower than a store; subsequent "-and-hold" referencesare not slowed. No operations use this.MapOpMapOp is intended for communicating map information between aprogram and the map processor; it might also be useful forcommunicating with other non-cache devices on the M bus. The"address" and "data" for a MapOp are transmitted to the cache as on astore; the cache acquires the M bus and sends a DoMapOp command withthe "address" during the command cycle and "data" in the next cycle.The map processor executes the command, which may require executionof other M bus commands. Finally, the map processor sends a fp!q4] G?fp b ^q Yp V!q: TV%< RS O$: MO51 KL IC GI F$; B[ @x>J x;09:8H6K14!x1J0 ?.M2x+ &)4(5&O =$x! 1 4Q5Jpxq3 Su/v =0#!)1 ^< (&) LA\?Dragon DocumentEdward R. Fiala25 August 198321MapOpDone command to indicate completion. The cache passes 32 bitsof arbitrary MapOpDone result information to the processor as on a fetch.If the map processor responds in 1 cycle, a MapOp takes 7 cycles (pluswaiting time for the M bus grant).A fetch or store which does not get a "data hit" in the cache gets a "map hit", if some cache block mapsthe same VP as the reference. In this event, the mapping information from the other cache block is usedrather than asking the Map processor for it. A fetch or store which misses and gets a map hit takes 8cycles (longer, if it must wait to acquire the M bus grant).A fetch or store which gets neither a "data hit" nor a "map hit" takes ~17 cycles; it is suspended, first,while real page and protection information is obtained from the Map processor via M bus commands,then for 4 more cycles obtaining the data from storage.Special cases and MapOps are discussed in the "Memory System" chapter.4.3. Instruction Fetch Unit (IFU)?? The IFU, shown in Figure 2, has two different P bus connections, its own for fetching opcodes, andone shared with the EU for other storage references and communication with the EU. A 16-bitbidirectional data bus and 19 other control signals collectively called the I bus also connect the IFU tothe EU (and to any coprocessors). All of these data paths are clocked on half-cycles.The IFU's job is to fetch the stream of opcodes and decode it appropriately into control signals for itselfand the EU. It can handle 1, 2, or 3-byte opcodes that don't jump or up to 4-byte opcodes that jump.The IFU does all the "work" for unconditional jumps, procedure calls, and returns without involving theEU. On other opcodes, it decodes the opcode, a, and b bytes into control signals for the EU andregister addresses for the EU's RAM; it handles conditional jumps by means of branch conditions fromthe EU.?? The IFU pipeline normally prepares opcodes at a faster rate than the EU can execute them. After ajump or context change, the first stage of the IFU pipeline fetches up to one word from storage per cycleand splits it into opcodes; the second pipeline stage decodes two opcodes per cycle until the pipeline isfull. Since the EU uses at least one cycle on each opcode it must execute, the IFU tends to refill its pipeat ~twice the rate at which the EU empties it. The IFU itself completes absolute or PC-relative jumps,procedure calls, and returns without consuming any EU cycles. In addition, opcodes which change thestack pointer (StkP) don't consume EU cycles; these opcodes are REC, DIS, and AS.?? Jump latency is the fetch time plus 1 cycle per word. Since a fetch that hits takes 2 cycles, this meansthat, if the last code byte of an unconditional jump is fetched 3 cycles before the following opcode mustbe executed, then execution time for the jump is 0; this assumes that the first fetch hits in the IFU cacheand that the word fetched contains all bytes of the next opcode. If the opcode at the jump target crossesa word boundary, then the jump is 1 cycle slower.Procedure calls and returns will be discussed extensively in the "Context Switching" chapter. The IFUkeeps a stack of PC's and pointers into the EU's RAM; the pointers delimit procedure frames. Thisstack is modified by the procedure call and return opcodes.When it encounters a conditional jump, the IFU pipeline follows only the predicted path, and it only fp!q4] G?fpbqC`S= ^+\" YL%C W!G UP S< PzX NS L7 IsF DZp" @qQ ?U =S#F ;V 8K 6KZ 4E" 2.sqsq$ 0J /! +e )C& (54 &OY $.9 "T Q }W A( ] j S1 <* T L; D B\x.Dragon DocumentEdward R. Fiala25 August 198322follows that path as far as the next conditional jump, procedure call, or return opcode. At that point, theIFU waits for the conditional jump to be resolved. An opcode bit predicts "jump" or "not jump," so thecompiler controls which way the jump is predicted. In other words, the conditional jump must executean EU microinstruction to obtain the branch condition which decides which path to follow.?? With a correctly predicted conditional jump closing a tight loop, and with two 32-bit words holding allthe opcodes in the loop, the minimum execution time is 5 cycles per iteration. Another importantsituation is a sequence of conditional jumps each correctly predicted to not jump. Minimum timebetween two conditional jumps in this situation is ???The "microstore" is totally on-chip in several PLAs, which imposes a tight space constraint. Theimplementation limits each opcode to at most three microinstructions. More complicated operations canstill be defined as "opcodes", but they will trap after at most three microinstructions and be emulated bya sequence of simpler opcodes. A trapping opcode X behaves exactly as though its three initialmicroinstructions were followed by a direct function call (discussed later) to a procedure in storage at Xlshift 6 (?). The idea is to determine the storage address without indirecting through a proceduredescriptor.4.4. Execution Unit (EU) and Opcode ClassesThe Execution Unit (EU) shown in Figure 3 has a three-ported 240b-word x 32d bits/word internal RAM.It can read 2 words from this RAM and write 1 word into it in each microinstruction. RAM addresses forthe three ports are provided by the IFU in the ways discussed in the next section.The two read ports for the EU's RAM are referred to in Figure 3 as "Ra" and "Rb," respectively, and thewrite port as "Rc." Ra and Rb each pass through a bypass circuit which compares the Ra (or Rb) RAMaddress to the Rc address and substitutes the Rc data for the Ra (or Rb) RAM output when the addressesare equal. Since it avoids waiting for a RAM write and read, the bypass circuit saves 1 cycle whenever aRAM word is read in the cycle after writing it.In addition to the bypass multiplexing, there is also multiplexing to allow a literal constant "k" to be usedinstead of Rb. The values of "k" may be any of the following:kOpcodesaLIB, ADDB, RB, RSB, WB, WSB, PSB, LRIn, SRIn37777777400b+aLINB, ADDNBbRRI, WRI, RAI, WAI, JEBB, JNEBB, RDIabLIDB, ADDDBab,,0LILDB100000b,,0LIMINF-1LIM10LI01LI12LI2Rb or k can also be optionally inverted.In Figure 3, the data paths into the ALU (Arithmetic and Logic Unit) and Field Unit after all of these substitutionsare called Rt (deriving from Ra), and Rm (deriving from Rb or k with optional inversion).The EU can apply a boolean, arithmetic, or field unit operation to the data from the correctly bypass Ra fp!q4] G?fp bqh `S Z ^S \Y YLX Wa U1/ S6 Pz#> NL Lg KJ IPj G;( E @p, =/qH ;e=* 9R 6( [ 4^ Y 2-9 036 ./ +-@ )>y'tQy$sQu,y# sQu y!YsQu$ysQu y suQy Qy&QyQydQyQ q(yuky)Y qe N Ct\x.Dragon DocumentEdward R. Fiala25 August 198323or Rb/k/Rb'/k' data. In addition, there is special logic for computing field descriptors when accessingpacked array elements, for doing multiplication at 4 bits/cycle, and for doing division at 1 bit/cycle.The following is one way of dividing opcodes into groups:MOVERc _ Ra.OPOperation (xor, add, and, or, shift) Rc _ Op(Ra,Rb) -or- Rc _ Op(Ra,k);FETCHRc _ (Ra + k)^ -or- Rc _ (Ra + Rb)^;STORE(Ra + k)^ _ Rb -or- (Rc _ Ra +/- 1)^ _ Rb;IFUOPIFU operation.JUMPEU not involved;CJUMPConditional Jump;ANDHOLD Conditional Store opcode;MAPMap processor operation;Othera decoded for unusual operations.The MOVE class is distinct from the OP class because registers are not needed until late on a MOVE,but are needed early on an OP, so that there is time to do the add, xor, or whatever. For this reason, aMOVE can be done after a fetch without being held 1 cycle, but an OP cannot. The MOVE opcodes areLRn, SRn, RMOV, DUP, and EXDIS; these opcodes provide the most useful options involving LRs andthe stack.The OP class uses one of the function units in the EU to perform a XOR, OR, AND, ADD, IF, EF,RSH, LSH, RCY, or LCY operation; each unit accepts two input words and produces one word ofresult. The two inputs to these operations are called Ra and Rb (or k). Because Rb can be ones-complemented before the operation, the boolean units can implement Ra & Rb', Ra U Rb', and Ra xorRb'. In addition, 0 and -1 are available as constants, so all 16d boolean operations are possible exceptRa' & Rb' and Ra' % Rb', which would require both operands to be inverted. However, not all of the14d possible operations are available from opcodes: Ra xnor Rb, Ra % Rb', Ra & Rb' are missing.Also, either the OR or AND units, for example, can provide MOVE by making Ra the same register asRb. This allows an OP to supply the missing MOVE-class operations, although they are 1 cycle slowerafter a fetch.Because Rb can be inverted, the ADD unit also does subtraction (i.e., Ra + Rb' = Ra - Rb - 1). Theonly micro-level limitation is that -Ra-Rb is impossible. Associated with the ADD unit is logic to savethe carry-out (addition), the carry-out' (subtraction), or 0 in a flipflop called Carry, and to supply thecarry-in from either 0 or Carry (addition), or from 1 or Carry' (subtraction). In other words, OPs can doneither Ra-Rb-1 nor Ra+Rb+1. Integer out-of-range can be computed for either 31d-bit or 32d-bitintegers, as discussed later in the "Arithmetic" chapter.A zero-test on the XOR result together with carry-out and sign bit of the ADD unit are the onlyconditions shipped back to the IFU for use in conditional jump opcodes.The Shift-and-Mask Unit left-cycles a 64-bit input (Ra..Rb) and writes a 32-bit masked result into Rc;the left-most 32 bits of the cycled output are the result, so a left-cycle count of 0 delivers Ra. Thecontrols may be taken from either ab or the Q register in several different ways: Shift, Extract Field,Insert Field, and Cycle.On a Shift, the 64-bit input is left-cycled by the shift count SC (a[2..7] or Q[26..31]); positive SC left- fp!q4] G?fp bq9/ `Sg \9xYo xW GxU $xT *xRE xPz xN xL"xK xIP sq E#@ DO BI W @~_ > ;A5( 9wF 7 U 5Z 4,= 2LJ 0%: -.3 +ED ){ & X $>c "s_ ] M 9 B G e#C O "sq@  4sq' ( LA\5Dragon DocumentEdward R. Fiala25 August 198324cycles by SC and masks off SC bits on the right-hand end of the result; negative SC shifts left by 32-|SC|and masks off |SC| bits at the left-hand side of the result.On an Extract Field or Insert Field, the left-cycle count and mask are specified separately from two 5-bitfields taken from Q or from ab. P is the number of bits to the left of the field; W is the width of thefield (with W=0 interpreted as 40b). The shift count for an extract field is P+W, and the mask is (2W -1)'. Insert Field is done in two steps: First, the destination is left-cycled by P+W and masked byMask[W] so that the field being replaced is right-justified and zeroed; simultaneously the new value isANDed with Mask[W]'; these two results are ORed. This yields the desired result but cycled. Secondly,the result of the first step is cycled by 32-P-W without masking.The EU has an extra register called Q which is used as a shift register by multiply and divide, and anICAND register is also used by multiply, but it has properties unknown to me. Controls for the shifter-masker can optionally come from Q also.For FETCH and STORE class opcodes, Ra (the base register) plus Rb/k (the displacement) is the virtualaddress of the reference. For a FETCH, Rc is where the memory data winds up. For a STORE, Rb isthe data written. If the reference hits in the cache, a fetch can be followed by any other reference in thenext cycle, but the next reference must wait until the second cycle following a store.IFUOP (IFU Operation) opcodes do not involve the EU. For example, stack pointer modifiers such asDIS, REC, and AS, do not involve the EU. In principal these opcodes can, like JUMPs, be completed in0 EU cycles.JUMP opcodes may specify jump displacement in either opcode bits, a, or ab; CJUMP opcodes onlyspecify jump displacement in a. Pure jumps do not require any EU cycles, so execution time is zero ifthe IFU is "caught up". Proposed conditional jump opcodes specify whether the IFU should favor thejump or the non-jump path, on the theory that the compiler will usually be able to guess right. If thecompiler is not sufficiently successful at branch prediction, other algorithms have been discussed.Currently proposed branch conditions are functions of the ALU sign, the XOR zero, and the ALU carry-out. There is also a special conditional jump associated with the Conditional Store opcode.These simple opcodes sometimes have an enormous performance advantage over the current 16-bit Mesaopcodes. Many programming examples are given later in this document. An example which showsDragon most favorably is the sequence for adding two locals and storing the result in a third. WithTrinity 16-bit Mesa, the sequence of opcodes for this is:LL0LL1ADDSL2.Dorado timing for this is about 6 cycles; Dolphin timing is about 27 cycles. But on Dragon, thissequence is replaced by a single RADD opcode taking 1 cycle (because the locals are all in registers)! fp!q4] G?fp bq;/ `S< \d [sq4 YLMYuYLq WC! U?( S07 R"A NS L[ K' GR E': DS BIV >4. = X ;A 7Bsqsq 6sqH 4: Y 2pS 0S -3K +i\ 'N &,F $aU "9 7l Q %A AP{Dragon DocumentEdward R. Fiala25 August 1983254.5. EU RAM AddressingThis section discusses the ways in which RAM addresses for the three ports of the EU's RAM arederived.Addresses 0 to 177b in the EU's RAM are managed as a ring buffer for both local registers (LRs) andevaluation stack. The StkLim register in the IFU can be loaded with the number of registers reservedfor page and frame fault handling. Then, 200b-[StkLim] of the ring buffer registers are available to theuser process, while [StkLim] are reserved for page and frame fault handling. Details about ring buffermanagement are discussed in the "Context Switching" chapter.In addition to the 200b ring-buffer registers, an IFU register called RA can be loaded with a pointer toany RAM address, and the 20b registers from RA+0 to RA+17b can then be accessed independently;these 20b auxiliary registers (ARs) are referred to as AR0 to AR17. For Mesa it is likely that RA will beinitialized and remain at address 220b, though there are other possibilities.For example, a procedure which wanted more than 20b LRs could load RA with RL+20b and then treat the 20bARs as additional local registers. The opcodes discussed later do not support ARs as fully as LRs, but this wouldbe a feasible way to get more local registers. However, the additional effort to load and reload RA is unattractive.Finally, register addressing modes discussed below permit 14b "constants" (addresses 200b to 213b) to beconveniently referenced on the read ports of the EU's RAM, though none of these can be written.However, since any RAM register can be read or written by first loading RA appropriately, the constantscan be written in this way; they are called "constants" only because the normal addressing mode doesnot permit writing. In addition, it is possible that the first four constants will be hard-wired to thevalues -1, 0, +1, and +2, leaving the other values undeclared; 20000000000b is another definite constant.The possible base registers from which registers in the EU RAM are referenced are as follows:RLPoints at local register 0 = LR0; LRm = [RL+m] can be referenced, where m = 0 to 15d.SPoints at the top of the evaluation stack; [S+1] means "contents of the word above the top-of-stack"; [S-1] means"contents of word immediately underneath the top-of-stack". [S] and [S-1] are readable, and [S-1], [S], or [S+1]are writeable. S can also be modified, as discussed below.RAPoints at auxiliary register 0 = AR0; ARm = [AR+m] can be referenced, where m = 0 to 15d.CSPoints at constant 0 = CS0; CSm = [CS+m] are readable, where m = 0 to 11d on the read ports; unavailableon the write ports.The first addressing mode is stack-relative; opcodes which use only this addressing mode (called "SO"operations) can be compactly encoded in one byte, or a and b can be used for purposes unrelated toregister addressing. A stack operation can read and write the registers at S+1, S, and S-1, and at theend of a microinstruction, S can be changed to S+1, S-1, or S-2.The most flexible addressing mode, called the "register-to-register" or "RR" addressing mode, allows any ofthe above base registers to be used with the ranges indicated. For the RR mode, ab encodes three registeraddresses in the following way:Ra = b[1..1]a[4..7]Rb = b[2..2]b[4..7]Rc = b[0..0]a[0..3]AuxSel = b[3..3]The first bit in each group is called "Opt" and the last four bits are called "Reg" here. The following casesare possible:AuxSelOptRegMeaning fp!q4] G?fp b ^qD \ Yo21 W&? Ue T` RE< N Z MB K>_ IsMyFu \yEQayC/F @qT >F = +< ;A$@ 9w$D 7>+ 4:]x1yuKUx0KqK. eK-V;x+K=x*KIK)4 %qC" $5sqsq  "Pc @ d IQsq ~ysqsqysqsqyKsqsqysq +C  y u# Bt]nUDragon DocumentEdward R. Fiala25 August 198326000 to 17b[RL+Reg]100 to 17b[RA+Reg]Read ports only (Ra and Rb):--10 to 13b[CS+Reg]--114b[S]--115b[S-1]--116b[S] and S _ S-1--117b[S-1] and S _ S-1Write port only (Rc):--10 or 10b[S+1]--11 or 11b[S+1]--12 or 12b[S]--13 or 13b[S-1]--14 or 14b[S+1] and S _ S+1--15 or 15b[S+1] and S _ S+1--16 or 16b[S] and S _ S+1--17 or 17b[S-1] and S _ S+1Why not allow 8 of the constants to be written on the write port using the unused decodes 10b to 17b above? Thiswould convert these 8 registers to read/write form which is better than restricting them to constants.Thus the RR mode allows almost all interesting combinations of registers except that references to bothLRs and ARs cannot occur in the same microinstruction. Note that the read and write addresses relativeto S in the table above refer to the original value of S. Also, S changes are cumulative; so if Ra, Rb, andRc all select the stack addressing mode, then it is possible to accumulate a change to S of -2, -1, 0, or+1; at the end of the microinstruction, S is modified by the accumulated changes. Note that the RRmode uses both a and b to specify the register addresses, so the opcode must be 3 bytes long, and noextra bytes are available for other uses.A third addressing mode allows the Ra and Rc ports to use Opcode[4..7] to select a LR (LRIn, SRIn, SRn,and LRn); this is called the "O" mode.A fourth addressing mode uses the two nibbles in a to select two LRs, or one LR and one AR; this iscalled the "A2" mode. a[0..3] can specify the register for the Rc or Rb ports, and a[4..7] can specify theregister for the Ra port.A fifth addressing mode uses b[2..7] to specify any LR, AR, constant, or stack option for the Rb port,exactly as in the RR addressing mode, while b[0..0] selects [S] or [S-1], and b[1..1] selects no change to Sor S _ S-1. This is called the "B2" addressing mode.These are the only ways in which EU RAM addresses can be derived.4.6. EU RAM RefreshThe EU RAM must be refreshed with a period ~2 msec at a temperature of 75oC; the primary leakagephenomenon is exponential in temperature [~exp(-44*(300/T)], so the period can be doubled for each5oC decrease in maximum temperature.Since there are 240b=160d words in the RAM, one word must be refreshed every ~125 cycles. TheIFU has a counter which cycles through the addresses in the EU's RAM. If the Rb RAM port is notused in any cycle, the IFU will use the Rb port to read the next word to be refreshed. Reading theword refreshes it. fp!q4] G?fpbAu\#`\#y^\w\#[\#Y\#XU\#V\#yTR\#Q+\#O\#Ni\#M\#K\# JG\# H\# yFH1@yDf Aqg ?J >%pqpq <8L :nW 8sqsq; 6) 3g` 1& .*1sq) ,_sq 2sq * '#sqH %X,sq sq #5 A p q7uq b uq" <" M X * Bt](Dragon DocumentEdward R. Fiala25 August 198327The Rb port is not available on stores, on Op class opcodes, on RJxxB conditional jumps, or on RFXopcodes; these opcodes read data from both the Ra and Rb ports.The Rb port is available for refresh whenever a cycle is suspended for memory wait or because the IFUisn't ready with the next opcode; it is available on all fetches except RFX, on all the immediate opcodes,and on all the Move class opcodes.Since the Rb port is so frequently available, it should be rare for a program sequence to prevent refreshfor 125 cycles. And, if this ever happened, there would still be no problem unless refresh couldn't catchup within the ~2 msec interval. However, the IFU also has a counter which will time out if a refreshcycle has not been inserted during the required interval. If this counter times out, it steals a cycle toinsert a refresh read.4.7. IFU PerformanceOn Dorado, about 16% of all cycles were spent in IFU "not-ready" wait during one of McDaniel'smeasurements; about 26% of all opcodes executed were jumps, conditional jumps, or context switchingopcodes. Fortunately, the Dragon IFU should have comparatively little not-ready wait for the followingreasons:1) The Dragon IFU gets code bytes from the cache at four bytes per cycle, twice the rate percycle of the Dorado IFU, and it does not compete with the processor for cache cycles becausethere are separate caches; this means that in a long stream of non-jump 3-byte opcodes with noIFU cache misses, Dragon will ALWAYS keep up, but Dorado can't even keep up with astream of 2-byte opcodes. This means that the only not-ready wait on Dragon will be due tojumps and cache misses.2) Jump latency is 3 cycles on Dragon vs. 5 cycles on Dorado. On Dragon, in the worst case(assuming no cache misses), 5 non-jump opcodes between two jumps ensures that the secondjump will complete in 0 time. This is a very bad Dragon worst case with all 5 non-jumpopcodes at the worst byte alignment in words, 3 bytes longs, and completing in 1 cycle.3) Dragon should get better true/false prediction on conditional jumps. Dorado always predicts"false" and is correct about 60% of the time. If the Dragon compiler can improve this to 80%,Dragon will average .6 cycles not-ready wait per conditional jump compared to 2 cycles perconditional jump on Dorado.4) Delay on a Dragon miss is ~10 cycles vs. 25 cycles on Dorado. [8 cycles/miss is the commoncase on Dragon, which happens when another munch on the same page is already in the cache;a Map processor access adds ~8 cycles. For the IFU it seems reasonable to expect 75% ofmisses to be serviceable without a Map processor access, which gives an average miss time of~10 cycles.]5) If the hit rate in the IFU cache is unacceptably low, several caches can be ganged to improvehit rate.This means that the "width" of an opcode, i.e., the number of bytes required to encode it, will have littleeffect on its execution time. This fact is important to Dragon performance because compiled code willbe less compact than existing 16-bit Mesa machines. fp!q4] G?fp bqF `S? \_ [V YL" U09 Tc RE:+ Pz&D N Ip F$qG DZY B,; @x>&Ix<\Tx:Lx8))x6Bx51x2Ix06"x.Nx-3K x*_x(Hx'M x%5x"Wx O x/)x78$xl xCx O ] 3  AXV*Dragon DocumentEdward R. Fiala25 August 1983284.8. Limited MicrostoreThe original Dragon opcode set will be implemented by less than 200d microinstructions. Opcodecomplexity is likely to remain well below that of the opcode sets used on 16-bit D machines for anumber of years. In addition, there is no conditional jump microinstruction on Dragon, so opcodes withmicrocoded loops or tests are impossible. Finally, EU registers can be flexibly addressed by thecollection of one microinstruction opcodes presently proposed, but multi-microinstruction opcodes losemost of this flexibility. In other words, Dragon has not been designed for microcoding complicatedopcodes.Although hardware-supported Dragon opcodes must be simple for these reasons, it is still possible todefine a complicated operation like MUL or BLT as an XOP trap opcode for code compactness, andthen implement the desired operation in a trap subroutine. This would not work well on the 16-bit Dmachines because traps are slow. But on Dragon, the trap itself and the return from the trap use 0 EUcycles (if the IFU is caught up) and the IFU can fetch opcodes in the trap subroutine without interferingmuch with the work the EU is doing.A microstore consisting of one or several PLAs (Programmed Logic Arrays) is more LSI area efficientthan a ROM. An unoptimized PLA has chip area proportional to (2*Inputs+Outputs) x Min-terms,where inputs are bits required to address the microstore, and outputs are bits in the microinstruction;2*Inputs are required to provide both the input and its complement to the AND plane of the PLA; thenumber of Min-terms is the number of terms in the AND plane (.le. 2^Inputs required for a ROM).A major effect of using a PLA rather than a ROM will be to reduce the number of initialmicroinstructions for a family of opcodes. To illustrate, suppose that Load Local n (n = 0 to 17b) areassigned opcodes 120b+n and that location X in the microstore is executed as the first microinstructionfor opcode X; then only one row of the AND-plane will be needed for all 20b of these opcodes becauseit is possible to pass through the low-order four bits of the opcode byte itself as an "immediate"argument to a single microinstruction. This type of change saves over 100d of the 256d startingmicroinstructions for currently proposed opcodes. In other words, in a "sparse" table of dispatchmicroinstructions, the PLA organization coalesces many identical microinstructions into only a few Min-terms.This trick saves 60d microinstructions on the LLn, SLn, LRIn, and SRIn opcodes alone, and 6dmore on DFC and DJUMP. Similar tricks save many more microinstructions by selecting theAND, OR, XOR, ADD, or SHIFTER units and the Rb' with bits from the opcode.Further size reduction is possible by decomposing AND and OR terms for individual PLA outputs,although this is harder to do with automatic design tools. Dragon opcodes are carefully chosen tofaciliate space bumming in the PLA, which will be done by hand, if necessary.This size reduction is a large departure from previous Mesa machines. In an earlier proposal, 32-bitmicroinstructions from the cache would have been executed after exhausting the ones on-chip. Becausethe cache can deliver 32 bits/cycle with a latency of two cycles, the fetch for the first off-chipmicroinstrucion would be started during the second PLA microinstruction, so it would be ready forexecution after the third PLA microinstruction. Assuming no cache misses and no unpredictablemicrocode jumps [And Dragon microcode has no conditional jumps.], this microcode would execute justas fast as microcode from the PLA. Hence, the main problems with this proposal are that it constrainsmicroinstructions to 32 bits and complicates the IFU hardware.Thacker and Petit believe there would be little performance advantage to executing additional fp!q4] G?fp b ^qD \[ [I YLH W"D UK S Pz/6 NH LJ K@& IP63 G# DL BIP @~I > W <5* 9w @ 7c 5.9 4R 2Lb 0W .B ,W +"x(=Lx& Mx$J !}H b M u;* O K W K<" V _ > y/. 2A]LDragon DocumentEdward R. Fiala25 August 198329microinstructions rather than sequences of non-trap opcodes. So there is little reason to complicate theIFU with some method of obtaining microcode from storage.4.9. The Opcode Set Cannot Be Changed.IFU PLAs allow much more on-chip microcode than a RAM. But if a PLA is used, how will changes bemade in the opcode set? The ability to modify opcodes is highly desirable during development.Phil Petit indicated that there will be no method of executing microcode from off the IFU chip or ofchanging the action of any opcode, short of replacing the IFU chip with one programmed differently. Inaddition, he believes that the control signals which constitute the "microprogram" are sufficientlybaroque that only a hardware person familiar with the machine will be able to make changes.This means that a single-processor system could be programmed either for Cedar, Lisp, or Smalltalk, butnot more than one of these. Also, the Lisp and Smalltalk groups would have to hire an LSI designer, orbe loaned one, to provide their own opcode set on Dragon. However, the opcodes proposed aresufficiently basic that sharing a common opcode set among all of these software systems is possible, andthis is what we should aim for. Although Dragon is primarily oriented toward the Cedar system, someprovisions are being made for Lisp and Smalltalk, as discussed later.The basic 256 opcode set includes many trap opcodes, the interpretation of which can be modified bychanging trap procedures. During development, this is one way to make changes. In addition, eventhough old opcodes cannot be modified, they could be abandoned by the compiler, while a modifiedfunction were provided through a trap opcode. Thus a limited form of modification is possible, withsomewhat degraded performance until changes are incorporated in hardware.The hardware design makes unimplemented regular opcodes as fast as possible, so that the time penaltyfor these is small. However, Esc/EscL opcodes, which dispatch on a, will be slow on Dragon becausethe dispatch is carried out by software instead of hardware.4.10. CoprocessorsA coprocessor is a separate processor subordinate to a Mesa processor which performs special functions.Current thinking on the use of coprocessors is in Phil Petit's "Dragon Co-Processors" memo([Ivy]dragcop.br).Petit's memo identifies three important coprocessor examples:1) A Lisp, Smalltalk, or whatever coprocessor which would run autonomously by shutting down the IFU.Such a coprocessor would be pin-for-pin wire-or'ed with the IFU; it would control the EU and both theIFU and EU caches.2) A floating point coprocessor which would obtain arguments from the EU's stack, perform a functionon the arguments, store a result on the EU's stack, and then return control to its caller. Such acoprocessor would be wire-or'ed to the I bus and the EU's P bus.Note that this kind of coprocessor can be shared by a Lisp or Smalltalk coprocessor, because control isreturned to the caller at completion, which is not necessarily the IFU. fp!q4] G?fp bq*? `S9 [:p' Wq;& U$: RU PK N?$ M,? IV G/8 F$9# DZF" BQ @E =SP ;.4 9>" 7R 6(I 2] 0Bsq /!< *p &q;, $#>$ # = L R7.  ` K#? @  Y Dpq1 B\Dragon DocumentEdward R. Fiala25 August 1983303) A BitBlt coprocessor which would obtain a pointer from the EU's stack or from a fixed storagelocation, acknowledge completion of the command, and then run in parallel with the IFU using aseparate cache to move data.I think this is not a good example because BitBlt is probably best provided by a special purposeautonomous processor that accepts commands in storage from any Mesa processor. How would pagefaults be handled in a parallel arrangement?The general idea is that coprocessors will be wire-or'ed onto the EU's I bus and P bus; somecoprocessors may have another P bus and cache; others may be wire-or'ed to the IFU's P bus as well. Acoprocessor is started by a two-byte COPR or three-byte COPRL opcode in which a[0..2] identifies theparticular coprocessor and a[3..7] the command to be executed, while b (COPRL only) is passed as animmediate argument. The IFU presents a, b, RL, and StkP on the I bus during the COPR opcode, andraises the StartCopr signal.If some coprocessor is prepared to execute the requested function, it latches the I bus information itneeds and acknowledges; otherwise, the IFU pushes a (COPR) or ab (COPRL) and traps the opcode toan address based upon the opcode number (i.e., a different address for COPR and COPRL). Meanwhile,the IFU has treated the opcode like a NOP and fetched code bytes along the non-trap direction, so whenno coprocessor acknowledges, the trap takes ~6 cycles.If some coprocessor acknowledges, then the IFU pauses until the coprocessor indicates completion.While the coprocessor is in control, it can use the EU like the IFU would use it. Note that the IFU doesnot drive any of its pins until the coprocessor signals completion, so that the coprocessor can share theIFU and EU P buses.However, the coprocessor does not know at what point stack overflow will occur, suggesting that allcoprocessor operations must be defined so that the number of words on the stack as arguments is .ge. thenumber of result words. Or StkLim can be setup to reserve extra words for coprocessors.When finished, the coprocessor returns control to the processor which called it along with the followinginformation:the new value of StkP;advance the PC to the opcode after COPR or don't advance the PC;reschedule or don't reschedule;non-zero trap address or zero meaning don't trap;3-bit ID of processor being returned to.On a trap, more bits determine the location to which control is transferred. a[0..2] are placed on the Ibus to identify the coprocessor being called, where the IFU is processor number 0; these same three bitson the I bus identify the processor being returned to by a coprocessor exit. fp!q4] G?fp bqC `SV ^ [B YLS W, TA RE#C Pz8sq Nsq)sq Lsqsq7 K GG Esq sq D'< BI@& @~6 = .3 ;A;. 9we 7 4:H 2pd 0X -3O +i x(x'@x%5x#j1x!( .:sq cO L RBM Dragon DocumentEdward R. Fiala25 August 1983314.11. Scheduling and InterruptsEach IFU and coprocessor has a Reschedule input pin. If the IFU is in control when Reschedule ispulsed, it will either insert a reschedule trap into the control stream, or, if another interrupt or a page,frame overflow, or frame underflow fault is in progress, then it will set a WakeupWaiting flipflop.WakeupWaiting regenerates the Reschedule condition when control departs from the fault or interruptsubroutine.If a coprocessor is in control when Reschedule is pulsed, it is responsible for either servicing theinterrupt or passing the fact that an interrupt has occurred back to the IFU appropriately.The current proposal is to have all interrupts pulse the Reschedule pins on all processors simultaneously.The processors will all try to lock the process scheduler, and only one will succeed. Russ Atkinson iswriting a memo on this. TBC.4.12. Booting and InitializationEach IFU, EU?, Cache, Map processor, Arbiter, etc. will have a Reset input pin which can be asserted bythe maintenance microprocessor during booting. While Reset=true, the caches will neither initiate norrespond to M bus commands or P bus references, and the IFUs will not initiate any references on the Pbus. During Reset=true, all of these devices automatically initialize themselves to a good state.It is likely that one IFUReset signal will be bussed to all the IFUs in a multiprocessor configuration,while a separate CacheReset signal is bussed to all the caches. These signals will remain true at leastlong enough for the devices to initialize themselves.After a suitable interval, CacheReset will become false first, while IFUReset continues to be true. Amaintenance processor will complete cache and storage initialization as described in the "MemorySystem" chapter. Enough of the storage map and IFU initialization program are initialized so that theMesa processors will be able to execute the next stage of the bootstrap. Finally, the maintenanceprocessor sets IFUReset=false.At this time, each IFU will start executing opcodes at a particular location as though it were servicing apage fault. In this condition, interrupts are deferred and faults are ignored. The sequence of opcodesexecuted will initialize the EU RAM, the IFU frame ring, StkP, etc., and finally turn off the faults-disabled condition and allow a Reschedule trap to occur.TBC. fp!q4] G?fp b ^qa \f [.5 YL(; W T13 RE[ NV M>) K> F$p! BqX @E! ?;* =S] 9 [ 8_ 6K5 215 1K /Db -z=% + (=h &sT $M "8 k r $BHDragon DocumentEdward R. Fiala25 August 1983325. Size Limits16-bit Mesa has the following size limits (possibly others):Virtual memory (VM) size225 bytes (Dorado*)Real memory (RM) size223 bytes (Dorado*)Page size29 bytes (Dorado*)MDS size128k bytesFrame size indices256Global Frame Table entries/MDS1024Number of processes1024Code segment size64k bytesEntries/code segment128Various object sizes216*Dorado VM sizes of 227 and 229 bytes are available with 211 or 213-byte pages, respectively; in thiscase RM size is also enlarged by a factor of 4 or 16. Alternatively, VM limits are multiplied by 4when 256k RAMs are used in the map. However, the cache independently limits VM size to 226bytes (or to 227 bytes if parity is discarded); this limit can also be multiplied by 4 if 4k ECL RAMsare used for cache data. VM sizes above 229 bytes cannot be obtained.There are also some limits associated with data structures used by the compiler, etc., not intrinsic to theunderlying Mesa machine. Only intrinsic limits are discussed here. Where necessary, this memoproposes changes such that each "hard" or unavoidable size limit, such as the VM size, be made at least30 times larger than current usage.Only the number of GFT entries is of immediate concern to people I have asked; 700 of the 1024 GFTentries are presently in use by Cedar. Of medium term concern is the fact that more than half of theMDS is in use; other limits are longer term. The "Context Switching" chapter discusses how MDS,GFT, and code segment entry vectors are eliminated for Dragon; it is not clear whether these will vanishon the 16-bit machines. Roy Levin is of the opinion that we will probably not have to fix the MDS sizelimit on the 16-bit machines and that we can eliminate the MDS on Dragon without doing so on the 16-bit machines.The FSI limit and code segment size are "soft" limits because they don't really limit what programs cando.This leaves only the "hard" limits of VM and RM size and number of processes to worry about. Thelargest systems presently use 221 to 222 bytes of VM and about 222 bytes of RM. About 150 processeshave been used.If current proposals are adopted, Dragon size limits will be as follows:VM size220 (min) to 233 (max) bytes for data228 bytes for codeRM size per storage module220 (min) to 228 (max) bytesPage size210 bytesNumber of processes8192Code segment size32k bytes fp!q4] G?fp ar ^eq<xZ)W[sZqxY))WYsY)qxW^)WWsW^qxU)WxS)WxQ)WxP4)WxNi)WxL)WxJ)WKasxG?qGsG?qGsG?qGsG?qGsG?qxEtOxC;D6sxAq BlsAq$1x@*@s@q <g :(7 9 :- 7B# 3pq4 2O 0;1/ .q#E ,g *d ) %Q # ba %sq%sq%sq"  [Hx)Wvsq vsq)WsqxT)WsTq sTq x)Wsqx)Wx )W V A[^Dragon DocumentEdward R. Fiala25 August 198333The Dragon code limit of 228 bytes is the result of using four four-byte DFC and four four-byte DJUMPopcodes, allowing 226 words or 228 code bytes to be addressed. If we expected code to exceed the 228-byte limit, then DFC and DJUMP would be enlarged. However, these are the only four-byte opcodes inthe machine, so making them 5 bytes wide would be costly. Or another opcode bit used as jumpdisplacement would consume 8 more opcodes, also undesirable.Personally, I don't think that programmers produce code at the much-mentioned 1/2 bit per year of VMexponential rate. Although there have been some improvements in programming technology, we seem toproduce code at a more-or-less linear rate, and if we can produce 221 bytes of new code per year andcontinue using all our old code, it will take 128 years to use up the 228-byte code limit advocated forDragon. Consequently, I am unconcerned about this limit. One worry: intelligent programs whichproduce code rather than representing knowledge in some interpreted structure might run out.However, I can imagine applications that will fill unreasonably large VM's with data of various sorts, sowe should be conservative with respect to the VM size limit for data. Data could be allowed 234 bytes,the limit with a 32-bit pointer. However, cardinal arithmetic, which Dragon doesn't support well, wouldbe needed for 234-byte pointers; Dragon lacks unsigned conditional jumps which would be needed forpointer comparison; also, the integer out-of-range checks for ADD, SUB, ADDB, etc. make theseopcodes inappropriate for pointers larger than 31d bits--integer out-of-range would occur for addscrossing the 231 word boundary. Secondly, the SFC opcode is limited to 233 bytes because bit 0distinguishes "direct" from "indirect" pointers, as discussed later; its implementation would have to zerothe sign bit in hardware before making the reference, if a 234-byte VM were allowed, and indirectreferences would be limited to the first half 233 bytes. Thirdly, a one-bit larger VM requires one moreVP bit in each cache block, and, possibly, one more signal pin each on the P bus and M bus; it requiresone more bit in each map entry. Fourthly, Interlisp (and I think Smalltalk) will use half of pointervalues for small integers, so they cannot use more than 233 bytes. These arguments suggest that theupper bound on the VM size be held at 233 bytes.The hardware lower limit on VM size is affected only by applications which might need to use high-order bits in words containing pointers. If the high-order bits can be left equal to 0, then any other limiton VM size, such as one imposed to limit the size of mapping tables in storage, can be separatelychecked by page fault software; it need not affect the hardware limit. For this reason, determination ofthe lower limit should be governed only by possible uses of high-order pointer bits.First, it may be convenient to represent byte pointers with Pointer[0..1] (Mesa) or Pointer[1..2] (Interlisp)identifying the byte and Pointer[3..31] identifying the word. This consideration brings the lower boundon VM size down to 231 bytes, still over 100d times larger than the VM in use on our 16-bit machines,but only about twice the capacity of our latest 315 megabyte disk drives. Secondly, if a generalized fielddescriptor is desired in one word, then 10d bits are needed, which reduces the VM size to 222 bytes.These thoughts suggest that we can't decide definitely on the VM size lower limit, so we should beconservative. Two provisions have been made for reducing the VM size. First, a 30d-bit mode has beenadded to the EU; when SmallVAMode=true, VA[0..1] is zeroed in all VAs presented to the cache.Secondly, the caches also allow any VA bits on the P bus to be zeroed, so VM size can be arbitrarilycontrolled at the cache interface. The plan is to allow VM size to be controlled over a range up to a 233-byte maximum.Because munches are 128 data bits wide, minimum RM module size is 220 bytes (=256k words) with 64kx 4 RAMs or with 256k x 1 RAMs used in nibble mode. The lower bound on module size could bereduced with a larger configuration multiplexor in the map processor, but components shorter than 64k fp!q4] G?fp aqbAsaqI _`vs_q `vs_qA`vs_q ^E \TX Z< WM UML S%TsSq QEREsQq OJ N#\ JO H? pq IssHq G\ EQEsEQqL C2+ AY ? @~s?q9@~s?q >&37 <\& P yB]Dragon DocumentEdward R. Fiala25 August 198334RAMs would have to be used to implement a smaller size. I am ignoring this possibility at present;UMC for a 220-byte storage module should be ~$500 when we begin building Dragons in quantity,declining to ~$300 a few years after that.Maximum module size is limited separately by cache and map; the map proposal later allows a muchlarger limit than the value proposed here. The cache limit is governed by the number of bits given tothe real page in the implementation, currently 18 (=> 228-bytes = 64m words on one M bus with a 210-byte page size). This limit is multiplied by the number of M buses in the configuration, giving a 256m-word limit with 4 M buses.Code segment size is the only limit more severe than in 16-bit Mesa; it results from the fact that bothlocal function calls (LFCs) and jumps allow a 16-bit signed displacement from the opcode. At present,jumps are always interior to a single procedure, and procedures are always contiguous blocks of code, solimiting PC-relative jumps to +/- 32k bytes is unimportant. However, code segments are limited to~32k bytes because it is possible to have a LFC near the end of the code segment to a procedure nearthe beginning and vice versa, so the beginning and end of the code segment cannot be much further than32k bytes apart. According to Roy Levin, the LFC-imposed code segment size limit is unimportant forsingle modules, normally well under this size; only when the packager puts a number of modulestogether is this limit a factor, and Roy is not concerned about repackaging efforts that he believes will bemodest.The code segment size limit could be quadrupled by left-shifting the 16-bit ab displacement for LFC by2; this would work because procedures always start on the zeroth byte of a word. However, Phil Petitwants to use the same IFU hardware for jumps, which must be byte-relative, as for LFC, so this isn't aconvenient hardware change. Alternatively, two bits from the opcode itself could extend the range ofLFC, but this would waste three opcodes. One of these alternatives could be implemented if the codesegment size limit is objectionable.5.1. Page SizeTwo different page sizes are of interest on Dragon, one for data written on the disk, another for data inVM. Mark Brown feels that it is desirable to have Dragon disk formats compatible with other Dmachines, but that the VM page size can be any multiple of the disk page size. The run-length encodeddata structure which makes the two page sizes relatively independent will be discussed in the "MemorySystem" chapter. This suggests that the 29-byte page size used on current D machines should be retainedon the disk, but we are free to pick a larger page size for the VM.Pursuing this further, if the page size were changed, what should it become? One factor is diskutilization; wasted disk capacity is some amount per sector for formatting plus ~0.5 sectors/revolutiondue to truncating actual capacity back to the nearest full sector; in addition, ~0.5 sectors/file are wastedin rounding files up to the next full sector. For the moment, assume an average file size of 214 bytes.Trident format requires about 96 bytes of overhead per sector, and achieves 29 sectors/revolution. With29-byte pages, wasted capacity is about (96/608) + (0.5/29) + (0.5 x 29/214) = 19%. With 210-bytepages this would be about (96/1120) + (0.5/29) + (0.5 x 210/214) = 13%. With 211-byte pages, itwould be about (96/2144) + (0.5/29) + (0.5 x 211/214) = 12%. However, the previous computationassumes an average file size of 214 = 16k bytes; if average file size is only 4k bytes, then wasted capacitybecomes 24%, 23%, and 31%, respectively.The conclusion from this is that maximum disk utilization is achieved with page sizes in the range of 29 fp!q4] G?fp bqF `S `s`Sq2 ^* [)7 YL] W7XsWq&XsWq U*> S Pz [ NQ LO Kb IP Z G60 EQ CA B%?- @[ <7tq ;_ 9T^ 7J 5K 3$ .p +iqQ )Q 'f & "C $>*$s$>q4 "sC :& 7!F ll K/sq 26  s q@s qs qs q BsBqsBqsBq w.swqswq# !:sq"' ( pG s )B\Dragon DocumentEdward R. Fiala25 August 198335to 211 bytes. A 28-byte page wastes 27% of capacity in record format alone; pages larger than 211 byteswaste more than 211-byte pages even if average file size is nearly infinite. 210 or 211 bytes may be betterthan the current 29 byte choice, but not much. Fewer sectors/revolution, more efficient record formatson the disk, or smaller average file sizes all favor smaller pages. This lends support to the thought thatthe current 29-byte page size should be retained on the disk.The VM page size affects M bus traffic, cache geometry, and map geometry. Recall from the discussionof memory references in the "Hardware Overview" chapter that a cache miss only results in a mapreference when no other cache quadword is on the same virtual page. Consequently, a larger page sizereduces M bus traffic by increasing the chance of finding another quadword on the same page; doublingthe page size also shrinks the width of the cache addressing section slightly by replacing one VP and oneRP bit by a single QA bit, as discussed in the "Memory System" chapter. Finally, there are datastructures totaling 3 bits/VM page needed for memory management; the size of these tables is halved bydoubling the page size.The conclusion of this reasoning is that the VM page size should be increased to at least 210 bytes; 211bytes may be better, but the rest of this document will assume 210 bytes.5.2. How Important Is Code and Data Compactness?Dragon's designers have been concerned about the number of cycles required to execute a program, butnot especially concerned about code and data compactness. Since the cost of storage has been decliningfor many years, both in absolute terms and relative to other system costs, this viewpoint is correct, atleast to some degree.However, I don't entirely share the view. The arguments are as follows:First, Dragon ought not do badly on data compactness. Although many items which are 16 bits inpresent Mesa systems may grow to 32 bits on Dragon, I think that most data storage will consist ofstrings, bitmaps, and things which are naturally N x 32 bits; the fraction of data storage given to itemswhich are unnecessarily large will be small. Assuming this is true, then only code size needs to beconsidered.Secondly, when storage is small, code size is important. This is the case on the current Xerox Starsystem, for example, where John Shoch recently said that Dandelion with its compact Mesa opcodes wasgetting performance in 0.5m bytes that the Apple Lisa needed 1m bytes to achieve; the code compactnessrepresents a significant cost savings to Xerox. As storage is made larger, it is eventually the case thatenough code and data both fit in storage, so that access to the backing store is infrequent. Star reachesthis point at about 1m bytes, I have been told. This means that code and data compactness will beimportant in low-end Dragon configurations, which have 1m bytes.Clearly, those Dragons used at PARC will not be minimum configurations; we are certainly willing toexpend N x ($300 to $500 per 1m bytes) for enough storage to work efficiently. However, Dragon maybe used in products where such costs are significant.Thirdly, size of code stored on file servers is probably not very important; the same reasons whichapplied to accepting a VM size limit of 228 bytes for code, while insisting upon 233 bytes for data, apply.Fourthly, code size on work station backup devices may be important. In a configuration which demand- fp!q4] G?fp aqbAsaq bAsaq? bAsaq _`vs_q<`vs_q`vs_q ^^s^qE \Tk Z [sZq/ WI UM[ S)< QA$ O<- N#*6 LXf J GsGqGs EQq@EsEQq @7p1 &F "sD 6. 7 T l13 _  ysVy<:yT89y\y$My2y  pA\7Dragon DocumentEdward R. Fiala25 August 198338Storage responds to the ReadQuad and WriteQuad M bus commands, which will be described later.TBC.6.2. Memory MappingUnlike Dorado and Dolphin, Dragon will not have a separate memory to map from VP to RP. The mapwill instead be kept in storage, and a special purpose LSI device on the M bus called the map processorwill perform mapping and carry out other transactions with the storage map.Because VM size/RM size can be so large (233/220 = 213), a resident table lookup map such as is usedin Dolphin and Dorado is impractical. Although the VM and RM sizes limits are, to some extent,choosable, VM size/RM size must be limited to about 26 before a table map becomes practical, and Ithink we are unwilling to modify our size assumptions to this extent. The mapping proposal below usesa small fixed fraction of real storage for a hash table map; the hash table delivers the RP and flags forany VP presently in storage. An independent run-coded structure is used to obtain the SP and flags forpages not presently in RM.The operating system must manage some data structure which gives for each VP both its flags and itsreal page (RP) in primary storage (if any) or its backup address in secondary storage (SP). In addition,some data structure must give the SP, VP, and any memory management information required for eachRP. In the discussion below, we will denote the number of bits in a VP number by V, in a real pagenumber by R, and in a secondary storage page by S; as a ballpark approximation S=V+2.It is important to determine the RP and flags given the VP as quickly as possible. Under the assumptionthat 20% of misses result in map access, each storage reference during map access increases M bus trafficor miss wait ~15%.If a VP is not in storage, the method by which its SP and flags are determined is not as time criticalbecause a slow disk read will be required anyway. The approach suggested for Dragon, similar to thatused in Lisp, Pilot, or Cedar, is as follows:The sorted data structure which produces the SP given a VP contains entries representing runs of pages;a binary search finds the desired entry. Each entry contains a starting SP, starting VP, and run length.With a 16d-bit run length, this structure has S+V+16 = 64 bits/entry. The data structure is notrequired for pages in RM, so it can be swapped. Also, the size of the data structure is not very dependenton page size.Various possibilities exist for defining a "run". Cedar simply preallocates a contiguous block of backupaddresses to the VM, so that the entire VM is described by one run per bad spot on the backing store.Originally done for Alto Lisp, this produces a pleasingly small data structure. In addition, thisorganization makes the VM page size roughly independent of the disk page size.If preassigning backup storage to the VM is unacceptable, then the run-coded structure could becomevery large. Dynamic allocation saves considerable backup storage when, for example, very large VMblocks are commonly allocated to allow hypothetical data structure growth. In this situation, the part ofVM touched is small compared to the part of VM allocated. If this were common on Dragon, the run-coded structure would quickly become large--too large unless the average is upward of 500 pages/run.We will assume this is not a problem. fp!q4] G?fp bqI ^ Yp V!qO TVA& RK O*OsOqOsOqOsOq. MO Q K-LsKq, I)= Ge F$T DZ @C ?` =SD ;%> 9U 6K] 4F# 2 /D X -z6/ +- (=X &s45 $4, "p q*p ! q 18 e  D AN M  V :j oF d % * B\x?Dragon DocumentEdward R. Fiala25 August 198339The ReadOnly and DiskDirty flags can also be kept in a (separate) run-length encoded structure, assuggested by Mark Brown. This run-length structure should be distinct from the VM map because theVM map will be valid until the disk is reformatted, but the ReadOnly/DiskDirty information will betransient. ReadOnly and DiskDirty flags are accessed on the first store into a clean page. DiskDirty isfrequently of no interest; its function is to allow files "borrowed" from a remote place to be written backafter local modification. For this reason, Mark Brown suggested that DiskDirty ordinarily gounrecorded; it should be recorded only when it is explicitly enabled for a range of pages in the VM.In other words, there are four classes of runs in this structure: Read-only runs; read-write runs forwhich DiskDirty is not recorded; read-write runs for which DiskDirty is recorded; and read-only runs forwhich DiskDirty is recorded. The last of these is unusual; it is included just for completeness, so that theprotection can be changed after modification has occurred.Runs for which DiskDirty is recorded should have a separately-allocated dense bit table with onebit/page, and the run-length data structure should point at this separate bit table when DiskDirty isbeing recorded. It seems reasonable that little of the VM should require DiskDirty to be recorded, buteven if all of VM must be so handled, the required bit table is only 220 (swappable) bytes.Two variations: First, ReadOnly and DiskDirty could be stored in a dense array with two bits for each page ofVM; this array consists of 221 (swappable) bytes. This is the structure for the Cedar 5.0 system. Since this issimpler than the run-length structure proposed by Mark Brown, it may be preferable, if its observed paging behavioris acceptable.Secondly, DiskDirty and ReadOnly flags could be stored with the data record on the backup device, from whichthey could be copied into the RP hash table discussed below. This might be the preferable data structure if backupdevice formats were controllable. However, this would rule out exchanging disk packs with other D machines, whichdo not have this format.Finally, Cedar presently has one Allocated bit per page of VM, used when allocating VM to variousfunctions. The proposal for Dragon is to store these in a dense bit table, which requires 220 (swappable)bytes for a 233-byte VM with 210-byte page size.RP and flags are determined from VP by accessing a chained hash table. Chuck Thacker, in a CSLNotebook entry entitled "The Dragon Map" (7 December 1980), proposed a hash table map in whichone 64-bit node/storage page holds one map entry and two pointers. The method described here is avariation which uses one 128-bit node/storage page or 1/64th of storage altogether; it has an averagesearch time ~1.1 map probes/lookup, assuming a random hashing function. Also important: search timedegrades slowly if the hashing function isn't random (though, in fact, I think that actual use of VM maybe better than random).1/64th of storage is required if the module is exactly a power-of-two words long; otherwise, the requirementincreases up to a maximum of 1/32nd of storage.In this proposal, each VP is hashed and then divided into an initial probe position VPprobe (R bits) anda key VPkey (V-R bits). The number of bits in VPprobe is enough to hold an RP in the module, andthis is also the number of nodes in the hash table. The hashing algorithm uses the high-order VP bits asVPkey, while xor'ing the low-order VP bits with the high-order VP bits to get VPprobe.Low-order VPkey and VPprobe bits are interleaved when the low-order VP bits are used in M busselection, as discussed in the "Mapping With Multiple M Buses" section below.Each 128-bit root node can be arranged into 3 map entries, a 0 son pointer, and a 1 son pointer asfollows: fp!q4] G?fp bq11 `S)9 ^N \54 ZM Y)6' W^!C SQ R"Y PWi N: K` IPS G34 EFFHsEqxBsfxARAAR*)x?jx> x;66x:%Nx90Ux7 4qA 2D3Cs2q 0 1ys0q1ys0q -zG + P )N (Q &O_ $Y "ysE'y/ IqJ ~Q 81 V wB M ;K pH )A\eDragon DocumentEdward R. Fiala25 August 198340Map entries:3 bitsReadOnly, Dirty, and Referenced flagsV-R bitsVPkey in map entriesR bitsModule-relative RPPointers:1 bit1=root pointer, 0 = son pointerR bitsson pointer or root pointerOther fields:1 bit1=root node1 bit1=free list node10 bitsunusedThe combination of ReadOnly=1, Dirty=1, and Referenced=1 denotes an unused map entry. Nodeson the free list use the 0 son as a backward link and the 1 son as a forward link in a doubly-linked list.Unused sons point at the root.The pointers create a digitally-searched tree with a worst-case search depth no greater than V-R=13dprobes; average search depth is ~1.1 probes if the hash is random. On the other hand, because eachnode can hold 3 map entries before it must be split, a little non-randomness in the hash should not makeaverage search time a lot worse.One node/page is necessary, even though each can hold 3 map entries, because 1 node/page is consumedwhen all nodes are root nodes containing one map entry each. One node/page is sufficient because thealgorithms for inserting and deleting map entries given below guarantee that any tree of N nodes willhold at least N map entries.New map entries are inserted as follows:If the first probe is to a non-root node, allocate a free list node, and copy the non-root node into it; follow anypath to the root; then do a digital search to locate the pointer to the node which has been moved; fixup thatpointer. If the first probe is to a free list node, then remove it from the free list (i.e., allocate it). In either ofthese cases, a new root node is made containing the map entry being inserted with pointers to the root in both sons.If the search hits a root node, then digitally search the tree until a node with an empty slot is found, or the searchterminates at either a leaf node or a non-leaf node with a single son going the wrong way. If there is an emptyslot, insert the new map entry there. Otherwise, allocate a free list node, and split the node at which the searchfailed; put the new map entry and two root pointers into the new node; point the old node's son at the new node.The algorithm for collapsing the tree structure following removal of a map entry from node N is asfollows:If N is a root and has no sons, then if it has no map entries remaining, deallocate N; done.Otherwise, if N has at least one map entry remaining, then done.Otherwise, N has no map entries remaining and is not the root; the father of N has been remembered during thesearch. If N is a leaf (i.e., both of N's sons point at the root), deallocate N and replace its father by a pointer tothe root. If N is not a leaf, then follow any path from N to a leaf L. Collapse L into N; deallocate L; fixup L'sfather to point at the root.*Note: leaves can be collapsed into any father, grandfather, etc., but non-leaves cannot ordinarily be collapsedbecause the son pointers would then fork the digital search in the wrong place. Also, the above algorithm could bechanged to attempt collapsing L into N even when N has 1 or 2 remaining map entries. However, thisimprovement may not be worth the extra multiplexing, etc. that it would require.Since both the insertion and removal algorithms guarantee at least one map entry in every node, onenode/page is sufficient.In addition to the hash table, which gives the RP and flags for any VP, a table with R 32-bit entries is fp!q4] G?fpybAs `_^y\[]YyX W;U Ty Q+qC O`Q M J#d HYT F62 D ARX ?@$ =%@ ; 8(x5ssx4^C*x2k x1=7x.px-Ux,<&Mx*!O 'qP %x#s\x b@xb xcwx]xx/AxO$xB^xP qS  V]  R B]YDragon DocumentEdward R. Fiala25 August 198341required to determine the VP for any RP. This table is used in maintaining usage histories as discussedlater; word RP in this table contains the VP and usage bits associated with that RP.Rejected SchemeDan Greene suggested that the hash table map and map processor be replaced by a number of simpler fully associative LSImaps in which the VP and flags were stored in an array in which the position within the array gave the RP. This wouldreduce map access time from ~10 cycles to ~2 cycles, substantial reducing memory wait. Also, the LSI mapping parts arefairly trivial compared to the map processor.A preliminary calculation suggested that at l = 2 microns each associative map chip could hold ~512 map entries. Then oneof these LSI mapping parts would be needed for every 18.25 256k RAMs. However, Chuck Thacker suggests as ballparknumbers that our cost for 256k RAMs should be estimated as ~$5 and our cost for any custom LSI part in moderate volumeshould be ~$50; so the map parts are too expensive.Also, with 228 bytes/module and 210 bytes/page, there would have to be 512 of these LSI mapping parts in the maximum sizemodule, clearly unacceptable for fanout reasons.If we start to modify our assumptions, then the associative map could be considered. For example, if the maximum modulesize were reduced to 226 bytes, the page size increased to 211 bytes, and l reduced to 1.4 microns (which increases the LSImap part's capacity to 1024 entries), then 1 LSI map would be needed per 73 256k RAMs, and the maximum module wouldneed 32 LSI mapping parts.6.3. Cache and Storage ConfigurationsEach P bus will have one or more caches connected to it, and each cache has an "acceptance" registercontrolling the VA range to which it responds. This register has 2 bits each for VP and BL, interpretedas "respond to 0" and "respond to 1"; both bits 0 disables the cache from responding to any references;both bits 1 means that both values of that VA bit are present in the cache. In addition, there is a 14d?-bit register used to adjust the VM size from 220 to 233 bytes; ones in this register cause the VA receivedby the cache to have its leading bits forced to zero.It is assumed here that during system initialization, acceptance registers on the caches on each P bus aresetup so that precisely one cache part responds to any VA. In other words, there will be no "holes" inthe VM at the P bus interface, and at most one cache part will respond. Also, real memory (RM) ispaired one-for-one with VM. For example, suppose that there are two storage modules, each with itsown M bus; then some caches on a particular P bus will connect to one M bus, some to the other; so thepart of VM accepted by one cache can be loaded only into the corresponding part of RM.Several ideas are mentioned below in which a P bus is not connected to all M buses. In such a situation,the caches on that P bus are, nevertheless, initialized to cover the entire VM; if a reference occurs on a Pbus which has no access to the proper part of RM, then it will miss, and the map processor (on thewrong M bus) will return a page fault. The page fault software must distinguish this situation fromnormal page faults; such a reference is illegal.Caches can be ganged on a single M bus, enlarging the cache on any device that needs it. Or caches canbe split among several M buses, not only increasing cache size but also dividing M bus traffic. A multi-M-bus arrangement is powerful but has some problems; special purpose processors must normally havemultiple caches, needed or not; and each M bus must have a map processor and storage module.Traffic usually divides more evenly when low-order BL bits are used for cache selection. However, BLbits cannot be used as selectors in a multi-M-bus arrangement because mapping tables would then haveto be redundantly stored in more than one storage module, which is impractical. Consequently, multi-M- fp!q4] G?fp bq44 `ST ]t ZfsA6 YC3 Wm VD- S,us6 Qr Pf O=3 L M+LM+L'/ K>0 H/I FGF"GF us0 Etf D ?p& ;q Z 9Y 8>) 6KY 45s4q5s4q3 25 /DJ -zL +%= )c (X &OV "\ !l HD }d 0 AX vJ %= #9 o:+ %? ?( A\x_Dragon DocumentEdward R. Fiala25 August 198342bus configurations must select on VP bits. And the map itself must be arranged so that each storagemodule holds the map entries for its own pages, as discussed below.In multi-cache configurations, the method of dividing data among caches will significantly affect thepercentage of misses which require map references. Recall that a map access happens only when thecache servicing a miss does not hold another quadword with the same VP. With 64d quadwords/page,the assumption that ~10% of misses cause mapping seems plausible for a single-cache configuration. Ifin a two-cache configuration, even quadwords are put in one cache, odd in the other, then about half ofthe quadwords in each page will be in each cache. One would expect such a configuration to average ahigher percentage of map accesses per miss than in a single-cache configuration. Conversely, even pagesin one cache, odd in the other, concentrates all quadwords from any given page in one cache, producinga smaller percentage of map accesses. This reduction in M bus traffic may more than compensate forless even distribution of misses among the caches.Although low-order VP or BL bits generally divide references more uniformly, one or two high-order VPbits may be used with some software accommodation. For example, with the exception of indirect SFCs,the IFU uses its own cache only to fetch code from the low 228 bytes of VM. If two M buses wereselected by the high VP bit, and if the SFC exception were somehow eliminated, then caches on the IFUP bus would only have to connect to one M bus. Similarly, if all data were put in the high half of theVM, and if all constants and pointers in the code were relocated to the high half of the VM somehow,then the EU would not need a cache covering the half of VM. So this division would put all IFU trafficon one M bus, all EU traffic on another, without constraining the cache configurations.We should strive to make this division of M bus traffic possible by arranging the Mesa compiler to makeno data references to the code segment, and by fixing the SFC opcode. In addition, some opcode whichcan store data into a code segment may be needed, or at least one processor in a multi-M-busconfiguration must have its EU fully connected.McCreight suggested a scheme to make several ganged caches look like one larger cache, rather than dividing theVM as proposed above. This scheme passes a miss-goes-here token onto the next cache after servicing a miss, sothat only one cache will load data if none hit; this distributes misses evenly among the caches. Control signals mustwire-or or wire-and to correct values when one cache holds data and the others don't.Such a scheme would utilize caches evenly and allow 1 to 6 cache parts to be used on a P bus, while a binarydivision of the VM distributes data unevenly when the number of caches is not 1, 2, 4, or 8. Also, it would obviatethe acceptance register, except for those bits used in multi-M-bus configurations. However, map hits would occurless frequently, as discussed above, and two extra signal pins would be required. This idea should be pursued, butit is not clear that its result is, on average, better than selection by low-order VP bits.Enlarging a cache through ganged or multi-M-bus arrangements is only one way to improve memorysystem performance. Thacker has proposed that some ROM storage in parallel with the IFU cache couldpreempt the cache for some range of addresses. For example, such a ROM could hold code bytes for"trap opcodes", as discussed later, reducing contention in the IFU's cache. EPROMs available today areat most 8-bits wide, so at least 4 EPROMs would be needed to make a 32-bit width, but McCreight isdesigning a wider EPROM for Daffodil which could perhaps be adapted. fp!q4] G?fp bqpqM `SC \L [T YL$= W06 U70 S X R"A' PWf N+8 L2 IPA$ G] EK <W 9T@' 7X 5\ 3/x12sb x/N!x.q/Gx-Ux*rQx) ix'3>x&O\x$[ !q> T  C @K uK E dAO<Dragon DocumentEdward R. Fiala25 August 1983436.4. Multi-processorsThere will be many software problems to solve in bringing Cedar to the point where it can run morethan one processor concurrently. After those problems have been solved, it is interesting to ask howmany processors the hardware organization can support.For the minimum Dragon configuration, an effective limit will be reached when the M bus approachessaturation. We can estimate this as follows:1) Suppose that on a uniprocessor the average code byte is executed in 0.75 cycles (equivalent toabout one opcode every 1.5 cycles, if the average opcode is two code bytes), and that all jumpsare to the zeroth byte of a word. From this we can conclude that each IFU will reference itscache every 3 cycles.2) Suppose that a 97% IFU cache hit rate is achieved. Then each IFU cache will referencestorage every 100 cycles.3) Assume that the average IFU miss uses 9 M bus cycles: 7 for reading the quadword, plus 2for mapping [20% non-map-hits x (2 + (7/probe x 1.1 probes))].4) The average EU miss takes about 7+2+2.5 = 11.5 M bus cycles because allowance must bemade for writing back dirty victims (5 cycles x 50% dirty victims). Suppose that the EU usesthe same M bus bandwidth as the IFU.5) Suppose that a 38.5 Hz x 1024 bits/scanline x 808 scanline monitor (LF monitors on currentD machines) is refreshed over the M bus; then 25% of M bus bandwidth will be used to refreshthe display.6) Assume that M bus bandwidth (2 cycles) for WriteSingle commands is negligible. Thishappens on a store into a cache block which may be shared with another cache.7) All fetch-and-hold opcodes (currenty only CST) hold the M bus 4 cycles if all references hit.Assuming that 1 in 333d opcodes is one of these, 0.8% of all M bus cycles are consumed bythese.From this we can conclude that each IFU will use 9% of the M bus bandwidth; each EU 9%+0.8% =9.8%, so maximum power is (1-.25)/.188 = ~4 processors.If the model is changed to show a 99% hit rate, and if display refresh is accomplished without using Mbus cycles, then the result improves to about 6.8% of M bus bandwidth/processor =~15 processors.Hence, the maximum power obtainable by ganging up minimum Dragon processors on a single M bus isplausibly ~4 to ~15 standalone processors, depending upon cache hit rate. At l = 2 microns, Dragon'scache seems headed for ~256d data words, so the minimum Dragon processor with two of these cacheswould have 512d 32-bit words; for comparison, Dorado has a 4096d 16-bit word cache and averages over99% hit rate (Also, Dorado's block size is twice as large as Dragon's.). Four caches each for an IFU andEU would hold the same amount of data as the Dorado's cache in blocks which are half as large.Cache expansion, multi-M-bus arrangements, or ROM storage can increase the peak multiprocessor sizeby improving the hit rate and reducing M bus bandwidth. Interleaving the entire storage system 4 wayswould multiply cache size by 4 and simultaneously divide traffic on each M bus by 4, plausiblypermitting ~60 processors. fp!q4] G?fp b ^qP \e [6 WW U-xRh%<xP"=xNGxMxJj;xHxFTxD7>xARx?Q x>$x;e]x9-/x7 x51Wx3gMx0Bx. Lx-3 )3* '7 $G "T H9' } Buq Z U *? SM ^ (> LQ  :AZ#Dragon DocumentEdward R. Fiala25 August 1983446.5. Mapping With Multiple M BusesA promising method of making large multiprocessors is to gang up LSI caches using multiple M buses,each with a storage module. This increases "hit" rate through larger caches while also dividing M bustraffic. However, for this to work, each M bus must have its own map and storage system, so that itneed not access resources on another M bus when a cache miss occurs. Consequently, VP bits must beused to select the M bus (and hence the cache) because use of lower-order VA bits would requirereplication of mapping information in different storage systems or communication from one storagesystem to another somehow, which is impractical.Although VP bits probably do not, during the time constant of interest, divide references as evenlyamong the caches as would the lower-order VA bits, they do, as discussed earlier, concentrate allquadwords from a particular page in a single cache, reducing map references.The following provisions allow multiple M buses. First, assume at most 4 M buses, and pick particularVP bits that will be allowed in M bus selection; here I assume that the two high-order (VP[1..2]) and twolow-order (VP[22..23]) VP bits might be used this way. For good hash table probe distribution, low-orderVP bits should be used as VPprobe; e.g., VPkey = VP[1..m] and VPprobe = VP[m+1..23] xor VP[1..m]is reasonable in a single-M-bus configuration. But if VP[23] is used for M bus selection, it should beinterleaved with VP[m], and if VP[22] is used for M bus selection, it should be interleaved with VP[m-1].Unfortunately, "m" can range from 13 to 5 as the module size ranges from 220 to 228 bytes. A possiblesolution to this problem is to allow any reasonable interleaving combination; i.e., allow VP[23] to beinterleaved with any of VP[5..13] and VP[22] with any of VP[4..12]. Perhaps, this can be done withconfiguration registers or shift registers of some sort.A less good but simpler implementation is to control interleaving by jumpering signal pins; i.e., allow VP[23] to beinterleaved only with VP[5] and VP[22] only with VP[4] with two signal pins which can be grounded or pulled highto effect interleaving.The hash table map for any storage module of size R pages will reside in words 0 to 4R-1 of the module,and the pointer to the free node list for the hash table can be put at location 4R. Because these are usedonly by the map processor, and are referenced at module-relative real addresses by the map processor,these data structures are insensitive to the M bus configuration after system initialization. However, themaintenance processor must initialize these data structures, and to do this it must know the M busconfiguration.Also, there must a list of free RPs for each module (which can have its header at real location 4R+1within the module, for example). Software must arrange to use the proper free page list for each VP.Because each VP is mapped on one and only one M bus, it is constrained to use RPs from one and onlyone storage module. In other words, each VP/RP pair unfortunately will be entirely within a singlestorage system; in a 4-M-bus configuration, VM and RM will be managed as 4 separate pools rather thanas a single pool. fp!q4] G?fp b# ^q8+ \b [9+ YLD W6) UY S0 Pz:) NQ LL Isf GG" ES DY BIK @~` >J?@s>q?@s>q <M ;C 9T8x6sUx51d x3 0qg .%F ,05 +"g )W@" ' $J "Pa J /4 S & BM,$Dragon DocumentEdward R. Fiala25 August 1983456.6. Cache OrganizationCache signal pins under the current proposals are used as follows:32PData[0..31]P bus data 1PParityP bus parity 1PNErrorP bus parity error' 3PCmdP bus command 1PRejectReference reject (cache not ready) 1PFaultPage or write protect fault on reference32MData[0..31]M bus data 1MParityM bus parity 1MNErrorM bus parity error' 1MNewRqNew M bus request to the Arbiter 1MRqContinuing M bus request to the Arbiter 1MGrantM bus grant from the Arbiter 4MCommandM bus command 1MNSharedM bus shared' 1MNReadyM bus data being transported 2Power 2Ground 1Reset 2Clock pins 3Serial io pins for maintenance processor (data, read clock, write clock)------------92 pins totalInside a cache, one half cycle is used for M bus commands which must access cache data, the other foractivity initiated on the P bus. The P bus is also operated on half cycles. Even half cycles are used tomove data from processor to cache and odd half cycles to move data from cache to processor.Note that the maximum number of connections to a P bus is ~8 (4 caches, IFU, EU, 2 coprocessors),while the maximum number of M bus connections is much larger. To allow for larger bus fanout, the Mbus is operated on full rather than half cycles; it is precharged during the first half-cycle, driven duringthe second half-cycle. If all caches contribute to precharging, then the precharge time would not besignificantly affected by fanout.With the exception of VP, all cache fields below are dual-ported: one port for P bus references, the otherfor M bus commands. With two accesses per cycle, M bus commands don't interfere with P busreferences unless a reference must access the M bus and has to wait.In addition to the access methods used during normal operation, the cache has a serial interface whichallows all of the internal storage to be read and written by the maintenance processor. The use of thisinterface is outlined in the initialization section later.LSI cache blocks each contain the following fields:VP23-bit associative Virtual Page number. VP is VA[1..23d] of the 31-bit virtual word addressVA[1..31d] supplied by the processor on references.BL6-bit associative Block Address within the page. BL is VA[24..29] of the VA supplied by theprocessor on references.RP18-bit associative Real Page number for data in a block. It is loaded on a miss from aMapDone M bus command or on a map hit from another block in the same cache.VPValid1-bit associative Virtual Page Valid flag. VPValid can be set false in all blocks in all cacheswith matching RP by the ChangeFlags M bus command, which is issued by the Map processorwhen the association between a VP and a RP is broken by a map change; VPValid is set truewhen a block is loaded from storage. fp!q4] G?fp b ^qBx[s  xZ xY)xW xVgxU$xS  xRE xPxOxN# xLxKa xJ xHxG?xExD}xC xABx@[ x> ;q,9 9j 8= 4B 2S 1f /D32 -z! *b (=*1 &sD #M !6N k: 3x8s+13x 1+x/(? x3-U#4 '2 $ pA\Dragon DocumentEdward R. Fiala25 August 198346RPDirty1-bit associative Real Page Dirty flag. Whenever RP is loaded by a MapDone, RPDirty is alsoloaded; in addition, RPDirty will be rewritten in all other blocks with matching RP; RPDirty isalso loaded by the ChangeFlags M bus command. RPDirty=false on a store forces VP tocompare not-equal to VA[1..23d].RPValid1-bit associative Real Page Invalid flag. RPValid is reset to false in all blocks in all cachesduring system initialization (by the Reset signal). It is set true whenever a block is loaded on amiss.Shared1-bit associative flag indicating that the block may be present in some other cache. Shared isset true in all blocks holding a particular RP..BL whenever data for a miss is supplied fromanother cache rather than from storage. It is reset false when a block is loaded from storage.Broken1-bit associative flag indicating that the cache block is malfunctioning. This flag is normally setfalse during initialization but may be set true in response to diagnostic errors. It forces "nomatch" on both P bus and M bus operations, and causes the cyclic replacement algorithm toskip the block for which it is true. Up to 2? consecutive blocks may be broken before theentire cache part becomes non-functional.CEDirty1-bit Cache Entry Dirty flag. CEDirty is set false in a particular RP..BL when a WriteQuad Mbus command is executed for that block; it is set true in a block hit by a store.Transporting1-bit flag indicating that transport is in progress into this block. Transporting is set true whenthe first word from storage has been received, false when the last word has been received. Areference hitting a block with Transporting=true is suspended.NextVic1-bit flag indicating that this block is the next victim for a miss. This bit is treated as a tokenthat cycles through all the cache blocks, skipping the ones with Broken=true; up to twoconsecutive Broken blocks can be skipped. NextVic cycles to the next Broken=false blockwhen either it is hit by a reference or used as the victim for a miss.QW132-bit QuadWord of data, stored in a RAM with 4 32-bit+parity words/block. The individualwords from QW are selected by VA[30..31d] on references. fp!q4] G?fpxbAs4(`T _ F^x\TOZ^YxW4+VgN U:%xS<`QQPz.+O<M)xK-0J1xH  YGb[F>xD7,8BPAu;@Fx>J;<8 <;+EDragon DocumentEdward R. Fiala25 August 1983476.7. M Bus CommandsA cache which wants to use the M bus first obtains the M bus "grant" from the Arbiter, which uses around-robin scheduling algorithm. Once acquired, the M bus grant is retained until the reference hascompleted all necessary M bus commands; but an "-and-hold" reference retains the grant after thereference completes. Also, a cache automatically relinquishes the M bus grant on page and write-protectfaults to avoid deadlock. A grant acquired by an "-and-hold" reference is released after completing anynon-"-and-hold" reference.A cache which sends an M bus command to the map processor holds onto the grant while the mapprocessor is active; the map processor then uses the M bus freely without any interaction with theArbiter.The grant is obtained through three signals called MNewRq, MRq, and MGrant. Because the grant isgiven to a single cache on a single M bus rather than to all caches on a particular P bus, the semantics offetch-and-hold and store-and-hold are limited. The reference following a fetch-and-hold or store-and-hold must use the same cache as the previous reference to avoid deadlock. This means that the referencemust either be to the same word as the fetch-and-hold or that cache selection has some constraintsknown to the program. For example, if cache selection were limited to VP bits, then any location in thesame page as the first fetch-and-hold would be safe.No current opcodes use store-and-hold, and only the CST opcode (Conditional STore) uses fetch-and-hold. Since the reference following CST's fetch-and-hold is to the same location as the fetch-and-hold, ithas no problem. However, the restriction on -and-hold sequences rules out some interesting futurepossibilities.Even if an M bus grant could be given to all caches on a particular P bus that were connected to that M bus, therewould still be a limitation in multiple M bus configurations. Solutions suggested to this problem were rejected astoo complicated. We have decided to live with the limitation.The Allocate opcode, now discarded, had this deadlock problem. It removed an item from a free list by first fetch-and-holding the free list header; then fetch-and-holding the word pointed at by the header; and finally, storing thesecond value back into the first word. To avoid deadlock, Allocate was restricted to situations when its firstreference was either on one particular M bus, or on the same M bus as the second reference.Since CST is a sufficient primitive to implement monitor locks and condition variables on a multi-processor Dragon,the lack of a more general fetch-and-hold capability in the memory system will not be a serious problem unlesssome applications turn up for which monitor locks provide inadequate performance. Allocation and deallocation oflocal storage is one area of concern discussed later.The following are the commands so far established for M bus communication. A maximum of 16dcommands is allowed by the 4-bit MCommand bus.NoOpNo command.DoMapOpDoMapOp (2 M bus cycles) is issued on behalf of a MapOp reference. The processor issuesthis command like a store; the cache acquires the M bus and issues DoMapOp with the VAon M[1..31d] in the first cycle and the 32-bit data argument on M[0..31d] in the second cycle.By convention M[24..26d] in the first M bus cycle specify the processor (0 for the mapprocessor, 1 to 7 for other processors); M[27..31d] specifies the desired action.MapOpDoneMapOpDone (1 M bus cycle) is issued by the map processor after completing a DoMapOp.The cache which issued DoMapOp simply passes M[0..31d] of MapOpDone back to theprocessor like a fetch.ReadMapReadMap (1 M bus cycle) occurs when there is neither a data hit nor a map hit on a fetch;the VA of the reference is on M[1..31d]. This command is ignored by other caches on theM bus and received by the map processor which answers with a MapDone command.ReadMapSetDirtyReadMapSetDirty (1 M bus cycle) occurs when there is neither a data hit nor a map hit ona store; the VA of the reference is on M[1..31d]. This command is ignored by other caches fp!q4] G?fp b ^qc \14 ['9 YL17 WG! U RE+1 Pz:( N K>:' Is%F GI ER D"@ BIh @~4 = &< ;A*A 9wO 7 x4rkx3 ex2)>x/,Gx.*S!x,G(x+i[x(&Mx'iI%x& Tx$5 !Yq\ .xr/ x/,,/uJ /I/++/SQx/B/(K/x/)0/)// ;8x p/C/ 7#& A^tDragon DocumentEdward R. Fiala25 August 198348on the M bus and received by the map processor which answers with a MapDone command.SetDirtyA SetDirty command (1 M bus cycle) occurs when there is a data hit on a store butRPDirty=false in that block; the VA of the reference is on M[1..31d]. This command isignored by other caches on the M bus and received by the map processor which answerswith a MapDone command.MapDone?? MapDone (1 M bus cycle) is issued by the map processor in response to a ReadMap,ReadMapSetDirty, or SetDirty. The response has RP on M[6..23d], RPDirty on M[1d], andVPValid on M[2d] as well as PF (Page Fault), WPF (Write Protect Fault), MDE (Map DataError = storage failure during map access), and MapOK encoded somehow. IfMapOK=true, then all cache blocks matching RP in all caches will update RPDirty to thevalue specified in the command. If VPValid=false, then all cache blocks matching RP in allcaches will set VPValid=false. The cache which has the M bus grant knows that MapDoneis the response to its ReadMap, ReadMapSetDirty, or SetDirty command.ChangeFlagsChangeFlags (1 M bus cycle) may be issued by the map processor during a MapOp or whenthe first store into a clean page occurs; RP, RPDirty, VPValid, and MapOK are encoded asin the MapDone command. ChangeFlags causes all cache blocks matching RP in all cachesto reload RPDirty from the value specified in the command; VPValid is also reloaded iffVPValid=false in the command. In other words, it is possible to turn off VPValid usingChangeFlags, but not to turn it on. The CEDirty, Shared, Transporting, NextVic, andBroken flags cannot be modified by this command.WriteSingleWriteSingle (2 M bus cycles) is issued by a cache when its processor stores into a blockwhich has Shared=true; M[6..31d] has RP..BL..WA, and the 32-bit value being written is onthe M bus in the following cycle. All other caches on the M bus respond by matchingRP..BL; on a match (there can be at most one match in a particular cache), the data RAMholding RP..BL..WA is updated.ReadQuadReadQuad (7 M bus cycles) is issued to obtain a quadword from storage. During C0,M[6..31d] holds (RP..BL..WA); C1 is always dead; then the storage module can impose anindefinite hold beginning with C2, if it wants to, by asserting MNReady. I think the normalcase will be a one-cycle hold. C3 to C6 return the data in cyclic order beginning with thereferenced word WA. Other caches also match on RP..BL and will respond if they hold thedata (1 cycle faster in this case because no cycle is spent in storage wait), in which caseShared is set true in all caches matching the data. Otherwise, storage supplies the data withShared=false. When ReadQuad is issued by a cache, it will allow the reference to proceedas soon as the first data word has been received, and Transporting is set true during the next3 cycles, while the other words are transported.WriteQuadWriteQuad (5 M bus cycles) is issued when the victim for a miss has CEDirty=true. DuringC0, RP..BL..0 is on the M bus. Storage must accept the address during C0, but if it isunprepared to accept the data, it will raise MNReady, causing the WriteQuad to suspendtransport. Otherwise or after MNReady has fallen, data is transported in cyclic order duringC1 to C4. The quadword is written into storage and CEDirty is reset to false.Other caches which hold a block matching RP..BL respond by clearing CEDirty and rewritingtheir quadword from the M bus data. This ensures that when an io device streams data intostorage with WriteQuad operations, any cache holding a matching address is safely updated.6.8. Cache Operating DetailsM bus commands use one port to associatively operate on all blocks matching either RP or RP..BL,depending upon the command; Broken=false is also required for a match. P bus references use theother port to associatively match VA[1..29d] against VP..BL (a "data hit") or, barring that, VA[1..23d]against VP (a "map hit"); a block matches only when VPValid=true and Broken=false.On a data hit for a fetch, another reference can start in the next cycle, and data can be referenced in thesecond cycle; there are no complications. After a store, however, the next reference cannot start for twocycles, and there are two complications. First, Shared=true suspends a store reference until the M busgrant has been obtained and a WriteSingle M bus command executed (2 cycles). WriteSingle causes thesame RP..BL..WA in any other cache to be reloaded from the value on the M bus; note that at most oneblock can hold RP..BL..WA in any single cache because of the way "orphans", discussed below, are fp!q4] G?fp/bAr.&x`v/:/_F/]4/\TxZ/E /Y)9/WP/Vg"(/UF/S7$/REN/PExO /),/M)//LX++/JF/ID/H6T/F0xE /H/CP/BIH /@J /?x=/N/<\5!/:?/9[/89E/6;/5x Q/4&3/2 T/1U0x// K/.*,+/,>/+i7&/*N/(=/*/&H/%|D p qB IG ~V R BI" w ] ]  V R M R A]Dragon DocumentEdward R. Fiala25 August 198349treated.The second complication is that a store which hits must access the map unless RPDirty=true. Thishappens on the first store into a clean page or when writes are not permitted, both of which are rare. Inthis situation, the SetDirty M bus command will cause the map processor to set the Dirty bit in thestorage map and return with RPDirty=true in MapDone. If writes are not permitted, then a write-protect fault occurs.A store that hits an unshared block takes 1 cycle, if it is not followed by another reference, else 2 cycles. However,a store that hits a shared block takes 3 cycles, regardless of whether another reference appears in the followingmicroinstruction. In other words, data transport for the store overlaps M bus acquisition in all the complicatedcases, so it doesn't add to the timing.Similarly, store-and-hold and fetch-and-hold have the same timing as store and fetch, respectively, when they don'tget a data hit.On a map hit, RP and RPDirty are supplied from another block with the same VP. If there is no maphit, a ReadMap or ReadMapSetDirty M bus command is executed to obtain mapping information fromthe map processor. The MapDone response to these may force a page-not-in-storage or write-protecttrap, if access is disallowed, or it may return new values for RP and RPDirty. If the Map processor hasto set Dirty=true in storage because this reference is the first store into a clean page (a rare event), thenthe map processor will do a WriteQuad command to update the map in storage. Altogether, theReadMap or ReadMapSetDirty command, the first map probe by the map processor, and the MapDonecommand take ~9 cycles; at 1.1 map probes/lookup, this will average ~10 cycles.Note that RPDirty is reloaded in all cache blocks which match RP by the MapDone command.After MapDone, the sequence of events resembles an original map hit. If RPDirty=false was the onlyreason for going to the map (in which case the SetDirty command would have been sent), then a data hitwill follow because the proper VA..BL is already present in the cache.Or if RP..BL is already present in the cache, then the "orphan" special case will occur. Some cacheblock could have RP..BL matching the reference, but not hit the data because VP doesn't match. Suchan orphan occurs after the association between a VP and RP has been broken by a MapOp. When theM bus grant has been acquired for a miss, the orphan case is recognized, and VP for the referenceoverwrites VP in the orphan cache block using an M bus cycle. In this case, the ReadQuad M buscommand continues, but its results are ignored by the cache.The orphan case must be recognized by the cache to prevent the occurrence of more than one block holding thesame quadword, possibly with differing data.Mesa currently restricts virtual to real mapping to be one-to-one, so at most one VP can be assigned a particular RP.However, orphan treatment makes the cache work correctly, even when this restriction is not true. If two VP's mapthe same RP, and if a particular quadword for one of the VP's is in the cache, then a reference to the samequadword via the other VP will replace VP in the cache block due to the orphan treatment.If no orphan is found, Rick Barth has proposed a modified cyclic algorithm for choosing the cache blockto reload on a miss; this block is referred to as the "victim". The NextVic token cycles to the next cacheblock both on a miss, when the cell pointed at is replaced by data for the miss, and when the cell itpoints at is the target of a data hit. Also, if Broken=true in the next block, the token cycles to the blockafter that.Skipping blocks with Broken=true is carried several levels, after which the entire cache would be considered non-functional. It is believed that the ability to set Broken=true will allow many manufacturing defects to be ignored,improving yields. fp!q4] G?fp bq ^J \K [E YL8( WxTrK,xS_g xQ1@xP'xMZxL IPqJ G^ E5- Cd B%\ @[0, >:# <Ox:r!sr4 6qH 4E! 3 F /-7 -D ,>" *NL (0/ &<x#r` x",x3Bxcx7VxY q9. 56 =( (Y ] xrG*x ;tx N A\TSDragon DocumentEdward R. Fiala25 August 198350If the selected victim has CEDirty=true, then it is stored with a WriteQuad M bus operation. Then aReadQuad M bus command is executed to obtain new data. Any other cache matching RP..BL willsupply QW and Shared will be set true in all matching blocks simultaneously; if fetch timing for a hit is2 cycles, then a miss answered by another cache (Shared=true) takes 7 cycles, and a miss answered bystorage takes 8 cycles (Shared=false).QW is always transported with the requested word first, followed by the other three words in cyclicorder, so that the processor can continue without waiting for the rest of transport to finish. However, ifthe reference that missed is followed shortly by another reference to the same quadword, then the secondreference will be suspended until all data transport has finished.This is an exception to the rule that a reference is suspended until M bus activity it initiates has completed.Suppose that two fetches in sequence are issued in consecutive cycles to a single cache block and that the first fetchmisses. Then the first fetch will be suspended in the pipeline until the first word of transport for the miss has beenreceived by the cache. Then the first fetch will finish and the second fetch will be suspended for 3 cycles becauseit hits a cache block for which Transporting=true.Or suppose that the sequence is fetch, touch data, fetch in three consecutive cycles. Again the first fetch issuspended until the first word has been transported. Then the instruction which touches the first fetch's data issuspended for 1 cycle. Finally, the second fetch is suspended for 1 cycle to wait for transport to finish.Here is timing for a fetch that gets a map hit and is assigned a clean victim:C0:EU registers read, addition to compute VA started in EU.C1 A:VA sent to cache on P bus.C1 B:Map hit but no data hit is detected.C2 A:M bus grant is requested from Arbiter.C3 A:M bus Grant is obtained.C3 B:ReadQuad command issued.C4:Dead.C5:Storage wait (possibly no wait if no error correction).C6 B:Transport 1st word (at referenced address) on M bus.C7 B:Transport 2nd word; store 1st word in cache.C8 B:Transport 3rd word; send 1st word on P bus; "move" class opcodes proceed.C9 A:M bus request dropped.C9 B:Transport 4th word; "op" class opcodes proceed; done.C10 A:Arbiter gives grant to somebody else.Here is timing for a fetch that misses and is assigned a dirty victim:C0:EU registers read, addition to compute VA started in EU.C1 A:VA sent to cache on P bus.C1 B:Miss is detected (= not map hit or data hit).C2 A:M bus grant is requested from Arbiter.C3 A:M bus Grant is obtained.C3 B:ReadMap command issued.C4:Map processor issues ReadQuad to get map entry from hash table.C5:Dead.C6:Storage wait (possibly no wait if no error correction).C7 to 10:Transport quadword to map processor; map processor does matches aseach word arrives, so it is (usually) ready with its response to thecache before the transport has finished.C11:MapDone sent to cache.C12 A:Recognize not orphan special case & dirty victim.C12 to 16:WriteQuad command for dirty victim.C17 B:ReadQuad command issued.C18 to 24:Like C4 to 10 in previous example.Here is timing for a store that misses and winds up with Shared=true: fp!q4] G?fp bq04 `SA ^I \O Z& WM UF% S80 R"BxO`rd xL[xKarxJlxH2xFVxDcxC@$G ?qNy=/r 8y; y:n !y9  %y7 y6K y4 y3 7y2) +y0 #y/h @y. y, ,y+E  'qFy%5r 8y# y"s )y! %y yQ y ?y y/ 7y 9 mD  (y yK (y  y y)   qE A\xDragon DocumentEdward R. Fiala25 August 198351C0 to 12:As aboveC12 B:ReadQuad command issued to obtain data.C13:Dead.C14:Transport referenced word from another cache with Shared=true.C15 to 17:More transport.C18:WriteSingle command issued and processor suspension lifted.C19:Rest of WriteSingle command issued.6.9. MapOpsAs its name suggests, the MapOp or map operation is intended for communication with the mapprocessor. However, it can send information through a cache to the M bus where any non-cache device,not necessarily the map processor, can receive it. This means that a MapOp might communicate with aBitBlt processor, Multibus communication processor, or whatever.A MapOp transmits an "address" in the first M bus cycle, "data" in the second. These bits are notentirely subject to convention; in a multi-M-bus configuration, some bits will select the M bus; theconvention suggested earlier, for example, allows the high-order and low-order VP bits to select the Mbus. All other "address" bits are ignored by the cache and, therefor, subject to convention. The 8 sub-page VA bits (i.e., BL..WA, which select the word within a page) should be used to identify both thespecial device which is to receive the MapOp and the operation to be executed. Then VA[1..23d] areavailable as the VP for a MapOp aimed at the map processor; VA[24..26d] select the processor to receivethe MapOp (0 for the map processor, 1 to 7 for other processors); and VA[27..31d] select the operation tobe performed. The 32d data bits transmitted in the second M bus cycle of the command are anargument to the operation.Note that MapOps are automatically atomic: The cache which transmits the MapOp holds onto the Mbus grant until it receives the MapOpDone response from the map processor; the map processor executesa sequence of M bus commands to complete the requested action, knowing that it has the grant.Since Dragon is a multiprocessor, it seems reasonable that modifications of the storage map should beconditional stores. Then the way a map modification would be carried out is as follows: First, softwarereads the map entry for some VP with an RMap MapOp, determining OldRP and OldF. Next, basedupon OldF, a NewF is computed and a CWFlags MapOp is issued which will only change the flags toNewF if the flags are still equal to OldF. The MapOpDone response includes 32d bits which indicatecurrent map contents and whether or not CWFlags was successful. This methodology covers thepossibility that Dirty or Referenced will have been set true between the RMap and the CWFlags.The map processor will do fairly complicated things to maintain cache consistency across flag changes.For example, if CSetFlags is turning off Dirty in the map entry for some VP, it will issue a ChangeFlagsM bus command to set RPDirty=false at the RP corresponding to that VP in all the caches. Or ifReferenced is being turned off, it will reset VPValid=false in all the caches at that RP.With these preliminaries, here are proposed map operations:RMapRMap is used to read a map entry. During C0, the VP is on M[1..23], M[24..31]=0 selectingthe RMap function. M data during C1 is undefined. The map processor interprets the hashtable structure and returns the following data with MapOpDone: M[0]=1 means that the VPisn't in storage and M[1..31] are undefined; M[0]=0 means M is interpreted as follows:M[1]=Dirty; M[2]=Referenced; M[3]=ReadOnly; M[4..5]=0, M[6..23]=RP, M[24..31]=0.WMapWMap is used to add a new to the map. During C0, the VP is on M[1..23],M[24..31]=1 selecting the WMap function. During C1, M[1..3] are the new flags; fp!q4] G?fpybArny`ny_ny^n>y\ n y[]n;yYn# Up Qq< O,9 M"B L5@ H%= F$@ E-*< Cc<- A` ?N >T <8%D :nY 8 51+5 3g;* 1K .*b ,_E$ *P (R '/4 %5< #jU 60 .T c%: Y ';xer:FQCBPx DQ @2 B]KaDragon DocumentEdward R. Fiala25 August 198352M[4..5]=undefined, M[6..23]=RP, M[24..31]=undefined. The map processor inserts the newentry in the hash table and returns the following with MapOpDone: M[0]=1 means that theoperation failed; the new RP (or the existing RP if M[27]=1) is returned on M[6..23]; M[27]=1means the VP being inserted was already in the map; M[28]=1 means a node could not beobtained from the hash table's free list.CWMapCWMap is used to remove a map entry, iff the existing flags match. During C0, the VP is onM[1..23], M[24..31]=2 selecting the CWMap function. During C1, M[24..26]=old flags.Assuming a match is found, the map processor sets VPValid=false and RPDirty=false in allcache blocks matching RP via a ChangeFlags M bus command. Then it removes the mapentry; if a node is freed by the removal, it puts the node on the module's free node list.MapOpDone data is interpreted as follows: M[0]=1 means that the operation failed;M[1..3]=old flags; M[6..23]=RP that was in the map entry; M[29]=1 means the VP wasn't inthe map; M[30]=1 means the flags didn't match.CWFlagsCWFlags is used to change the flags of an existing map entry iff the old flags haven't changedvalue. During C0, the VP is on M[1..23], M[24..31]=3 selecting the CWFlags function. DuringC1, M[1..3] are the new flags which will be written into the map entry iff M[24..26] match theold flags.If Dirty or Referenced is being turned off, the map processor will issue a ChangeFlags M buscommand. When Dirty is being turned off, the ChangeFlags will clear RPDirty in all cachesholding quadwords on that RP. When Referenced is being turned off, the ChangeFlags willclear VPValid in all caches holding quadwords on that RP.MapOpDone data is interpreted as follows: M[0]=1 means that the operation failed;M[6..23]=RP; M[29]=1 means the VP wasn't in the map; M[30]=1 means the flags didn'tmatch.Note that there is no problem with caches holding map data becoming inconsistent. Because any cachematching the RP..BL of a map processor ReadQuad command will respond and set Shared=true, andbecause WriteQuad by the map processor updates any cache holding the RP..BL, any cache holding mapdata will remain consistent in the presence of map processor activity. This may become important,though I know of no reason why software should want to reference the map data structures.Also, if a cache should modify its map data after the map processor's ReadQuad and before the map processor'sWriteQuad, it would find Shared=true and have to wait for the M bus grant.Here are (some of) the basic primitive actions which must be performed using MapOps:First, it must be possible to remove a page VPa/RPa from its current VP mapping, thereby acquiringRPa for some other use. To do this, if the page is Dirty use CWFlags to set Dirty=0; then write thepage on the disk; then use CWFlags to set Referenced=0. Then, use CWMap to remove the page fromstorage if it is still clean and unreferenced. Next, it must be possible to assign VPa to a free page RPb and read data into it from the disk (orwhatever). In this case, the storage manager first uses a WMap to bind a reserved VPb to RPb duringthe disk read. VPa cannot be used during the disk read, because another process might reference thepage and malfunction. At the end of the disk read, the storage manager breaks the VPb/RPb associationand clears RPDirty with a CWMap. Finally, it uses another WMap to bind VPa to RPb and setup theflags.Thirdly, it must be possible to enumerate the VPs in the map efficiently, so that candidates forreplacement can be identified and referenced histories maintained. This enumeration should be good,but not necessarily perfect. Since the map is a hash table, a simple sweep through it might occasionallyfind a particular VP moving from one node to another in such a way that the page is enumerated either0 or more than 1 times during a single sweep. If this doesn't happen too frequently, the error which it fp!q4] G?fpbArW`G_H^C\)xZ BX)+V1'U*4S>Q@P4#5N.xL5OJ5(Is<"H ECD}=CXA9?@>&S< 9wqU 7/. 5F 4b 2LYy/r94y.*J *qT 'iX %=' #V " / R R U 8G mK  1)7 fU 'B e I A]LdDragon DocumentEdward R. Fiala25 August 198353causes is probably tolerable.However, enumerating hash table nodes would be hard without complicating the map processor. For thisreason, software should maintain an array containing the VP at each RP. With 210-byte pages, this arrayrequires 1/256th of storage. Then to enumerate the VPs in the map, the storage manager enumerates theVPs in this array.Accurately updating the Referenced bit in the storage map is the low-level part of this problem. To dothis, use CWFlags to clear the Referenced bit in the storage map and sample its old value. Subsequentreferences to that page to quadwords already in a cache then find VPValid=false, access the map and setReferenced=true, and finally take 1 M bus cycle with the orphan special case.6.10. Parity and FaultsRick Barth's current proposal for the caches is to check/generate parity on both the P and M buses aftera delay of 32 cycles using a shift-and-xor scheme suggested by McCreight, and to store parity with datain the cache; no checking will be used within the addressing section of the cache.The M bus parity error will cause a system wide crash, and the parity error flipflops in the cache(s)which detected the failure will remain set until system reset. If such an error be detected by more thanone cache, it will not be localizable to any particular source.P bus parity errors will be handled similarly, but be localizable to the P bus. The processors on the Pbus (e.g., IFU and EU) will generate parity on both addresses and data and check parity on data.Cache data storage parity errors will be localizable to the cache whose data storage failed.Using this approach, there are M bus and P bus parity bits, and a cache has both P and M bus parityerror signal pins. Presumably the IFU has 2 and the EU 1 parity error signal pin as well. Theprocessing of these signals has not been specified; in uniprocessor configurations, it might be convenientto simply wire-or all of the parity error signals together; in a large multi-processor, we could discretely-orthe parity error signals while loading the values of the individual flipflops into LEDs or something formaintenance purposes.6.11. RefreshSince the processor's cache port can access all parts of the cache, it is used to refresh the various cacheelements as follows:There is a counter which determines maximum time between refresh operations; whenever there is no Pbus reference during a cycle, the cache executes a refresh operation and resets the counter. Since therefresh operation takes only one cycle, there will be no disruption to the processor in this case. However,if the counter expires without any idle P bus cycles, then the PReject signal is raised for one cycle whilea refresh operation is carried out anyway.During each refresh operation, one associative block and two of its four data words are refreshed. Thetarget refresh period is about 3 msec, so a 256 block x 4 words/block cache requires 512 refreshoperations per 3 msec. At 100 ns/cycle, this is one refresh operation every 60 cycles, a negligibleinterference rate. fp!q4] G?fp bq ^32 \4]nr\q [V YL U<+ Tf RE[ PzM Kap Gq,< F$M DZR @\ ?>+ =S? 9H 83- 4\ 12E /hA -O +] *C% (= #$p q56  u#@ :- 84 >- K* <+ )7 Dd y ( 2A]Dragon DocumentEdward R. Fiala25 August 1983546.12. System Initialization and TestingEach IFU and cache will have a Reset signal pin. An IFU resets both itself and its EU in response tothe Reset signal being asserted; then it remains passive until its Reset signal is dropped. As soon asReset is dropped, the IFU starts executing the program at some wired-in location in the VM.The caches are setup for either a quick-and-dirty initialization or a more careful initialization involving amaintenance processor. The quick-and-dirty initialization takes place automatically when the Reset signalis raised. This clears all of the VPValid, Broken, and RPValid bits in the cache and initializes all of theNextVic bits to 0 except for one which is initialized to 1. Also, the VM range is setup to the maximumof 31d bits and the acceptance register to accept all VAs.So long as Reset is asserted, the cache will issue PReject to P bus references and will ignore the M bus.As soon as Reset is dropped, the cache will respond to references and commands.The quick-and-dirty initialziation suffices for any configuration in which a single cache is on each P bus,but it won't handle multiple caches.The more careful initialization requires a maintenance processor with special hardware or a special LSIdevice. The special hardware must receive commands from the maintenance processor using themaintenance processor's clock, and then shift out these commands in sync with the Dragon's clock usingthe cache's two-wire interface; then it must read out the cache's response to the command one bit at-a-time. Two different wires are needed for each cache in the system.The format of these commands is not yet decided, but it will allow any cache block to be read orwritten; each read or write operation affects all associative bits in a block and one of the four data words.In addition, the acceptance register and VM size registers in the cache can be read and written.The careful initialization then works as follows. First, one Reset signal is bussed to all of the IFUs andanother to all of the caches. Both reset signals are raised initially; then the cache reset signal is droppedbut the IFU reset signal is left on during cache testing and initialization. Each cache is tested using thetwo-wire interface. The acceptance registers, Broken bits, and VM range bits for each cache are theninitialized appropriately.The map processor and Arbiter each can receive the same Reset signal as the IFUs.Before the IFU Reset signal is dropped, storage must be tested and initialized by the maintenanceprocessor. The IFUs cannot do this because they always do a ReadQuad (which would get bad parity)before a WriteQuad. In addition, the map hash table, free node list, and free page list have to beinitialized in each module, and the program which the IFU's will execute has to be loaded into storage.There must be no storage bad spots in the first 4R+1 words, because this area will be referencedabsolutely by the map processor. VP=0 should be initialized to not-in-storage; the rest of theinitialization can be done in some fairly simple way.TBC. fp!q4] G?fp b( ^qV \B% [[ Wm UY TQ REg Pz: MY K>O Gd F$ BV @T >H =/52 ;eC 7` 6(Q 4^ T 01: /!\ -V0< +V ) &OQ "4- !L H?$ }#D  ` A!> v5  B BUNDragon DocumentEdward R. Fiala25 August 1983556.13. Conclusions1) The current general method of using run-length encoded entries for the VM map, as is done in Lisp,Pilot, and Cedar, should be retained. This representation is small, can be swapped, and is not verydependent on page size.2) One of the two methods discussed for keeping track of ReadOnly and DiskDirty should be adopted.The simplest scheme puts these two bits in an array with 2 bits per VP. If this has poor paging behavior,the more complicated run-length encoded scheme suggested by Mark Brown can be used.3) The 128 bits/bucket hash table map with digitally-searched tree of collision lists should be adopted.4) For good probe distribution, the first hash table probe by the map processor should be made at anaddress VPprobe determined from low-order VP bits xor'ed with high-order VP bits; VPkey should bethe high-order VP bits. This puts all available randomness into VPprobe; since the low-order VP bits arelikely to be most random, the xor should be at the high-order part of VPprobe. The exact point atwhich VP is divided into VPprobe and VPkey is controlled by the module size. To enable multi-M-busconfigurations, it must be possible to interleave low-order VPprobe and VPkey bits, so that when low-order VP bits are used to select the M bus, they no longer are used in VPprobe.5) Low-order VP bits might be the best cache selectors when several caches are ganged on a single Mbus. Although low-order BL bits probably divide references more evenly, VP bits reduce the percentageof map accesses; this advantage will compensate for somewhat more misses.6) The two high-order and two low-order VP bits seem like the best selectors in multi-M-busconfigurations. The high-order VP bits may be useful in separating M bus activity in the code segment,for display refresh, and for other uses. The low-order VP bits seem best for generally distributingreferences among M buses.7) The implementation of the SFC opcode should, if possible, be changed so that the cache neverreferences addresses greater than 228 through its own cache. Also, the compiler and loader should, ifpractical, be organized so that data references (i.e., references which use the EU's P bus) never occur inthe code segment. If these objectives are achieved, it will be possible to run IFUs and EUs onindependent M buses selected by the high-order bit of VA.Also, free lists of unused hash table entries and of available storage pages should be kept, as discussed, inmodule-specific real locations manipulated by the map processor. This makes both the map processorand software less sensitive to multiple M buses. fp!q4] G?fp b ^q14 \X [ WV U` TS P*> M,E Kaa I [ Gb FP D7&? BlO > W =/H ;eI 7F 6(S 4^H 2 /!C -V#-r-VqA +a )<# '9 $N "Q 0 f BGcDragon DocumentEdward R. Fiala25 August 1983567. Basic OpcodesThe next few sections deal with the basic opcodes that do loads and stores, conditional jumps, immediateloads, and other operations. More complicated operations for context switching, process machinery,allocation and deallocation, field manipulation, and arithmetic are discussed later.7.1. General Comments About OpcodesFor simplicity of the IFU, unconditional jumps and context-switching opcodes are limited to 4 codebytes, other opcodes to 3 code bytes. Opcodes must complete or trap in no more than 3microinstructions.There is no proposed distinction among the 16d local registers (LRs); in other words, any opcode whichcan address an LR has a 4-bit field which allows any LR to be selected. LRs are used to hold aprocedure's arguments, for pointers to its local and global storage, and for its local variables.Similarly, there is no proposed distinction among the 16d auxiliary registers (ARs). ARs are useful forthings such as the pointer to the FSI table (used in local storage allocation), the procedure return Hook,and the pointer to the current process's PSB (Process State Block). In these applications, the AR is abase register for memory references. However, an AR may also be used for STICKY, which ismanipulated by RAND and ROR opcodes and has bits tested by conditional jumps.The stack is used to pass procedure arguments, to receive procedure results, and for other generalpurposes when a LR or AR seems inappropriate. Also, the larger immediate constants get put on thestack first, then into the LR or AR which is the real destination. Also, conditional jumps require thatone of the two values being compared come from the stack.Note that when |StkP-RL| is less than 20b, [S] is interchangable with a LR; this interchangability can beused, for example, to initialize an LR by pushing a value onto the stack at procedure entry.There is no CODE base register like the one used in 16-bit Mesa; the analog on Dragon is PC-relativeaddressing. Since code is clean, there are no PC-relative writes into storage. Also, PC-relative reads arelimited to the PRL opcode (PC Relative Load--pushes the 32-bit word at PC + signed ab) and FPC(Fetch PC--pushes the PC onto the stack). Exact implementation of PRL and FPC is still uncertain.The primary reason for PRL is to push a pointer to the global frame which is stored in the codesegment. Petit says the implementation is messy, and he may omit PRL, if we expect to load Global inanother way.Furthermore, there are no special opcodes for referencing Local or Global storage. Instead, a pointer toLocal or Global storage is put into one of a procedure's LRs, just like a pointer to any other object.Consequently, there are general opcodes for referencing objects pointed at by LRs, but nothing specialfor Local or Global storage. And the MDS is eliminated.Register addressing at the microinstruction level was discussed in the "Hardware Overview" chapter. Animportant point is that only the stack can be arbitrarily addressed by microinstructions; LRs, ARs, andconstants must be selected from bits in the opcode, a, or b. In the opcode set, an attempt has beenmade to limit the number of addressing modes, so that the hardware can be kept simple. Similarly, thenumber of different ways jump displacements are computed, etc. has been limited. fp!q4] G?fp ar ^eq` \"A ZT Up$ REq] Pz%1 N K>N Is)6 Ga D7+= Bl46 @M >%5 = M 9 V 7S 6U 4:9 0*? .\ +D )O 'Ksq &,?# $a T "\  Z90 Q +; 8 ?( P 4sqsq) )&@ ^Q 2 BZ*Dragon DocumentEdward R. Fiala25 August 198357In considering possible changes to the opcode set, it is important to note that both the IFU and EU willbe very cramped at l = 2 microns. Changes which increase complexity are undesirable.7.2. Immediate OpcodesThe general immediate operation is forced to be a stack operation because there are insufficient bits in a3-byte opcode to specify other registers. A limited range of immediate values could be loaded, added,etc. directly into LRs; several such opcodes are listed as possible future additions to Dragon, but noneare included in the present opcode set. Consequently, if the compiler wants to load, add, etc. animmediate value to a LR, it first pushes the value onto the stack, then does the operation.At procedure entry, it is sometimes possible to initialize an LRm with an immediate load. This can bedone when S+1 = LRm.Even if the IFU were prepared to handle a 5-byte non-jump opcode (which it's not), a 32-bit load-immediate operation would take two cycles because the I bus data path between the IFU and EU is only16 bits wide. For this reason, the general load immediate is accomplished by first loading the left-halfand then adding the right half:LILDBLoad Immediate Left Double Byte (Push ab into 0:15, 0 into 16:31)ADDDBAdd Double Byte. Add ab to [S]; trap on integer overflow; Carry unchanged.Unlike some add and subtract opcodes, ADDDB does not smash "Carry", so the above sequence can beused in multi-precision arithmetic without fear. And there is no possibility of an integer-overflow trap inthis case.The PRL opcode discussed earlier is an alternative to the above 6-byte, 2 cycle sequence. Althoughprimarily intended for pushing the global frame pointer from the code segment at procedure entry, PRLcan also push constants. PRL itself is a 3-byte opcode, and the constant it loads is another 4 bytes, sothis is not more compact unless there is more than one reference to the constant. However, PRLexecutes in only 1 cycle (assuming no cache miss), if the following opcode doesn't operate on the 32-bitvalue.Other immediate opcodes are as follows:LInLoad Immediate n.1-byte opcodes for pushing common immediate operands onto the evaluationstack (n = 0, +1, +2, -1, 20000000000b).LIBLoad Immediate Byte.2-byte opcode pushing a.LINBLoad Immediate Negative Byte.2-byte opcode pushing 37777777400b +a.LIDBLoad Immediate Double Byte.3-byte opcode pushing ab.ADDBAdd Byte.Add a to [S]; trap on integer out-of-range; Carry untouched.ADDNBAdd Negative Byte.Add a with high-order 1's to [S]; trap on integer out-of-range; Carryuntouched.Although the one-byte LIn opcodes push values identical to planned constants, the implementation doesnot involve reading constants from the EU's RAM. The coincidental equality of values limits the utilityof the one-byte load immediates; any operation performed on the value in the next cycle could insteadhave been done directly with an RR opcode (RAND, ROR, RSUB, RADD, RXOR, etc.); although asequence such as LIn, ADD is one byte smaller than RADD, it is also one cycle slower. At best the LInopcodes save 2 code bytes over using RAND or ROR to push the same value. For this reason, the LInopcodes may not be especially useful. fp!q4] G?fp bqT `SsqA [:p WqO UN T3_ Rh V P[ M,/7 Ka G9( F$d DZW Bx?t&stx>st3 :q+5 81; 6 3O 1<) /@) .*X ,_"F * '#'x$at& 6&#(x!}&stx&stx.&stx&st7x&st0&[  q7. B] wL N b  V M% A]_Dragon DocumentEdward R. Fiala25 August 198358All of the immediate opcodes finish in one cycle.7.3. Unconditional JumpsThe IFU handles unconditional jumps without any EU cycles, but a 3 cycle gap is introduced in the IFUpipeline; then 1 more cycle is needed per extra word fetched at the jump target address. So if the nextopcode is entirely in the first target word, the jump takes 3 cycles, otherwise 4.The proposed unconditional jumps are as follows:JnJump n = 2,3,4,5,6,7.Common jumps encoded in one-byte opcode. J2 and J3 are treated as 2 and3-byte noops for faster execution.JBJump Byte.Jump to . + sign-extended a.JDBJump Double Byte.Jump to . + sign-extended ab.Here, "." is the location of the first byte of the jump opcode.7.4. Conditional JumpsThe EU returns the following branch conditions to the IFU:ADDER<0XOR=0ADDER carry-outThe conditional jump opcodes use these signals in combination; for example, a jump on less-than-or-equal would combine the XOR=0 and ADDER<0 conditions.Each conditional jump is predicted to jump or not-jump by a bit in the opcode interpreted by the IFUpipeline. The IFU fetches and decodes opcodes only along the predicted path, and only as far as thenext context switch or conditional jump. Then the branch condition is computed in a single EU cycleand sent back to the IFU over the I bus. If the prediction was wrong, then the results in the pipelineafter the conditional jump will be thrown away, and the IFU will refetch opcodes along the other path.This results in the following execution times for a 1 cycle conditional jump based upon the output of theADDER:PredictionResultCyclesnot jumpnot jump1not jumpjump5+(1 if 1st opcode in jump path crosses word boundary)jumpnot jump5+(1 if 1st opcode in non-jump path crosses word boundary)jumpjump1+(3-A+1 if 1st opcode in jump path crosses word boundary)The "3-A" is an attempt to describe the 3 or 4 cycle gap introduced into the IFU pipeline by a jump. Ittakes the IFU 2 cycles (assuming a cache hit) to fetch the first word on the new path and 1 cycle todecode the first opcode, if the first opcode on the new path is contained within the first word; otherwise,it takes 1 cycle longer. This time is 0 if the IFU is caught up but can be as long as 3 or 4 cycles if theIFU is not caught up.There have been several proposals for predicting jump or not-jump. The Dorado predicts all conditionaljumps to "not jump", and this prediction is correct 60% of the time (highly compiler dependent).Dragon provides one opcode bit for predicting jump/not-jump, so the compiler can choose. fp!q4] G?fp bq1 ]p Yq#B WC% UR R0xOt%41%5Ni"xLX %4stxJ#%4st Fq? Ap >&q:x;etx:x8 5UqB! 35 0%? .MR ,F *F! (f '#18 %Xx"-u #$xt#$x#$6x7#$6x#$: q'A [ ,? (38 ] 07 L VX  B]+Dragon DocumentEdward R. Fiala25 August 198359Conditional jumps use sign-extended a as the jump displacement relative to ".", the location of the firstbyte of the conditional jump opcode, giving a displacement range of .-128d to .+127d.We don't know how often a jump target will be "in range". Jim Sandman told me that a displacementof .-64 to .+63 would be "in range" about 90% as often as .-128 to .+127 using the Trinity 16-bit Mesacompiler and opcode set on the collection of programs which SDD has examined. When a jump targetis out-of-range, the conditional jump must send control to a substitute in-range target that holds a JDBopcode to the real target location.Here are the conditional jumps (B2 register addressing):RJGEBRegister Jump Greater Equal Byte.RJLBRegister Jump Less Byte.RJEBRegister Jump Equal Byte.RJNEBRegister Jump Not Equal Byte.RJLEBRegister Jump Less Equal Byte.RJGBRegister Jump Greater Byte.JEBBJump Equal Byte Byte. Jump if [S] .eq. b; S _ S-1.JNEBBJump Not Equal Byte Byte. Jump if [S] .ne. b; S _ S-1.JCSTJump on CST. Jump if the last CST executed did not store; the prediction for this opcode is always no jump(i.e., that the last CST was able to store).Each of the RJxxB opcodes above, called "register jumps", is 3 bytes. These opcodes compare anyregister against [S] or [S-1]. "Any register" includes the 16d LRs, 16d ARs, [S] and [S-1] with the stackoptionally popped, and 12d constants. This means that usually only one of the two values beingcompared needs to be pushed onto the stack first. Also, the fact that the stack pointer can be eitherpopped or not-popped frequently saves an instruction. No one thinks that the bit which selects between[S] and [S-1] is useful, but Petit says this simplifies of the IFU.Each of the above conditional jumps except JCST is represented by two opcodes: one predicts "jump,"the other "no jump."Conspicuously, there are no unsigned compare RJxxB opcodes, which would be used for cardinalarithmetic and for low-order terms in multi-precision integer arithmetic. It is believed that use of 32-bitcardinals will be greatly diminished on Dragon. In the absence of an unsigned conditional jump, RXOR200000000000b,[S],[S] is applied to each term of the comparison, and a signed conditional jump is usedinstead; this is about 6 code bytes and 2 cycles worse than having the required unsigned conditionaljump opcode.Also absent is any RJxxB based upon the AND of the two registers equal to 0. Because there areconstants equal to 1 and 2, such an opcode would give low-order bit tests in one instruction. However,most other uses of the opcode would test a single field for 0 or non-0, and that can be done using EF,RJEB. fp!q4] G?fp bq$sq% `SU \Q [)= YL U W:. U# RE8yOt!yN#yLyKayJyHyE(st yD7,st yAuY!@, )A Is[ G4 D7f BlD( @46 >X ;e%y8tDy7BCy5Cy3 +st sty1y,st sty/,st sty.*-st sty+Esty)sty'sty&Osty$sty!%sty *sty72st yR'!st st!st 9qX n4, ./B 3/ R Da y ~ 2A]TDragon DocumentEdward R. Fiala25 August 198361A matching WFX opcode is not provided because it would require two cycles (since all three registersmust be read from the EU's RAM, and only two can be read in one cycle).RSI and RSD are used to store data from any register while simultaneously incrementing ordecrementing the pointer for the store. Because the destination for the updated pointer can be a garbageregister, these opcodes can be used to store data at -1 or +1 displacements relative to any pointerwithout smashing the pointer; the -1 displacement is a convenient method of writing one of the overheadwords of a block. However, there is no similarly convenient way of writing word -2 of a block.The fetch equivalents of RSI and RSD (which would be called RFI and RFD) are not provided becausethey would have to specify two destination registers, one for the updated base regsiter and one for thedata being fetched, and this would take an extra cycle.RB, RSB, WB, WSB, and PSB are like ones in 16-bit Mesa except for RSB, which pushes the destinationonto the stack rather than overwriting the base register. My programming examples indicate that RSB isuseful.For the PRL opcode, ab is sign-extended. PRL is intended primarily for pushing a global frame pointeronto the stack at procedure entry; and it is sometimes an economical way to push a 32-bit constant ontothe evaluation stack. PRL is suboptimal; ab as a signed word displacement relative to the word-part of".", quadrupling the displacement range, would be better. However, unless LFC (Local Function Call,discussed later) can also be made to work this way, code segments are limited to 215 bytes, so the largerrange isn't worth much.Because SRIn does S _ S-1, it must be preceded by a DUP to avoid losing the data.The alternative sequence of SRIn, REC, doesn't work on Dragon because, unlike 16-bit Mesa, the method proposedby Russ Atkinson for saving and restoring process state doesn't save any values above [S]. In other words, RECcannot be used to recover data above [S].The following table shows how various kinds of memory references can be accomplished on Dragonusing proposed opcodes. No opcodes are explicitly intended for pushing an AR onto the stack orpopping the stack into an AR, but this can be done in either direction using the multi-register RAND orROR opcodes, for example. In the examples below, LAn will be used to denote a RAND that pushesARn onto the stack, and SAn will denote a RAND that pops the stack into ARn.BaseSource orRegisterDisplacementDestinationMethodAnyR-1AnyRRead by RFX. Write by RSD.AnyR+1AnyRRead by RFX. Write by RSI.AnyRAnyRAnyRRead by RFX. Write by pushing the address with RADD,then WB.Stack0 to 377bStackRead by RB or RSB. Write or put by WB, WSB, and PSB.The variations allow the base register to be retained on stack or not.Stack0 to 377bLRmRead by RB or RSB followed by SRm.Write by LRm followed by WSB or PSB.Stack0 to 377bARmRead by RB or RSB followed by SAm.Write by LAm followed by WSB or PSB.LRn0 to 377bStackRead by LRIn.Write by SRIn. Put by DUP, SRIn.LRn0 to 377bLRmRead by RRI. Write by WRI.LRn0 to 377bARmRead by LRIn, SAm. Write by LAm, SRIn.ARn0 to 377bStackRead by LAn, RB.Write by LAn, WB. Put by DUP, LAn, WB. fp!q4] G?fp bqP `SG \ N [Q YLE WS US REL PzN N7 K>L IsM G D7sq#- BlH @ sq; >13 = 3=t= q ;A 7Qy5tny3ky2L) .qD -3H +i*= )=" 'Lx%ux#  (x!t (x (xQ (5(xR (()x ((0$x ((n$x  ((!xL (x  (#x  ( ( *' A]Dragon DocumentEdward R. Fiala25 August 198362ARn0 to 377bLRmRead by RAI. Write by WAI.ARn0 to 377bARmRead by LAn, RB, SAm. Write by LAn, LAm, WB.Stack.ls. -1 or .gr. 377bStackRead by pushing the constant displacement onto the stack andusing RFX to replace [S] and [S-1] by the word at ([S]+[S-1])^. Ifthe base register is at [S] and the data being stored at [S-1],write by using ADDDB or ADDNB followed by WB; if the baseregister is at [S-1] and data at [S], precede this with an EXCH.Stack.ls. -1 or .gr. 377bLRnRead as above; then SLn. Write by first using ADDDB or ADDNBto compute the effective address; then do LRn, WSB.Stack.ls. -1 or .gr. 377bARnRead as above; then SAn. Write by first using ADDDB or ADDNBto compute the effective address; then do LAn, WSB.LRn.ls. -1 or .gr. 377bStackRead by pushing the constant displacement onto the stack and usingRFX to do [S] _ (LRn+[S])^. Write by computing the effectiveaddress with LRn followed by ADDDB or ADDNB; then do WB.LRn.ls. -1 or .gr. 377bLRmRead by pushing the constant displacement onto the stack andusing RFX to do LRm _ (LRn+[S])^, S_S-1. Write byfirst computing the effective address with LRn followed byADDDB or ADDNB; then do LRm, WSB.LRn.ls. -1 or .gr. 377bARmRead by pushing the displacement and using RFX to do[S] _ (LRn+[S])^; then do SAm. Write by first computingthe effective address with LRn followed by ADDDB or ADDNB;then do LAm, WSB.ARn.ls. -1 or .gr. 377bLRmFirst push the constant displacement onto the stack; then add itto ARn, leaving the result in any convenient AR. Finally, read byRAI or write with WAI.ARn.ls. -1 or .gr. 377bStackRead by pushing the constant displacement onto the stack andusing RFX to do [S] _ (ARn+[S])^. Write by first computing theeffective address with LAn followed by ADDDB or ADDNB;then do the store with WB.ARn.ls. -1 or .gr. 377bARmRead by pushing the constant displacement onto the stack andusing RFX to do ARm _ (ARn+[S])^. Write by first computing theeffective address with LAn followed by ADDDB or ADDNB;then do the store with LAm, WSB.PC.-100000b to.+77777bStackRead with PRL (Read-only).PC.-100000b to.+77777bAnyRFetch to the stack first, then move to thedesired register (Read-only).In the above table, it is seen that the common constant displacements from -1 to +377b have an efficientimplementation for most combinations of base register and source or destination register. The exceptionsare ...7.6. Stack OperationsThe first group of opcodes are ones which modify the stack pointer:RECRecover. S_S+1.DISDiscard. S_S-1.ASAdjust Stack. S_S+signed a.These opcodes do not use any EU cycles, so they are free if the IFU is caught up. fp!q4] G?fpxbAt (x` ()x^qtqt (8(\:([:.(Y%(Xx 5xVqtqt (0(U*3xS_qtqt (9(Q3xOqtqt (/(M=(L58xJjqtqt (8(H2(G:(F$!xDZqtqt (0(B8(Au(@x=Sqtqt ()(;0(:nx8qtqt (8(71 (56(4^x2qtqt (8(1*(/6(.Mx+ *N (x( ' (*(&, "qO !^ H /p qCyty1yst ;qQ h A[Dragon DocumentEdward R. Fiala25 August 198363Then there are the following operations for manipulating operands on the evaluation stack:EXCHExchange. Q _ [S-1] and [S-1] _ [S] in the first cycle; [S] _ Q in the second cycle.DUPDuplicate. S_S+1; [S]_[S-1].EXDISExchange Discard. [S-1] _ [S]; S_S-1.Except for EXCH, these are "move" class opcodes, which means that they don't get held one cyclefollowing a fetch. EXCH is an "op" class opcode.Many proposed opcodes accept all operands on the evaluation stack and leave their results on theevaluation stack. These opcodes are compact because no extra bytes are required to specify registeraddresses:AND[S-1] _ [S-1] & [S]; S_S-1.OR[S-1] _ [S-1] U [S]; S_S-1.XOR[S-1] _ [S-1] xor [S]; S_S-1.ADD[S-1] _ [S-1]+[S]; S_S-1; trap on integer out-of-range; Carry unchanged.SUB[S-1] _ [S-1]+[S]; S_S-1; trap on integer out-of-range; Carry unchanged.BNDCKtrap if [S-1]-[S]-1 has carry-out = 1; S_S-1.NILCKtrap if [S] is 0; S_S-1.etc.Arithmetic and shifter stack operations are discussed in later chapters.7.7. Register-to-Register OperationsMany opcodes are in the register-to-register format in which a 3-byte opcode specifies three operands inab:RANDRegister AND.RORRegister inclusive OR.RXORRegister XOR.Shifter and arithmetic operations are discussed in a later chapter.7.8. Move OperationsThe RAND or ROR 3-byte opcodes can move data between two registers in 1 cycle by specifying thatboth operands are the same register (or other operations can be used with Zero or MinusOne hard-wiredconstants). However, the following opcodes are more compact and are 1 cycle faster when the valuemoved was fetched from storage by the preceding opcode (i.e., they are "move" rather than "op"operations, as discussed in the "Hardware Overview" chapter):LRnLoad Register n. Push LRn onto the stack (1-byte opcode).SRnStore Register n. Pop the stack into LRn (1-byte opcode).RMOVRegister MOVe. LRm _ LRn; m=a[0..3], n=a[4..7] (2-byte opcode).DUP and EXDIS mentioned earlier are also "move" operations.DUP, SRn is probably the best way to store the stack into LRn without popping (because SRn, RECdoesn't work for the reasons discussed earlier). The RAND or ROR alternative has to be used foranything involving an AR. fp!q4] G?fp bqZy^tUy]Ky[& XxqA V1 S<K Qq$@ O yLXtyJyIPyGHyFHHyD-yC@yA >mqH 9Tp% 5q\ 4sqy1Ut y/y. +EqC &,p "q8( e %/3 Z!= =ydt:y:yst st  q; !> D  A]LDragon DocumentEdward R. Fiala25 August 1983647.9. Conditional StoreThe lowest level operation which can be used to achieve atomic operation on a Dragon multi-processor isthe memory system's fetch-and-hold reference, which holds onto the M bus after the reference completes.At present the only opcode proposed which uses fetch-and-hold is the "conditional store" or CST opcodeproposed by Russ Atkinson.The idea behind CST is that a procedure which wants to access a shared data structure atomically computesa result based upon a "no interference" assumption; then it uses CST to store the result only if somememory word still has its original value. In so far as the sameness of that word guarantees that nointerference has taken place, CST will change memory only when there has been no conflicting access;otherwise, the process must try again. A special conditional jump JCST opcode tests whether or not CSTsucceeded.CSTConditional STore. This opcode has register-to-register format. [S] points at the word to be changed, Ra is theold data, and Rb is the new data. In the first cycle, S_S+1, [S] _ ([S-1])^ is performed using fetch-and-hold;in the second cycle, [S] is compared against Ra, and the stack is popped; in the third cycle, Rb is written into[S]^ iff the values were equal. The result (whether or not the store took place) is saved in an IFU flipflop fortesting by the JCST opcode. Ra and Rb are limited to LRs, ARs, and constants (stack operands illegal). Thestack has been popped a total of one time at the end of the opcode. JCSTJump on CST. Jump if the last CST executed did not store. This jump is always predicted to not jump (i.e.,the prediction is that CST succeeded).The CST flipflop must be saved and restored across process switches.7.11. Other OpcodesIt may sometimes be necessary to substitute "no operation" opcodes into code sequences. The opcodeused for this purpose should be as fast as possible. In addition, Petit wants a 1 cycle NOP to letpipelines unscramble during debugging:NOP00 cycle No Operation for paddingNOP11 Cycle No Operation for debuggingPetit has told me that a different breakpoint opcode will be needed for each length of opcode. I don'tknow how this will work.BRKBreakpoint (3 BRK opcodes, one for each non-jump opcode length).We need a limited ability to read and write the IFU and EU registers. Some potential applications areframe underflow and overflow, process switching, and free variable searching; for these code size isunimportant, so we want to get good performance with a small number of opcodes. Something like the 16-bit Mesa VERSION opcode, which pushes the microcode rev level, possibly coprocessor information, etc.onto the stack, must also be provided. EU registers include ICAND and Q. The four opcodes for this areas follows:LIFURLoad IFU Register. S_S+1; [S] _ an IFU register selected by a.SIFURStore IFU Register. An IFU register selected by a _ [S]; S_S-1.REURRead EU Register. S_S+1; [S] _ data from an internal EU register selected by a.EUSFEU Special Function. Sends a to the EU as a special function. This in general causes data from the stack tobe popped into some internal register (i.e., Q, ICAND, or MODE), or some flipflop to be cleared, etc. fp!q4] G?fp b ^q/8 \ ] [L YL U?* T@% RES Pz-7 NB% L xJt89H|%JFL$EtRCTBlDx?"J>&& :qD 5p 2LqG 04/ .&x+tx*+" &qa %x!t@ q7/ 40 c 8)< mB&  xTt=stx1st xNstx ^st2 @% Ct\T<Dragon DocumentEdward R. Fiala25 August 198365The meaning of the various values of a will be given later. fp!q4] G?fp bq%sqj a24#Dragon DocumentEdward R. Fiala26 August 1983668. ArithmeticDragon will fully support twos-complement integer arithmetic, whether it be single, double, N, orarbitrary precision. Dragon does not intentionally support cardinal arithmetic, which would require theaddition of RCADD and RCSUB opcodes that trapped on cardinal result out-of-range. Conditionaljumps based upon unsigned comparisons are also missing.The reasoning here is that Dragon must support some kind of signed arithmetic, and twos-complement isproposed; ones-complement, cardinal, and sign-magnitude arithmetic can also be supported, but at theexpense of chip area and opcodes that can be better employed for something else. I have assumed twos-complement because we have used it on earlier machines.However, IEEE floating point uses a sign-magnitude representation, and we may regret theincompatibility between our floating point and integer representations later. A future floating pointcoprocessor might, for example, have slightly more trouble providing integer multiplication and divisionas well as its floating point operations.According to Prof. Kahan of UC Berkely, one reason why sign-magnitude is preferred for floating point is that someapplications which take reciprocal of infinity need to know the correct sign when the reciprocal of zero is againtaken later, so 1/+0 = +infinity while 1/-0 = -infinity.Dragon may have to carry-out arithmetic on the following:16-bit integers16-bit cardinals31-bit integerspointers32-bit integers32-bit cardinalsAny size of integer or cardinal less than 32 bits may be treated as a 32-bit integer once it reaches aregister, so the only problems are sign-extension following the fetch of an integer and bounds-checkingprior to storing back a result. These problems, separable from the arithmetic operations themselves, aredealt with below.Similarly, Dragon pointers are 31 or fewer bits long, so integer arithmetic can be used for pointercomparisons or other pointer arithmetic.8.1. Overflow, Underflow, and Carry16-bit Mesa does not have the best primitives for N-precision arithmetic due to deficiencies in carrypropagation; its double-precision opcodes are sufficient but not ideal. Also, it does not detect out-of-range arithmetic results. Although final results of an expression evaluation may be bounds-checked, anyout-of-range value during intermediate calculations of an expression goes undetected. This inconsistencyseems worth fixing for Dragon.Detecting out-of-range values cannot be done without incompatibility. In evaluating expressions withmore than two terms, an intermediate overflow, for example, might be corrected by a subsequentunderflow such that the final result is accurate in spite of the intermediate error. A program relying onsuch a phenomena will trap on Dragon. However, since Dragon has 32-bit words, and many existingprograms use 16-bit arithmetic where possible, my guess is that there will not be much software that fp!q4] G?fp ar ^eq10 \L ZI Y7 Ub SF QP P47 L@TA JK I-L Gb)yDsmyC@%LyA8 >q9x;x9x7x6(x4^x2 /!\ -V;, +:/ ) &O8+ $( ?p$ qU N 8M mi  141 fQ 'C L .6 A]LDragon DocumentEdward R. Fiala26 August 198367suffers from this problem.Suppose we consider an expression where a clever programmer has deliberately chosen a sequence ofadditions that experiences intermediate out-of-range results but winds up with the correct answer at theend. This kind of expression is where Dragon will detect an unnecessary out-of-range and do the "wrongthing". With 32-bit integers, 3 terms suffice to cause this event. However, if the terms are restricted to31-bit integers, at least 5 terms are needed; with 30-bit integers, 9 terms; 29-bit, 17 terms; etc. It seemsthat Dragon's detection of underflow and overflow effectively reduces arithmetic precision by several bitsto ensure against unintentional overflows.I think this is a good trade. The proposals are then as follows: Two flipflops called Carry and IntOvfEnable will appear in the process state. Certain opcodes will useCarry as the carry-in for addition or subtraction; some will change the value of Carry.Similarly, some opcodes will detect out-of-range. Integer out-of-range occurs when the carries into andout of bit 0 are different, cardinal out-of-range when the carry-out of bit 0 is 1 on addition or 0 onsubtraction. Out-of-range detection must not require extra execution cycles when the result is in-range.Cardinal out-of-range will always trap; integer out-of-range will trap if enabled by IntOvfEnable=1. Aprocess can disable integer out-of-range traps for compatibility with old Mesa, or for some kind ofarithmetic where overflows are legitimate.Cardinal out-of-range could also be conditional, but the BNDCK (Bounds Check) opcode, which traps on cardinaloverflow, would have to ignore the enable and unconditionally trap. BNDCK is the only cardinal operation atpresent; it may be easier to make any other cardinal operations which are added later unconditionally trap on out-of-range also.The out-of-range traps occur as follows: Any integer add or subtract opcode that detects out-of-rangewill cause the IntOutOfRange trap if carry-out of bit 0 is different from the carry-out of bit 1. Anycardinal add will cause the CardOutOfRange trap if carry-out=1, and any cardinal subtract, if carry-out=0. Each trap procedure is entered as though the opcode had never been executed, so the out-of-range result has not been stored; no change to StkP has occurred; and the PC has not been advanced.The integer out-of-range trap is interpreted according to the OutOfRange and OutOfRangeEnable bits inSTICKY (a word in the process state). Then a process can:(1) Totally disable out-of-range detection by forcing IntOvfEnable=0 so that integer overflowsor underflows go unnoticed.(2) Set IntOvfEnable=1 but leave OutOfRangeEnable=0; then the trap software will setOutOfRange=1, set IntOvfEnable=0, advance the PC, and continue (notices the firstoccurrence of the out-of-range event).(3) Set both IntOvfEnable=1 and OutOfRangeEnable=1 to trap all out-of-range events.The trap can also identify the opcode responsible for the trap because the PC points at it.The RVADD and RVSUB opcodes do no out-of-range checking, and they have the most general form ofregister addressing, so the out-of-range trap has a slow but straight-forward way to complete theoperation which trapped, if that is desired. One possibility is that the trap subroutine may want todistinguish underflows from overflows. For integer addition and subtraction, the out-of-range conditionis an overflow when the out-of-range result is negative and an underflow when positive. No extra fp!q4] G?fp bq ^)8 \@( [H YL75 W94 Ue S* PzB Mg K>W G17 FG D7 \ Bl'@ @H >*y<sF'y:)Cy9T gy7 4qQ 2Z 1S /D?$ -z!B *] (=:y%5 Sy#jy bMy+K,y&yK S[ \ 3. LE <, X pB\5Dragon DocumentEdward R. Fiala26 August 198368hardware is needed to distinguish these, but the integer and cardinal out-of-range traps must be distinctto allow this.Multiplication or division can produce a product or quotient exceeding the allowed precision. For theseopcodes, however, out-of-range detection is not assisted by hardware.8.2. Sub-word NumbersIn a strongly-typed language such as Mesa, numbers smaller than one word are conveniently stored ascardinals subject to a full-word integer offset; the compiler applies the offset when the value in the fieldmust be converted to a 32-bit integer for combination with other values.To illustrate, converting a 16-bit cardinal with a -100000b offset to a 32-bit integer takes 1 cycle onDragon (RSUB x_x-100000b), if either 100000b or -100000b is available in a LR, AR, or constant;otherwise, it takes 2 cycles (LIDB 100000b, SUB); converting a 32-bit integer back to an offset cardinalalso takes 1 cycle (ADDDB 100000b). Furthermore, in some cases where the value is fetched, modified,and stored back, the compiler can cancel the offset applied at the fetch against the offset applied at thestore, avoiding both operations. Or if several different numbers, each with its own offset, are combined,the compiler can apply a single offset to the whole expression.However, using sub-word integers instead of cardinals-with-offset is slower. The best Dragon sequencefor sign extension of 16-bit integers is probably:EXTS:RXOR [S]_100000b xor [S][S] _ [S] xor 100000bRSUB [S]_[S]-100000b[S] _ [S] - 100000bThis sequence is 2 cycles, if 100000b is in an LR, AR, or constant, else 3 cycles; in other words, it is 1cycle slower and 1 to 3 code bytes larger than the the sequence required for cardinal-with-offset.Although sub-word integers are inconvenient on Dragon, it is unfortunately the case that 16-bit integersare a defined data type for network communication (RPC, Currier, etc.), so Dragon systems will, at aminimum, have to convert between 16-bit integers and an internal format when going to and from thenetwork.Although 32-bit integer overflows are detected during expression evaluation, an additional bounds-checkmay be required before storing a value into a field. Suppose that a field is defined to hold an integer Nsuch that A .le. N .ls. B; this is equivalent to 0 .le. N-A .ls. B-A, and N-A is the cardinal value stored inthe field subject to the offset A. Note that the single check of (N-A)-(B-A) producing cardinal overflowdetects both underflows and overflows. Without such an opcode, bounds-checking takes two conditionaljumps (or an unsigned conditional jump, but there aren't any in the opcode set).Some common range descrimination operations are as follows:; Jump if value is a 16-bit cardinalDUPSHIFT -20bRight-shift 16d positionsRJEB [S]&-1,Zero,CardJump if value is cardinal & pop; Give BndChk trap if value is not a 16-bit cardinalLI 200000bBNDCK; Jump if value is a 16-bit integerRVADD Push [S]+100000bAssumes 100000b is a LR, AR, or constant fp!q4] G?fp bqR `S \h [E Up Rq?$ P=/ NH Kd IH G>* F$#B DZc BV @? =S X ;2x8s)W7f)W 4q+? 3 K /J -_ ,^ *N &L %X #GT !}Q 41 P u;xs$S )W  )Wx4 2x # 3)W! A]/Dragon DocumentEdward R. Fiala26 August 198369SHIFT -20bRight-shift 15d positionsRJEB [S]&-1,Zero,IntJump if value is 16-bit integer & pop; Give BndChk trap if value is not a 16-bit integerRVADD Push [S]+100000bLI 200000bBNDCKDIS8.3. Small Integers for Lisp and SmalltalkUnlike Mesa, Lisp and Smalltalk must encode integers in the same 32-bit space as pointers. In otherwords, some values illegal as pointers encode integers. In this way, pointers can be distinguished fromnumbers in untyped storage. It is desirable to have an efficient representation for such "small integers"on Dragon.One attractive implementation is to represent small integers in the range -230 to +230-1 with an offset of+3 x 230. Then pointers always have bit 0=0; small integers use up all values with bit 0=1.Packing and unpacking small integers can take place as follows:Unpack:RJGEB [S],Zero,NotSIJump if positive, which means not-a-small-intRVSUB [S]_[S]-30000000000bSubtract offsetPack:RVADD [S]_[S]+300000000000bOffset the numberRJGEB [S],Zero,BigResJump if offset value is positive, indicating out-of-rangeThe RVADD/RVSUB operations do not check for either integer or cardinal out-of-range. The30000000000b constant/AR saves 1 cycle and 3 bytes in both packing and unpacking; 100000000000bcould be used instead. Without one of these constants, an LILDB must be added to each of thesequences.Dragon has one more hardware feature intended for use with 31-bit integers. The determination ofinteger out-of-range can be changed from the normal 32-bit method (out-of-range => carry out of bit 0unequal to carry out of bit 1) to a 31-bit method (out-of-range => carries out of bits 0, 1, and 2 are notall the same). If Lisp or Smalltalk enables this feature, then it would trap any 31-bit out-of-range.When the 31-bit mode is used for single-precision, it affects all arithmetic, so the high-order word of an N-precisionnumber must also have 1 less bit of significance. Also, integer operations on 31-bit pointers are no longersatisfactory; ADD, SUB, RADD, and RSUB opcodes could get out-of-range when pointer arithmetic caused a carryacross the 230-word boundary. So probably the 31-bit arithmetic method should be accompanied by a reduction inthe size of VM to 232 bytes.8.4. Multi-word Numbers16-bit Mesa has the unfortunate non-feature of representing 32-bit numbers in storage with the high-order word at location 1 and the low order word at location 0 of an array. This is a non-feature because,when the 32-bit quantity is shipped over a network and received by a 32-bit machine, the value isreceived with the left and right halves of a 32-bit word exchanged. Dragon must avoid this problem byrepresenting multi-word quantities in storage with the high-order word first and the low-order word last.In other words, an N-precision number in storage must have its high-order word at location 0 and itslow-order word at location N-1. fp!q4] G?fpbAs )W  `)W!x^B3\[ Z X Sp+ PWq^ NK L` J G<HGHG EF$E+) B%?x?ds)W)>)Wx;e)W :)W5 6qT 4F 3 U 1U --4 ,,9 *N#G (?'y%svy$a2:y#.>y!Y !!YT y ? p JqE S X f  "G U\  l CB[FDragon DocumentEdward R. Fiala26 August 198370Dragon can put multi-precision arguments onto the evaluation stack in either the forward or reverseorder, according to its preference. However, it is important to choose one consistent ordering convention,so that the results of one N-precision evaluation can be used by another. The examples below put thehigh-order terms nearer the top of the stack.8.5. Arithmetic Accuracy and Precision16-bit Mesa implements single-precision and double-precision signed or unsigned integer arithmetic. Aproblem is pointed out by the following example:IntR _ Int0 + (Int1 * Int2 * Int3 / (Int4 * Int5)) + Int6If this calculation were done single-precision in 16-bit Mesa, each partial evaluation would be truncatedto a single-precision number, and overflows would not be detected. Each multiply, for example, wouldproduce a double-precision result, but the high-order part would be discarded; and each add might causean undetected overflow.Even though the final result might always be single-precision, the intermediate evaluation could becometriple-precision, so double-precision does not necessarily suffice for the calculation.Triple or higher precision could be used to avoid overflow, or arbitrary-precision arithmetic could beused, in which the result vector for each partial evaluation expanded or shrank to accommodate thecurrent size requirement. I think that the "Big Nums" used in MIT's Math Lab and the integers used bySmalltalk are arbitrary-precision integers(?), and this kind of arithmetic has its uses, though it is slow.Some kind of arbitrary-precision integer or fixed-point arithmetic could be considered as an addition toCedar.There are some important differences in how one approaches double, N, and arbitrary-precisionarithmetic. For double or triple-precision, it may be preferable to pass arguments on the evaluationstack; at entry to the DADD, DSUB, DMUL, DDIV, or whatever procedure, the arguments appear inLRs where they can be quickly accessed; this is the approach taken by 16-bit Mesa. At quintuple orlarger precision, however, this becomes unworkable; even for smaller precision, it may be advantageousto pass pointers to the number-vectors rather than values. Below are some examples with argumentsspread on the evaluation stack and some with arguments referenced through pointers.8.6. Basic Arithmetic OperationsFor addition and subtraction, three versions of each opcode are used on N-precision integer algorithms:The low-order operation sets carry-in to 0 (addition) or 1 (subtraction) and loads Carry with carry-out;middle operations use Carry as carry-in and load Carry from the carry-out; the high-order operation usesCarry as carry-in and detects integer overflow.Single-precision operations set carry-in to 0 (addition) or 1 (subtraction) and detect integer overflow atthe same time.However, it is possible to combine the single-precision add with the high-order N-precision add bysetting Carry=0 whenever an operation which uses Carry also checks for out-of-range; this approach alsocombines the low-order and middle-order N-precision add operations. Then Carry can be non-zero onlyin the middle of an N-precision addition or subtraction. To also combine the single-precision subtract fp!q4] G?fp bq-6 `S0; ^R \- Wp' T3q+; Rh0xN9 KV IH G*= F$ BX @W =v>( ;^ 9'? 8X 6Kh 4 1'6 /DK -z] +F )33 (;' &OS !6p! q/8 L /Y d/ (B (  W T _ VA& B] Dragon DocumentEdward R. Fiala26 August 198371with the high-order N-precision subtract, Carry must be inverted on subtraction--a subtract which savescarry-out must save carry-out' instead, and the carry-in for a subtract must be Carry'.For multiplication hardware provides only four steps of the full operation, and for division, only onestep, so even single-precision calculations will trap. The primitives should be unsigned multiply-step anddivide-step because unsigned arithmetic requires one more bit of precision than signed arithmetic. [Thereis an easy conversion between signed and unsigned products, so this is not strictly necessary for multiply;but for divide I think it is necessary.] The full multiplication or division trap opcode then computes thesign(s) of the result(s), makes all arguments positive, carries out the unsigned operation, and negatesresult(s) as needed. Also, the sequence of primitives for full-word multiplication and division should notrequire many conditional jumps because these are slow on Dragon.The following basic arithmetic opcodes are proposed:ADDAdd. [S-1]_[S-1]+[S]; S_S-1; trap on integer out-of-range; Carry unchanged.SUBSubtract. [S-1]_[S-1]-[S]; S_S-1; trap on integer out-of-range; Carry unchanged.ADDBAdd Byte. [S]_[S]+a; trap on integer out-of-range; Carry unchanged.ADDNBAdd Negative Byte. [S]_[S]+(-1,,-400b+a); trap on integer out-of-range; Carry unchanged.ADDDBAdd Double Byte. [S]_[S]+ab; trap on integer out-of-range; Carry unchanged.RADDRegister Add. Rc _ Ra+Rb+Carry; Carry _ 0; trap on integer out-of-range.RUADDRegister Unsigned Add. Rc _ Ra+Rb+Carry; Carry _ carry-out.RVADDRegister Vanilla Add. Rc _ Ra+Rb; Carry not affected; no trap.RSUBRegister Subtract. Rc _ Ra-Rb-1+Carry'; Carry _ 0; trap on integer out-of-range.RUSUBRegister Unsigned Subtract. Rc _ Ra-Rb-1+Carry'; Carry _ carry-out.RVSUBRegister Vanilla Subtract. Rc _ Ra-Rb; carry not affected; no trap.RUMSTEPRegister Unsigned Multiply Step. 8d repetitions of RUMSTEP completes an unsigned multiply with two 32-bit arguments and a 64-bit result. Details TBC. Perhaps the opcode should do 2 steps in 2 cycles ratherthan 1 step in 1 cycle?RUDSTEPRegister Unsigned Divide Step. 32d repetitions of RUDSTEP completes an unsigned divide with a 64-bitdividend and 32-bit divisor. Details TBC.JEBB, JNEBBConditional jumps based upon comparing [S] and b.RJEB, RJNEB, RJLB, RJGEB, RJLEB, RJGBConditional jumps based upon integer comparisons of two registers (See opcode summary later).BNDCKBounds Check. Trap if [S]-[S-1] has carry-out=0; S_S-1. In other words, [S] is a "bound"; if [S-1] isunsigned greater than this bound, then trap. The fact that Carry is factored into the comparison allowsBNDCK to be used as the high-order part of an N-precision comparison in which RUSUB is used for low-order terms.The low-order and middle-order N-precision add and subtract opcodes are the same for cardinals andintegers--RUADD and RUSUB are used. However, different high-order operations are needed to checkfor cardinal overflow rather than integer overflow; the missing operations for 32-bit cardinals, whichwould be called RCADD and RCSUB, are not included in Dragon's opcode set.Nor does Dragon have any unsigned conditional jumps; the most useful of these would be RJUGEB(Register Jump Unsigned Greater than or Equal to Byte) and RJULB (Register Jump Unsigned Lessthan Byte). To do these comparisons without special opcodes, the integer conditional jumps can be usedafter complementing the sign bit of each number being compared (with RXOR Ra_20000000000b xor fp!q4] G?fp bq[ `SW \K [Qpq YLJ WJ! UA* SH R"` PW@ L4xIsLxHYQxFts0xE-'ts1xCts0x@[Ix><x=?x:Qx90Dx7Dx4#C3gi2x.e-*x* )/tsx%%$]x!}Q NTZ  q20 AD v!E I :V o$9 U /. A\x?Dragon DocumentEdward R. Fiala26 August 198372Ra; RJGEB replacing RJUGEB). This adds 2 cycles and 6 bytes to each such conditional jump.Unsigned conditional jumps would also be used for block comparison operations, as shown in anexample later.Magnitude and Negate operations have problems with overflow; here are the cases:integer in, integer outNegating 100000b,,0 (i.e., negating the most negative value) must cause overflow. RSUB[S]_Zero - [S] can be used for negation, preceded by RJGEB for magnitude.integer in, cardinal outMust avoid overflow when the argument being negated is zero. RJEB Zero,Ra,x (fornegate) or RJGEB Zero,Ra,x (for magnitude) followed by RSUB Zero,Ra,Ra works.cardinal in, integer outSign=1 should overflow except 20000000000b shouldn't. This is a problem for negate;magnitude of a cardinal doesn't make sense. First bounds-check against 20000000001b; ifthat succeeds, then use RVSUB [S] _ Zero - [S].The following operations are implemented with sequences of primitive opcodes; some of these may beordinary procedures; others are likely to be trap opcodes. The kinds of multi-precision arithmetic usedwill determine the actual opcodes--these are just examples:MULSMultiply Short. [S-1] _ Lowpart([S]*[S-1]); S _ S-1; detect integer out-of-range if Highpart([S]*[S-1]) issignificant; [S] and [S-1] are integers. This is the primitive for single-precision integer multiply.DIVSDivide Short. Q _ remainder([S]/[S-1]); [S-1] _ quotient([S]/[S-1]); S _ S-1; [S] and [S-1] are integers.This is the primitive for single-precision integer divide.REMSRemainder Short. [S-1] _ remainder([S]/[S-1]); S _ S-1; [S] and [S-1] are integers.UMULUnsigned Multiply. [S-1] _ Highpart([S]*[S-1]); [S] _ Lowpart([S]*[S-1]); [S] and [S-1] are cardinals.Variations: results reversed or one part left in Q, stack pointer decremented by 1.UMULAUnsigned Multiply and Add. Q _ Highpart(([S]*[S-1])+2S), [S-1] _ Lowpart(([S]*[S-1])+2S); S _ S-1.Since the maximum value of ([S]*[S-1])+2S is (2^n-1)*(2^n-1)+(2^n-1) = 2^2n - 2^n, the result cannever overflow. This is used by the N-precision multiplication example below, and it may be useful inthe N-precision division.UDIVUnsigned Divide. Q _ Quotient([S]..[S-1]/2S); 2S _ Remainder([S]..[S-1]/2S); S _ S-2; [S]..[S-1] is a 64-bit cardinal dividend and 2S is a 32-bit cardinal divisor. Possibly reverse the location of the results.BLCBlock Compare, a generalization of the Trinity BLE (Block Equal) opcode. The word count N and twoN-precision numbers are passed on the stack. An N-precision compare of the two numbers is made,returning +1, 0, or -1 for [S] greater, equal, or less than. Altogether collapses three stack argumentsinto one stack result. Variations: argument arrangements; arbitrary precision with N specified in thefirst list or array element.8.7. Multi-precision Addition and SubtractionHere is 3-precision subtraction TSUB(C,A,B) returns C and performs C _ A-B; the numbers are allreferenced through pointers:LRc/Pointer to CLRa/Pointer to ALRb/Pointer to BTSUB:3; Entry byte = 3 argumentsLRIa 2, LRIb 2RUSUB [S-1] _ [S-1] - [S] & pop; Low-order carry-in = 1 because Carry=0SRIc 2LRIa 1, LRIb 1RUSUB [S-1] _ [S-1] - [S] & pop; Middle carry-in = Carry = low carry-outSRIc 1LRIa 0, LRIb 0RSUB [S-1] _ [S-1] - [S] & pop; High carry-in = Carry = middle carry-out; detect out-of-rangeSRIc 0RET 1; Returns C as the result fp!q4] G?fp bqD `S%8 ^ [PxXUs4VIxTV'&R6xPWFN,,M/ JGqU H|h F;xCs[BRxA.Z?:x>mTx= N;Sx:K\8)87?'6(x4M3gMx2b0I/DN-f, 'p. $q[ "Px%s x xc x[)W )W'9 w)W( U)W1   )W J pB\tDragon DocumentEdward R. Fiala26 August 198373This example uses two of the three register-to-register subtract opcodes. Addition is similar. Ifarguments are passed and returned spread on the stack, then the computation can be carried out directlyas follows:LR2,1,0/A0..A1..A2LR5,4,3/B0..B1..B2TSUB:6; Entry byte = 6 argumentsRUSUB LR0 _ LR0 - LR3RUSUB LR1 _ LR1 - LR4RSUB LR2 _ LR2 - LR5RET 3; Returns 3 resultsFor N-precision, with the terms referenced through pointers, the procedure would be as follows:LRc/Pointer to CLRa/Pointer to ALRb/Pointer to BLRn/Precision N (N .ge. 1)Carry/initially 0NSUB:4; Entry byte = 4 argumentsJMP NSBEGNSLP:RFX push (a+n)^RFX push (b+n)^RUSUB [S-1] _ [S-1] - [S] & pop; Low-order and middle-order subtractsRADD push_c+nWB 0; ([S]+0)^ _ [S-1]NSBEG:ADDNB 377b; N _ N-1 (last argument N at top-of-stack)RJGB [S],One,NSLP; Jump if not final term (don't pop stack)RFX push (a+n)^RFX push (b+n)^RSUB [S-1] _ [S-1] - [S] & pop; Final subtract checks for out-of-rangeRADD push_c+nWB 0; ([S]+0)^ _ [S-1]RET 1; Return pointer to C as the resultThis N-precision example is primarily interesting for its loop control and memory indexing operations. Asingle opcode to do the work of the RADD, WB 0 sequence above (analogous to RFX) would be useful;however, this would take 2 cycles on Dragon. Also, the ADDNB 377b, RJGB sequence might beimprovable.8.8. MultiplicationTBC. fp!q4] G?fp bqK `S!F ^ x[]s xY xV)WUT3RQq)W N#qPxJs xI xH6 xFxEt xBl)WA x>J<;)W%:' 8)Wx7f )W*6)W)43C1)W'0 /!)W-)W# *rqJ (H &7# % p qx @AKDragon DocumentEdward R. Fiala26 August 1983748.9. DivisionTBC.8.10. Multi-precision MultiplicationThe product of a multiplication in general has a precision equal to the sum of the precision of itsarguments. In arbitrary precision, this can be dealt with by enlarging the representation of the product;in N-precision, it can be dealt with by generating overflow if the 2N-precision product has significance inany of the high-order N words.The basic primitive is an unsigned multiply that produces a double-precision product. In this algorithm,the primitive multiply operation can be either signed or unsigned. This can be shown as follows: First,if the sign bits of both arguments are 0, the answers are the same either way. Alternatively, if oneargument has its sign bit 1, then either (231+A)*B for an unsigned multiply or (-231+A)*B for a signedmultiply will be computed. The difference between these two results is 232*B, so a signed product canbe derived from an unsigned product by subtracting B from the high-order result when the otherargument is negative, and vice versa. Note that only the high-order result is affected. In N-precision,this manifests as a modification to the high-order N words of the 2N-precision product.If one were willing to simply truncate the top N words of the 2N-precision product, the above resultwould allow the multiply algorithm to ignore the signs of the arguments, execute an unsigned multiplyalgorithm, and discard any partial products that do not contribute to the low N words of the product.This would be reasonable in a context where the compiler guaranteed no overflows.If it is necessary to check for overflow or to produce a full-precision product, then the multiplier andmultiplicand must be made positive, and the unsigned product must be conditionally negated. Theexamples below work this way. The brute force algorithm requires N2 single-precision multiplies and(2N2-2N) single-precision additions, which can be ordered in a variety of ways. Triple-precision multiplyof A x B produces the following terms:A2B2 highA2B2 lowA1B2 highA1B2 lowA0B2 highA0B2 lowA2B1 highA2B1 lowA1B1 highA1B1 lowA0B1 highA0B1 lowA2B0 highA2B0 lowA1B0 highA1B0 lowA0B0 highA0B0 lowThe algorithm below carries out N steps each consisting of an Nx1 unsigned multiply and an N+1-precision addition; the product precision grows by 1 after each step. High-order terms are kept abovelow-order terms on the stack. It is easy to show that the N+1-precision additions never carry into thenext-higher term because the high-order product word of an Nx1 multiply cannot exceed 177776b, andthis is always added to 0+Carry.One primitive used is UMUL, which computes [S]*[S-1] unsigned, leaving the high-order product at 0Sand the low-order product at [S-1]. Another is UMULA which computes ([S]*[S-1])+[S-2] unsigned,leaving the high-order result at [S-1] and the low-order result at [S-2], then it decrements the stackpointer by 1. It is easy to show that UMULA cannot cause carry-out. fp!q4] G?fp b ^q Yp% V!q.5 TV(B RL P MOS Ki IN GH|sGq%H|sGq F$IFsF$q DZH BN @W =S] ;M 9 W 7Q 480 2'9 0"!1ys0q /!/s/!qf -V&4^*+s<,(4^#'i,,& 4^#$,#G##!, #K% qZ  [ A16 v:(  :*9 o,4 *< D B\xgDragon DocumentEdward R. Fiala26 August 198375At call:LR2,1,0/3-precision multiplier A0..A1..A2LR5,4,3/3-precision multiplicand B0..B1..B2At return:LR5,4,3,2,1,0/6-precision productTMUL:6; Entry byte = 6 argumentsRXOR push _ LR2 xor LR5; Push negative value if must negate productLR2; RJGEB [S],Zero,APos; Jump if A0 positiveRUSUB LR0 _ Zero - LR0RUSUB LR1 _ Zero - LR1RSUB LR2 _ Zero - LR2APos:LR5; RJGEB [S],Zero,BPosRUSUB LR3 _ Zero - LR3RUSUB LR4 _ Zero - LR4RSUB LR5 _ Zero - LR5BPos:LR0; LR3; UMUL; [S]/ high product, [S-1]/ low productLR0; LR4; UMULALR0; LR5; UMULA; Now have 4-precision low-order partial product on stack in LR10..9..8..7.RMOV LR0 _ LR7; Copy complete low-order product into LR0LR1; LR3; UMULLR1; LR4; UMULALR1; LR5; UMULA; Now have another 4-precision partial product on stack in LR14..13..12..11.; Add 4-precision partial product to the high-order 3 words of the; 4-precision partial product underneath it.RUADD LR1 _ LR8 + LR11RUADD LR7 _ LR9 + LR12RUADD LR8 _ LR10 + LR13RADD LR9 _ Zero + LR14AS -5; Point StkP at LR9LR2; LR3; UMULLR2; LR4; UMULALR2; LR5; UMULA; Now have another 4-precision partial product on stack in LR13..12..11..10.RUADD LR2 _ LR7 + LR10RUADD LR3 _ LR8 + LR11RUADD LR4 _ LR9 + LR12RADD LR5 _ Zero + LR13AS - 7; Point StkP at LR6; Now have 6-precision value in LR0 to LR5RJGEB [S],Zero,PPos & pop; Test LR6 filled by RXOR at beginningRUSUB LR0 _ Zero - LR0RUSUB LR1 _ Zero - LR1RUSUB LR2 _ Zero - LR2RUSUB LR3 _ Zero - LR3RUSUB LR4 _ Zero - LR4RSUB LR5 _ Zero - LR5PPos:RET 6As given, the code takes about 34 cycles + 6*UMULA + 3*UMUL +(15 if one argument negative or12 if both arguments negative). If the UMUL and UMULA time is 13 cycles, then the overall executiontime is about 151 cycles + negation time. Performance could be improved at least 18 cycles by open-coding the individual multiplies and operating on the argument locals directly, rather than putting thearguments onto the stack first. The total register requirement of the procedure is 17b registers; hence,the method used cannot be extended to quadruple or larger precision, but a variant in which the N+1- fp!q4] G?fpxbAsx`!x_#x^ x\ )WxY)WXxqs )W+V)WUpqsqsSqsqsR"qsqsxPOqsqsMrqsqsKqsqsxJG )W&HGxF qs qs0DZqs)WqsB Au@x>Lx=/qs%qsx;qs(9qsqs89qsqs6qsqs4qsqs3Cqs)W1 0_.x-L,qsqs*rqsqs(qsqs'#qsqs%|qs)Wx# qs"P)W% qsqs%qsqs}qsqsqsqs/qsqsqsqsx q> #A  K UO Q [P yA]Dragon DocumentEdward R. Fiala26 August 198376precision addition is done by a procedure can be extended up to about 7-precision; beyond that, theterms must be referenced through pointers.BigNums would use something like the above program in a loop on two variables; CONSes wouldproduce storage for extra terms in the product.In the Mesa context, a TMULS procedure (Triple Multiply Short) which confines the product to thesame precision as its arguments and generates Overflow in exceptional cases is probably more useful. Inthis case, it is substantially faster to avoid five or six of the partial products altogether by verifying thatthey produce no significant terms:At call:LR2,1,0/3-precision multiplier A0..A1..A2LR5,4,3/3-precision multiplicand B0..B1..B2At return:LR2,1,0/3-precision productTMULS:6; Entry byte = 6 argumentsRXOR Push LR2 xor LR5; Push negative value if must negate productLR2; RJGEB [S],Zero,APos; Jump if A0 positiveRUSUB LR0 _ Zero - LR0RUSUB LR1 _ Zero - LR1RSUB LR2 _ Zero - LR2APos:LR5; RJGEB [S],Zero,BPos; Jump if B0 positiveRUSUB LR3 _ Zero - LR3RUSUB LR4 _ Zero - LR4RSUB LR5 _ Zero - LR5BPos:LR0; LR3; UMUL; A2 x B2LR2; RJNEB [S],Zero,ABig; Test A0 for significanceLR5; RJNEB [S],Zero,BBig; Test B0 for significance; Both high-order terms are smallLR0; LR4; UMULALR1; LR3; UMULLR1; LR4; UMULARMOV LR0 _ LR7; Save low-order termRUADD LR1 _ LR8 + LR10RUADD LR2 _ LR9 + LR11RADD LR12 _ Zero + LR12LI1; BNDCK; Trap if high term non-0AS - 6; Point StkP at LR6JMP TMF1; A0 is significant, so B0 and B1 must both be insignificant or overflow.ABig:ROR Push LR4 or LR5; Push B0 or B1.LI1; BNDCK; DIS; Trap if B0 or B1 is significantLR3; LR1; UMULALR3; LR2; UMULA; (A0..A1..A2) x B2TMFIN:LI1; BNDCK; DIS; Trap if high-order product term .ne. 0SL2; SL1; SL0TMF1:RJGEB [S],Zero,PPos & pop; Test LR6 filled by RXOR at beginning; Have to avoid out-of-range trap when result is exactly 100000b,,0.RADD Push (100000b,,0)+1; BNDCKRUSUB LR0 _ Zero - LR0RUSUB LR1 _ Zero - LR1RUSUB LR2 _ Zero - LR2RSUB garb _ Zero+Zero; Leave Carry=0RET 3PPos:LILDB 100000b; BNDCK fp!q4] G?fp bq?$ `S* \< [/ WL US TA. RE"xOsxM!xLX#xJ xIxF)WEt)W+D)WBqsqs@qsqs?Aqsqsx=)W<8qsqs:qsqs8qsqsx7f )W6)W4)Wx3 qs10; .-Vqs)Wqs +qsqs*qsqs(`qsqs& )W%Xqs)W#x!Ix)WQ)W)Wx/)W' xm)W%x D(qsqsqsqsqsqsU)W  x 3$ A]Dragon DocumentEdward R. Fiala26 August 198377RET 3; B0 is significant, so A1 must be insignificant or overflow.BBig:LR1; LI1; BNDCK; DIS; Trap if A1 significantLR0; LR4; UMULALR0; LR5; UMULA; (B0..B1..B2) X A2JMP TMFIN8.11. Multi-precision DivisionDivision of an N+M-precision dividend by an N-precision divisor requires a complicated algorithmdiscussed in Knuth's Seminumerical Algorithms, section 4.3 Multi-precision Arithmetic.The primitive operation required is an unsigned divide in which the dividend is double-precision, whilethe divisor, quotient, and remainder are single-precision. For this operation, a No-divide trap occurs ifthe divisor is 0, and Overflow occurs if the quotient exceeds single-precision (i.e., if the low-order wordof the dividend is greater than the divisor); it may be convenient to treat the Overflow as a No-dividetrap also.The algorithm begins by making both dividend and divisor positive. After the unsigned quotient andremainder are determined, the quotient will be negated if the signs of the dividend and divisor weredifferent; the remainder, if the dividend was negative (I.e., the remainder always has the same sign as thedividend.).Then, normalize the divisor. Determine how far the divisor must be left-shifted such that its high-orderbit becomes 1; left-shift both the divisor and dividend by this amount, inserting a leading 0 word ontothe dividend so that it won't overflow.After normalizing the divisor, a trial division of the two leading words of dividend by the leading wordof divisor is carried out producing a q* that is accurate, 1 too small, or 2 too small. The error isnarrowed to 1 too small or accurate by a trial multiply-and-subtract with q* conditionally decrementedby 1. Then multiply q* by the entire divisor and subtract from the dividend. The rest of the algorithm Ileave to Knuth.The important sub-operations of the algorithm are as follows:a) N-precision negation, which was shown in the multiply example;b) N-precision left-shift;c) 2-precision by 1-precision unsigned divide;d) N-precision x 1-precision multiplication, which was shown in the multiply example;8.12. Block CompareThe N-precision block comparison operation can best be done starting with the high-order term andcontinuing comparisons down through low-order terms until a difference is encountered. This causes theopcode to finish as soon as it can, which might be substantially faster for large precision.At call:LR2..1..0/A0..A1..A2LR5..4..3/B0..B1..B2At return: fp!q4] G?fpbAsx_=x^)W\[])WY Up Qq U Opq) LX9. J ^ HU F#D E- AV ?+9 >&-> <\ 8P 7b 5U' 126 005 .ML ,h * 'F= #A "  ?. tU [p q] W T\x)sx  x g  x   A]( Dragon DocumentEdward R. Fiala26 August 198378LR0/0 if A .eq. B1 if A .gr. B-1 if A .ls. BTBLC:6Entry byte = 6 argumentsLR2; RJNEB [S]&pop,LR5,THNELR1; RJNEB [S]&pop,LR4,MHNELR0; RJNEB [S]&pop,LR3,LHNERADD LR0 _ Zero + Zero; LR0 _ 0RET 1THNE:LR2; RJGEB 0S&pop,LR5,NBGRNBLS:RADD LR0 _ Zero + MinusOne; LR0 _ -1RET 1MHNE:RXOR Push LR1 xor (100000b,,0)RXOR Push LR4 xor (100000b,,0)RJLB [S]&pop,[S-1]&pop,NBLS; Pop stack twiceNBGR:RADD LR0 _ Zero + One; LR0 _ 1RET 1LHNE:RXOR Push LR0 xor (100000b,,0)RXOR Push LR3 xor (100000b,,0)RJLB [S]&pop,[S-1]&pop,NBLS; Pop stack twiceRADD LR0 _ Zero + One; LR0 _ 1RET 1Block comparisons will not be confined to arithmetic applications. A string lookup procedure for asorted table might use a string-comparison operation like the one above, but some applications wouldrequire this comparison to be made as an N-precision cardinal, rather than an integer. Cardinalcomparison is like the example above with two RXOR's added to the THNZ arm.If block comparison, string comparison, etc. are deemed high-frequency applications, then the RJUGEBand RJULB unsigned conditional jumps should be added to the opcode set. With these additions, theRXORs in the above example disappear. fp!q4] G?fpxbAs ` _ x\w)W[YXUV)WUxRxQq)WPxMOKJ)WxI-)WGxE CBI)W@)W? <8q): :nL 8F 6K 3g7- 1;' /% /A8Dragon DocumentEdward R. Fiala26 August 1983799. Sub-word OperationsSince we want Dragon to compile and run existing "user program" sources without (non-automatic)source-level changes, it is necessary to look at the kinds of data descriptions acceptable to the 16-bit Mesamachines and see what needs to be done.9.1. Compatibility on the NetworkIn 16-bit Mesa, fields in a record are allowed to be 1 to 15 bits or N 16-bit words in size. The basicfield opcodes, RF and WF, accept the pointer to a 16-bit word-aligned record on [S] and a 16-bitdescriptor in the a and b operand bytes; a is an 8-bit word displacement into the record, b a 4-bit bitdisplacement and 4-bit size. Although nothing in the descriptor so constrains it, sub-word fields are notallowed to cross word boundaries.Fields that are either (2N+1)*16 bits in size, or which are N*32 bits in size and start at an odd halfword(with N greater than 0) will be referenced inefficiently on a 32-bit machine. On data records assembledfor its own use, the Dragon compiler can avoid the inconvenient fields by promoting each field to thenext word boundary or size; however, a record transmitted over a network must meet externally specifiedrequirements. In this case, either Dragon representations must conform or Dragon must convert betweenits own form and the external form when transmitting or receiving. Since the 10 mHz Ethernet packetformat mandates both 48-bit fields in packet headers and 32-bit fields that cross 32-bit word boundaries,the Dragon compiler will have to deal with these inconvenient fields in machine-dependent records.In CSL's RPC (Remote Procedure Call) protocol, a sender's procedure stub translates internal argumentrepresentations into a network format, building a packet (or sequence of packets); the receiver'sprocedure stub unpacks the packet into internal form and completes the remote procedure call. In otherwords, a layer of translation lies between internal form and network form, so the Dragon compiler canuse its own internal representations ordinarily, just so long as code can be produced to convert betweenits own and network representations in the procedure stubs.RPC standards and practice have not consciously provided for 32-bit or larger word sizes. Perhaps noprovision is needed, if translation poses a relatively small execution overhead. According to Birrell andTaft, the best transmission rate of useful data achieved between two Dorados using any existing Ethernetprotocol has been about 2 mHz (16 microseconds/32-bit word); packet transmission time alone on the 10mHz network is about 50 microseconds/packet + 3.2 microseconds/32-bit data word. Because RPC-styleapplications tend to use small packets, major improvements to this rather slow rate seem unlikely.Consequently, if Dragon's time to pack data on one end plus unpack it at the other is small compared to16 microseconds/word, and if Dragon's performance is comparable with Dorado's, then choosing networkformats efficient for Dragon is unimportant.At the sender's end, minimum translation for each argument is a BLT (block transfer) into a packetbuffer. At the receiver's end, translation could conceivably be null, if the packet representation isidentical to the internal representation. For comparison purposes, assume that translation to/from aninconvenient format would require a BYTBLT at each end. Consequently, the execution time differencebetween a convenient and an inconvenient network representation is no worse than the differencebetween a BYTBLT and a BLT at the sender's end, and between nothing and BYTBLT at the receiver'send.I think that in a large BYTBLT or BITBLT, data will be transferred at ~7 cycles/32-bit word, and a BLTat ~5 cycles/32-bit word; these times allow 2 cycles/word for M bus activity. The examples later show fp!q4] G?fp ar ^eqH \O Z' Vp" Sq16 QNL Osqsqsq0sq Mj K! H|T FM DU CM AR X ?T =Z ;J 8 Y 6&; 4!F 3 *; 1U` /; ,Y *NQ (H &b $O #$D !YB% S , RS =) O 8, (E ]*6  H Vf B]LDragon DocumentEdward R. Fiala26 August 198380how the inner loops for these work. Then (7+2) cycles/word is the difference in work betweenconvenient and inconvenient formats. This represents 28% of packet transmission time, and only 5.6% ofthe best rate achieved with Dorados so far.Consequently, network representations don't seem to be crucial, but we should consider what non-binding guidelines result in representations efficient on both 16-bit and 32-bit machines. Rigid guidelinesare impractical, if for no other reason, because 10 mHz Ethernet packet format violates the guidelines wewould pick. The following guidelines seem desirable:1) Discourage fields larger than 16 bits except for those which are 2n * 16 bits. This allows 16, 32, 64,128, 256-bit, etc. fields which can be easily accessed on a machine with a 2n-bit word size because eachof these sizes is either less than one word or M words.2) Assume that a network packet is always received with its first data field beginning at bit 0 of a word.Do not allow fields less than or equal to 16 bits to cross 16-bit memory boundaries; 2n-bit fields largerthan 16 bits must start on (at least) 16-bit boundaries and should ordinarily start on a 2n-bit boundary.This means that a 16-bit or smaller field will never cross a word boundary; a 2n-bit field will not cross aword boundary on a machine with 2n-bit or larger words.3) Packed array headers (giving array size in elements) should be changed from 16-bit cardinals to 32-bit(always positive) integers. The first data byte in a packed array must then begin on a 32-bit boundary.The 32-bit header is more general (because it allows arrays with more than 64k elements) and faster toaccess on a 32-bit machine; the 16-bit cardinal header is more efficient on a 16-bit machine and is thecurrent convention. We should switch to the 32-bit integer header because 64k elements is too small alimit. The reasons for preferring 32-bit (positive) integers to 32-bit cardinals were discussed in theArithmetic chapter.The compiler needs a new construct to support these guidelines; in addition to "packed", "unpacked",and "machine dependent", a "network packed" designation should be added. A "network packed"record must obey established guidelines for network records; it is essentially a "machine dependent"record verified by the compiler to contain no guideline violations. Any record sent over a networkshould be either "network packed" or "machine dependent"; "network packed" should be preferredexcept where backward compatibility or some other unfortunate constraint is involved, such as the 10mHz Ethernet packet format.In "machine dependent" records and arrays, the Dragon compiler must allow anything acceptable to 16-bit Mesa, as well as any new constructs introduced for Dragon. In particular, all N*16-bit fields andboundaries must be supported.9.2. Dragon FieldsA controversy now resolved was whether to have pointers address 16-bit or 32-bit boundaries. With 16-bit memory addressing, Dragon need not support fields that cross from one half-word into another, anduse of such fields would be discouraged because our 16-bit machines do not support them (and cannotbe modified to do so efficiently). However, with 32-bit boundary addressing, Dragon must support fieldsin either half of a 32-bit word, and it makes sense to also allow fields which cross from the left-half tothe right-half and 17 to 31 bit fields. Dragon should provide at least this much generality so that the fullpower of the 32-bit machine can be made available.With 32-bit boundary addressing, the sub-word displacement in a field descriptor must be enlarged to 5 fp!q4] G?fp bq#: `S%B ^+ [V YLI# WV U5 REERtREq" PzLQtPzq N7 K>K Is,*JtIsq GSH5tGq EOFktEq D!DtDq @b >^ = M ;A&A 9w33 7 \ 5 2pP 0B .d -c +E^ ){13 ' $>I "sJ  cp q)= '"C \'< Y J h 22 dP yB]YDragon DocumentEdward R. Fiala26 August 198381bits, and with 17 to 31-bit sizes the size field must also be enlarged to 5 bits. Changing field descriptorformat will require editing wherever field descriptors are manually constructed by left-shifting--the size ofthe shift must be made 1 larger; I doubt that these places can be found automatically, but there are notmany of them. Ed Satterthwaite told me that he knew of precisely one module which would becompiled for Dragon that depended upon the format of the existing 16-bit descriptor.The most general field would cross word boundaries (e.g., a two-bit field could consist of bit 31 in oneword and bit 0 in the next). Nothing extra is needed in the field descriptor to support such wordcrossings, but if such crossings are allowed, field opcodes must check for the crossing and do more workwhen it occurs. For Dragon, we propose to allow only fields confined to a single 32-bit word (plusN*16-bit fields that cross word boundaries for backward compatibility, but the compiler will producespecial, more verbose code for this case). If a field descriptor anomalously identifies a field that crosses aword boundary, Dragon will do something undefined--it will not detect this error.After coalescing long, double, and MDS opcodes in the Trinity release of 16-bit Mesa, the single-field,cycle, and shift opcodes given below remain; these do not include the multi-field, computed field, andpacked array opcodes which are discussed in the next section.RF, WF, PSFRead Field, Write Field, Put Swapped FieldR0F, W0F, PS0F, WS0FRead, Write, Put Swapped, and Write Swapped 0 FieldRLIPFRead Local Indirect Pair FieldSHIFT, SHSBShift, Shift Signed ByteROTATE, ROTSBRotate, Rotate Signed ByteDSHIFT, DSHSBDouble Shift, Double Shift Signed ByteRF and WF accept the pointer to a record in [S]; WF has the right-justified new value for the field in[S-1]; a is a word displacement into the record and b is a sub-word field descriptor. WSF (WriteSwapped Field) would be a variation in which the stack arguments for WF are reversed; this opcode didnot survive opcode set optimizations. PSF is a variation in which stack arguments for WF are reversedand the pointer is left on the stack at completion.R0F, W0F, PS0F, and WS0F are optimizations of RF, WF, PSF, and WSF, respectively, where the worddisplacement is 0, saving one operand byte. However, Dragon needs 10 bits of sub-word field descriptor,so R0F, W0F, PS0F, and WS0F opcodes cannot be provided unless each consumes four primaryopcodes, which would be too costly.On Dragon, the read/write operations R0, W0, PS0, and WS0 make more sense. These one-byte special cases ofRB, WB, etc. are in the list of possible future additions to Dragon's opcode set.RLIPF is a variation on RF in which a[0..3] selects a Local and a[4..7] is a word displacement into therecord pointed at by the local--RLIPF saves one code byte over LLn followed by RF on the 16-bitmachines. The analog of RLIPF on Dragon would use 4 bits to specify the local register, 2 bits to selectthe word in the record pointed at by the register, and 10 bits for sub-word field description; this 3-byteopcode would save 2 bytes over LRIn followed by EF, so it could be considered as a Dragon opcode butdoes not fit well with other proposed Dragon opcodes.Dragon's opcode set replaces all of the field opcodes in the above list by the following; the format of thefield descriptors and other details are given in Figure 5:DSHDouble SHift. 3-byte, 1 cycle shifter opcode in which the 64-bit shifter input is [S]..[S-1]. [S-1] is replaced bythe result of the operation and the stack is popped once. ab determines the shift function, which may be aleft or right shift, cycle, or field extraction; the mask and shift specification can come from either ab or the fp!q4] G?fp bq=/ `SC* ^35 \< ZT WW U8* SW R"X PW\ NQ LQ IPW G51 E=xBI R&x@~R.x>Rx< Rx; Rx9T R 5G 4sq,sq, 2LD! 0*< .3 +EC ){<, ' L %#y#$t$Gy!Q tq$sqsq& >! i !I JY 5  N C:x th  ;st. V9.st A]fDragon DocumentEdward R. Fiala26 August 198382Q register.SHSHift. 3-byte, 1 cycle shifter opcode in which the 64-bit input to the shifter is [S]..[S]; ab are formattedexactly like DSH. This opcode optimizes DUP, DSH.QSHQ SHift. 2-byte 1 cycle shifter opcode in which the 64-bit shifter input is [S]..[S]. a supplies the shiftoperation. This opcode is an optimization of SH that is one code byte smaller when the shifter control is fromQ.IFInsert Field. 3-byte, 2 cycle shifter opcode in which the right-justified quantity in [S-1] is inserted into theword at [S]; the result is left in [S-1] and the stack is then popped once. The mask and shift specification cancome from either ab or from Q, in the same format as SH.To duplicate the RF function on Dragon, the compiler would produce a RB opcode (2 bytes) followedby a SH opcode (3 bytes), so the code sequence is 5 bytes on Dragon compared with 3 bytes on 16-bitMesa machines. The Dragon sequence is maximally fast and is more flexible because RB could bereplaced by any other opcode which pushes a word on [S]; hence, the only weakness is the greater lengthof Dragon's code sequence. SH and IF may be too verbose for field extraction and insertion, which are very common. However,since 10d bits of sub-word field descriptor are needed to describe a 32-bit field, field extraction andinsertion must either be 3 bytes long or supplement a with two opcode bits.9.3. Packed ArraysThe 16-bit Mesa compiler precomputes all field descriptors, which appear as constants in the code, exceptwhen it must access an array. In accessing a packed array, the code begins with a pointer P to the baseof the array and an index N for the desired element. It then computes a field descriptor for the Nthfield; the general computation multiplies by the number of bits/field and divides by the number ofuseful bits/word; the quotient of the divide is the word displacement, and the remainder is the bitdisplacement.The Mesa compiler can choose to use a larger field size than N, and it takes advantage of this freedom topromote field sizes to the nearest larger power of two, so sizes of 1, 2, 4, 8, and 16 bits are used, but notintermediate sizes. Only on machine-dependent arrays is the user-specified size taken literally, and thecompiler then disallows sizes other than 1, 2, 4, 8, or 16 bits. The virtue of 2n-bit sizes is that both thenumber of bits/field and the number of useful bits/word become powers-of-two, so the multiply becomesa left-shift and the quotient and remainder of the divide are each computed by a shift.If the general packed array were supported by the compiler, it could save as much as 40% of the storageused by the next larger power-of-two size; for example, if 5-bit fields were needed, the general packedarray would have 6.4 fields/word compared to 4/word for an 8-bit field, the next-larger power-of-two.However, multiply is slow, extracting a field crossing a word boundary is slow, and inserting a fieldacross a word boundary is very slow and not well supported by Dragon's hardware.The problem is also bad when word-boundary crossings are disallowed, because the descriptorcomputation then requires both a multiply and a divide by small cardinals. Avoiding division seems justas important as avoiding word boundary crossings.These problems seem enough to avoid non-power-of-two sizes ordinarily. Consequently, packed arrayswith non-2n-bit sizes should continue to be avoided. In addition, there seems no urgent reason to allowother element sizes in machine-dependent arrays because no other machine is presently producing such fp!q4] G?fpbAt x_%7st ]2x[:*.stYUXUxU cTVL%Rst% O`q+6 M U K<" JT H6 DF B Y A.4sq Left-shift N by 5 to position the field for the storeinto Q plus m to make the mask control value.ADDB 2mAdd the field size into [S] making field descriptorEUSF constPop the stack into Q registerSHRight-shift N to get word displacementADDAdd word displacement to PRB 0Read the word containing the fieldEFExtract the field; Total code = 17 bytes; total time = ~9 cycles.; Code to write packed-array element by the value in LRx. The code through; ADD is identical to ReadPA.WritePA:DUPSHADDB 2mEUSF constSHADDRSB 0Push the word containing the fieldLRxPush the value to write into the fieldIFInsert the new valueWSB 0Rewrite the word containing the field; Total code = 20 bytes; total time = 12 cycles.The SPA (Setup Packed Array) opcode executes the code common to the above subroutines beginningwith DUP and ending with SH in 1 cycle. The SPA a holds m; it loads Q with the shifter fp!q4] G?fp bq4+ `S \CxYoRxWRxUR RhB" Psq3 N"D Md K>i Is; Fa D7I BlD >M =/L ;e)x89t#x6#!x51551 x2pPx1#/ # +#.M-,-3#,0+E #) # (#'##% # x$a0x!Kx ?x} cu  S##& # 1#x0 qC 1sq% pA\[Dragon DocumentEdward R. Fiala26 August 198384control for f(a) while simultaneously replacing [S] by [S] right-shift 5-m. Using this opcode as aprimitive, the packed array reference sequences become the following:[S]/NNth element of packed array[S-1]/PPointer to origin of packed array2m = field size; Code to read packed-array element. Replace [S] & [S-1] by the extracted fieldReadPA:SPA mRead the word containing the fieldRFX [S-1]_([S-1]+[S])^ & S_S-1SHExtract the field; Total code = 8 bytes; total time = 4 cycles.; Code to write packed-array element by the value in LRxWritePA:SPA mSetupADDLRxPush the value to write into the fieldRFX Push ([S-1]+0)^Push the word containing the field to be changedIFInsert the new valueWSB 0Rewrite the word containing the field; Total code = 12 bytes; total time = 7 cycles.9.4. Multi-field OperationsIn addition to single-field and packed-array opcodes, 16-bit Mesa has a variety of multi-field transfer andcompare opcodes. After coalescing long, double, and MDS opcodes presently in Trinity Mesa, thefollowing multi-field opcodes remain:BITBLTBit Block Transfer*8-bit specialized opcodesBYTBLT, BYTBLTRByte Block Transfer, Byte Block Transfer Reversed*16-bit specialized opcodesBLT, BLTC, BLTRBlock TransferBLE, BLECBlock EqualBLZ, LOCALBLZBlock ZeroCKSUMPup checksumAnalogous operations on Dragon will be either trap opcodes or ordinary procedures. In either case,execution time of the procedure implementing the operation is important but not code size. Also,arguments will appear automatically in LR0 to LRn at the beginning of the procedure, so register-to-register operations will be preferable to stack operations in referencing arguments, and an abundance ofLRs will be available as temporaries.For Dragon, full-word block transfer and compare opcodes will be needed in addition to 16-bit forms,which become sub-word operations. A generalization of the full-word Block Equal function into a BlockCompare function was discussed in the Arithmetic chapter.The first example below shows how a forward-direction, full-word block transfer function might beimplemented on Dragon. Dragon should use integer counts for operations such as this because itsopcodes favor integers. This example quadruplicates the inner loop to achieve the hardware limit of 3.0cycles/word. fp!q4] G?fp bq sq$0 `SEx](t#x[#!xZ ZZ xW^PxU#TS< # xQ.xO8xM#LXJ#&I#,H6 # F#xEt/ ARp =q^ <(7 :K%x6Rx3gx1R-x.*x,_Rx*Rx( Rx'R #&= !L  V .e c% @$ 'D" \9 @!  B U%C & CA[?Dragon DocumentEdward R. Fiala26 August 198385; LRs/Pointer to first source word; LRd/Pointer to first destination word; LRc/Negative word count; LRc is [S] at entry.; Timing (for 4 or more words) = (14 to 20) + 12*(C/4) assuming all cache hits.BLT:3; Entry byte = 3 argsRSUB LRd_LRd-1; InitializationRSUB LRs_LRs-LRdADDB 3RJGEB LRc,0,LessThan4; Inner loop:; LRs/Pointer to first source word - pointer to first dest word + 1; LRd/Pointer to first destination word - 1; LRc/Negative word count + 3; LRc is [S] here.Loop4:RFX Push (LRs+LRd)^; Push 1st source wordRSI Pop into (LRd_LRd+1)^; Pop into the destination & increment LRdADDB 4; Increment the negative word count by 4RFX Push (LRs+LRd)^; Push 2nd source wordRSI Pop into (LRd_LRd+1)^RJGB [S],1,TwoOrThree; Predicted false conditional jumpRFX Push (LRs+LRd)^; Push 3rd source wordRSI Pop into (LRd_LRd+1)^RFX Push (LRs+LRd)^; Push 4th source wordRSI Pop into (LRd_LRd+1)^RJLBJ [S],0,Loop4; Loop if word count is still .ge. 4ADDB 2; 2 = 3 words left (11 cycles); 3 = 2 words left (13 cycles)TwoOrThree:RFX Push (LRs+LRd)^RSI Pop into (LRd_LRd+1)^RFX Push (LRs+LRd)^RSI Pop into (LRd_LRd+1)^RJNEB [S],2,ExitRFX Push (LRs+LRd)^RSI Pop into (LRd_LRd+1)^Exit:RET 0; 3 = 0 words left; 2 = 1 word left; 1 = 2 words left; 0 = 3 words leftLessThan4:RJGB [S],1,LessThan2RFX Push (LRs+LRd)^RSI Pop into (LRd_LRd+1)^ADDB 2RFX Push (LRs+LRd)^RSI Pop into (LRd_LRd+1)^LessThan2:RJNEB [S],2,Exit1RFX Push (LRs+LRd)^RSI Pop into (LRd_LRd+1)^Exit1:RET 0A four-word inner loop is necessary to reduce the IFU time to 2.75 cycles/word (3 cycles for thecorrectly predicted conditional jump plus 1 cycle for each of the 8 words required to hold the 32d-byteloop, all divided by 4). A double-word inner loop would still achieve 3.0 EU cycles/word, but the IFUtime would be 4.0 cycles/word, and code cannot run faster than the maximum of the IFU and EU times.The substantial fixed cost of 14 to 20 cycles in the BLT procedure is due to conditional jumps. The cost fp!q4] G?fpxbAtx`x_!x^Ox[#Y #XUVUxR xQ+6xOxNi!xKa#J$)H#'G?#ED}#!C#A@[#>=##<8x:x9wx8 65U3212/.qx-x+x*Nx(x'x&, $#j"  Hx &dx qO H  F UM 45 A]oZDragon DocumentEdward R. Fiala26 August 198386remains substantial even after replicating the loop to eliminate this overhead as much as possible.BLTC (Block Transfer Code) and BLEC (Block Equal Code) do not have plausible analogs on Dragon.Each of these opcodes must be preceded by a load of a 16-bit value onto the evaluation stack which isthen treated as a CODE-relative pointer. But on Dragon, we have 32-bit registers, which can pointanywhere, and PC-relative load and load immediate opcodes that allow efficient construction of pointersinto the code segment. For these reasons, ordinary BLT and BLE opcodes suffice for transfers from thecode segment.Nor doesLOCALBLZ (Local Block Zero) have a plausible analog on Dragon because a block of words inlocal storage is not distinguished from a block of words in any other place. A BLZ function on Dragonshould probably be generalized to a SETBLK function, allowing a block to be initialized from any one-word constant. This function would again require a four-word inner loop for optimum performance.Procedures can also be provided to efficiently replicate larger blocks of constants, if that is desired.BLT and BLTR (16-bit special case), BYTBLT and BYTBLTR (8-bit special case), and the single-scanlineinner loop of BitBlt have similar implementation requirements. Each of these has code for the left edgeof the block, an inner loop, and code for the right edge of the block; it is also possible for the left andright edges occur within the same word, a special case.Code for the left and right edges of a scanline is perhaps most critical for overall performance becausethe average scanline is not large enough for the inner loop to dominate. For example, the LF monitorsare only 32 words wide, which establishes a maximum BITBLT width; and the vastly more commonBITBLT is a font character less than one word wide. Nevertheless, the example below shows only theinner loop. A forward direction transfer inner loop can be implemented by the following Dragonsequence; as in BLT, the inner loop must be replicated 4 times to amortize the conditional jump and theword count modification:; LRsa/Pointer to source - pointer to destination; LRsb/Pointer to source - pointer to destination + 1; LRsc/Pointer to source - pointer to destination + 2; LRsd/Pointer to source - pointer to destination + 3; LRse/Pointer to source - pointer to destination + 4; LRd/Pointer to destination - 1; LRc/Word count - 3 (LRc = [S] at entry); Q/precomputed cycle count for shifter controlLoop:RFX Push (LRsd+LRd)^; Push source word 3.RFX Push (LRse+LRd)^; Push sw 4.RFX Push (LRsc+LRd)^; Push sw 2.RFX Push (LRsd+LRd)^; Push sw 3.RFX Push (LRsb+LRd)^; Push sw 1.RFX Push (LRsc+LRd)^; Push sw 2.RFX Push (LRsa+LRd)^; Push sw 0.RFX Push (LRsb+LRd)^; Push sw 1.RSUB LRc_LRc-4; Word count _ word count - 4DSH (left-cycle Q); [S-1]_ Left-cycle[[S],[S-1]]; S _ S-1.RSI Pop into (LRd_LRd+1); Pop [S] into (LRd+1)^ while incrementing dDSH (left-cycle Q); [S-1]_ Left-cycle[[S],[S-1]]; S _ S-1.RSI Pop into (LRd_LRd+1); Pop [S] into (LRd+1)^ while incrementing dDSH (left-cycle Q); [S-1]_ Left-cycle[[S],[S-1]]; S _ S-1.RSI Pop into (LRd_LRd+1); Pop [S] into (LRd+1)^ while incrementing dDSH (left-cycle Q); [S-1]_ Left-cycle[[S],[S-1]]; S _ S-1.RSI Pop into (LRd_LRd+1); Pop [S] into (LRd+1)^ while incrementing dBegin:RJGB [S],0,Loop; Jump if not done yet; Code for last 0 to 3 words fp!q4] G?fp bq.5 ^6) \e [ V YLD# WG U REU PzS N41 L#> K?) GE E>* D)B BI7 >` = I ;A\ 9wS 7? 5N 4x0t#x/'x.*'x,'x+i'x*x(x'F x$>#"# !}#  # # Z# # # 8 ##'v#+#'#+T#'#+#'2#+x #x p T MB\Dragon DocumentEdward R. Fiala26 August 198387The EU time for the above loop is 4.5 cycles/word; the IFU time is 4.25 cycles/word (3 cycles for thecorrectly predicted conditional jump plus 14 cycles for the 5 words containing the 54 code bytes). Eachword requires 2 fetches, 1 store, and 1 cycle, and there are 2 cycles for overall loop control. Additionalreplications would make this loop approach 4 cycles/word.An improvement would be a register-to-register opcode to replace DSH. Such an opcode would allow the loop touse only 1 fetch per word. If this RCYQ opcode were available, 4 of the RFX's in the above example woulddisappear, and the time would be 3.5 cycles/word.9.5. Pup ChecksumsThe source and final destination of a 10 mHz Ethernet transmission generate and check, respectively, a16-bit checksum computed by initializing a CSUM register to 0 and doing a ones-complement addfollowed by a left-cycle 1 for each 16-bit argument in the packet. If no special hardware provisions aremade for checksums, the following code could be used for this operation:LRp/Pointer to block PLRc/Positive integer count of 32-bit words to checksum[S]/Checksum, initially 0Carry/initially 0Loop:SH (left-cycle 1)Left-cycle current checksum by 1LRIp 0Fetch word containing next two itemsRADD p _ p + 1Increment PRUADD [S-1]_[S]+[S-1]+CarryAdd to checksum at [S-1] saving Carry, S unchangedSH (extract left half-word)Extract the left-half of the new wordRUADD [S-1]_[S]+[S-1]+Carry & popAdd to the checksum with carry-in_Carry,saving the new Carry, decrementing S by 1.RUADD [S]_[S]+0+CarryWrap-around the carry-out; Carry_0 for next step.Begin:RSUB c _ c - 1LRcRJGEB [S]&pop,Zero,LoopJump if negativeThe EU time for this code is ~10 cycles/32-bit word, the IFU time ~10 cycles/32-bit word (3 cycles forthe correctly predicted conditional jump plus 1 cycle each for the 7 code words required to hold 27dcode bytes). The above loop takes advantage of the interchangeability of ones-complement addition andcycling and of the fact that a+a = a left-cycle 1. It also takes advantage of the fact that the left-halfand right-half of a 32-bit word are equivalent with respect to computing the checksum.9.6. Integers vs. Cardinals in FieldsSub-word fields which hold numbers are always cardinals in Mesa, subject to an integer offset. Avoidingintegers in sub-word fields is computationally faster because it is easier to add the constant offset than totest the sign of the field and conditionally extend it.When the operation is something like incrementing the field, according to Satterthwaite, the compiler cancleverly avoid applying the constant offset altogether. We would like to have this same convention forsub-word fields on Dragon as well. However, 16-bit integers may continue to be a problem; if 16-bitintegers are used widely on Dragon, special hardware or opcode support may be needed for signextension and overflow testing.We will assume that, except in machine-dependent or network records, Dragon will avoid sub-word fields fp!q4] G?fp bq%@ `S=+ ^%F \9yYtUyXA(yW;1 Sp Oq(> M;" LK JGHxGtxE2xDZxB x@7# >#=v #<'/:#9T!){%#7*6# &x51 32p# 1(/8 ,_q.8 *M (,: ' _ %5V p& }q-; M 7 vX F! I X L 06 8 A\xTDragon DocumentEdward R. Fiala26 August 198388that contain integer values. However, network types will probably include the 16-bit integer, so Dragonwill have to deal with the 16-bit integer in machine-dependent and network records. fp!q4] G?fp bq44 `SS ` A*Dragon DocumentEdward R. Fiala26 August 19838910. Context SwitchingThe original "Fast Procedure Call Implementation" memo by Chuck Thacker and Butler Lampson's"Fast Procedure Calls" paper in the 1982 SASPLOS discuss early ideas; a successor memo "DragonXfer" by Phil Petit (dated 2 September 1982) supercedes Thacker's and proposes the way in whichcontrol transfers will take place and the environment in which Mesa opcodes will be executed.Lampson's CSL notebook entry [Ivy]DragonXfer.bx (dated 9 December 1982) discusses someadditional proposals. I think you should read this section first, then read Petit's "Dragon Xfer".The general plan is that, unlike 16-bit Mesa, context switching will ordinarily be a trivial operation. Thebasic opcodes are Direct Function Call (DFC) and Return (RET). Their execution time is zero, if theIFU is far enough ahead of the EU. Later, execution of these opcodes is denoted by "4-A", meaningthat the execution time is 0 if the IFU is at least 4 cycles ahead of the EU and doesn't miss in the cache. The first byte of a procedure, called the "entry byte" (EB), is always byte 0 of a word; bytes 1, 2, and 3are then the first code bytes executed. The EB is interpreted as follows:1 bitArgType0: Regular1: All (New RL = old RL)1 bitsunused1 bitRetTrap0: No action1: Give the RetTrap trap when this procedure returns5 bitsArgCountNo. arguments expected - 1.Restricting the EB to byte 0 of a word saves 2 bits in both the DFC and DJUMP opcodes discussedbelow, while simultaneously maximizing the number of useful bytes in the IFU's first reference in thenew context. Effectively it converts DFC from a 5-byte to a 4-byte opcode and makes procedure callsslightly faster at the expense of wasting an average of 2 bytes per procedure due to truncation at the end.a of the RET opcode is interpreted similarly, except that the ResCount field (analogous to ArgCount) is7 bits wide, and there is no RetTrap bit:1 bitResType0: Regular1: All (New S = old S)7 bitsArgCountNo. results returned - 1.In the normal case, a caller pushes N arguments onto the evaluation stack and then executes a DFC.The EB has ArgType=Regular, and the IFU creates a new frame in which LR = S-(N-1) (S remainsunchanged, pointing at the last argument). The destination finds its N arguments in LR0 to LR.The IFU remembers the information needed to return to the previous context.The new context can then reference any of the 16d LRs. However, it must advance S above any LRswhich it uses before referencing them, so that stack overflow can be detected properly. In addition tothe LRs, [S-1], [S], and [S+1], 16d ARs, about 12d constants, and the Q and ICAND registerscan also bereferenced.After completing its work, the destination stores its M results in LR0 to LR and executes a RETM, which reloads S with LR+M-1 and returns to the caller. Finally, the caller resumes executionfinding its stack intact with the N arguments replaced by the M results, and with S pointing at the Mthresult.In other words, Dragon locals will normally be registers rather than words in storage; the evaluation stackand any other registers needed, such as pointers to the global frame or additional local storage, will also fp!q4] G?fp ar ^eq P \B Z&9 Y%8 W;_ Up\ Qh P4T Ni7+ L]  I-0: GbJxDsC@xAx@~ ?5x= :nq&9 8^ 6d 5,? 1tq_ /)x-s+x*N 'q#? %5V #j/5 !K .1/ c)> ^  \J 9' g  "I >- yB]PDragon DocumentEdward R. Fiala26 August 198390be in LRs. Paired with a frame's EU registers are its IFU state, consisting of return PC and the registernumber of local 0.Note the 16-bit Mesa concepts that are absent from the simple Dragon model:1) There is no CODE base; PC-relative addressing is used instead. A few context-changingopcodes use absolute addressing.2) There are no distinctions among LOCAL or GLOBAL frame pointers and other pointers inthe basic model. There are no special opcodes associated with the local or global frames;instead, general opcodes for referencing blocks of storage using a LR or AR as the base pointerare used.3) There is no global frame table and no entry vector in the basic model.4) There is no mention of the word "xfer". This concept has not seemed useful in expositinghow context changing occurs, so I have avoided it except in comparisons to 16-bit Mesa.Most context switches can be handled with this simple model. However, the following exceptions mustall be dealt with successfully, if this scheme will work:Global frame references.Too few or too many arguments.Pointers to local variables.More than 15d arguments.Procedure variables.More than 16d local variables.Interface function calls.Inline procedures.Nested procedures.Return trap.Non-hierarchical context switching.Mesa signals.Lisp free variable searches.Frame overflow trap.Stack overflow trap.Frame underflow trap.SFC trap.Reschedule trap.Page and write protect faults.Coprocessor opcodes.Extended opcode traps.Traps for PC sampling, etc. (16-bit Mesa XferTrap).Traps for program errors (16-bit Mesa CodeTrap, UnboundTrap, and ControlTrap).Finding the procedure associated with a frame.Retained frames.Local storage tracing for the garbage collector.Here is a summary of opcodes proposed for context changing:DFCDirect Function Call. A four-byte opcode in which three operand bytes and two bits from the opcode itselfspecify a 26d-bit word address. The FrameOverflow trap occurs if the IFU frame ring has fewer than threefree slots; the FrameOverflow trap occurs in the caller's context and returns to restart the DFC opcode.Otherwise, the EB is interpreted by the IFU. If ArgType=All, then new RL = old RL; otherwise, RL isloaded with S-ArgCount (where ArgCount can legally be in the range -1 to 15d). If no trap occurs, thecaller's return PC and LR are pushed onto the IFU's frame ring, and control is transferred to byte 1 in thetarget word.LFCLocal Function Call. A three-byte opcode in which ab is a 16-bit signed byte displacement from the firstbyte of the LFC opcode to the procedure's EB. LFC is one byte shorter than DFC, and it requires norelocation during loading because its relative displacement doesn't change when a code segment is relocated inVM. Apart from the different way the location of the EB is determined, LFC is identical to DFC.Since a procedure must start on byte 0 of a word, a superior encoding would define ab as the worddisplacement from the LFC to the entry word, but Petit and Thacker propose a byte rather than a worddisplacement to share logic with jumps, which must be able to reach any byte. As a result, code segmentsare limited to about 32k bytes. RETReturn. Two-byte opcode which returns to the previous context; a has the format given earlier. First, givethe FrameUnderflow trap if no previous frame is in the registers; this trap returns to repeat the RET. Then,if ResType=All, do not modify S; otherwise, S _ (RL+ResCount) mod 128d (pointing S at the last result). fp!q4] G?fp bqj `S \KxZCBxXxxUExTHxRENxPzxMIxK>?xIsQ F5/ D79xA s)Wx?)Wx>J)Wx<)W x;)Wx:'#)Wx8)Wx7f)Wx6)W x4)W x3Cx13x0Nx/!.x-x,_0 )q;x&OsU$O#\ "-=' 24k)B  xI %ts421da `Rts (,8Zfx /ts+ 67 Q A]n{Dragon DocumentEdward R. Fiala26 August 198391Next, if the RetTrap is turned on, jump to the RetTrap location and turn off the trap; otherwise, pop RL,PC, and RetTrap from the preceding IFU ring slot and resume in the calling context.If the RetTrap occurs, it returns with ResType=All, which gives control to the caller of the procedure whichexperienced the RetTrap. Since S was already fixed up at the onset of the RetTrap, this makes the RetTraplook like a procedure call sandwiched in between the RET and the place returned to.DJDirect Jump. A four-byte opcode in which three operand bytes and two bits from the opcode itself specifythe 32-bit word address of a destination PC inside the first 228 bytes of VM. The 0th byte of the targetword is skipped, and control is transferred to byte 1; a new context is NOT built. This opcode is intendedfor long-range jumps to the byte after an EB.SFCStack Function Call. One-byte opcode which does a function call to the destination specified by a controllink at [S]; [S] bit 0 indicates the kind of control link as follows:0Indirect ([S][1..31d] points at a control link)1[S][1..3] .eq. 0Direct. [S][4..31d] points at a procedure EB(i.e., it is a byte address, not a word address; [S][30..31d] should = 0)[S][1..3] .ne. 0Trap (SFCTrap).A direct link is popped off of the stack, and a function call is made to the EB. An indirect link is left onthe stack, and the word it points at is fetched; that word must be a direct link--i.e., only one level ofindirection is permitted. The StackLink trap returns to repeat the SFC.Note that an original indirect link results in an extra argument being left on the stack. SFC requires 1 EUcycle and, after the EU cycle, 4 IFU cycles, so it takes 5 cycles. An indirect link adds 2 more cycles for theextra fetch.SJStack Jump. SJ is a one-byte opcode which treats [S] as a control link. It is direct, indirect, or trap, justlike SFC, but it does not change contexts. SJ takes 1 EU cycle, then 4 IFU cycles plus 1 more IFU cycle ifthe first opcode after the jump crosses a word boundary (= 5 or 6 cycles overall).DFC, LFC, RET, and DJ are type "jump" opcodes--they use no EU cycles and have zero execution timeif the IFU is "caught up". However, they introduce a gap of at least 3 cycles in the IFU pipeline.10.1. Global Frame ReferencesUnlike 16-bit Mesa, Dragon does not include any specification of a global context in its "proceduredescriptor"; only the PC is specified. However, procedures must still be able to reference the globalframe. The following cases must be distinguished:1) There is at most one global context associated with the module;a) Procedure does not reference G;b) Procedure references G.2) There may be more than one global context;a) Procedure does not reference G and no locally-called context references G;b) Procedure makes local calls;c) Procedure references G.In 1a and 2a, no load of G need be made by the procedure. In 1b, there is a single entry point for theprocedure, and G is loaded at that entry point. In 2b and 2c, there must be a different entry point forthe procedure in each global context, so that the correct value of G can be loaded for use by theprocedure itself and/or for passing G to other procedures within the same module. Examples of thesecases are given below.The important point is that the value of G is implicit in the destination PC. fp!q4] G?fpbAsK`S^B ^\j[SxXZW;>WW;U[Ty-xQ` PzEmNF//mL/"s&"sK)/J#"s GL!F'BE-HB `A/@@7 x=1><8+@:R 7qH 530 0p -3qD +iI )2x&B:#":"-x%pq!:M:: J  Z ]  ^ U M @ A]o?Dragon DocumentEdward R. Fiala26 August 198392The best guess from Ed Satterthwaite and Ed Taft is that 10% to 25% of 16-bit Mesa proceduresreference the global frame; monitor locks are commonly stored in the global frame, for example. This isapproximately the percentage of frames which touch global variables because frame-linked procedurecalls are infrequently used in 16-bit Mesa.Dragon interface function calls (IFCs) are likely to be substantially more common than global-frame-linked calls in 16-bit Mesa. In addition, in a module that allows multiple global contexts, G must bepropagated to locally-called procedures (unless the compiler can determine somehow that no possiblenested path uses G). This means that G will have to be loaded more often than the 10% to 25% of thetime predictable from 16-bit Mesa.At any rate, Dragon frequently won't have to load G. When necessary, immediately following the EB isa three-byte PRL opcode which pushes G onto the stack. Then the global frame is referenced indirectlythrough that LR. Here is the simple case where a code segment appears in only one global context:GFPloc:pointer to global frame; At beginning of code segmentProc1:1; Entry BytePRL .-GFPloc; Push global frame pointer...Proc2:2; Entry BytePRL .-GFPloc...This simple approach works fine so long as only one global frame can be associated with a module.However, modules such as the Pup package may be instantiated in more than one global context. In thissituation, G must be loaded at module entry and propagated to locally-called procedures. To handle thiscase, Butler Lampson has proposed a small entry segment for each global context of such a codesegment; the entry segment contains an entry sequence for each externally called procedure. The entrysequence first loads G and then does a DJUMP to the first code byte of the procedure. Each LFCwithin the module is preceded by a push of G, and the procedure is entered as though the number ofarguments were one larger than for external entries. This awkward mechanism is discussed in theThacker and Petit procedure call memos and is shown below:G1loc:Pointer to global frame number 1P1Ent:N; Entry Byte = N arguments (external entry)PRL .-G1loc; Push global frame pointerDJ Proc1+1Proc1:N+1; Entry Byte = N+1 arguments (local entry)...Push argumentsLLn; Push GLFC Proc1-.; Local call of Proc1Push argumentsDFC P1Ent; External call of Proc1 in context G1locNote that DJ, similar to DFC, uses up four opcodes and is limited to targets that are byte 1 of a word,so it probably cannot be used in more general situations.The unfortunate consequence of handling G this way is that multiply-bound and singly-bound modulesare neither the same length nor same form. It is desirable to know at compile time whether or not a fp!q4] G?fp bq<! `S` ^C \+ YL*: WW Uc SV R"" N_ L<* K VxGs)WxE-)W C )WBlx?)W >J < 9qD 7] 6L 4:;# 2p51 0J .M -B +E:x(sx%)W+$ )W#$ x )W*%c )W )W )W( 1q\ f9 D *S B](EDragon DocumentEdward R. Fiala26 August 198393procedure being compiled or being called locally might be multiply-bound later; this information is notneeded for externally called procedures. If not known at compile time, then code must be produced justas though the module were multiply instantiated: the global frame pointer must be pushed immediatelybefore each LFC, the number of arguments for each procedure must be increased by one, and a LRmust be reserved for G in each procedure. I don't think it is feasible to improve the code at bindingtime, if it becomes known then that there will be only one instantiation.Similarly, it is desirable to know at compile time whether each call is to be made through an interfacerecord or directly. If this information is not available until binding time, then the longer IFC codesequence (discussed later) must be replaced by a DFC with a NOOP for padding. Also, manyprocedures will have loaded G solely for the purposes of an IFC, and this load becomes wasteful if theIFC isn't used.Using the above pattern, the LR used for G is always the one immediately after the procedurearguments, but nothing in the hardware or instruction set makes this constraint.There is no method of distinguishing, by means of information in the IFU or in an overflow frame, one registerfrom another. In other words, the argument locals, pointer to extra local storage, pointer to the global frame, otherlocals, and evaluation stack are all indistinguishable. Any attempt to differentiate these must be based uponcompiler conventions, the values found in the words, or information obtained from the PC saved in the frame.10.2. Too Few or Too Many ArgumentsMesa does not allow either surplus or deficient arguments in its basic procedure call mechanism. Surplusarguments or omitted arguments with no default value cause a compile-time error; if a caller omitsarguments required by an interface, the compiler will produce code to load default values in the callingcontext before making the call. This method of defaulting arguments is fast and allows any argument tobe defaulted, not just ones at the end of the formal parameter list. However, the default values cannotbe functions of anything in the global environment of the destination but are limited to compile-timeconstants. Also, when a procedure is called many times, code for defaulting unsupplied arguments willbe repeated many times. Defaulting unsupplied arguments in the destination's procedure body wouldavoid both problems at some expense in execution time.Interlisp does not have anything like the Mesa procedural interface, so the source of a control transfermust be compiled without any knowledge of the destination. This means that argument/result passingconventions may be based only upon the number and type of arguments/results. Currently, Interlispthrows away any surplus arguments or defaults missing arguments to NIL at procedure entry, and itreturns precisely one result for all procedures.A plausible method of handling deficient and surplus arguments for Interlisp is to require any procedureprepared to default arguments to have one entry point for each argument count. Each caller is thenlinked to the entry point for the number of arguments it is passing. Here is an example of a procedurewith four arguments that it is prepared to default to the numbers 0, 1, 2, and 3, respectively; the call isby DFC:Ent0:0, JUMP E0Ent1:1, JUMP E1Ent2:2, JUMP E2Ent3:3, JUMP E3; 1 too few argsEnt4:4, JUMP E4; Normal entryEntX1:5, DIS, JUMP E4; 1 too many argsEntX2:6, AS -2, JUMP E4 fp!q4] G?fp bqT `SY ^"B \;# Z8. Y)I UJ S/7 R"9 PW8. N K\ IPPyFsNyE-"TyCD*yBll =vp$ :qU 89<& 6oB& 4g 2D$ 150 /DN -z.4 +6 (=9/ &s-6 $"@ "B !0 ^ F  N AQ vxs\ xT\ x\ x\ &x2\ & x \&x p\ MA\^Dragon DocumentEdward R. Fiala26 August 198394E0:LI0E1:LI1E2:LI2E3:LIB 3E4:code for procedureThis example handles deficient or surplus arguments at a cost of one extra jump in the entry sequence.Interlisp compiled code need define only those entries actually used by other compiled functions. If anew call cannot be bound to an entry point with the correct number of arguments, then the definitioncan be relocated, adding the new entry point; the old definition can then be modified to trap anyprocedure calls through it, and the trap can relocate the calls. When all old callers have been relocated(determinable from a reference count), then the old definition can be deallocated. Also, the next trace-and-sweep collection following redefinition can relocate any remaining pointers to the old definition.10.3. Pointers To Local VariablesIn 16-bit Mesa, pointers into a caller's local frame are passed only when the "@" operation is used. Inthis case on Dragon, local storage must be allocated even though all locals fit within the 16d register limitbecause pointers to LRs or overflow frames are illegal. This storage is "owned" and must be explicitlyfreed by the caller, not by any destination.Pointers into a caller's frame don't work when the caller resumes execution before all uses of the pointerby destinations have terminated. This can happen after forking or when the pointer is written into non-local storage. This is one reason why Mesa must avoid passing pointers into the calling context unlessspecifically directed to do so by the program.10.4. Returning More Than 1 ResultThe 16-bit Mesa evaluation stack of a caller and destination are identical, so the returner simply pushesresults onto the stack and returns. Then the destination stores the results into its own frame somewhere.This works up to the 14d word size of the stack.However, on Dragon extra steps may be involved because the results are returned in LR0 to LRn ratherthan in a shared evaluation stack. One result can simply be stored into LR0 prior to returning.However, with two results, there is a problem, if the second result depends upon the former contents ofLR0. In this case, the second result must be computed before the first result is stored in LR0. Thesituation gets progressively more complicated as the dependencies of results upon values in overwrittenLRs increases.With up to 16d results, the general idea is simple. Whenever a result will overwrite a LR that is neededin computing a subsequent result, that result is instead pushed on the evaluation stack. When all resultsthat depend upon LR's which will be overwritten have been computed, they are popped from theevaluation stack into the proper LRs using SLn opcodes (The SLn's are the extra work that is notrequired in the 16-bit Mesa machines.). Other results can be computed directly in the LR where thevalue will be returned. The compiler can reduce the number of operations required by computing theresults in the best order.Any specific number of results can be returned using the 7-bit ResCount field. However, a practicallimit on this method is about 100d words; this allows the returner to push results 17d to 100d onto thestack followed by result words 1 to 16d, which are then popped into LR0 to LR15d. Finally, the calling fp!q4] G?fpxbAs\x`\x_\x^\x\\ Yoq Z WK UB" TD RE#G Pz>+ NF Ip" F$qD$ DZg BT @, =S)A ;17 9L 7. 2p# /hq<- -P +0 (`Y &.2 $C$ #A$ !643 k #F /Q d9# A _ K : N S 3.9 B^?Dragon DocumentEdward R. Fiala26 August 198395context is reentered with up to 16d LR's and 12d stack values underneath the results returned; at thispoint the EU ring is full.However, passing many results on the stack will cause extra frame overflows and underflows. Whenmore than one or two results are returned, the caller will usually store them into its own local storage. Ifthe words are instead returned in storage allocated by the returner, sometimes the caller can use theresults where they sit; other times it must move them somewhere else. So passing many results on thestack is essentially a trade off the extra work of frame overflows and underflows against refetching andrestoring result words.Also, the frame overflow allocator discussed later can take advantage of a small frame size limit; it canthen use a single block size for overflow frames, reducing the allocation time and utilizing the storagebetter.Finally, there seems no reason to insist upon returning many results on the stack. The compiler will notbe simpler because it must still be able to handle more than 100d result words, so it will have to be ableto allocate storage for results that won't fit in registers. And the statistical frequency of more than a fewresult words should be so small that performance is not an issue.The conclusion of this argument is that Dragon hardware can efficiently pass up to 100d result words onthe stack, but the limit can be set much smaller without harm. Beyond whatever limit is agreed upon byconvention, extra local storage must be allocated, the excess result words put there, and a pointer to thatstorage returned on the stack.The requirements of allocating extra storage are discussed later.There are also situations when returning a specific number of results is not what's wanted. ResType=Allallows the entire frame to be concatenated onto the caller's evaluation stack. For example, ResType=Allis used in the XOP dispatcher example later; in this case, the procedure to which control was sent hasput some unknown unumber of results onto the stack and returned to the XOP dispatcher. Thedispatcher returns with RetType=All.10.5. More Than 16d Argument WordsPassing large numbers of arguments is similar to passing large numbers of results, but with somedifferences.One difference is that a trick is needed to pass more than 15d arguments on the stack; this trick is slow.The ArgCount field in an EB can specify only 0 to 16d arguments. However, if the EB specifiesArgType=All, the destination can begin with the caller's frame, then manually advance RL (i.e., read RLand S, then rewrite RL as S-NArgs with rescheduling disabled); altogether, manually advancing RL takes~14 cycles, compared to only 1 cycle for the AS trick used when returning more than 16d results. As inthe return case, the trick allows up to about a 100d word argument limit, so the compiler can choose areasonable convention.Since source and destination are independently compiled, the source is limited to calling conventionsbased solely on the interface. I propose that the first N-1 argument words always be passed on the stack,while those beyond the (N-1)th be passed in storage pointed at by the Nth word. With this convention,there is never any doubt about the presence or absence of a long argument or result record. A plausible fp!q4] G?fp bq06 `S \/2 [*C YLP WN UB& S PzH! NI L Is [ GU E1= DA @.9 > ] = Y ;A 7A 4^:. 2N 0 \ .F -3$ (p# $q>" " k0: Q _  J Ad vY  : Y o<pq P V  B\x,Dragon DocumentEdward R. Fiala26 August 198396value for N is 14d.When storage is allocated for more than N argument words, it may be desirable to allocate an excess forother uses by the destination. The reserve should be large enough that the destination normally avoidsanother allocation. In other words, the caller's excess argument storage should become the extra localstorage of the destination most of the time.Even though excess arguments are referenced through a pointer, semantically they are still being passedby value, so the destination must not misuse this pointer. One convention that preserves pass-by-value isto make the destination "owner" of the excess storage and responsible for its deallocation; the sourcemust not make further use of the storage. This is the convention followed by 16-bit Mesa.An alternative is to put extra arguments in the caller's local storage and pass a pointer to them.However, to preserve pass-by-value, the destination must be limited to read-only access and cannot allowthe pointer to be used by a forked process or be stored in a non-local variable where another processmight use it (because the caller might continue execution and deallocate its frame).This is different than on a return with excess results. On a return, the destination already has any localstorage it needs, so there is no reason to reserve extra words; also, the returner is being destroyed, sothere is no question but that the destination becomes "owner" of the excess result storage.10.6. More Than 16d Local VariablesThe limit on LRs is 16d, including possible global frame and extra local storage pointers. For more than16d LRs, extra storage must be allocated by the procedure and freed before it returns.The LR pointing at extra storage is no different from any other LR; opcodes which reference the extrastorage allow indirection through any LR. The allocation requirements of the storage are the same asthose for excess argument or result words, a little different from frame overflow storage, as discussedlater. All of these are "local storage" from the viewpoint of the Collector.10.7. Traps In GeneralEach trap is a procedure call that sends control to a particular location. Opcodes which unconditionallytrap (XOP's or Extended Operations) are essentially abbreviated DFC's because the IFU predicts thecontrol transfer as it does for a DFC. These opcodes can do some work before trapping; for example, atwo-byte XOP must generally push a onto the stack before trapping so that the trap procedure can easilyaccess it.Other traps are conditional. Since these traps are not predicted by the IFU, they have timingcharacteristics similar to those of a wrongly-predicted conditional jump. Namely, 3 cycles plus thenumber of words occupied by the first target opcode will pass before the first trap opcode is executed.Conditional traps generally transfer control to a location determined from the condition detected, so theopcode which caused the trap is not necessarily known. Also, whether or not the microinstruction iscompleted, and whether or not the PC is advanced, are functions of the trap.Also note that when a trap occurs, S is restored to its value at the onset of the opcode causing the trap. fp!q4] G?fp bq ^M \G [$C YL, U+< T?+ RE"D PzZ M>$ K>C% Is"C GT D7a Bl81 @[ ;p$ 8q<- 6KV 205 1e /D*= -zM (`p $q54 #$b !Y!E !tq+  R#; 7- =* KN E L DU B\Dragon DocumentEdward R. Fiala26 August 19839710.8. Interrupts and FaultsThe most worrisome kinds of traps on Dragon are associated with events which leave the process unableto proceed. The following events are of particular interest:Reschedule trapFrame overflow trapStack overflow trapFrame underflow trapPage faultWrite protect faultThe approach taken for these events was discussed briefly in the "Hardware Overview" chapter; it is toreserve [StkLim] of the ring buffer registers and 3 IFU frame slots for all of the unusual events whichmay begin with one of the above traps. In addition, some ARs may be reserved.To prevent all possible nestings of the above traps from happening, the following software and hardwareconventions are followed:1) The reschedule trap subroutine can experience any of the other traps and faults.2) The frame overflow, stack overflow, and frame underflow traps are mutually exclusive andnon-recursive. Once one of these has started, the IFU disables the others until the trap inprogress has exited with a RET. The IFU also disables the reschedule trap when one of these isin progress.3) Page faults and write protect faults might occur during execution of the other four traps, butother traps cannot occur during page or write protect faults. The IFU enforces this restriction.4) Any pulse on the Reschedule signal pin of the IFU is remembered in the RescheduleWaitingflipflop until it can be serviced (except when a coprocessor has control). TheRescheduleWaiting flipflop is cleared at entry to the reschedule trap subroutine.5) Trap subroutines must not use more than the [StkLim] registers and 3 IFU frame slots knownto be available at the onset of the trap. This must include the requirements of any possiblenested page or write protect fault that may occur during execution of the other traps.Because of these conventions, nothing preempts page and write protect faults; and only these faults canpreempt the other traps. This is enforced by the IFU hardware.Note that it is important to reserve several ring buffer registers for use by these traps. Although there might beenough ARs to deal with the events, some opcodes require stack registers. For example, conditional jumps and fieldextractions can only be carried out on stack arguments. So without reserved stack registers, some stack words wouldhave to be moved to make room for the needs of the trap subroutine; this would hurt performance.When a page or write protect fault has occurred, the fault subroutine will usually block the process whichfaulted and wakeup a special process to deal with the fault. These special processes must not themselvesfault, or the system would deadlock. This gives rise to the following requirements:At the highest priority are any real-time processes with severe timing requirements. These processesmust neither page fault nor exhaust local storage; local storage for overflow frames and other purposesmust be locked down. fp!q4] G?fp b ^qJ \=yYyXyVDyTyyR yP MrJ K70 IN FkG DyA.Ty=M y;6&y:' Sy8] y4(:y3 Vy/ Oy-fg:y,Qy(./y&Ty%V !=* ?yssy fyRUyM qd <-  T F P  A]LDragon DocumentEdward R. Fiala26 August 198398At the second highest priority are the page and write protect fault processes which must neither fault norexhaust local storage; the disk server process must service page faults, so it inherits the samerequirements. Also at the second highest priority is any local storage fault process, which must notexhaust local storage but could be allowed to page fault.Real-time and non-real-time processes which can withstand both page faults and exhaustion of localstorage can use local storage which need not be locked down.10.9. Reading and Writing Low-Level Machine StateThis section discusses operations which examine, save, and restore low-level machine state. Theseoperations are used when servicing frame overflow, frame underflow, page fault, reschedule trap, andsimilar events, as discussed in the next sections.The following four opcodes are used:LIFURLoad IFU Register. Push data from an internal IFU register selected by a onto the stack.Register assignments not yet specified.SIFURStore IFU Register. Pop the stack into an internal IFU register selected by a.REURRead EU Register. Push data from an internal EU register selected by a onto the stack. Registerassignments not yet specified.EUSFEU Special Function. Sends a to the EU as the SPC FUN field. This in general causes the stackto be popped into some internal register (i.e., Q, ICAND, or MODE), or some flipflop to becleared, etc. Register assignments not yet specified.The IFU has three pointers into the ring buffer in which RLs and PCs are saved: FirstF addresses theentry for the earliest frame; SecondF, the next-to-earliest frame; and LastF, the latest frame. The RLand PC associated with FirstF, for example, are called RL[FirstF] and PC[FirstF], respectively. Thehardware may also have some way of directly reading the number of ring registers in FirstF [=(RL[SecondF] - RL[FirstF]) mod 128]. RL[SecondF] is S when there is only one frame in the ring.There may also be some way to reload some IFU registers directly from storage.The EU registers which must be read or manipulated specially include Q, ICAND, Carry, Integer-out-of-range enable, integer-out-of-range mode (31-bit or 32-bit), ?TBC.10.10. Frame Overflow and Stack OverflowThe IFU RL and PC registers for each of 16d frames and the 128d available EU registers are arranged inring buffers. On either a procedure call with only 3 IFU slots left (frame overflow) or a S advance withfewer than [StkLim] registers remaining (stack overflow), storage is allocated, and the earliest frame in theIFU ring buffer is written out and discarded; then the fault procedure returns to reexecute the opcodewhich caused the fault. These two faults are handled almost identically.Stack overflow occurs on any attempt to make S "greater than or equal to" RL[FirstF] - [StkLim]; thismight happen on a stack push or on an AS +n opcode. fp!q4] G?fp bqL `S W  ^9, \9 YL U W< Rhp2 Nq6, M,d Ka2 G$xE s%#tsC'xA.Mtsx>4ts=/x:B9F896 4q23 3 *= 1U_ /> -1/ *NN &#B %= ! p) q X JD% A, _ I x` 3 ^ gA[Dragon DocumentEdward R. Fiala26 August 198399A coding discipline could be followed such that stack overflow wouldn't happen on AS +n; AS +n would only beused to restore S to a position previously reached using push operations; we do not assume that this codingdiscipline is followed.Although dumping one frame always satisfies a frame overflow, a stack overflow fault might repeat. Theworst case is 14d frames each using no registers underneath 1 frame using 127d registers; then 14d nullframes must be dumped before freeing a register for a push. A null frame might be a caller of a nestedprocedure or a procedure with no locals and no stack. However, this complication is invisible to the trapwhich simply dumps one frame and restarts the opcode that faulted.Barring some break in the control sequence, the number of frame overflows and underflows will benearly equal (i.e., overflows minus underflows = frame depth). One dynamic study of Pascal proceduresreported to me by Lampson indicated that about 70% of procedures return without calling anotherprocedure, about 80% before going two levels down, and about 90% before going three levels down. OnDragon, we expect to have between 4 and 13d levels within the registers, so fewer than 10% of procedurecalls will experience frame overflow.Frame overflow and underflow are potentially a significant drain on performance. To show this, assumethe following:All opcodes take 1 cycle.2% of all opcodes are procedure calls.5% of all procedure calls experience frame overflow.Servicing either frame overflow or underflow takes 50 cycles.Then it follows that 2% of all opcodes are returns and 5% of all returns experience frame underflow;then 100 cycles out of every 1100, or 9% of all cycles, are spent in frame overflow or underflow. Sincethis is substantial, we must try to make these traps go fast.As discussed earlier, the reschedule, frame overflow, and frame underflow traps are disabled during theframe overflow or stack overflow trap. Not until a RET terminates trap service can another of theseevents occur. The only interruptions permitted by hardware are page and write protect faults; allschemes considered have made write protect faults illegal. There were (at least) 3 free slots in the IFUring at the onset of fault service; 1 of the 3 is consumed by frame overflow; so at least 2 slots remain forsubroutines and a potential page fault. And there are at least [StkLim] ring buffer regsiters available foruse as LRs and stack.The following actions occur on frame overflow:1) Determine the size of the oldest frame (i.e., the FirstF frame) in the IFU's ring; this isNRegs[FirstF] = (RL[SecondF] - RL[FirstF]) mod 128d.2) Allocate enough local storage to save that frame.3) Save the following:Hook (an AR)PC[FirstF]NRegs[FirstF]Registers from RL[FirstF] to RL[SecondF]-14) Increment FirstF (i.e., discard the oldest frame).5) Point Hook at the block just written.6) Return to repeat the opcode which trapped. fp!q4] G?fpybAs*By`Zy_ \1qg Zf&A Xg VM UB Q?! OT MG L5%? Jj25 H% E-_ Cc x@x>&x=/4x;e= 8U 6J 51= 1N /d .*V ,_J *[ (H$ ' #.y Ey4y4y~   T*y6y )y y. 2A]Dragon DocumentEdward R. Fiala26 August 1983100The differences among the various schemes considered revolve around the "Allocate" part of the trap.Some schemes use one or two fixed-size blocks, some variable sizes; some allow page faults duringallocation, some do not. Also, there are various ideas about what to do when/if the allocate fails.Because each store takes at least two cycles, step (3) above takes at least 2*(NRegs+3) cycles; the otherstuff takes 15 to 50 cycles depending upon how easy it is to allocate storage, compute NRegs, and discardthe oldest frame. This suggests that the time required to overflow a 16d-register frame is 53 to 88 cycles.10.11. Frame UnderflowAlthough a stack push can cause frame overflow, stack underflow can only happen due to a codegeneration error and is not detected by the hardware. So the frame underflow trap only happens on aRET with no previous context in the IFU.There is a potential problem with shared frames, which occur when ArgType=All is used. Someexamples were given earlier and more are given in the "Nested Procedures" section later. Fortunately,registers are saved with the deepest frame (= the last frame overflowed and the first frame underflowed),so the registers needed by a procedure are always in the EU after frame underflow. Consequently,shared frames are not a problem.The frame underflow trap procedure interprets the Hook, which is kept in an AR. I think the followingkinds of Hook must be considered:1) Direct hook. A direct hook, denoted by Hook[0..0] = 1, points at an existing context, whichis reloaded in the way described below.2) Indirect hook. An indirect hook, denoted by Hook[0..0] = 0, points at a word which isrecursively interpreted as a Hook.3) No previous context. When the Hook contains 0, a control trap happens. This possibility isnot checked for explicitly by the frame underflow trap procedure; instead, a page fault on page0 is allowed to occur, and the page fault code identifies this case.The direct hook is not necessarily used to return to a previous context. The process machinery mayfabricate direct hooks to a new context, and then RET through the fabricated hook; this approachhandles all special case control transfers I have considered. The frame underflow trap is also used as thefinal part of a process switch. The previous process's frames are unloaded; then a pointer to the currentcontext of a new process is loaded into the Hook, and a RET 0 sends control to that context via theframe underflow trap.With these preliminaries, we can now consider only the case of a direct hook. Note first that reschedule,frame overflow, and frame underflow traps are disallowed by hardware; so the only events which mightinterrupt the frame underflow trap subroutine are page and write protect faults. All schemes consideredmake write protect faults illegal.Note secondly that no pointers to overflow frames are allowed except for the Hook. A frame existseither in storage or in registers, never in both places at once; the Hook exists only when the frame is instorage, never when it is in registers; an overflow frame is deallocated immediately after it has beenreloaded into registers.Lisp free variable searches and Mesa signals are examples of situations which may cause difficulties with overflow fp!q4] Ffp bqd `SQ ^7- [N YL36 WC) Rhp Nq S M,b Ka( G!; F$X DZ63 B4- @ =S@& ;!y8Hy6'y3Hy1"y.">y-6)y+ED 'M & 1/ $>X "s[ K  l>, K P  " !A "H S ;y ys$N VB]2Dragon DocumentEdward R. Fiala26 August 1983101frames. The methods for these have not yet been worked out.Usually, Dragon could simply reload RL and restore EU data into precisely those registers from which itwas dumped; this would locate the frame snugly underneath the results being returned. In other words,a sequence of hierarchically nested calls and returns always comes back into precisely the same registers itoccupied when it overflowed.However, the Hook might point at a context which was not hierarchically nested, so, for generality, theframe underflow trap must recompute RL to locate the context being restored snuggly underneath theresults. At the beginning of the frame underflow trap, FirstF=LastF identifies the context whichcontains the RET opcode, and SecondF identifies the imaginary context which has RL=S. Here is anoutline of the frame underflow trap subroutine:1) Follow indirect hooks until a direct hook is found.2) Decrement FirstF, after which LastF=SecondF points at the context containing the RETwhich trapped, and FirstF is the context to be reloaded.3) Save S; reload both S and RL[FirstF] with (RL[SecondF] - NRegs) mod 128, where NRegsis obtained from the overflow frame; reload PC[FirstF] with the PC saved in the overflowframe.4) Push NRegs words from the overflow frame onto the stack.5) Reload Hook from the saved hook in the overflow frame.6) Deallocate the overflow frame.7) Restore S; RET to exit from the trap.Note that even though reexecuting the RET which caused the trap will fixup S, S has to be saved andrestored in the frame underflow trap subroutine because a reschedule trap could happen after the RETwhich exits from the frame underflow trap and before the RET which caused the trap originally isreexecuted.10.12. Reschedule TrapA pulse on the Reschedule signal pin requests a reschedule trap. If a coprocessor has control at the timethe pulse is received, the IFU ignores the reschedule trap, and the coprocessor deals with the event asdiscussed in Petit's "Dragon Co-processors" memo and in the Hardware Overview chapter. In this case,the coprocessor will either service the trap itself, pass the trap back to the IFU while arranging to restartthe coprocessor operation, or defer the trap until the coprocessor operation has completed and then passit back to the IFU.When a coprocessor does not have control, the IFU sets the RescheduleWaiting flipflop, which remainsset until the onset of the reschedule trap. The reschedule trap may have to wait for other traps to finishbefore it can start. When it is legal to start the reschedule trap, the IFU inserts it into the opcode streamsomehow; I don't understand exactly when. This trap looks like a DFC to the reschedule trapsubroutine.There is also a SIFUR operation to enable and disable the reschedule trap and an LIFUR operation toread the trap enable and the RescheduleWaiting flipflop. fp!q4] FfpybAs< ^q*= ](U []i Y V!43 TVS R': P$= N/yK7yH=yG8yDPyBI;y@~y=v<y:n:y7f"y4^) 0D /!\ -V-3 + &sp #qD& !6 [ kM =0 Y   L Q &H :)3 o M 38 ^ A^Dragon DocumentEdward R. Fiala26 August 198310210.13. Page and Write Protect FaultsPage and write protect faults are non-interruptible. My thoughts on how to deal with these, which differfrom the complete proposal which Russ Atkinson is preparing, are as follows:1) Before it gives control to a process, the scheduler obtains a maxi-state-vector large enough to containthe machine state of the process in the worst case; this state vector is locked in storage to prevent pagefaults. It will be used if the process is preempted during a reschedule trap or by a page or write protectfault; or it will be used if the frame overflow allocator "fails" and some more elaborate process must beawakened to make more frames.The maxi-state-vectors must be large enough to hold 128 ring registers, 16d contexts x 3 words/context,the Q register, ICAND, the Hook and several other ARs, the Carry and overflow mode flipflops, etc.Altogether there are ~185d words.The page and write protect fault processes never need state vectors. Some real-time processes mayrequire permanently assigned state vectors; other processes can wait when no state vector is availablebecause one will be freed eventually.2) Frame overflow and frame underflow allocators need not use frames locked in storage; if a page faultoccurs during one of these traps, then the maxi-state-vector reserved in step (1) is used.3) Processes which block on monitors, condition variables, or similar methods will be treated like the"minimal stack" opcodes of 16-bit Mesa. These processes manually overflow all register frames tostorage, just as though frame overflows had occurred. The residual state of these processes can then besaved in a mini-state-vector which does not include ring buffer registers, the Q register, Carry bit, etc.4) Because few processes are likely to be in page fault wait at one time, and because the number ofprocesses interrupted by the reschedule trap is generally small compared to the number which block onmonitors and condition variables, the number of maxi-state-vectors required will be small.10.14. Local Storage AllocationAs discussed earlier, one set of requirements applies to the allocator for extra arguments, results, andlocal storage. A less severe set of requirements applies to overflow frame allocation because overflowframes are serially rather than randomly accessed, and the maximum size is small. All of this is "localstorage" from the viewpoint of the Collector, which must enumerate it efficiently during on-stackmarking.The less severe requirements of the overflow frame allocator can take advantage of a clever allocationidea suggested by Lampson, which is as follows:1) Make all blocks the same size.2) Each processor owns a table to which it keeps a pointer in an AR called FROVP. This tableis locked in storage and is N pages long; it contains pointers to free blocks, and FROVP pointsat the first valid pointer in the table.3) Allocation consists of fetching the word pointed at by FROVP and then incrementingFROVP. The page immediately after the table is non-existent, so any attempt to overflow, i.e., fp!q4] Ffp b% ^qA( \L Yo a W>, Uk TU RE N$C MR K>! GJ FJ D7% @R >Z ;&A 9': 7I 6(j 2Z 0e /!Z *p &q-; $<+ #U !6E k 24 //y'"yR yTGy(y :y @ 6 pB\Dragon DocumentEdward R. Fiala26 August 1983103to allocate a block when the table is empty, will cause a page fault. This event is checked forby the page fault subroutine. Fetching and then incrementing the pointer cannot be doneatomically, so the reschedule trap must be disabled during the two instructions which do theallocation.4) Deallocation consists of decrementing FROVP and storing a pointer into the table. The pageimmediately preceding the table is non-existent, so any attempt to underflow, i.e., to free ablock when the table is full, will cause a page fault. This event is checked for by the page faultsubroutine. Deallocation has no atomicity problem because the pointer can be decremented andthe store done in a single opcode.The algorithm is also good without using the page fault trick to save instructions. In that case, there isone extra conditional jump for both allocate and free. The fact that the largest storage requirement foran overflow frame is small makes this single-block-size allocator practical. Russ Atkinson is developingan algorithm in which either one or two blocks are allocated on frame overflow using a variant ofLampson's idea.Unfortunately, processes that allocate blocks on one processor can wind up freeing them on another, sothere can be a migration of storage from one processor's list to another. Some kind of rebalancingprogram must exist to fix up this situation when it happens.For arguments, results, and local storage, a more flexible allocation approach is needed because the sizerange of these blocks is much greater than for overflow frames. The compiler could use several differentallocators in this situation, if that's desirable. For example, it could use the frame overflow allocator,when the constant block size is appropriate, and fall back on some other allocator for larger blocks.Allocation and deallocation of local storage are probably common enough operations to warrant usingXOPs rather than DFCs to save code space. A two-byte ALS opcode (Allocate Local Storage) in whicha specifies a frame size index (FSI) is discussed below. Somehow specifying the frame size in a isreasonable because storage requirements are known at compile time. Similarly, a one-byte FLS opcode(Free Local Storage) in which a pointer to the free list header is obtained from word -1 of the block canbe used for deallocation.The ALS/FLS data structure for free blocks consists of a table of free lists indexed by FSI, similar to the16-bit Mesa frame allocator. Each free list has (positive) pointers threading through word 0 in each freeblock terminated by a distinguished pointer called "empty" which points at a word containing itself.Word -1 of each block contains a pointer to the free list header for the block, and word -2 contains thesize of the block.The little programming example below shows the common case of allocation from an FSI list; thecomplications which happen when the desired list is empty aren't shown. Implementation would besimple if other processes could not intervene. Here is the code in this simple case:ALS:1; Entry byte = 1 argument (FSI = a)RADD Push_[S]+FSITab; Push FSI table origin (in an AR) + a; LR1 [S]/ ptr to head-of-list for original FSI; LR0 [S-1]/ original FSIRSB 0; Push head-of-list (=ptr to 1st free block)RJLEB [S],Empty,ALEmptyJump if head-of-list .eq. Empty (In an AR).RSB 0; Push word 0 of block (ptr to next entry).; LR3 [S]/ ptr to next free list entry; LR2 [S-1]/ ptr to new block (=head-of-list) fp!q4] FfpybqFy`S=y^ Ny\ yYDyWHyV!QyTVJyR" O[ MOi KI I<% G D}B$ BK @< =vQ ;S 9Q 8=( 4R 25- 1tqAtq /DN -z*? + (=65 &s _ $O "h ! :$ H  Uxs\&!ts\&$txs/x1\&+\o&'\&*x &x M- *B\VDragon DocumentEdward R. Fiala26 August 1983104; LR1 [S-2]/ ptr to head-of-listSRI1 0; Pop ptr to next entry into head-of-listSL0; Pop ptr to new block into LR0 for RETRET 1ALEmpty:AS -2Discard two stack entries (S_S-2)LR0 [S]/ original FSI...The above implementation would work, if the heap were inaccessible to other processes, but the storagewasted by 150 or more heaps of this kind would be objectionable. The FSI table approach is gearedtoward fairly large heaps, where it makes sense to prestructure many frames of each size.Preventing preemption is hard because frame overflow could happen on either of the RSB's, whichadvance S. Also, a page fault can happen on the reference to the "next" pointer in the first free block;locking all frames in storage to avoid this seems like a poor idea. Consequently, some kind of monitorlock or spin lock would be required to share this allocator, but that might be too slow. I don't think thisis a practical allocator for Dragon local storage, but I am leaving it in as a programming example.Here is the code for deallocation; since its write is a function of only one read value, the CST(Conditional STore) opcode can ensure atomicity:FLS:1; Entry byte = 1 arg (pointer to block)REC; Reserve 1 stack wordRFX Push (LR0+-1)^; Push block header (contains pointer to free; list header for blocks of this size); LR2 [S]/ Pointer to head-of-list; LR1 [S-1]/ reserved; LR0 [S-2]/ Pointer to block being freedFLLP:RSB 0; Push head-of-listWRI (LR0+0)^ _ LR2; Store old head-of-list into word 0 of blockSL1; Save the old value of the head-of-list; LR2 [S]/ Pointer to head-of-list; LR1 [S-1]/ Old head-of-list; LR0 [S-2]/ Pointer to block being freedCST 1,0; [S]^ _ LR0 if [S]^ .eq. LR1JCST FLLP; Jump if the conditional store failed.RET 0For the ordinary cases above, ALS takes 8 EU cycles and 11 IFU cycles (3+3 for the ALS opcode andthe RET, plus 1 cycle each for the 4 words of the subroutine and the word containing the ALS opcode).FLS takes 12 EU cycles and 12 IFU cycles (3+3 for the FLS opcode and the RET plus 1 cycle each forthe 5 words of the FLS subroutine and the word containing the FLS opcode). fp!q4] FfpxbAs\`&(\_&'\^x[]\Y&xX\W; Sq^ R"X PWY LR K _ IPJ G#I EZ BIH @~0x=Ss\&'\;&\:&,&90&x7"x6ox5)x3\&\2L&,\0&(x/"x.*x,)\+i&\*&&\( %Xq># #T !S Jf BHZDragon DocumentEdward R. Fiala26 August 198310510.15. Extended OperationsAs mentioned earlier, Dragon's PLA provides only the first three microinstructions of an opcode.Opcodes which don't finish in three microinstructions "trap" to a procedure which completesimplementation of the opcode.Because direct procedure calls will be fast, equating a trap opcode to a call on the procedure whichimplements it would suffice. However, 16-bit Mesa has four important differences between a trapopcode and a procedure call which must be resolved before doing this:a) PC points at the trap opcode itself rather than its successor, so PC must be advanced beforereturning from the trap procedure.b) The evaluation stack may contain other data underneath opcode arguments. 16-bit Mesa dumps thestack at entry to the trap procedure and restores the stack before returning to ensure against stackoverflow.c) Trap parameters (currently only one) can be passed in addition to arguments on the evaluationstack. Because normal code for storing arguments would smash the trap parameter, Trinity Mesa trapprocedures cannot be defined with any arguments. Hence, the trap opcode's arguments must bepainfully extracted from and the results painfully stored back into the state vector dumped in (b). Thestack pointer in the state vector also must be manually adjusted.d) An opcode could be defined to preserve the evaluation stack above its arguments (E.g., PUSH andSTORE kinds of opcodes are defined this way.).For both Dragon and 16-bit machines, we propose to eliminate trap parameters on trap opcodes and toleave PC pointing at the successor to the trap opcode rather than backing it up. In addition, we proposeto rule out having any trap opcodes that preserve the evaluation stack above opcode arguments (If itwere necessary to preserve the evaluation stack above arguments, this could be done pretty easily byadding a new source language construct to Mesa.).The only remaining problem is data underneath opcode arguments on the evaluation stack, which ispermitted on Dragon, as discussed earlier. Hence, on Dragon an ordinary procedure can implement atrap opcode. However, to deal with this on 16-bit machines, either the opcode must be defined "minimalstack", so that the compiler ensures nothing underneath its arguments, or the procedure must DumpStackat entry and LoadStack (LoadStack is a new opcode we add.) before pushing the results at exit. Thisrepresents a sustantial improvement over the 16-bit Mesa situation because neither are argumentsextracted from, nor results stored into, nor the stack pointer modified, nor PC manually advanced in thestate vector. In addition, the LoadStack opcode saves one totally useless Xfer.On Dragon, an unimplemented or incompletely implemented opcode becomes semantically equivalent toa procedure call. To make this as fast as possible, we also propose to locate the procedures at fixedlocations (e.g., at VA = 32*opcode) to avoid fetching a procedure descriptor. For 16-bit Mesa, locatingtrap opcode procedures at fixed locations could be considered, but since traps are infrequent, it is easierto stay with the current opcode trap table.It would be nice if an extended operation analogous to 16-bit Mesa's Esc opcode could automaticallydispatch on a without loss of time. This would allow a two-byte DFC for 256 common procedures.Unfortunately, this results in some complications, so Petit proposes to require a manual dispatch on a.In other words, an XOP that dispatches on a will be more compact than a DFC (2 bytes compared to 4)but substantially slower. Sample code for dispatching on a is as follows: fp!q4] Ffp b ^q T \%#&8 [ WF UY TE QqL O" Mb K>5/ Is FN E Q C@\ Au9/ ?A = >$ ;A. 7` 6%D 4:#A 2pF 01 -3C +iK )G 'f & ^ $>8( "sM P 7S l<* 71 k  + 6-  rq; Grq ;*rq p:rq> )B\BDragon DocumentEdward R. Fiala26 August 1983106; Space trap procedures 32d words apart. The XOP dispatch does another call so; that the procedure implementing the operation can have its own local frame.XOP:All; Microcode pushes a and traps; the XOP dispatch; code executes within the caller's frameSH [S]_[S] lshift 5; Multiply a by 32dRADD [S] _ [S]+XopBase; XopBase is an AR containing the dispatch base address; with 1 in bit 0 to make it a "direct" control link.SFCRET AllThis example takes ~8 EU cycles and ~15 IFU cycles compared to 0 EU cycles and 4 IFU cycles for aDFC. The extra RET can be avoided if all procedures for the XOP accept the same number ofarguments, in which case the dispatch would be like the following:XOP:2; Microcode pushes a and traps; the 2 arguments are; a and the argument for the procedure.SHSB 5; Multiply a by 32dRADD [S] _ [S] + XopBase; Make a direct procedure descriptorSJEven with these improvements, it takes ~8 EU cycles and ~10 IFU cycles.These examples suggest that use of XOPs which dispatch on a will be limited to those functions whichare statically frequent (so that compaction is worthwhile) and dynamically infrequent (so that 8 cycles isnot too costly). An IFU feature to directly dispatch on a would be nice.10.16. Coprocessor OperationsCoprocessor opcodes (COPRs and COPRLs) first attempt to transfer control to a coprocessor; only if thatfails do they trap. Consequently, the IFU does not predict a control transfer to the trap procedure, sowhen the trap occurs, there is a 4 cycle hickup in the IFU pipeline.This means that a COPR or COPRL which traps is 5 EU cycles and 2 IFU cycles slower than an XOPexecuting the same code.10.17. Procedure VariablesGlobal frame information is moved from procedure descriptors into the target procedures on Dragon.Since there is usually only one global frame for a procedure, this change usually causes no problems.However, procedures appearing in more than one global context require a fairly complicated replicationof entry sequences.What remains on Dragon is a 32-bit direct PC which does not involve the code segments or entry vectorswhich clutter up 16-bit Mesa implementations. In addition, the shortcuts taken for the DFC opcode limitthe starting address of a procedure to the 0th byte of a word and the first 228 bytes of VM, so there are 5unused bits in the 32-bit direct PC. These bits can be used somehow to encode other possible controltransfers of interest for the SFC opcode.TBC fp!q4] FfpxbAsOx`Mx]"trs"t\1)Z"t rsY)"t6"tW5VgU Qqa OK N#BxJs"trs"tI-rs$G"t rsF$#D ARqG =- rq) <O :K9rq 51p 1qg / [ .*D *:$ ( #p bqG  W E!  X b =sq 1T f)  B[^Dragon DocumentEdward R. Fiala26 August 198310710.18. Inline ProceduresAn "inline procedure" is called and returns in the usual way but executes within the environment of itscaller--in other words, rather than building a new frame, it uses the local and global environment of itscaller. To make this possible, one of the ArgTypes is "All" or "Don't change RL". A procedure callwith this ArgType makes the callee's frame congruent with the caller's.In returning from an inline procedure, ResType .eq. All meaning "don't change S" can optionally beused.The anticipated use of this feature is indicated by the following example from Satterthwaite: Suppose aconstruct such as "DO for I = 1, 3, 7, 13, and 23" were provided in Mesa. The compilercould easily implement this sequence by generating an inline procedure for and then callingthis procedure once for each argument value. The fact that the local environment doesn't change allowsthe inline procedure to access all locals in the enclosing context.In addition, if a nested procedure (next section) is not recursive and if descriptors for it are not passedout, then it can be implemented as an inline procedure.10.19. Nested ProceduresA Mesa "nested procedure" is one that is explicitly named, potentially recursive, and can access thevariables of the enclosing context. If recursive, then it cannot execute within the environment of thecaller, but it must still access the variables of the caller. How can this be done?It must clearly be done by somehow passing a pointer to/into the enclosing context, and then referencingnon-local variables through that pointer. Unfortunately, it is illegal to make a pointer to the enclosingcontext in any ordinary way. The first possibility that comes to mind is to flush the enclosing frame (i.e.,all register frames) to storage because there is no way to build a pointer to a context that is now in theregisters but might at any instant be flushed to storage by a process switch or whatever. However, apointer to the nested procedure might be passed to another process, and the enclosing context might thencontinue execution, invalidating the pointer.Consequently, all local storage accessed by the nested procedure must be placed in separate storageoutside the register frame; this storage would be explicitly allocated and freed (i.e., "owned") by theenclosing context, and the pointer would remain valid during the enclosing context's lifetime.However, passing a pointer to the enclosing context's variables is not straight-forward. The descriptor forthe nested procedure may be passed as an argument to other procedures; when called from remotecontexts, it must still be able to access the variables of its own enclosing context. This means that theprocedure descriptor for the nested procedure must itself embody a pointer to the enclosing contextsomehow. The trick used for this, like that in 16-bit Mesa, is as follows:1) Fabricate a direct control link for the nested procedure and place it in a block of local storagetogether with all other local variables referenced by the nested procedure. A direct control link is abyte pointer pointer with bit 0 .eq. 1.2) Fabricate a pointer to the control link of (1); this pointer is an "indirect procedure descriptor" inthe parlance of the SFC opcode.3) Always call the nested procedure through the indirect control link. fp!q4] Ffp b ^qg \90 [ Y YLG U3/ T PN NW MW K>-: IsC F1: D77 ?p ;qK 9'@ 8T 4,< 2J 1V /D ^ -zE +] )- &sE $.9 "^ kY  T -=  K AK "B V  ' oO    F F A]L#Dragon DocumentEdward R. Fiala26 August 19831084) Since the call to the nested procedure was by an SFC, the original indirect pointer is on the stack,and the enclosing context's variables can be referenced as displacements from the indirect pointer.Because Dragon opcodes do not support references at negative displacements to base registers very well,it is desirable for the direct procedure descriptor to be the first word in the block of local storage.10.20. Interface Function CallsThe Mesa compiler naturally produces interface records containing procedure descriptors for proceduresin an interface. Before binding, procedure calls are represented by a pointer to one of these interfacerecords and an index to one of the procedure descriptors in the interface. After binding, 16-bit Mesaprocedure calls are Xfers which use an involuted, space-saving, time-wasteful representation discussed inthe "Mesa Processor Principles of Operation", while Dragon DFCs point directly at the target procedure.The Global Interface Function Call (IFC) proposed for Dragon works as follows: The caller's globalframe holds a pointer to an interface record which contains procedure descriptors. IFC referencesthe jth procedure descriptor in the interface record whose base is at (Global+i)^.The first interesting property of an interface record is that all callers of a will use the same entry in the same interface record, so both initial binding and subsequentrebinding of that procedure to a new implementation are possible by modifying that one place.The second interesting property suggested by Taft is that the interface records are ordinary Cedar objects,allocatable and collectable in the usual way.Dragon IFC will not be an opcode. The proposed code for an interface function call is as follows, where is the number of the local register pointing at the global frame, the global index of the pointer tothe interface record, and the interface record index of the 32-bit procedure descriptor:RDI; Push (LRn + i)^ in the first two cycles.; Then replace [S] by ([S] + j)^ in the second two cyclesSFC; Stack Function callThe above 4 byte, 9 cycle sequence is the same size as a DFC, which eases interchange of IFC and DFCat binding time.Lampson has pointed out that when a procedure call appears inside a loop, the compiler could save theresult of the RDI above in some local , and then repeat the call with:LRSFCinside the loop.Ideally, compilation could use IFC's exclusively, and conversion to faster DFC's could be deferred untilbinding time or later; preserving flexibility in this way is desirable. However, IFC requires a globalframe pointer to exist in the local environment, and this may be the only reason for a global pointer inthat procedure, so a register and a load of G may sometimes be wasted after converting to DFC.Consequently, postponing the decision to use DFC until binding time or later has this cost associatedwith it. fp!q4] Ffp bq<+ `SH \Q [g Up RqK P53 Nf M,#F Ka-: GO F$ Z DZR @5, ?^ =S] 9k 8- 4&B 275 1\x.Ms n*n,9x+n (=qQ &s #C" !6Ix sx [q h ^ T9/ -1 C"  B[^Dragon DocumentEdward R. Fiala26 August 1983109Indirect xfers in 16-bit Mesa (with the MDS restriction removed) could achieve almost the same effect asIFC's proposed for Dragon. An indirect procedure descriptor (either in the global frame or at someCODE-relative location) could point into an interface record at a direct procedure descriptor for thecallee. Although superficially like the Dragon proposal, interface records could not be ordinary Cedarobjects because pointers are not all to the base of an interface record.10.21. Finding the Procedure Associated With a FrameAs discussed in the "More Than 16d Local Variables" section, the Collector doesn't have to pair extralocal storage with the overflow frame that owns it--it must be able to enumerate all the overflow framesand extra local storage, but not in any particular order.However, the debugger must pair each overflow frame with its extra local storage and must find variablenames for each procedure. To do this on Dragon, the debugger may use only the PC in the overflowframe. However, this need not be done efficiently, and data structures left behind by the compiler canbe done. Presumably, some sort of data structure which gives the PC range for each procedure can besearched.Lisp must efficiently be able to pair variable names with frame positions during free variable searches.One possible implementation is to put a list of pointers to variable names in the code segmentsomewhere and push a pointer to the list of names as, for example, the first argument to each procedure.Because it adds three bytes to each procedure call, this is an unattractive implementation. Doing this atentry to the procedure is more compact, if there are several callers, but slower (because the pointer has tobe exchanged with LR0). Free variable searches are discussed later.Finally, Mesa catch phrases and Lisp Errorsets (or whatever) must be efficiently located. Here the returnPC saved in the overflow frame is a suitable handle, as discussed in the next section.Lampson's memo proposed that the "first PC" of a procedure be saved by hardware just as the returnPC and RL are saved. This would ease finding information associated with a procedure. In addition, iftwo procedures shared the same code tail, then they could be distinguished. However, this featurerequires substantial extra hardware and was rejected by Petit as unimplementable. In addition, it slowsdown frame underflow and frame overflow service and increases the storage requirement for eachoverflow frame.10.22. Mesa SignalsTBC fp!q4] Ffp bqW `SE ^+: \/8 Z,pq Up5 RhqV Pa N9 Kag I01 G34 Fd D7 @.: >8& =/] ;eO 948 7D 4^Z 2V /!F -Vd +<& )'A 'A &, !p q  ZBJDragon DocumentEdward R. Fiala26 August 198311010.23. Lisp Free Variable SearchesTBC10.24. Why Procedure Calls Are Fast and ExceptionsIn summary, hierarchical Dragon control transfers will be much faster than on 16-bit machines for thefollowing reasons:1) The direct function call avoids indirections. 16-bit Mesa indirects through the Global FrameTable, loads the code base, and indirects through the code segment entry vector--none of theseare done on Dragon direct calls. Even Dragon interface function calls with two indirections aresimpler than the complicated unraveling required for 16-bit Mesa calls.2) The Dragon IFU will automatically track the new PC on a DFC, but this cannot be donewith 16-bit Mesa Xfers.3) Global and Local frame pointers need never be restored on a procedure return because theseare ring-buffer registers that appear automatically in the restored context.4) A caller's evaluation stack is preserved across procedure calls, and the stack depth limit couldbe made larger than on present machines.5) Procedure arguments are usually not moved or stored at procedure entry; they wind up asthe first local registers in the new context without executing additional opcodes, and thecompiler will usually have no reason to move them elsewhere.6) A pointer to a local frame extension is frequently not required, avoiding a storage allocateand deallocate.7) The global frame will frequently go unreferenced, and loading the global frame pointer cansometimes be avoided altogether in this case.Unfortunately, non-hierarchical control transfers (e.g., process switches) will be inefficient because everyframe in the ring buffer will have to be flushed to storage before such a transfer and the new contextreloaded from storage. For this reason, such transfers should remain infrequent.Also, two classes of pathological control sequences can cause thrashing of the register frames to-and-fromstorage. In the simplest case, the first happens as follows:1) One procedure calls another over-and-over in a tight loop;2) The caller has a lot of registers;3) The sum of registers required by the caller and the callee exceeds the number in the ringbuffer;4) The execution time of the callee is small.In this case, the caller's registers are repeatedly written into storage on the call and read back on thereturn. This cannot happen on Dragon because the hardware is limited to at most 16 local registers perframe and the compiler by convention will limit the stack depth to a moderate value, say 16; so a calldepth of at least 4 is guaranteed before the original register frame is forced out. Even without ahardware limit of 16 local registers, the compiler could limit a frame to 1/2, 1/3, or less of registers inthe ring buffer, so that a moderate call depth would be required to cause the pathological event.The other pathological sequence is a very deep sequence of nested calls followed by consecutive returns.For example, suppose that 1000 nested calls are made followed by 1000 returns; after the first few calls, fp!q4] Ffp b# ^q Yp3 V!qM TVxR"ExPWXxN&:xLGxJOxHxF9$xDLxB!Bx@(x>6$x<$6x:<x8Hx6x4&7x2- /J" -;+ +Q (W &=x$=x"P%x  OxQx- ;. g S9- (; h C h @) pB\Dragon DocumentEdward R. Fiala26 August 1983111subsequent calls regularly cause frame overflow; at the bottom, the first few returns find the caller still inregisters, but then the other returns all cause frame underflow. We are confident that this type ofpathological sequence is statistically infrequent.10.25. Non-Hierarchical Control Transfers & Process SwitchingMesa source constructs include the Fork and Join operations, port transfers, fault notification, and theStart and Stop operations. In addition, Howard Sturgis and others have implemented other controltransfer mechanisms using 16-bit Mesa machine operations directly--these are not included in the Mesasource language. Each of these operations uses a non-hierarchical control transfer and is much lessfrequent than a call or return.On Dragon, non-hierarchical transfers will all be reduced to a basic context switch operation with otherdetails being taken care of by software. The general idea is that all frames presently in the registers willbe overflowed to storage, just as though frame overflow had occurred repeatedly. The final overflowframe includes the arguments, parameters, or results being passed with the context switch. Software thencreates or restores a different context and, if arguments/parameters/results are involved, moves themfrom the previous context's final overflow frame into the new context. There are a number of ways tomove the arguments (etc.) which should be thought about carefully to find the most efficient and generalmethod.Satterthwaite has informed me that the Start and Stop operations are obsolete, so we can deimplementthem.Process notification is basically a one-bit "wakeup" message directed to some process, and it causes theprocess to resume execution in its current context--no parameters are passed with such a message.Fault notification is used on page, write protect, and frame faults in 16-bit Mesa. What happens forthese is as follows: First, the state of the faulted process is saved in a state vector; this state includesfault parameters (e.g., the VP of a page fault) in addition to other state. Secondly, the process isrequeued onto a fault service queue. Thirdly, the process responsible for servicing the kind of fault isnotified.Dragon may want to use fault notification as well because it has good properties on a multiprocessor. Amultiprocessor must be scheduled; if fault notification is used for abnormal events, then only the processscheduling is subject to multiprocessor worries because a single process can be responsible for the diskqueue (or whatever).An alternative is to treat a fault like a trap; in other words, insert a fault-handling procedure into theflow of control of the process which faulted. However, to do this, the handler of the page fault wouldhave to ensure that no page faults or local storage faults occurred, which implies using resident overflowframes and local storage during the critical code. In addition, if traps were used for abnormal events,then the disk driver (or whatever) could potentially be entered by more than one processor at-a-time, somultiprocessor safety would become an issue in a larger body of code.Fork[procedure descriptor, args] is a way of forking a new process; this operation returns a "processhandle" as its result. The process executing the Fork becomes "master", the newly forked process,"slave". The slave is started as for an ordinary procedure call; however, a top level return from the slavegoes to the scheduler, which will capture its results and deactivate it. The master can at any time either fp!q4] Ffp bq^ `SQ ^2 Yop> Uq08 T3G Rhe P/5 N Ka] IZ GF FA( D7'> BlF @I > ;eN 9 6(44 4^6+ 06/ /!Y -VN +i ) &OY $b "'A  }H" I +? 35 SL E Z LL f O pA\3Dragon DocumentEdward R. Fiala26 August 1983112"Detach" or "Join" the slave by specifying its process handle. Whether executed before or after theslave has done a top level return, the results of Detach or Join are the same. If a Join happens beforethe slave returns, then the master will be suspended until the slave finishes; if the join happens after theslave has returned, then the master will continue on. In either case, Join returns the results left by theslave's top level return to the master and destroys the slave. Similarly, Detach tells the scheduler todestroy the slave after a top level return, and the Master continues execution.The basic underlying mechanism needed to implement the above control transfers is the process switch.Fork, Join, and Fault Notification must also pass arguments, results, and fault parameters, respectively,but the mechanisms can be slow because they are infrequent....TBC.Port[args] returns [results]a. Register ports with collector -or- transfer via process descriptor.b. Start up mechanism looks like CoFork[Port,Proc,args] returns [Port]?c. Shutdown mechanism...TBC10.26. Code Traps16-bit Mesa causes a CodeTrap when the Code base is odd, and this occurs legitimately for a modulestart-trap. Since there is no code base on Dragon, any other reasons for the trap are meaningless. It istentatively proposed to eliminate start traps on Dragon. Instead, the system would explicitly initializesome modules (as early system modules are initialized in 16-bit Cedar); for other modules, a startprocedure could be defined; it would be called by the loader whenever another global frame instantiationoccurred for the module.10.27. Unbound TrapsThis is an illegal condition that occurs on Dragon when the PC of a context being returned to is 0. Weplan to detect this event when a page fault on page 0 is serviced; no special hardware support is needed.The context which caused the trap will be irretrievably lost when it occurs, but we accept this.10.28. Control TrapsThis illegal condition occurs when the Hook of a frame transfer is zero; it could happen on a SFC orRET. Like UnboundTrap, it will be detected when a page fault on page 0 occurs. fp!q4] Ffp bqE `Sa ^b \1: Z&B Y)O U6/ S ^ R"B N LF KG IP E @p =Sq8* ;+? 9M 7G 6(h 4^ /Dp +qc *54 (=)7 #$p qH O BJkDragon DocumentEdward R. Fiala26 August 198311310.29. XferTrapThis feature will be deimplemented on Dragon. The "Trap on return" feature may substitute for it.10.30. Retained FramesTBC10.31. Local Storage Tracing for the Garbage CollectorSo far as the Collector is concerned, storage is divided into three groups: counted, uncounted, and local.Counted storage is characterized by the property that all object pointers are to the base of an object, andthe type of an object can be readily determined from a pointer to its base. This means that countedstorage can be traced and marked perfectly by the Collector because all pointers within counted storagecan be enumerated.Uncounted storage is not traced or marked by the Collector.Local storage may contain pointers to counted storage (i.e., it is traced but not marked), but the Collectorcannot determine which words are such pointers. For this reason, it uses a "conservative" markingalgorithm which will be discussed later. The essential requirement for this algorithm is that the Collectormust be able to enumerate all local storage (both frames and extra storage) exactly, even though it doesnot know whether particular words are pointers or not.It is easy to enumerate all the overflow frames. The potential problem with this requirement is that theCollector must find each local storage pointer in each overflow frame to enumerate the extra localstorage. Because we have proposed no mechanism for identifying extra storage pointers, this wouldn't beeasy to do. However, the conservative scan discussed above does not require that each block of localstorage be paired with the overflow frame that points at it--only that the blocks all be enumeratedsomehow. Enumeration can be made possible by allocating local storage for Cedar processes from aseparate zone(s) using an allocator which allows all allocated blocks to be enumerated efficiently.The point here is that it is unnecessary to pair each overflow frame with its extra local storage for theCollector. The debugger must make this pairing, but its performance requirements are not critical, so itcan match the PC against complicated data structures left by the binder.A further reduction in OnStack marking time is possible by excluding from local storage those processeswhich cannot under any circumstances have Refs in local storage. To achieve this reduction, differentlocal storage "zones" have to be used for Cedar and non-Cedar processes. For example, the page faultprocesses might fulfill this requirement.However, if proposals are advanced to type or partially type local storage, it might become necessary topair an overflow frame with the block of extra local storage it points at...... fp!q4] Ffp b ^qZ Yp V!q Qp7 MqK J#I" HYQ Fg D AR; =l <b :K] 8h 66 3C54 1y6, /h -D! ,#@ *N*7 (H %M #G"G !}H  ^ @7/ u X ) 9e nO 'ATDragon DocumentEdward R. Fiala26 August 198311411. Process Machinery11.1. Process StateIn 16-bit Mesa, most non-running processes have a "state" limited to the 64-bit PSB and a separateTIMEOUT word logically in the PSB but in a separate table for implementation reasons. The four 16-bit items in the PSB are called the LINK, FLAGS, CONTEXT, and MDS. PSBs are in a 128k-byteregion of VM called the PDA, and they are resident in storage. However, processes terminated by afault or interrupt, as opposed to a minimal-stack process opcode, also save an evaluation stack. Ofcourse, the total state of a process is much larger than this, but only the PSB and saved evaluation stackare managed by the process machinery itself.In other words, the "state" of a process in 16-bit Mesa consists, first, of the LINK, FLAG, CONTEXT,and TIMEOUT words needed by the process machinery itself. Secondly, it includes the high part of theMDS register. Finally, it includes the evaluation stack on involuntary process switches only. TheSTICKY register is erroneously not part of the process state, and is being added in the Klamath releasefor 16-bit Mesa.For Dragon, the CONTEXT field must grow from 16 to 32 bits in size, so that local frames can beanywhere, but the MDS register is eliminated from the process state. STICKY is added; the Carryflipflop, integer overflow Mode flipflop, and the CST flipflop should be absorbed into STICKY orelsewhere. ICAND is, so far, used only by multiply trap opcodes; if page and stack overflow faults areeliminated during the trap procedure, and if other sources of process switching can be disallowed, thenICAND need not appear in the process state. Q is used more widely than ICAND (both arithmetic andfield opcodes), so it will probably have to appear in the process state.In other words, a Dragon process's state consists of information needed by the process scheduler itselfplus the values of the Q and STICKY registers. The information needed by the process machineryshould appear in the PSB for the process because that will be resident; the values of Q and STICKY arelogically an adjunct to the final overflow frame, unless it is easier to add them to the PSB.Because Dragon overflows all frames to storage before a process switch, and because the currentprocedure's evaluation stack is absorbed in the overflow frame, Dragon has no separate evaluation stackas part of the process state. We can choose to make the basic process operations be "minimal register",which would mean that Q and ICAND are not preserved by those opcodes. Q would then be saved oninvoluntary process switches, but not on voluntary process switches.After all of this, we wind up with the following fields in the PSB:STICKY 00'OR' 1 into InexactResult on every inexact result.1Trap on any inexact result. 1..20Trap on any denormalized result (user may beinterested in loss of precision).1Substitute 0 on underflow (non-IEEE).2Gradually denormalize on underflow.3-- 30Projective infinity (only one unsigned infinity;compare of anything with infinity traps; not surewhat other operations are supposed to do).1Affine infinity (+ infinity and - infinity bothdefined). 4..50Round to nearest (unbiased; round even if halfway). fp!q4] Ffp ar ^ep ZqH Y)%> W^L U] S13 QS P4, L[ JD! I-X Gb=* E B%_ @[O >E <c :,; 90G 7fH 3V 2)#< 0_51 .] +"W )Wg 'J %V #D Cxsx!.8!x!(!v!!%!#T!x! &!1!2* !/! px !. A^NDragon DocumentEdward R. Fiala26 August 19831151Round toward 0 (truncate).2Round toward plus infinity.3Round toward minus infinity. 60Trap if denormalized args are supplied.1Normalize the arguments and then use them. 70Trap on invalid operations (compare of projectiveinfinity, not-a-number as an argument).1Result is the infinity or not-a-number. 80Trap on overflow of Fix or Round operation.1Return low-order 32 bits of the result. 90Trap on divide by zero.1Stuff in not-a-number on divide-by-zero and continue. 100Trap on arithmetic overflow or underflow.1Truncate result to low-order 32 bits and continue on arithmeticoverflow or underflow.11..15--undefined 161One or more floating point inexact results have occurred(i.e., rounding has taken place). 171Denormalized arguments were supplied. 181Invalid operations occurred. 191Fix/Round overflow occurred. 201Divide by zero occurred. 211Denormalized result(s) occurred. 221Arithmetic overflow occurred. 231Arithmetic underflow occurred24..29--undefined 301Integer overflow mode (0=regular, 1=31-bit mode) 311CarrySTICKY will be copied between the PSB and an AR during process switching. Unused bits in STICKYcan be manipulated by the user using ordinary AR operations.Although the 16-bit Mesa limit of 1024 processes does not seem to be of near term concern, it is onlyseven times larger than the 150d processes presently in use. If at some point we are able to build 50-processor Dragon systems, there may be an explosion of demand for processes, so we propose to increasethis limit to 8,096 processes.TBC11.2 Process Opcodes, Multi-Processor Provisions, InterruptsThe current Mesa system is not arranged to run multiple processors, and a lot of work will be needed tomake this possible. Ultimately, we would like to be able to utilize multiple processors thorugh Mesa forkand join primitives without incurring overhead that is substantially larger than an Xfer.Here are some of the problems which we have identified:1) Wherever the DI (Disable Interrupts) and EI (Enable Interrupts) opcodes appear, the software willhave to be changed. In these cases, disabling interrupts for the current processor will not preventanother concurrently running processor from doing whatever bad action was feared. A few places whereDI and EI are used are the following: map operations, ?2) Frame allocation by Xfer will have to be made safe.3) The process data structures will have to be made safe. fp!q4] FfpbAs!`!_!x^!#\!*x[]!-!Y'X!'xW;!'U!'xTy!S!5xQ!%PW!?!NxM!xL5!5!J!xIs! xH!xF!xEQ!xC! xB! xA.! x?!x>m!)x= ! 8]q,4 6< 3 ] 1UQ /60 - *N %5p= !q70 )A .Y 7 J'= S I 7 x6 9 A]L|Dragon DocumentEdward R. Fiala26 August 19831164) Any place which currently wakesup a higher priority process and assumes its completion beforecontinuing will have to do explicit monitor waits to make sure that the higher priority process is reallyfinished (??)TBC fp!q4] Ffp bq>" `Si ^ [ ZA ;3Dragon DocumentEdward R. Fiala26 August 198313113. Dragon Opcode Summary13.1. Auxiliary Register UsageThe following are possible ways the ARs might be used. Any unused ARs could be used for additionalconstants, though these would only be accessible to the RR opcodes.HOOKPointer to previous overflow frame for oldest ring frame.STICKYFloating point state, trap enables, flags.FSITABPointer to FSI table (or whatever) for the ALLOC and FREE XOP opcodes.PSBBASEBase register for PSB data structure.PSBXPSB index of currently running process.STVECPointer to state vector reserved for process switch in the event of page fault, write protectfault, or interrupt.? temps for frame overflow, page faults, and interrupts. 13.2. Selected ConstantsThe 12d constants can be loaded by an inconvenient path during initialization. Currently proposed valuesfor these and other values being considered are given below.100000b,,0Used in JUxx if no unsigned conditional jumps; used to bounds-check cardinals beforeconversion to integers.-1Used in LI-1; permits NOT with RXOR.0Used in NEG, INC, DEC, LI0; permits negation with RSUB; use by many RADD/RSUBsequences.1Used for incrementing registers; for LI1.2Used for LI2; common increment.100000b? Used when extending sign of 16-bit integers or in bounds-checking a result that must be a16-bit integer.200000b? Used in NC16, NI16 (Narrow 32-bit integer to 16-bit cardinal or integer).40000b,,0? Used in small integer packing and unpacking for Lisp (140000b,,0 is an alternative value).-2Used for reading second overhead word with RFX.3? Not used in any examples.13.3. Shift DescriptorsThe format of ab in SH, DSH, and IF is as follows:a[0..1]Op (0 extract/cycle; 1 shift; 2 insert; 3 right-cycle).a[2..7]Signed shift; positive is left, negative is right.b[0..1]unusedb[2..2]0 select ab control; 1 selects Q-register control.b[3..7]Mask specifier if b[2..2] .eq. 0.Mask .eq. 0 selects all bits. When Q-register control is selected the signed shift count is taken fromQ[18..23] and the mask specifier is in Q[27..31].13.4. Current OpcodesNo.No. EUNameBytesOpcTimeDescription fp!q4] Ffp ar ]p Z qX XUCxUsQ9xSQ*xREQFxPQ%xNQ'xMOQCQKxJG: F$p BqQ @<x>&s QMQ<x;Q$x9wQ7Q8 x6oQ)x4Qx3 Q PQ1x0QKx.qQ1+x,Q/x+"Q 'p #q tq"y tq7y%tq2yZtqytqtq'ytqtq '9. \1 p qn/ Kn/Q @ Bt\x'Dragon DocumentEdward R. Fiala26 August 1983132NOP0110(X)No OPeration 0 cycle. For paddingNOP1111(X)No OPeration 1 cycle. To provide delay for debugging and getting outof certain timing problems.?BRK13?(X)BReaKpoint. 3 BRK opcodes are provided, one for each non-jumpopcode length. How will this work?Jump displacements are always specified relative to the first opcode byte, denoted by "."; the a or abdisplacement for a jump is always a signed byte displacement.Jn160(X)Jump .+n (n=2, 3, 4, 5, 6, 7). J2 and J3 are implemented as two-byteand three-byte, 0-cycle NOPs of the appropriate length for speed.JB210(X)Jump .+aJDB310(X)Jump .+ab (ab is sign-extended)DJ440(X)Direct Jump. Jumps to byte 1 in the word specified by abg and 2opcode bits. This opcode is intended only for jumping to the byte after aprocedure's EB, as discussed in the "Context Switching" chapter; becausebytes 0, 2, and 3 in a word cannot be reached, the opcode is not moregenerally useful.SJ115+(S)Stack Jump. Jump to the byte address specified in [S][4..31] and then S _S-1. The opcode traps prior to execution if [S][1..3] .ne. 0, as discussed inthe "Context Switching" chapter.Conditional jumps use one opcode bit to predict jump or no-jump. The jump range in a is .-128 to .+127.JEBB311(S)Jump Equal Byte Byte. Jump if [S] .eq. b; S _ S-1; predict no jump.JEBBJ311(S)Jump Equal Byte Byte Jump. Jump if [S] .eq. b; S _ S-1; predict jump.JNEBB311(S)Jump Not Equal Byte Byte. Jump if [S] .ne. b; S _ S-1; predict nojump.JNEBBJ311(S)Jump Not Equal Byte Byte Jump. Jump if [S] .ne. b; S _ S-1; predictjump.The register jumps use an opcode bit to predict the jump direction. b[1..1] .eq. 0 selects [S] for the left sideof the comparison; b[1..1] .eq. 1 selects [S-1] for the left side and pops the stack once. If b[0..0] .eq. 1, thestack is popped once more. b[2..7] select any LR, AR, constant, or stack option for the right side of thecomparison. In RJGEB, for example, the meaning is "jump if left-side .ge. right-side."RJGEB311(B2)Register Jump Greater Equal Byte; predict no jump.RJGEBJ311(B2)Register Jump Greater Equal Byte Jump; predict jump.RJLB311(B2)Register Jump Less Byte; predict no jump.RJLBJ311(B2)Register Jump Less Byte Jump; predict jump.RJEB311(B2)Register Jump Equal Byte; predict no jump.RJEBJ311(B2)Register Jump Equal Byte Jump; predict jump.RJNEB311(B2)Register Jump Not Equal Byte; predict no jump.RJNEBJ311(B2)Register Jump Not Equal Byte Jump; predict jump.RJLEB311(B2)Register Jump Less Equal Byte; predict no jump.RJLEBJ311(B2)Register Jump Less Equal Byte Jump; predict jump.RJGB311(B2)Register Jump Greater Byte; predict no jump.RJGBJ311(B2)Register Jump Greater Byte Jump; predict jump.These opcodes are discussed in the "Context Switching" chapter.LFC310(X)Local Function Call.DFC440(X)Direct Function Call. fp!q4] Ffp `Sqn/Q% ^n/Q8Q\ Zn/Q$QY)# U'8tqt Sq= R"n/Q =QPWA Nn/Q t Lqn/Q tqtq Jn/Q+tqQI-= QGb@QE*QC Bn/Q2Q@7KQ>m :4tq 90n/Q+tq 7fn/Q# tq 5n/QtqQ3 2n/Q4tq Q0; ,Etq& *tq6tq )4 tqM 'iW %n/Q6 #n/Q8 " n/Q- ?n/Q/ tn/Q. n/Q0 n/Q2 n/Q4 Jn/Q3 n/Q5 n/Q0 n/Q2 x? n/Q n/Q  Bt]oDragon DocumentEdward R. Fiala26 August 1983133RET210(X)Return.SFC115+(S)Stack Function Call.The Load Immediate opcodes push a constant specified in the opcode onto the stack.LIn151(S)Load Immediate n (n = 20000000000b, -1, 0, 1, 2)LIB211(S)Load Immediate Byte. Pushes a.LINB211(S)Load Immediate Negative Byte. Pushes 37777777400b+a.LIDB311(S)Load Immediate Double Byte. Pushes ab.LILDB311(S)Load Immediate Left Double Byte. Pushes ab,,0.ADDB211(S)ADD Byte. [S] _ [S]+a; trap on integer out-of-range; Carry unchanged.ADDNB211(S)ADD Negative Byte. [S] _ [S]+(37777777400b+a); trap on integer out-of-range; Carry unchanged.ADDDB311(S)ADD Double Byte. [S] _ [S]+ab; trap on integer out-of-range; Carryunchanged.RB211(S)Read Byte. [S] _ ([S]+a)^.RSB211(S)Read Save Byte. S _ S+1; [S] _ ([S-1]+a)^.WB211(S)Write Byte. ([S]+a)^ _ [S-1]; S _ S-2.WSB211(S)Write Swapped Byte. ([S-1]+a)^ _ [S]; S _ S-2.PSB211(S)Put Swapped Byte. ([S-1]+a)^ _ [S]; S _ S-1.RFX311+(RR)Register Fetch indeXed. [Rc] _ ([Ra]+[Rb])^.RSI311+(RR)Register Store IndeXed. ([Rc] _ [Ra]+1)^ _ [Rb].RSD311+(RR)Register Store Decrement. ([Rc] _ [Ra]-1)^ _ [Rb].RRI311+(A2)Read Register Indirect. [RL+n] _ ([RL+m]+b)^.WRI311+(A2)Write Register Indirect. ([RL+m]+b)^ _ [RL+n].RAI311+(A2)Read Auxiliary Indirect. [RL+n] _ ([RA+m]+b)^.WAI311+(A2)Write Auxiliary Indirect. ([RA+m]+b)^ _ [RL+n].RDI313+(A2)Read Double Indirect. In the first microinstruction (which takes at least2 cycles), push ([RL+a[0..3]]+a[4..7])^; in the second microinstruction, [S]_ ([S]+b)^.LRIn2161+(O)Load Register Indirect n. Push (LRn+a)^.SRIn2161+(O)Store Register Indirect n. Pop into (LRn+a)^.?PRL311+?(S)Pc Relative Load. Push the 32-bit quantity [(. + ab) rsh 2]^ (ab is sign-extended); intended for loading the global frame pointer.?FPC111(S)Fetch PC. Push the value of "." for the first byte of the FPC opcode.REC110(X)RECover. S _ S+1.DIS110(X)DIScard. S _ S-1.AS210(X)Adjust Stack. S _ S + signed a.LRn1161(O)Load Register n. Push LRn.SRn1161(O)Store Register n. Pop into LRn.RMOV221(A2)Register Move. LRn _ LRm. Saves 1 cycle if the source register wasjust loaded by a fetch. fp!q4] Ffp bqn/Q `Sn/Q \R [n/Q3 YLn/Qtq Wn/Q6tq Un/Q'tq Sn/Q,tq Pzn/Qtq/ Nn/Q/tqQL Kn/Qtq%QIP En/Qtq Dn/Q*tq BIn/Qtq @~n/Qtq >n/Qtq ;An/Q1 9wn/Q5 7n/Q7 4:n/Q.tq 2pn/Q&tq 0n/Q/tq .n/Q'tq -n/Q''Q+Etqtq !Q){tq & n/Q(tq $>n/Q-tq n/Qtq tqQ9 7n/Q= n/Q n/Q 0n/Q!tq n/Q n/Q# )n/Q5Q ^f BtZDragon DocumentEdward R. Fiala26 August 1983134DUP111(S)DUPlicate. S _ S+1; [S] _ [S-1].EXDIS111(S)EXchange DIScard. [S-1] _ [S]; S _ S-1.EXCH112(S)EXCHange. Q _ [S-1] and [S-1] _ [S] in the first cycle; [S] _ Q in thesecond cycle.AND111(S)logical AND. [S-1] _ [S-1] & [S]; S _ S-1.OR111(S)logical inclusive OR. [S-1] _ [S-1] U [S]; S _ S-1.XOR111(S)logical XOR. [S-1] _ [S-1] xor [S]; S _ S-1.ADD111(S)integer ADD. [S-1] _ [S-1] + [S]; trap on integer out-of-range; Carryunchanged; S _ S-1.SUB111(S)integer SUBtract. [S-1] _ [S-1] - [S]; trap on integer out-of-range; Carryunchanged; S _ S-1.BNDCK111(S)BouNDs ChecK. Trap if [S-1] >= [S] using an unsigned compare; S _S-1.NILCK111(S)NIL ChecK. Trap if [S] = 0; S unchanged.DSH311(S)Double SHift. Ra is [S]; Rb is [S-1]; ab is a general shift descriptor; [S-1] _ Shifter output; S _ S-1.SH311(S)SHift. Ra and Rb are both [S]; ab is a general shift descriptor; [S] _Shifter output; S unchanged. This opcode is equivalent to DUP, DSH.QSH211(S)Q SHift. Ra and Rb are both [S]; a is the left-half of a general shiftdescriptor in which only the Op field is significant; the control data comesfrom Q; [S] _ shifter output; S unchanged.IF312(S)Insert Field (2 cycles). The right-justified value in [S-1] is inserted intothe selected field of [S]; the result is left at [S-1] and S _ S-1. ab is ageneral shift descriptor. During C0, [S] is right-cycled to right-justify theselected field, and the right-justified value being inserted in [S-1] replacesthose bits; the result is left in [S-1], and S _ S-1. During C1, [S] is cycledback into its original position.SPA211?(S)Setup Packed Array. [S] is the array index N and a is a function of theelement size 2m; a should contain 0, 1, 2, 3, 4 corresponding to elementsizes of 1, 2, 4, 8, 16d bits, respectively. Results with Q set up for an extractfield or insert field and [S] _ [S] rshift 5-m.RAND311(RR)Register AND. Rc _ Ra & Rb.ROR311(RR)Register inclusive OR. Rc _ Ra % Rb.RXOR311(RR)Register XOR. Rc _ Ra xor Rb.RADD311(RR)Register ADD. Rc _ Ra+Rb+Carry; Carry _ 0; trap on integer out-of-range.RUADD311(RR)Register Unsigned ADD. Rc _ Ra+Rb+Carry; Carry _ carry-out; notrap.RVADD311(RR)Register Vanilla ADD. Rc _ Ra+Rb; Carry unchanged; no trap.RSUB311(RR)Register SUBtract. Rc _ Ra-Rb-1+Carry'; Carry _ 0; trap on integerout-of-range. Other ways of writing the subtract operation are Ra-Rb-1+Carry' = Ra-Rb-Carry = Ra+Rb'+Carry'. fp!q4] Ffp bqn/Q$ `Sn/Q+ ^n/Q#'Q\ Un/Q. Sn/Q7 R"n/Q0 Nn/Q:QL Kn/Q%)QIP En/Q,QD BIn/Q, >n/Q*tqQ=  ;An/Q#tqQ9w, 7n/Qtq$Q5($Q4* 2Ln/QKQ0<tqQ.)%Q,NQ+"('Q)W %n/QtqQ$ $s$qtq!Q"PRQ / n/Q &n/Q) [n/Q" n/Q4Q n/Q%Q1 fn/Q& n/Q,Q 2Q ' Bt]LDragon DocumentEdward R. Fiala26 August 1983135RUSUB311(RR)Register Unsigned SUBtract. Rc _ Ra-Rb-1+Carry'; Carry _ carry-out'; no trap.RVSUB311(RR)Register Vanilla SUBtract. Rc _ Ra-Rb; Carry unchanged; no trap.RMSTEP311(RR)Register unsigned Multiply STEP. ??Presently unspecified.RDSTEP311(RR)Register unsigned Divide STEP. ??Presently unspecified.CST315+(RR)[S] is a pointer to the word to be changed; Ra has the old data and Rbis the new data. During the first microinstruction (which takes at least 3cycles): Push ([S-1])^ using fetch-and-hold. During the secondmicroinstruction (1 cycle) compute Ra XOR [S]; S _ S-1. During the thirdcycle, iff the result of the XOR in the second microinstruction is 0: ([S])^ _Rb; S _ S-1. The result of the operation (whether or not the store tookplace) is remembered in an flipflop testable by the JCST conditional jump.Ra and Rb must not include any stack options.JCST211Jump on CST. Jump if the last CST executed did not store (i.e., jump if theCST failed). The prediction is always no-jump.LIFUR211(B2)Load IFU Register. S _ S+1; [S] _ data from an internal IFU register.a specifies the internal register, but the codes are not yet specified.SIFUR211(B2)Store IFU Register. Stores data in [S] into an internal IFU register; S _S-1. a specifies the internal register, but the codes are not yet specified.REUR211Read EU Register. S _ S+1; [S] _ EU register a. a specifies the internalregister, but the codes are not yet specified.EUSF211EU Special Function. Sends a to the EU as the SPC FUN field. This ingeneral causes data from the stack to be stored in some internal register (i.e.,Q, ICAND, or MODE), or some flipflop to be cleared, etc. S _ S-1. Codesfor a are not yet specified.MAP117+MAP OPeration. [S][1..23] holds the VP; [S][24..26] selects the M busdestination (0 for the map processor, 1 to 7 for other devices); [S][27..31]selects the operation to be performed. [S-1] holds a data argument for theoperation. During C0 and C1, [S] and [S-1] are sent to the cache as theaddress and data, respectively, for a MapOp; the cache requests the M busgrant during C1 and receives the M bus grant no sooner than C2; then [S]and [S-1] are sent on the M bus as a DoMapOp command during C2 andC3. The earliest MapOpDone is C5, which would be passed to the processorand written into [S-1] in C6; then S _ S-1. Hence, minimum time for aMapOp is 7 cycles. The map processor issues at least one ReadQuadcommand for every DoMapOp, so it has a minimum time of 13 cycles.?COPR21?(S)COPRocessor operation. a[0..2] selects the coprocessor; a[3..7] selects thecommand to be executed. If no coprocessor acknowledges, a is pushed andthe COPR trap occurs with the PC advanced, so that return from the trapgoes to the next opcode. More details are in the "Coprocessors" section ofthe "Hardware Overview" chapter. Timing on the trap is 6 cycles??COPRL31?(S)COPRocessor operation long. a[0..2] selects the coprocessor and a[3..7],the command to be executed; b is an argument to the command. If nocoprocessor acknowledges, aB is pushed and the COPRL trap occurs withthe PC advanced so that return from the trap goes to the next opcode. fp!q4] Ffp bqn/Q%Q`S ^n/QE \n/Q= Zn/Q; Wn/Q> QU AQS8QR"7QPW DQN2QL#'QJ- Gn/Q+!QE/ BIn/Q0Q@~tqF >n/Q0Q<tq < ;n/QtqtqQ9T. 7n/Qtq)Q5 BQ3CQ2)tq .n/Q 8Q,*"Q+" @Q)WAQ'*Q%9Q#)Q"-IQ bFQ%Q3 [n/QtqtqQtq Q*Q-Q1A fn/QtqtqQtq&Q tq)Q 1 v Ct]LDragon DocumentEdward R. Fiala26 August 1983136Timing on the trap is 6 cycles?84 opcodes remain to be divided among these XOPs.XOP11?0(S)eXtended OPeration 1 byte. Traps to a location determined from theopcode with the PC advanced. Equivalent to a DFC to a trap location.XOP22?0(S)eXtended OPeration 2 bytes. Pushes a then traps.?XOP33?0(S)eXtended OPeration 3 bytes. Pushes ab and then traps. fp!q4] FfpQbq ^1 \n/Q$"Q[+ YLn/Q'tq Wn/Q'tq UpBtYDragon DocumentEdward R. Fiala26 August 1983137TRAP AND COPROCESSOR OPERATIONSNo.No. EUNameBytesOpcTimeDescriptionALS21TAllocate Local Storage. a is the FSI. Same algorithms as 16-bit Mesa.Possible implementation is to keep pointer to FSI table in aux. reg.; pushFSI+a; trap.FLS11TFree Local Storage. Uses the FSI in [[S]-1]^. Traps immediately.WCB21TWrite Counted Byte. See "Cedar Opcodes". Pushes [S]+a and then traps.PSCB21TPut Swapped Counted Byte. See "Cedar Opcodes". Pushes [S-1]+a andthen traps.MULSC119?Multiply Short. Overflow if Highpart[[S]*[S-1]] is significant; [S-1] _Lowpart[[S]*[S-1]], S _ S-1.DIVSC1TDivide Short. [S-1] _ quotient([S-1]/[S]); S _ S-1. No Divide if divisor is 0.Remainder left at -[S-1]? in Q?REMSC1TRemainder Short. No Divide trap if divisor is 0. [S-1] _ remainder([S-1]/[S]); S _ S-1.(MULSBSCL1TMultiply Signed Byte Short; Overflow if Highpart[[S]*a] is significant; [S] _Lowpart[[S]*a]. (Saves 1 byte over LIB or LINB followed by MULS; thisopcode is desirable only if it can be implemented faster than MULS.)DADDX1?Double Add.DSUBX1?Double Subtract.DMULSC1?Double Multiply Short.DDIVSC1?Double Divide Short.DREMSC1?Double Remainder Short.DMULC1?Double Multiply (Long).DDIVC1?Double Divide (Long).NADDX1?N-precision Add.NSUBX1?N-precision Subtract.NMULC1?N-precision Multiply.NDIVC1?N-precision Divide. Gives both quotient and remainder.NQUOC1?N-precision Quotient.NREMC1?N-precision Remainder.FADDC1?Floating Add.FSUBC1?Floating Subtract.FMULC1?Floating Multiply.FDIVC1?Floating Divide.FCOMPC1?Floating Compare.FIXC1?Fix to Integer.FLOATC1?Float.ROUNDC1?Round to Integer.FSQRTC1?Square Root.FSCC1?Floating Scale. Order of arguments???FREMC1?Floating Remainder.DFADDC1?Double Floating Add. fp!q4] Ffp bu^qn/ \Kn/Q YLn/QtqQW-QUtq Sn/QB R"n/Q6tq PWn/Q%tqQN Ln/Q9QJ I-n/Q4QGb En/Q: QC Bn/Q$tqQ@7 tq+ Q>m& :n/Q 90n/Q 7fn/Q 5n/Q 3n/Q 2n/Q 0;n/Q ,n/Q *n/Q )4n/Q 'in/Q7 %n/Q #n/Q bn/Q n/Q n/Q n/Q 8n/Q mn/Q n/Q n/Q  n/Q Cn/Q% xn/Q n/Q Ct]L Dragon DocumentEdward R. Fiala26 August 1983138DFSUBC1?Double Floating Subtract.DFMULC1?Double Floating Multiply.DFDIVC1?Double Floating Divide.DFCOMPC1?Double Floating Compare.DFIXC1?Double Fix. Fix to 64-bit Integer.DFLOATC1?Double Float.DROUNDC1?Double Round. Round to 64-bit Integer.DFSQRTC1?Double Square Root.DFSCC1?Double Floating Scale. Order of arguments???DFREMC1?Double Floating Remainder.These are compatibility opcodes for Levin and Lampson.RWUX1XRead Word Unaligned. [S] is a pointer to a 16-bit boundary; read the 32-bitquantity (which might cross a word boundary) beginning at the pointer.WWUX1XWrite Word Unaligned. [S] is a pointer to a 16-bit boundary; write [S-1]into the 32-bits at the pointer; S _ S-2.PSWUX1XPut Swapped Word Unaligned. [S-1] is a pointer to a 16-bit boundary; write[S] into the 32 bits at the pointer; S _ S-1.WSWUX1XWrite Swapped Word Unaligned. [S-1] is a pointer to a 16-bit boundary;write [S] into the 32 bits at the pointer; S _ S-2.RDBUX1XRead Double Byte Unaligned. [S] is a pointer to a 16-bit boundary; S _S+1; [S] _ the 16-bit quantity.WDBUX1XWrite Double Byte Unaligned. [S] is a pointer to a 16-bit boundary; write[S-1] into the 16 bits at the pointer; S _ S-2.PSDBUX1XPut Swapped Double Byte Unaligned. [S-1] is a pointer to a 16-bitboundary; write [S] into the 16 bits at the pointer; S _ S-1.WSDBUX1XWrite Swapped Double Byte Unaligned. [S-1] is a pointer to a 16-bitboundary; write [S] into the 16 bits at the pointer; S _ S-2.BTX1XBlock Transfer. [S] is integer word count, [S-1] is destination pointer, 2S issource pointer; S _ S-1. (E.S. prefers source pointer at bottom of stack.)BTRX1XBlock Transfer Reversed. [S] is integer word count; S _ S-1.BZX1XBlock Zero. [S] is integer word count, [S-1] is pointer; S _ S-2.BCX1XBlock Compare the two N*32-bit twos-complement integers pointed at by [S-1] and 2S, where [S] is the integer word count N; 2S _ -1 on less than, 0 onequal, or +1 on greater; S _ S-2.DBBTX1?Double Byte Block TransferDBBTRX1?Double Byte Block Transfer ReversedDBBEX1?Double Byte Block EqualBBTX1?Byte Block TransferBBTRX1?Byte Block Transfer ReversedBBEX1?Byte Block EqualSEX1?String Equal ??SCX1?String CompareMEX1?Monitor Enter.MXX1?Monitor Exit.MWX1?Monitor Wait. fp!q4] Ffp bqn/Q `Sn/Q ^n/Q \n/Q Zn/Q# Y)n/Q W^n/Q' Un/Q Sn/Q, Qn/Q N6 Ln/Q)#QJ9 I-n/Q/QGb) En/Q")QC- Bn/Q)Q@73 >mn/Q#$Q< :n/Q:Q9 / 7Bn/QBQ5x= 3n/Q%Q1= .qn/QIQ,. *n/Q= )n/QB 'Fn/Q <Q%|5Q#! ?n/Q tn/Q# n/Q 8n/Q mn/Q n/Q n/Q  n/Q n/Q n/Q n/Q  Bt]L Dragon DocumentEdward R. Fiala26 August 1983139MRX1?Monitor Reenter.NCX1?Notify Condition.BCX1?Broadcast Condition.REQX1?Requeue. fp!q4] Ffp bqn/Q `Sn/Q ^n/Q \n/Q \w2 K TIMESROMAN  TIMESROMAN  TIMESROMAN HIPPO  TIMESROMAN TIMESROMAN TIMESROMAN  TIMESROMAN  TIMESROMAN HIPPO  TIMESROMAN TIMESROMAN TIMESROMAN  TIMESROMAN  TIMESROMAN  TIMESROMANHIPPO  TIMESROMAN  TIMESROMAN  TIMESROMAN  TIMESROMAN  TIMESROMAN TIMESROMANHIPPO  TIMESROMAN  TIMESROMAN  TIMESROMAN TIMESROMAN TIMESROMAN  TIMESROMAN HIPPO  TIMESROMAN;   ( 2 < E OW ` i|m v~   q !v E j     ' 1 ;C L\TZ e Qq { _        tX   ) 3 <AIhPjW^3b l v   E          # , 5< FM V^em w3}  & ,bbVb# rbei~ccj/yvGDragonDoc.pressFiala26-Aug-83 17:22:31 PDT: