CSL Notebook EntryTo:DoradoUsers, CedarImplementorsDate:January 19, 1981From:Ed TaftLocation:PARC/CSLSubject:Mesa opcode execution timesFile:[Ivy]80CSLN-YYYYon the DoradoXEROX Attributes:technical, Cedar, Dorado, MesaAbstract:This memo tabulates the execution time of all common Mesa opcodes on the Dorado.Possible future changes likely to affect these timings are also summarized.The execution time on the Dorado for all common Mesa opcodes is presented in the table below.This information is intended to be useful to compiler writers in deciding about code sequences to begenerated by the Mesa compiler for Cedar. It may also be of interest to others, e.g., in assessing therelative costs of operations on short versus long data values.The information here is for the PrincOps (Pilot, as opposed to Alto) implementation of the Mesainstruction set. At present, only Mesa 6.0 opcodes are described; this memo will be revised astiming information becomes available on new opcodes implemented for Cedar.Following the opcode table are some general comments on the timings, along with brief descriptionsof possible future changes that are likely to have important effects on performance.Opcode timingsThe timings are expressed in terms of microinstruction execution cycles; with current Dorado clocksettings, 1 cycle = 64 nanoseconds. (So, to convert to microseconds, divide the timings by 16.)The timings include all unavoidable delays due to memory timing and IFU restart latency, but donot include variable delays due to cache misses, IFU pipeline starvation, and the like; measurementshave shown that these degrade performance only a small amount [Clark, et al., 1981; Lampson, etal., 1981]. Nor do the timings account for I/O overhead; but this is typically less than 10 percent(including Alto-compatible display), and in any event will affect all opcodes proportionally.In the opcode names, *'' stands for a digit denoting an encoded constant; the name thereby refersto a family of related opcodes. The comments generally specify the conditions under which thetimings apply.Opcode(s)TimingCommentsADD2ALLOC10Ordinary case (no trap, AV[fsi] non-empty).AND2ASSOC103Minimum (more if many dirty cache entries)BCAST71Typical minimum (assuming 1 process awakened, no process switch).BITBLT86+height*(35+5*(width/16)) (Typical case)BLT, BLTC, BLTL27+3*countApproximate.]gpi c8q]rX-q5Ur ]q]r-q5`r Yq]r-q5`q]Wr Ssr Pq (r Mq(rP LXK FI E d Cg B> ?V ='8 <J 90O 7T 3 tX 0;r@" .#= +u r, *NQ (1uru 'FrX %] "T !Y?  u"` [q@ @"`+  @ e"`* @"`A @+ o@ "`  ZL?/YMesa opcode execution times on the Dorado2BNDCK4Ordinary case (in bounds).CATCH1CKSUM9+3.5*countDADD4DBL1DCOMP7Approximate.DESCB, DESCBS4DST15Minimum.DUCOMP6Approximate.DIV42Approximate; ordinary case (no overflow).DSUB4DUP1DWDC4Ordinary case (no interrupts pending).EFC*, EFCB54Ordinary case (no trap, external link is procedure)EXCH3FADD56Typical for normal'' operands.FCOMP27Typical for normal'' operands.FDIV156Typical for normal'' operands.FIX26Typical for normal'' operands.FIXC20Typical for normal'' operands.FIXI24Typical for normal'' operands.FLOAT41Typical for normal'' operands.FMUL104Typical for normal'' operands.FREE8FSUB57Typical for normal'' operands.GADRB3GETF101Minimum (more if many dirty cache entries)INC1IWDC3J*, JB1JEQ*, JEQB9if jump occurs;4if jump does not occur. See general comments on conditional jumps.JGB, JGEB10if jump occurs;5if jump does not occur.JIB17if jump index in range;5if jump index out of range.JIW16if jump index in range;5if jump index out of range.JLB, JLEB10if jump occurs;5if jump does not occur.JNE*, JNEB9if jump occurs;4if jump does not occur.JUGB, JUGEB9if jump occurs;4if jump does not occur.JULB, JULEB9if jump occurs;4if jump does not occur.JW10JZEQB, JZNEB8if jump occurs;3if jump does not occur.KFCB49Ordinary case (no trap, SD entry is procedure).LADRB3LDIV42Approximate; ordinary case (no overflow). fr)G bAq@"` `@ ^@ ]K@ [@ Y@"` XU @ V@"` U@"` S_@"`) Q@ P@ Ni@"`& L @"`+ K@ Is@"` G@"` F$"` D}@"` B@"` A.@"` ?@"` ="` <8@ :@"` 8@ 7B"`* 5@ 3@ 2L@ 0 @"`@."`, -V@"` @+"` *@"`@(`"` &@"`@%"` #j@"` @!"`  @"` @t"`  @"` @&"` ~ @"` @"` 0@  @"` @"` :@"`/ @ @"`) =/ZCMesa opcode execution times on the Dorado3LFC*, LFCB35Ordinary case (no trap).LG*, LGB2LGDB3LI*, LIB1LIN1, LINB, LINI2LINKB4LIW3LL*, LLB2LLDB3LLKB7LP3LST37Ordinary case (no trap, control link is frame).LSTF45Ordinary case (no trap, control link is frame).ME12Ordinary case (monitor not already locked).MISCSee description of individual MISC opcodes (ASSOC, CKSUM, FADD,FCOMP, FDIV, FIX, FIXC, FIXI, FLOAT, FMUL, FSUB, GETF,ROUND, ROUNDC, ROUNDI, SETF).MRE32Ordinary case (monitor not already locked).MUL24MXD15Ordinary case (no process waiting on monitor).MXW145Minimum (includes one process switch).NEG1NILCK3Ordinary case (not NIL).NILCKL4Ordinary case (not NIL).NOOP2Reduced to 1 if post-XFER interrupt restriction is eliminated.NOTIFY71Minimum (assuming no process switch).OR2PL*2POP1PORTI6PORTO34Ordinary case (no trap, control link is frame).PUSH1R*, RB2RBL4RD*, RDB4RDBL6REQUEUE63Minimum (assuming no process switch).RET34Ordinary case (no trap, return link is frame).RF3RFC4RFL5RFS6RFSL7RIGP5RIGPL9RIL*4RILP5RILPL9ROUND35Typical for normal'' operands.ROUNDC28Typical for normal'' operands.ROUNDI33Typical for normal'' operands.RR6Typical. fr)G bAq @"` `@ ^@ ]K@ [@ Y@ XU@ V@ U@ S_@ Q@ P@"`/ Ni@"`/ L@"`+ K"`""`I6"`H F@"`+ EQ@ C@"`. B"`& @[@ >@"` = @"` ;e@"`> 9@"`% 8@ 6o@ 4@ 3 @ 1y@"`/ /@ .*@ ,@ *@ )4@ '@"`% %@"`. $>@ "@ @ H@ @ @ R@ @ @ \@ @"`  @"` f@"` @"` >/YoiMesa opcode execution times on the Dorado4RSTR5RSTRL6RXGPL9RXLP6RXLPL9SHIFT6Ordinary case (ABS[shiftCount] < 16)SETF122Minimum (more if many dirty cache entries)SFC49Ordinary case (no trap, control link is procedure).SG*, SGB2SGDB4SL*, SLB2SLDB4SUB2W*, WB3WBL4WD*, WDB4WDBL5WF5WFL7WFS7WFSL9WIGPL9WILP6WILPL9WR5Typical.WS*, WSB4WSDB5WSF6WSTR7WSTRL8WXGPL9WXLP6WXLPL9XOR2General comments on timingsSimple instructionsFor the most part, except for conditional jumps the simple opcodes (those with timings less thanabout 10) are presently implemented as efficiently as is possible. The microcode for these opcodeshas remained relatively stable for some time.The only potential improvement will come from use of the Dorado IFU's instruction forwardingfeature. At the expense of control store (somewhat over 200 microinstructions), a number ofopcodes can be made one cycle faster by deferring their final operations into the next opcode, wherethose operations can sometimes be overlapped with other work. In particular, all instructions whosefinal operation is to fetch from memory (LL*, R*, etc.) or store into memory (SL*, W*, etc.) can bespeeded up by one cycle, though it is not always the case that the deferred operation can actually beoverlapped with other work by the next instruction.Until recently, the Alto-Mesa emulator actually did use the instruction forwarding feature, resultingin measured improvements of up to 8 percent. Unfortunately, instruction forwarding cannot be usedin the Pilot microcode and still remain faithful to the PrincOps. This is because the PrincOpsrequires faults and traps to abort the instruction that causes them; with instruction forwarding, such fr)G bAq@ `@ ^@ ]K@ [@ Y@"`$ XU"`* V@"`3 U@ S_@ Q@ P@ Ni@ L@ K@ Is@ G@ F$@ D}@ B@ A.@ ?@ =@ <8@ :@"` 8@ 7B@ 5@ 3@ 2L@ 0@ .@ -V@ +@ 'FtX #u rI % V - -/ 8< $@ 0>& )qrqrqrqr (%@ 3 G ;Z A 3M >/^RMesa opcode execution times on the Dorado5exceptions are frequently not detected until emulation of the next instruction has begun. (There aresome more subtle technical problems as well.)Most likely we will stretch the PrincOps to permit instructions to be suspended in mid-executionand later resumed. We believe it is possible to save and restore the necessary machine-dependentstate; of course, Pilot modifications will be required to remember this state when suspending aprocess that has faulted.Conditional jumpsFor the conditional jumps, the jump case is typically at least twice as costly as the non-jump case;the costlier case is the one in which the IFU has to be restarted. The two cases can trivially beinterchanged (on a per-opcode basis) to favor the non-jump case, if dynamic execution frequencyseems to warrant this. Alternatively, two opcodes can be provided, one favoring each case; there areundoubtedly situations (e.g., the jump that closes a FOR loop) in which the compiler could easilydecide which case is likely to be dynamically more frequent.Additionally, the favored case of each conditional jump can be made one cycle faster, through use ofthe instruction forwarding and conditional exit features of the Dorado IFU. The cost of this isadditional control store (somewhat over 100 microinstructions) and, in some cases, a restriction onwhether the jump or non-jump case is to be favored.Process/monitor opcodesThe ME and MXD opcodes are invoked upon every entry to and exit from a monitor. Approximatelyhalf their execution time is spent deciding whether their pointer arguments are short or long, byinspecting the stack depth. This would be eliminated if separate short and long versions of theseopcodes were introduced (as has long been specified in the PrincOps but has never beenimplemented).Control transfersSubstantial improvements are possible in the control transfer (XFER) primitives, which are presentlyrather expensive. These improvements will derive from various forms of caching of executioncontexts, for which there is some Dorado hardware support that is not presently used.To take advantage of this will require rewriting nearly all the microcode for the control transferopcodes. Additionally, we will have to identify those places in Pilot that reference the cachedinformation (e.g., frame overhead words) and add code to dump or invalidate the cache atappropriate times.To gain an idea of what improvements are possible, it is worth examining the workings of thecontrol transfer primitives in some detail. The two most important cases are transfers to proceduresas typified by EFC*, and transfers to frames as typified by RET.OperationTimingEFC* (procedure transfer)P1.Save PC in local frame, and obtain local frame pointer.5P2.Fetch external control link from code segment.5P3.Dispatch on control link tag.3P4.Decode procedure descriptor to obtain global frame and entry index.8P5.Load global frame and code base registers.5P6.Obtain PC for entry point and restart IFU.5 fr)G b&? `- ]M \101 ZB Y) UpuX Rr+9 QK O@ MC" L{+ qr) J< H7- F` E '< C3 ?uX  !YJ qr)qru<8vuXvq7?.?(?C?*?2*? >/YHMesa opcode execution times on the Dorado6P7.Allocate new local frame.11P8.Set up overhead words in new local frame.5P9.Dispatch on exit actions, and load local frame base register.3P10.Push source and destination control links onto stack.3RET (frame transfer)F1.Fetch return link from local frame.2F2.Dispatch on control link tag.3F3.Load global frame and code base registers.5F4.Obtain PC from destination frame and restart IFU.4F5.Dispatch on exit actions, and load local frame base register.3F6.Push source and destination control links onto stack.5F7.Free old local frame.8The caching strategy comes in two parts. First, the base registers and local frame overhead wordsare cached in the hardware, eliminating most of P1, P8, and F1F4, for a total of 24 cycles perEFC*/RET pair. Second, up to four local frames are kept in the cache rather than being allocatedand freed, eliminating most of P7 and F7, for a total of 19 cycles. Of course, there are somecompensating costs arising from the need to check for cache over/underflow and other conditionsthat require the cache to be invalidated.It is worth mentioning some architectural changes that would also speed things up. First, thecompact encoding of procedure descriptors has a substantial cost (P4); a simpler encoding (probablyrequiring 32 bits) would speed up procedure transfers. Second, the control links pushed onto thestack (P10 and F6) are saved for the benefit of coroutines and nested local procedures, which areused relatively infrequently; some way of obtaining this information only when it is needed wouldbe better.Other observationsThe operations performed by NILCK and BNDCK could be provided nearly free if performed as partof the dereferencing and array access opcodes rather than as distinct opcodes.References[Clark, et al., 1981]Douglas W. Clark, Butler W. Lampson, and Kenneth A. Pier, The Memory System of a High-Performance Personal Computer,'' IEEE Transactions on Computers, to appear; CSL report 81-1.[Lampson, et al., 1981]Butler W. Lampson, Gene A. McDaniel, and Severo M. Ornstein, An Instruction Fetch Unitfor a High-Performance Personal Computer,'' submitted for publication; CSL report 81-1. fr)GbAq?`)?^=?]K5?ZvuXWq#?V!?Ty*?R1?Q+=?O5?M? Kr)9 I7( HqrD F^ E J C) @1- ?-6 =] <a :&; 9  5UuX 2prqrqr! 0N )t '#rurp%X=p#!urp! % urpZApA I>/L TIMESROMAN  TIMESROMAN TIMESROMAN LOGO TIMESROMAN  TIMESROMAN  TIMESROMANp &j/)'GOpcodeTimings.bravoTaftJanuary 21, 1981 6:23 PM