Heading:z18697qjk40(635)\g Mesa opcode execution times on the Doradoz18697y756qjk40\g Page Numbers: Yes X: 527 Y: 10.5"z18697qjk40\g CSL Notebook Entryz18697l4445y762\f5bg To: DoradoUsers, CedarImplementors Date: January 19, 1981z18697l4445d2998e21(0,65535)(1,4445)(5,11684)(6,13653)\f1g3f0t2 1t0 30t6 1f1t0 5f0t7 1t0 z18697l4445\g From: Ed Taft Location: PARC/CSLz18697l4445d2998y716e25(6,13664)\f1g5f0t2 1t0 7t6 1f1t0 9f0t7 1t0 Subject: Mesa opcode execution times File: [Ivy]80CSLN-YYYYz19579l4445d2998e25\f1g8f0t2 1t0 27t6 1f1t0 5f0t7 1f1t0 30f0 on the Doradoz18697l4445(6,14146)\g XEROX z18697l508y644e14(2116)\f2g5f0 Attributes: technical, Cedar, Dorado, Mesaz18697l4896d2999e10(0,4904)(1,65535)(5,65535)(6,65535)\f1g11f0 Abstract: This memo tabulates the execution time of all common Mesa opcodes on the Dorado. Possible future changes likely to affect these timings are also summarized.z18697l4896d2999e10\f1g9f0 z18697l3033e10j(1270)\g The execution time on the Dorado for all common Mesa opcodes is presented in the table below. This information is intended to be useful to compiler writers in deciding about code sequences to be generated by the Mesa compiler for Cedar. It may also be of interest to others, e.g., in assessing the relative costs of operations on short versus long data values.z18697l3033e10jk40\g The information here is for the PrincOps (Pilot, as opposed to Alto) implementation of the Mesa instruction set. At present, only Mesa 6.0 opcodes are described; this memo will be revised as timing information becomes available on new opcodes implemented for Cedar.z18697l3033e10jk40\g Following the opcode table are some general comments on the timings, along with brief descriptions of possible future changes that are likely to have important effects on performance.z18697l3033e10jk40\g Opcode timingsz18697l3033e22k80\bg14B The timings are expressed in terms of microinstruction execution cycles; with current Dorado clock settings, 1 cycle = 64 nanoseconds. (So, to convert to microseconds, divide the timings by 16.)z18697l3033e10jk40\g The timings include all unavoidable delays due to memory timing and IFU restart latency, but do not include variable delays due to cache misses, IFU pipeline starvation, and the like; measurements have shown that these degrade performance only a small amount [Clark, et al., 1981; Lampson, et al., 1981]. Nor do the timings account for I/O overhead; but this is typically less than 10 percent (including Alto-compatible display), and in any event will affect all opcodes proportionally.z18697l3033e10jk40\g24i11I232i7I16i7I In the opcode names, *'' stands for a digit denoting an encoded constant; the name thereby refers to a family of related opcodes. The comments generally specify the conditions under which the timings apply.z18697l3033e10jk40\g Opcode(s) Timing Commentsz18697l3033e16k80(0,6016)(1,8800)\ig25I ADD 2z18697l8800d3033e10jk40(0,6464)\f1g ALLOC 10 Ordinary case (no trap, AV[fsi] non-empty).z18697l8800d3033e3jk40\f1g AND 2z18697l8800d3033e3jk40\f1g ASSOC 103 Minimum (more if many dirty cache entries)z18697l8800d3033e3jk40(0,6315)\f1g BCAST 71 Typical minimum (assuming 1 process awakened, no process switch).z18697l8800d3033e3jk40(0,6464)\f1g BITBLT 86+height*(35+5*(width/16)) (Typical case)z18697l8800d3033e3jk40\f1g BLT, BLTC, BLTL 27+3*count Approximate.z18697l8800d3033e3jk40\f1g BNDCK 4 Ordinary case (in bounds).z18697l8800d3033e3jk40\f1g CATCH 1z18697l8800d3033e3jk40\f1g CKSUM 9+3.5*countz18697l8800d3033e3jk40\f1g DADD 4z18697l8800d3033e3jk40\f1g DBL 1z18697l8800d3033e3jk40\f1g DCOMP 7 Approximate.z18697l8800d3033e3jk40\f1g DESCB, DESCBS 4z18697l8800d3033e3jk40\f1g DST 15 Minimum.z18697l8800d3033e3jk40\f1g DUCOMP 6 Approximate.z18697l8800d3033e3jk40\f1g DIV 42 Approximate; ordinary case (no overflow).z18697l8800d3033e3jk40\f1g DSUB 4z18697l8800d3033e3jk40\f1g DUP 1z18697l8800d3033e3jk40\f1g DWDC 4 Ordinary case (no interrupts pending).z18697l8800d3033e3jk40\f1g EFC*, EFCB 54 Ordinary case (no trap, external link is procedure)z18697l8800d3033e3jk40\f1g EXCH 3z18697l8800d3033e3jk40\f1g FADD 56 Typical for normal'' operands.z18697l8800d3033e3jk40\f1g FCOMP 27 Typical for normal'' operands.z18697l8800d3033e3jk40\f1g FDIV 156 Typical for normal'' operands.z18697l8800d3033e3jk40(0,6315)\f1g FIX 26 Typical for normal'' operands.z18697l8800d3033e3jk40(0,6464)\f1g FIXC 20 Typical for normal'' operands.z18697l8800d3033e3jk40\f1g FIXI 24 Typical for normal'' operands.z18697l8800d3033e3jk40\f1g FLOAT 41 Typical for normal'' operands.z18697l8800d3033e3jk40\f1g FMUL 104 Typical for normal'' operands.z18697l8800d3033e3jk40(0,6315)\f1g FREE 8z18697l8800d3033e3jk40(0,6464)\f1g FSUB 57 Typical for normal'' operands.z18697l8800d3033e3jk40\f1g GADRB 3z18697l8800d3033e3jk40\f1g GETF 101 Minimum (more if many dirty cache entries)z18697l8800d3033e3jk40(0,6315)\f1g INC 1z18697l8800d3033e3jk40(0,6464)\f1g IWDC 3z18697l8800d3033e3jk40\f1g J*, JB 1z18697l8800d3033e3jk40\f1g JEQ*, JEQB 9 if jump occurs;z18697l8800d3033e3jk40\f1g 4 if jump does not occur. See general comments on conditional jumps.z18697l8800d3033e3jk40\f1g JGB, JGEB 10 if jump occurs;z18697l8800d3033e3jk40\f1g 5 if jump does not occur.z18697l8800d3033e3jk40\f1g JIB 17 if jump index in range;z18697l8800d3033e3jk40\f1g 5 if jump index out of range.z18697l8800d3033e3jk40\f1g JIW 16 if jump index in range;z18697l8800d3033e3jk40\f1g 5 if jump index out of range.z18697l8800d3033e3jk40\f1g JLB, JLEB 10 if jump occurs;z18697l8800d3033e3jk40\f1g 5 if jump does not occur.z18697l8800d3033e3jk40\f1g JNE*, JNEB 9 if jump occurs;z18697l8800d3033e3jk40\f1g 4 if jump does not occur.z18697l8800d3033e3jk40\f1g JUGB, JUGEB 9 if jump occurs;z18697l8800d3033e3jk40\f1g 4 if jump does not occur.z18697l8800d3033e3jk40\f1g JULB, JULEB 9 if jump occurs;z18697l8800d3033e3jk40\f1g 4 if jump does not occur.z18697l8800d3033e3jk40\f1g JW 10z18697l8800d3033e3jk40\f1g JZEQB, JZNEB 8 if jump occurs;z18697l8800d3033e3jk40\f1g 3 if jump does not occur.z18697l8800d3033e3jk40\f1g KFCB 49 Ordinary case (no trap, SD entry is procedure).z18697l8800d3033e3jk40\f1g LADRB 3z18697l8800d3033e3jk40\f1g LDIV 42 Approximate; ordinary case (no overflow).z18697l8800d3033e3jk40\f1g LFC*, LFCB 35 Ordinary case (no trap).z18697l8800d3033e3jk40\f1g LG*, LGB 2z18697l8800d3033e3jk40\f1g LGDB 3z18697l8800d3033e3jk40\f1g LI*, LIB 1z18697l8800d3033e3jk40\f1g LIN1, LINB, LINI 2z18697l8800d3033e3jk40\f1g LINKB 4z18697l8800d3033e3jk40\f1g LIW 3z18697l8800d3033e3jk40\f1g LL*, LLB 2z18697l8800d3033e3jk40\f1g LLDB 3z18697l8800d3033e3jk40\f1g LLKB 7z18697l8800d3033e3jk40\f1g LP 3z18697l8800d3033e3jk40\f1g LST 37 Ordinary case (no trap, control link is frame).z18697l8800d3033e3jk40\f1g LSTF 45 Ordinary case (no trap, control link is frame).z18697l8800d3033e3jk40\f1g ME 12 Ordinary case (monitor not already locked).z18697l8800d3033e3jk40\f1g MISC See description of individual MISC opcodes (ASSOC, CKSUM, FADD, FCOMP, FDIV, FIX, FIXC, FIXI, FLOAT, FMUL, FSUB, GETF, ROUND, ROUNDC, ROUNDI, SETF).z18697l8800d3033e3jk40\f1g MRE 32 Ordinary case (monitor not already locked).z18697l8800d3033e3jk40\f1g MUL 24z18697l8800d3033e3jk40\f1g MXD 15 Ordinary case (no process waiting on monitor).z18697l8800d3033e3jk40\f1g MXW 145 Minimum (includes one process switch).z18697l8800d3033e3jk40(0,6315)\f1g NEG 1z18697l8800d3033e3jk40(0,6464)\f1g NILCK 3 Ordinary case (not NIL).z18697l8800d3033e3jk40\f1g NILCKL 4 Ordinary case (not NIL).z18697l8800d3033e3jk40\f1g NOOP 2 Reduced to 1 if post-XFER interrupt restriction is eliminated.z18697l8800d3033e3jk40\f1g NOTIFY 71 Minimum (assuming no process switch).z18697l8800d3033e3jk40\f1g OR 2z18697l8800d3033e3jk40\f1g PL* 2z18697l8800d3033e3jk40\f1g POP 1z18697l8800d3033e3jk40\f1g PORTI 6z18697l8800d3033e3jk40\f1g PORTO 34 Ordinary case (no trap, control link is frame).z18697l8800d3033e3jk40\f1g PUSH 1z18697l8800d3033e3jk40\f1g R*, RB 2z18697l8800d3033e3jk40\f1g RBL 4z18697l8800d3033e3jk40\f1g RD*, RDB 4z18697l8800d3033e3jk40\f1g RDBL 6z18697l8800d3033e3jk40\f1g REQUEUE 63 Minimum (assuming no process switch).z18697l8800d3033e3jk40\f1g RET 34 Ordinary case (no trap, return link is frame).z18697l8800d3033e3jk40\f1g RF 3z18697l8800d3033e3jk40\f1g RFC 4z18697l8800d3033e3jk40\f1g RFL 5z18697l8800d3033e3jk40\f1g RFS 6z18697l8800d3033e3jk40\f1g RFSL 7z18697l8800d3033e3jk40\f1g RIGP 5z18697l8800d3033e3jk40\f1g RIGPL 9z18697l8800d3033e3jk40\f1g RIL* 4z18697l8800d3033e3jk40\f1g RILP 5z18697l8800d3033e3jk40\f1g RILPL 9z18697l8800d3033e3jk40\f1g ROUND 35 Typical for normal'' operands.z18697l8800d3033e3jk40\f1g ROUNDC 28 Typical for normal'' operands.z18697l8800d3033e3jk40\f1g ROUNDI 33 Typical for normal'' operands.z18697l8800d3033e3jk40\f1g RR 6 Typical.z18697l8800d3033e3jk40\f1g RSTR 5z18697l8800d3033e3jk40\f1g RSTRL 6z18697l8800d3033e3jk40\f1g RXGPL 9z18697l8800d3033e3jk40\f1g RXLP 6z18697l8800d3033e3jk40\f1g RXLPL 9z18697l8800d3033e3jk40\f1g SHIFT 6 Ordinary case (ABS[shiftCount] < 16)z18697l8800d3033e3jk40\f1g SETF 122 Minimum (more if many dirty cache entries)z18697l8800d3033e3jk40(0,6315)\f1g SFC 49 Ordinary case (no trap, control link is procedure).z18697l8800d3033e3jk40(0,6464)\f1g SG*, SGB 2z18697l8800d3033e3jk40\f1g SGDB 4z18697l8800d3033e3jk40\f1g SL*, SLB 2z18697l8800d3033e3jk40\f1g SLDB 4z18697l8800d3033e3jk40\f1g SUB 2z18697l8800d3033e3jk40\f1g W*, WB 3z18697l8800d3033e3jk40\f1g WBL 4z18697l8800d3033e3jk40\f1g WD*, WDB 4z18697l8800d3033e3jk40\f1g WDBL 5z18697l8800d3033e3jk40\f1g WF 5z18697l8800d3033e3jk40\f1g WFL 7z18697l8800d3033e3jk40\f1g WFS 7z18697l8800d3033e3jk40\f1g WFSL 9z18697l8800d3033e3jk40\f1g WIGPL 9z18697l8800d3033e3jk40\f1g WILP 6z18697l8800d3033e3jk40\f1g WILPL 9z18697l8800d3033e3jk40\f1g WR 5 Typical.z18697l8800d3033e3jk40\f1g WS*, WSB 4z18697l8800d3033e3jk40\f1g WSDB 5z18697l8800d3033e3jk40\f1g WSF 6z18697l8800d3033e3jk40\f1g WSTR 7z18697l8800d3033e3jk40\f1g WSTRL 8z18697l8800d3033e3jk40\f1g WXGPL 9z18697l8800d3033e3jk40\f1g WXLP 6z18697l8800d3033e3jk40\f1g WXLPL 9z18697l8800d3033e3jk40\f1g XOR 2z18697l8800d3033e3jk40\f1g General comments on timingsz18697l3033e22k80(1270)\bg27B Simple instructionsz18697l3033e16k80\ig19I For the most part, except for conditional jumps the simple opcodes (those with timings less than about 10) are presently implemented as efficiently as is possible. The microcode for these opcodes has remained relatively stable for some time.z18697l3033e10jk40\g The only potential improvement will come from use of the Dorado IFU's instruction forwarding feature. At the expense of control store (somewhat over 200 microinstructions), a number of opcodes can be made one cycle faster by deferring their final operations into the next opcode, where those operations can sometimes be overlapped with other work. In particular, all instructions whose final operation is to fetch from memory (LL*, R*, etc.) or store into memory (SL*, W*, etc.) can be speeded up by one cycle, though it is not always the case that the deferred operation can actually be overlapped with other work by the next instruction.z18697l3033e10jk40\g429f1 3f0 2f1 2f0 30f1 3f0 2f1 2f0 Until recently, the Alto-Mesa emulator actually did use the instruction forwarding feature, resulting in measured improvements of up to 8 percent. Unfortunately, instruction forwarding cannot be used in the Pilot microcode and still remain faithful to the PrincOps. This is because the PrincOps requires faults and traps to abort the instruction that causes them; with instruction forwarding, such exceptions are frequently not detected until emulation of the next instruction has begun. (There are some more subtle technical problems as well.)z18697l3033e10jk40\g Most likely we will stretch the PrincOps to permit instructions to be suspended in mid-execution and later resumed. We believe it is possible to save and restore the necessary machine-dependent state; of course, Pilot modifications will be required to remember this state when suspending a process that has faulted.z18697l3033e10jk40\g Conditional jumpsz18697l3033e16k80\ig17I For the conditional jumps, the jump case is typically at least twice as costly as the non-jump case; the costlier case is the one in which the IFU has to be restarted. The two cases can trivially be interchanged (on a per-opcode basis) to favor the non-jump case, if dynamic execution frequency seems to warrant this. Alternatively, two opcodes can be provided, one favoring each case; there are undoubtedly situations (e.g., the jump that closes a FOR loop) in which the compiler could easily decide which case is likely to be dynamically more frequent.z18697l3033e10jk40\g451f1 3f0 Additionally, the favored case of each conditional jump can be made one cycle faster, through use of the instruction forwarding and conditional exit features of the Dorado IFU. The cost of this is additional control store (somewhat over 100 microinstructions) and, in some cases, a restriction on whether the jump or non-jump case is to be favored.z18697l3033e10jk40\g Process/monitor opcodesz18697l3033e16k80\ig23I The ME and MXD opcodes are invoked upon every entry to and exit from a monitor. Approximately half their execution time is spent deciding whether their pointer arguments are short or long, by inspecting the stack depth. This would be eliminated if separate short and long versions of these opcodes were introduced (as has long been specified in the PrincOps but has never been implemented).z18697l3033e10jk40\g4f1 2f0 5f1 3f0 Control transfersz18697l3033e16k80\ig17I Substantial improvements are possible in the control transfer (XFER) primitives, which are presently rather expensive. These improvements will derive from various forms of caching of execution contexts, for which there is some Dorado hardware support that is not presently used.z18697l3033e10jk40\g63f1 4f0 To take advantage of this will require rewriting nearly all the microcode for the control transfer opcodes. Additionally, we will have to identify those places in Pilot that reference the cached information (e.g., frame overhead words) and add code to dump or invalidate the cache at appropriate times.z18697l3033e10jk40\g To gain an idea of what improvements are possible, it is worth examining the workings of the control transfer primitives in some detail. The two most important cases are transfers to procedures as typified by EFC*, and transfers to frames as typified by RET.z18697l3033e10jk40\g210f1 4f0 41f1 3f0 Operation Timingz18697l4303e16k80(0,15552)(1,65535)\ig16I EFC* (procedure transfer)z18697l4303e10k80(1270)\f1ig4f0 21I P1. Save PC in local frame, and obtain local frame pointer. 5z18697l4303e10jk40(0,5280)(1,16288)\f1g P2. Fetch external control link from code segment. 5z18697l4303e3jk40\f1g P3. Dispatch on control link tag. 3z18697l4303e3jk40\f1g P4. Decode procedure descriptor to obtain global frame and entry index. 8z18697l4303e3jk40\f1g P5. Load global frame and code base registers. 5z18697l4303e3jk40\f1g P6. Obtain PC for entry point and restart IFU. 5z18697l4303e3jk40\f1g P7. Allocate new local frame. 11z18697l4303e3jk40\f1g P8. Set up overhead words in new local frame. 5z18697l4303e3jk40\f1g P9. Dispatch on exit actions, and load local frame base register. 3z18697l4303e3jk40\f1g P10. Push source and destination control links onto stack. 3z18697l4303e3jk40\f1g RET (frame transfer)z18697l4303e10k80(1270)\f1ig3f0 17I F1. Fetch return link from local frame. 2z18697l4303e10jk40(0,5280)\f1g F2. Dispatch on control link tag. 3z18697l4303e3jk40\f1g F3. Load global frame and code base registers. 5z18697l4303e3jk40\f1g F4. Obtain PC from destination frame and restart IFU. 4z18697l4303e3jk40\f1g F5. Dispatch on exit actions, and load local frame base register. 3z18697l4303e3jk40\f1g F6. Push source and destination control links onto stack. 5z18697l4303e3jk40\f1g F7. Free old local frame. 8z18697l4303e3jk40\f1g The caching strategy comes in two parts. First, the base registers and local frame overhead words are cached in the hardware, eliminating most of P1, P8, and F1F4, for a total of 24 cycles per EFC*/RET pair. Second, up to four local frames are kept in the cache rather than being allocated and freed, eliminating most of P7 and F7, for a total of 19 cycles. Of course, there are some compensating costs arising from the need to check for cache over/underflow and other conditions that require the cache to be invalidated.z18697l3033e10jk40(1270)\g195f1 8f0 It is worth mentioning some architectural changes that would also speed things up. First, the compact encoding of procedure descriptors has a substantial cost (P4); a simpler encoding (probably requiring 32 bits) would speed up procedure transfers. Second, the control links pushed onto the stack (P10 and F6) are saved for the benefit of coroutines and nested local procedures, which are used relatively infrequently; some way of obtaining this information only when it is needed would be better.z18697l3033e10jk40\g Other observationsz18697l3033e16k80\ig18I The operations performed by NILCK and BNDCK could be provided nearly free if performed as part of the dereferencing and array access opcodes rather than as distinct opcodes.z18697l3033e10jk40\g28f1 5f0 5f1 5f0 Referencesx6e36k120(635)\bg [Clark, et al., 1981] Douglas W. Clark, Butler W. Lampson, and Kenneth A. Pier, The Memory System of a High-Performance Personal Computer,'' IEEE Transactions on Computers, to appear; CSL report 81-1.l3696d2992x3e6jk40(0,3712)(1,65535)\g8i6I129i30I [Lampson, et al., 1981] Butler W. Lampson, Gene A. McDaniel, and Severo M. Ornstein, An Instruction Fetch Unit for a High-Performance Personal Computer,'' submitted for publication; CSL report 81-1.l3696d2992x3e6jk40\g10i6I