*start* 01728 00024 US Date: 11 Aug. 1981 8:57 pm PDT (Tuesday) From: Fiala.PA Subject: Floating point opcodes To: LStewart, Satterthwaite cc: Wilhelm, Rovner, Taft, Fiala I am proposing to implement floating point opcodes for square root (FSqRt = MISC 36b) and floating scale (FSc = MISC 37b) in both Alto and Cedar Dolphin microcode in the near future (I have them both coded and am about to start debugging). The following comments are relevant to these. 1) The Cedar unimplemented opcode trap implementation must be changed so that the procedure reached when an opcode (such as FSqRt) is not defined is entered after one Xfer rather than two. I was shocked to learn from Willie-Sue today that this is apparently not the case. When this becomes true, it will be practical to produce the MISC 36b opcode for FSqRt (etc.) when that is implemented on the Dolphin but not on the Dorado (or vice versa). 2) FSqRt takes one real on TOS,,2OS and returns a positive real on TOS,,2OS. The sign of its argument is ignored--LS, if this is wrong, please tell me. My thought is that the user can check for <0 before execution if he cares. Timing for FSqRt is about 57 microseconds. 3) FSc takes a real on TOS,,2OS and an integer N on 3OS where -202b <= N <=200b. It returns the real on TOS,,2OS with the exponent scaled by N. My thought on the use of FSc is as follows: (a) Define "shifts" of reals to use the FSc opcode--make this legal in Mesa source programs; (b) Have the compiler detect real multiplication and division by constants that are powers of 2 and do use FSc instead of FDiv or FMul in this case. Note that the timing of FSc is about 8 microseconds compared to 15 for a FAdd, 38 for FMul, or 42 for FDiv. *start* 01839 00024 US Date: 14 Aug. 1981 8:45 am PDT (Friday) From: Satterthwaite.PA Subject: Floating point issues In-reply-to: Fiala's message of 11 Aug. 1981 8:57 pm PDT (Tuesday) To: Stewart, Warnock, Wilhelm cc: Fiala, Satterthwaite, Rovner, Taft Never in my wildest dreams did I think that any important Mesa programs would spend much time crunching floating-point numbers. Since I was wrong, let me (re)raise the following issues now that we have more experience with performance (and lack thereof): - How closely do we want to try to adhere to the IEEE standard. There are two problems for the compiler: (1) Because of all the modes and exceptions that are possible, the current compiler does very few floating-point operations at compile-time even when all operands are constant -- I believe that it attempts only fixed-to-float, unary negation and abs. (2) Because of exceptions, cheap "trick" implementations of some operations aren't strictly legitimate, and apparently redundant computations cannot strictly be eliminated. For example, recent compilers have implemented ABS by masking off the sign bit -- probably not really correct for NaNs, etc.; on the other hand, the current compiler does not discard addition of (constant) zero, multiplication by one, etc. - What is the current state of support for LONG REAL? Is it worth adding to the language sooner rather than later? Note that, even with the larger stack, code for evaluating deeply nested LONG REAL expressions will not be wonderful (only two operands can be pushed onto the stack at a time). I suppose it's too heretical at this point to suggest 48-bit arithmetic for LONG REAL (or for REAL, with 64 bit LONG REAL). The numerical analysts at Stanford who used the B5500 many years ago seemed to think that 48 bits was a wonderful compromise. Ed *start* 01003 00024 US Date: 14 Aug. 1981 8:54 am PDT (Friday) From: Satterthwaite.PA Subject: Re: Floating point opcodes In-reply-to: Fiala's message of 11 Aug. 1981 8:57 pm PDT (Tuesday) To: Fiala cc: Stewart, Satterthwaite, Wilhelm, Rovner, Taft If we still plan to do strict IEEE standard arithmetic (see my previous message), and if you want the compiler to generate FSc, please make sure that it behaves identically to multiplication/division by powers of 2 within some specified range. ([-202b .. 200b] seems good enough; the compiler would only generate this instruction for multiplication/division by constants. Would it be hard to change the conventions for FSc so that N comes from TOS and the real comes from 2OS,,3OS? This would follow the convention for fixed-point shift and is somewhat less awkward for the compiler. I don't expect the compiler to get involved in FSqRt at all; unless constant-folding of square roots is a big deal, a MACHINE CODE inline seems entirely adequate. *start* 00903 00024 US Date: 14 Aug. 1981 9:46 am PDT (Friday) From: Wilhelm.PA Subject: Re: Floating point issues In-reply-to: Satterthwaite's message of 14 Aug. 1981 8:45 am PDT (Friday) To: Satterthwaite cc: Stewart, Warnock, Wilhelm, Fiala, Rovner, Taft I couldn't care less about the IEEE "standard;" I think it's a crock. It would really be nice if the compiler could easily produce efficient code and do as much as possible at compile time. There is no question that we will (do) need LONG REAL, and I suspect if it's to be added to Mesa, it might as well be now as later. I wouldn't worry about lengthy expressions, they don't happen very often anyway. As for the size of the numbers, they will have to be 64 bits in length. Forty-eight bit numbers would be somewhat short on precision in some cases, and dealing with three-word quantities on future machines would be a real pain. Neil *start* 03095 00024 US Date: 14 Aug. 1981 11:28 am PDT (Friday) From: Fiala.PA Subject: Re: Floating point issues In-reply-to: Satterthwaite's message of 14 Aug. 1981 8:45 am PDT (Friday) To: Satterthwaite cc: Stewart, Warnock, Wilhelm, Fiala, Rovner, Taft I don't think optimizing constant expressions at compile time is important because of infrequency and because the user can do this by hand if he really cares. However, using FSc instead of FMul or FDiv would be useful when it can be used--FSc is in fact identical to FMul or FDiv by a power-of-two. Although my implementation accommodates -202b to +200b, it might be simpler to confine the compiler's use of FSc to be -200b through +177b (i.e., 8 bit integers). It is possible to reverse arguments for FSc. This will slow it from about 8.9 microseconds to about 9.3 microseconds and increase code size by 2 microinstructions. This is not a big deal and I am willing to change the implementation if you insist. Do you insist? I think it is desirable to obey the IEEE floating point standard, though I have been willing to add modifications such as Wilhelm's substitute-0-on-underflow as options. At the moment, however, it is impossible to allow users anything other than system-wide options because parameters controlling rounding modes, infinity handling, etc. are global rather than part of the process state. For this reason, Pilot should allocate larger blocks for process state dumps, so that I can save and reload the extra registers as required. The immediate requirement is for two extra words in the process state; however, it might be useful to increase the size by 4 words to allow for future growth. ----- I don't think we presently have microcode space for 48 or 64-bit reals IN ADDITION TO 32-bit reals. Even if we had space, there might not be enough registers to support 64-bit reals on the Dolphin, though there are enough registers for 48-bit reals. However, I agree with Wilhelm that, if we change to longer reals, we should go to 64-bit reals. If 32-bit reals were REPLACED by 64-bit reals, microcode required would grow substantially from 541b microinstructions now to perhaps 720b microinstructions with 64-bit reals. Execution time might average 200% longer. My feeling is that 64-bit reals would be best, but changing to these would require so much new programming that I don't want to write microcode for this now--a software implementation is up to Larry Stewart. The graphics people should indicate their feelings since this appears to be an EITHER-OR choice for microcode implementation. If both 32 and 64-bit reals are wanted, a possible approach is to do something like the PDP-10 implementation which has, in addition to the usual stuff, unnormalized floating add and the long-mode opcodes (FMPL, FADL, FDVL, etc.) which produce a double-precision result from single precision arguments. Typically four or five of these opcodes are executed to complete a double-precision operation. If 64-bit reals are wanted, we might want to use the floating point board for these on the Dolphin. *start* 03515 00024 US Date: 14 Aug. 1981 2:15 pm PDT (Friday) From: Taft.PA Subject: Re: Floating point issues In-reply-to: Satterthwaite's message of 14 Aug. 1981 8:45 am PDT (Friday) To: Satterthwaite cc: Stewart, Warnock, Wilhelm, Fiala, Rovner, Taft I have to take exception to the assertion that the IEEE floating point standard is a "crock". In my opinion, it's the cleanest and easiest to understand floating point convention I have seen. There are minor changes that I might desire to simplify software/microcode implementations; but overall I believe they "got it right" in the sense that IEEE floating point is free of most of the pathological problems common to other floating point implementations. Leaving that aside, I believe we can't afford to diverge from the IEEE standard, for the simple reason that it's the standard that will be cast in VLSI by the chip manufacturers whether we like it or not. Now, to address the specific issues that have been raised: To make it possible to do constant folding at compile time, it will be necessary for the various modes (particularly rounding) to be determined at compile time, presumably by static declarations that obey block structure. As for eliminating redundant computation such as addition of zero, this seems fairly hopeless as you say; but I think it's far less important than simple constant folding. Dynamic floating point modes are a real problem. Originally I believed we could store the modes as part of the process state simply by enlarging the PrincOps StateVector -- something that is allowed by the PrincOps and is now actually possible as of the Rubicon release of Pilot. Unfortunately (also as of Rubicon), a process is allocated a StateVector only when it is preempted. The state of a process that gives up control voluntarily is contained entirely in the PSB, which is completely full. I suppose increasing the size of the PSB from 4 to 8 words is practical, but it would not be a trivial change and would be incompatible with current Pilot. As for LONG REALs, I concur with most of Fiala's remarks. But I would go further and say that the Mesa stack-machine architecture is probably not the right basis for implementing LONG REALs. The overhead of pushing and popping LONG REAL values may entirely dominate the cost of the floating point operations themselves. This is particularly true in the case of the Dolphin using the floating point hardware, since communication with that hardware is via the Dolphin's I/O system. I think we should seriously consider an alternative architecture, such as one involving "floating point accumulators" or something. In any event, there's no point in introducing LONG REALs into the language until we have a better understanding of what the underlying architecture is going to be and how we are going to implement it on the machines we care about. Which brings me to a final point. On the basis of hallway discussions, I know Larry Stewart has already given most of these issues a great deal of thought; and I imagine other people have also. If we're really serious about a comprehensive Mesa floating point architecture, I think we should have a series of meetings to do a serious design rather than continuing this hit-or-miss correspondence via messages. Participants in such a design should include Mesa language experts, implementors of microcode and hardware for all machines of interest (Dolphin, Dorado, Dragon), and of course a representative sample of users. Ed *start* 00325 00024 US Date: 1 Sept. 1981 3:07 pm PDT (Tuesday) From: Wilhelm.PA Subject: Square Roots on Dorados To: McDaniel, Taft cc: Stewart, Warnock, Fiala, Rovner, Satterthwaite, Wilhelm On a Dorado, Thyme spends 19.3% of its time taking square roots (for a Dolphin, the corresponding number was about 15%). Neil *start* 01978 00024 US Date: 5 Feb. 1982 6:00 pm PST (Friday) From: Fiala.PA Subject: Re: New Dolphin Cedar Microcode In-reply-to: Your message of 5 Feb. 1982 5:13 pm PST (Friday) To: Willie-Sue,Taft cc: Fiala Specs are as follows for new opcodes (all tentative): Misc 100b is LocalBlkZ which takes a cardinal count N on TOS and zeroes Local 0 to Local N-1; it pops the stack once. Misc 102b is LongBlkZ which takes a cardinal count N on TOS and a long pointer on 2OS,,3OS. It zeroes the N words beginning at the long pointer and pops the stack once. Note that the long pointer is still on the stack after the opcode has finished. I am not being very careful about the alpha numbers assigned because I don't expect the current instruction set to survive much longer; there will be a huge renumbering if Cedar is converted to some variant of the new Pilot's instruction set. For the new Pilot I have started to assign bits for a VERSION opcode; I think I am going to make it MISC 104b in the old instruction set, also. It is tentatively speced as follows: Two words are pushed onto the stack as follows: TOS bits 0:3 Engineering number (0-1 AltoI, 2-3 AltoII, 4 Dolphin, 5 Dorado, 6 Dandelion) compatible with Alto VERS opcode bits 4:7 unassigned bits 8:11 machine-dependent; on Dolphin: Bit 8 = have Cedar microcode Bit 9 = have TextBlt microcode Bit 10 = have floating point microcode Bits 11-12 unused Bits 13-15 = machine speed code; indicated period * 64 is the rate at which the double-precision clock counts: 0 40 mhz (100 ns) 1 44.5 mhz (~ 90 ns) 2 50 mhz (80 ns) 3-7 unused 2OS Day of release encoded as number of days since 1 January 1968. The motivation behind 2OS is that Pilot time is a long cardinal equal to the number of seconds since midnight 1 January 1981. To convert the above number to Pilot time, you just lengthen the result and multiply by 86,400. I have Micro macros to generate the release day conveniently. *start* 02805 00024 US Date: 7 Feb. 1982 12:38 pm PST (Sunday) From: Taft.PA Subject: New memory-zeroing opcodes To: Rovner, Satterthwaite cc: Fiala, Willie-Sue, Taft Ed Fiala has given me his spec for the new memory-zeroing opcodes, copied below: ------------------- Misc 100b is LocalBlkZ which takes a cardinal count N on TOS and zeroes Local 0 to Local N-1; it pops the stack once. Misc 102b is LongBlkZ which takes a cardinal count N on TOS and a long pointer on 2OS,,3OS. It zeroes the N words beginning at the long pointer and pops the stack once. Note that the long pointer is still on the stack after the opcode has finished. ------------------- There are several different ways to implement this on the Dorado, depending on how these opcodes are expected to be used. The most general way is to call the existing microcode subroutine for BLT, which properly handles all aspects of long-running opcodes: it checks for interrupts periodically, restarts after page faults, and does explicit cache management (PreFetches) to maximize memory performance. Once it gets started, it can store a constant value into memory at the rate of about 10 words per microsecond; however, it incurs about 2 microseconds' overhead getting started. A much simpler and cheaper implementation is possible if it can be GUARANTEED that these opcodes will be called upon to zero no more than about 50 words and if it is usually the case that the memory being zeroed is already actively in use (hence likely to be present in the cache). But remember that the price paid for the more general implementation is more than repaid if it prevents as few as two cache misses from occurring. Of course, if it is important that zeroing small blocks be cheap, but no upper bound on block size can be guaranteed, then I can implement the opcodes both ways and select the proper one at execution time. I should also mention that the LongBlkZ opcode, as presently specified, is somewhat difficult to implement if it must be interruptible. Interruption is ordinarily handled by storing intermediate state back on the stack, so that when the instruction is re-executed it will resume at the right point. But this is difficult for LongBlkZ because it is required to leave its long pointer operand undisturbed. Now, I can fix this by zeroing the block in reverse order (assuming this is acceptable to you); however, this will require a fairly extensive rework to the BLT microcode, which currently works in the forward direction only. How important is it that the long pointer operand be left on the stack? Also, it must be explicitly prohibited for a program to execute a PUSH to recover the word count operand; I don't believe this case is covered by any of the existing PrincOps restrictions on the use of PUSH. Ed *start* 01278 00024 US Date: 8 Feb. 1982 2:47 pm PST (Monday) From: Fiala.PA Subject: Re: New memory-zeroing opcodes In-reply-to: Taft's message of 7 Feb. 1982 12:38 pm PST (Sunday) To: Taft cc: Rovner, Satterthwaite, Fiala, Willie-Sue The new Pilot instruction set has both forward and reverse direction BLTs, so if you do the new work for making *BlkZ go in the reverse direction, that work will not be wasted if/when Cedar converts to the new instruction set. I think it would be a poor idea to restrict this opcode to short blocks because the places where BLT is used to zero blocks of storage may eventually want to use the new *BlkZ opcodes instead, and a length restriction would rule this out. I think you should implement the opcodes in the easiest way now, and let Rovner and Satterthwaite figure out exactly what they want. Also, Dolphin Cedar/Pilot microcode already uses the count word on the stack as a state variable for essentially all opcodes that repeat, so there is apparently no dependence on that stack word retaining its original value in current programs--I think you should go ahead and use that stack word as a variable, and restrict the compiler not to restore the stack pointer over that word--I thought such a restriciton was already in force. *start* 01412 00024 US Date: 24 Feb. 1982 7:29 pm PST (Wednesday) From: Fiala.PA Subject: Esc Opcode Changes for Trinity To: Sandman cc: Taft,Fiala I would like to make the following changes in the Esc definitions (i.e., in your ESCAlpha.mesa file and any other relevant places): 1) Esc 144b, formerly undefined, is now aRECLAIMCOUNT; 2) Esc 211b (Dolphin) is aSTARTIO; 3) Esc 212b (Dolphin) is aDESOPR; 4) Esc 213b (Dolphin) is aREADR; 5) Esc 214b (Dolphin and Dorado) is aLONGBLKZ; 6) Esc 215b (Dolphin and Dorado) is aLOCALBLKZ; In these, the LONGBLKZ and LOCALBLKZ opcodes will be implemented on any machine which runs Cedar, so perhaps they should not be put with the other processor-dependent alpha codes (200b-237b); but I have put them there tentatively. I would like to propose that the VERSION opcode be defined to push two words onto the stack as follows: TOS/ Day of release encoded as number of days since 1 January 1968. 2OS/ Bits 0:3 Engineering number (0 or 1 = Alto I, 2 = Alto II without extended memory, 3 = Alto II with extended memory, 4 = Dolphin, 5 = Dorado, 6 = Dandelion). Bits 4:7d machine-dependent flags Bits 8:13d undefined Bit 14d Floating point microcode loaded Bit 15d Cedar microcode loaded The day of release is easily converted to Pilot time by lengthening the result and multiplying by 86,400. The interpretation of 2OS bits 0:3 is historical from the Alto. *start* 01866 00024 US Date: 8 March 1982 10:02 am PST (Monday) From: Taft.PA Subject: PrincOps bug To: Sandman, Wick cc: Levin, Fiala, Taft If Reschedule is called with interrupts disabled, and there are no runnable processes, Reschedule calls RescheduleError which calls TrapZero. Unfortunately, by this time, the PSB which was running has been removed from the ready queue, so the context within which TrapZero is called is not very well defined. This actually occurred recently on the Dorado due to a bug in Cedar. The machine ended up in a totally wedged state that was extremely difficult to figure out. I'm not completely sure exactly what went wrong, but it certainly did not land in the debugger as one might desire. One reasonable context in which to cause the trap is the process that called Reschedule. So one possible thing to do is to take the PSB off whatever queue it is on and put it back on the ready queue before initiating the trap. (Fortunately it's possible to do this without knowing what queue the PSB is on now.) Alternatively, this error could be handled as a fault rather than a trap. That is, awaken some process that is waiting for this error to occur and that will then call the debugger. Also, I might note that the trap may occur in a process different from the one that turned off interrupts, and can happen a long time later. This makes debugging quite difficult. Would it be possible to catch this error earlier? For example, we could define it to be an error to call Reschedule with interrupts disabled, regardless of whether or not there are other runnable processes. The consequence of this would be to make it illegal to cause a fault or to wait on a monitor lock or condition variable with interrupts disabled. This doesn't seem like an unreasonable restriction; in fact, you might claim it's a feature. Ed *start* 03601 00024 US Date: 3 April 1982 5:49 pm PST (Saturday) From: Taft.PA Subject: Xfer traps and related issues To: Sandman, Sweet cc: Wick, Levin, Fiala, Maxwell, Taft The Xfer Trap mechanism is not described in the PrincOps, but it has been around for some time and there are various tools that use it. Recently, John Maxwell tried to use Xfer traps on the Dorado and found some problems. One problem was due to an out-and-out bug in the implementation, which I have fixed. However, we now have some difficulty deciding exactly what Xfer traps are supposed to do. My understanding is that the trapped Xfer first runs to completion, and then a trap occurs in the destination context (i.e., the Xfer trap handler is invoked in such a way as to make it appear that the first instruction of the destination context caused the trap). The parameter passed to the Xfer trap handler is the destination control link of the original Xfer. This is straightforward except in one case: when the original Xfer is a local function call, there IS no destination control link. (Currently, zero is stored as the trap parameter in this case.) Now, it's not difficult for the microcode to detect this situation and fabricate a procedure descriptor from the LFC's EPI. (This change is required in both Dorado and Dolphin microcode; I don't know about Dandelion.) This seems like a reasonable thing to do, since it will enable the Xfer trap handler to determine in a uniform way what kind of Xfer took place. However, before I do this, I'd like to ask whether Xfer traps are intended to remain a permanent feature of the instruction set; and if so, whether the mechanism will continue in its present form. There are some additional difficulties with the current implementation, such as the fact that process switches can cause Xfer traps that are hard to distinguish from procedure returns and fixed-frame transfers; it might be worthwhile cleaning this up. What do you think? Do Xfer traps become a PrincOps feature? While I was re-examining this part of Xfer, I happened to notice one inconsistency in the PrincOps. In section 9.4.1, LFC is described as "logically equivalent" to Call[MakeProcDesc[GF, evi]]; but the actual "optimized" LFC differs from Call in one important way: it does not push the original source and destination control links onto the stack. Is this correct? The present Dorado and Dolphin implementations do push the links on the stack; eliminating this will speed up LFC by a noticeable amount (about 12% on the Dorado). The "standard" uses of the control links left on the stack -- PORTI (coroutines) and LKB (statically enclosing contexts) cannot occur following LFC. If these are the ONLY uses sanctioned by the PrincOps, then LFC does not need to push the links onto the stack. In fact, I believe it's true that PORTI and LKB are used only after Xfers through indirect control links. If so, pushing the links onto the stack need not be done at all except in the case of Xfers through indirect control links; this will benefit the normal cases of EFC and RET as well. It's always bothered me that the cost of pushing these links is incurred on every Xfer in order to support mechanisms that are used only a tiny fraction of the time. To summarize my questions: 1. Are Xfer traps a PrincOps feature? 2. If so, what is the Xfer trap parameter in the case of a LFC? 3. Does LFC need to push the source and destination control links onto the stack? 4. Do ANY Xfers need to push the links in any case besides transferring through an indirect control link? Ed *start* 00400 00024 US Date: 12 April 1982 8:51 am PST (Monday) From: Levin.PA Subject: Re: Xfer traps and related issues In-reply-to: Taft's message of 3 April 1982 5:49 pm PST (Saturday) To: Taft cc: Sandman, Sweet, Wick, Levin, Fiala, Maxwell My take on your 4 questions: 1) I hope so. In my opinion, yes. 2) The cleanup you suggest sounds good to me. 3) No. 4) What about traps? Roy *start* 01054 00024 US Date: 28 April 1982 11:10 am PDT (Wednesday) From: Taft.PA Subject: Dorado Trinity conversion To: Johnsson, Lauer, Wick cc: Levin, Taft I'm once again thinking about bringing up Trinity on the Dorado. I'll have to do this eventually; the question is when. The current Cedar schedule calls for converting to Trinity after the Cedar 4.0 release, which we now think will be mid-summer at the earliest. So there is no pressure from the Cedar project to bring up Trinity on the Dorado any earlier. On the other hand, if it would be of benefit to you to be able to run Trinity on your Dorados, that might be a good justification for my doing the conversion sooner. In particular, if not being able to run Trinity makes your Dorados useless to you, I want to hear about it! Please let me know your feelings about this. Also, at this point I don't want to start until after Trinity is officially released; the recent repeated releases of version "8.0" have me confused about whether or not the release has actually happened. Ed *start* 01128 00024 US Date: 3 May 1982 3:27 pm PDT (Monday) From: Fiala.PA Subject: Re: Xfer traps and related issues In-reply-to: Taft's message of 3 April 1982 5:49 pm PST (Saturday) To: Taft cc: Sandman, Sweet, Wick, Levin, Fiala, Maxwell After looking at the Xfer trap change which Taft proposed, I find all local function calls on the Dolphin would be slowed to prepare a fabricated original destination link as the trap parameter. In other words, on the Dolphin, all local function calls would have to run slower to save the original evi needed by a possible but rarely occurring Xfer trap. Consequently, I am opposed to this change. What we have at present is that the source link for the Xfer trap points at the new frame, which allows the Xfer trap to easily return to the new context. It is not clear to me what added value is obtained from having the trap parameter be the original fabricated destination link rather than 0; there are only a few more instructions required to follow the linkage back to the caller in this case, and having a trap parameter of 0 unambiguously identifies a local function call. *start* 01190 00024 US Date: 4 May 1982 10:16 am PDT (Tuesday) From: Taft.PA Subject: Re: Xfer traps and related issues In-reply-to: Fiala's message of 3 May 1982 3:27 pm PDT (Monday) To: Fiala cc: Taft, Sandman, Sweet, Wick, Levin, Maxwell I'm happy to have LFCs that trap pass a trap parameter of zero. As you point out, in the case of LFC (though not in general), it's not difficult for the trap handler to reconstruct the destination control link if it wants one. (In Maxwell's application, all he wants to know is that it was a LFC, which can be deduced from the trap parameter being zero.) Note that in order to benefit the Dolphin implementation, this convention must apply to Unbound traps as well as Xfer traps. Currently the PrincOps specifies (section 9.4.1, page 104): LFC: PROCEDURE [evi: EVIndex] = ... IF nPC=0 THEN UnboundTrap[MakeProcDesc[GF, evi]]; ... Thus trapping with zero instead of a fabricated procedure descriptor requires a PrincOps change. I'll wait for an "official" word from the SDD people before changing the Dorado implementation. (By the way, I'm still waiting for some response to my original message, now more than a month old.) Ed *start* 00526 00024 US Date: 7 May 1982 3:16 pm PDT (Friday) From: Satterthwaite.PA Subject: Floating Point Opcodes To: Fiala, Taft, Stewart cc: Satterthwaite I was trying to organize some old mail files and came across a series of messages about extended floating-point opcodes. I am currently spending a fair amount of time and effort on compiler changes intended to make Cedar code smaller and faster. Whatever happened to FSc? Was it ever implemented for the Dorado? Should the compiler consider generating it? Ed *start* 00958 00024 US Date: 7 May 1982 4:23 pm PDT (Friday) From: Taft.PA Subject: Re: Floating Point Opcodes In-reply-to: Satterthwaite's message of 7 May 1982 3:16 pm PDT (Friday) To: Satterthwaite cc: Fiala, Taft, Stewart There were several extensions which Fiala proposed and then implemented in the Dolphin microcode; the ones I can recall are FSc, square root, and a number of new option and sticky flags. I have not implemented any of these on the Dorado, mostly because I had no clear idea whether anyone was actually planning to use them. Assuming the Real package provides complete trap support for the extended facilities, you can go ahead and change the compiler to make use of them, independent of whether or not they have been implemented in the Dorado microcode. (Larry will have to provide the definitive answer on what trap support exists.) I will implement the Dorado microcode whenever there seems to be popular demand for it. Ed *start* 00799 00024 US Date: 7 May 1982 6:17 pm PDT (Friday) From: Fiala.PA Subject: Re: Floating Point Opcodes To: Satterthwaite cc: Taft, Fiala, Stewart You asked for a reversal of the arguments to FSC at some point, and I don't remember whether or not I did that. The options for the floating point processor (rounding modes, underflow options, etc.) will not be fully valuable until the FSticky bits (about 22 of them) can be saved and restored as part of the process state. The Thyme program is the only one that to my knowledge uses FSQRT. If anything is done at present, I think that Larry Stewart should take the lead by implementing the trap subroutines for FSC and FSQRT and the handling of the other modes and flags. Not until he does this should anything be done to the compiler. *start* 00852 00024 US Date: 10 May 1982 9:20 am PDT (Monday) From: Satterthwaite.PA Subject: Re: Floating Point Opcodes In-reply-to: Fiala's message of 7 May 1982 6:17 pm PDT (Friday) To: Fiala cc: Satterthwaite, Taft, Stewart I will wait for someone else to take the initiative on this. The Cedar 3.1 compiler will emphasize performance improvements; until it is released, I will be in a fairly good position to change the compiler so that floating-point multiplications and divisions by constant powers of two generate FSc. I don't know how important this is, but I see (x+y)/2 a lot in graphics algorithms. I'd also be happy to consider any new peephole optimizations that occur to people as they read code listings. Aside from a reworking of inline expansion, changes in the compiler's global strategy are not likely in 3.x for any x. Ed *start* 00451 00024 US Date: 10 May 1982 9:29 am PDT (Monday) From: Taft.PA Subject: Re: Floating Point Opcodes In-reply-to: Satterthwaite's message of 10 May 1982 9:20 am PDT (Monday) To: Satterthwaite cc: Fiala, Taft, Stewart I'll be happy to implement Dorado microcode for FSC, if someone will tell me what the current spec is. It sounds like a worthwhile addition and shouldn't be much work. Larry, is there already trap support for FSC? Ed *start* 00784 00024 US Date: 10 May 1982 10:09 am PDT (Monday) From: Fiala.PA Subject: Re: Floating Point Opcodes In-reply-to: Satterthwaite's message of 10 May 1982 9:20 am PDT (Monday) To: Satterthwaite cc: Fiala, Taft, Stewart If we convert to Trinity this year, spending effort peephole-optimizing the old instruction set may be largely wasted. FSC is easy. I suggest that Larry implement the trap opcode and that Ed Taft provide Dorado microcode so that using the FSC opcode is really an improvement. At the moment, I have the floating point number to be scaled on the top of the stack and the integer scaling factor underneath that. I think you wanted the arguments reversed, and I can do that by expending three or four more microinstructions, but please confirm that. *start* 00555 00024 US Date: 10 May 1982 12:01 pm PDT (Monday) From: Satterthwaite.PA Subject: Re: Floating Point Opcodes In-reply-to: Fiala's message of 10 May 1982 10:09 am PDT (Monday) To: Fiala cc: Satterthwaite, Taft, Stewart The scheme currently used for code generatation tends to produce better code when constant and/or shorter operands are pushed last. I have a moderate preference for an implementation of FSC that reverses the current positions of the arguments, but it's not essential (things like f[x]/2 would come out somewhat poorly). *start* 01105 00024 US Date: 11 May 1982 2:29 pm PDT (Tuesday) From: Fiala.pa Subject: Re: Floating Point Opcodes In-reply-to: Taft's message of 10 May 1982 9:29 am PDT (Monday) To: Taft cc: Satterthwaite, Fiala, Stewart Current implementation of FSC on the Dolphin is as follows: At entry: 0S/1S a floating point number R 2S an integer scaling factor E (-202b <= E <= 200b on Dolphin) At exit: stack is popped once 0S/1S a floating point number (R with E added to the exponent). My implementation is only 4 microinstructions with the only tricky part being the range limit on the scaling factor. Values outside the range above result in reversal of overflow and underflow reporting. For uniformity of representation, I suggest that the opcode be defined to accept scaling factors -200b <= E <= 177b. If Ed Satterthwaite wants the arguments reversed, then the code will be longer because the standard unpacking subroutine will have to be preceded by saving the integer scaling factor. We should also discuss other ways in which the Dolphin and Dorado floating point opcodes are incompatible. *start* 01846 00024 US Date: 11-May-82 14:44:10 PDT (Tuesday) From: Sandman.PA Subject: Re: Xfer traps and related issues In-reply-to: Taft's message of 3 April 1982 5:49 pm PST (Saturday) To: Taft cc: Sandman, Sweet, Wick, Levin, Fiala, Maxwell Xfer Traps The following describes how xfer traps are currently implemented for Trinity on the Dandelion. Xfer traps will become a PrincOps feature, and they do indeed need cleaning up. I will do that as I update the PrincOps. The following code should be at the end of XFER (pg 103) and the end of LFC (pg 104). The change over Rubicon is that a bit in the overhead word is checked to see if xfer trapping is allowed. Xfer traps occur if the XTS register is odd. The XTS register is shifted right by one position each xfer unless xfer traps were enabled but the module did not allow them. IF XTS MOD 2 # 0 THEN BEGIN word: GlobalWord = FetchMds[@GlobalBase[GF].word]^; IF ~word.trapxfers THEN -- note the field should be dontTrapXfers. BEGIN XTS _ XTS / 2; TrapOne[sXferTrap, dst]; END; END ELSE XTS _ XTS / 2; For the "real" xfer trapping mechanism the trap parameters should include at least the destination, the type of xfer (call, return, process switch, etc.) and possible the source. Pushing Source and Dest PORTI and LKB are the only instructions that use the links pushed above the stack. Therefore, local function calls don't need to push them, but indirect xfers clearly do. The normal cases of EFC don't have to push links, but as ports are currently defined, returns do. See Figure 9.4 on page 110. Dest of LFC Zero should be used as the dest of an LFC in those cases that it is needed. I believe xfer traps and unbound traps are the only places it's needed since we've determined that source and dest needn't be pushed. *start* 00308 00024 US Date: 11 May 1982 2:55 pm PDT (Tuesday) From: Fiala.PA Subject: Floating Point To: Taft, Satterthwaite, Stewart cc: Fiala I am going ahead and reversing the arguments for FSC so that the integer scaling factor is on the top of the stack and the floating point number underneath it. *start* 01427 00024 US Date: 11 May 1982 3:44 pm PDT (Tuesday) From: Fiala.PA Subject: Microcode Version To: Levin cc: Morris, Satterthwaite, Taft, Fiala The VERSION opcode (Misc 104b) which puts the release date on top of stack and a flag word underneath that. I am not sure whether or not this opcode is implemented on the Dorado. It is implemented but has never been checked on the Dolphin. The release date is the number of days since 1 January 1968. I believe that this can be converted into Pilot standard time format by multiplying by 60*60*24 (=86,400). The flag word underneath that at the moment contains only two flags: bit 15d is "Cedar" (i.e., 1=Cedar microcode, 0=other pilot microcode); bit 14d is "floating point" (i.e., 1=floating point microcode is loaded, 0=no floating point microcode). My suggestion would be as follows: Upon booting, Cedar should execute the VERSION opcode and give a distinctive MP code crash if the "Cedar" bit = 0 and another distinctive crash if the microcode date is less than some particular value (I suggest that 1000 and 1001 be the MP codes shown for these)--this should be done early in the initialization, before any Cedar-specific operations are executed. In addition, the microcode date should be stored somewhere that is retrievable on crashes, so that I know what microcode was in use. You may also want to display the date somewhere immediately after booting. *start* 00450 00024 US Date: 11 May 1982 4:05 pm PDT (Tuesday) From: Levin.PA Subject: Re: Microcode Version In-reply-to: Fiala's message of 11 May 1982 3:44 pm PDT (Tuesday) To: Fiala cc: Levin, Morris, Satterthwaite, Taft This sounds good. Perhaps this opcode could be extended to include the machine type (Dorado, Dolphin, Dandelion, Dicentra, ...)? If this were done, I would gladly put the appropriate tests into Cedar initialization. Roy *start* 01145 00024 US Date: 11 May 1982 6:27 pm PDT (Tuesday) From: Taft.PA Subject: Re: Microcode Version In-reply-to: Fiala's message of 11 May 1982 3:44 pm PDT (Tuesday) To: Fiala cc: Levin, Morris, Satterthwaite, Taft I'll implement VERSION and FSC as you have specified, and I'll let you know when this is completed. Minor notes: 1. Your definition of the date is fine, but it differs from the Pilot definition by a constant offset. 2. Perhaps there should be provision for a version number as well as a release date. I can imagine that the need might arise to maintain microcode for two incompatible instruction sets simultaneously, and the release date would not be a good indication of which version was loaded. I would recommend using version numbers rather than release dates as the basis for an automatic check that the microcode is "new" enough. Making the microcode release date available to the human user is still a good idea, however. 3. Maintenance panel codes have to be less than 1000 to be displayed properly on the Dorado. There is barely room for 3 digits in the cursor, and 4 would be completely hopeless. Ed *start* 02340 00024 US Date: 11 May 1982 8:46 pm PDT (Tuesday) From: Fiala.PA Subject: Re: Microcode Version In-reply-to: Taft's message of 11 May 1982 6:27 pm PDT (Tuesday) To: Taft, Levin cc: Fiala, Morris, Satterthwaite I neglected to mention the part of the flags word for VERSION that reports the machine. That is also implemented and is compatible with the Alto VERS opcode. Bits 0:3 of the word containing the Cedar and floating point bits contains the engineering number (0 or 1 = Alto 1, 2 = Alto II without extended memory, 3 = Alto II with extended memory, 4 = Dolphin, 5 = Dorado, 6 = Dandelion. I had intended the date returned by VERSION to require only multiplication to get to the Pilot time format. If Pilot time is not the number of seconds since 0:00 1 January 1968, what is it? I wanted to keep the transformation simple and will change my representation to anything reasonable. I have Micro macros for this if you want them. With regard to the maintenance panel codes, I wanted only to avoid confusion with the Germ and microcode MP codes. If 1000+ are unavailable, then 800 to 899 seem a good choice. How about 800 for "Not Cedar", 801 for "Ancient microcode", and 802 for "Incompatible version number"--we should reserve these, perhaps, even if not all are implemented. I suggest that bits 4:7 of the flags word be the version number, right now equal to 0. I think that both the version and the date should be surveyed by software during booting, although not necessarily handled the same way. I want to change the version number only for incompatible changes in the microcode. When these are wrong, the software should definitely crash with an MP code. The version number for a particular Cedar system would be the same on both Dolphin and Dorado (etc.) and wouldn't change for most Cedar releases. However, the date would change whenever maintenance releases occurred and might be different on the Dolphin and Dorado. Depending upon how bad the bugs are the software might choose to print "obsolete microcode" on the display and plunge ahead, or it might choose to crash. One method for automatic date checking is to retrieve the create date of the appropriate microcode file from the Ethernet and compare to that installed as Pilot microcode. This check may be too expensive to make, however. *start* 01227 00024 US Date: 12 May 1982 9:48 am PDT (Wednesday) From: Taft.PA Subject: Re: Microcode Version In-reply-to: Fiala's message of 11 May 1982 8:46 pm PDT (Tuesday) To: Fiala cc: Taft, Levin, Morris, Satterthwaite 1. The Pilot format time has the same representation as Alto format, namely seconds since midnight, January 1, 1901. (The difference between Pilot and Alto time interpretation is that the beginning of the current "epoch" in Pilot is January 1, 1968; bit patterns that would be interpreted as times between 1901 and 1968 in the Alto world refer to times in the future in Pilot.) 2. 4 bits does not seem nearly enough for a microcode version number. I would advocate devoting an entire word to it. 3. I agree with your proposed semantics, which I will try to restate in a different way. The version number identifies the version of the instruction set that is implemented, and is machine-independent. Changing the version number requires agreement among purveyors of the microcode that implements the instruction set and software that uses it. The pair [machine type, release date] identifies the specific release of the microcode; this identification is independent of the version number. Ed *start* 01084 00024 US Date: 13 May 1982 11:05 am PDT (Thursday) From: Fiala.PA Subject: Re: Microcode Version In-reply-to: Taft's message of 12 May 1982 9:48 am PDT (Wednesday) To: Taft cc: Fiala, Levin, Morris, Satterthwaite I want to keep the value returned by Version small since it costs two microinstructions per word returned, so I recommend that we stay with an opcode that returns two words. If we do this, then enlarging the version number is not feasible. Also, my model of how comparisons will be made to the version number is always "equal", never "greater than", so counting version numbers mod 16d is perfectly adequate. I can't believe that microcode versions 16 old will still be around to cause confusion. As to the date, starting at 1901 means that current dates will be about 30000d. If the type of the value returned is CARDINAL, this still leaves plenty of years before overflow, but if the type is INTEGER, then we will overflow relatively soon. I don't think this is a problem, so I will convert my code to produce days since midnight 1 January 1901. *start* 01910 00024 US Date: 13 May 1982 11:41 am PDT (Thursday) From: Taft.PA Subject: Re: Microcode Version In-reply-to: Fiala's message of 13 May 1982 11:05 am PDT (Thursday) To: Fiala cc: Taft, Levin, Morris, Satterthwaite 1. I still disagree about the size of version numbers. There are situations in which we add new opcodes in an upward-compatible way so that old software continues to work; but a new version number needs to be assigned so that new software can tell whether or not the new opcodes are implemented. That is, the old software will work with either version n or version n+1 of the microcode, and should test for version ge n; but the new software will work only with version n+1 so should test for version ge n+1. Actually, this sort of arrangement is not ideal because there is no way to capture the distinction between upward-compatible and incompatible changes. That is why "major" and "minor" version numbers are often used. A change in the minor version number represents an upward-compatible change which does not affect old software and which only new software will care about. A change in the major version number represents an incompatible change (and causes the minor version number to be reset to zero). If this strategy is adopted, the correct version number test for software to perform is "equal" on the major version number and "greater than or equal" on the minor version number. If adding another word to the results returned by the Version opcode is too costly in microcode space, then I advocate widening the version number field from 4 bits to 8, and subdividing it into major and minor versions (say 3 bits for major version and 5 for minor). Thus the first word returned by Version contains 4 bits of machine type, 8 bits of version number, and 4 bits of flags (of which 2 are currently defined). 2. Defining the date word as a CARDINAL sounds fine. Ed *start* 01507 00024 US Date: 14 May 1982 9:42 am PDT (Friday) From: Fiala.PA Subject: Re: Microcode Version In-reply-to: Taft's message of 13 May 1982 11:41 am PDT (Thursday) To: Taft cc: Fiala, Levin, Morris, Satterthwaite The combination of the version number for what you are calling the "major version number" and the date for what you are calling the "minor version number" seems perfectly adequate with respect to all the points you raise in your message. Namely, incompatible changes (which are probably infrequent) bump the 4-bit version number (mod 16d); compatible changes bump the date. For any version number the dates are monotonic upward. However, I don't want to iterate on this any more. As you may recall, there were about 20 messages exchanged over a year ago of the same general flavor as the ones we are now exchanging, and no concensus was reached, so no VERSION opcodes was implemented. I don't feel strongly about any of the points raised, so I will be happy to implement a VERSION opcode that is exactly what you think is best. 2 microinstructions is only 2 microinstructions, so we can make the value returned be 3 words, if you'd prefer. Please send me a message specifying how you'd like the bits arranged. I would like a few "Dolphin-specific" (i.e., machine-dependent) bits left over for uses such as the "has floating point microcode" and "Cedar mode" bits. Other than that, I can adapt to any date, version number, and argument ordering conventions that you like. *start* 01336 00024 US Date: 14 May 1982 10:09 am PDT (Friday) From: Taft.PA Subject: Re: Microcode Version In-reply-to: Fiala's message of 14 May 1982 9:42 am PDT (Friday) To: Fiala cc: Taft, Levin, Morris, Satterthwaite Very well; you have convinced me. Here is the spec for the result of the VERSION opcode as I understand it: MachineType: TYPE = MACHINE DEPENDENT { altoI (1), altoII (2), altoIIXM (3), dolphin (4), dorado (5), dandelion (6), dicentra (7), (17B)}; VersionResult: TYPE = MACHINE DEPENDENT RECORD [ machineType (0: 0..3): MachineType, majorVersion (0: 4..7): [0..17B], -- incremented by incompatible changes unused (0: 8..13): [0..77B], floatingPoint (0: 14..14): BOOLEAN, cedar (0: 15..15): BOOLEAN, releaseDate (1): CARDINAL]; -- days since January 1, 1901 VERSION: PROCEDURE RETURNS [VersionResult] = MACHINE CODE { Mopcodes.zMISC, 104B}; In my opinion, the flags should all be machine-independent; but "floatingPoint" and "cedar" certainly meet this requirement. This is just to say that the flag bits should mean the same things on all machines. If you require any truly machine-dependent flags (or other things), they should be returned by a separate, machine-dependent opcode. Perhaps the above definitions should be put in some interface, or turned into a new interface. Ed *start* 02923 00024 US Date: 14 May 1982 11:16 am PDT (Friday) From: Taft.PA Subject: Re: Xfer traps and related issues In-reply-to: Sandman's message of 11-May-82 14:44:10 PDT (Tuesday) To: Sandman cc: Taft, Sweet, Wick, Levin, Fiala, Maxwell Xfer traps: thanks for the updated information. Pushing source and dest: I think you slightly misunderstood my proposal. I'm suggesting that the decision to push source and dest should be based not on the opcode that called Xfer but rather on whether or not any indirect links are encountered during the Xfer. That is, the "push" argument of Xfer should be eliminated; and the body of Xfer should have something like the following added: push: BOOLEAN _ FALSE; ... WHILE controlLinkType[nDst]=indirect DO link: IndirectLink _ nDst; nDst _ FetchMDS[link]^; push _ TRUE; ENDLOOP; I'm pretty sure this will work for all uses of PORTI and LKB that I'm aware of. The main thing I was concerned about was whether PORTI and LKB are the ONLY sanctioned uses of the links pushed by Xfer. There is nothing in the PrincOps at present that would prohibit a program from attempting to do two PUSHes to recover the source and destination links left over from an arbitrary Xfer. Your statement "PORTI and LKB are the only instructions that use the links pushed above the stack" seems to suggest that other uses are prohibited. Is this correct? Dest of LFC: Before I acquiesce on the issue of passing zero instead of a fabricated procedure descriptor when LFC causes an unbound or Xfer trap, I would like to point out that doing so makes the job of the Xfer trap handler somewhat hairy. If the Xfer trap handler needs to fabricate a real procedure descriptor (as is true in Maxwell's application), it does not suffice to go back to the source context of the original Xfer and look at the opcode that was executed, because the PC has been advanced past that opcode and you don't know whether it was a 1-byte (LFCn) or 2-byte (LFCB) opcode. Roy suggests another strategy, which is to take the PC of the destination context (which is the caller of the Xfer trap handler) and search for it in the entry vector in order to determine the entry vector index. This is workable but messy. Nevertheless, the fact that saving the EVI to enable a real procedure descriptor to be fabricated would slow down all LFCs on the Dolphin seems like a fairly compelling argument in favor of passing zero instead. So I am willing to go along with this, if everyone else is still happy about it. But note that this represents a change in the PrincOps. In the definition of LFC, the statement IF nPC=0 THEN UnboundTrap[MakeProcDesc[GF, evi]]; should be replaced by IF nPC=0 THEN UnboundTrap[LOOPHOLE[0]]; Also the assertion in section 9.4.1 that LFC[evi] is equivalent to Call[MakeProcDesc[GF, evi]] should be removed because it is not true with respect to traps. Ed *start* 00890 00024 US Date: 14-May-82 12:16:42 PDT (Friday) From: Sandman.PA Subject: Re: Xfer traps and related issues In-reply-to: Taft's message of 14 May 1982 11:16 am PDT (Friday) To: Taft cc: Sandman, Sweet, Wick, Levin, Fiala, Maxwell Pushing source and dest: I agree with your proposal to eliminate the push argument of Xfer and push links only if an indirect link is encountered. PORTI and LKB are the only instructions that may reference the links above the stack. Software must insure that these instructions are only executed immediately after an indirect xfer. Dest of LFC: While in general it is not possible to read code backwords, in this case it is possible to determine if the call in the source was an LFCn or a LFCB. If the LFCn opcode numbers are > 128 (and they are), then they cannot be mistaken for the alpha byte of the LFCB which must be less than 128. *start* 01507 00024 US Date: 31 May 1982 4:57 pm PDT (Monday) From: Taft.PA Subject: Floating Point To: Satterthwaite, Fiala, Stewart cc: Taft I've implemented the Floating Scale (FSc) opcode in the Dorado microcode. It's been tested only superficially; a more thorough test awaits completion of changes to the Real package and test programs, which Larry is working on. I estimate that FSc is a factor of 5 faster than FMul for multiplications by powers of two (1.2 microseconds rather than 6.2). The exact specification is: FSc: PROCEDURE [number: REAL, scale: INTEGER] RETURNS [result: REAL] = MACHINE CODE { Mopcodes.zMISC, 37B }; The result returned is number*(2**scale). Traps can occur for over/underflow, but the inexact result condition cannot occur. I believe there is a restriction in the Dolphin implementation that the scale factor must be IN [-130 .. 128], else over/underflow may not be detected properly. The Dorado implementation does not have this restriction. This means that either (1) the compiler must insert an explicit bounds check for non-constant scale factor, or (2) the Dolphin implementation must be changed to deal properly with the full range of scale factors. In conjunction with this, I've debugged and released a new floating point implementation which I began last year but never finished. It uses an improved internal representation for unpacked numbers (suggested by Ed Fiala), and consequently the primary opcodes are 5 to 10% faster than before. Ed *start* 00518 00024 US Date: 1 June 1982 8:47 am PDT (Tuesday) From: Taft.PA Subject: Re: Floating Point In-reply-to: Taft's message of 31 May 1982 4:57 pm PDT (Monday) To: Taft cc: Satterthwaite, Fiala, Stewart On second thought, I can't offhand think of any language construct that would compile into scaling by a non-constant factor. (Something like r*(2**i) would do that, but Mesa doesn't have an exponentiation operator.) So perhaps the Dolphin restriction on the range of the scale factor is harmless. Ed *start* 02283 00024 US Date: 1 June 1982 9:35 am PDT (Tuesday) From: Fiala.PA Subject: Re: Floating Point In-reply-to: Taft's messages of 31 May 1982 and 1 June 1982 To: Taft cc: Satterthwaite, Fiala, Stewart The range restriction is not entirely harmless in the Dolphin implementation; it has one of two drawbacks: (1) If the compiler restricts scaling factors to be in the range -200b to +177b (actual Dolphin restriction is -202b to +200b), then scaling factors in the range -377b to -201b and +200b to +377b cannot be used, where these ranges might be useful if something is known about the range of the number being scaled. However, note that while these very large and very small scaling factors are potentially useful, they do not correspond to any representable real number, so they won't occur as the result of simply substituting FSc for FMul. -or- (2) If the range restriction is ignored, then any scaling factor can be used, so (1) is not a problem, but if an overflow or underflow occurs, then the microcode reverses the indication, believing that an overflow is, in fact, an underflow and vice versa. When both underflow and overflow trap to software, this is never a problem, but since we added the substitute-0-on-underflow option for Wilhelm's Thyme program, it is possible that a result of 0 will be substituted for some overflows as a result of not range checking. I have been under the impression that the FSc opcode will be produced by the compiler only when one of the arguments to an FMul is (a constant) 2^n or 2^-n, in which case all bounds checking can be carried out at compile time, when the decision to substitute FSc for FMul is made. Also, bounds checking is useless since the Real number exponent cannot be outside -200b to +177b anyway. Even if the compiler could somehow decide to use larger scaling factors, I don't want to expend more microcode range-checking FSc, which also slows it down. I think very large or very small scaling factors will be very rare, and the compiler can restrict itself to scaling factors in -200b to +177b without loss of generality, using two FSc's for larger scaling. The fact that FSc will be faster for the narrower range will more than make up for the fact that it can't be used for the larger range. *start* 01871 00024 US Date: 14 June 1982 5:01 pm PDT (Monday) From: Taft.PA Subject: Dolphin microcode boot To: Frandeen, Neely, Luniewski cc: Murray, Taft I'm hoping one of you can remember far enough back to help me with the following: I'm in the midst of implementing boot-loading of microcode from the disk on the Dorado. (Until now, microcode has always been booted from the Ethernet.) I'm trying to understand exactly how the Initial microcode file is stored on the reserved area of the disk, and how the Boot microcode reads it. My current understanding of this, determined by reading the code implementing Othello's "Initial Microcode Fetch" command, is as follows: The microcode file (in .eb format, including the initial overhead page) is written starting at a fixed disk address. The label of each page is entirely zero, with the exception that the bootChainLink points to the next page, and is filled in for every page rather than only for pages that end runs. The filePageLo word is zero in every page instead of incrementing for successive pages. The file terminates at a page whose bootChainLink is zero. If all that is correct, then it would appear that the Boot program must read the Initial microcode file without checking labels but just blindly following the bootChainLinks. If this is true, how does the Boot program distinguish a valid Initial microcode file from random garbage (or detect that a chain trails off into garbage or goes into a loop)? I assume it must do so somehow, since it is able to switch to booting microcode from the Ethernet if no microcode is installed on the disk. There is not any particular need for me to adopt the same convention on the Dorado, since this is a part of the architecture that is machine-dependent. But I think I should understand the present implementation before I start changing it. Ed *start* 00856 00024 US Date: 15-Jun-82 9:11:17 PDT (Tuesday) From: Luniewski.PA Subject: Re: Dolphin microcode boot In-reply-to: Taft's message of 14 June 1982 5:01 pm PDT (Monday) To: Taft cc: Frandeen, Neely, Luniewski, Murray I believe that you have correctly described the current situation for both Dolphins and Dandelions. I can add the following: 1. On both the Dolphin and Dandelion I believe that the initial microcode is constrained to not cross a track boundary. 2. I also believe that there is a constraint that the first page of the initial microcode area not be bad. 3. Perhaps the Boot program detects a non-existent initial microcode file by a label mismatch when reading the first page of the initial microcode. This is possible since, when the disk is formatted, that part of the disk ends up with rather "unusual" labels. /Allen *start* 00825 00024 US Date: 15 June 1982 1:08 pm PDT (Tuesday) From: Frandeen.PA Subject: Re: Dolphin microcode boot In-reply-to: Taft's message of 14 June 1982 5:01 pm PDT (Monday) To: Taft cc: Frandeen, Neely, Luniewski, Murray The Boot program (SA4000Boot) checks each boot chain link to see if it appears to be valid. If the head number is greater than 7, or if the sector is greater than 27, it decides the link is invalid; if it falls within valid bounds, the program blindly follows the link. The program also checks the validity of the data which is being read. The data being read is control store data, so it checks to be sure it is not loading into reserved control store areas, and it also verifies the checksum. I believe the end of the boot file is signalled by -1 in the last word of the label field. Jim *start* 01753 00024 US Date: 16 June 1982 5:49 pm PDT (Wednesday) From: Taft.PA Subject: Side-effect of Xfer change To: Sandman cc: Levin, Fiala, Sweet, Wick, Taft Roy just tracked down an obscure side-effect of the Xfer change we made recently (eliminating the push argument of Xfer and instead pushing links only if an indirect link is encountered during the Xfer). If an ordinary procedure call (i.e., not a trap) results in a type 0 Xfer to a fixed-frame handler, the destination frame's return link is not set. PilotNub.WorryCallDebugger is such a fixed-frame handler, and it is entered by a KFCB. The way it figures out who called it is by recovering the source link that was supposedly left above top-of-stack by the KFCB. With the new Xfer this no longer works. Note that this is not a problem with fixed-frame trap handlers, since Xfer explicitly stores the return link during traps. It is a problem only during non-trap Xfers directly to fixed-frame handlers (i.e., not via ports). This is a fairly bizarre use of Xfer, and one that probably doesn't deserve any special microcode support. The solution Roy and I have tentatively adopted is to put an indirect link into the SD slot for WorryCallDebugger, pointing to a cell that actually contains the pointer to the fixed frame. We propose that as a general policy, a fixed-frame handler called other than by traps which wants to figure out who called it must interpose an indirect link in the path leading to it so that the source and destination links will be pushed. This is just another way of saying that fixed-frame handlers should be called via ports rather than directly, which is more consistent with Mesa language semantics anyway. Do you concur with this? Thanks. Ed *start* 00393 00024 US Date: 17-Jun-82 14:10:03 PDT (Thursday) From: Sandman.PA Subject: Re: Side-effect of Xfer change In-reply-to: Taft's message of 16 June 1982 5:49 pm PDT (Wednesday) To: Taft cc: Sandman, Levin, Fiala, Sweet, Wick Calling fixed frames through indirect links is clearly the right thing to do. For Trinity, WorryCallDebugger was made to call a port, not do a KFCB. Jim *start* 01529 00024 US Date: 7 July 1982 11:06 pm PDT (Wednesday) From: Fiala.PA Subject: Re: Microcode sizes In-reply-to: Levin's message of 7 July 1982 11:58 am PDT (Wednesday) To: Levin cc: Taft, Fiala Results below are for Trinity microcode, where size is somewhat larger than Rubicon because of more Esc/EscL opcodes. Initialization microcode, subsequently overwritten, is not included. IO drivers for EOM, EIM, CDC9730, MIOC, DUIB, Jasmine, JasmineHalftone, Audio, etc., not used by Cedar, are not included. Essentially, all available microstore can be used by turning on various optional opcodes or io drivers; at the moment a number of infrequent Esc/EscL opcodes trap to software, so the emulator could be larger. Known differences from Dorado: 1) Dolphin does not include Cedar allocator opcodes that are included on Dorado; I think these would add about 300b on Dolphin. 2) Dorado initialization is not overwritten. 3) Dolphin FSqRt not included in Dorado. 4) 10 mb Ethernet driver not included in Dorado. Emulator 5277b = 2126b+3151b enumerated below Process 600b Xfer 302b (not inlcuding PortI) BitBlt 344b TextBlt 0b (230b but not inlcuded in Cedar) Floating Point 542b (includes FSqRt) Cedar 605b everything else 134b special io opcodes Disk 332b 3 mb Ethernet 204b 10 mb Ethernet 333b Display 365b, 254b, or 377b (CSL display is biggest) Other I/O 104b (54b color display, 30b Timers) everything else 265b fault handler unused CRAM space 420b at the moment *start* 01878 00024 US Date: 4 Oct. 1982 12:01 pm PDT (Monday) From: Fiala.PA Subject: Re: Dolphin clock rate In-reply-to: Taft's message of 1-Oct-82 17:00:38 PDT To: Taft cc: Fiala, Levin You put the value 323b onto the top of stack and use the READR opcode (Misc 106b) to read the vCrystal register from the hardware. It returns a result interpreted as in the following table: 640 1280 2560 cycles/tick 40 mHz 320 160 80 44.5 356 178 89 50 400 200 100 At the moment, you will get "80" back for most of the Dolphins running Cedar. This indicates that the RCLK opcode reads a counter denominated in units of 2560 cycles/tick and that cycles are 100 ns (40 mHz crystal). As you can see from the table, the number read from vCrystal is directly proportional to the real rate of the clock. ------- While you are making the variable clock rate changes, you might consider doing something about the problems which Cedar/Pilot have when initializing for a different number of useful pages than the number at which you last booted. You can also use READR to get the storage information passed through from Initial as follows: TOS 344b xPageCount Count of 'good' pages from Initial 345b xStorageFaults Sum of 2^(bad board no.) from Initial (e.g., 1 = storage board nearest proc. had problem 2 = 2nd board from proc. had problem 4 = 3rd board ...) 346b xHardBadPages Count of hard bad pages from Initial 347b xSoftBadPages Count of soft bad pages from Initial In the absence of making software work with a different number of useful pages, which would be best but might be hard, I would suggest that you issue a distinctive MP code for this problem, or store storage-testing results in some place where higher levels of software can display results if they choose. I think I sent a more comprehensive message on this subject some time ago. *start* 08273 00024 US Date: 6 Oct. 1982 2:44 am PDT (Wednesday) From: Thacker.PA Subject: SixteenBitness To: Fiala, Petit, Taft cc: Thacker I would appreciate your preissue comments on the following, which I believe represents our discussions of the past few days (years?) Chuck ------- This memo discusses the support to be provided by the hardware and the instruction set of Dragon for compatibility with 16-bit addressing and 16-bit data items as used in present D-machines. Although this is a simple problem, it has caused more debate than any other issue in the design, and I would like to resolve the debate and move on to more productive issues. The Dragon and its instruction set will be quite different from present D-machines. Because it is a multiprocessor, the process mechanism will change. The opportunity to use a large number of registers in the processor will change the data structures of Xfer and the way in which local frames and the stack are addressed. The implementation of the map will provide a much larger virtual address space than is presently available. Finally, the use of 32-bit data paths provides an opportunity to increase significantly the efficiency with which 32-bit pointers and 32-bit data are handled. Although a number of programs will require source changes to run on Dragon, it should be possible to recompile a large fraction of existing code and have it run efficiently. In many cases, the (new) compiler will be able to use the most efficient 32-bit constructs for things now declared SHORT, but it is believed that in many cases, 16-bit addresses and data are built into even "vanilla" programs in ways that the compiler cannot translate properly. For these situations, we must provide a (possibly inefficient) escape hatch. [Note: Those who believe this are a small but vocal and influential minority of those interested in Dragon. BWL is the ringleader, and I myself have flopped back and forth from time to time.] The escape hatch is the subject of this memo. Presently, the Mesa instruction set provides two types of pointer: LONG POINTERs address 16- or 32-bit quantities in an address space of up to 2^32 16-bit words, and POINTERs address similar quantities in a distinguished 2^16 word region of VM - the Main Data Space. There are a variety of instructions for loading and storing both types of data using both types of pointer (four cases). Cedar exteends the pointer types to include REFs, which are identical to LONG POINTERs when used, but which support reference counted assignment. Cedar also extends the LONG data types, although these extensions are invisible at the hardware or instruction set level. A pointer to a 32-bit doubleword may be either even or odd (no alignment restrictions), and doublewords are stored with their least significant bits in the word with the smaller address. [Note: This convention is almost certainly a mistake, since it is inconsistent with the storage representation of strings and the naming of the bits. I did it and I am now suffering for it.] Unfortunately, the most straightforward implementation for the Dragon memory is inconsistent with ALL of the formats above. Were compatibility not an issue, a pointer would be a 32-bit quantity that addressed a 32-bit word, yielding an address space twice as large as the present one. Because we expect that 32-bit data and pointers will be the most common type in the future, we do not want to burden the memory with 16-bit compatibility features, and want to add as little compatibility hardware to the processor as possible. Within the processor (and therefore on the stack and in frames), all quantities will be held in 32-bit registers. When a 16-bit quantity is loaded onto the stack, the most significant 16 bits of the register will be cleared by the load. If the quantity is an INTEGER, it will be explicitly sign extended; an instruction will be provided for this purpose. Such quantities can then be manipulated by the 32-bit ALU, so no special instructions are required for 16-bit arithmetic. Before 16-bit results are stored, the most significant 16 bits of the register may be explicitly tested for all zeros or all ones to ensure that overflow has not occurred. Addresses with 16-bit resolution will be generated as 32-bit quantities, and a limited number of opcodes will be provided to access 16- and 32-bit data. The exact details of the opcodes are not final, but at least four are required: two to read and write 16-bit items, and two to read and write 32-bit items. These instructions may have to do several memory references, depending on the length and alignment of the data, but they will not have any alignment restrictions. All these opcodes will take 32-bit addresses (LONG POINTERs). If short (MDS-relative) addressing is intended, an opcode to replace the most significant 16 bits of the address with MDS (from where?) can be explicitly inserted by the compiler. Note that 16-bit compatible addresses can only span half the available VM. In order to provide the maximum range for indices and the simplest address calculation procedure, it is important that SIZE[MostFrequentObject] = 1. In the present D-machines, MostFrequentObject is assumed to be the 16-bit word. In Cedar, the MostFrequentObject is 32 bits in length. It is likely that in the (perhaps distant) future, there will be no 16-bit quantities or 16-bit resolution addresses, and we will go to full "native mode" addressing on the Dragon (and omit the right-shift necessary for compatibility mode). At the moment, this would presumably be difficult, since REFs and LONG POINTERs must have the same representation in storage (although the eventual converson is made simpler by the fact that REFs are currently all even). We propose to provide a rich set of opcodes to support native mode addressing, plus the escape hatch instructions for compatibility mode. This will mean having to shift a native mode REF to produce a LONG POINTER during the period when both worlds exist, but this should not be a problem. I, but this should not be a problemt is true that we could keep compatibility addressing forever with no performance penalty, but it would restrict the address space to 2^31 32-bit words. The final addressing question that requires resolution is the representation of 32-bit quantities in storage. As mentioned above, these quantities are stored with the least significant word in the lower address, which is inconsistent with the numbering of the bits and the storage convention for strings. On present machines, this has been confusing, but not terribly so. On a 32-bit machine, it is much worse, particularly when we introduce 64-bit REALs. As Danny Cohen [1] points out, it is not important whether you are a Big-Endian or a Little-Endian, but it IS important that all quantities are represented in a consistent manner. Currently, we are Big-Endian in everything except the storage format for 32-bit items. I think we should go to a consistent representation. There are two possibilities: p1 - Big-Endian: Reverse the storage order of doubleword quantities. This should cause few conversion problem, since we have very few jars of pickled values sitting on the shelf, and most programs that access parts of 32-bit quantities do so through the INLINE interface or the MACHINE DEPENDENT RECORD that defines REALs in IEEE. p2 - Little-Endian: Reverse the numbering of the bits, reverse the order of the bytes in a string, and leave 32-bit data the way it is. This is slightly more elegant that the Big-Endian order, since increasing addresses imply increasing significance, but it is a terrible wrench for those of us who like our MSB numbered zero. It is, of course, possible to leave things the way they are. I can even draw you a picture of storage that looks right, with the most significant bits of a 32-bit word on the left and the low order bits on the right. Unfortunately, in this picture, the first byte of a string is in the third byte of the 32-bit word. Ugh! Does anyone have any objections to going to Big-Endian order? If not, let it be so. --------- [1] Danny Cohen, "On Holy Wars and a Plea for Peace", IEEE Computer, October 1981 *start* 01113 00024 US Date: 13 Oct. 1982 12:13 pm PDT (Wednesday) From: Taft.PA Subject: PrincOps question To: Sandman cc: Taft I'm starting to work on the Trinity microcode for the Dorado. I thought I would try programming it from the PrincOps rather than from some existing implementation (Dolphin or Dandelion), both as an opportunity to test the accuracy of the PrincOps and as a way of casting off unneeded vestiges of old implementations. So far, this seems to have been fairly successful. I have found the PrincOps to conform extremely closely with my existing understanding of the instruction set; and I have been able to make some reorganizations that have resulted in significant improvements in the code. While studying Xfer, I have noticed one thing which I don't understand the reason for. In the frame case of Xfer, a test is made for GF=0 and nPC=0, and an UnboundTrap occurs if either is true. I don't understand how either GF or nPC could become zero in an existing local frame; and I suspect the GF=0 and nPC=0 tests are appropriate only in the procedure case. What do you think? Ed *start* 01007 00024 US Date: 13-Oct-82 12:36:45 PDT (Wednesday) From: Sandman.pa Subject: Re: PrincOps question In-reply-to: Taft's message of 13 Oct. 1982 12:13 pm PDT (Wednesday) To: Taft cc: Sandman I believe the tests are more consistency tests than anything else. It would probably be better if it said ERROR instead of UnboundTrap making programs illegal which change an existing frame in that manner. A word of caution. I assume you are using the version of the PrincOps that I gave to Roy. In answer to your previous message, I have not made a more version. There are three places where it differs from Trinity. Xfer traps in Trinity do not have the type parameter. Code traps in Trinity take the dest link as the parameter instead of the global frame. State vectors in Trinity cannot be permanently attached to a psb. I hope to get back to the PrincOps soon. I intend to make version 4 describe exactly what Trinity does. I appreciate your debugging the PrincOps in this way. Jim *start* 03271 00024 US Date: 19 Oct. 1982 9:36 am PDT (Tuesday) From: Taft.PA Subject: Dragon Xfer To: Petit cc: Thacker, Fiala, Taft I read your memo with great interest. Here are my comments. 1. My first reaction was astonishment over the lack of any mention of Interface Function Call (IFC). The DFC is fast and is wonderful for tightly-bound systems such as boot files, but is no good for dynamically-loaded modules. I mentioned this to Chuck, and he told me that IFC can be composed of (approximately) the following sequence of operations: Load Global n Read n Stack Function Call Furthermore, Chuck says this sequence executes just as fast as if the hardware provided a combined IFC instruction -- so the absence of IFC is costly in space but not in time. If this is really true, then I am happy. In any event, your next edition of the Dragon Xfer memo should contain some discussion of IFC. 2. Another point worth considering is that it is desirable to have the DFC instruction and the IFC instruction sequence occupy the same number of bytes. That way, it need not be known at compile time how a module will be bound. Instead, the compiler can simply always generate the IFC sequence; and the program that does the tight binding (e.g., the boot file maker) can replace IFC sequences with DFCs. (Arranging the default to work this way is advantageous because code containing DFCs must be copied before binding whereas code containing IFCs need not be, and making a boot file involves copying code anyway.) DFC takes 4 bytes whereas a simpleminded encoding of the IFC sequence takes 5. Of course, constants encoded in the opcode byte could shrink this. If there were 16 flavors of LGn (plus LGB) and 16 flavors of Rn (plus RB), then a module could import 16 interfaces with up to 256 procedures in each, plus up to 256 interfaces with no more than 16 procedures in each, and still stay within the addressing limitations of a 4-byte IFC. This is perhaps not an unreasonable constraint. 3. Unless I am missing something fundamental, the DFC encoding implies that all code must reside within a single 2^26-word segment of the 2^32-word address space. This hardly seems restrictive now, since Cedar is currently limited to a 2^22-word address space for everything. However, I'm concerned that in the long run, setting aside only 1/64th of the address space for code may prove constraining. 4. There needs to be some provision made for transfers to existing frames instead of to procedures. Frame transfers are used for various purposes including coroutines and fixed-frame trap handlers. There is no need for these to be particularly efficient, since they are relatively infrequent; it would probably suffice for the registers to be dumped whenever such a transfer occurred (just the same as for a process switch). But there is considerable advantage to having the frame link be a variant of a control link, along with the procedure descriptor and the indirect link. Perhaps one of the three tag values which now identify an indirect link should be taken over to designate a frame link. 5. Enter needs some escape mechanism for more than 15 arguments. Procedures taking more than 15 arguments are uncommon but must be provided for. Ed *start* 02231 00024 US Date: 20 Oct. 1982 9:52 pm EDT (Wednesday) From: Lampson.PA Subject: Re: FYI: Dragon Xfer In-reply-to: Taft's message of 20 Oct. 1982 6:06 pm PDT (Wednesday) To: Taft cc: Petit, Thacker, Fiala, Atkinson, Lampson, Levin, Rovner, Satterthwaite Some thoughts about your comments on Xfer: 2. A very good point, even though we hope that with the modeler recompiling with different compiler switches will be less tiresome. We considered various compact encodings of IFC, but for this purpose perhaps it would be best to have Read Global Byte Indirect Byte, which is just like Real Global Indirect Pair except that it has a byte instead of a nibble for each displacement. This takes three bytes, plus one for the SFC which totals four as required. This instruction might be useful for other things also. It remains an open question whether compilation with a tighter encoding of the IFC should be permitted: the Real Global Indirect Pair instruction, if applicable, yields 3 bytes for the call. 3. You are not missing anything fundamental. My position is that 2^28 bytes of code (256 megabytes) is enough for the next few years, since we currently have less than 4 megabytes. At 1/2 bit per year, this gives us 12 years. Before 256 megabytes runs out we will surely want another machine, and have many other ideas for improvements in the instruction encoding. 4. I have mostly worked out a design for this, but it still has a few problems. Fear not, frame links and ports will be taken care of. As you say, they will be much slower than the regular calls and returns (probably comparable to the current XFER, in fact), since a considerable amount of shuffling between registers and memory is required. 5. We intended to keep the present scheme for long argument records: allocate a record and pass a pointer to it explicitly. ENTER then specifies one argument. Or, of course, a hybrid is possible in which some of the arguments are passed on the stack along with the pointer, but this is probably too much trouble to specify intelligently. Of course, it is the typechecking and not the ENTER which is responsible for agreement between caller and procedure about the arguments, just as in the present scheme. *start* 04541 00024 US Date: 20-Oct-82 20:24:59 PDT From: Atkinson.pa Subject: Re: FYI: Dragon Xfer In-reply-to: Taft's message of 20 Oct. 1982 6:06 pm PDT (Wednesday) To: Taft.pa cc: Atkinson, Lampson, Levin, Rovner, Satterthwaite, Thacker, Fiala Some thoughts on Ed Taft's thoughts, plus a few of my own... Taft.1 & Taft.2: [... lack of any mention of Interface Function Call (IFC) & ... have the DFC instruction and the IFC instruction sequence occupy the same number of bytes]. I certainly agree that IFC should be mentioned. The 3-instruction sequence for IFC is quite OK even if it is larger than the space for DFC, because we can pad with no-ops when replacing with DFC with little or no execution penalty. If there are small encodings for IFC, then replacement by DFC requires that IFC be at least 4 bytes. Taft.3: [... the DFC encoding implies that all code must reside within a single 2^26-word segment of the 2^32-word address space]. I believe that if we get 4 bits from the opcode byte and 24 bits from the next 3 bytes that we have 28 bits to use as address. If we insist that procedures start on word boundaries (it may simplify the IFU anyway), then we have 2^28 words of code address space. I don't see how we could need more than 2^28 words of code before we could change the IFU to allow 5-byte DFCs. Taft.4: [There needs to be some provision made for transfers to existing frames instead of to procedures]. I agree that there should be some way of calling a frame, or something like a frame. Suppose that we allow a procedure descriptor to be a code address, a frame, or even a REF (see below). If we assume that only a code address has to be made fast, then we can afford to trap on anything that is not code, and do the rest without microcode. In most cases of calling a frame, we only need to be careful to inform the IFU to push new return info, and return to the frame from the trap handler. Taft.5: [Enter needs some escape mechanism for more than 15 arguments].  Presumably we can pull the same stunt that Mesa now uses for argument records larger than the eval stack, or we may want to adopt the convention that if there are more than 14 arguments that the 15th argument has the extension. I prefer the latter convention, since it makes it easier for the conservative scan to find the extension and trace it. -------- Here are some random thoughts of my own about Dragon XFER and related topics. Atkinson.1: To allow replacement of IFC by DFC we may want to pass along information about call sites (DFC or IFC instructions, where they are, and how many bytes) in BCD information of some kind. The alternative is to parse byte codes, which we may want to avoid (to allow intermixing of data and code within a procedure). This could also help the debugger. Atkinson.2: I would like the ability to bind data to a procedure descriptor in a dynamic fashion (the creation of collectible closures on the fly). This might prove useful in polymorphic procedures (the additional parameter might be the type-dependent information), retained frames (actually retained contexts, since some of the information in a retained frame is useless), and other hacks derived from lambda calculus. This approach would require the reference-counting machinery to ignore procedure descriptors to code or uncounted data space (if that is where frames reside), but that seems quite acheivable. This would also bring procedure descriptors under the REF ANY umbrella. Atkinson.3: We also have to allow for ultra-large local frames. This is no problem if we institute the convention that there is a place for a frame extension in any small frame. To make the job of the conservative scan easier we should have some way to quickly determine where this extension is (as with long argument records). The hack of using the largest small frame size to indicate the presence of a frame extension (the extension could be at any fixed offset) seems reasonable, although there may be significantly better ways. Atkinson.4: This may be unreasonable, but if we assume that collectible storage is at a low level in the system, and that everything needed to support that level does not use very large local frames or large argument records, then frame and argument extensions can come from collectible storage. This may have no real effect on the lower levels (they tend to use small frames and few arguments anyway), and it could simplify the conservative scan (there would be no special cases for extensions). *start* 03086 00024 US Date: 21 Oct. 1982 3:51 pm PDT (Thursday) From: Fiala.PA Subject: Re: Dragon Xfer In-reply-to: Taft's message of 19 Oct. 1982 9:36 am PDT (Tuesday) To: Taft cc: Lampson, Atkinson, Petit, Thacker, Fiala There are some other issues that were not raised in the answers by Lampson and Atkinson to your original message. First, I believe (not positive) that a single opcode IFC (Interface Function Call) is unreasonable on Dragon, even if we were inclined to implement it. Dragon can implement in hardware only opcodes that finish in two or fewer microinstructions; longer implementations have to trap, and the trap involves creating a new frame; IFC cannot be implemented in two microinstructions, so it would have to trap. At the time the opcode traps and the new register context is created it is not known how many arguments the new procedure takes, so proper adjustment of the register pointers for the new context at the onset of the trap is impossible. One way to deal with this problem would be to create an extra dummy frame for the IFC itself; unfortunately, although the new frame consumes no extra EU ring buffer registers, it does consume an IFU PC slot. The proposed IFU implementation allows some modest maximum number of IFU PC slots--say 16 of these. So if an IFC trap opcode consuming a PC slot would greatly reduce the frame depth that could be retained before overflow occurred. This is sufficiently undesirable that it should be avoided Your second suggestion was that it could be ambiguous at compile time whether DFC or IFC were used. Unfortunately, this is also unreasonable. Because IFC involves a reference to the global frame, it must be preceded by reserving one of the 16 local registers as a pointer to the global frame and by an opcode that loads this register with a pointer to global. However, Thacker, Petit, Lampson, etc. want to avoid loading global altogether in procedures that don't need it. If it were ambiguous at compile time whether IFC or DFC were used, then the compiler would have to put out a load-global opcode and waste one of the precious local registers, which is unacceptable. With regard to the 2^28-byte VM limit on code imposed by DFC, this is really 1/16th of the maximum rather than 1/64th because we want a return PC to fit in one word, so at most 2^32 bytes or 2^30 words would be available for code in the absence of the DFC limit. In addition, I don't know of any particular reason why SFC (Stack Function Call) couldn't cover 2^32-bytes. The exact code sequence for this is probably two 3-byte immediate opcodes to setup a 32-bit constant on the evaluation stack followed by an SFC to do the call; alternatively, it would be a 3-byte PC-relative load of a 32-bit procedure descriptor followed by an SFC. These sequences are not impossibly bad. The original idea was that SFC would cover all the non-call/return xfers somehow, so its one-word argument would encode procedure calls, frame transfers, indirect transfers, etc. I don't think anyone has worked out details of this, however. *start* 05773 00024 US Date: 30 Oct. 1982 3:53 pm PDT (Saturday) From: Taft.PA Subject: PrincOps corrections To: Sandman cc: Fiala, Levin, Taft I should begin by saying that I've been extremely impressed by the accuracy of the PrincOps. I've found only a few small errors. (Note: in general, I have read only the Mesa code and not the commentary.) 7.5.2: WFSL should be WLFS. 8.1. BLTLR: Store[dest+count]^ _ Fetch[source+count]^; should be: Store[dest+count-1]^ _ Fetch[source+count-1]^; 8.2. BLECL: IF Fetch[ptr]^ # ReadCode[offset]^ THEN... the second ^ is extraneous. 8.4.2. BITBLT: the method described for capturing the intermediate state works for interrupts but not for faults. Even if you assume that PushState is executed once per item rather than only when an interrupt is detected, it still doesn't work because a fault restores the stack pointer to its state at the start of the instruction, thereby losing any intermediate state which has been pushed (and also losing the interruption indication itself, namely the stack depth). One way to deal with this is to use the low-order bit of the BitBltArg pointer (which starts out even) as the interrupt indication, and to push up to two words of intermediate state above top-of-stack once per item. This is what the Dorado implementation does. I'm not sure what to do if you have more than two words of intermediate state. I believe the Dolphin uses the entire stack for intermediate state, and has some hack in the fault handler which detects page faults from BITBLT and does not restore the stack pointer in that case. This would seem somewhat difficult to describe in a clean way. Note also that in certain cases, BITBLT has to do explicit touching (and writing, if relevant) of all pages on which an item lies before beginning to transfer the item. The PrincOps should describe this. 9.1.3.3. Descriptor instructions: the description implies, but does not state explicitly, that alpha must be even. There should be a programming note which states this. 9.3: In the frame case of Xfer, the two conditions shown as giving rise to UnboundTrap[dst], namely GF=0 and nPC=0, should instead result in ERROR. (We discussed this before, and you agreed.) 9.5.3. DSTK: ESCAlpha calls this DSK. Who is right? 10. None of the process/monitor instructions are sufficiently cautious about page faults or write-protect faults on monitor locks and condition variables. In most cases it is necessary to dirty the monitor lock or condition variable explicitly before proceeding with the real work of the instruction. This is particularly true of instructions which call Requeue (which is to say, nearly all of them), since Requeue can make irrevocable changes to the source queue before touching the destination queue. 10.2.5. BC: the following declaration needs to be added at the beginning: requeue: BOOLEAN _ FALSE; 10.2.7. SPP: the statement: link: PsbLink _ Fetch[@PDA.block[PSB].link]^; should be changed to: link _ Fetch[@PDA.block[PSB].link]^; (by the way, SPP is missing from ESCAlpha.) 10.4.1: Reschedule is specified to remove the new process from the ready queue before running it. In actuality, the process remains on the ready queue. (I suspect the PrincOps description is a vestige of an unsuccessful attempt made some time ago to make the process primitives work on multiprocessors.) 10.4.2.1. SaveProcess: I can't find any definition of PsbContext. 10.4.5. CheckForTimeouts: the statement: PTC _ PTC+1; should be: PTC _ MAX[1, PTC+1]; should it not? A timer value of zero is reserved to mean no timeout, so PTC had better not assume that value since there is no way to specify a timeout ending at time zero. I also have a couple of substantive comments on the instruction set itself. 1. BYTBLT and BYTBLTR (and BLT, for that matter) are advertised to replicate data if an overlap exists and the destination block is displaced forward from the source block in the direction of data transfer. This is unlike BITBLT, where it is explicitly stated that the result is undefined if any bit is used as a destination and later as a source. I'm sure you are aware that the replication semantics are quite hard to implement when there is any sort of pipelining going on or when the unit of transfer is greater than the granularity of the data being transferred. The result is either that the transfer is done a lot more slowly than it would be otherwise, or that a lot of microcode has to be expended distinguishing the replication and non-replication cases and coding each one separately. On the Dorado, this actually isn't too bad for BLT; but for BYTBLT it is horrendous, so I have given up implementing BYTBLT for now. My guess is that the replication semantics for BLT are useful only for zeroing blocks of memory, and for BYTBLT aren't of any use at all. I suggest that the replication semantics be abolished, and that separate opcodes be provided for zeroing blocks of memory. I suspect this would end up taking less microcode and would execute faster, since no checking for replication would be required. (In fact, we have already implemented block zeroing opcodes in the Cedar microcode, because Cedar needs to zero blocks of memory somewhat more often than standard Mesa, and the code for zeroing memory with BLT is quite clumsy.) 2. ME and MX should be regular opcodes, not ESC opcodes. Their static frequency may not justify this; but ME and MX are dynamically quite frequent, and ESC opcodes are sufficiently slower than regular opcodes (at least on the Dorado and Dolphin) to warrant making them regular opcodes. For example, ME takes 14 cycles on the Dorado when implemented as an ESC opcode; it would take 8 cycles if it were a main opcode. Ed *start* 00491 00024 US Date: 10-Nov-82 15:04:57 PST (Wednesday) From: Sandman.pa Subject: Emulator Tester To: Taft, Murray cc: Sandman.pa [Igor]EmulatorTest> has the emulator test program. It works by having the initial ucode load the .germ files. EmulatorTester.germ spins when it finds an error. EmulatorTester2.germ executes opcode 376B when it encounters an error. You can look at the code listing to see what it actually does. It tests most of the common instructions. *start* 00413 00024 US Date: 12 Nov. 1982 1:07 pm PST (Friday) From: Fiala.PA Subject: Trinity LoadStack opcode To: Taft cc: Fiala I think alpha = 215b is good for me. If you don't like 215b or 217b, which is also good, then 220b to 377b are available. Although 64b to 77b are presently unused, it is inconvenient for me to use them because all of the opcodes in that group are intended to trap to software. *start* 00310 00024 US Date: 12 Nov. 1982 1:15 pm PST (Friday) From: Taft.PA Subject: Re: Trinity LoadStack opcode In-reply-to: Your message of 12 Nov. 1982 1:07 pm PST (Friday) To: Fiala cc: Taft 200B-237B are reserved for processor-dependent opcodes. What about 177B? Alternatively, 240B or higher. Ed *start* 00217 00024 US Date: 15 Nov. 1982 1:45 pm PST (Monday) From: Fiala.PA Subject: Re: Trinity LoadStack opcode In-reply-to: Your message of 12 Nov. 1982 1:07 pm PST (Friday) To: Fiala cc: Taft Esc 177b is ok. *start* 01884 00024 US Date: 15 Nov. 1982 6:36 pm PST (Monday) From: Taft.PA Subject: Trinity LoadStack opcode To: Satterthwaite, Sturgis cc: Sandman, Sweet, Fiala, Levin, Rovner, Taft As I have discussed previously with the other two Eds, here are the details of a proposed LoadStack operation, to be implemented in the Trinity Cedar compiler and microcode, and proposed for inclusion in future versions of the PrincOps. A new opcode is defined, LSTK (opcode=177B), which is exactly symmetrical with the present DSTK. Its semantics are: LSTK: PROCEDURE = BEGIN alpha: BYTE = GetCodeByte[]; state: POINTER TO StateVector = LOOPHOLE[LF+alpha]; LoadStack[LengthenPointer[state]]; END; Note especially that LSTK does not give rise to an XFER as LSTF and LSTE do. A new compiler construct is defined to invoke this opcode. I suggest: STATE _ state; where state is a PrincOps.StateVector; this is exactly symmetrical to the present "state _ STATE" which invokes DSTK. I will provide trap support for this opcode in all ProcessorHeads so it will be possible to run code using it on top of non-Cedar microcode. The purpose of this change is to significantly simplify the coding of opcode trap handlers, making them both faster and smaller. A trap handler for an opcode which both takes arguments and produces results (and which is not passed a trap parameter by the microcode) is now coded as follows: Trap: PROCEDURE [arguments] RETURNS [results] = BEGIN -- arguments are popped into local variables by compiler-generated code state: PrincOps.StateVector _ STATE; -- dumps stuff that was underneath args -- results _ F[arguments]; -- the main work of the trap procedure -- TrapSupport.BumpPC[2]; STATE _ state; -- reloads stuff that was underneath args -- results are pushed by compiler-generated code END; -- normal RETURN Comments? Ed *start* 00592 00024 US Date: 16 Nov. 1982 8:21 am PST (Tuesday) From: Satterthwaite.PA Subject: Re: Trinity LoadStack opcode In-reply-to: Taft's message of 15 Nov. 1982 6:36 pm PST (Monday) To: Taft cc: Satterthwaite, Sturgis, Sandman, Sweet, Fiala, Levin, Rovner I will be happy to add this to the Trinity Cedar compiler if everyone agrees that it's the right thing to do. Will DSTK and LSTK be fast enough to use in compiler-generated code to dump and reload (nearly full) stacks during expression evaluation (the alternative is a sequence of store/load local double instructions)? Ed *start* 00397 00024 US Date: 16-Nov-82 8:44:04 PST (Tuesday) From: Sandman.pa Subject: Re: Trinity LoadStack opcode In-reply-to: Taft's message of 15 Nov. 1982 6:36 pm PST (Monday) To: Taft cc: Satterthwaite, Sturgis, Sandman, Sweet, Fiala, Levin, Rovner I was planning to include LSTK in Kalamath as defined by you. I also intend to change LSTF and LSTE to not load the stack but only XFER. *start* 00945 00024 US Date: 16 Nov. 1982 9:03 am PST (Tuesday) From: Taft.PA Subject: Re: Trinity LoadStack opcode In-reply-to: Satterthwaite's message of 16 Nov. 1982 8:21 am PST (Tuesday) To: Satterthwaite cc: Taft, Sturgis, Sandman, Sweet, Fiala, Levin, Rovner On the Dorado, DSTK and LSTK are substantially slower than the corresponding operations performed with sequences of SLDB and LLDB, even though the latter require more code. There are two reasons for this: (1) DSTK and LSTK are ESC opcodes, which take longer to dispatch than regular opcodes; (2) DSTK and LSTK include a substantial amount of overhead, including saving and restoring the Break byte and (for DSTK on the Dorado) checking for stack overflow. Trinity DSTK for the Dorado takes 16+n cycles, where n is the number of words dumped from the stack (including 2 words above top-of-stack). The corresponding sequence of n/2 SLDB instructions takes 1.5*n cycles. Ed *start* 00704 00024 US Date: 16-Nov-82 9:23:38 PST (Tuesday) From: Sweet.PA Subject: Re: Trinity LoadStack opcode In-reply-to: Taft's message of 15 Nov. 1982 6:36 pm PST (Monday) To: Taft cc: Satterthwaite, Sturgis, Sandman, Sweet, Fiala, Levin, Rovner I'm willing to have such a construct only on the condition that it is not very widely advertised and is used only by wizards. The problem with its use is that the compiler won't know the depth of the stack when it is pushing the results onto the stack after STATE _ state. Random programmers that have complicated expressions in a RETURN after STATE _ state could cause the stack to overflow without the compiler being able to prevent it. Dick *start* 00547 00024 US Date: 17 Nov. 1982 10:51 am PST (Wednesday) From: Fiala.PA Subject: Re: Trinity LoadStack opcode In-reply-to: Satterthwaite's message of 16 Nov. 1982 8:21 am PST (Tuesday) To: Satterthwaite cc: Taft, Sturgis, Sandman, Sweet, Fiala, Levin, Rovner DSTK/LSTK on Dolphin is slightly faster than a sequence of stores and loads and slightly slower than a sequence of doubleword stores and loads. Although its possible to make DSTK/LSTK substantially faster by expending microcode, I doubt that this would be worth the effort. *start* 00284 00024 US Date: 19-Nov-82 16:19:02 PST (Friday) From: Sandman.pa Subject: Re: NOOP In-reply-to: Taft's message of 19 Nov. 1982 3:24 pm PST (Friday) To: Taft cc: Sandman Opcodes 0 and 377B are reserved in the PrincOps for implementations. They are not legal opcodes. *start* 00701 00024 US Date: 23 Nov. 1982 5:55 pm PST (Tuesday) From: Taft.PA Subject: Dorado Trinity To: Sandman cc: Taft Today I ran your EmulatorTester program on the Dorado for the first time. It turned up four bugs, three of mine and one of yours. Your bug is that the PrincOps descriptions of ADC and ACD are reversed. That is, ADC is described as taking the LONG CARDINAL argument on top-of-stack and the CARDINAL underneath, but actually takes the CARDINAL on top-of-stack and the LONG CARDINAL underneath. Now that I look at the EmulatorTester, it seems not very comprehensive; but at least it's a good start... By the way, the ExternalTests procedure appears never to be called. Ed *start* 03208 00024 US Date: 25 Nov. 1982 4:53 pm PST (Thursday) From: Taft.PA Subject: Trinity Cedar opcodes To: Fiala, Satterthwaite, Rovner, Willie-Sue cc: Levin, Sandman, Sweet, Taft We need to standardize the names and values of the Cedar opcodes for Trinity. There have been various ad-hoc assignments made already, not all of which are entirely satisfactory. Let's try to clean this up as part of converting Cedar to the new instruction set. Here is my understanding of the situation as it stands now, and my proposed changes: 1. Main reference-counting opcodes Two main opcodes are defined for reference-counted assignment: WCLB = 76B Write Counted Long Byte ICLB = 77B Initialize Counted Long Byte These seem fine as they stand. Someone should see that these make it into the standard Mopcodes.mesa at some point. In the meantime, I volunteer to produce an initial version of Trinity Mopcodes and ESCAlpha for Cedar. 2. ESC opcodes for allocator/GC A bunch of opcodes are already defined in the Trinity version of ESCAlpha.mesa, as follows: RECLAIMREF = 140B ALTERCOUNT = 141B RESETSTKBITS = 142B GCSETUP = 143B ENUMERATERECLAIMABLE = 145B CREATEREF = 147B REFTYPE = 151B CANONICALREFTYPE = 152B ALLOCQUANTIZED = 153B ALLOCHEAP = 154B FREEOBJECT = 155B FREEQUANTIZED = 156B FREEPREFIXED = 157B Additionally, the Dolphin microcode appears to contain the following additions: RECLAIMCOUNT = 144B LONGBLKZ = 146B LOCALBLKZ = 150B These opcode names are not consistent with the style used for naming the standard Mesa opcodes. But I have no problem with this, so long as (a) the people who implement and/or use the Cedar runtime system are happy with these names, and (b) these names get used uniformly everywhere, including: in ESCAlpha.mesa in the Compiler in the Lister in the microcode Currently there is a lot of variation in the names different people use for the same opcodes; this has the potential for causing a lot of confusion. Would Paul and Ed S. please get together and publish a list of "approved" names for these opcodes, to which everyone else will happily conform. 3. Machine-dependent opcodes ESC opcodes in the range 200B-237B are (according to ESCAlpha.mesa) reserved as machine-dependent opcodes. This seems like an excellent idea. However, some of these opcodes have nevertheless crept into ESCAlpha. In particular, INPUT, OUTPUT, and LOADRAMJ seem to be used on both the Dandelion and the Dolphin, and perhaps are not so machine-dependent after all. Can someone (presumably from SDD) clarify this? In any event, it seems to me that machine-dependent opcodes should NOT appear in ESCAlpha, but only in machine-dependent defs files assocated with the Heads for each machine. 4. Possible addition The introduction of ESCL means that we can have ESC opcodes which take an operand byte. This might be of some advantage in certain instructions. In particular, is the compiler interested in flavors of LONGBLKZ and/or LOCALBLKZ which get their count argument from the operand byte instead of from the stack? (I suspect the length of the block to be zeroed is nearly always known at compile time and is less than 400B.) Ed *start* 01997 00024 US Date: 26 Nov. 1982 2:32 pm PST (Friday) From: Taft.PA Subject: Trinity booting To: Levin cc: Taft The Trinity microcode is sufficiently larger than the Rubicon microcode that I will be forced to overlay the initialization code. This means that the present form of the 1- and 2-push boots, which do not reload microcode but simply restart it, will no longer work. The distinction between a 1- or 2-push boot and a 3-push boot is presently interesting only in the Alto world: the former types of boot do not reset your disk partition and do not zero memory whereas the latter type does. There is no such distinction in the Pilot world: all types of boot reinitialize your Pilot world, including reloading the Germ. I can perpetuate this simply by making 1- and 2-push boots invoke a 3-push boot. However, it seems to me that it would be more useful if a 1-push boot could be used to restart a wedged Pilot world without reloading the Germ. Among other things, this would mean that the user credentials would not be lost by a 1-push boot as they are at present. From brief study of the Trinity Germ, I believe it is not restartable in the obvious way (XFER[@SD[sBoot]]). However, it appears that a physical volume boot could be invoked by forcing a call to SnapshotImpl.InLoadFromBootLocation (I haven't figured out where this is exported, though it is declared PUBLIC in the implementation). This could be done either by putting this procedure in SD (on the Pilot side, not the Germ side) or by having the microcode request an interrupt and the interrupt routine call this procedure. Alternatively, it might be desirable to invoke this entirely on the Germ's side; i.e., have a procedure in the Germ's SD which sets up a "bootPhysicalVolume" request and somehow forces ProcessRequests to go around the loop again. However, I am sufficiently fuzzy about how the cross-MDS linkage works that I don't really know how to go about doing this. What do you think? Ed *start* 02821 00024 US Date: 26 Nov. 1982 3:25 pm PST (Friday) From: Taft.PA Subject: Trinity traps To: Satterthwaite, Sandman, Sweet, Levin, Rovner cc: Fiala, Sturgis, Taft Given that we implement a LoadStack operation (invoked by "STATE _ state"), as we have discussed previously, the only remaining difficulty with writing trap procedures for Trinity the way we have been doing it in Rubicon is in obtaining the trap parameter. This is mainly of interest to the Cedar opcodes, since most other opcode trap procedures do not use trap parameters. The problem is to ensure that the trap parameter in local 0 will not be clobbered before being picked up and saved somewhere else. We have discussed ways to do this by means such as MACHINE DEPENDENT local frames or TRAP PROCs; but I understand that both of these represent fairly big changes to the compiler and are not likely to be done any time soon. I have now figured out a way to accomplish this without any language changes. A trap procedure which takes arguments, produces results, and also requires a trap parameter may be coded as follows: Trap: PROCEDURE = trapParam: WORD _ Trap.Parameter[]; InnerTrap: PROCEDURE [arguments] RETURNS [results] = INLINE BEGIN -- arguments are popped into local variables by compiler-generated code state: PrincOps.StateVector _ STATE; -- dumps stuff underneath args -- results _ F[arguments, trapParam]; -- the main work of the trap procedure -- TrapSupport.BumpPC[2]; STATE _ state; -- reloads stuff that was underneath args -- results are pushed by compiler-generated code END; MaterializeArgs: PROCEDURE RETURNS [arguments] = MACHINE CODE {}; AbsorbResults: PROCEDURE [results] RETURNS [] = MACHINE CODE {}; APPLY[AbsorbResults, APPLY[InnerTrap, MaterializeArgs[]]]; END; -- normal RETURN The statement which saves the trap parameter is the first one executed, and is performed while the arguments are still on the stack (this should be OK because the compiler leaves two words above top-of-stack). The rigamarole with MaterializeArgs and the inner APPLY does not generate any code. Rather, it tricks the compiler into expanding InnerTrap in-line without first pushing any arguments for it; since the arguments are in fact already on the stack, this has precisely the desired effect. AbsorbResults and the outer APPLY have a symmetrical effect on the results. This is a lot of boilerplate for obtaining a seemingly simple result; but it does generate exactly the right code, so I think it will be adequate for our needs in Cedar. By the way, my attempts to call InnerTrap in a more straightforward way, e.g., LOOPHOLE[InnerTrap, PROCEDURE][]; were unsuccessful; the compiler invariably complained "inline procedure used improperly" or some such. Ed *start* 00581 00024 US Date: 29 Nov. 1982 9:08 am PST (Monday) From: Levin.PA Subject: Re: Trinity Cedar opcodes In-reply-to: Taft's message of 25 Nov. 1982 4:53 pm PST (Thursday) To: Taft cc: Fiala, Satterthwaite, Rovner, Willie-Sue, Levin, Sandman, Sweet Thanks for your clear and appropriate message. I agree with everything you say. As "purveyor of Pilot to Cedar", I will be happy to work with you to get Mopcodes/ESCAlpha sorted out. I don't have much other stuff to do between now and 4.0, so perhaps I should start working on a Trinity Pilot Kernel for Cedar. Roy *start* 01482 00024 US Date: 29 Nov. 1982 9:50 am PST (Monday) From: Fiala.PA Subject: Re: Trinity Cedar opcodes In-reply-to: Taft's message of 25 Nov. 1982 4:53 pm PST (Thursday) To: Taft cc: Fiala, Satterthwaite, Rovner, Willie-Sue, Levin, Sandman, Sweet I am indifferent to any names or name changes anyone wishes to make. Your proposals that the machine-dependent opcodes not appear in ESCAlpha and that Paul and Ed S. get together and publish "approved" names seems reasonable to me. However, I suggest that the approved names not be as long as ENUMERATERECLAIMABLE and CANONICALREFTYPE. The ICLB opcode now is no longer used. Satterthwaite has suggested the addition of another version of ICLB with its stack arguments reversed, and this opcode should be added with the same opcode assignment as ICLB used to have. It is getting hard to add Dolphin microcode, and many already implemented opcodes cannot be included for some hardware configurations, so I am mildly opposed to adding opcodes along the lines of LONGBLKZ and LOCALBLKZ, which don't have much impact. If such opcodes were added on the Dolphin, the microstore used would prevent implementation of other opcodes such as BYTBLT and the Cedar allocator opcodes, for example. However, Cedar is unlikely to be run on Dolphins very often, so I suppose that trapping such opcodes won't matter much. Also, the variants of LONGBLKZ and LOCALBLKZ won't use much extra space, so go ahead and add them if you wish. *start* 07764 00024 US Date: 31 Dec. 1982 6:29 pm PST (Friday) From: Stewart.PA Subject: Mesa floating point design To: CedarDiscussion^.pa, @[Indigo]Real>Users.dl Cc: Satterthwaite, Sweet, Malasky, Lampson, Johnsson, Wick, Thacker, Petit Reply-To: Stewart.PA It is time to clean up the Mesa floating point implementation and to plan for double precision. Please let me know if you are not interested in any of this, or if there are others who are interested. I will act as moderator of the discussion unless someone else wants to. General -------- I propose that Mesa floating point conform to the IEEE floating point standard (as indeed, the single precision implementation does). Instructions ----------- The present coplement of single precision instructions are: 16 opcodes FADD -- REAL _ REAL + REAL FSUB -- REAL _ REAL - REAL FMUL -- REAL _ REAL * REAL FDIV -- REAL _ REAL / REAL FCOMP -- INTEGER _ REAL ? REAL FIX -- LONG INTEGER _ REAL FLOAT -- REAL _ LONG INTEGER FIXI -- INTEGER _ REAL FIXC -- CARDINAL _ REAL FSTICKY -- exchange mode words FREM -- REAL _ REAL % REAL ROUND -- LONG INTEGER _ REAL (rounded) ROUNDI -- INTEGER _ REAL ROUNDC -- CARDINAL _ REAL (rounded) FSQRT -- REAL _ SQRT(REAL) FSC -- REAL _ REAL * 2^INTEGER I recommend that ROUNDI and FIXI be deimplemented. ROUNDC and FIXC are heavily used by the graphics folk to compute screen coordinates and (perhaps) should be retained. FREM has never been implemented, either in microcode or in Mesa. FSTICKY is used only by the Mesa part of the floating point package to keep track of the microcode's copy of the various mode and exception enable flags; see below for discussion of floating point state. I propose that the following additional instructions be added for double precision: 11 opcodes DFADD -- LONG REAL _ LONG REAL + LONG REAL DFSUB: -- LONG REAL _ LONG REAL - LONG REAL DFMUL -- LONG REAL _ LONG REAL * LONG REAL DFDIV -- LONG REAL _ LONG REAL / LONG REAL DFCOMP -- INTEGER _ LONG REAL ? LONG REAL DFLOAT -- LONG REAL _ Quadword INTEGER DFIX -- Quadword INTEGER _ LONG REAL DROUND -- Quadword INTEGER _ LONG REAL (rounded) DFSQRT -- LONG REAL _ SQRT(LONG REAL) DFSC -- LONG REAL _ LONG REAL * 2^INTEGER DFREM -- LONG REAL _ LONG REAL % LONG REAL There are some mixed precision operations also 3 opcodes FMULD -- LONG REAL _ REAL * REAL FLONG -- LONG REAL _ REAL FSHORT -- REAL _ LONG REAL We would need language support for LONG REAL and perhaps also for 64 bit integers. LONG REAL _ REAL * REAL seems to be the most useful mixed arithmetic operation. One would expect to use it inside an inner product loop. Floating point modes -------------------- IEEE floating point defines a number of floating point modes, among these are rounding mode (round up, round down, round towards zero, and unbiased round), infinity mode (handling of overflow), and normalization mode (handling of underflow). At present, all microcode operations operate with fixed modes, the interface RealOps supplies clients with access to alternative modes. The standard also defines a number of exceptions: invalid operation, overflow, underflow, divide by zero, and inexact result. The single precision package also defines overflow for conversion to fixed point. If a given exception is not enabled, a "sticky bit" records the fact that at least one such exception has occurred. The present single precision floating point package has a global copy of these modes. If more than one process attempts to use floating point arithmetic in any complex way, the results will be indeterminate. I propose that the Mesa process state be augmented by the floating point exception enable flags and sticky bits. Attaching the exception enables and sticky flags are clearly necessary for correct operation of multi-process floating point code. I am uncertain whether the mode bits are needed as well. The IEEE standard defines floating point modes to be dynamically scoped. The implicit assumption is that interval arithmetic can be achieved by calling a subprogram once with round-down and once with round-up. Lexically scoped modes would be more in line with Mesa style. The trouble with dynamically scoped modes is that subroutines which change the mode must excercise extreme care to restore the caller's modes on exit -- including a catch phrase for UNWIND, etc. Otherwise, a function returning a REAL might alter the caller's modes in the middle of the evaluation of a complicated floating point expression.  On the other hand, lexically scoped modes provide considerable implementation difficulty: a language construct must be defined and attached to a block: BEGIN ROUND-UP ... END; In 1980, the conclusion reached was to provide two sets of floating point opcodes, one set providing a default set of fixed modes and the other set accepting the mode as an explicit argument to the instruction. Only the first set of opcodes has ever been implemented in microcode. (For the curious, the fixed-mode instructions are those declared in the interface Real, while the explicit-mode instructions are those declared in the interface RealOps) Under this model, programmers wishing to use the non-default modes would be required to write RealOps.FAdd[a, b, specialMode] rather than a + b. However, RealOps.FAdd would be a MACHINE CODE procedure and execute very quickly. An alternative would be to provide two sets of instructions, one of which uses fixed modes and the other of which uses the modes contained within the process state. I am inclined to this course. Process switching ------------------ In any event, 16 bits is sufficient to store the 4 mode bits, 6 exception enable bits, and 6 sticky flags. The process switching machinery will have to save and restore the floating point modes. This can be done at no cost to processes which are not using floating point, by for, example, having the floating point microcode maintain a cached copy of the floating point modes and check (on each operation) whether the process has changed. A special cache flushing instruction would be executed by FORK/JOIN to handle the re-use of an old process id. Special loads and stores? ------------------------ At present, floating point arithmetic is carried out by microcode using the standard Mesa evaluation stack. It seems likely that future Mesa processors will have special hardware for floating point operations. It may be convenient to define a separate collection of loads and stores for manipulation of REALs rather than to continue using the standard loads and stores. Use of special instructions would be appropriate for special hardware, since the floating point operands could be transferred directly to the special hardware rather than to the evaluation stack. On machines without special hardware, the special instructions could use the evaluation stack. Use of special instructions could be faster and smaller, since it takes a number of existing Mesa instructions to manipulate 32 and 64 bit quantities. However, special instructions would complicate the compiler, since either or both of the floating point registers and the evaluation stack might overflow. An interesting alternative would be to require any special floating point hardware to maintain a shadow copy of the evaluation stack at all times. On arrival of a floating point operation, the operands would be ready. On conclusion of a floating point operation, the results would need to be transferred to the real evaluation stack for later transfer to main memory using standard store instructions. The results usually involve fewer bits than the operands. There are lots of implications for machines with local frame caches. *start* 02353 00024 US Date: 3 Jan. 1983 3:16 pm PST (Monday) From: Fiala.PA Subject: Re: Mesa floating point design In-reply-to: Your message of 31 Dec. 1982 6:29 pm PST (Friday) To: Stewart, CedarDiscussion^.pa, @[Indigo]Real>Users.dl, Satterthwaite, Sweet, Malasky, Lampson, Johnsson, Wick, Thacker, Petit cc: Fiala Here are my comments in response to yours: 1) I support eliminating ROUNDI and FIXI. 2) We should continue implementing the IEEE floating point standard with our own addition of the substitute-zero-on-underflow option, which you didn't mention. 3) Should we deimplement FREM since nobody uses it? 4) I think we should stay with one opcode per operation and control variable aspects (rounding, overflow, underflow, etc.) with the FSTICKY flag word, as we are presently doing. On Dolphin, extra microcode for checking FSTICKY costs about 3 microinstructions per opcode (0.6% to 2% in performance). This cost is small, and I think it is impractical to try to avoid the cost. Explicitly calling a procedure for non-standard options forecloses the possibility of eventually implementing some options in microcode or hardware on some machines. 5) Integrating FSTICKY into the process state should be done at some point, but this is not easy. It involves enlarging the PSB from 64 bits to 128 bits and absorbing the TIMEOUT word (in Trinity, TIMEOUT is in a separate table) back into the PSB. Also, the arrangement of the bits in each PSB entry has to change. I think we should wait until some future release before absorbing FSTICKY into the process state. This will definitely be done for Dragon, where the incorporation of FSTICKY into the process state is planned. Also, it may be desirable to simultaneously provide some overflow options for ordinary integer/cardinal arithmetic as well as floating point arithmetic; this will enlarge FSTICKY to more than 16d bits. We should defer this unwanted perturbation on the 16-bit machines for awhile. 6) I don't think that special loads and stores for floating point coprocessors on future machines are worth considering at this time. Such operations would complicate life on existing machines (because they would have to be implemented) without providing help at present, and the operations chosen might be inappropriate for whatever we eventually build or buy. *start* 00679 00024 US Date: 3 Jan. 1983 4:46 pm PST (Monday) From: Sturgis.PA Subject: Re: Mesa floating point design In-reply-to: Fiala's message of 3 Jan. 1983 3:16 pm PST (Monday) To: Fiala cc: Stewart, CedarDiscussion^, @[Indigo]Real>Users.dl, Satterthwaite, Sweet, Malasky, Lampson, Johnsson, Wick, Thacker, Petit With regard to point 5) It is my understanding that Stewarts proposed design does not involve actually storing information in the process state, but rather maintaining the sticky information in a data structure parallel to the process data structure. Thus, implementing it does not require modifications to the process data structure. Howard *start* 02229 00024 US Date: 4 Jan. 1983 12:09 pm PST (Tuesday) From: Fiala.PA Subject: Re: Mesa floating point design In-reply-to: Sturgis' message of 3 Jan. 1983 4:46 pm PST (Monday) To: Sturgis cc: Fiala, Stewart, CedarDiscussion^, @[Indigo]Real>Users.dl, Satterthwaite, Sweet, Malasky, Lampson, Johnsson, Wick, Thacker, Petit A parallel structure is exactly what was done in going from Rubicon to Trinity for the TimeOut word. This was once in the PSB but got moved to a separate table when the multiple MDS field replaced it in the 4x16-bit PSB. As a result of removal, about 25 microinstructions got added that would have been unnecessary if the TimeOut word were in the same block with the other PSB words. Also, the timeout scan in the process microcode uses about 1% of all cycles on the Dolphin, with about 30% of this time wasted because the TimeOut word is located in a separate table. We could conceivably repeat this kind of kludge for the FSTICKY word. However, my feeling is that FSTICKY word will grow beyond 16 bits to accommodate ordinary arithmetic overflow and perhaps other stuff. This means that the "logical" size of the PSB would be 7 x 16 bits. My preference would be to put all this information back into an enlarged PSB, which could become 8 x 16 bits aligned on appropriate storage boundaries; although this wastes one word at present, it cleans up the implementation, and we will probably find some use for the unused word eventually. For example, Roy Levin mentioned to me once that he had a scheme to eliminate cleanup links that required another 16 bits in the PSB--I don't know the details of this, however. As part of the Dragon project, we will be making changes to the PSB anyway. We have to eliminate the MDS field and enlarge the CONTEXT field from 16 to 32 bits, for example. My thought is that, at the same time we make these other changes, we should insert a 32-bit FSTICKY field into the PSB. The timing for this would probably be about one to two years from now. It would be wasteful to make an inferior interim change to get FSTICKY into the process state now (on three machines and two software systems) and then repeat the work again in one or two years. *start* 06105 00024 US Date: 30 April 1983 4:07 am PDT (Saturday) From: Fiala.PA Subject: Re: PrincOps changes/bugs/improvements In-reply-to: DKnutsen's message of 26 Apr 83 14:35:58 PDT (Tuesday) To: DKnutsen, Wick, Johnsson cc: Sandman, Luniewski, Fiala, Fay, Neely, Taft Here is a revision to the message which Knutsen sent out earlier. 1) The Princ Ops says that bits 0..3 of both condition variables and monitor locks are reserved and equal to 0; I have observed bit 0 of monitor locks being set 1 by software, so the Princ Ops must be wrong. 2) The Princ Ops description of Wait is inconsistent with the Dolphin microcode (and I suspect other microcode); it should not abort unless both condition.abortable and PsbFlags.abortpending are true. 3) For critical tables manipulated by microcode, not only should the origin be at a 4-word or 16d-word boundary but also the table length should be 4*n words. Two particular places where I want this to be true are the state vectors (where the Dolphin formerly used only 22b words) and the timeout table. Both of these begin at 4-word or 16-word boundaries, as defined by the Princ Ops, which does not specify anything about the lengths. Knutsen informed me that the state vectors were, in fact, 24b words long; I have now taken advantage of this in the microcode. The Mesa sources which define the state vector size for the Dolphin should be revised with a comment or whatever indicating that all 24b word are used. For the timeout scan, the microcode I want to use makes one extra quadword memory reference beyond the end of the timeout table; this mustn't page or write protect fault, but I don't care what the data is in that word; I am only referencing the extra quadword because I want to do the end tests during memory wait. Knutsen informed me that the present software has a state vector immediately after the timeout table, and, since that is resident, it can't page fault; I would like to instantiate this restriction into either the Princ Ops or the appropriate software module, so it won't be changed later. Secondly, I would like to make only one end test per 4 timeout table entries, but unused words in the final quadword of the timeout table must be zero filled to do this. I am not presently relying on these words being zero filled, but I would like to. 4) Is it deliberate in the Princ Ops that the Monitor Reenter opcode doesn't clear PsbFlags.AbortPending? 5) The Monitor Exit and Monitor Exit and Wait (MX and MW) opcodes should trap if the monitor was already unlocked. 6) The Reschedule Error trap should be strengthened as follows: Instead of generating this error only in the "None Ready" loop when wakeups are disabled, it should be generated whenever rescheduling occurs with interrupts disabled. The only reasons I have thought of for not wanting to make this change are the possibility of page, write protect, or frame faults. I think it is a bug for any of these faults to happen with interrupts disabled. In addition, interrupts should be disabled at entry to the Requeue subroutine and reenabled at exit, so that the Reschedule Error trap will occur on page faults during process requeuing, when they are a disaster. This indicates either a screw up in the process data structures or a fault from an io task (which has been happening on the Dolphin). The exact way in which interrupts are disabled/enabled to check for illegal page and write protect faults is open for discussion or perhaps optional for a particular implementation. The idea is to cause the Reschedule Error trap to occur whenever illegal page faults happen. 7) The algorithm for time out scans can be improved. Change the code from CheckForTimeouts: ... PTC _ PTC + 1; RETURN[TimeoutScan[]]; ... TimeOutScan: ... timeout: Ticks _ FetchPda[@vector[psb]]^; IF timeout # 0 AND timeOut = PTC THEN ... ..to.. CheckForTimeouts: ... PTC _ PTC + 1; IF PTC = 0 THEN PTC _ PTC + 1; RETURN[TimeoutScan[]]; ... TimeOutScan: ... timeout: Ticks _ FetchPda[@vector[psb]]^; IF timeOut = PTC THEN ... In addition, the code in the Monitor Wait opcode which sets timeouts should be changed to add 1 whenever the sum of the timeout value and Ticks crosses 0. In the Dolphin microcode this is done as follows: LU _ (prWaitCount) + T; prWaitCount _ (prWaitCount) + T, UseCOutAsCIn; These two changes have two benefits: TimeOutScan runs faster since a check is eliminated from its inner loop; and processes will not see a spurious extra tick if the PTC happens to overflow to zero during their wait time. 8) More thoughts about the timeout code: Although the above change removes one error from the precision of the timeout scan, three other errors remain: a) The clock stops when WDC # 0 (because Interrupts aren't serviced so ticks aren't counted). b) The Princ Ops allows an unreasonably large variation in the size of a "tick" (15 to 60 msec). c) A wait count of 1 can actually result in any wait between 0 (the next tick happens right away) and 60 msec (the next tick happens maximally far away). A wait count of 2 can actually result in any wait between 15 (small tick size happening immediately) and 120 msec (large tick size happening later); etc. For (a) I think that the clock should not stop when interrupts are disabled. Instead, interrupts and the timeout scan should occur as usual, but RESCHEDULING should be inhibited until interrupts are reenabled. Implementation details need to be worked out. (Perhaps WDC<0 should inhibit interrupts as well as rescheduling, while WDC>0 inhibits only rescheduling). For (b) the MW opcode can round and right-shift the program-supplied wait count in, say, 10 msec ticks to match the desired count against the courser grain of the display field interrupt or whatever. For (c), the Dolphin implementation at least, has a 3-state sub-tick counter, so that a "tick" happens every 3 field interrupts; this would allow the MW opcode to carry out rounding based upon the state of the sub-tick counter. *start* 01141 00024 US Date: 2 May 83 11:27:02 PDT (Monday) From: DKnutsen.PA Subject: Re: PrincOps changes/bugs/improvements In-reply-to: Fiala's message of 30 April 1983 4:07 am PDT (Saturday) To: Fiala cc: DKnutsen, Wick, Johnsson, Sandman, Luniewski, Fay, Neely, Taft "the clock should not stop when interrupts are disabled. Instead, interrupts and the timeout scan should occur as usual, but RESCHEDULING should be inhibited until interrupts are reenabled" Making this change would mean that the software would have no way of taking a snapshot of the timeout vector and psb's or of modifying the timeout vector or psb allocations while the machine was running. It is necessary (somehow) for the software to be able to have the microcode stop looking at the Process Data Area -- it is done when swapping to the debugger. This could be accomplished by other means, but I don't see the motivation. Your suggestions about improving the accuracy of timeouts are fine for any microcode implementation that is willing to do them. I believe the PrincOps is written the way it is so that good accuracy is not required, just nice. Dale *start* 02415 00024 US Date: 6 May 83 14:50:23 PDT (Friday) From: Haynes.PA Subject: Re: Hitting the same break multiple times To: Taft.PA cc: , Haynes I think I know what the solution is, but please tell me how you did it so I can cross check my conclusions. the following is an earlier message on the same subject, a little more technical.  I'd welcome any comments you had. -------------------- Date: 6 May 83 11:46:35 PDT (Friday) From: Haynes.PA Subject: Hitting the same break multiple times To: Haynes.pa Cc: Sandman, DKnutsen, Daniels This is mostly notes to myself on what the problem is. Jim, Dale, or Andy, please feel free to correct any mistakes here. Andy, how hard would it be to implement the proposed microcode change? Workaround: Every time you enter on a break, hash the contents of the local frame, plus all the frames on the call stack. Then if you hit the same break twice compare this hash with the previous one, and if it is the same, indicate that it is possible that you didn't make any progress and should proceed again. This is the bug where, when you take a breakpoint, and proceed, you hit the same breakpoint again (possibly multiple times) The way breakpoints work is, that when you want to proceed from a break (and keep the break in place) the old instruction is put in a special location (the break-byte*) for execution.  When you hit a BRK instruction, if the break-byte is not empty, you execute it instead of the BRK. Unfortunately, the break-byte is cleared early in the break-instruction code before you execute the break-byte. This is so that when you finish the break-byte you just go on your merry way fetching the next instruction. This causes problems when you fault during the execution of the instruction in the break-byte. When the fault routine finishes it backs up the PC and you re-execute the BRK instruction causing a breakpoint. This can theoretically occur up to 11(!) times, but practically speaking it rarely happens more than 4 times (hopefully doesn't happen at all of course). The cure for this is to overload enabling interrupts (?) (I don't understand how this works... microcode magic) so that you can get called AFTER the instruction in the break-byte is executed and you can clear the break-byte then. This problem will most often manifest itself in a system that is thrashing, for example machines with small memory. -- Charles *start* 01414 00024 US Date: Fri, 6 May 83 15:08 PDT From: Taft.PA Subject: Re: Hitting the same break multiple times In-reply-to: "Your message of 6 May 83 14:50:23 PDT (Friday)" To: Haynes cc: Taft Sounds like you have figured it out for yourself. But here it is just in case: When the BRK instruction is executed with nonzero Break byte, it dispatches on the Break byte but does not clear it. It also initiates an interrupt late enough so that it does not take effect on THIS dispatch but rather on the next one. The interrupt handler unconditionally clears the break byte. There are some details that complicate things: 1. LoadState (LSTF and LSTE) must not permit an interrupt during the dispatch to the next instruction. This is so that if it loads a Break byte and transfers to a BRK instruction, the Break byte doesn't get cleared before the BRK has had a chance to execute. (Loading a Break byte and transferring to an instruction other than a BRK is probably a programming error; I haven't worried about it.) 2. Instructions such as BLT and BITBLT that test for interrupts must be clever enough not to terminate if the hardware interrupt pending condition is true but WP is still zero. Also, if WP does become nonzero, suspending the instruction and initiating the interrupt should NOT clear the Break byte. This enables correct operation if the broken instruction is a BLT or BITBLT. Ed *start* 03110 00024 US Date: Tue, 17 May 83 00:11 PDT From: Fiala.PA Subject: Princ Ops questions & comments To: DKnutsen cc: Wick, Johnsson, Taft, Fiala These all refer to Mesa Processor Principles of Operation 3.0c 9 October 1980: 1) On page 140, what, if anything, are the "SIGNAL Abort" statements in FaultOne and FaultTwo intended to mean? There is nothing for these in the Dolphin microcode. 2) On page 142 in the middle of the page is a paragraph "If the process switch time to save the state of a running process and load the state of a ready process is N time units, the check for interrupts must occur at least every 2n time units.  This rule guarantees an interrupt latency of no more than 3n time units." This comment is wrong and should be replaced by something like the following:  "Worst case response to the highest priority interrupt will happen when the interrupt request is raised in conjunction with both the timeout scan and an opcode with a long execution time without interrupt checks. To avoid making response time worse than it must be, opcodes should check for interrupts at intervals small compared to the timeout scan." 3) The definition of CleanupCondition on page 135 includes the statement "cond.wakeup _ FALSE;". I believe that this statement is useless since there is no way that wakeup can be true with CV pointing at a non-NIL queue (If the queue were non-NIL, wakeup wouldn't have been set true; if wakeup is true, it would be zeroed before any MW would put a process on the queue.) 4) Also on page 135, following the programming not, I think there should be some comment like the following: "Also, any fault that occurs between a Monitor Wait and the subsequent Monitor Reentry will result in the process being requeued, first, to a fault service queue and, later, to the Ready Queue again. For example, a page fault on the code page is possible. The first of these requeues is carried out without calling CleanUpCondition, and the second must also avoid CleanUpCondition, or the cleanup link will be smashed. For this reason, the process must be moved from the fault service queue to the Ready Queue by means of the REQ opcode--Notify Condition and Broadcast Condition must not be used." Question: Is it the case that all fault servers restart faulted processes by means of REQ rather than NC or BC? 5) The Reschedule Error revision I suggested in an earlier message didn't work.  A revised improvement, which seems to work, is as follows: On any requeue which does not remove the current process from the Ready Queue (notably NC and BC), an attempt to reschedule is legal but a no-op; in this case, the reschedule is disallowed. All other entries to the scheduler with interrupts disabled result in the Reschedule Error. Note that it is necessary to zero WDC before starting the Reschedule Error. An example of a legitimate program which entered the scheduler on a NC with interrupts disabled is a page fault monitor (turned up by Hal Murray) where interrupts are disabled and a naked notify is simulated. After that interrupts are turned on again. *start* 00576 00024 US Date: 17 May 83 10:49:05 PDT (Tuesday) From: DKnutsen.PA Subject: Re: Princ Ops questions & comments In-reply-to: Fiala's message of Tue, 17 May 83 00:11 PDT To: Fiala cc: DKnutsen, Wick, Johnsson, Taft Yes, all fault servers restart faulted processes by means of REQ rather than NC or BC. WDC should NOT be zeroed before starting Reschedule Error. It is not necessary and could lead to loss of state about the error. This requires that the Reschedule Error trap handler be resident. Johnsson and/or Wick will respond to your other points. Dale *start* 00744 00024 US Date: 16 Sept. 1981 4:16 pm PDT (Wednesday) From: DDavies.PA Subject: Xfer Cache To: Taft cc: DDavies Ed, I discovered I don't understand one aspect of the cache scheme. Can the processor take a page fault in attempting to flush out the oldest entry? I didn't see any restrictions that kept either of the necessary local frames in memory. If a page fault is taken, we would like to flush the cache, yet obviously, we can't do this directly. Is there some other mechanism for holding cache elements until they can be saved properly? If this is a real bug, I assume the simplest fix is to have the calling microcode always store the PC and links. The cache could only speed up the returns. What did I miss? Dan *start* 00808 00024 US Date: 16 Sept. 1981 6:20 pm PDT (Wednesday) From: Taft.PA Subject: Re: Xfer Cache In-reply-to: Your message of 16 Sept. 1981 4:16 pm PDT (Wednesday) To: DDavies cc: Taft I'm assuming that the cache is flushed on every process switch, so every local frame that is in the cache was created (and therefore touched) by the currently-running process. So long as processes don't unmap frames in their own call stacks, I don't think there are any problems with page faults during cache flushes. The PrincOps already prohibits software from unmapping the currently-running local frame. One could either change the PrincOps to extend this restriction to all frames of the currently-running process; or else the opcodes that change the map could unconditionally flush the cache first. Ed *start* 11160 00024 US Date: 5 Aug 83 05:55:11 PDT (Friday) Subject: BITBLT on D-machines - long message To: BLee.ES, CharlieLevy.es, JLarson.PA, RAMCHANDANI.ES, Haynes.PA, JMaloney.pa, Dillon.PA, Hamilton.ES, Hoffman.es, Bonikowski.Henr, hamerly.wbst, Crocker.henr, Deutsch.PA, Boggs.pa, SChen.PA, charnley.pa, Fiala.PA, Kellman.ES, Lynn.es, Denber.WBST, JONL.PA cc: Wedekind, MesaFolklore^.pa From: Wedekind.es Thanks a lot to everyone who helped out with my BITBLT question. Here are the question and replies, followed by a summary of what I got out of it.  Warning - if you're not a user of the BITBLT operation, you'll get bored real quick! Date: 2 Aug 83 00:00:00 PDT (Tuesday) Subject: BITBLT on D-machines To: D0Users^.pa, MesaUsers^.pa cc: Wedekind, Reply-To: Wedekind.ES From: Wedekind.es Can anyone tell me, within a factor of five or so, how the (hardware plus software) setup time for the BITBLT operation on a Dolphin or Dlion compares to the word transfer rate once the operation is set up? Put another way, if x is the number of seconds to initiate a BITBLT operation, how many words can an already-running BITBLT transfer in x seconds? The answer will help determine some tradeoffs in polygon-filling graphics for CDS. Please respond to me with answers or if you want to see the replies. thanks a lot, ~ Jerry Date: 2-Aug-83 17:37:44 PDT (Tuesday) From: CharlieLevy.es Subject: Re: BITBLT on D-machines In-reply-to: Your message of 2 Aug 83 00:00:00 PDT (Tuesday) To: Wedekind cc: CharlieLevy.es A roundabout answer. BitBlting 16 bits by 1 bit takes but a small fraction of the time to set up bit blit (this time inclusdes the mesa time plus the ucode time). Alternately, bitblting 16 bits x 200 bits is only a few times longer than 16 bits x 20 bits. All this seems to say that setting up is the larger element. You can check any or all of these parts by using the hack ButtonsAndLightsDefs/Pack, on the hacks 8 and 10 directory. This will measure ANY piece of work to within the nearest 28 microseconds, on the screen or in a debugger variable. Your code would look like OPEN ButtonsAndLightsDefs; StartTiming[1]; SetUpBitBltTable[]; [] _ EndTiming[1]; StartTiming[2]; BitBlt.BITBLT[ptBitBltTable]; [] _ EndTiming[2]; Bar charts 1 and 2 would have the results to the nearest 28 usec. Do this for a large and then a small operation, and youhave all your answers. Do it in a loop, because the first timings might be invalid due to swapping, etc. Charlie Date: 3 Aug 83 14:02:02 PDT (Wednesday) From: charnley.pa Subject: DLion BitBlt To: Wedekind.es cc: , charnley.pa BitBlt time on a dandelion is approximately the following: SetUp: 68 clicks Transfer: 4 clicks per destination word Refill: 32 clicks {between each horizontal line} Factors which modify these numbers: Gray inner loops are only 2 {if modifying dest} or 1 {if replacing dest} Refill time is longer if Either the source or dest bitmap is not an integral number of words {16 bits} wide. Memory references to the display bank take an average of 2 {rather than 1} clicks. This means that moving off screen data to the screen takes 6 clicks.  Moving screen to screen, takes 7 clicks. Assuming you are BitBlting from off screen to on, and with integral word bitmaps, and not gray, then the answer to your question is about 11 words moved in the initial setup time. If the alternative continues BitBlt then this reduces to (68 - 32) / 6 = 6 words. Next time include me in your distribution list. {I don't know about Dolphin times} don Date: Wed, 3 Aug 83 11:06 PDT From: Fiala.PA Subject: Re: BITBLT on D-machines In-reply-to: "Wedekind.es's message of 2 Aug 83 00:00:00 PDT (Tuesday)" To: Wedekind.es cc: Fiala The following are the approximate times for the Alto BITBLT; Pilot BITBLT should be almost the same, but the initial setup is less than for Alto BitBlt. Also, these times are emulator cycles @ 100 ns on a standard Dolphin. But they have to be divided by [1 - percent IO], where "percent IO" is the percentage of all cycles given to tasks other than the emulator. "Percent IO" is about 20% with the standard Pilot emulator running the LF monitor. Also, you have to understand a little about the way the BITBLT inner loop works. If either the source and destination are word-aligned or there is no source (i.e., if the destination is receiving a gray-mode pattern only), then each destination word is filled with a single move; otherwise, two moves are needed.  In the word-aligned case, each quadword is completed by one source-dest refill and three moves; in the non-aligned case, each quadword is completed by one source refill, one dest refill, and 6 moves. Also note in the times below that quadword refills happen after a quadword boundary is crossed; the scanline time includes the initial quadword fetch for the source and the final quadword store into the destination. Initialization ~275 cycles source-to-dest -or- 210 cycles if no source + ~65 cycles/scanline + ~34 cycles/quadword for source-dest refill when word-aligned + ~26 cycles/quadword for dest-refill when no source and word-aligned + ~22 cycles/source quadword if not word-aligned for source refill + ~26 cycles/destination quadword if not word-aligned for destination refill + 4 or 6 cycles/byte moved Date: 5 AUG 83 01:53 PDT From: JONL.PA Subject: BITBLT timings on D-series machines To: Wedekind.ES cc: LispSupport In reply to your quest of 2 Aug 83 00:00:00 PDT Several months ago I did some comparative timings of the BITBLT function from Interlisp-D; I ran tests on the Dorado, Dolphin, and DandeLion, using two different kinds of calls to BITBLT (one which I call BltShade just stuffs a pattern into memory, and the other which I call BltInvert which inverts the bits in memory). The reason for the two kinds of tests is to discern whether the "shade" case is optimized for memory access. The timings reflect the then-current Lisp microcode for BitBlt, as well as the BitBlt setup time (which is done in Lisp "macro code"). These numbers would likely vary for other implementations. Incidentaly, I believe we were using the Pilot definition of BitBlt by the time I ran these tests, rather than the Alto one. DORADO BltShade 261us Setup 16ns/pixel BltInvert 277us Setup 25ns/pixel DOLPHIN BltShade 2.25ms Setup 85.4ns/pixel BltInvert 2.31ms Setup 87.4ns/pixel DLION BltShade 2.17ms Setup 70.8ns/pixel BltInvert 2.16ms Setup 137ns/pixel Although it would appear that the Dolphin doesn't "optimize" the BltShade case, that may not be very important. The more important factor that seems to be the bottle neck is the memory bandwith; in the case of the Dorado, the bandwith is much greater because it can fetch 16 words at a time, and also "pre-warm" the cache before bitbltting. In the Dolphin's case, there is a quadruple-word fetch. The DLion's performance is exactly what you would expect if the microcode bashed one 16-bit word roughly every microsecond. Why I asked Sometimes extra calls on BITBLT can reduce the total number of words which must be transferred. One example is that of drawing a diagonal line L bits long with prestored bit patterns - you can do this with one BITBLT of L^2 bits, L BITBLTs of one bit, or anything in between (in between is usually better!).  Another is in polygon filling where sometimes, by using a scratch area, you can get by with fewer BITBLT calls at the cost of having to move every word two or three times instead of once. So, it's nice to know how the overhead of a BITBLT compares with the transfer rate. Also, with extra computation sometimes you can avoid a BITBLT altogether, so it's helpful to have a feel for the absolute cost of the operation as well. My (possibly butchered) interpretation I'll word-align my sources and destinations where possible. For tradeoffs like those above, I'll just use very rough lumped timing values. Based on Ed's figures and timings using Charlie's hack, the very rough values I'm going to use for a Pilot D0 are: 25 uSec microcode setup time 2-3 uSec/word transfer rate within scan line +6.5 uSec/scan line There also appears to be a 15 uSec procedure call overhead, and of course it takes time to recalculate the parameters in the bbptr record: absolute minimum for this is 30-40 uSec (couple of additions, update only the "dst" field of the record), but it is often at least 250 uSec (a DIVMOD, a LongMult, and assignment to a couple of record fields). Thus the parameter calculation can easily dominate the microcode setup time. So an additional BITBLT is faster if it will save the transfer of roughly this number of words (this is just a tabulation of the function (15 + 25 + P)/(2.5 + 6.5/W), and I used one significant figure below only because I couldn't figure out how to use zero significant figures): P(=param calc & xtra minimal moderate logic overhead): (40 usec) (250 uSec) W(=words per BITBLT scanline): 1 10 30 4 20 60 10 25 80 infinity 30 100 These numbers are smaller for word-aligned BITBLTs, somewhat larger otherwise. Don's corresponding figures for the Dlion come out to: 28 uSec microcode setup time 13 uSec per scan line and, if I understand him, anywhere from 1 to 7 clicks (.4 to 2.8 uSec) per word transferred. (So you could make a table for the Dlion like the one above, depending on the type of BITBLTs you were doing and using somewhat smaller values for the proc call and param calculation overheads). JONL's figures seem to show that the Lisp microcoding for BITBLT is maybe 30% more efficient than Tajo's - that's interesting. To see the effect of the word alignment Ed referred to, I tried drawing a constant-length diagonal line on a 1 pixel/bit display using differently sized BITBLTs of a prestored pattern. This is what resulted (I hope you have a fixed-pitch font!): time (ms) 13 ^ * 12 * line length = 320 bits 11 10 * 9 * * * * * 8 * * * * * * x 7 x 6 x x 2 4 6 10 20 30 40 50 60 70 80 BITBLT size (bits) (The prestored pattern and the destination both started at word boundaries). In case this didn't come out on your screen, the graph is steeply decreasing until about 20, then stays fairly constant till 40, then begins a gradual rise. Note that if the BITBLT size is a multiple of 16 and if the first one is word-aligned, they all will be. The small dips in the "*" graph at multiples of 16 bits are due to this faster transfer of word-aligned quantities. The "x"'s show an additional advantage of using even-word BITBLT sizes: the code can be rewritten to make the parameter calculation simpler in these cases. So for this specific operation it looks like a "chunk size" of 32 is a winner on b/w screens.  On color screens (more bits/pixel), I think 16 would be best. ~ Jerry *start* 03306 00024 US Date: 20 Sep 83 17:08:36 PDT (Tuesday) Subject: Whoops! Floating Point Benchmark To: MesaFolklore^.pa, Sweet, Poskanzer, Karlton cc: PNeumeister, Malasky, Nunneley, Newman.es, Gunther From: Paul Due to a misinterpretation of REAL representations, and using LOG10 instead of natural Log, I messed up the results. The results are now actually closer to the expected result, and faster. The LReal package still wins out even over our uCode assisted FP package. No, as far as I know, nobody has figured out how to hang an 8087 off of a DLion. An 8087 is a 16-bit numeric function processor chip put out by Intel, used with an 8086, and an 8232 is an 8-bit numeric processor chip used with 8080's and 8085s. (I find Phil's use of the 'NATURAL' type interesting... I've never heard of it before. How does it differ from CARDINAL when you use it that way in the FOR loop?) And I wonder if the results would be the same if you used Root[2, Power[x,2]] instead of SqRt[Multiply[x,x]]?! The message should be as follows: ---------------------------------------------------------------- I picked up a Dr. Dobb's Journal (#83, September, 1983, p. 120) the other day and found this interesting benchmark/quality test, originally written in BASIC and Pascal. They were rated on both speed and accuracy as follows for popular implementations of high level languages on popular micros (8085 @5MHz & 8086 @5MHz, they didn't mention if they were packaged in a system...): The Algorithm: A=1 ILOOP=2500 FOR I=1 TO ILOOP-1 A= TAN(ATN(EXP(LOG(SQR(A*A))))) + 1 NEXT I The arctangent function is used to accentuate discrepencies in the language's floating-point algorithms and/or rounding methods. The answer ~should~ obviously be 2500.000. This is my Mesa translation of the above: OPEN RealFns; iLoop: CARDINAL = 2500; a: REAL _ 1; one: REAL = 1; ten: REAL = 10; FOR i: CARDINAL IN [1..iLoop) DO a _ Tan[ArcTan[x: one, y: Exp[Ln[SqRt[(a*a)]]]]] + one; ENDLOOP; Phil's LReal implementation is: OPEN LReal; out: Format.StringProc = Exec.FeedbackProc[h]; start, stop: System.GreenwichMeanTime; value: LONG STRING = [30]; iLoop: NATURAL = 2500; a: Number _ StringToNumber["1"L]; one: Number = a; Format.Text[out, "And away we go\n"L]; start _ System.GetGreenwichMeanTime[]; FOR i: CARDINAL IN [1..iLoop) DO a _ Add[Tan[ArcTan[Exp[Ln[SqRt[Multiply[a, a]]]]]], one]; ENDLOOP; stop _ System.GetGreenwichMeanTime[]; With hardware floating point libraries from MicroFloat: Languages Version Result Time (sec.) PL/I-86 w/8087 1.01 2477.244 3.7 8080 Asm w/8232 RMAC 2499.955 10.2 PL/I-80 w/8232 1.40 2499.995 10.4 BASIC-80 w/8232 5.20 2500.000 10.7 Fortran-80w/8232 3.40 2499.995 12.5 With standard floating point libraries: BASIC-86 Interpreter 5.20 2179.850 92.2 Fortran-80 3.40 2304.863 140.8 BASIC-80 Compiler 5.20 2304.860 140.8 BASIC-80 Interpreter 5.20 2304.860 174.9 PL/I-86 1.01 1641.758 179.6 PL/I-80 1.30 1641.758 254.4 Here is how Mesa on a DLion fared: Type Result Time (sec.) LReal Package 2500.002 339 Versatec uCode Assist 2503.195 377 Standard DLion FP package 2503.195 474 Paul *start* 01038 00024 US Date: Tue, 20 Sep 83 17:17 PDT From: Taft.PA Subject: Re: Floating Point Benchmark In-reply-to: "Your message of 20 Sep 83 16:40:27 PDT (Tuesday)" To: Karlton cc: Paul , Malasky, Nunneley, MesaFolklore^.pa, Taft The same program run in Cedar ([Ivy]FloatBenchmark.mesa) takes 2.88 seconds on a Dorado and 15.56 seconds on a Dandelion, and gives a result of 2503.195 on both machines. This is with complete microcode implementations of the basic operations (+ - * /), but not of SqRt or any of the other functions. Observations: 1. The inaccuracy in the answer is probably due to deficiencies in one or more of the transcendental functions rather than in the basic IEEE operations, which we believe to be correct. I believe the Cedar versions of the transcendental functions are unchanged from Larry Stewart's original Alto/Mesa implementation. 2. Having floating point microcode sure makes a big difference in performance, but it's hard to do as well as floating point hardware. Ed Taft *start* 01639 00024 US Date: 27 Sep 83 19:00:15 PDT (Tuesday) From: JLarson.PA Subject: Re: Floating Point Benchmark To: DesignAnalysis^.pa cc: CIG^.pa, MesaFolklore^.pa Reply-To: JLarson.PA I added D0 (Cedar, Mesa 10.0) numbers to fill out the chart. (Those of you doing number crunching, and considering moving from D0's to DLion's, note factor of 18 between best DLion/MDE numbers and the D0, and the improved performance with DLion/Cedar). John ------------------------------------------------------------ The Algorithm: A=1 ILOOP=2500 FOR I=1 TO ILOOP-1 A= TAN(ATN(EXP(LOG(SQR(A*A))))) + 1 NEXT I The arctangent function is used to accentuate discrepencies in the language's floating-point algorithms and/or rounding methods. The answer ~should~ obviously be 2500.000. With hardware floating point libraries from MicroFloat: Languages Version Result Time (sec.) PL/I-86 w/8087 1.01 2477.244 3.7 8080 Asm w/8232 RMAC 2499.955 10.2 PL/I-80 w/8232 1.40 2499.995 10.4 BASIC-80 w/8232 5.20 2500.000 10.7 Fortran-80w/8232 3.40 2499.995 12.5 With standard floating point libraries: BASIC-86 Interpreter 5.20 2179.850 92.2 Fortran-80 3.40 2304.863 140.8 BASIC-80 Compiler 5.20 2304.860 140.8 BASIC-80 Interpreter 5.20 2304.860 174.9 PL/I-86 1.01 1641.758 179.6 PL/I-80 1.30 1641.758 254.4 Type Result Time (sec.) Mesa Development Environment: D0/Standard FP package 2503.195 18 DLion/LReal Package 2500.002 339 DLion/Versatec uCode Assist 2503.195 377 DLion/Standard FP package 2503.195 474 Cedar: Dorado 2503.195 2.9 Dlion 2503.195 15.8 D0 2503.195 20 *start* 01038 00024 US Date: 26 Sep 83 19:15:17 PDT (Monday) From: Murray.PA Subject: Version OpCode To: Taft, Johnsson cc: Murray I feel trapped in the middle. Anybody got any good suggestions? I think it would be possible to write some code that figured out wheather there were 2 or 4 words returned, but that seems a bit ugly. ---------------------------------------------------------------- Date: 16 Sep 83 08:38:34 PDT (Friday) From: Johnsson.PA Subject: Re: Version OpCode In-reply-to: Your message of 15 Sep 83 20:40:09 PDT (Thursday) To: Murray cc: Johnsson I am the keeper of the PrincOps. We defined ESCAlpha.aVERSION = 57B some time ago but never reached any agreement on what it did. I proposed something like having it return four words the first byte of which was registered (i.e. approximately = machine type) and determined the meaning of the remainder.  As far as I know it is not implemented in either microcode or mesacode by any of our stuff. ---------------------------------------------------------------- *start* 01685 00024 US Date: Tue, 27 Sep 83 09:14 PDT From: Taft.PA Subject: Re: Version OpCode In-reply-to: "Murray's message of 26 Sep 83 19:15:17 PDT (Monday)" To: Murray cc: Taft, Johnsson My understanding was that Sandman left it for Fiala to define. Since we're the first and only people to have implemented this opcode so far, I suggest that you adopt our definition unless there is a good reason to do otherwise. I assert that the Version opcode should return only machine-independent information. Different processors will have different amounts of machine-dependent information they want to return (surely more than two words in some cases); and I claim that information should be the responsibility of the ProcessorHead to provide by whatever means is appropriate (machine-dependent opcodes, perhaps). Ed p.s. The Cedar definition of Version is copied below (this is the Rubicon version). -- MicrocodeVersion.mesa definitions for Alto/Mesa and Rubicon Pilot -- modified 18-May-82 17:56:45 by Taft DIRECTORY Mopcodes USING [zMISC]; MicrocodeVersion: DEFINITIONS = BEGIN MachineType: TYPE = MACHINE DEPENDENT { altoI (1), altoII (2), altoIIXM (3), dolphin (4), dorado (5), dandelion (6), dicentra (7), (17B)}; VersionResult: TYPE = MACHINE DEPENDENT RECORD [ machineType (0: 0..3): MachineType, majorVersion (0: 4..7): [0..17B], -- incremented by incompatible changes unused (0: 8..13): [0..77B], floatingPoint (0: 14..14): BOOLEAN, cedar (0: 15..15): BOOLEAN, releaseDate (1): CARDINAL]; -- days since January 1, 1901 VERSION: PROCEDURE RETURNS [VersionResult] = MACHINE CODE { Mopcodes.zMISC, 104B}; END. *start* 00871 00024 US Date: 10-Oct-83 22:59:46 PDT From: Rovner.pa Subject: Back on the track: the ZCT stuff for Cedar 5.0 To: Taft Reply-To: Rovner cc: Atkinson, Fiala, Rovner Ed, I made and tested the changes that Ed Fiala requested. They include ... 1. Both the zct and the fostable are aligned on page boundaries 2. The components of ZCT.ZCTObject have been re-arranged (see ZCT.mesa) 3. WriteUCodeRegisters has gone away 4. EnableMicrocode and DisableMicrocode have been redefined (see RCMicrocodeImpl.mesa) 5. The opcodes have been re-assigned (see CedarMicrocode.mesa) 6. rcBottom = 0 (see RCMicrocodeImpl.mesa) 7. OnZ stuffs thru wp^ first, then checks for ZCTFull (see RCMicrocodeImpl.mesa) The files are to be found via /Indigo/PreCedar/Top/NewSafeStorage.df. I'm ready anytime to try out new Dorado microcode. Cheers, Paul *start* 00797 00024 US Date: Tue, 11 Oct 83 15:29 PDT From: Taft.PA Subject: Re: Back on the track: the ZCT stuff for Cedar 5.0 In-reply-to: "Your message of 10-Oct-83 22:59:46 PDT" To: Rovner cc: Atkinson, Fiala, Taft I expect to have the microcode changes done by early tomorrow. In RCMicrocodeImpl, one of the comments is slightly misleading (either that or I misunderstand the code). In OnZ, there is a comment: "wp will not point to the link word; wp^ may be left with garbage if ZCTFull is raised". What is really true is that if ZCTFull is raised, the last cell (before the link word) will simply not have been used; that is, ZCTFull will be raised when there appears to be space left for one more item. (zct.wp MOD zctBlockWords will be equal to zctBlockWords-4 in this case.) Ed *start* 05002 00024 US Date: Mon, 17 Oct 83 17:19 PDT From: Fiala.PA Subject: Klamath changes To: Bridge.ES cc: Fiala, Singhania.ES, Lynn.ES, Taft Ed Taft, I am copying you on this message because you may be interested in this eventually. Laura, please let me know when you have some test software I can try. Here is the message I got from Andy Daniels regarding the Klamath changes. I have been faithfully following the documentation mentioned in Andy's message with the errata and additional comments noted here. The Sweet document referred to is "32-bit Control Links for Klamath" on [McKinley]Memos>Klamath 32-bit Control Links 5 May 1983. This is a Star document. Here are errata for Dick Sweet's document: 1) On page 2, the short paragraph toward the lower middle of the page should read: "Local function calls become three byte instructions with alpha/beta being the initial CODE-relative pc (i.e., the PC of the FSI byte that begins the procedure). 2) In the paragraph immediately after (1), the first sentence should read "... the D0, DB, and DBS instructions are not needed ..." 3) On page 6, the paragraph in the middle should begin with "For the XE and XF operations the microcode would prefer that the control link at local beta be at an even address." Here is errata for the MopcodeChanges.txt document: 1) Under "Deleted", aMX should be "1" instead of "2"; "DBS" should be added to the deleted opcodes. 2) The opcodes under "Added" aren't defined anywhere. J5 and J7 are just unconditional jumps like J2, J3, etc. ME and MX retain their old definitions but are changed from ESC to regular opcodes. LFC is a three-byte opcode in which alpha/beta are the CODE-relative PC of the FSI byte of a procedure in the same code segment. CAW is "Code Address Word"; I think it pushes a long pointer equal to the code base + alpha/beta onto the stack. GA1 is "Global Address 1"; it is like GA0, pushing the short pointer GLOBAL+1 onto the stack. JDEB and JDNEB are "Jump Double Equal Byte" and "Jump Double Not Equal Byte"; they jump if the two double-words on the stack are equal or not equal and pop both of the double-words off of the stack. If there is any confusion, ports are 4-word objects that have word 0 always at a location that has the two low-order bits equal to 2 (so that a pointer to the port is an indirect control link). The four words in the port are the 2 word source link followed by the 2 word destination link. As for the changes to Sandman's diagnostic germ that would be useful, the first step should be to regenerate the existing program, removing obsolete and adding new opcodes for Klamath. The program will be useful with just that much done to it. However, since most of the Klamath changes are in xfer and the process machinery, more comprehensive tests are really needed. I don't think Sandman's program tests these more complicated areas at all. If you are willing to tackle diagnostics for the process machinery and/or xfer, that would be useful. A simple and incomplete xfer test (i.e., one that doesn't generate any of the exceptional error conditions and faults) should be possible without much trouble. One that generates the exceptional cases would take much longer. Same for process machinery. Let me know how much you are willing to attempt. --------------------------- Date: 10 Aug 83 15:33:58 PDT (Wednesday) From: Daniels.PA Subject: Changes for Klamath To: Fiala cc: , Daniels I've collected what I think are the relevant files on [Igor]11b>. Dick Sweet has a memo that summarizes the architecture changes made for Klamath.  It is a Star document, so I've sent you a copy through ID mail. Most of the changes are described there. The only changes not covered are the changes to trap parameters: CodeTrap takes a GlobalFrameHandle; ControlTrap takes a (16-bit) ShortControlLink that is the source of the XFER that trapped; UnboundTrap takes a (32-bit) ControlLink that was the destination of the XFER (it can trap because either the global frame or the pc in the link was 0); XferTrap takes 3 words: a ControlLink (the destination) and a machine-dependent enumeration that describes what type of Xfer it is. The enumeration, which can be found in XferTrap.mesa, is: XferType: TYPE = MACHINE DEPENDENT{ ret(0), call, lfc, port, xfer, trap, pswitch, (65535)}; call => SFC, KFCB port => PO, POR xfer => XE, XF ret, lfc, trap, pswitch => the obvious things. BLTLR must be in ucode for Klamath to run. This change is unrelated to the architecture changes for 32-bit control links, but is necessary nonetheless. I've put the following files onto [Igor]11b>: ESCAlpha.mesa MopcodeChanges.txt Mopcodes.list PrincOps.mesa PSB.mesa SDDefs.mesa Trap.mesa Traps.mesa XferTrap.mesa I think this should be enough to get you started. If you have any questions about the details, let me know. -- Andy. -- ------------------------------------------------------------ *start* 00439 00024 US Date: 19 Oct 83 09:17:08 PDT (Wednesday) From: Bridge.ES Subject: Re: Klamath changes In-reply-to: Fiala.PA's message of 17 Oct 83 17:19 PDT (Monday) To: Fiala.PA cc: Bridge, Singhania.ES, Lynn.ES, Taft.PA Ed, The test germ, sources, code listings and bcd's for Klamath u-code are stored on [OLY]11.0>Pilot>. I am looking into extending the test code and will message you when there is more. ~~Laura *start* 01244 00024 US Date: Fri, 9 Sep 83 15:10 PDT From: Fiala.PA Subject: Cedar Microcode changes To: Taft, Rovner cc: Fiala Here are more Cedar microcode changes I would like: 1) I want to store a new ZCT entry before checking for the ZCT block being full; if the block has become full as a result of the store, the microcode will then either advance to the next block or trap if the current block is the last block. In the event of a trap, the store will be ignored. I think the total effect of this change on the software is that the ZCTFull trap will happen with 1 empty slot at the end of the last block. This change saves about 8 cycles on the Dolphin by allowing the ZCT write pointer to be advanced during the 15 dead cycles following the store of the ZCT entry. 2) I want to generate the RCUnderflow trap when the RefCnt is being counted down from 1 to 0 after previously overflowing. This allows me to do a 4-way dispatch (in three places) on the two low-order bits of an NHP rather than an 8-way dispatch. 3) The ZCTObject and FOSTable should start at EVEN storage addresses. For possible future microcode additions, it would be even better if the ZCTObject started at a 64-bit boundary (i.e., at a quadword address). *start* 00580 00024 US Date: Sat, 10 Sep 83 13:28 PDT From: Taft.PA Subject: Re: Cedar Microcode changes In-reply-to: "Fiala's message of Fri, 9 Sep 83 15:10 PDT" To: Fiala cc: Taft, Rovner I have no objections to (1) and (3) if they make life easier for you. I fail to understand your proposal (2) because I believe the case you describe cannot happen. If the RC has previously overflowed (nhp.rcOverflowed=TRUE) then an underflow trap occurs if an attempt is made to decrement nhp.refCount below rcBottom, so it is not possible for it to get all the way down to zero. Ed *start* 00660 00024 US Date: Mon, 12 Sep 83 11:54 PDT From: Fiala.PA Subject: Re: Cedar Microcode changes In-reply-to: "Taft's message of Sat, 10 Sep 83 13:28 PDT" To: Taft cc: Fiala, Rovner Change in last line of previous mesage: The proposal is to make RCBottom be 0. In other words, the event "underflowing after a previous overflow" and "counting to 0" are detected at the same time.  Since the "previously overflowed" bit is just to the right of the RefCnt field, this allows my microcode to dispatch on the low bit of the RefCnt and the "previously overflowed" bit using a 4-way dispatch (after first checking that the top 5 bits of RefCnt were 0). *start* 00615 00024 US Date: Mon, 12 Sep 83 12:12 PDT From: Taft.PA Subject: Re: Cedar Microcode changes In-reply-to: "Fiala's message of Mon, 12 Sep 83 11:54 PDT" To: Fiala cc: Taft, Rovner I guess that's ok. But allowing nhp.refCount to be zero while nhp.rcOverflowed is TRUE means that all the code (in microcode and software) that presently takes actions when nhp.refCount=0 (e.g., put object on ZCT) must be changed to do so only if nhp.refCount=0 AND nhp.rcOverflowed=FALSE. This is an easy change in my microcode. I'll leave it to Paul to decide whether this change is reasonable in the software. Ed *start* 01259 00024 US Date: Tue, 24 Jan 84 14:30 PST From: Levin.PA Subject: A microcode hack To: Taft cc: You seem to be the relevant person for this one. Sproull and I worked this out over lunch one day. Bob asserted that it would be useful to support "dynamic loading and binding" of opcodes with special purpose packages (presumably graphical grinders). We didn't see any problem with it, although you may have a better scheme. Also, the demand is approximately zero. Nevertheless, you may wish to file it with your microcode list. Roy --------------------------- Date: 13 April 1983 2:56 pm PST (Wednesday) From: Sproull.PA Subject: reminder To: levin Roy, This is a message to remind you of the scheme we cooked up to let special-purpose microcode that implements emulator extensions can be accommodated gracefully in Cedar. The idea is to augment the interpretation of the "unimplemented op code" trap table to provide either addresses in core of Mesa instructions to implement the op code or to provide an address in the microcode memory. This has the nice feature that resetting the world (i.e., disabling special microcode) can be done by simply sweeping the table. Bob ------------------------------------------------------------