:TITLE[MesaOP2];*Opcodes 200 to 277b + refill trap instructions

%Ed Fiala 21 February 1984:
Bummed 2 cycles each off JW, EXCH, ACD, ADDSB, DCMP<0, UDCMP<0,
timing, 4 cycles off DEXCH, 1 to 7 cycles off J5-J8. Add @J5, @J7,
@JDEB, @JDNEB for Klamath; delete @OP276; move @LP to MesaOP3; change
TrapParm to xfTrapParm0; absorb refill trap mi from elsewhere.
%

%The PCB,,PCBhi base register points at the current instruction quadword.
PCB[14:15] are 0, and the low 3 bits of the PC (which point at a byte
within the quadword) are kept in PCF. Since code segments cannot cross 64K
boundaries and are limited to 32K words in length, the two bytes of PCBhi
are forced to be equal, rather than having the least significant byte
differ from the msb by 1. This facilitates negative jumps.

Refill occurs when, at the onset of a NextInst or NextData, PCF contains a
value greater than 7b. In this case the mi is aborted and the trap mi at
location 0, ’LoadPageExternal[0], GoToExternal[377]’, is executed, sending
control to location 377b on the page that caused refill. Identical mi
exist on all pages from which refill might occur.

Refill timing is as follows: The aborted mi, trap mi at 0, and PFetch4 use
6 cycles; memory wait uses 12 more, totalling 18 cycles. Assuming that the
next byte in the instruction stream crosses a quadword boundary 1/8 of the
time, a charge of 18/8 = 2.25 cycles/byte is appropriate.

Ideally, a jump opcode should be charged execution time + 2.25 cycles/byte
corrected by the improvement or worsening of the quadword position in the
instruction quadword as a result of the jump x 2.25 cycles. However, if
all quadword positions are equally likely jump targets, the position will
be improved by 1 byte on average (because the original value of PCF is
1..10b; after the jump it is 0..7b), so average execution time
+ 2.25 cycles/byte in the opcode - 2.25 cycles is charged to a jump.

However, counting is more difficult when memory references are in progress
at the tail of the opcode. Consider the following sequence, for example:
PFetch1[LOCAL,Stack];
LU ← NextInst[IBuf];
NIRet;
The 1st mi of the next opcode will be aborted once (two cycles) to ensure
that PCX and SStkP will remain valid for the previous opcode on a fault.
Any fault (sans memory transport) aborts the 4th mi after the PFetch1. If
the NextInst causes refill, timing for this sequence will be 25 cycles;
otherwise, timing will be 9 cycles if the next opcode does not reference
[S] for a long time or 15 cycles if [S] is referenced in the 1st mi of the
next opcode. Assuming the NextInst causes refill 1/8 of the time, this
sequence will average 11.25 cycles if [S] is not referenced for a long time
or 16.25 cycles if [S] is referenced in the 1st mi of the next opcode. All
of these times assume the quadword is not retransmitted from MC2 as a
result of a correctable storage failure.

Similarly, for the following:
PStore1[LOCAL,Stack];
LU ← NextInst[IBuf];
NIRet;
If the NextInst causes refill, the timing for this sequence is 35 cycles;
otherwise, timing is ~9 cycles if [S] is not written and no reference
occurs for a long time, 17 cycles if [S] is written or a new reference is
issued in the 1st mi of the next opcode. Assuming NextInst causes refill
1/8 of the time, this sequence averages 12.25 cycles if the next opcode
doesn’t write [S] or start a reference and 19.25 cycles if it does one of
these things in its 1st mi.
%

MesaRefill:
PCF ← RZero, At[MesaRefillLoc];
*Hold page fault on page 0 while assuring that the PCB←PCB+4 doesn’t happen
*until AFTER the page fault. Even if the PCB+4 were in the mi immediately
*after the PFetch4, it would be uncertain whether or not that mi had
*completed, so the fault handler could not be certain what to do.
Nop;
Nop;
PCB ← (PCB) + (4C), Return;

*Mandatory refill trap mi for NextInst and NextData operations.
:IF[WithDShift]; ***************************************
PFetch4[PCB,IBuf,4], GoToP[MesaRefill], At[LShift[dsPage0,10],377]; *3b
:ENDIF; ************************************************
PFetch4[PCB,IBuf,4], GoToP[MesaRefill], At[LShift[opPage0,10],377];
PFetch4[PCB,IBuf,4], GoToP[MesaRefill], At[LShift[opPage1,10],377];
PFetch4[PCB,IBuf,4], GoToP[MesaRefill], At[LShift[opPage2,10],377];
PFetch4[PCB,IBuf,4], GoToP[MesaRefill], At[LShift[opPage3,10],377];
PFetch4[PCB,IBuf,4], GoToP[MesaRefill], At[LShift[moPage,10],377]; *10b
PFetch4[PCB,IBuf,4], GoToP[MesaRefill], At[LShift[bbP1,10],377]; *11b
:IF[WithFloatingPoint]; ********************************
PFetch4[PCB,IBuf,4], GoToP[MesaRefill], At[LShift[fpPage0,10],377]; *13b
:ENDIF; ************************************************
PFetch4[PCB,IBuf,4], GoToP[MesaRefill], At[LShift[xfPage1,10],377]; *15b
PFetch4[PCB,IBuf,4], GoToP[MesaRefill], At[LShift[prPage,10],377]; *16b

%NOTE: after any PFetch4 on page 6 faults, the fault handler resumes after
filling IBuf with 377b bytes; this means that it is inadvisable to use the
bypass kludge after the PFetch4 and control must remain on page 6 (i.e.,
no tasking is allowed in the 3 mi after the PFetch4 which refills IBuf).
If the bypass kludge were used, transport for a preceding PFetch4, such as
the one in RDC.Mc, could experience error correction and advance page fault
time so that the mi containing the bypass kludge was aborted, and
this would then execute incorrectly.

NOTE: The final mi of a jump opcode writes the PCB register, so it is
illegal to begin any opcode with a PCB-relative fetch with non-zero
displacement.

PCF points to the byte after the last one fetched (i.e., PCF = 1..10b).
PCX is loaded from PCF at T2 of the first mi executed. NextInst is illegal
in the mi after PCF←.

@J2 or @JB cannot be used instead of J2 or JB because, after page faults,
PCX is wrong for continuation since it is smashed at an opcode starting
instruction.
%
@CATCH:
SkipData, CallX[P6Tail], Opcode[200];
@J2:
SkipData, CallX[P6Tail], Opcode[201];*Timing: 6+(18*(2/8)) = 10.5
@J3:
SkipData, CallX[J2], Opcode[202];*Timing: 8+(18*(3/8)) = 14.75
@J4:
SkipData, CallX[J3], Opcode[203];*Timing: 10+(18*(4/8)) = 19.00

ShortJumpFix:
T ← (PCFReg) - T;
RTemp ← T, GoTo[P6PCFTail,ALU<0];
PFetch4[PCB,IBuf,4];
PCB ← (PCB) + (4C);
P6PCFTail:
PCF ← RTemp;
Nop;*NextInst illegal in mi after PCF←; no tasking here
P6Tail:
LU ← NextInst[IBuf];
P6Tailx:
PCB ← (PCB) and not (3C), NIRet;

%In these opcodes, PCF is in the range 1..10b, so J5, for example, must
add 5-1 to PCF to get the target; this result is in the range 5..14b.
For @J5 to @J7, the target can wind up in either the quadword already in
IBuf or the quadword after that. Since (X-10b) mod 10b = X, the
ShortJumpFix subroutine computes PCF+disp-1-10b = PCF-(11b-disp), which is
.ls. 0 if the target is in the current quadword. Timing for these is 15
cycles if in the same quadword or 24 cycles if in the next quadword.
%
*Avg. timing = 20.625 cycles
@J5:
T ← 4C, GoTo[ShortJumpFix], Opcode[204];
*Avg. timing = 21.75 cycles
@J6:
T ← 3C, GoTo[ShortJumpFix], Opcode[205];
*Avg. timing = 22.875 cycles
@J7:
T ← 2C, GoTo[ShortJumpFix], Opcode[206];

%PFetch4[PCB,...] should be ok in 1st mi of opcode because, even if PCB
was loaded in the previous mi at P6Tailx, the value of PCB being loaded
and the old value are equally good for the reference. But this caused
problems somehow??
Timing = 20 cycles
%
@J8:
PCB ← (PCB) + (4C), Opcode[207];*Timing = 18 cycles
PFetch4[PCB,IBuf,0];
T ← (PCFReg) - 1;
RTemp ← T, GoTo[P6PCFTail];

J2:
SkipData, CallX[P6Tail];*Odd
Ejmp:
T ← NextData[IBuf], CallX[JBr];*Even

JB:
T ← NextData[IBuf], CallX[JBr];*Odd
Enojmp:
SkipData, CallX[P6Tail];*Even

%Jump Byte: alpha is a signed displacement from the opcode.
AllOnes is used as a temporary, restored when done. Note that
RH[PCBhi] = LH[PCBhi] since code can’t cross 64k boundary.
Timing: 28.25 (+3 if negative displacement) cycles.
%
@JB:
T ← NextData[IBuf], Opcode[210];
*JBr timing: 24 cycles (+3 if displacement negative).
JBr:
T ← (PCFReg) + T + 1, Skip[H2Bit8’];
PCB ← (PCB) - (200C);*Offset 400b bytes if neg. displacement
*Jump here on @JW, @JIB, and @JIW.
*AllOnes ← alpha+PCF-2 for @JB or (alpha..beta)+PCF-3 for @JW, where -2
*or -3 displaces alpha to the 1st opcode byte.
JBy:
AllOnes ← (Form-3[AllOnes]) + T;
T ← RSh[AllOnes,1], Skip[R>=0];
T ← (LSh[R400,7]) or T;
PFetch4[PCB,IBuf];
PCF ← AllOnes;
PCB ← (PCB) + T;
AllOnes ← (Zero) - 1, GoTo[P6Tail];

%Jump Word. alpha..beta is a 2’s complement displacement relative to
the first opcode byte. Because PCF is read before the final NextData,
the code must allow for the possibility that PCF was 10b when read, but
will be 1b after the final NextData (and PCB will have been advanced).
To do this, the high bit of PCF is cleared on read-out.
Timing: 30.5 cycles (+3 if displacement negative).
%
@JW:
LU ← CycleControl ← NextData[IBuf], Opcode[211]; *get alpha
T ← (Cycle&PCXF) and not (370C);
T ← (NextData[IBuf]) + T + 1, CallX[JBy];

PairComp:
T ← Stack&-1, UseCTask;
LU ← (LdF[Cycle&PCXF,0,4]) xor T, Return;

*Jump Equal Pair.
*Jump if the 1st nibble in alpha .eq. TOS with displacement equal to the 2nd
*nibble in alpha + 4 = -2 + (4+PCF+pair right); pop stack once.
*Timing: 17.5 cycles (no jump), 12.25 cycles + JBr (jumps).
@JEP:
LU ← CycleControl ← CNextData[IBuf], Call[PairComp], Opcode[212];
T ← 4C, Skip[ALU=0];
LU ← NextInst[IBuf], CallX[P6Tailx];
T ← (LdF[Cycle&PCXF,4,4]) + T, GoTo[JBr];

stkdif:
LU ← (Stack&-1) - T, Return;

*Jump Equal Byte.
*Jump with displacement alpha if TOS .eq. 2OS; pop stack twice.
*Timing: 10.25 + JBr (jumps), 17.5 cycles (no jump)
@JEB:
T ← Stack&-1, UseCTask, Call[stkdif], Opcode[213];
JEQBx:
DblGoTo[Ejmp,J2,ALU=0];

*Jump Equal Byte Byte.
*Jump with displacement beta if TOS .eq. alpha; pop stack once.
*Timing: 12.5 cycles + JBr (jumps), 19.75 cycles (no jump).
@JEBB:
T ← NextData[IBuf], Opcode[214];
LU ← (Stack&-1) xor T;
Skip[ALU=0];
SkipData, CallX[P6Tail];
T ← (NextData[IBuf]) - 1, CallX[JBr];

*Jump Not Equal Pair.
*Jump if the 1st nibble in alpha .ne. TOS with displacement equal to the 2nd
*nibble in alpha + 4; = -2 + (4+PCF+pair right) pop stack once.
*Timing: 16.5 cycles (no jump), 12.25 cycles + JBr (jumps).
@JNEP:
LU ← CycleControl ← CNextData[IBuf], Call[PairComp], Opcode[215];
T ← 4C, Skip[ALU#0];
LU ← NextInst[IBuf], CallX[P6Tailx];
T ← (LdF[Cycle&PCXF,4,4]) + T, GoTo[JBr];

*Jump Not Equal Byte.
*Jump with displacement alpha if TOS .ne. 2OS; pop stack twice.
*Timing: 16.5 cycles (no jump), 11.25 cycles + JBr (jumps).
@JNEB:
T ← Stack&-1, UseCTask, Call[stkdif], Opcode[216];
JNEBx:
DblGoTo[JB,Enojmp,ALU#0];

*Jump Not Equal Byte Byte.
*Jump with displacement beta if alpha .ne. TOS; pop stack once.
*Timing: 18.75 cycles (no jump), 13.5 cycles + JBr (jumps).
@JNEBB:
T ← NextData[IBuf], Opcode[217];
LU ← (Stack&-1) xor T;
Skip[ALU#0];
SkipData, CallX[P6Tail];
T ← (NextData[IBuf]) - 1, CallX[JBr];

JLBpos:
DblGoTo[J2,Ejmp,Ovf’];*Even
JLBneg:
DblGoTo[JB,Enojmp,Ovf’];*Odd

*Jump Less Byte.
*Jump with displacement alpha if integer 2OS .ls. TOS; pop stack twice.
*Timing: 18.5 or 20.5 cycles (no jump), 13.25 cycles + JBr (jumps).
@JLB:
T ← Stack&-1, UseCTask, Call[stkdif], Opcode[220];
JLBx:
FreezeResult, DblGoTo[JLBpos,JLBneg,ALU>=0];

JGEBpos:
DblGoTo[JB,Enojmp,Ovf’];*Even
JGEBneg:
DblGoTo[J2,Ejmp,Ovf’];*Odd

*Jump Greater Equal Byte.
*Jump with displacement alpha if integer 2OS .ge. TOS; pop stack twice.
*Timing: 19.5 cycles (no jump), 12.25 or 13.25 cycles + JBr (jumps).
@JGEB:
T ← Stack&-1, UseCTask, Call[stkdif], Opcode[221];
JGEBx:
FreezeResult, DblGoTo[JGEBpos,JGEBneg,ALU>=0];

stksw:
T ← Stack&+1, Return;

*Jump Greater Byte.
*Jump with displacement alpha if integer 2OS .gr. TOS; pop stack twice.
*Timing: 20.5 or 22.5 cycles (no jump), 15.25 cycles + JBr (jumps).
@JGB:
Stack&-1, UseCTask, Call[stksw], Opcode[222];
LU ← (Stack&-2) - T, GoTo[JLBx];

*Jump Less Equal Byte.
*Jump with displacement alpha if integer 2OS .le. TOS; pop stack twice.
*Timing: 21.5 cycles (no jump), 14.25 or 15.25 cycles + JBr (jumps).
@JLEB:
Stack&-1, UseCTask, Call[stksw], Opcode[223];
LU ← (Stack&-2) - T, GoTo[JGEBx];

*Jump Unsigned Less Byte.
*Jump with displacement alpha if cardinal 2OS .ls. TOS; pop stack twice.
*Timing: 10.25 cycles + JBr (jumps), 17.5 cycles (no jump).
@JULB:
T ← Stack&-1, UseCTask, Call[stkdif], Opcode[224];
JULBx:
DblGoTo[J2,Ejmp,Carry];

*Jump Unsigned Greater Equal Byte.
*Jump with displacement alpha if cardinal 2OS .ge. TOS; pop stack twice.
*Timing: 11.25 cycles + JBr (jumps), 16.5 cycles (no jump).
@JUGEB:
T ← Stack&-1, UseCTask, Call[stkdif], Opcode[225];
JUGEBx:
DblGoTo[JB,Enojmp,Carry];

*Jump Unsigned Greater Byte.
*Jump with displacement alpha if cardinal 2OS .gr. TOS; pop stack twice.
*Timing: 12.25 cycles + JBr (jumps), 19.5 cycles (no jump).
@JUGB:
Stack&-1, UseCTask, Call[stksw], Opcode[226];
LU ← (Stack&-2) - T, GoTo[JULBx];

*Jump Unsigned Less Equal Byte.
*Jump with displacement alpha if cardinal 2OS .le. TOS; pop stack twice.
*Timing: 12.25 cycles + JBr (jumps), 18.5 cycles (no jump).
@JULEB:
Stack&-1, UseCTask, Call[stksw], Opcode[227];
LU ← (Stack&-2) - T, GoTo[JUGEBx];

*Jump Zero 3.
*Jump with displacement 3 if TOS .eq. 0; pop stack once.
*Timing: 16.5 cycles (jumps), 11.25 cycles (no jump).
@JZ3:
LU ← Stack&-1, Opcode[230];
Skip[ALU=0];
LU ← NextInst[IBuf], CallX[P6Tailx];
J3:
SkipData, CallX[J2];

*Jump Zero 4.
*Jump with displacement 4 if TOS .eq. 0; pop stack once.
*Timing: 20.75 cycles (jumps), 11.25 cycles (no jump).
@JZ4:
LU ← Stack&-1, Opcode[231];
Skip[ALU=0];
LU ← NextInst[IBuf], CallX[P6Tailx];
SkipData, CallX[J3];

*Jump Zero Byte.
*Jump with displacement alpha if TOS .eq. 0; pop stack once.
*Timing: 8.25 + JBr (jumps), 15.5 cycles (no jump)
@JZB:
LU ← Stack&-1, GoTo[JEQBx], Opcode[232];

*Jump Not Zero 3.
*Jump with displacement 3 if TOS .ne. 0; pop stack once.
*Timing: 17.5 cycles (jumps), 10.25 cycles (no jump).
@JNZ3:
LU ← Stack&-1, Opcode[233];
Skip[ALU#0];
LU ← NextInst[IBuf], CallX[P6Tailx];
SkipData, CallX[J2];

*Jump Not Zero 4.
*Jump with displacement 4 if TOS .ne. 0; pop stack once.
*Timing: 21.75 cycles (jumps), 10.25 cycles (no jump).
@JNZ4:
LU ← Stack&-1, Opcode[234];
Skip[ALU#0];
LU ← NextInst[IBuf], CallX[P6Tailx];
SkipData, CallX[J3];

*Jump Not Zero Byte.
*Jump with displacement alpha if TOS .ne. 0; pop stack once.
*Timing: 14.5 cycles (no jump), 9.25 cycles + JBr (jumps).
@JNZB:
LU ← Stack&-1, GoTo[JNEBx], Opcode[235];


StkDDif:
LU ← (Stack&+1) xor T, Return;

*Jump Double Equal Byte. Jump if the doublewords at TOS,,2OS and
*3OS,,4OS are equal.
*Timing = 4.5+13 cycles (no jump); x
@JDEB:
T ← Stack&-2, UseCTask, Call[StkDDif], Opcode[236];
T ← Stack&-2, Skip[ALU=0];
SkipData, Stack&-1, CallX[P6Tail];
LU ← (Stack&-1) xor T, GoTo[JEQBx];

*Jump Double Not Equal Byte. Jump if the doublewords at TOS,,2OS and
*3OS,,4OS are not equal.
@JDNEB:
T ← Stack&-2, UseCTask, Call[StkDDif], Opcode[237];
T ← Stack&-2, Skip[ALU#0];
SkipData, Stack&-1, CallX[P6Tail];
LU ← (Stack&-1) xor T, GoTo[JNEBx];

CODEToRTemp:
PFetch1[CODE,RTemp];
T ← PCFReg, Return;

P6PopComp:
T ← Stack&-1, UseCTask;
LU ← (Stack) - T, Return;

*Jump Indexed Byte.
*Alpha,,beta is a CODE-relative pointer to an array of bytes; TOS is a byte
*index to the array (even bytes in bits 0:7 of a word, odd bytes in bits
*8:15); jump with unsigned displacement fetched from the byte array.
*Timing: 21.75 cycles (no jump), 32.5 cycles + JBy (jumps).
@JIB:
LU ← CycleControl ← CNextData[IBuf], Call[P6PopComp], Opcode[240];
T ← RSh[Stack,1], Skip[Carry’];
SkipData, CallX[P6Pop];*Exit to next opcode
T ← (NextData[IBuf]) + T;*Add beta
T ← (LHMask[Cycle&PCXF]) + T, Call[CODEToRTemp];
Stack&-1, Skip[R Odd];
T ← (LdF[RTemp,0,10]) + T, GoTo[JBy];
T ← (RHMask[RTemp]) + T, GoTo[JBy];

*Jump Indexed Word.
*Alpha,,beta is a CODE-relative pointer to an array of words indexed by TOS;
*carry out a PC-relative jump using the signed displacement from the array.
*Timing: 21.75 cycles (no jump), 32.5 cycles + JBy (jumps).
@JIW:
LU ← CycleControl ← CNextData[IBuf], Call[P6PopComp], Opcode[241];
T ← Stack&-1, Skip[Carry’];
SkipData, CallX[P6Tail];*Flush beta and exit
T ← (NextData[IBuf]) + T;*add beta
T ← (LHMask[Cycle&PCXF]) + T, Call[CODEToRTemp];
T ← (RTemp) + T, GoTo[JBy];

*Recover. Timing: 6.25 cycles.
@REC:
LU ← NextInst[IBuf], Opcode[242];
P6PushTailx:
Stack&+1, NIRet;

*Recover Two. Timing: 6.25 cycles.
@REC2:
LU ← NextInst[IBuf], Opcode[243];
Stack&+2, NIRet;

*Discard. Timing: 6.25 cycles.
@DIS:
LU ← NextInst[IBuf], Opcode[244];
P6PopTailx:
Stack&-1, NIRet;

*Discard Two. Timing: 6.25 cycles.
@DIS2:
LU ← NextInst[IBuf], Opcode[245];
Stack&-2, NIRet;

*Exchange. Timing: 12.25 cycles.
@EXCH:
T ← Stack&-1, Opcode[246];
*The preceding Stack&-1 has interlocked any PFetch to the stack.
*Stack&+1 ← here interlocks a PStore1 at Stack properly.
Stack&+1 ← Stack&+1, LoadPage[moPage];
Stack&-1 ← T, GoToP[moPsh];

*Double Exchange. Timing: 20.25 cycles.
*See interlocking comments for EXCH.
@DEXCH:
T ← Stack&-2, Opcode[247];*Exch(StkP-2,StkP)
Stack&+2 ← Stack&+2;
Stack&-2 ← T;
Stack&-1;
T ← Stack&+2;*Exch(StkP-3,StkP-1)
Stack&-2 ← Stack&-2, LoadPage[moPage];
Stack&+2 ← T, GoToP[moPsh];

*Duplicate. Timing: 8.25 cycles.
*Stack&+1 ← Stack&+1 won’t interlock odd PFetch2.
@DUP:
T ← Stack&-1, Opcode[250];
SkipPushT:
*Here from @LP in MesaOP3 and MesaESC.
LU ← NextInst[IBuf];
SkipPushTx:
Stack&+2 ← T, NIRet;

*Double Duplicate. Timing: 10.25 cycles.
@DDUP:
T ← Stack&-1, Opcode[251];
Stack&+2 ← Stack&+2, GoTo[P6PushT];

%Exchange Discard. Timing: 8.25 cycles.
NextInst[IBuf];
Stack&-1 ← Stack&-1, NIRet;
might work but Stack&-1 ← Stack&-1 does not interlock an
unaligned PStore2 at StkP-3/StkP-2. A dangerous opcode sequence
would be some unaligned PStore2 opcode (followed by NextInst, NIRet, and
1 aborted mi = 6 cycles + 2 for transport), followed by REC2 (4 cycles),
followed by EXDIS. In this case the Stack&-1 ← would take place 1 cycle
before the PStore2 finished. Since the final cycle of the PStore2
will not have transport (because to be unaligned the referenced words
must be 1 and 2 within the quadword), the faster code for EXDIS given above
probably works. I am unsure what error-correction would do to this sequence.
%
@EXDIS:
T ← Stack&-2, GoTo[P6PushT], Opcode[252];

ACDx:
Stack&-1 ← Stack&-1, NIRet;

*Negate. Timing: 10.25 cycles.
*... Stack&+1 ← (Zero) - T, NIRet; gets stack underflow.
@NEG:
T ← Stack&-1, Opcode[253];
T ← (Zero) - T, GoTo[P6PushT];

*Increment. Timing: 8.25 cycles.
@INC:
T ← (Stack&-1) + 1, GoTo[P6PushT], Opcode[254];

*Decrement. Timing: 8.25 cycles.
@DEC:
T ← (Stack&-1) - 1, GoTo[P6PushT], Opcode[255];

P6PS2Safety:
*This mi wasted for PStore2 safety (?ugh?)
Stack&+1, LU ← T, Return;

*Double Increment. Timing: 14.25 cycles.
@DINC:
T ← Stack&-2, Call[P6PS2Safety], Opcode[256];
Stack ← (Stack) + 1;
T ← (RZero) + T, UseCOutAsCIn, GoTo[P6PushT];

*Double. Timing: 8.25 cycles.
@DBL:
T ← LSh[Stack&-1,1], Opcode[257];
P6PushT:
LU ← NextInst[IBuf];
Stack&+1 ← T, NIRet;

*Double Double. Timing: 14.25 cycles.
@DDBL:
T ← LSh[Stack&-2,1], Call[P6PS2Safety], Opcode[260];
T ← (RSh[Stack,17]) + T;
Stack ← LSh[Stack,1], GoTo[P6PushT];

*Triple. Timing: 10.25 cycles.
@TRPL:
T ← LSh[Stack&-1,1], Opcode[261];
Stack&+1, GoTo[Addx];

*And. Timing: 8.25 cycles.
@AND:
T ← Stack&-1, Opcode[262];
LU ← NextInst[IBuf];
Stack ← (Stack) and T, NIRet;

*Ior. Timing: 8.25 cycles.
@IOR:
T ← Stack&-1, Opcode[263];
LU ← NextInst[IBuf];
Stack ← (Stack) or T, NIRet;

*Add Signed Byte. Timing: 12.5 cycles pos., 15.5 cycles neg.
@ADDSB:
T ← NextData[IBuf], Opcode[264];
*Here an immediately preceding unaligned PFetch2 to the stack has had
*at least the following sequence: PFetch2[...,Stack]; NextInst; NIRet;
*1 aborted mi; so when it gets to Addx, enough time will have elapsed
*for the PFetch2 to have completed, so interlocking it is not a problem.
LU ← T, GoTo[Addx,H2Bit8’];
T ← (LHMask[AllOnes]) + T, GoTo[Addx];

*Add. Timing: 8.25 cycles.
@ADD:
T ← Stack&-1, Opcode[265];
Addx:
LU ← NextInst[IBuf];
Stack ← (Stack) + T, NIRet;

*Subtract. Timing: 8.25 cycles.
@SUB:
T ← Stack&-1, Opcode[266];
Subx:
LU ← NextInst[IBuf];
Stack ← (Stack) - T, NIRet;

GetTDecStk2:
T ← Stack&-2, Return; *grab it, point to lsb of second doubleword

*Double Add. Timing: 16.25 or 17.25 cycles.
@DADD:
MNBR ← Stack&-1, Call[GetTDecStk2], Opcode[267]; *point to lsb of top doubleword
Stack ← (Stack) + T;*add low bits
Stack&+1, Skip[Carry];
T ← MNBR, GoTo[Addx];*pick up high bits of top doubleword
T ← (MNBR) + 1, GoTo[Addx];*pick up high bits of top doubleword

*Double Subtract. Timing: 16.25 or 17.25 cycles.
@DSUB:
MNBR ← Stack&-1, Call[GetTDecStk2], Opcode[270]; *point to lsb of top doubleword
Stack ← (Stack) - T; * subtract low bits
Stack&+1, Skip[Carry’]; *point to msb of second doubleword
T ← MNBR, GoTo[Subx]; *remember msb of top doubleword (TOS)
T ← (MNBR) + 1, GoTo[Subx];

*Add Double to Cardinal. Timing: 16.25 cycles.
*Cardinal at TOS, Double at 2OS,,3OS.
@ADC:
T ← Stack&-3, Call[P6PS2Safety], Opcode[271];
Stack ← (Stack) + T;
Stack&+1, FreezeResult;
Stack ← (Stack) + 1, UseCOutAsCIn, GoTo[P6Tail];

*Add Cardinal to Double. Timing: 14.25 or 15.25 cycles.
*Double at TOS,,2OS, Cardinal at 3OS.
@ACD:
LU ← Stack&-2, Opcode[272];
T ← Stack&+1;
Stack&-1 ← (Stack&-1) + T;
Stack&+2, Skip[Carry];
LU ← NextInst[IBuf], CallX[ACDx];
LU ← NextInst[IBuf];
Stack&-1 ← (Stack&-1) + 1, NIRet;

*Add Local 0 to Immediate Byte.
:IF[CacheLocals]; ************************************
*Timing: 12.5 cycles.
@AL0IB:
T ← NextData[IBuf], Opcode[273];
T ← (LocalCache0) + T, GoTo[P6PushT];
:ELSE; ***********************************************
*Timing: 20.5 cycles.
@AL0IB:
PFetch1[LOCAL,Stack,0], Opcode[273];
T ← NextData[IBuf], CallX[Addx];
:ENDIF; **********************************************

%Multiply--high half of 32-bit product is left above the top of the Stack
multiplier in RTemp (from the argument at TOS)
multipliplicand in T (from the argument at TOS-1)
product low in Stack, hi in RTemp1

The first loop flushes leading 0’s in the multiplier with timing 2 cycles/0;
The second loop processes 0’s in 6 cycles and 1’s in 10 or 11 cycles.
Note how a low-order 1 in the multiplier serves as an end flag.
Timing: 18.25 cycles if multiplier is 0, else
58.25 cycles if multiplier is 1, else
(16.25 to 19.25) + 2*LZ + (16-LZ)*6 + (4 or 5)*(NOnes) cycles =
(48.25 to 51.25) + (16-LZ)*4 + (4 or 5)*NOnes cycles.

NOTE: For random numbers, this algorithm averages about 23 cycles faster than
the one on the next page. However, when the multiplier has many leading +
trailing zeroes, it is worse than the other. For products less than 16d bits,
this algorithm is 28 cycles slower for a multiplier with a single 1 bit but
gains 2 cycles for each additional 1 bit in the multiplier. For products
greater than 16d bits, it is 23 cycles slower for a multiplier with a single
1 bit and gains 5 cycles for each additional 1 bit in the multiplier after the
product has exceeded 16 bits. Since 87 percent of all multiplies are
preceded by small constant pushes, the other algorithm probably averages
faster than this one, but this one is 6b mi smaller, so we use it.
%
@MUL:
RTemp1 ← T ← 30C, Call[MulSU], Opcode[274];
*2nd loop shifts the product RTemp1/Stack left 1 and conditionally adds the
*multiplicand T based upon sign of the multiplier RTemp, which is left-shifted
*until the right-most 1 bit is seen.
RTemp ← (RTemp) SALUFOP T, GoTo[Mul1,R<0];
Mul0:
Stack ← (Stack) SALUFOP T;
RTemp1 ← (RTemp1) SALUFOP T, UseCOutAsCIn, Return;
Mul1:
RTemp1 ← (RTemp1) SALUFOP T, UseCOutAsCIn, GoTo[MulLast,ALU=0];
Stack ← (LSh[Stack,1]) + T, Skip[R<0];
RTemp1 ← (RTemp1) - 1, UseCOutAsCIn, Return;
RTemp1 ← (RTemp1) + 1, UseCOutAsCIn, Return;

*Force the low bit of multiplier RTemp to 1 for the end test.
*Initialize the high product word (RTemp1) to 0; low product word (TOS-1)
*already contains the multiplicand, so we don’t zero it and add the
*multiplicand on the 1st multiplier 1 (but we have to test multiplier for 0).
MulSU:
RTemp1 ← (RTemp1) - (SALUF ← T);*SALUF = 30b is LU ← 2A
T ← (Stack&-1) SALUFOP T, Skip[R>=0];*Multiplier*2 from TOS
RTemp ← (Zero) + T + 1, GoTo[MulSUX];*Multiplier .ls. 0
RTemp ← (Zero) + T + 1, Skip[ALU#0];*Multiplier .ge. 0
T ← Stack ← 0C, GoTo[mdPush];*Multiplier .eq. 0
*One mi loop shifts off leading 0’s in multiplier.
RTemp ← (RTemp) SALUFOP T, Skip[R<0];
RTemp ← (RTemp) SALUFOP T, GoTo[.,R>=0];
MulSUX:
T ← Stack, Return;*Multiplicand from TOS-1

MulLast:
T ← RSh[RTemp1,1], Skip[Carry’];
T ← (LSh[AllOnes,17]) or T;
mdPush:
Stack&+1 ← T;*Even
P6Pop:
LU ← NextInst[IBuf], CallX[P6PopTailx];

%Multiply--high half of 32-bit product is left above the top of the Stack
product low in Stack, hi in RTemp1
multipliplicand low in xBuf, hi in xBuf1
multiplier in RTemp
The first loop runs until the multiplicand being left-shifted 1 each step
overflows into the high word. It has timing of 4 cycles on 0’s, 10 on 1’s;
this loop doesn’t task on 0’s (worst case without tasking ~ 70 cycles on a
multiplier of 100000b and multiplicand of 1). Note that the end test need be
made only when processing a multiplier 1. The second loop runs until the
last multiplier 1 is processed with timing of 6 cycles on 0’s, 14 on 1’s.

Timing: 20.25 cycles if the multiplier is 0, else
30.25 cycles if the multiplier is 1, else
20.25 + (1 if product .ge. 2↑16) +
+ (4/multiplier 0) + (10/1) cycles while product .ls. 2↑16
+ (6/multiplier 0) + (14/1) cycles while product .ge. 2↑16
where the zeroes are those between the leftmost and rightmost ones in the
multiplier.

PopToT: T ← Stack&-1, FreezeResult, Return;

@MUL:
RTemp1 ← T ← 30C, Opcode[274];*SALUF = 30b is LU ← 2A
RTemp1 ← (RTemp1) - (SALUF ← T), Call[PopToT];*RTemp1 ← 0
RTemp ← T, UseCTask, Call[PopToT];
Stack&+1 ← 0C, Skip[ALU#0];*tests RTemp ← T
T ← Stack&+1 ← 0C, GoTo[P6Pop];
xBuf ← T, Call[.+1];
*1st loop
RTemp ← RSh[RTemp,1], GoTo[MulZ,R Even];
MulO:
Stack ← (Stack) + T, Skip[ALU#0];
T ← (RTemp1) + 1, UseCoutAsCin, GoTo[mdPush];
T ← xBuf ← (xBuf) SALUFOP T, FreezeResult, Skip[R<0];
RTemp1 ← (RTemp1) + 1, UseCOutAsCIn, Return;
RTemp1 ← (RTemp1) + 1, UseCOutAsCIn, GoTo[MulL];

MulZ:
T ← xBuf ← (xBuf) SALUFOP T, Skip[R<0];
*Must replicate the mi at MulO-1 because the opcode dispatch locations are
*only four apart on this page.
RTemp ← RSh[RTemp,1], DblGoTo[MulO,MulZ,R Odd];
MulL:
xBuf1 ← 1C, Call[.+1];
*2nd loop
RTemp ← RSh[RTemp,1], GoTo[MulLZ,R Even];
MulLO:
Stack ← (Stack) + T, GoTo[.+3,ALU#0];
T ← xBuf1, FreezeResult;
T ← RTemp1 ← (RTemp1) + T + 1, UseCOutAsCIn, GoTo[mdPush];
T ← xBuf1, FreezeResult;
RTemp1 ← (RTemp1) + T + 1, UseCOutAsCIn;
MulLZ:
T ← xBuf ← (xBuf) SALUFOP T;*Double the multiplicand
xBuf1 ← (xBuf1) SALUFOP T, UseCOutAsCIn, Return;
%

*Double Compare (signed).
*If 3OS,,4OS < TOS,,2OS, push -1, 17.25 cycles
*If 3OS,,4OS = TOS,,2OS, push 0, timing 16.25 cycles
*If 3OS,,4OS > TOS,,2OS, push 1, 18.25 cycles
*Add 4 cycles if high order words are equal and low order words unequal.
@DCMP:
T ← (Stack&-2) + (100000C), Opcode[275];
Stack ← (Stack) + (100000C), GoTo[DCMPy];

*Unsigned Double Compare
@UDCMP:
T ← Stack&-2, Opcode[276];
*Compare msb’s, point at lsb of high doubleword
*grab lsb of top doubleword, point at lsb of second doubleword
DCMPy:
LU ← (Stack&+1) - T;
T ← Stack&-2, FreezeResult, Skip[ALU=0];
Stack ← (Zero) + 1, DblGoTo[DUCompL,DUCompG,Carry’];
Stack ← (Stack) - T;*compare low words
FreezeResult, Skip[ALU#0];
LU ← NextInst[IBuf], Call[P6Tailx];
Stack ← (Zero) + 1, DblGoTo[DUCompL,DUCompG,Carry’];

DUCompL:
LU ← NextInst[IBuf];
Stack ← (Stack) or not (0C), NIRet;
DUCompG:
LU ← NextInst[IBuf], Call[P6Tailx];

P6Undef:
LoadPage[opPage0];
RTemp ← sOpcodeTrap, GoToP[SDTrap];

@OP277:
xfTrapParm0 ← 277C, GoTo[P6Undef], Opcode[277];

:END[MesaOP2];