:IF[WithFloatingPoint]; ********************************
TITLE[MesaFP];*Mesa single-precision floating point

%Ed Fiala 17 November 1983: Change fpPage1, fpPage2, fpPage5, and fpPage7
from 17b to 10b.
Ed Fiala 8 June 1982: Fix FSc bug and speed it up by 6 cycles; move 1 mi
from fpPage7 to fpPage0 in FSqRt to fit.
Ed Fiala 11 May 1982: Reverse args for FSc.
Ed Fiala 16 April 1982: change page 1 to page 3.
Ed Fiala 25 September 1981: Prefix names with "fp"; move fpSticky register
for Cedar; implement square root and floating scale; insert several more
page breaks.
Ed Fiala 15 June 1981: Improve speed about 60 percent; bum 7 registers;
expended 23b mi net. Produce 0 on underflow for Wilhelm (non-spec) as
option.
Ed Fiala 20 March 1981: Pilot conditionals; bum 2 registers; save 11b mi.
Original version by Jim Boyce.

Floating point numbers are represented in two different formats. Arguments
on the stack are IEEE single precision reals with the following format:
bit 0sign (1 means negative)
1:8exponent biased by 127d
(exponent of 177b with a fraction of 0 is the integer 1;
exponent of 0 with a fraction of 0 is true 0;
exponent of 0 with non-0 fraction is denormalized [i.e., the
fraction does not have an implicit "1." in front of it];
exponent of 377b with a fraction of 0 is infinity;
exponent of 377b with non-0 fraction is Not-A-Number)
9:31fraction
(implicitly preceded by ’1.’ if non-0 and normalized.)

These numbers appear on the stack with the word holding bits 0-15 nearer to
the top of stack. Most operations unpack the operands into a format that is
easier to work with. This format uses three words interpreted as follows:
ExpSignbits 1:8 are the exponent (biased by 127d), bit 15d the sign.
NOTE: FAdd, FSub, FMp, FDiv, etc. must and do compensate for
the addition of 1 to the exponent which occurs in the Renorm
subroutine.
FracThe fraction. When unpacked, the leading one appears in
Frac[1]; when about to be packed again, the leading bit is in
Frac[0], Frac[31d] is a StickyBit (the inclusive or of all
remaining bits); Fix and Round handle rounding slightly
differently.
FracH16 high order bits of Frac.
FracL16 low order bits of Frac.

Three registers for each argument hold an unpacked floating point number.
These registers have names with 1’s or 2’s in them. Numbers are often
referred to by index (SB or DB). Arg1 has index 2 and Arg2 has index 3.

The IEEE specification calls for four different rounding modes, two methods of
handling underflow/denormalized numbers (we have added a third), two kinds
of infinity. We are in the process of converting to a Real package that
records all mode settings in the fpSticky register accessible to microcode.
When this conversion is complete, microcode will optionally implement as much
of the package as it wants to, but trap when an opcode cannot be completed in
microcode. We fall short in that the fpSticky register is not saved in
the Mesa process state. In all other situations, microcode either completes
the operation or traps to software, as indicated below.


POSSIBLE IMPROVEMENTS:

1) Get the 1st level Misc dispatches all on one page to save 4 cycles here.
2) Put call to RoundFrac in FDiv exit saving 6 cycles.
3) Faster Renorm called from Float and subtraction.
4) Eliminate 3 cycles in zero-arg testing for all two-arg opcodes.
5) Use 31b max shift rather than 40b for Unnorm entry from FAdd/FSub; possibly
do the Unnorm entry differently for negative fpTemp.
6) Implement variant rounding modes, infinities, gradual denormalization,
infinity on overflow, not-a-number on divide by zero, mod on fix/round
overflow.
7) Get the fpSticky register and ? software bits in the process state.
8) Speed FSc by directly adding scaling factor to exponent field.

FPSTICKY REGISTER

0
0OR 1 into fpSticky[15d] on every inexact result
1Trap on any inexact result
1-2
0Trap on any denormalized result (user may be interested in
the loss of precision)
1Substitute 0 on underflow (non-IEEE; requested by Wilhelm)
2Gradually denormalize on underflow (not in microcode)
3--
3
0Projective infinity (only one unsigned infinity; compare of
anything with infinity traps; not sure what other operations
are supposed to do).
1Affine infinity (+ infinity and - infinity both defined;
arithmetic and comparisons work as expected).
4-5
0Round to nearest (unbiased; round to even if halfway)
1Round toward 0 (truncate)--not in microcode
2Round toward plus infinity--not in microcode
3Round toward minus infinity--not in microcode
6
0Trap if denormalized args are supplied
1Normalize the arguments and then use them (not in microcode)
7
0Trap on invalid operations (compare of projective infinity,
Not-a-number as an argument)
1Result is the infinity or not-a-number (not in microcode)
8
0Trap on overflow of Fix or Round operation
1Return low-order 16d bits of the result (not in microcode)
9
0Trap on divide-by-zero
1Stuff in not-a-number on divide-by-zero and continue
(not in microcode)
10d
0Trap on arithmetic overflow
1Stuff in infinity on arithmetic overflow and continue
(not in microcode)
11-14d
--undefined
15d
0All results have been exact
1One or more inexact results have occurred (i.e., rounding has
taken place).

Softare is expected to execute the FSticky opcode to exchange a new value on
the stack with the value currently in Sticky; then it executes floating point
opcodes in the new mode settings. The fpSticky register should be saved with
the Mesa process state. Also, the software state of the floating point
processor includes occurrence bits for all the events which have traps:
Inexact result occurred (temporarily kept in fpSticky),
Denormalized arguments were supplied;
Invalid operations occurred;
Fix/Round overflow occurred;
Divide by zero occurred;
Arithmetic overflow occurred;
Underflow occurred.
The microcode need not keep these occurrence bits because the software can
originally cause the events to trap, remember the event in storage, and
turn off the microcode trap enable.
**What about keeping the events in the process state?

In the current microcode implementation, any argument with 377b or 0b
exponent and any result which overflows will trap to software.


ROUNDING MODES

Rounding occurs at SB[fpFracL] bit 7 for arithmetic or bit 15d for Fix and
Round. The bit following the one rounded is the rounding bit, and all lower
bits are sticky bits. If sticky bits are all 0, the result is exact, and no
rounding occurs. Otherwise, if the InexactResult trap is enabled, trap.
Otherwise, round in the direction specified by the rounding mode.

"Round to nearest" goes in the obvious direction when the sticky bits are
.ne. 0; if rounding bit is 1 and sticky bits .eq. 0, the number is exactly
halfway between two possible results, so by convention round to the nearest
even number.
%

*NOTE that fpExpSign, fpFracH, and fpFracL are not used--these are displaced
*by SB[] or DB[] to select the 1/2 registers.
:IF[AltoMode]; ********************************************
*52, 44, 46, 47, 53, 64, 70, and xfXTPReg are available.
RV[fpExpSign,70];
RV[fpFracH,54];RV[fpFracL,64];
RV[fpExpSign1,72];
RV[fpFrac1H,56];RV[fpFrac1L,66];
RV[fpExpSign2,73];
RV[fpFrac2H,57];RV[fpFrac2L,67];

RV[fpSticky,75];
*See page 1 comment--NOT A TEMPORARY
RV[fpCode,45];
*Fix, Round
RV[fpCCTemp,51];
*Unnorm
RV[fpTemp,71];
*Unnorm, FAdd, FSub
RV[fpCount,51];
*FMul
RV[fpFrac2HLSh1,50];
*FMul
RV[fpFrac2HLSh1p1,45];
*FMul
RV[fpOne,71];
*FMul
RM[fpFrac1LL,IP[fpFrac2L]];
*FMul
RV[fpTFrac,50];
*FDiv
RV[fpFrac2H’,51];
*FDiv
RV[fpFrac2H’p1,71];
*FDiv
RV[fpFrac2Hp1,45];
*FDiv
RM[fpRhi,IP[fpFrac1L]];
*FSqRt
RM[fpRlo,IP[fpFrac1H]];
*FSqRt

*545b mi including two alpha-byte dispatch locations overwritten in MesaX.
Set[fpPage0,16];
*357b mi (not pages 4 to 7)
Set[fpPage1,15];
* 36b mi
Set[fpPage2,15];
* 40b mi
Set[fpPage3,17];
* 5b mi
Set[fpPage4,17];
* 12b mi
Set[fpPage5,16];
* 22b mi
Set[fpPage6,15];
* 2b mi
Set[fpPage7,15];
* 23b mi
:ELSE; ****************************************************
*17, 46, 47, 51, 53, and 70 are available temporaries.
RV[fpExpSign,70];
RV[fpFracH,54];RV[fpFracL,64];
RV[fpExpSign1,72];
RV[fpFrac1H,56];RV[fpFrac1L,66];
RV[fpExpSign2,73];
RV[fpFrac2H,57];RV[fpFrac2L,67];

RV[fpSticky,75];
*See page 1 comment--NOT A TEMPORARY
RV[fpCode,44];
*Fix, Round
RV[fpCCTemp,71];
*Unnorm
RV[fpTemp,45];
*Unnorm, FAdd, FSub
RV[fpCount,71];
*FMul
RV[fpFrac2HLSh1,50];
*FMul
RV[fpFrac2HLSh1p1,45];
*FMul
RV[fpOne,44];
*FMul
RM[fpFrac1LL,IP[fpFrac2L]];
*FMul
RV[fpTFrac,50];
*FDiv
RV[fpFrac2H’,71];
*FDiv
RV[fpFrac2H’p1,45];
*FDiv
RV[fpFrac2Hp1,44];
*FDiv
RM[fpRhi,IP[fpFrac1L]];
*FSqRt
RM[fpRlo,IP[fpFrac1H]];
*FSqRt

*545b mi placed primarily on page 3 with the rest on page 10.
Set[fpPage0,3];
*357b mi (not pages 4 to 7)
Set[fpPage1,10];
* 36b mi
Set[fpPage2,10];
* 40b mi
Set[fpPage3,3];
* 5b mi
Set[fpPage4,3];
* 12b mi
Set[fpPage5,10];
* 22b mi
Set[fpPage6,3];
* 2b mi
Set[fpPage7,10];
* 23b mi
:ENDIF; ***************************************************

Loca[UnnormDisp,fpPage0,40];
Loca[UnpackRet,fpPage0,60];
Loca[RoundDisp,fpPage0,100];
Loca[FixDisp,fpPage0,120];

%Coding conventions.

Misc bytecodes with alpha in [20b to 37b] are floating point operations.
The 1st mi below is the Misc dispatch table entry for alpha in this range.
For Alto Mesa, alpha is saved in xfOTPReg, where it is the trap parameter if
the FP operation traps; for Cedar RTemp holds alpha and 400b+alpha determines
the trap dispatch. T is loaded with 30b, stored in SALUF as the A + A + Cy0
ALU operation.

RTemp holds the Misc alpha byte at entry here; two arg opcodes dispatch
on the low four bits of RTemp not only initially but also out of Unpack2.
The various Fix/Round opcodes use fpCode to control dispatches as follows:
10:11FixDone dispatch (Long 0, Integer 1, Cardinal 2)
12:15Unpack dispatch (FAdd 0, FSub 1, FMul 2, FDiv 3, FComp 4,
Fix 6, Round 10, FSqRt 13); 2-arg opcodes dispatch
on RTemp[12:15d], 1-arg opcodes on fpCode[12:15d].

Timing from the beginning of the opcode through the dispatch table below is
16 cycles + buffer refill, where buffer refill is 4.5 cycles for Pilot mode,
6.5 cycles for Alto mode.
%

:IF[AltoMode]; ********************************************
*Overwrite the two MISC dispatch mi to setup the branch to this code.
Set[FlPntLoc,Add[MiscDisp0,17]];
xfOTPReg ← T, LoadPage[fpPage0], At[MiscDisp0,1];
Dispatch[RTemp,14,4], At[FlPntLoc];
:ELSE; ****************************************************
LoadPage[fpPage0], At[MiscDisp0,1];
Dispatch[RTemp,14,4];
:ENDIF; ***************************************************
OnPage[fpPage0];
*Approx. time in cycles is given in "()"
fpFrac2H ← T ← 30C, Disp[.+1];
fpFrac2H ← (fpFrac2H) + (SALUF ← T), GoTo[Unpack2], DispTable[20];
*FAdd (~156)
fpFrac2H ← (fpFrac2H) + (SALUF ← T), GoTo[Unpack2];*FSub (~158)
fpFrac2H ← (fpFrac2H) + (SALUF ← T), GoTo[Unpack2];*FMul (~389)
fpFrac2H ← (fpFrac2H) + (SALUF ← T), GoTo[Unpack2];*FDiv (~422)
fpFrac2H ← (fpFrac2H) + (SALUF ← T), GoTo[Unpack2];*FComp (~94)
fpCode ← 6C, GoTo[Unpack1];*Fix (~72)
T ← Stack&-1, SALUF ← T, GoTo[Float];*Float (~ 47+Renorm)
fpCode ← 26C, GoTo[Unpack1];*FixI (~76)
fpCode ← 46C, GoTo[Unpack1];*FixC (~70)
T ← Stack&-1, GoTo[FSticky];*FSticky (~28)
:IF[AltoMode]; ********************************************
LoadPageExternal[opPage3], GoTo[fpTrapExit];*FRem not implemented
:ELSE; ****************************************************
TrapParm ← 0C, GoTo[fpTrapExit];*FRem not implemented
:ENDIF; ***************************************************
fpCode ← 10C, GoTo[Unpack1];*Round (~85)
fpCode ← 30C, GoTo[Unpack1];*RoundI (~89)
fpCode ← 50C, GoTo[Unpack1];*RoundC (~83)
fpCode ← 13C, GoTo[Unpack1];*FSqRt (~482)
fpFrac2H ← (fpFrac2H) + (SALUF ← T);*FSc (~87.5)
T ← (Stack&-1) - 1, Call[fpSetSB];
fpFrac1H ← T, BBFBX, Call[Unpack];
T ← fpFrac2H ← (LSh[fpFrac2H,7]) or T, GoTo[FSc1];

%Unpack2 unpacks the first Real on the stack into fpExpSign2, fpFrac2H, and
fpFrac2L and the 2nd into fpExpSign1, fpFrac1H, and fpFrac1L and returns
SB pointing at arg1 and fpFrac1H in T for FAdd, FSub, FMul, and FDiv.

Unpack1 unpacks one real on the stack into fpExpSign2, fpFrac2H, and
fpFrac2L and returns SB and DB pointing at arg2 for Fix and Round.

Unpack also filters out numbers which the microcode can’t handle, traping on
unnormalized numbers, Infinities, and NotANumbers.

Timing = 29 cycles for Unpack1, 46 for 2 args (-3 per arg that is 0 and
+1 per arg that is negative).
%
OnPage[fpPage0];

*fpFrac2H contains 60b here, where the two leading ones (i.e., bits 10:11d)
*are loaded into the word-addressing part of SB and select word 3 of an
*addressed quadword.
Unpack2:
fpFrac2H ← (SB ← fpFrac2H) - 1, Task;
T ← LdF[Stack,11,7];
*BBFBX causes SBX←SB; the SB← here takes effect when BBFBX in the last mi of
*Unpack is executed.
SB ← fpFrac2H, fpFrac2H ← T, BBFBX, NoRegILockOK, Call[Unpack2A];
fpFrac2H ← (LSh[fpFrac2H,7]) or T, Call[Unpack];
Dispatch[RTemp,14,4];
T ← fpFrac1H ← (LSh[fpFrac1H,7]) or T, Disp[@FAdd];

fpSetSB:
SB ← fpFrac2H, Return;

Unpack1:
fpFrac2H ← (fpFrac2H) + (SALUF ← T), Call[fpSetSB];
DB ← fpFrac2H, BBFBX, Call[Unpack];*BBFBX will cause SBX←SB
Dispatch[fpCode,14,4];
T ← fpFrac2H ← (LSh[fpFrac2H,7]) or T, Disp[@Fix];

*Position exponent in bits 1:8 and sign in bit 15d because:
*(a) XOR’ing signs and adding or subtracting exponents in FMul and FDiv can
*be accomplished in one operation;
*(b) Convenient overflow/underflow test in sign bit of word; and
*(c) Convenient for packing result.
Unpack:
T ← LdF[Stack,11,7];
SB[fpFracH] ← T;
Unpack2A:
T ← (LdF[Stack&-1,1,17]) and not T, GoTo[UnpNeg,R<0];
SB[fpExpSign] ← T, GoTo[UnpNZ,ALU#0];
*Zero exponent may represent true 0 or denormalized number.
UnpZ:
T ← Stack&-1;
LU ← SB[fpFracH] or T;
SB[fpFracL] ← T, BBFBX, GoTo[DenormTrap,ALU#0];
CInexactOK:
Return;

UnpNeg:
SB[fpExpSign] ← (Zero) + T + 1, GoTo[UnpZ,ALU=0];
UnpNZ:
LU ← SB[fpExpSign] + (200C);
SB[fpFracH] ← SB[fpFracH] or (200C), GoTo[NaNTrap,ALU<0];
T ← LSh[Stack,7];
SB[fpFracL] ← T, BBFBX;
T ← RSh[Stack&-1,11], Return;

%Entries are PackA and Renorm; NormGo is entered by FAdd/FSub and FSc; Float
contains a copy of the mi at Renorm.

Renorm left-shifts the fraction SB[fpFracH/L] until SB[fpFracH][0] is 1 and
subtracts the number of positions shifted from the exponent. Then it rounds
SB[fpFracH],,SB[fpFracL] at SB[fpFracL][7] using SB[fpFracL][8] for rounding and
SB[fpFracL][9:15d] as sticky bits. Finally, it performs range checks, pushes
the number onto the stack, and exits.

NOTE: If an overflow occurs, the possible range of exponents is
377 to 376+376-200 or 377b to 574b (For FMul with 0 NormSteps).
For an underflow, the possible range is 0 or 777b down to 0-376b+177b-1; i.e.,
0 or 777b down to 600b (For FDiv).

Timing from Renorm to exit: 8 x nsteps + 30 + (if sticky bits .eq. 0 after
normalizing: 1 if odd or 3 if even) + (3 if fpFracL carries rounding up
(+5 if fpFracH also carries)).
The number of normalizing steps will be:
0 for FDiv or FSqRt
1 for FSc (starting at NormGo);
0 or 1 for FMul;
0 or 1 for FAdd/FSub if signs are the same;
1 to 24d for FAdd/FSub if signs differ (starting at NormGo);
0 to 15d steps for Float.
%

NormStep:
*Only Renorm and copy of Renorm in Float jump to NormStep.
SB[fpExpSign] ← SB[fpExpSign] - (200C);
LongNormGo:
*Odd placement; paired with mi at SubNonNeg+1.
NormGo:
SB[fpFracL] ← SB[fpFracL] SALUFOP T, Task;
SB[fpFracH] ← SB[fpFracH] SALUFOP T, UseCOutAsCIn;
*Even placement; paired with mi at MulDone+2.
Renorm:
T ← LSh[SB[fpFracH],10], GoTo[NormStep,R>=0];
Pack:
Dispatch[fpSticky,4,2];*Dispatch on rounding mode
LU ← SB[fpFracL] and (177C), Disp[.+1];
*Round to nearest: Test sticky bits; begin assuming sticky bits .ne. 0
SB[fpFracL] ← SB[fpFracL] + (200C), DblGoTo[RndEx,RndNr,ALU=0], DispTable[4];
T ← SStkP, GoTo[FloatTrp1];*Truncate (i.e., round toward 0)
T ← SStkP, GoTo[FloatTrp1];*Round toward plus infinity
T ← SStkP, GoTo[FloatTrp1];*Round toward minus infinity

*Assumption correct--continue rounding.
RndNr:
SB[fpFracH] ← SB[fpFracH] + 1, UseCOutAsCIn, DblGoTo[RndUp,Inexact,Carry];
*Assumption wrong--all sticky bits are 0--fixup result.
RndEx:
Dispatch[SB[fpFracL],7,3], Skip[Carry’];
*If carry occurred, then the number was exactly halfway between an odd
*number and the next higher even number, and it is now correctly rounded.
SB[fpFracH] ← SB[fpFracH] + 1, GoTo[RndUp];
Disp[.+1];
Inexact:
*Was halfway and odd--now fixed up.
fpSticky ← (fpSticky) or (1C), DblGoTo[InexactTrap,InexactNoTrap,R<0], At[RoundDisp,0];
*Was exact and even--adding 200b didn’t affect it.
LU ← SB[fpExpSign] + (400C), DblGoTo[OverUnderflow,Pack2,R<0], At[RoundDisp,2];
*Was halfway beteen even & next higher odd--undo +200b.
SB[fpFracL] ← SB[fpFracL] - (200C), GoTo[Inexact], At[RoundDisp,4];

RndUp:
T ← LSh[SB[fpFracH],10], GoTo[.+3,Carry’];
*Adding 400b to SB[fpExpSign] here while leaving SB[fpFracH] .eq. 0 works
*except when exponent is 376b--false overflow gets generated.
SB[fpExpSign] ← SB[fpExpSign] + (200C);
SB[fpFracH] ← 100000C;
fpSticky ← (fpSticky) or (1C), GoTo[InexactTrap,R<0];
InexactNoTrap:
*Was exact and odd.
LU ← SB[fpExpSign] + (400C), DblGoTo[OverUnderflow,Pack2,R<0], At[RoundDisp,6];

Pack2:
T ← (RSh[SB[fpFracL],10]) or T, GoTo[Overflow,ALU<0];
Stack&+1 ← T;*Push low-order result word
T ← SB[fpExpSign] and not (177C), Task;
*Simultaneously offset the exponent by 1 while OR’ing in the fraction.
T ← (LdF[SB[fpFracH],0,10]) + T;
*Push high-order result & exit
PackA:
T ← (LSh[SB[fpExpSign],17]) or T, GoTo[FlPush];

%Improvement for Float and FAdd/FSub with different signs.
Timing: 12/tristep + (5, 10, or 14 for the exit 0, 1, or 2 step cases).
This is 2 slower for 0 steps, 1 faster for 1 step, and 5 faster for 2 steps;
much faster for longer normalizations. Renorm goes 1 faster if all places
which might use more than 1 NormStep jump to LongNorm.

LongNorm:
Dispatch[SB[fpFracH],1,2], GoTo[PackL,R<0];
SB[fpExpSign] ← SB[fpExpSign] - (200C), Disp[.+1];
Norm3:
T ← RSh[SB[fpFracL],15], At[x,0];
SB[fpFracL] ← LSh[SB[fpFracL],3], Task;
SB[fpFracH] ← (LSh[SB[fpFracH],3]) or T;
SB[fpExpSign] ← SB[fpExpSign] - (400C), GoTo[LongNorm];
Norm2:
T ← RSh[SB[fpFracL],16], At[x,1];
SB[fpFracL] ← LSh[SB[fpFracL],2];
SB[fpFracH] ← (LSh[SB[fpFracH],2]) or T;
SB[fpExpSign] ← SB[fpExpSign] - (200C), GoTo[PackL];
NormGo:
SB[fpFracL] ← SB[fpFracL] SALUFOP T, Skip, At[x,2];
SB[fpFracL] ← SB[fpFracL] SALUFOP T, At[x,3];
PackL:
T ← LSh[SB[fpFracH],10], GoTo[Pack];

NormStep:
SB[fpExpSign] ← SB[fpExpSign] - (200C), GoTo[NormGo];
%


%NegFrac performs a 2’s complement negation on SB[fpFracH],,SB[fpFracL] and
returns the new SB[fpFracL] in T.

Timing: 6 cycles if SB[fpFracL] is 0, else 7 cycles.
%
NegFrac:
T ← SB[fpFracL];
NegFrac1:
SB[fpFracH] ← SB[fpFracH] xnor (0C), Skip[ALU=0];
T ← SB[fpFracL] ← (Zero) - T, Return;
SB[fpFracH] ← SB[fpFracH] + 1, Return;


%RoundCard rounds SB[fpFracH],,T using MNBR[0] as rounding bit and
MNBR[1:17b] as sticky bits; T is unchanged if no rounding occurs, or contains
the new SB[fpFracL] if rounding occurs.

Timing: 4 if exact, 12 if sticky bits .ne. 0, 13 if sticky bits .eq. 0.
%

RoundCard:
LU ← (MNBR) SALUFOP T, Skip[ALU#0];
Return;*Exact
*Round up if halfway and sticky bits non-0.
T ← (RZero) + T, UseCOutAsCIn, Skip[ALU=0];
SB[fpFracH] ← SB[fpFracH] + 1, UseCOutAsCIn, GoTo[CInexact];
*Sticky bits all 0, rounding bit 1: round up if odd, down if even, but just
*added 1, so result is correct if SB[fpFracL] now even, else 1 too big.
SB[fpFracH] ← SB[fpFracH] + 1, UseCOutAsCIn;
T ← (Form-2[AllOnes]) and T;
CInexact:
fpSticky ← (fpSticky) or (1C), DblGoTo[InexactTrapA,CInexactOK,R<0];

%Unnorm right-shifts the fraction DB[fpFracH/L] by fpTemp[1:8] positions with
underflow going into MNBR. MNBR[1:15] are sticky bits.

Unnorm is called on FAdd, FSub, Fix, FixI, FixC, Round, RoundI, and RoundC.
On FAdd and FSub, it right-shifts the fraction of the number with the smaller
exponent by an amount equal to the difference in the exponents, so that the
fractions can be added. On Fix or Round, it shifts the fraction so that its
exponent can be made 235b, which positions the decimal point to the right
of fpFracL bit 15d.

Use of MNBR to capture underflow and the 37b maximum shift is needed for
Fix/Round, which use MNBR[0] for rounding and MNBR[1:15d] as sticky bits.
FAdd/FSub worst case is when signs differ and subtraction of fractions
necessitates two normalization steps afterwards; in this case DB[fpFracL] bit
10d will become the rounding bit, so a maximum shift of 31b suffices, MNBR is
not needed, and underflowing bits can result in OR’ing into any of DB[fpFracL]
bits 11d to 15d.

Timing: 8 cycles if shift count .ge. 40b (i.e., the leading fraction bit in
DB[fpFracH] bit 1 is shifted further than bit 0 of MNBR); 14 if count is 0;
25 if count is 1 to 17b; 16 if count is 20b; or 27 if count is 21b to 37b.
%

Unnorm:
LU ← (fpTemp) - (10000C);
T ← (LdF[fpTemp,5,4]) - 1, GoTo[.+3,ALU<0];
*More than two words--any value .ls. 40000b ok here
MNBR ← R400, BBFBX;
T ← DB[fpFracH] ← 0C, Return;
fpCCTemp ← T, BBFBX, GoTo[Unnorm0,ALU<0];
CycleControl ← fpCCTemp;
*CycleControl ← RSCount - 1, so MWX ← RSCount - 1 and DBX ← 0. For WFA, this
*will left-cycle by 17b - DBX - MWX (= 20b - RSCount) and mask with 1’s in
*bits DBX to MWX (= 0..(RSCount - 1)).
fpCCTemp ← 16C;
Dispatch[fpTemp,4,1];
fpCCTemp ← (fpCCTemp) - T, LoadPage[fpPage3], Disp[.+1];

MNBR ← WFA[DB[fpFracL]], GoToP[.+1], At[UnnormDisp,0];*1 to 17b
OnPage[fpPage3];
T ← WFA[DB[fpFracH]];
*CycleControl ← 16b - (RSCount mod 20b - 1), knowing RSCount is non-zero, or
*DBX ← 0 and MWX ← 17b - (RSCount mod 20b). For RF, this left-cycles by
*MWX+DBX+1 (= 20b - RSCount) and masks with 1’s in bits 17b-MWX to 17b
*(= RSCount mod 20b to 17b).
CycleControl ← fpCCTemp;
T ← (RF[DB[fpFracL]]) or T;
DB[fpFracH] ← RF[DB[fpFracH]], Return;

*21b to 37b
LU ← DB[fpFracL] - 1, LoadPage[fpPage0], GoToP[.+1], At[UnnormDisp,1];
OnPage[fpPage3];
* +1 always .eq. or (1C) here
T ← (WFA[DB[fpFracH]]) + 1, UseCOutAsCIn, GoToP[.+1];
OnPage[fpPage0];
CycleControl ← fpCCTemp, fpCCTemp ← T, NoRegILockOK;
MNBR ← fpCCTemp;
T ← RF[DB[fpFracH]];
Unnorm1:
DB[fpFracH] ← 0C, Return;

Unnorm0:
Dispatch[fpTemp,4,1];
T ← MNBR ← DB[fpFracL], Disp[.+1];
MNBR ← RZero, Return, DispTable[2];*No shift
T ← DB[fpFracH], GoTo[Unnorm1];*Shift exactly 1 word

%For Alto Mesa, FloatTrap et al. store alpha in OTPReg, set T to
sFloatingPoint, and go to Kfcr which traps to MesaCode to try again
on this operation. For Cedar Mesa, the parameter is 400 + alpha.

Arrive at OverUnderflow for exponents 401b to 777b with the exponent field
offset by -1. Exponents of 0 or of 777b down to 577b represent underflow;
377b to 576b represent overflow. Overflow always traps.

Gradual denormalization is not handled in microcode and traps. A microcode
implementation should sensibly handle not only denormalized results but also
denormalized arguments. For an underflowing result, microcode would first
right-shift the fraction one position because the implicit leading 1 isn’t
there with a zero exponent. Then it would right-shift the fraction while
counting the exponent up to zero. Rounding would take place after shifting,
and the special case in which rounding normalizes the number would be dealt
with. Unpacking would normalize the fraction while counting the exponent
to a negative value.
%
OverUnderflow:
LU ← SB[fpExpSign] - (137400C);*Distinguish overflow and underflow
LU ← (fpSticky) and (20000C), Skip[ALU<0];
DblGoTo[Pack0,FloatTrap,ALU#0];*Underflow--check mode
T ← SStkP, GoTo[FloatTrp1];*Overflow

Overflow:
*For exponent .eq. 377b or 400b
T ← SStkP, GoTo[FloatTrp1];

InexactTrap:
T ← SStkP, GoTo[FloatTrp1];
FixTrapA:
T ← SStkP, GoTo[FloatTrp1];
FixTrapB:
T ← SStkP, GoTo[FloatTrp1];
FixTrap:
T ← SStkP, GoTo[FloatTrp1];
NaNTrap:
T ← SStkP, GoTo[FloatTrp1];
InexactTrapA:
DenormTrap:
T ← SStkP, GoTo[FloatTrp1];
FloatTrap:
T ← SStkP;*Even placement; paired with mi at Pack0.
:IF[AltoMode]; ********************************************
FloatTrp1:
fpTemp ← T, Task;
StkP ← fpTemp;
LoadPageExternal[opPage3];
fpTrapExit:
T ← sFloatingPoint, GoToExternal[kfcrLoc];
:ELSE; ****************************************************
FloatTrp1:
TrapParm ← T, Task;
TrapParm ← (StkP ← TrapParm) xor T;
fpTrapExit:
LoadPage[opPage0];
RTemp ← (RTemp) + (400C), GoToP[UndefTrap];*400 + alpha
:ENDIF; ***************************************************

%Addition and subtraction are almost the same.
1) Unpack the arguments, negate Arg2 if FSub, check for zeroes.
2) Point DB at the argument with the smaller exponent, SB at the other, and
call Unnorm to right-shift the DB argument by the difference in the exponents.
3) Add or subtract magnitudes, depending on the signs.
5) Renormalize the result, round, and push it on the stack.

Get here after T ← fpFrac1H; SB selects arg1.

Timing @FAdd to Renorm: 24 + (2 if FSub) + (5 if Exp1 .ls. Exp2)
+ (7 if subtraction produces negative fraction)
+ (1 if Unnorm produces ones in MNBR). Add Unnorm time to this.
For those cases which enter Renorm at NormGo, 4 cycles have been subtracted.
Subtraction can produce a negative result only when Exp1 .eq. Exp2;
Unnorm produces a sticky bit only for exponent differences larger than 7.

0 or 1 NormSteps will be performed after addition;
1 or 2 NormSteps will be performed after subtraction usually; but
up to 24d NormSteps might be performed if 0.5 < arg1/arg2 < 2.0.
%
@FSub:
fpExpSign2 ← (fpExpSign2) xor (1C), At[UnpackRet,1];*Complement sign
@FAdd:
LU ← (fpFrac2H) and T, At[UnpackRet,0];
*Simultaneously compute the exponent difference in bits 0:8 while XOR’ing
*the signs in bit 15d and setting up a 3 in bits 10:11d for SB← and DB←.
*The "1" in "61C" here prevents sign subtraction from propagating higher.
T ← (fpExpSign2) - (61C), GoTo[Add0,ALU=0];
T ← (fpExpSign1) - T, Call[FAddA];
*Return here after Unnorm. Or the Sticky bit into DB[fpFracL] bit 15 and test
*cycle the (Sign1 xor Sign2) bit into the ALU sign for testing.
LU ← MNBR;
LU ← RCy[fpTemp,1], Skip[ALU=0];
T ← (RSh[AllOnes,17]) or T, DblGoTo[AddFrac,SubFrac,ALU<0];
DblGoTo[AddFrac,SubFrac,ALU<0];

*fpTemp ← [(Exp1-Exp2) lsh 7] + 61b + S1 - S2
FAddA:
fpTemp ← T, Skip[ALU<0];
*Exp1 .ge. Exp2: Point SB at arg1 (it already does), DB at arg2.
DB ← fpTemp, GoTo[Unnorm];
*Exp1 .ls. Exp2: Point SB at arg2, DB at arg1; negate the count.
T ← 157C;
*fpTemp ← [(Exp1-Exp2)’ lsh 7] + 41b + S1 - S2; SB ← 3.
fpTemp ← (SB ← fpTemp) xnor T;
*fpTemp ← [(Exp2-Exp1) lsh 7] + 21b + S1 - S2; DB ← 2.
fpTemp ← (DB ← fpTemp) + T + 1, GoTo[Unnorm];

*The signs are equal; add the fractions. DB[fpFracL] is in T.
AddFrac:
SB[fpFracL] ← SB[fpFracL] + T, Task;
T ← DB[fpFracH] + 1, UseCOutAsCIn;
SB[fpFracH] ← SB[fpFracH] + T, GoTo[Renorm];

*The signs are different; subtract the fractions. DB[fpFracL] is in T.
SubFrac:
SB[fpFracL] ← SB[fpFracL] - T;
T ← DB[fpFracH], FreezeResult;
SB[fpFracH] ← T ← (SB[fpFracH]) - T, UseCOutAsCIn;
LU ← SB[fpFracL] or T, GoTo[SubNeg,ALU<0];
SubNonNeg:
SB[fpExpSign] ← SB[fpExpSign] - (200C), GoTo[LongNormGo,ALU#0];
T ← Stack&+1 ← 0C, GoTo[FlPush];

*Get here only when the exponents are equal and the DB fraction is larger
*than the SB fraction. 2 to 24d NormSteps will be performed.
SubNeg:
T ← SB[fpFracL], Call[NegFrac1];
*Simultaneously complement the sign and subtract 1 from the exponent.
SB[fpExpSign] ← SB[fpExpSign] - (177C), GoTo[LongNormGo];

%Multiply.
1) Unpack arguments leaving normalized fractions in fpFracH[1:15],,fpFracL[0:8].
2) Check for zero arguments. GoTo Mul0 if either arg denormalized; since
Unpack has trapped for any non-0 denormalized arg, the "and" at @FMul is
always non-0 for both args non-0.
3) Set sign and exponent of Frac1 to Exp1+Exp2-177b+1-1, where -177b
compensates for the excess in Exp1+Exp2, +1 for the NormStep normally
required, and -1 for the exponent offset in Renorm; altogether this is
Exp1+Exp2-177b. Adding fpExpSign1 to fpExpSign2 also XOR’s the sign bit
for the result.
4) Shift fpFrac2 so non-0 bits are in fpFrac2H[8:15],,fpFrac2L[0:15];
then copy fpFrac2L into MNBR for the inner loops; fpFrac2L becomes
the low-order product word fpFrac1LL.
5) Left-shift fpFrac1H/L 2 positions and add fpFrac2H to fpFrac1L; the
leading multiplier bit is known to be 1, so the inner loop is entered with
the multiplier lsh 1 in fpFrac1H/L/LL[0:25b] and the multiplicand in
fpFrac1H/L/LL[30:57b].
6) Perform the shift and add loop 23d times, since the 1st of 24d steps was
done before starting the loop. Initially, two 0 bits lie between the
low-order multiplier and high-order product bits. Each step shifts the
multiplier and product in fpFrac1H/L/LL and conditionally adds the
multiplicand in fpFrac2H/MNBR. Carries may make the 2nd of the two 0’s
between the multiplier and product into a 1, but the 1st 0 will remain 0.
7) Do 0 or 1 NormSteps at Renorm.

Timing @FMul to Renorm for non-0 args: (36 or 39) + 8*(10/zero or 19/one)
+ 15*(6/zero, [11 or 14]/one). Total averages 293.75 cycles.
%
@FMul:
LU ← (fpFrac2H) and T, At[UnpackRet,2];
T ← (fpExpSign2) - (37400C), GoTo[Mul0,ALU=0];
fpExpSign1 ← (fpExpSign1) + T, LoadPage[fpPage4];
T ← RSh[fpFrac1L,16];
OnPage[fpPage4];
fpFrac1H ← (LSh[fpFrac1H,2]) or T, Task;
T ← LSh[fpFrac2H,11];
fpFrac1LL ← (RSh[fpFrac2L,7]) or T;*For that 1st multiplier 1
T ← fpFrac2H ← RSh[fpFrac2H,7], Task;
fpFrac1L ← (LSh[fpFrac1L,2]) or T;
fpOne ← 1C;
LU ← (MNBR ← fpFrac2L) SALUFOP T;
T ← (fpFrac2H) SALUFOP T, UseCOutAsCIn;
fpFrac2HLSh1 ← T, LoadPage[fpPage2];
fpFrac2HLSh1p1 ← (Zero) + T + 1;
OnPage[fpPage2];
fpCount ← 6C, Call[FMNoA1];*Total of 5 + 3*fpCount iterations

%With this unrolled loop, the exterior step runs in 10 + 9/add cycles and
interior steps in 6 + (5 or 8)/add cycles; the interior step can be
replicated 0 to 4 times, trading space for speed. Each replication costs
10d mi each (+5 mi for initializing the three registers needed).
fpCount and entry point have to be adjusted. The 1st fast loop saves an
average of 2.625/cycles bit; the 2nd saves an additional .875 cycles/bit;
34d saves .4375 cycles/bit, etc. I haven’t found any way of making the
slow exterior loop look more like the fast one and still get in the
required tasking.
%
fpCount ← (fpCount) - 1, GoTo[MulDone,R<0];

fpFrac1LL ← (fpFrac1LL) SALUFOP T;
fpFrac1L ← (fpFrac1L) SALUFOP T, UseCOutAsCIn;
fpFrac1H ← (fpFrac1H) SALUFOP T, UseCOutAsCIn, GoTo[FMAd1,R<0];

FMNoA1:
fpFrac1LL ← (fpFrac1LL) SALUFOP T;
fpFrac1L ← (fpFrac1L) SALUFOP T, UseCOutAsCIn;
FMAdS2:
fpFrac1H ← (fpFrac1H) SALUFOP T, UseCOutAsCIn, GoTo[FMAd2,R<0];

FMNoA2:
fpFrac1LL ← (fpFrac1LL) SALUFOP T;
fpFrac1L ← (fpFrac1L) SALUFOP T, UseCOutAsCIn;
FMAdS3:
*Additional loop replications go here
fpFrac1H ← (fpFrac1H) SALUFOP T, UseCOutAsCIn, DblGoTo[FMAd3,FMNoA3,R<0];

FMAd1:
T ← (MNBR) SALUFOP T;
fpFrac1LL ← (LSh[fpFrac1LL,1]) + T, Skip[R<0];
T ← (fpFrac2HLSh1) + 1, UseCOutAsCIn, Skip;
T ← (fpFrac2HLSh1p1) + 1, UseCOutAsCIn;
fpFrac1L ← (LSh[fpFrac1L,1]) + T, GoTo[FMAdS2,R>=0];
**Carry is impossible for the 1st 7 iterations of the loop.
T ← (fpOne) + 1, UseCOutAsCIn;
fpFrac1H ← (LSh[fpFrac1H,1]) + T, GoTo[FMNoA2,R>=0];

FMAd2:
T ← (MNBR) SALUFOP T;
fpFrac1LL ← (LSh[fpFrac1LL,1]) + T, Skip[R<0];
T ← (fpFrac2HLSh1) + 1, UseCOutAsCIn, Skip;
T ← (fpFrac2HLSh1p1) + 1, UseCOutAsCIn;
fpFrac1L ← (LSh[fpFrac1L,1]) + T, GoTo[FMAdS3,R>=0];
T ← (fpOne) + 1, UseCOutAsCIn;
fpFrac1H ← (LSh[fpFrac1H,1]) + T, GoTo[FMNoA3,R>=0];

FMAd3:
*Additional loop replications go here
T ← MNBR;
fpFrac1LL ← (fpFrac1LL) + T;
T ← fpFrac2H, FreezeResult;
fpFrac1L ← (fpFrac1L) + T, UseCOutAsCIn;
FMNoA3:
fpFrac1H ← (fpFrac1H) + 1, UseCOutAsCIn, Return;

*If fpFrac2L ne 0, set the fpFracL1[15d] to control rounding.
MulDone:
LU ← (fpFrac1LL) - 1, LoadPage[fpPage0];
*7 cycles from last task to Renorm.
fpExpSign1 ← (fpExpSign1) - (200C), GoToP[Renorm,Carry’];
OnPage[fpPage0];
*10 cycles from last task to Renorm.
fpFrac1L ← (fpFrac1L) or (1C), GoTo[Renorm];

%Division. Arg1 is dividend, arg2 divisor; quotient in MNBR/fpTFrac.
1) Unpack arguments.
2) Trap if divisor is 0; else return 0 quotient if dividend is 0.
3) Copy fpFrac1L into fpTFrac so that fpFrac1L can accumulate the quotient.
4) Set sign and exponent of the result: Exp1-Exp2+177b-1 is the exponent
if the initial subtract succeeds, where 177b supplies the offset and -1
compensates for the +1 in Renorm; if the initial subtract fails, an
additional -1 compensates for the extra quotient bit generated. The
subtraction of the exponents also xor’s the signs, but it is necessary to
offset fpExpSign1 by 2 to ensure that if arg1 is positive and arg2 negative,
the carry from subtracting the signs won’t affect the exponent calculation.
5) Compute the variations on fpFrac2H needed for the inner loop.
6) Initialize fpFrac1L to 1. On the 1st iteration, a 1 will be produced if
dividend fraction is larger than divisor fraction; if so, 25d iterations
produce a 24d bit result and an extra bit for rounding. If the 1st iteration
produces a 0, a 26th iteration is required, and fpFrac1L is reinitialized to 1.
The 1st inner loop exit occurs when the leading 1 has migrated out of the
sign bit; the 16d bits in fpFrac1L are copied into MNBR, and fpFrac1L is
reinitialized to 200b; the inner loop is then repeated 9d more times. After
the 2nd inner loop exit, 25d quotient bits are in MNBR/fpFrac1L; the remainder
serves as a sticky bit. The 1st inner loop exit is distinguished from the
2nd by the sign of fpFrac1L.
7) The shift and add-or-subtract loop is primed with an initial subtraction;
then each step combines shifting the dividend with adding or subtracting
the divisor according to the quotient bit produced by the previous step.
8) Undo the last subtraction if the right-most result bit is a 1, so that the
remainder will correctly be 0 or non-0 for rounding.
9) Copy MNBR into fpFrac1H and OR into fpFrac1L for rounding if remainder non-0.
10) No normalization is required; round and push result.

Timing @FDiv to Renorm: (52 or 53) + 24*(10 to 12) + (12 or 13 if
dividend < divisor) + (3 if 25th bit is 0) + (3 if final remainder .ne. 0).
%
@FDiv:
LU ← (fpFrac2H) and T, At[UnpackRet,3];
*Go if either arg denormalized, since denormalized implies 0 because Unpack
*will have trapped for any non-zero denormalized argument.
T ← (fpFrac2H) xnor (0C), Skip[ALU#0];
T ← fpExpSign2, GoTo[Div0];
fpFrac2H’ ← T;
fpFrac2Hp1 ← (Zero) - T;
fpFrac2H’p1 ← (Zero) + T + 1;
*Have to open-code the first subtraction.
T ← fpFrac2L, LoadPage[fpPage1];
T ← (fpFrac1L) - T;
OnPage[fpPage1];
fpTFrac ← T, FreezeResult, Task;
T ← (fpFrac2H’) + 1, UseCOutAsCIn;
fpFrac1H ← (fpFrac1H) + T, Call[DivStart];
*Loop here 10 to 12 cycles
fpFrac1H ← (LSh[fpFrac1H,1]) + T;
*Shift in quotient bit
fpFrac1L ← (fpFrac1L) SALUFOP T, UseCOutAsCIn, GoTo[DivSub,Carry];
*Shift the remainder left 1 bit and add the divisor.
DivAdd:
T ← fpFrac2L, FreezeResult, GoTo[DivAddLast,Carry];
DivAdR:
fpTFrac ← (LSh[fpTFrac,1]) + T, Skip[R<0];
T ← (fpFrac2H) + 1, UseCOutAsCIn, Return;
T ← (fpFrac2Hp1) + 1, UseCOutAsCIn, Return;

*Shift the remainder left 1 bit and subtract the divisor.
DivSub:
T ← fpFrac2L, FreezeResult, GoTo[DivSubLast,Carry];
DivSbR:
fpTFrac ← (LSh[fpTFrac,1]) - T, Skip[R<0];
T ← (fpFrac2H’) + 1, UseCOutAsCIn, Return;
T ← (fpFrac2H’p1) + 1, UseCOutAsCIn, Return;

*Initialize fpFrac1L so the sign will become 1 on the final iteration;
*a 1 will then be shifted into the sign for a normalized result.
DivStart:
T ← fpExpSign2 ← (fpExpSign2) - (37400C), GoTo[DivExtraStep,Carry’];
fpExpSign1 ← (fpExpSign1) - T;
fpFrac1L ← 3C, GoTo[DivSub];*Initial subtract succeeded
*Initial subtract failed--extra iteration and exponent is one smaller.
DivExtraStep:
T ← (fpExpSign2) + (200C);
fpExpSign1 ← (fpExpSign1) - T;
fpFrac1L ← 1C, GoTo[DivAdd];

*Get to DivAddLast or DivSubLast once after the result bits for the
*high quotient word are accumulated and again after the final iteration.
DivAddLast:
fpExpSign1 ← (fpExpSign1) + 1, DblGoTo[FDivX,.+2,ALU>=0];
DivSubLast:
fpExpSign1 ← (fpExpSign1) + 1, GoTo[FDivX,ALU>=0];
MNBR ← fpFrac1L, Skip[R Odd];
fpFrac1L ← 200C, GoTo[DivAdR];
fpFrac1L ← 200C, GoTo[DivSbR];
*Really done; position the fraction for normalization and rounding.
FDivX:
fpFrac1L ← LSh[fpFrac1L,7], Skip[R Even];
T ← fpFrac1H, LoadPage[fpPage0], GoTo[DivMore1];
fpTFrac ← (fpTFrac) + T;
T ← (fpFrac2H) + 1, UseCOutAsCIn;
T ← (fpFrac1H) + T, LoadPage[fpPage0];
DivMore1:
LU ← (fpTFrac) or T;
OnPage[fpPage0];
T ← MNBR, Skip[ALU=0];
fpFrac1L ← (fpFrac1L) or (1C);
*20 to 27 cycles from last task to Renorm.
fpFrac1H ← T, GoTo[Renorm];

%Square root.

Conceptually, 25d iteration steps produce 24 data bits + 1 extra bit for
rounding; fpRhi/RLo and FracH/L form a single register left-shifted 2 each
step; fpRhi/RLo, called R, are initially 0. The square root Q accumulates
in fpFrac2L/H (initially 0). An iteration is as follows:
If 4*R - 4*Q - 1 >= 0, then R ← 4*R - 4*Q - 1 and Q ← 2*Q + 1,
else R ← 4*R; Q ← 2*Q.
or alternatively as follows (with Q multiplied by 2):
Q ← 2*Q; R ← 4*R
if R - Q - 1 >= 0 do [ R ← R - Q - 1; Q ← Q + 2 ]

The exact algorithm is as follows:
1) Unpack1 leaves SB pointing at Frac2 and a normalized fraction in
Frac2[1:24]. With the decimal point between bits 1 and 2, Exp2 is biased by
177b and 1 <= Frac2 < 2.
2) Return 0 if Frac2H is 0.
3) Trap if the sign is 1 (imaginary result).
4) Conditionally left-shift Frac 1 bit if Exp2 is even and then compute
the exponent as Exp2 ← (Exp2-177b)/2 + 177b = (Exp2+177b)/2; after this
1 <= Frac < 4 and 1 <= SqRt[Frac] < 2. **ACTUALLY Exp2 ← (Exp2+176b)/2-1/2
5) The first Q bit is known to be 1; this step is done by hand while
initializing registers.
6) 7 steps shift FracH only into fpRlo and build Q only in fpFrac2L; since
Q is multiplied by 2 throughout, this iteration terminates when the
leading 1 in Q is in bit 7.
7) 6 steps shift FracL only into fpRlo and build Q only in fpFrac2L; the
final step shifts in a zero; this iteration terminates at the onset of the
step in which the leading 1 in fpFrac2L is shifted into bit 0.
8) 11d steps shift in zeroes, building R in both fpRhi/Rlo and Q in both
fpFrac2H/2L; before starting this iteration both R and Q are left-shifted
so that the final step will leave the leading 1 in Q in the sign bit of
fpFrac2H.
9) 1 is OR’ed into fpFrac2L bit 15d if either fpFracLLL or fpFracLL is
non-zero (’sticky’ bit for rounding).
10) No normalization is required; round and push the result.

Timing @FSqRt to Renorm: 32 + (3 if Frac lsh 1) + 7*(13/zero or 12/one) +
6*(12/zero or 11/one) + 11*(18 to 20/zero or 19 to 21/one)
= 380 to 429 cycles.
%

*T has fpFrac2H just put through the ALU.
@FSqRt:
T ← fpExpSign2 ← (fpExpSign2) + (LShift[176,7]C), Skip[ALU#0], At[UnpackRet,13];
T ← Stack&+1 ← 0C, GoTo[PackA];
fpExpSign2 ← RCy[fpExpSign2,1], GoTo[FSq0,H2Bit8];
T ← fpFrac2L ← (fpFrac2L) SALUFOP T;
fpFrac2H ← (fpFrac2H) SALUFOP T, UseCOutAsCIn, GoTo[FSq1];
*Here Frac2L is copied into Frac1L so that the root can wind up in Frac2.
*Copying Frac2H into Frac1H is unnecessary because Frac2H is consumed before
*the square root accumulating in Frac2L overflows.
*1st iteration, done by hand while copying Frac2L to Frac1L, is:
* Frac2L ← 2, FracRlo ← RSh[Frac2H,16] - 1, Frac2H ← LSh[Frac2H,2]
FSq0:
T ← fpFrac2L;
FSq1:
fpExpSign2 ← (fpExpSign2) - (LShift[1,6]C), Skip[R>=0];
T ← SStkP, GoTo[FloatTrp1];
fpFrac1L ← T, LoadPage[fpPage5];
T ← (RSh[fpFrac2H,16]) - 1;
OnPage[fpPage5];
fpRlo ← T;
*Work with 2*Q throughout; initialize fpFrac2L with the 1st root bit lsh 1.
T ← fpFrac2L ← 4C, Call[.+2];
*7 iterations of 1st loop exhaust Frac2H.
T ← fpFrac2L ← (fpFrac2L) SALUFOP T, GoTo[FSq2,H2Bit8];
T ← (LdF[fpFrac2H,2,2]) - T - 1;*Odd placement
fpRlo ← (LSh[fpRlo,2]) + T;
FSq0X:
fpFrac2H ← LSh[fpFrac2H,2], GoTo[.+3,Carry];
FSqLpa:
T ← fpFrac2L;
fpRLo ← (fpRlo) + T + 1, Return;
FSqLpb:
T ← fpFrac2L ← (fpFrac2L) + (2C), Return;

*7 iterations of 2nd loop exhaust fpFrac2L producing a 15d-bit result.
FSq2:
T ← (LdF[fpFrac2H,2,2]) - T - 1;*Even placement
fpRlo ← (LSh[fpRlo,2]) + T, Call[FSq0X];
T ← fpFrac2L ← (fpFrac2L) SALUFOP T;
T ← (RSh[fpFrac1L,16]) - T - 1, GoTo[FSq3,ALU<0];
fpRlo ← (LSh[fpRlo,2]) + T;
fpFrac1L ← LSh[fpFrac1L,2], DblGoTo[FSqLpb,FSqLpa,Carry];

*The leading 1 in fpFrac2L was just left-shifted into bit 0 and 14d result
*bits are in fpFrac2L. Now left-shift both Frac2 and R so termination will
*occur after 11d iterations with the leading 1 in fpFrac2H bit 0. The initial
*shift also accomplishes the shift of Frac2 and R for the next step, so Frac2
*is shifted by 6 (for a total shift of 7 since we just shifted Frac2 by 1) and
*R is shifted by 8d (one more than the Q shift).
FSq3:
T ← RSh[fpRLo,10];
fpRHi ← T, LoadPage[fpPage6];
T ← RSh[fpFrac2L,12];
OnPage[fpPage6];
fpFrac2H ← T, LoadPage[fpPage7];
T ← fpFrac2L ← LSh[fpFrac2L,6];
OnPage[fpPage7];
fpRLo ← (LSh[fpRLo,10]) - T - 1, Call[FSq8];
*Rhi ← 4*Rhi + 2 bits from Rlo
T ← RSh[fpRlo,16];
fpRhi ← (LSh[fpRhi,2]) + T;
*Q ← 2*Q; Rlo ← 4*Rlo - Qlo - 100b; Rhi ← Rhi + Qhi’ + Carry
T ← fpFrac2L ← (fpFrac2L) SALUFOP T;
fpRlo ← (LSh[fpRlo,2]) - T - 1, Skip[Carry];
T ← fpFrac2H ← LSh[fpFrac2H,1], DblGoTo[FSq4,FSq5,Carry];
T ← fpFrac2H ← (LSh[fpFrac2H,1]) + 1, DblGoTo[FSq4,FSq5,Carry];
FSq4:
T ← (fpRhi) - T, DblGoTo[FSq9,FSq7,ALU<0];
FSq5:
T ← (fpRhi) - T - 1, GoTo[FSq7,ALU>=0];
FSq9:
FreezeResult, Call[FSq7];
*OR sticky bits into fpFrac2L bit 15d.
T ← (fpRhi) or T;
LU ← (AllOnes) + T, LoadPage[fpPage0];
fpFrac2L ← (fpFrac2L) + 1, UseCOutAsCIn, GoToP[Renorm];

FSq7:
fpFrac2L ← (fpFrac2L) + (200C), GoTo[FSq6,Carry];
T ← fpFrac2L ← (fpFrac2L) - (200C);
T ← fpRlo ← (fpRlo) + T + 1, Return;
FSq6:
fpRhi ← T;
T ← fpRlo ← (fpRlo) and not (77C), Return;

FSq8:
T ← fpFrac2H, DblGoTo[FSq4,FSq5,Carry];


%Floating scale. 2OS,,3OS are a floating point number whose exponent is
scaled by TOS, an integer such that -202b <= TOS <= 200b.
(Larger values reverse overflow and underflow reporting, so overflow could
substitute 0 rather than trapping in some cases.). The opcode specification
is for -200b <= TOS <= 177b.

**Could speed this up by ignoring the Unpack results and manually checking for
overflow and underflow (i.e., for 1 <= Exp <= 376b); if all is ok, then
directly add fpFrac1H to the exponent in 2OS.
%
OnPage[fpPage0];

FSc1:
T ← LSh[fpFrac1H,7], Skip[ALU#0];
T ← Stack&+1 ← 0C, GoTo[PackA];
fpExpSign2 ← (fpExpSign2) + T, GoTo[NormGo];

*ZERO ARGUMENTS.

%FAdd/FSub: Get to Add0 knowing that (fpFrac1H) and (fpFrac2H) .eq. 0; since both
args were either normalized or zero if passed by Unpack, passed by one/both
fractions must be zero. If just one is 0, return the other. If both are 0,
return 0 (negative 0 if both args are negative 0). SB points at arg1.
%

OnPage[fpPage0];

Add0:
LU ← fpFrac1H;
LU ← fpFrac2H, Skip[ALU=0];
Stack&+2, GoTo[NormReturn];*fpFrac1H .ne. 0, so fpFrac2H .eq. 0
T ← fpExpSign2, Skip[ALU#0];
*fpFrac1H and fpFrac2H both 0.
fpExpSign1 ← (fpExpSign1) and T, GoTo[Pack0];

*The following would work except that on FSub the sign must be complemented:
*
Stack&+3;
*
Stack&-2 ← Stack&-2;
*
Stack&+3;
*
Stack&-2 ← Stack&-2, GoTo[NormReturn];

fpTemp ← 60C;*fpFrac2H .ne. 0
SB ← fpTemp;
BBFBX, GoTo[Renorm];

%Divide. Trap if divisor (fpFrac2) is 0; else dividend (Frac1) is 0, so return
0 with sign the xor of the dividend and divisor signs.
Since Unpack has already trapped for all denormalized args except zeroes,
the checks for 0 look at fpFracH bit 1. T contains fpExpSign2.

Multiply. Return 0 with sign equal to the xor of the multiplier and
multiplicand signs. T contains fpExpSign2 - 40000b.
%
Div0:
LU ← (fpFrac2H) and (40000C);
Mul0:
fpExpSign1 ← (fpExpSign1) xor T, GoTo[FloatTrap,ALU=0];
Pack0:
T ← Stack&+1 ← 0C, GoTo[PackA];

%Floating Comparison returns INTEGER -1 if arg1 .ls. arg2,
0 if arg1 .eq. arg2, or 1 if arg1 .gr. arg2.

Timing @FComp to completion: 15 or 16 cycles if .eq., 27 to 30 cycles if .ne.

**Unpacking apparently serves only to eliminate not-a-numbers and denormalized
**numbers from testing.
%
@FComp:
T ← LSh[fpExpSign1,17], Task, At[UnpackRet,4];
Stack&+3;
LU ← (LSh[fpExpSign2,17]) xor T;
T ← Stack&-2, GoTo[CompDiffSign,ALU<0];
LU ← (Stack&+3) - T;*LU ← arg1Lo - arg2Lo
T ← Stack&-2, FreezeResult;
*T ← arg1Hi - arg2Hi
T ← (Stack&-1) - T, UseCOutAsCIn, GoTo[CompNon0Test,ALU#0];
Comp0Test:
Stack ← T, FreezeResult, GoTo[NormReturn,ALU=0];
fpExpSign2 ← (fpExpSign2) + 1, DblGoTo[.+3,.+2,Carry’];
CompNon0Test:
fpExpSign2 ← (fpExpSign2) + 1, Skip[Carry’];
fpExpSign1, DblGoTo[CompL,CompG,R Odd];
fpExpSign2, DblGoTo[CompL,CompG,R Odd];

CompG:
Stack ← 1C, GoTo[NormReturn];
CompL:
Stack ← (Zero) - 1, GoTo[NormReturn];

NormReturn:
LU ← NextInst[IBuf];
NIRet;

*If the signs differ, it is nevertheless possible that both fractions are 0
*(positive and negative 0). When the signs differ, we don’t care which way
*the Carry branch condition goes at Comp0Test+1 (???).
CompDiffSign:
T ← fpFrac1H;
T ← (fpFrac2H) or T, GoTo[Comp0Test];


%Float LONG INTEGER to REAL
1) Copy the argument into Frac1
2) Check sign, and negate Frac1 if negative.
3) Avoid 16d of NormSteps if 1st word is 0; special case entirely 0.
4) Set exponent and Renormalize.

Timing Float to Renorm: 21 cycles if positive, 28 if negative; +4 if
1st word insignificant.
%
Float:
fpFrac1H ← 40C, Task;
SB ← fpFrac1H, fpFrac1H ← T, NoRegILockOK;
fpExpSign1 ← T, BBFBX;*Start building sign
T ← Stack&-1, Call[FloatA];
*Have fpFrac1L in T from either FloatA or NegFrac.
FloatB:
fpExpSign1 ← (fpExpSign1) - (200C);
LU← fpFrac1H;
fpExpSign1 ← (fpExpSign1) + (43400C), Skip[ALU=0];
fpExpSign1 ← (fpExpSign1) + (4000C), GoTo[LongRenorm];
fpFrac1H ← T;
fpFrac1L ← 0C, Skip[ALU=0];
LongRenorm:
T ← LSh[SB[fpFracH],10], DblGoTo[Pack,NormStep,R<0];
T ← Stack&+1 ← 0C, GoTo[FlPush];

FloatA:
fpFrac1L ← T;
fpExpSign1 ← RSh[fpExpSign1,17], DblGoTo[NegFrac,FloatB,R<0];

%Fix or Round REAL to LONG INTEGER, INTEGER, or CARDINAL.
1) Unpack the argument.
2) Unnormalize Frac1 so that the decimal point is positioned to the right of
fpFrac1L bit 15d.
3) Round, if appropriate (controlled by fpCode)
4) Negate the fraction if sign is negative.
5) Several things depending on the result type
(controlled by fpCode)
LONG INTEGER:
Push double-precision result.
INTEGER:
Trap if positive or negative number is larger than 15d bits,
else push single-precision result.
CARDINAL:
Trap if fraction word 0 is non-zero, else push
single-precision result.

Timing from @Fix to exit (not including Unnorm):
Fix21 if 0, 23 if .gr. 0, 29 if .ls 0;
FixC19 if 0, 21 if .gr. 0, 27 if .ls 0;
FixI24 if 0, 26 if .gr. 0, 33 if .ls. 0;
Add 2 + RoundCard time for the analogous Round operations.
%

*Computation is 236b - Exp1 (the low order +1 prevents sign subtraction
*from propagating into the exponent field and inverts the branch condition).
*GoTo FixTrap if exponent .gr. 235b; in this case the number is .ge. 2↑31d
*(31d bits holds all numbers .ls. 2↑31.
FixRA:
SB[fpExpSign] ← T ← SB[fpExpSign] - (200C);
LU ← SB[fpFracH], GoTo[FixTrap,ALU>=0];
fpTemp ← (Zero) - T, GoTo[Unnorm,ALU#0];
T ← SB[fpFracL], GoTo[FrPos];*Fraction is 0


@Fix:
SB[fpExpSign] ← SB[fpExpSign] - (47000C) - 1, Call[FixRA], At[UnpackRet,6];
*SB[fpFracL] is in T here.
SB[fpExpSign], DblGoTo[FRPos,FRNeg,R Odd];

@Round:
SB[fpExpSign] ← SB[fpExpSign] - (47000C) - 1, Call[FixRA], At[UnpackRet,10];
*SB[fpFracL] is in T here.
LU ← MNBR, Call[RoundCard];
*SB[fpFracL] is still in T.
SB[fpExpSign], Skip[R Odd];
FRNeg:
LU ← T, Call[NegFrac1];
FRPos:
Dispatch[fpCode,12,2];
LU ← SB[fpFracH], Disp[.+1];

GoTo[FixTrapA,ALU#0], At[FixDisp,2];*Cardinal
FlPush:
LU ← NextInst[IBuf];
Stack&+1 ← T, NIRet;

SB[fpExpSign], FreezeResult, GoTo[.+3,R Even], At[FixDisp,1];*Integer
LU ← T, GoTo[FixTrapB,ALU#0];
DblGoTo[FixTrapA,FlPush,ALU<0];
LU ← SB[fpFracH] xnor (0C);
LU ← (LSh[AllOnes,17]) xor T, DblGoTo[FixTrapB,.-2,ALU#0];

Stack&+1 ← T, At[FixDisp,0];*Long integer
T ← SB[fpFracH], GoTo[FlPush];


%FSticky reads and sets the InexactResultTrapEnable and InexactResult Bit.
It swaps Stack and fpSticky.
%
FSticky:
MNBR ← fpSticky, fpSticky ← T, NoRegILockOK;
T ← MNBR, GoTo[FlPush];

*For buffer refill traps.
PFetch4[PCB,IBuf,4], GoToExternal[MesaRefillLoc], At[LShift[fpPage0,10],377];

END[MesaFP];

:ELSE; ************************************************
TITLE[No.floating.point.microcode];
:ENDIF; ***********************************************