Dragon Opcode Changes
Filed on: [Indigo]<Dragon>Documentation>OpcodeChanges.tioga
Edward R. Fiala
March 6, 1986
This memo discusses changes which presently are not incorporated and, furthermore, are unlikely ever to be incorporated in the Dragon opcode set. However, I have been logging some of these thoughts for possible future consideration.
1. Possible RRR Format Changes
The RRR format opcodes are poorly coded because the aux bit is only useful in 1 of the 2 states of aOpt, bOpt, and cOpt and because writing the 12 constants is of negligible value operationally (since they can be initialized with SIP), and Mesa will use at least 9 of the 12 registers only for constants; furthermore, writing a constant or one of the first 8 auxiliary registers is illegal in user mode and results in a ModeFault trap. Because of these inefficiencies, it is worth considering some tighter way of encoding these 16 bits to provide additional functionality.
This section discusses three different ways these 16 bits can be used more effectively:
1) To enlarge the number of registers accessible through RRR operations.
2) To allow more complete mixture of AuxRegs and LocalRegs in RRR operations.
3) To permit many literal constants, in addition to the 12 "constants" stored in EU registers, to be loaded directly into any register.
A fourth possibility is to encode current options in only 15 bits so that the 16th bit of the RRR format can indicate inversion of the Rb ALU input; this would allow the RSUB, RVSUB, RUSUB, and RLSUB opcodes to be eliminated because these functions can be carried out with RADD, RVADD, RUADD, and RLADD, respectively; and it would allow (Ra xor Rb'), (Ra and Rb'), and (Ra or Rb') operations to be manufactured from RXOR, RAND, and ROR. However, I haven't found a pleasing encoding for this.
First Variation On RRR Format
The variant encoding here adds 12 more registers which can be directly referenced from LocalRegs or the stack; these registers can be either additional LocalRegs or GlobalRegs, though enlarging the number of LocalRegs seems like a better choice. In addition, 4 AuxRegs are converted into constants. The 60 registers are divided into the following four register groups:
A = 12 AuxRegs + 4 stack options
L = 16 LocalRegs
X = 12 more LocalRegs (or these can be GlobalRegs) + 4 stack options
C = 16 constants
Note that the 4 stack options in A and X are identical; these options are the same as with the current RRR format. RRR decoding has 16 bits decoded into 3 register addresses as follows: The aOpt, bOpt, and cOpt fields are 4 bits each and select the particular register in a group, leaving 4 extra bits to control the register group used for each operand. There are 4 x 4 x 4 = 64 combinations of register groups for the three operands, which somehow must be thinned to 16 combinations for encoding with the 4 remaining bits. Eliminating all constant writing removes 16 possibilities; 26 more are eliminated by preventing simultaneous use of {L, X} with {A}, a restriction applicable to the current RRR format. Then 22 remaining combinations shown below must somehow be thinned to 16 combinations.
A ← A op A
A ← A op C e
A ← C op A
A ← C op C *
L ← L op L
L ← L op X e
L ← L op C e
L ← X op L
L ← X op X
L ← X op C
L ← C op L
L ← C op X
L ← C op C
X ← L op L
X ← L op X
X ← L op C e
X ← X op L
X ← X op X
X ← X op C e
X ← C op L
X ← C op X
X ← C op C
Eliminating the {A, L, X} op C combinations saves 5 more, as marked with "e" in the table; this costs something only when subtract is used because all other RRR operations can symmetrically use C op {A, L, X}, and in the case of subtract, it costs something only when -C is not also a constant. Planned constants are 0, 1, 2, 3, 4, -2, -1, 100000B, and 20000000000B, using 9 of the 12 registers; since 0 = -0 and 20000000000B = -20000000000B, the only constants whose negatives are not present are 3, 4, and 100000B. Hence, for constants so far defined, lost possibilities are {A, L, X} - {3, 4, and 100000B}; these lost combinations seem significantly less valuable than the additional registers.
Even if the negative of every constant must be provided, the 12 constants available with the old RRR format can be encoded as 6 new format constants and 6 new format GlobalRegs, leaving the new format with 6 more GlobalRegs and 10 more constants than the old format. The only restriction on this is that the GlobalRegs cannot be intermixed with AuxRegs.
One final operation has to be eliminated, so I picked A ← C op C below; so long as 0 is kept in one AuxReg, A ← C is still possible, at the cost of reducing the number of useful AuxRegs to 11.
With this proposal, the low-order 4 bit register displacements for the Ra, Rb, and Rc register addresses come directly from the opcode fields (ignored on stack options), while the high-order base register (L, X, A, or C) can be selected according to the output of a 4-input, 6-output PLA.
Since the RJB and QR formats have a 6-bit Rb field, they have no difficulty addressing the 60 registers in the above proposal.
Variation On {L, X} ← C op C
The {L, X} ← C op C combinations in the previous proposal don't produce many useful values. Essentially, this allows {L, X} ← C; not many interesting constants can be made from C op C (e.g., 8, 7, 6, 5, -4, and -3 are about the only ones). An improvement is to replace C op C by const0 op C2C3, where C2 and C3 are an 8-bit literal value [0..256) made by concatenating the two 4-bit Ra and Rb fields; this would allow {L, X} ← [-255..255] using RSUB and RADD.
Second Variation On RRR Format
Here is a variation on the previous proposal which modifies the old RRR format to allow intermixing AuxRegs and LocalRegs while converting 4 AuxRegs into constants and expanding the range of useful constant values. However, it does not increase the number of LocalRegs, a nice property of the previous scheme.
{L, X} op C is ruled out as in the previous proposal, so {L, X} — {3, 4, 5, 100000b, and 6 more unspecified constants} are impossible with this variation. Also, note how the range of small constants which can be loaded into {X, L} has been increased:
L = 16 LocalRegs
X = 12 more GlobalRegs (or LocalRegs) + 4 stack options
C = 16 constants
L ← L op L
L ← L op X
L ← X op L
L ← X op X
L ← C op L
L ← C op X
L ← 511 op C2C3
L ← const0 op C2C3
X ← L op L
X ← L op X
X ← X op L
X ← X op X
X ← C op L
X ← C op X
X ← 511 op C2C3
X ← const0 op C2C3
C2C3= [0..256); i.e., the two 4-bit Ra and Rb fields are treated literally.
C = 16 writable constants.
Then 0 op C2C3 and 511 op C2C3 can be used with RADD and RSUB to directly load [-255..+766] into any GlobalReg or LocalReg.
2. Branch Prediction
Perhaps backward branches can always be predicted "true" since a backward branch can usually be restricted to situations where the branch closes a loop. This means that the bit used to predict forward branches can be reused for jump displacement on backward branches, enlarging the jump range from [-128..+128) to [-256..+128).
3. Statistics from Puzzle and Richards Hand-Compiled Code
Richards Statistics January 6, 1986
instructions: 5748239, cycles: 9704271
bytes/inst: 2.25, cycles/inst: 1.69, cycles/reject: 5516.92
instBytesUsed: 12930949, instBytesFlushed: 7106579
euFetches: 2053083, euStores: 505694
goodPredictions: 543932, badPredictions: 218758, uncond jumps: 111046, calls: 185242
(fallThruGood: 279008, jumpGood: 264924, fallThruBad: 190492, jumpBad: 28266)
stackOver: 0, stackUnder: 0
instBufferCycles: 1133201, returnInterlockCycles: 151462, lookaheadRejectCycles: 595
IFU cache - probes: 6869620, misses: 107, mapMisses: 1, dirtyWrites: 0, rejectCycles: 755
miss rate: %0.00, reject cycles/probe: 0.00
EU cache - probes: 2558777, misses: 105, mapMisses: 4, dirtyWrites: 41, rejectCycles: 1004
miss rate: %0.00, reject cycles/probe: 0.00
746101 (12.9%): LRI0 678951 (24.7%): RB
316705 (30.3%): LC0 286298 (35.2%): LIQB
223549 (39.1%): SRI0 213676 (42.8%): JNEBBJ
185241 (46.1%): ALS 185240 (49.3%): RET
171093 (52.3%): JEBB 166833 (55.2%): RJNEB
159532 (57.9%): ROR 154861 (60.6%): LRI4
137952 (63.0%): LRI2 137434 (65.4%): LRI1
121859 (67.5%): SR0 121575 (69.7%): JNEBB
119432 (71.7%): LFC 116603 (73.8%): SHR
111044 (75.7%): JB 97532 (77.4%): LR0
88359 (78.9%): WB 72762 (80.2%): LRI3
69909 (81.4%): LR3 65804 (82.6%): LR2
65796 (83.7%): LR1 65790 (84.8%): SFCI
65790 (86.0%): SUBB 64400 (87.1%): WRI
58948 (88.1%): RJGEBJ 57303 (89.1%): SHD
52910 (90.1%): SRI4 47666 (90.9%): LC1
43068 (91.6%): RVADD 41899 (92.4%): ADDB
38918 (93.0%): RBC 38904 (93.7%): RRI
36422 (94.4%): RADD 28251 (94.8%): LIDB
25011 (95.3%): SRI3 23609 (95.7%): RXOR
23286 (96.1%): SRI2 23258 (96.5%): SR2
20322 (96.9%): LR7 20310 (97.2%): LRI5
20310 (97.6%): RJLEBJ 19595 (97.9%): LIB
18608 (98.2%): QADD 10127 (98.4%): DIS
10002 (98.6%): RVSUB 9999 (98.7%): QAND
9999 (98.9%): RJGEB 9555 (99.1%): SRI1
9310 (99.2%): SRI5 9306 (99.4%): SRI6
9300 (99.6%): LR6 9300 (99.7%): QBC
4656 (99.8%): LC3 2333 (99.9%): LR4
2331 (99.9%): LR5 2328 (99.9%): AS
340 (99.9%): SR4 253 (99.9%): SUBDB
253 (100.%): RJNEBJ 20 (100.%): BC
20 (100.%): ADD 20 (100.%): SFC
18 (100.%): LRI6 8 (100.%): SR3
6 (100.%): SR5 6 (100.%): SR7
6 (100.%): SRI7 5 (100.%): SIP
3 (100.%): LC2 3 (100.%): RJEBJ
2 (100.%): J1 2 (100.%): ASL
2 (100.%): WSB 2 (100.%): JDB
1 (100.%): LC4 1 (100.%): DUP
1 (100.%): ADDDB 1 (100.%): SHL
1 (100.%): X377
Puzzle Statistics September 16, 1985 8:05:02 pm PDT
instructions: 7363900, cycles: 13552239
bytes/inst: 2.24, cycles/inst: 1.84, cycles/reject: 8.94
instBytesUsed: 16502105, instBytesFlushed: 1892746
euFetches: 1891930, euStores: 49137
goodPredictions: 1592802, badPredictions: 116532, uncond jumps: 25523, calls: 21633
(fallThruGood: 40889, jumpGood: 1551913, fallThruBad: 13350, jumpBad: 103182)
stackOver: 8, stackUnder: 7
instBufferCycles: 2461215, returnInterlockCycles: 14, lookaheadRejectCycles: 728
IFU cache - probes: 9817842, misses: 167, mapMisses: 1, dirtyWrites: 0, rejectCycles: 1175
miss rate: %0.00, reject cycles/probe: 0.00
EU cache - probes: 1941067, misses: 197863, mapMisses: 5781, dirtyWrites: 17890, rejectCycles: 1515513
miss rate: %10.19, reject cycles/probe: 0.78
869276 (11.8%): RADD 838135 (23.1%): RX
816927 (34.2%): RJGEBJ 784997 (44.9%): JNEBBJ
779317 (55.5%): LR0 767048 (65.9%): SHL
759987 (76.2%): QADD 758398 (86.5%): LRI3
189481 (89.1%): LRI2 82554 (90.2%): QBC
80592 (91.3%): LC0 64124 (92.2%): ROR
54235 (92.9%): JEBB 52931 (93.6%): QRX
51545 (94.3%): LIDB 41887 (94.9%): ADD
41255 (95.5%): JEBBJ 34841 (95.9%): WSB
32652 (96.4%): LIB 29320 (96.8%): LRI0
25497 (97.1%): JB 21633 (97.4%): LFC
21618 (97.7%): ALS 21612 (98.0%): LIQB
21594 (98.3%): RET 19582 (98.6%): LR3
19371 (98.8%): RRX 13503 (99.0%): ADDB
12419 (99.2%): LC1 11646 (99.3%): RJLEBJ
9657 (99.5%): DUP 7631 (99.6%): SRI0
6033 (99.6%): SRI2 4025 (99.7%): SUBB
4002 (99.8%): RSB 3992 (99.8%): BC
3992 (99.9%): LR4 2005 (99.9%): DIS
2004 (99.9%): SR4 544 (99.9%): RVADD
270 (99.9%): RJNEBJ 265 (99.9%): SR0
257 (99.9%): LRI1 257 (99.9%): SRI1
253 (99.9%): SRI3 253 (99.9%): SUBDB
72 (99.9%): ADDDB 47 (100.%): SIP
46 (100.%): LIP 38 (100.%): RETN
35 (100.%): WAI 34 (100.%): WB
33 (100.%): RAI 33 (100.%): WRI
32 (100.%): RVSUB 30 (100.%): SHR
18 (100.%): PSB 15 (100.%): JS
15 (100.%): RETK 11 (100.%): JDB
7 (100.%): SUB 3 (100.%): AS
2 (100.%): ASL 2 (100.%): CST
2 (100.%): RSUB 2 (100.%): RJNEB
2 (100.%): JNEBB 1 (100.%): LC3
1 (100.%): LR2 1 (100.%): RXOR
1 (100.%): X377
Coalesced statistics
Richards Puzzle

RJNEBx 167086 272
RJGEBx 68947 816927
RJLEBx 20310 11646
RJEBx 3 0
RJLBx 0 0
RJGBx 0 0
JNEBBx 335251 784999
JEBBx 171093 95490

LRI0-6 1269438 977456
SRI0-7 352933 14174
RSB 0 4002
PSB 0 18
RB 678951 0
WB 88359 34
WSB 2 34841
WRI 64400 33
RRI 38904 0
RAI 0 33
WAI 0 35
RRX 0 19371
QRX 0 52931
RX 0 838135
LGF 0 0
IOS 0 0
IOL 0 0
ION 0 0

SR0-7 145477 2269
LR0-4 333327 802892
LC0-4 369032 93012
DUP 1 9657

LIQB 286298 21612
LIDB 28251 51545
LIB 19595 32652
SUBB 65790 4025
ADDB 41899 13503
SUBDB 253 253
ADDDB 1 72

DFC 0 0
LFC 119432 21633
SFCI 65790 0
SFC 20 0
RET 185240 21594
RETN 0 38
RETK 0 15

ALS 185241 21618
AS 2328 3
DIS 10127 2005
ASL 2 2
AL 0 0

SHR 116603 30
SHD 57303 0
SHL 1 767048
RFU 0 0
FSDB 0 0

JB 111044 25497
JDB 2 11
JQB 0 0
J5 0 0
J3 0 0
J2 0 0
J1 2 0
JS 0 15

RVADD 43068 544
RVSUB 10002 32
RUADD 0 0
RUSUB 0 0

RBC 38918 0
QBC 9300 82554
BC 20 3992

RADD 36422 869276
QADD 18608 759987
ADD 20 41887
RSUB 0 2
QSUB 0 0
SUB 0 7

RXOR 23609 1
ROR 159532 64124
QOR 0 0
OR 0 0
RAND 0 0
QAND 9999 0
AND 0 0
EXDIS 0 0

RLADD 0 0
QLADD 0 0
LADD 0 0
RLSUB 0 0
QLSUB 0 0
LSUB 0 0

SIP 5 47
LIP 0 46
CST 0 2
X377 1 1
The Puzzle statistics were taken before the QR opcode format was changed to allow pushing const0 op Rb, const1 op Rb, and [S] op Rb and before the bounds-check opcodes were changed to use cardinal comparisons. Presumably, these changes have little effect on the statistics.
The subsections below try to discover useful changes in the instruction set derivable from these statistics.
Infrequently Executed Opcodes--Candidates for Removal
Of the unused instructions and instructions with less than 100 executions in both programs (i.e., less than 0.001% of all instructions executed), LIP, SIP, CST, DFC, SFC, RETK, J1, JDB, JQB, JS, RETN, KFC, RLADD, QLADD, LADD, RLSUB, QLSUB, LSUB, RFU, FSDB, RUADD, RUSUB, RJEBx, RJLBx, RJGBx, IOS, IOF, and ION are either needed for completeness or cannot be eliminated from the statistical results because the two programs for which statistics are available did not use the features for which these instructions are needed.
Of instructions either unused or accounting for less than 100 executions in both programs, we can consider eliminating RSUB, QSUB, SUB, QOR, OR, RAND, AND, EXDIS, J2, J3, J5, ADDDB, AL, ASL, LC5-7, LR8-15, SR8-15, LRI8-15, SRI8-15, PSB, LGF, WAI, and RAI. SUBDB can be added to this list if instructions with less than 300 executions in both programs are also included; all other instructions (except Xops) have more than 2,000 executions in one or both benchmarks. In other words, each instruction not mentioned above accounts for at least 0.015% of all instructions executed.
My preference would be to ordinarily keep the RRR format opcode and discard the two-byte QR and one-byte stack format special cases in situations where it is desirable to keep only one of the three variations because all are rare; in accordance with this, it is reasonable to eliminate QSUB, SUB, QOR, OR, and AND, but RSUB and RAND should be retained.
The instruction counts also suggest that J2, J3, and J5 can be eliminated because these can all be replaced by JB at the cost of 1 code byte and 1 cycle; J1 is necessary because it is the only one-byte no-op. In normal situations, one expects Jn to be used to skip n - 1 following code bytes, but Russ Atkinson is concerned about the possibility sometime in the future of using these faster jumps to skip over unused bytes at the beginning of a loop, so that the loop starts on the first byte of a word; but in this case J2 and J3 can still be replaced, respectively, by the no-ops QOR[[S], [S], [S]] and ROR[[S], [S], [S]]. Consequently, J2 and J3 are good candidates for removal. Note, however, that the group of J1, J2, J3, and J5 takes only one line in the IFU's PLA, so it has low hardware cost; keeping only 1, 2, or 3 of these instructions would require another line in the PLA.
But caution seems advisable in considering removal of the other infrequently executed instructions. Here are some thoughts about these other operations:
LC5-7 and LR8-15 could be replaced by QOR[const0, anyreg] at the cost of one byte.
EXDIS and SR8-15 could be replaced by ROR[belowDst, topSrc, popSrc] at the cost of two bytes.
ADDDB/SUBDB could be replaced by LIDB and ADD/SUB at the cost of 1 cycle and 1 byte.
The 3 instructions for manipulating S are DIS (common), ASL (rare), and AS (common). ASL and AS are interchangeable when the distance between L and S is known, so both instructions are needed iff the distance between L and S isn't known some of the time; note that RET and RETK perform the same stack adjustment function as ASL at procedure exit. I have been told that the compiler doesn't always know stack depth, so AS is needed to trim the stack in some local computations, and ASL to cleanup. Hence, none of these instructions is a good candidate for removal.
The 2 instructions for manipulating L are AL (presently rare) and ALS (presently common); ALS is needed at procedure entry to adjust L relative to S, and ALS can be used instead of AL whenever the distance between L and S is known, so AL is needed iff the distance between L and S is sometimes unknown, but this situation occurs during frame overflow according to Atkinson, so both AL and ALS are needed. (With a procedure call change discussed later, it might be possible to retain AL and eliminate ALS.)
LRI8-15[disp] and SRI8-15[disp]. I think the hardware wants to keep 4, 8, or 16 of these instructions, and the statistics suggest that keeping at least 8 of each is desirable, but the two benchmarks used no more than 7 local registers. LRIn[disp] can be replaced by QRX[pushA0, regn] for disp in [0..+4] at no cost, and it can be replaced by LLn, RB[disp] at the cost of 1 byte and 1 cycle for disp in [5..255]. SRIn[disp] can be replaced by LLn, WB[disp] at the cost of 1 byte and 1 cycle.
PSB[disp] cannot easily be replaced assuming the current design which does not save any registers above the top-of-stack when processes are switched. However, it would be possible to keep PSB, eliminate WSB and use PSB, DIS to replace WSB; but WSB is common, so this is probably not a good idea.
RAI[disp] and WAI[disp] can only be replaced by three instruction sequences (two instructions for disp in [0..4]), so they are not good candidates for removal unless the use of auxiliary registers as base registers for memory references will for sure be rare.
LGF[alphaBeta] is used to push a GF pointer onto the stack if aux0 points at the GFT; if this implementation is used, then LGF seems useful.
In summary, the best candidates for removal from the instruction set based upon the statistics above seem to be QSUB, SUB, QOR, OR, AND, J2 and J3; I would throw in QAND, which is not rare but is used only once in the sample programs.
The hardware probably wants the four opcode groups LRn, SRn, LRIn, and SRIn to be handled symmetrically, but pruning these to only 8 opcodes each seems undesirable unless there are alternative uses for all of these opcodes and their rareness can be confirmed. In addition, it would be desirable to have a two-byte alternative to SRn (discussed below) and three-byte, no-lost-cycles alternatives to LRIn and SRIn. Since better alternative opcodes have not been proposed and since the rareness of references to reg8 to reg15 hasn't been confirmed, we should retain all of these instructions.
Candidates for New Instructions
This section considers the usage of some commonly executed 5-byte, 3-byte, and 2-byte opcodes for which it is plausible to provide, respectively, shorter 3-byte, 2-byte, or 1-byte opcodes to handle the most frequently occurring cases.
1) Statically, the JNEBBx and JEBBx opcodes were used once to compare against 3, 9 times to compare against 1, and 18 times to compare against 0 in the benchmark sources, so two-byte zero-comparison instructions would reasonably save 1 byte on about 2/3 of the executions for these instructions; this would be about 2/3 x 1,386,833 bytes, reducing the number of bytes executed by over 3%. In addition, many comparisons against 1 might be boolean TRUE, and these could be replaced by zero coparisons. Altogether, providing four two-byte zero comparison opcodes JNZBx and JZBx would reduce the number of bytes executed in the benchmarks about 4%.
2) 12 of the 20 occurrences of RJxxBx opcodes in the benchmark sources compared [S] against [S-1] and removed both values from the stack. If these statistics are reflected dynamically, the number of code bytes executed can be reduced more than 2% by providing two-byte conditional jumps for the specialization of the 12 RJxxBx opcodes in which both arguments are popped from the stack.
3) The ROR opcode is most commonly used in the forms regn ← const0, regn ← regm, and regn ← const0 or constX; in other words, it is used to move data from one local register to another or from a constant to a local register. All 223,656 instances of the ROR opcode in the two benchmark programs were of this form. 2-byte opcodes for these common cases would reduce the number of code bytes executed by 0.8% in the benchmark programs. Namely, opcodes which split Alpha into two 4-bit fields and do LR[alpha left] ← LR[alpha right] and LR[alpha left] ← LR[const(alpha right)].
Since there are only 12 constants, 4 stack options can also be provided with the second form; this would allow PLn (Put Local n = regn ← [S] without popping) in two bytes, which is useful, and it would mean that SRn can be replaced by a 2-byte opcode rather than a 3-byte opcode, if it were ever desired to deimplement SRn for some reason.
4) About half of the static occurrences of ADDB in the benchmark sources were ADDB[1] and about one-third of the occurrences of SUBB were SUBB[1]; if INC and DEC opcode were added to save one byte in these cases, and if half of the dynamic occurrences of ADDB and SUBB were ADDB[1] and SUBB[1], respectively, then the number of bytes executed would be reduced about 125,217/2 or 0.2%.
5) 2 of the 3 uses of RXOR in the benchmarks were [S] ← [S] xor minus1; a one-byte NOT opcode would save 23,609 x 2 bytes, reducing the number of bytes executed by up to 0.1%.
6) RADD, which is very common in the benchmarks, occurred in the forms: push(regm + regn), push(regm + constn), and regm ← regm + constn (Note that QADD is used instead of RADD for push([S] + regm) and [S] ← [S] + regm.). Some 2-byte instructions for these specializations of RADD are worth considering.
5. Instruction Length Decoding Change
The instruction length and PC displacement for jumps (i.e., the information needed by the IFU to operate the instruction pipeline) is determined from the top 3 bits of the instruction as follows:
Instruction length:
000 => 1 byte [000B..037B] (8 used, 24 unused)
001 => 5 bytes [040B..077B] (4 used, 28 unused)
01- => 1 byte [100B..177B] (49 used, 15 unused)
10- => 2 bytes [200B..277B] (59 used, 5 unused)
11- => 3 bytes [300B..377B] (48 unused, 9 unused, 6 undefined behavior)
PC displacement:
0------- PC + stack  JS
10------ PC + alphaS  JB
110----- PC + alphaBetaS JDB and LFC
111----- PC + betaS  RJB's and JBB's
It is of some concern that most new instructions considered above and below are 2 or 3 bytes long, and these categories have the fewest available slots. It is desirable to rearrange the method of instruction decoding so that many of the relatively useless one-byte and 5-byte Xops become 2 or 3-bytes. For example, what about the following?
Instruction length:
0000 => 5 bytes [000B..017B] (4 used, 12 unused)
0001 => 1 byte [020B..037B] (8 used, 8 unused)
0010 => 2 bytes [040B..057B] (0 used, 16 unused)
0011 => 3 bytes [060B..077B] (0 used, 16 unused)
01- => 1 byte [100B..177B] (49 used, 15 unused)
10- => 2 bytes [200B..277B] (59 used, 5 unused)
11- => 3 bytes [300B..377B] (48 unused, 9 unused, 6 undefined behavior)
This change converts 32 relatively useless 1-byte and 5-byte instructions into the length 2 and 3 categories where new instructions are more likely to be wanted.
6. Other Instruction Proposals
Here is an assortment of other instructions that have either seemed relatively useful or not been completely rejected. I have attempted to include even ones of relatively low utility.
R0  = RB 0. Saves 1 byte when displacement is 0.
RS0  = RSB 0
W0  = WB 0
PS0  = PSB 0
WS0  = WSB 0
INC  = ADDB[1]
DEC  = SUBB[1]
NOT  [S] ← [S] xor -1. Saves 2 bytes over RXOR.
NEG  [S] ← 0 - [S] - Carry; trap on overflow; Carry ← 0.
  Saves 1 byte over QSUB[0, [S]].
DBL  [S] ← [S] + [S] + Carry; trap on overflow; Carry ← 0.
  Useful for inline multiply.
PDBL  [S+1]+ ← [S] + [S] + Carry; Carry ← 0; trap on overflow.
  Useful for inline multiply.
QVADD
QVSUB  Primarily useful for regm ← regm +/- 1 without overflow checking;
  saves 1 byte over RVADD/RVSUB.
Two-byte SHL for logical shifting left or right.
Two-byte ASHL for arithmetic shifting left or right (to multiply or divide by 2n).
RXNOR  Rc ← Ra xor Rb'; saves 3 bytes and 1 cycle over RXOR, RXOR.
RNOR  Rc ← Ra or Rb'.
RNAND  Rc ← Ra & Rb'.
ADDQB  Add quad-byte; [S] ← [S] + abcd + Carry; trap on overflow; Carry ← 0;
  (saves 1 byte and 1 cycle over LIQB, ADD or LIQB, SUB).
EXCH  Interchanges [S] and [S-1]. Saves 2 cycles and 4 bytes over
  ROR[Ln ← [S-1] or [S-1]]; EXDIS; LLn.
RMAGADD Register Magnitude Add. Rc ← Ra + Magnitude(Rb). If Rb < 0,
  compute Ra — Rb else Ra + Rb; trap on integer overflow.
  (Since the "invert" signal for Rb is needed in advance of when
  the register data is available, may require 2 cycles.)
RZVSUB  Register Zero or Subtract. Rc ← if Ra > Rb then (Ra — Rb) else 0.
  Then RVADD(Rc ← Rc + Rb) gives Max(Ra, Rb), or
  RVSUB(Rc ← Ra — Rc) gives Min(Ra, Rb) in 2 instructions.
  Might require 2 cycles.
RSI  Register Store Increment. Mem(Rc ← (Ra + 1)) ← Rb.
RSD  Register Store Decrement. Mem(Rc ← (Ra — 1)) ← Rb.
RFI  Register Fetch Increment. Rc ← (Ra + 1) in cycle 1;
  then Rb ← Mem(Rc) in cycle 2.
RFD  Register Fetch Decrement. Rc ← (Ra — 1) in cycle 1;
  then Rb ← Mem(Rc) in cycle 2.
RDI  Register Double Indirect. In cycle 1 push
  Mem(RLaleft + aright); in cycle 2 push Mem([S] + b)
RLIB  Register Load Immediate Byte. LR(aleft) ← b.
JS  Jump Stack. The current SJ opcode sends control to PC+[S],
  inconvenient in some situations because the pointer at [S] isn't
  valid except from one place in the code. So it isn't possible to
  compute a place to go and store it in a variable somewhere for
  later use.
The current instruction set has single-instruction ways to read with displacements [-2..256) relative to either [S] or a local register (-2 and -1 being provided by using constM1 and constM2 with RRX or QRX); and it has single-instruction ways to write with displacements [0..256); and it has LGF to read with displacements [0..177777B] relative to aux0. However, displacements outside this range require an extra instruction.
RDB  Read Double Byte. [S] ← Mem([S] + AlphaBetaZ). Saves an instruction when
  the base pointer is already at [S].
WDB  Write Double Byte. Mem([S] + AlphaBetaZ) ← [S-1]; S ← S-1.
RJAZB, RJAZBJ
RJANB, RJANBJ These conditional jumps jump if (Rs & Rb) = 0 or (Rs & Rb) # 0;
  they save 2 bytes and 1 cycle whenever any register is AND'ed
  with 1 prior to a conditional jump. If these instructions are
  added, deimplementing QAND may be reasonable (only use of
  QAND was in this context).
Maybe SFCI[] should be replaced/supplemented by an SFCI[alpha] which Calls [S + alpha]^? Maybe it should be supplemented by one which does push[alpha], push[alphabeta], and/or push[alphabetagammadelta] as well as the call?
7.0 Calling System Procedures
One possible way to produce more compact programs is to use some XOPs to produce 1-byte calls to compiler runtime procedures, saving 4 bytes/call. 2, 3, or 5-byte XOP's can be used to load an immediate argument in the same instruction as a call, saving an additional byte and 1 cycle. For example, read-field, write-field, multiply5 to multiply32, block zero, block transfer, block compare, byte move, and byte compare are common runtime procedures which might warrant XOP calls. However, because the distance between XOP dispatch locations is only 16 bytes long, an additional jump (probably JQB) to the code for the function will often be required, so the compaction of the XOP call is achieved at the cost of about 1 cycle slower execution.
In addition, if unused opcodes in the instruction set are consumed for new instructions, XOPs may become too few to cover the most common runtime procedures. If this happens, an alternative is to use a 3-byte XOP, which pushes AlphaBeta, followed by a SJ to send control to the appropriate procedure, or perhaps a 2-byte XOP, which pushes Alpha followed by a shift and a SJ to the appropriate procedure; in either case, compaction is reduced and the 5-cycle SJ imposes a speed penalty. For this reason, using XOPs for compaction may not be useful. If the use of XOPs for compaction is judged to be desirable (Russ Atkinson doesn't think it will be significant.), then a special instruction for system dispatch is worth considering. The system dispatch instruction variations below aim at eliminating some of the dispatching that is required with XOPs or in automatically doing an ALS instruction at entry to the runtime procedure.
SYSB  2 bytes. Call the procedure at TrapBase + 16 * a.
  Like an XOP but requires only 1 opcode in the instruction set
  for up to 256 procedures.
SYSDBx  3 bytes. Call the procedure at TrapBase + ab.
  Like an XOP but avoiding an extra jump from the trap location
  to the code body. -or-
SYSDBy  3 bytes. Call the procedure at TrapBase + (ab & 177774B);
  execute ALS[-(a & 3B)] while trapping (= procedure call
  with 1, 2, 3, or 4 arguments); -or-
SYSDBz  3 bytes. Trap to TrapBase + (ab & 177770B); execute
  ALS[-(a & 7B)] while trapping (= procedure call with 1 to 8
  arguments).
KFC is even more cumbersome than XOP because it will be necessary to first push an entry point number, then execute the KFC, and at entry to the KFC, bounds-check the entry point number, shift it by a constant and do a SJ and a JDB or JQB to the appropriate procedure. For this reason, the following redefinitions of KFC are worth considering:
KFCx  Trap to TrapBase + 16 * a in kernel mode (replacement
  for KFC). Eliminates the bounds-check, shift, and SJ required to
  dispatch after the current KFC.
KFCy  Trap to TrapBase + 16 * a in kernel mode (replacement
  for KFC); also execute ALS[0] so that kernel calls with 1
  argument need not execute an ALS at entry. Kernel procedures
  requiring a different number of arguments can execute AL at
  entry.
8.0 Changes to Procedure Call and Return Instructions
Combining some instructions which are done at the time of procedure call and at the time of procedure return may also be useful. For example, at procedure entry most procedures do ALS[1 - nargs] (But some procedures use the enclosing context and don't do this.) and at procedure return, it is common to store a result into local 0 before doing a RET[nResults - 1].
All procedures in GenPuzzle and GenRichards which returned a result stored a value into reg0 immediately prior to RET; most used SRn[reg0], while a few used ROR[reg0, const0, localReg] prior to the RET. A supplementary instruction combining reg0 ← AnyReg or reg0 ← [S]- with RET would save 1 or 2 code bytes and 1 cycle in each case. AnyReg could be encoded using a 3-byte instruction; or in a 2-byte instruction, 6 bits in Alpha can encode AnyReg while 2 bits encode the change in S to 0 to 3 (i.e., returning 1 to 4 results); or ....
RET1 = RET[0] (i.e., a procedure returning 1 result) and RET0 = RET[377B] (i.e., a procedure returning 0 results) may be common enough to warrant 1-byte instructions for compaction.
It is tempting to combine the ALS[1 - nargs] which is presently done at procedure entry with the procedure call to save an instruction. However, the current procedure calls which do not do this are also needed because some short inline procedures operate on the stack without ever building a local frame and because some procedures may use variables in the enclosing context. For this reason, the procedure call variations discussed here should be supplementary to the current procedure calls.
It is likely that all procedures will be forced to begin on word, doubleword, or quadword boundaries because that restriction will reduce the number of cache misses at the beginning of the procedure. If procedures are required to begin at the first byte of a word and if the relative jumps and calls (Jn, JB, JDB, JS, and LFC) are redefined to compute displacement relative to the first byte of the word containing the PC, then it will be possible to free the two low-order bits of LFC and DFC displacement for another use; namely, these bits can specify 1 - nargs; and SFC and SFCI variations in which 1 - nargs is specified in an a byte can be added. So LFCn, DFCn, SFCn, and SFCIn can all specify 0 to 3 arguments; for a larger number of arguments, use AL at entry to change the position.
9.0 Possible Changes To Trapping Opcodes
Maybe the bounds check, Lisp NaN, and Overflow traps should be eliminated in favor of treating the trap like an Xop; in other words, opcodes which give rise to these events would be treated like conditional jumps which are always predicted false and which have the Xop trap location as the alternate address. In these cases, pushing a, ab, or abcd like an Xop, not disabling traps, and not entering Kernel mode are all improvements, albeit unimportant ones, on the current scheme. These are improvements because the software need not examine the opcode at PC to determine exactly what instruction caused the trap and because there is no reason to enter the Kernel on these events.
The efficiency of the bounds check or overflow trap handling procedure will probably be unimportant because these abnormal events will call the signaler and probably terminate the program. However, the NaN trap may occur frequently, making a generalized trap handler too slow. One alternative is to change QLADD or RLADD into a 5-byte instruction for which the last 3 or 2 bytes are ignored and passed over on a non-trapping execution. These 3 or 2 bytes encode a JDB or JB to a specialized trap handler within the same procedure; on a trap, the NaN trap handler can point the PC there and return.
Another possibility is to define a RJLADD conditional jump instruction which stores the result of a Lisp add ala QR format instructions, is always predicted not to jump, and jumps on a NaN trap.
A possible problem is that a ModeFault trap might happen on RRR format opcodes which write into aux0-7 or a constant register; I am not sure how to fit the possibility of another kind of trap into the scheme.
10.0. Register Decoding
According to McCreight, the computation for each of the three register addresses of an instruction is identical; i.e., a single circuit producing the union of all the addressing possibilities is replicated three times. That circuit allows the low four bits of each register address to be taken from any of the following:
alpha
beta (probably not presently used)
alpha[0..3]
alpha[4..7]
beta[0..3]
beta[4..7]
opcode[4..7]
and it allows the top 4 bits of register address to selected from any of the following:
0
LocalBase
ConstantBase
AuxiliaryBase
Stack