[_CD8_]<doradosource>DoradoManual.dm!7>d1ifu1.bravo

The instruction fetch unit, or IFU, decodes a stream of bytes from memory into a sequence of 8-bit opcodes and operands using a writeable decoding memory, and presents the results to the processor for efficient interpretation. The next section contains an overview of IFU function, supplemented by details in later sections.

The IFU handles four independent instruction sets. Opcodes are 8-bit bytes, which may be followed in memory by 0, 1, or 2 operand bytes. Hence, the total length of an operation is 1, 2, or 3 bytes. The first operand byte is called a, the second b.

One method of dealing with operations longer than 3 bytes is to encode them in IFUM as 1-byte jumps to the next operation. This gives up the possibility of referencing N, a, or b with ←Id but avoids having to restart the IFU. The processor then must compute the proper place in the instruction stream and reference a, b, g, etc. without help from the IFU.

The term PC refers to the displacement of an opcode byte from the codebase, which is BR 31. PC’s are 16-bit items, where 0:14 are an unsigned word displacement relative to the codebase, and bit 15 selects the byte. In other words, codebase points at a 32k segment of virtual memory; a PC selects a byte in this segment. The PC’s are named PCF, . . ., PCM, and PCX, where the final letter in the name denotes the level in the IFU pipeline.

Since the IFU’s PC is only 16 bits, overflowing either end of the code segment causes wraparound. This programming error is not detected by the hardware.

For Alto compatibility reasons, we currently have the following kludge. Instruction sets 0 and 1 treat byte 0 in the selected word as bits 0:7, 1 as bits 8:15; instruction sets 2 and 3 treat byte 0 as bits 8:15, 1 as 0:7. Eventually, this may be changed so that all instruction sets use 0 for the byte in 0:7 and 1 for 8:15.

The IFU is started by first selecting an instruction set (InsSetOrEvent←B function) and then loading the F-level PC (PCF←B function). The IFU then starts fetching the byte stream starting at the word BR[31] + PCF[0:14], byte PCF[15], from the cache and prepares opcodes for interpretation by the processor.

Bytes from the cache then march through the IFU pipeline beginning with the F and G full-word buffer registers on the MemD board; single bytes from F/G then move into J or H on the IFU board. InsSet[0:1] and the opcode byte in J address the decoding memory, IFUM, a 1024-word x 24-bit (+3 parity) RAM containing the information in the table below. Although IFUM is writeable, it will normally be loaded with the microprogram and not subsequently changed (Diagnostics are, of course, an exception.).

IFaddr’10TNIA[4:13] of the first instruction to be executed in interpreting this opcode (TNIA[14:15] from the IFUJump in the exit of the previous opcode).

Length’, TPause’, TJump’, Sign, Packeda, and N are used by the IFU to prepare operands and to sequence correctly to the next opcode; IFaddr’ is passed to the control section; and the processor uses MemB and RBaseB’ to initialize MemBase and RBase when the microcode for the opcode commences.

Length’ determines the number of operand bytes; a for a two or three-byte instruction will be in H, while b for a three-byte instruction will be in F/G, when the assembled instruction is ready to proceed. The assembled instruction and a then drop into the M level.

IFUJump[n] (see "Control Section") transfers control to the starting instruction for the opcode assembled in M, where TNIA[4:13]←IFaddr, TNIA[14:15]←n (n is 0 to 3) is the location of the entry instruction. A 4-long entry vector, rather than a single starting address, can be utilized for faster execution, as discussed later. IFaddr may be overruled by a trap address when appropriate.

At t0 of the starting instruction, the processor initializes RBase to RBaseB (i.e., to 0 or to 1) and MemBase to 0..MemBX[0:1]..MemB[1:2] if MemB[0] = 0, or to 348+MemB[1:2] if MemB[0] = 1. MemBX is interpreted as a stack pointer to a 4-entry stack with 4 base registers in each entry, and MemB[1:2] in IFUM select a particular base register from the current entry. The MemBX kludge may reduce computation on procedure call/return, as discussed later. Other information about the opcode and a are copied into the X level.

Instructions that implement the opcode then reference operands in sequence using the A←Id, RisId, or TisId operations discussed in "Processor Section" or the IFetch← operation discussed in "Memory Section," which read operands from the X level. The operand sequence delivered by the IFU in response to ←Id is as follows:

Regular and pause opcodes have an optional 4-bit operand N that is delivered first (N isn’t supplied when N = 178). This is followed by a and b, if they exist; a is sign-extended when sign = 1 or split into two 4-bit nibbles if Packeda = 1. Subsequently, ←Id delivers Length. For jumps, all of these operands are consumed in computing the jump displacement, and ←Id delivers Length.

The normal opcode references all of its N, a, and b operands; however, except on three-byte opcodes, the IFU hardware does not require that these operands be referenced—the processor could exit to the next opcode without reading all the operands, if that was desirable for some reason. However, for opcodes of length 3, the processor must consume the a byte with ←Id (both a[0:3] and a[4:7] if Packeda=1) before going to the next opcode with an IFUJump—it does not suffice to consume the last a byte with ←Id concurrent with IFUJump. An opcode must never do more than 7 ←Id’s for reasons that will be discussed later.

The types of opcodes are distinguished as follows: A pause has no successor, and the IFU must be restarted with PCF←B before the next IFUJump. A regular’s successor is the byte following its last operand; a jump’s successor is determined by adding a displacement to the current PC as follows:

The IFU pipeline follows the instruction stream and fills up when it is five or six bytes ahead of the current opcode. When a pause opcode is recognized, further memory references are not made. When a jump opcode is recognized in J, the IFU discards any bytes in F, G, and H and refills these pipe levels with bytes along the jump path.

The B←PCX’ function reads PC (inverted) for the current opcode. Note that PCF←B does not affect the value of PCX; B←PCX’ continues to read the displacement of the current opcode, which does not change until an IFUJump is done.

An opcode that conditionally jumps can be encoded in IFUM with type either jump or regular. If encoded as type jump, when the condition is false, the program must issue PCF←B to restart the IFU at the fall-through address. Similarly, if regular, PCF←B must be issued to restart at the jump address.

The Length argument delivered by ←Id after other operands have been referenced is useful in conditional jump calculations. Note that the fall-through address for a conditional jump is Length+PCX, so:

Following PCF←B, the IFU flushes its pipeline; it is illegal for either the instruction containing PCF←B or the one immediately after it to do an IFUJump, but any subsequent instruction can issue an IFUJump; however, the processor will spin uselessly at the IFU "NotReady" trap until the fifth cycle after PCF←B (earliest) or later (longer opcodes, cache misses, Mar traffic).

IFUResetHalt and clear the IFU pipeline and clear errors, testing features, and BrkPending (i.e., BrkIns); Reschedule condition and instruction set are not cleared.

IFUMLH←BLoad the high-order IFUM word from B (t1 to t3), where the Packeda and IFaddr fields are in the same form as B←IFUMLH’. Must have at least one intervening instruction after a preceding BrkIns← or InsSetorEvent←.

IFUMRH←BLoad the low-order IFUM word from B (t1 to t3) in the format given below; must have at least one intervening instruction after a preceding BrkIns← or InsSetorEvent←:

BrkIns←BLoad BrkIns from B[0:7] at t3, and set BrkPending (ill-defined unless the IFU has been reset). BrkIns replaces the next opcode loaded into J; then BrkPending is cleared. BrkIns also addresses IFUM on IFUMLH/RH← and B←IFUMLH’/RH’.

InsSetOrEvent←BIf B[0]=1, then B[6:7] are loaded into the InsSet register at t3; if B[0]=0, then B[4:15] control event counters as discussed in the "Other IO and Event Counters" chapter. A following PCF←B starts the IFU interpreting using the new instruction set. Illegal except when the IFU is paused or reset or when PCF← will be done before the next IFUJump.

RescheduleCause a reschedule trap on the second or third "successful" IFUJump. "Successful" means that an IFUJump is not trapped for some other reason such as not-ready. The second IFUJump will be trapped if it does not occur in the instruction immediately after the first successful IFUJump; otherwise, the third successful IFUJump will be trapped. The trap instruction is executed as though it were the first instruction of the rescheduled opcode, and ←Id and IFUJump will work as though that opcode were in progress.

RescheduleNowRescheduleNow is guaranteed to trap the next successful IFUJump, so long as the next IFUJump appears in the second cycle after RescheduleNow, or later. The Reschedule branch condition is not affected.

An IFUJump[n], encoded in the JCN field of the instruction, sends control to an address partly determined by the IFU and partly by the IFUJump clause. The four possible targets of an IFUJump are called an "entry vector".

An opcode leaves its results in one of several convenient forms agreed to by convention, then chooses an entry instruction in its successor with IFUJump[n], where n =0 to 3. Every opcode in the instruction set must have an entry vector of the same length. Careful choice of forms may reduce execution time by one cycle for some opcodes without increasing execution time for successor opcodes.

A true branch condition (FF-encoded) with IFUJump prevents starting the next opcode. For example, IFUJump[2,condition] sends control to the next opcode’s entry 2, if condition is false, or entry 3, if condition is true. However, no other IFU activities associated with starting the new opcode take place when condition is true, so entry 3 is executed in the context of the opcode that did the IFUJump[2,condition]; however, the processor initializes RBase and MemBase as though the next opcode were starting, so this part of the state is lost. Thus, at a cost of one entry instruction in every opcode of an instruction set, it may be possible to shorten the execution time of some opcodes using a conditional exit.

An opcode with common and uncommon exit cases, for example, can exit with IFUJump[2,condition], where entry 2, the common case, starts the next opcode, while entry 3 is reached for the uncommon case. Since IFUJump loads Link with .+1, entry 3 can either Return, to execute more code associated with the uncommon case, or it can do something more explicit, if an appropriate convention is followed by all opcodes.

The following example shows how an instruction set with four opcodes (Push, Add, Store, and JNZ) is implemented using a four-long entry vector. The opcodes in this example deal with the stack like Mesa opcodes do, and the first three entry conventions are, in fact, ones which might be used by the current Mesa emulator.

%Entry
0:Stk[StkP] holds top-of-stack (if any—garbage if stack empty), T holds garbage
1:T and Stk[StkP-1] hold previous top of stack (garbage if stack empty),
Stk[StkP] garbage, Md holds top-of-stack.
2:T and Stk[StkP+1] hold top-of-stack,
Stk[StkP] holds previous top of stack (garbage if stack empty).
3:Results in same form as entry 2, but restart IFU at NewPC = (Id)−(PCX’)−1
Note that Stack&+1 references must not check for underflow when the stack may legitimately be empty.
%

*Push the memory location pointed to by N.
Push:Fetch←Id, T←StackNoUFL&+1, IFUJump[1];
Fetch←Id, T←StackNoUFL&+1←Md, IFUJump[1];
Fetch←Id, StkP+2, IFUJump[1];
T←(Id)−(PCX’)−1, StkP+1, Return;

*Replace the top two stack entries by their sum.
Add:T←Stack&−1, Branch[.+2];
Stack←Md;
T←Stack&−1←T+(Stack&−1), IFUJump[2];
T←(Id)−(PCX’)−1, StkP+1, Return;

*Store the top-of-stack into the memory location pointed to by N and pop the stack.
Store:Store←Id, DBuf←Stack&−1, IFUJump[0];
Stack←Md, Branch[Storex];
Store←Id, DBuf←T, IFUJump[0];
T←(Id)−(PCX’)−1, StkP+1, Return;
Storex:Store←Id, DBuf←Stack&−2, IFUJump[2];

*Pop the stack and branch if the top-of-stack was zero, else fall through
*This opcode is of type jump.
JNZ:Pd←Stack&−1, Branch[ZTest];
Pd←Md, StkP−1, Branch[ZTest];
Pd←T, Branch[ZTest];
T←(Id)−(PCX’)−1, StkP+1, Return;
ZTest:T←Stack&−1, IFUJump[2,ALU#0];
*Return here when the jump doesn’t take.
T←Stack&−1, PCF←T;
IFUJump[2];

Push thus requires 1 execution cycle; Store and Add take either 1 or 2 cycles depending upon the entry point; JNZ takes 2 cycles when the jump takes or 9 cycles when the opcode falls through (because the IFU isn’t ready until the fifth cycle after PCF←B).

Although every opcode in an instruction set must have an entry vector following the same conventions, it is not necessary that the vector be four-long. In the above example, a single-entry scheme would probably use the entry 2 convention followed above. In that event, Push, Add, Store, and JNZ would require 2, 1, 2, and 3 cycles (common case), respectively, compared to 1, 1 or 2, 1 or 2, and 2 or 3 cycles for the four-entry scheme above.

Since Mesa requires about 120 IFU entries for its 256 opcodes, the cost of the second entry in the vector is between 0 and 120 locations, and 120 locations each for the third and fourth entries. Since Mesa is implemented by about 1044 instructions using entry vectors of length 1, a vector of length 2 scheme would require ~1100, length 3 ~1220, and length 4 ~1340 instructions. The implementor of an instruction set should decide when the additional locations expended for larger entry vectors are no longer worth the additional speed.

Although we originally hoped for as much as 8% faster inner loops and 4% overall speed improvement, Gene McDaniel measured only 2% faster execution for Mesa (excluding disk wait) using a length 3 entry vector; microstore increased about 120 locations. Investigation revealed that increased traffic on Mar (by overlapped Fetch← and ←Md) was causing IFU not ready to occur more often, offsetting the fact that fewer processor cycles were needed. Forwarding saved about .2 cycles/opcode.

Assuming no misses and no delays because the processor uses Mar, IFUJump will successfully dispatch to the entry instruction of the next opcode on the fifth cycle after PCF←B if the new opcode either is one byte long or is two bytes long and starts at an even byte; otherwise it will succeed on the sixth cycle.

A jump opcode causes a 3 cycle gap in the IFU pipe. The effect of the gap would be a 3 cycle delay if each opcode were executed in exactly one cycle. However, the gap can overlap with extra cycles taken on the jump opcode itself or either of the two preceding opcodes. As usual in timing considerations, a 3-byte opcode counts as two normal opcodes.

If a long stream of regular one-byte opcodes is being executed by the processor at the fastest possible rate (one instruction/opcode), and if the IFU neither misses nor faults nor waits for the processor’s use of Mar or the cache, then it will always have the next opcode ready for IFUJump. If the IFU waits one cycle for the processor to use Mar, it will shortly fill its pipe again, so scattered Mar references by the processor will not result in IFU NotReady.

If a long stream of regular two-byte opcodes, each of which has an a but no N (This is the worst case.), is being executed by the processor at the fastest possible rate (one instruction/opcode), and if the opcodes in the stream start at the even bytes in words, and if the IFU neither misses nor faults, and if the processor never uses Mar, then the IFU will give 25% NotReady. Each cycle in which the processor uses Mar adds one cycle of delay. If the opcodes in the stream start at the odd bytes in words, then the processor will get NotReady 40% of the time.

Three-byte opcodes are not as bad as two-byte opcodes because, in the worst case, the processor cannot reference both a and b in less than 2 instructions. Hence, a stream of three-byte opcodes has timing approximately the same as a stream in which each three-byte opcode is replaced by a one-byte opcode followed by a two-byte opcode.

Mar traffic may be an important timing factor if many opcodes finish in one or two cycles. Whenever the processor is making a reference, the IFU cannot use Mar, and the IFU must make one reference for every two bytes in the instruction stream. Note that if a processor reference is held, the IFU will also be prevented from making references (but the IFU is not prevented from making references when ←Md is held).

The present Mesa implementation requires 34 cycles for a local XFER and 54 cycles for an external XFER, excluding memory wait, and measurements made on the Mesa compiler showed that 38% of all cycles were spent in XFER. For this reason, speed improvements in XFER are an important objective.

Since about 70% of all calls return before calling any other procedure, if a caller’s base registers and stack were left untouched, then this information would neither have to be saved during the call nor restored during the return in most cases.

The hardware that supports this idea consists of the MemBX register, pointing at one of four blocks of 4 base registers each, and StkP, pointing at one of four stacks of 64 registers each. During a procedure call, StkP and MemBX may be advanced by 1 region, leaving the caller’s state intact; if the callee makes nested calls, then eventually the MemBX and Stk regions would be exhausted and some would have to be saved and (eventually) restored. However, if the callee returns without too many nested calls, then its caller’s state would still be intact.

The IFU may trap for not ready, reschedule request, map faults, cache data errors, and IFUM parity errors. When a trap condition occurs, the IFU substitutes a trap address for IFaddr on the next IFUJump. Hence, the next IFUJump sends control to one of the entries in the trap vector.

Each trap vector is dispatched into by IFUJump exactly as though it were an opcode. B←PCX’ reads the PC of the opcode that would have been executed if the trap had not occurred and RBase, MemBase, and ←Id stuff are set according to that opcode (in every case except NotReady—all are undefined at a NotReady trap).

The NotReady trap occurs whenever the IFU does not have both an opcode and its associated operands (a, b) ready for the processor. Since PCX, MemBase, and RBase are invalid, the trap microcode must wait for the IFU to become ready. The following code sequence will work for all instruction sets that do not use a conditional exit:

NotReady:
IFUJump[0];*Can’t convert to IFUJump[2] because stack may be empty
T←Stack&−1←Md, IFUJump[2];*Convert case 1 to case 2
IFUJump[2];
T←(Id)−(PCX’)−1, StkP←StkP+1, Return;*Resume the opcode which didn’t really exit

If the IFU detects bad parity on any read of IFUM, the IFUJump to the opcode affected by this parity error will trap to the IFUM parity error trap location.

The IFU will trap at the cache data parity error location, if it detected invalid parity on any byte sent by the memory system. PCX will always correctly point at the opcode that would have been executed next had the trap not occurred; however, the opcode and operands pointed at by PCX are not necessarily the ones that suffered the parity error. This occurs because the pipe has continued ahead of PCX. The most confusing case occurs when the opcode following PCX was a jump; in this case the opcode fetched by the jump may have caused the parity error, in which case PCX+/− jump displacement is limited to the range PCX−4008 to PCX+3778.

The Reschedule function is used by io tasks to request service by the emulator. The IFU will honor this trap request on the second IFUJump after it is executed, as discussed in a later section. The RescheduleNow function is like the Reschedule function, but the IFU honors it on the first IFUJump after it is executed, rather than the second (RescheduleNow was intended for use when continuing an opcode which previously experienced a fault).

An IFU fetch may experience a map fault. The memory system does not report IFU map faults to the fault task. Instead, it signals the IFU that a map fault has occurred, and the IFU passes this indication through its pipeline. Eventually, the IFUJump that would have sent control to the opcode affected by the map fault will instead transfer to the map fault trap vector.

Although IFU map faults are not reported to the fault task, the fault task must be careful to pass over any pipe entries that were created by IFU map faults when it is woken for some other reason.

Erroneous bytes fetched after a pause or jump opcode might cause map faults, but the IFU discards these before they reach the end of the pipeline, so the processor is never informed. Consequently, erroneous references interfere with processor memory activity and delay the IFU’s efforts to refill its pipe on a jump, but don’t have any disastrous effect.

An IFU fetch may experience single or double storage failures. Unlike map faults, these are reported to the fault task just as on processor fetches. The memory system pipeline will finish loading the cache munch just as though the data were ok, and the cache entries will have valid byte parity. The IFU will continue running just as though no error had occurred.

However, the fault task will be woken soon enough that it will run before the IFU’s F register is loaded with a byte from the bad munch. Hence, the fault task will run before the emulator can possibly execute an IFUJump to the byte that suffered the error.

For a recoverable error, the fault task can simply carry out some logging action and block; no harm will occur because the IFU will actually have gotten valid data, and the cache will contain valid data. For an irrecoverable error, the fault task must clear the bad cache munch and use the RescheduleNow function to trap the next IFUJump to code for dealing with the irrecoverable error.

Erroneous bytes fetched after a pause or jump opcode might suffer irrecoverable errors. The fault task has no reasonable way to distinguish these from bytes really in the instruction stream, so it will cause a Reschedule trap anyway.

Although independent trap vectors for each instruction set are probably inessential, performance should be better when the NotReady trap, which occurs frequently, is distinct for each instruction set. This allows the various IFUJump exits to be transformed into the form most likely to be convenient for the next opcode.

The other traps could have been implemented to use a common trap for all locations. This would be more economical for IFUM and FG parity error traps, if these simply result in an uncontinuable crash when running system microcode. However, different trap vectors for each instruction set are probably more convenient for Reschedule and Map fault traps, which have to save the state of the emulator currently running.

In any case, reserving locations for these traps costs at most 5 traps * 4 instruction sets * 4 entries/trap = 1008 locations, and realistically is much less than this because many instruction sets will not need 4 entries and there will probably be fewer than 4 instruction sets concurrently active.