CSL Notebook Entry XEROX To From Dragon Designers E. McCreight PARC CSL Subject Date Dragon Instruction Execution September 25, 1984 Release as /Indigo/Dragon/Documentation/InstrExec.tioga, .press Came from ([Indigo]Forms>CSLNotebook.form) Last edited by McCreight, September 25, 1984 2:57:23 pm PDT Abstract I have confused myself once again about exactly what happens when during Dragon instruction execution. By writing it all down carefully, maybe I can put my confusion finally to rest. Attributes technical, Computer organization, Dragon, VLSI Introduction and Conventions The Dragon processor is made up (mostly) of two chips: IFU and EU. The IFU has a control and data connection to an instruction cache chip, and a control connection to a data cache chip. The IFU and EU share a bi-directional K bus, assorted control lines from IFU to EU, and a single condition flag from EU to IFU. Dragon runs synchronously on a two-phase non-overlapping clock. The phases are named A and B. A cycle is an adjacent pair of phases. An instruction is said to be "begun" in the phase following a phase A when the entire instruction was available from the instruction buffer and the IFU sequencer was willing to take a new instruction. By convention, the phase in which an instruction begins is that instruction's phase 0B. Phase numbering is of the form "iJ.k", where iJ refers to a pipeline stage in the processor (0A, 2B, etc), and k is the number of times this instruction has visited this stage before. If k=0, the suffix is usually elided. Instruction traces One way of looking at instruction execution is to follow the various kinds of instructions through the pipeline stages of the processor, and see what happens to them along the way. The simplest instructions The simplest instructions take one cycle to execute (except if they are held up by an unrelated hazard), cannot cause a break in PC flow, and do not participate in any hazards. Such instructions are pretty uninteresting, but we have a few of them. J1 is the example we will use: a one-byte NoOp. 0A: Five bytes, beginning with the J1 instruction, are read out of the instruction buffer. The instruction buffer controller indicates that at least the first byte is good. The opcode is examined, and InstReady is asserted, because the whole one-byte instruction is available from the instruction buffer. The IFU sequencer decides that in the next phase a new instruction can be begun if one is available. The main PLA precharges. 0B: The main PLA decodes signals from the sequencer, InstReady, and the Op, Alpha, and Beta bytes of the new instruction. This PLA is self-timed, so it does two levels (AND-OR) of precharged logic within this phase. In response to the set of inputs for J1 it generates GetNextInst to the instruction buffer controller. This signal causes the instruction buffer controller to advance the instruction buffer read pointer by the length of the new instruction (one byte for J1). Innocuous values are generated in the other microinstruction fields. The read data lines from the instruction buffer precharge. Simple arithmetic instructions (hazard-free case) These instructions are typically three-register arithmetic operations of the form rC _ rA op rB. They execute in one cycle (except if delayed by a hazard), and they cannot cause a break in PC flow. 0B: The main PLA generates GetNextInst to the instruction buffer controller. It also generates a two-part field describing how to form each of the A, B, and C register addresses above, and a two-part field describing how to modify each of the L and S base registers before the next instruction begins. The A and B address generators start forming the A and B addresses. 1B: The A and B register addresses are sent to the EU. 2A: The ALU operation is sent to the EU. The EU registers are read at addresses A and B. 2B: The ALU result is calculated. 3B: The C register address is sent to the EU, and an indication that the Result3B data is arriving from the ALU (rather than from the data cache). 4A: The result is finally stored in the EU registers at address C. Simple arithmetic instructions (general case) Simple arithmetic instructions participate in stale data hazards, which are recognized in phase 1B. 1B: Each of the A and B register addresses is compared against the C register addresses formed in the two previous cycles. These C addresses name registers that may contain stale data when A and B are fetched. A and B are treated the same, so let's just consider A. If A matches the C generated two cycles earlier, then a bypass path can be activated that simply bypasses the register array's store-fetch delay, and carries the EU's Result3B directly into the ALU's left input. If A matches the C generated one cycle earlier (whether or not it matches that of two cycles earlier), then there are two bypass paths available for resolving the hazard, but perhaps neither one will succeed. The first goes from the ALU output back to the ALU's left input. It can be used if C was not to have been generated by a data cache fetch. The second goes from the fetched memory data back to the ALU output. It can be used if the the current ALU operation is "Or-with-zero". If neither of these bypass paths can be used (that is, the instruction one ahead is a memory fetch, and this instruction does arithmetic on its result), then a "bubble" instruction must be inserted in the pipeline before this one, and a signal is sent to the main PLA to inhibit accepting the next instruction. 1B: Control for the LeftOperand2A, RightOperand2A, and Store2A input multiplexers is sent to the EU. The ALU input multiplexers select the ALU inputs. 2B: Control for the Result3A and Store3A input multiplexers is sent to the EU. Memory fetch instructions These instructions are typically of the form rC _ [rA+literal]^. They execute in one cycle, except if delayed by a hazard. They participate in stale data hazards, which are recognized in phases 1B and 3B. They can cause map fault traps, which are recognized in phase 3B. 0B: The main PLA generates GetNextInst to the instruction buffer controller. It also generates a two-part field describing how to form each of the A and C register addresses above, and a two-part field describing how to modify each of the L and S base registers before the instruction begins. The A address generators forms the A address. 1A: The C address generator forms the C address. 1B: The A address is sent to the EU. 2A: The ALU operation "+", and bypass control for the level 2A multiplexers, is sent to the EU. The literal to be added to rA is sent to the EU on the K bus. The A and (default) B EU registers are read, and the level 2A multiplexers select the ALU inputs. 2B: The sum (rA+literal) is calculated. 3A: The address is transmitted on the PData bus from the EU to the data cache, and a Fetch operation is transmitted from the IFU to the data cache, and a control bit to cause the EU to listen to (rather than drive) the PData bus during the following phase. 3B: If the data cache signals Reject, then to avoid a potential stale data hazard this instruction and all following ones in the pipeline must repeat their current cycles. If not, then the result data appears on the PData bus and this instruction advances. The C register address is sent to the EU, and an indication that the Result3B data is arriving from the data cache. 4A: The result is finally stored in register C. Memory store instructions These instructions are typically of the form [rA+literal]^ _ rB. They execute in one cycle, except if delayed by a hazard. They participate in stale data hazards, which are recognized in phase 1B and 3B. They can cause map fault traps, which are recognized in phase 3B. Only those actions are described that are different from memory fetch. 0B: The A and B address generators form the A and B addresses. 1B: The A and B addresses are sent to the EU. 2A: The ALU operation "+", and bypass control for the level 2A multiplexers, is sent to the EU. The literal to be added to rA is sent to the EU on the K bus. The A and B EU registers are read, and the level 2A multiplexers select the ALU inputs. 2B: The sum (rA+literal) is calculated. Bypass control for the level 3A store data multiplexer is sent to the EU. 3A: The address is transmitted on the PData lines from the EU to the data cache, and a Store operation is transmitted on the PCommand lines from the IFU to the data cache. The store data is preselected from the store data pipeline or a bypass path. A control bit is transmitted to cause the EU to drive (rather than listen to) the PData bus during the following phase. 3B: The store data is transmitted on the PData bus from the EU to the data cache. If the data cache signals Reject, then this instruction and all following ones in the pipeline must repeat their current cycles. Unconditional jump instructions These instructions are typically of the form PC _ f(PC, literal). They execute in one cycle (except if delayed by a hazard), and are followed by at least one dead cycle while the instruction buffer controller refills the instruction buffer. They participate in no hazards. 0B: The main PLA generates Jump to the instruction buffer controller. It also generates a two-part field describing how to form the new PC. The new PC is formed, and sent to the instruction buffer controller. Next instruction's 0A.i: If the instruction cache PBus is not busy (did not assert Reject on the previous B), then the IFU's instruction buffer controller transmits the new PC word address on the PData lines, and a Fetch on the PCommand lines. Next instruction's 0B.i: If the instruction cache does not assert Reject, then the data on the PData lines is entered in the IFU's instruction buffer. Next instruction's 0A.n: Data within the instruction buffer is read out, correctly aligned, and a test is made whether a whole instruction is present. Conditional jump instructions, correctly predicted straight ahead These are of the form "PC _ PC+(IF ALUCondition[ra, rb] -- assumed FALSE -- THEN literal ELSE 3)". They execute in one cycle, except if delayed by a hazard, and they participate in stale data hazards. 0B: The main PLA generates GetNextInst to the instruction buffer controller. It also generates a two-part field describing how to form each of A and B, and one describing what literal to add to the PC to get the jump target address. The target address is computed. 1A: The PC is updated by adding 3. The target address computed in the last phase is entered in the pipeline (pcPipe[1][a]), to be used later if the prediction was in error. 2A: ALUOp and CondSel are sent to the EU. 2B: The EU computes the condition FALSE, indicating a correct prediction. 3A: The EU returns the condition FALSE to the IFU. Conditional jump instructions, incorrectly predicted straight ahead These are the same as above, up through phase 2A. 2B: The EU computes the condition TRUE, indicating an incorrect prediction. 3A: The EU returns the condition TRUE to the IFU. AbortPipe3AB is generated, and the IFU sequencer considers how to follow the new control flow in response. 3B: The main PLA generates control signals that carry the pipelined target address (pcPipe[3][a]) to the instruction buffer controller, and the signal Jump. Conditional jump instructions, correctly predicted to jump These are of the form "PC _ PC+(IF ALUCondition[ra, rb] -- assumed TRUE -- THEN literal ELSE 3)". 0B: The main PLA generates Jump to the instruction buffer controller, and a field describing what literal to add to the PC to get the jump target address. The target address is computed and sent to the instruction buffer controller. 1A: The quantity (oldPC+3) is entered in the pipeline (pcPipe[1][a]), to be used later if the prediction was in error. 2B: The EU computes the condition FALSE, indicating a correct prediction. Conditional jump instructions, incorrectly predicted to jump These are the same as above, up to phase 2B. Thereafter, recovery is the same as for the earlier mis-predicted conditional jump. Call instructions Because of the complex interactions of instruction abort with a call or return, the IFU stack is located at the bottom of the IFU pipeline, so that only surviving call or return instructions push or pop the IFU stack. 0B: The main PLA generates Jump to the instruction buffer controller, and a field describing how to compute the call target address. The target address is computed and sent to the instruction buffer controller. 1A: The return PC (oldPC+length of call instruction) is entered in the pipeline (pcPipe[1][a]), to be pushed on the IFU stack later. 4A: The pipelined return PC is written in the IFU stack. When is the stack pushed? Return instructions Because of the unfortunate location of the IFU stack at the bottom of the IFU pipeline, return instructions participate in a stale data hazard. A new return instruction cannot be accepted if there are any call or return instructions in the pipeline. 0A: The pipeline is examined and a decision is made whether it is safe to accept a return instruction if one presents itself for execution. 0B: The main PLA generates Jump to the instruction buffer controller, and the return address is transported from the IFU stack to the new PC in the instruction buffer controller. 1A: The return L is transported from the IFU stack to the L register. 4A: When is the stack popped? Faulting memory reference instructions These faults first become apparent in phase 3B, when the cache sends a fault indication. Fortunately, the cache also sends reject, so the resulting pipeline hold gives us a bit more time to collect our wits. 1A: Old PC, old S, and old L are entered in the pipeline (pipe[1][a]), to be used later if a fault occurs. 3B.i: The data cache signals Reject and Fault. 3A.i+1: The EU holds all ALU state because of Reject in the previous phase. AbortPipe3AB is generated. The IFU sequencer considers the problem. Innocuous C address and ALU operation are delivered to the EU for use in the next phase. 3B.i+1: The data cache signals NOT Reject and NOT Fault. 4A: The old PC and old L are written in the IFU stack. Old S is restored into S. When is the stack pushed? Faulting arithmetic instructions These faults first become apparent in phase 2B, when the EU sends an unexpected ALU condition. 1A: Old PC, old S, and old L are entered in the pipeline (pipe[1][a]), to be used later if a fault occurs. 2B: The EU computes the unexpected condition, and holds its carry state. 3A: The EU transmits the unexpected condition to the IFU. AbortPipe3AB is generated. The IFU sequencer considers the problem. 3B: The EU remembers that it computed an unexpected condition, and continues to hold its carry state. AbortPipe3AB caused the C address to turn into corn meal mush, which is now transported to the EU. 4A: The arithmetic result is stored in a non-register (the mushy C address). The old PC and old L are written in the IFU stack. Old S is restored into S. When is the stack pushed? Pipeline stages Another way of looking at instruction execution is to stand at a fixed pipeline stage in the processor and watch instructions flow past. In general, the decision whether an instruction can advance is made in A phases. 0A: The goal of this phase is to generate a set of inputs for the main decoding PLA. The main decoding PLA generates signals to control the 0B and 1A phases. If there was a Reject in the previous phase, or if this phase detects an impending pipeline interlock, or the ALU condition flag is TRUE, then prepare the main PLA to decode unusually. Otherwise set up to decode the next cycle of the next instruction. 0B: The main PLA acts on the instruction. 1A: If there was no Reject in the last phase, and no impending pipeline interlock detected in this phase, load this stage from 0B. Otherwise reload it from 1B. 1B: If AbortPipe3AB, load this stage as a NoOp would. Otherwise, load this stage from 1A. Transmit A and B register addresses to EU on the K bus. 2A: If an impending pipeline interlock is detected, load this stage as a NoOp would. Otherwise, if Reject was detected in the last phase, load from stage 2B. Otherwise load from 1B. Transmit the IFU arithmetic data via the X and K bus to the EU, as well as the the ALU operation, CondSel, and LeftOperand2A, RightOperand2A, and Store2A input multiplexer controls. 2B: If AbortPipe3AB, load this stage as a NoOp would. Otherwise, load this stage from 2A. Perform the ALU operation. Transmit the Result3A and Store3A input multiplexer controls. 3A: If Reject was detected in the last phase, load from stage 3B, with the cache operation set to NoOp. Otherwise load from stage 2B. Transmit the ALU condition flag back to the IFU. If it is TRUE, or was true one cycle ago, or Reject, recycle carry from 2A; otherwise set carry from last arithmetic operation in 2B as appropriate. Transmit the cache operation to the data cache. EU loads Result3A pipeline stage either from Result2B or Result3B, depending on Result3A input multiplex control. Load Store3A pipeline stage either from Store2B or Result3B, depending on Store3A input multiplex control. Compute AbortPipe3AB as (IF Reject THEN PFault#None ELSE condition flag), as of the end of the previous phase. 3B: If AbortPipe3AB, load this stage as a NoOp would. Otherwise, load this stage from 3A. Transmit the Result3B input multiplexer control and the C register address to the EU on the K bus. Load the Result3B stage either from Result3A or from the data cache's PData bus, depending on Result3B input multiplex control. 4A: If there was no Reject in the last phase, load this stage from 3B. Otherwise load it as a NoOp would. Load the C register from Result3B. Transmit Result3 to the K bus if the C address is among the registers in the IFU. If there is a push or pop for the IFU stack, do it. �� e�StyleDef�BeginStyle (Cedar) AttachStyle (firstHeadersAfterPage) {0} .cvx .def (root) "format for root nodes" { docStandard 36 pt topMargin 36 pt headerMargin .5 in footerMargin .5 in bottomMargin 1.75 in leftMargin 1.5 in rightMargin 5.25 in lineLength 24 pt topIndent 24 pt topLeading 0 leftIndent 10 pt rightIndent } StyleRule (positionInternalMemoLogo) "Xerox logo: screen" { docStandard 1 pt leading 1 pt topLeading 1 pt bottomLeading } ScreenRule (positionInternalMemoLogo) "for Xerox logo" { docStandard 1 pt leading 1 pt topLeading -28 pt bottomLeading -1.5 in leftIndent } PrintRule (internalMemoLogo) "Xerox logo: screen" { "Logo" family 18 bp size 20 pt topLeading 20 pt bottomLeading } ScreenRule (internalMemoLogo) "for Xerox logo" { "Logo" family 18 pt size 12 pt leading -28 pt topLeading 36 pt bottomLeading -1.5 in leftIndent } PrintRule (memoHead) "for the To, From, Subject nodes at front of memos" { docStandard AlternateFontFamily 240 pt tabStops } StyleRule EndStyle�Iblock�MarkinsideHeader��box�Ipositioninternalmemologo�Iinternalmemologo��memohead��s��N��t��N�� N�N�N�N�N��N��/�/N�N��Iabstract�� 5� ��ms��,��O��O�� /�9head�Ibody��Q��i�,�b��Q�� K��K�dK�K��K�:��1 ��K��k��K��C�CK��4�6K��W�YK��!K��K��@�B��- �cK��K��K��K��b�dK�1K��L�N�� K��K��-�-K��.�0K��"�$K��]�_K�=K��a�aK��%�'K��K��K��s�sK��-�/�� K��<�>K��+�-K��]�_K�=K��W�WK��p�rK��K�� K��K��K��K��A ��<� ��K��K��K��'�)K��G�IK��0�2��C ��1�1K��I�KK��K��: ��<��bK��K��t�vK��G�I��