[_CD8_]<doradosource>DoradoManual.dm!7>d1control.bravo

The control section interfaces the mainframe to the baseboard microcomputer or Alto which controls it as detailed in the "Dorado Debugging Interface" document. In addition, the control section stores instructions in 4k x 34-bit (+2 parity) IM ("Instruction Memory") and contains logic for sequencing through instructions and switching among tasks.

The current instruction is clocked into the MIR register at t0 and exported to the processor, memory, and IFU sections for decoding. The control section itself decodes the JCN field, the BLOCK bit, and its own FF decodes (Wakeup, B←Link, B←RWCPReg, Link←B, TaskingOn, TaskingOff, BDispatch←B, BigBDispatch←B, Multiply, MidasStrobe←B, UseDMD, and branch conditions).

Figure 5 shows the overall organization of the control section. Figure 6 shows how branch control is encoded in JCN. Figure 7 shows the timing for regular instructions and for the multi-cycle TPC and IM read/write instructions.

Dorado provides sixteen independent priority-scheduled tasks at the microcode level. Task 15 is highest priority, task 0 lowest. Task 15 (the "fault task") is woken by StkError and by memory map and data error faults. Tasks 1-14 provide processing functions for io controllers implemented partially in hardware, partially in firmware; the present assignment of these tasks to device controllers is given in the "Slow IO" chapter. Task 0 (the "emulator") implements instruction sets (Mesa, Alto, etc.). In the absence of io activity, task 0 (always awake) controls the processor.

Essentially, io devices are paired to tasks when built, and a device controller can assert a wakeup request for the task with which it is paired. A program cannot modify the assignment of controllers to tasks (although the hardware change for this is easy). Additional flexibility in this area is not thought to be worth additional hardware cost.

Each task has its own program counter and subroutine return link, stored in the (task-specific) TPC and TLINK registers when the task is inactive. TPC may also be treated as a memory, so program counters for tasks other than the current task can be read and written by a program. This is discussed later in this chapter.

When device hardware requires service from a task, it activates its wakeup request line at t0. Wakeup requests are priority-encoded, and the highest priority request (BNT or "Best Next Task") is clocked at t2 and competes with the current task (CTASK) for control of the machine. If BNT is higher priority than CTASK, or if the current (non-emulator) instruction has BLOCK = 1, a task switch will take place; in this case, CTASK will be loaded from BNT at t4. This implies that the shortest delay from a wakeup request to the first instruction of the associated task is two cycles.

The 16 Wakeup[task] FF decodes allow any task to be woken, just as though a hardware device had activated its wakeup line. A minimum of two cycles elapses after the instruction containing Wakeup before the task executes its first instruction. The task responding to a Wakeup must not block sooner than the second instruction, or it will get reawakened.

When a task has been woken by Wakeup[task] or has executed one or more instructions and then deferred to a higher priority task, the fact that it is runnable is remembered in a Ready flipflop. The Ready flipflop is cleared only when the associated task blocks. In other words, there is no way to deactivate a task, after its ready flipflop has been set, except by forcing it to execute an instruction that blocks. The Wakeup[task] function must be executed with tasking off, if it is possible that the specified task might be waking up for some other reason (e.g., due to a wakeup request from an external device, or due to a wakeup issued by yet another task). Otherwise, the control section may get horribly confused, and the machine will hang in the same task forever.

Task switching may occur after every instruction unless explicitly disabled by the TaskingOff function. The TaskingOn function reverses the effect of TaskingOff. TaskingOff is "atomic"; an instruction containing TaskingOff will be held if a task switch is pending; the next instrcution will be executed in sequence without any intervening task switches. TaskingOn is not immediately effective; at least two more instructions will be executed by the same task before task switching can occur.

It is illegal for a task to block in an instruction that might be held, if the wakeup line for the task might be dropped at t0 of the instruction. If this occurred, the instruction might inadvertently be repeated before the block occurred.

Multiple tasks seem better than a more conventional priority interrupt system because interference by input/output tasks is substantially reduced. As to the exact implementation, variations are possible. The current scheme requires more hardware than one in which the program explicitly indicates when a task switch is legal (as on Alto and D0). However, because Hold may last for about 30 cycles, a reliance upon explicit tasking would result in inadequate service for high priority tasks.

This section gives a low-level view of jump control. Because the microassembler and loader handle details of instruction placement automatically, programmers need not struggle with the encodings directly. For this reason, programmers may wish to skim this section while concentrating on high-level jump concepts described in "Dorado Microassembler".

For the most part, instruction memory (IM) addressing paths are 16 bits wide, although only 12 bits are presently used; the extra width allows for future expansion to 13 or 14 bits, when sufficiently fast 4kx1 ECL RAMS are economically available; there are no plans to utilize the remaining 2 bits, but since nearly all hardware components in the control data paths are packaged 4/can, the extra two bits are almost free. Also, the 16-bit wide Link register can be used to hold full word data items.

The various registers and data paths that contain IM addresses are numbered 0:15, where bits 4:15 are significant for the 4k-word microstore, while the quadrant bits 2:3 are ignored. This numbering conveniently word-aligns the bits while also allowing for future expansion. The discussion below assumes a 4k-word microstore.

Dorado does not have an incrementing instruction-address counter. Instead, the address of the next instruction is determined by modifying the current instruction address (CIA) in various ways. The Tentative Next Instruction Address (TNIA) is determined from JCN[0:7] in the instruction according to rules in Figure 6. TNIA addresses IM for the fetch of the next instruction unless a task switch occurs. If a task switch occurs, the program counter for the highest priority competing task (BNPC or "Best Next PC") addresses IM.

A 16k-word microstore is viewed as consisting of four 4k-word quadrants; each IM quadrant is viewed as containing 64 pages of 64 instructions. Values in JCN are provided for the following kinds of branches:

Conditional branches to any of 14 even locations in the current page, if the selected condition is false, or to the adjacent odd location, if the condition is true (7 branch conditions are available);

IFU jumps to a starting address supplied by the IFU; JCN selects any one of up to 4 entries in the starting address vector (This is motivated by an entry-vector scheme discussed in "Instruction Fetch Unit".);

Branch conditions may also be specified in FF, as discussed below. Several dispatches may also be specified in FF. These ’OR’ bits into the branch address computed by the following instruction.

If IM is expanded to 16k words, branching from one quadrant to another will only be possible by loading the Link register with a 14-bit address and then returning; jumps, calls, and IFUJumps will be confined to the current 4k-word IM quadrant.

JCN cleverly encodes in 8 bits almost as much programming flexibility as would be possible with an arbitrarily large and general field. The main disadvantage is that MicroD is needed to postprocess assemblies and place instructions.

The earliest prototype of Dorado used a 7-bit JCN encoding that had fewer global and conditional branch targets, so programming was harder and additional instructions had to be inserted in a few places. This was slightly worse than the 8-bit encoding, but it would have been feasible to stay with the 7-bit encoding and employ the bit thus saved for some other use in the instruction.

Local, global, and long branches are analogous, respectively, to local, page-zero, and indirect branches used on many minicomputers. However, Dorado scatters its global locations over the microstore rather than concentrating them in page-zero; this is better than the minicomputer scheme for the following reason. During instruction placement, when a cluster of instructions is too large to fit on one page, a global allows it to be divided between two pages; but if all globals were in page zero, then page zero itself would quickly fill up. In other words, dispersing the globals is theoretically more powerful than concentrating them in page zero; because MicroD does all the tedious work of placing instructions, this theoretical advantage is made practical; minicomputers have not employed any program like MicroD, so they have used the less powerful but simpler page-zero scheme.

Local branches on Dorado are within a 64-word page, where minicomputers usually branch relative to the current PC. Relative branching is probably more powerful, but it cannot be used on Dorado because of insufficient time for addition.

Long branches on Dorado use 4 bits of JCN in conjunction with the 8-bit FF field to specify any location in the 4k-word quadrant. Since BSEL never selects a constant in this case, an improvement on our scheme would have used 3 bits of JCN in conjunction with BSEL.0 and the 8-bit FF field; this would have freed 8 values of JCN to encode some other kind of branch. In addition, 5 of the 256 values of JCN are unused and 1 is a duplicate (See Figure 6 for the 5 unused decodes; the replicated decode is the Global call on the Local page.). We have variant JCN decodings that correct these problems, but they were not ready when the design was frozen.

IM is organized in two banks, with odd addresses in one bank, even in the other. The address is needed shortly after t0, but the bank-select signal not until 15 ns after the address. For this reason conditional branches select between an even-odd pair of instructions (i.e., between the two banks) according to branch conditions that need not be stable until a little after t1.

Alternatively, a conditional branch may be encoded in FF in conjunction with any addressing mode except a long branch in JCN. When this is done, the result of the branch test is ORed with TNIA[15].

Hence, it is possible to conditionally branch using only JCN, while using FF for an unrelated function, or to encode a branch condition in FF while using any addressing mode in JCN. If branch conditions are encoded in both FF and JCN, the branch test results are OR’ed, providing further flexibility.

060ALU=0
161ALU<0
262ALUcarry’
363Cnt=0&-1 (decrements count after testing)
464R<0 (RM or STK, whichever is selected, not overruled by RIsId)
565R Odd (RM or STK, whichever is selected, not overruled by RIsId)
666IOAtten’ (non-emulator) or ReSchedule (emulator)
—67Overflow

ALU=0 and ALU<0 are the results of the last ALU operation executed by the current task. ALUcarry’ (the saved carry-out of the ALU) and Overflow are the result of the last arithmetic ALU operation executed by the current task (ALU←A may be stored in ALUFM as either an arithmetic or logical operation, so programmers should be wary of smashing these branch conditions when ALU←A is used.). These are saved in a RAM and may be frozen by the FreezeBC function for one cycle. In other words, the branch conditions are ordinarily loaded into the RAM at t3, but if FreezeBC is present, then the RAM is not loaded and values from the previous instruction for the same task will apply.

The bank-select toggling trick, which allows branch conditions to be developed very late, is valuable. Without this trick, it would be necessary to choose between slowing the instruction cycle or restricting branch conditions to signals stable at t0. Neither of these alternatives is palatable.

A more traditional implementation of conditional branches would go to the branch address, if a condition were true, or fall through to the instruction at .+1, if it were false. This traditional scheme is never faster but is sometimes more space-efficient than the target-pair scheme because the target-pair requires a duplicated instruction for every instance of a conditional branch to a single target, which is fairly common. The traditional scheme does not allow DblGoto and DblCall constructs discussed in "Dorado Microassembler," but these are infrequent.

Dorado provides single-level subroutines by means of the (task-specific) Link register. A Call occurs on any instruction whose destination address is 0 mod 16 before any modification of TNIA due to branch conditions or dispatches. On a Call, Return, or IFUJump, Link is loaded with CIA+1.

Because Return loads Link with CIA+1, CoReturn constructs are possible. Because IFUJump also loads Link with CIA+1, the conditional exit feature discussed in the "Instruction Fetch Unit" chapter is possible.

CIA+1 is used rather loosely in discussion here; the actual value loaded into Link by a call or return is [(CIA & 1777008) + ((CIA+1) & 778)]. In other words, a call or return in location 778 of any page loads Link with location 0 of that page.

The functions Link←B and B←RWCPReg and the B dispatch functions discussed below, all of which load Link from B, overrule a call. In other words, if there are conflicting reasons for loading Link, Link←B wins over Link←CIA+1.

The B←RWCPReg function (= Link←B, B←CPReg’) is provided primarily for initialization from the baseboard computer and for use by the Midas debugging program. Since the CPReg register clock is asynchronous to the Dorado clock system, a Dorado microprogram that reads CPReg (e.g., to receive information from the baseboard) must use some synchronization method to ensure that CPReg is stable during the cycle in which it is read.

Note: it is illegal to use an ALU branch condition in the instruction after Pd←RWCPReg, if CPReg might have been loaded during the cycle in which it is read—this might result in an unstable IM address being presented to the control store.

Deciding between call and jump based on target address saves one bit in the instruction and costs little for the following reasons. Instructions can be divided into three groups: those always jumped to, those always called, those for which Link can be smashed (i.e., "don’t care" about call or jump), and those both jumped to and called.

A realistic guess is that over half of all instructions will be "don’t care"; namely, these will be executed at the top level, not inside a subroutine, and the Link register will not contain anything of importance. Assembly language declarations make this information available to MicroD.

The hardware makes 1/16 of the locations in each page "call locations". It is estimated that this is somewhat more than real programs will need, on the average (although we vacillated about whether 1/8 or 1/16 of the targets should be calls).

In each page, MicroD first places instructions that must be called or must be jumped to. Because there are so many "don’t care" instructions, it is unlikely that either call or jump slots in a page will be exceeded. Consequently, it will nearly always be possible to complete allocation of the call and jump targets without overflowing due to the call/jump restriction. After this "don’t care" instructions fill in the remaining slots.

The remaining situation, with which Dorado cannot cope, is an instruction both called and jumped to. This would arise in a subroutine whose entry instruction closed a loop (uncommon). On Dorado, this situation requires duplicating the entry instruction, so it costs one location but no extra time.

MultiplyOR’s Q[14] into TNIA[14] (The value of Q[14] is captured in a flipflop at t2 of the instruction containing the Multiply function and is OR’ed into TNIA[14] during the next instruction for the same task.)

The two B dispatches load Link register from B, then OR appropriate bits of Link into TNIA during the next instruction for the task. Since Link is task-specific, this works correctly across task switching. The Q-bit is only loaded during a multiply, and tasks other than the emulator are not allowed to use the multiply function.

The decision between call and jump in the instruction after a dispatch is unaffected by dispatch bits—it depends only upon JCN. In other words, the instruction following a dispatch is a Call if its unmodified target address is 0 mod 16, else a jump.

It is possible to neutralize any bits in a dispatch by placing target instructions at locations with 1’s in the neutralized bits. In other words, a dispatch on B[8:10] could be accomplished by locating the 8 target instructions at IM locations whose low five address bits were 1, e.g. at 378, 778, 1378, 1778, 2378, 2778, 3378, and 3778, and by branching to 378 in the instruction after the BigBDispatch←B.

Note: Methods discussed later for resuming a program interrupted by a page fault do not permit continuation when a fault occurs between a dispatch and the following instruction; for this reason, programmers should ensure that no fault can possibly occur by holding for memory faults with ←Md prior to or concurrent with the dispatch; also, stack operations that might overflow/underflow may not be used in the instruction after a dispatch.

Note: When the PC for another task is loaded using the LdTPC← operation discussed later, any pending dispatch conditions for that task are cleared. The debugging program Midas does not clear pending dispatches, however, so it should be ok to put a breakpoint on the instruction after a dispatch or to single-step through a dispatch.

The IFU supplies ten bits of opcode starting address to the processor. During the last instruction of every opcode, exit to the next opcode is accomplished by IFUJump[n] (n = 0 to 3) which selects among four entry locations for the next opcode. The starting address supplied by the IFU is used for TNIA[4:13] and TNIA[14:15] are set to n. If the IFU is unprepared, it supplies a trap address instead of a starting address, and control goes to the nth location in a trap vector.

If an FF-encoded branch condition is true in the same instruction as an IFUJump, IFU advance to the next opcode is disabled. This kludge allows an opcode with common and uncommon exit conditions to finish, for example, with IFUJump[2,condition]. If the condition is false (common case), then the IFU advances normally to the next opcode, starting at location 2 of the entry vector. Otherwise (uncommon case), control continues at location 3 of the entry vector, but the IFU does not advance, so emulation of the current opcode can continue.

IFU map fault *0-3The IFU buffers the fact of a map fault and completes all opcodes in the pipe ahead of the one experiencing the fault. Upon dispatch to the first instruction for the opcode affected by the fault, this trap occurs.

*Ifu traps OR the 1’s complement of the instruction set into bits 8:9 of the trap address, so actual trap locations for Reschedule, for example, are 14-17, 114-117, 214-217, and 314-317. The trap vector is 1 to 4 instructions long according to the IFUJump programming convention, as discussed in the "Instruction Fetch Unit" chapter.

IM is read and written by programs using a special decode of JCN in conjunction with the RSTK field of the instruction; TPC is also read and written using a special JCN decode. TaskingOff must be in force, and anything that might cause hold is illegal in the same instruction; hold is also illegal in the instruction after an IM or TPC read, when the data is accessed using B←Link.

After the read or write instruction, control passes to the next sequential instruction, i.e., to CIA+1 (with wrap-around at 64-word page boundaries). CIA+1 also winds up in Link.

Note: The hardware does not actually load Link with the IM or TPC data; instead B←Link in the next cycle routes inverted data onto B using an alternate path. The Link register itself is smashed with CIA+1 as discussed above, and this value would be read (assuming it wasn’t overwritten) in later instructions.

This implies that continuation from a breakpoint or program-interrupt halt on the instruction following an IM or TPC read (i.e., on the B←Link instruction) won’t work correctly.

A 34 (+2 parity)-bit IM word is read as four 9-bit quantities. The read address is taken from Link. Data must be read from Link[7:15] in the instruction immediately after the IM read; this data is inverted; Link[0:6] contain 1’s, so that when the entire word is 1’s complemented the desired data will have leading 0’s. The byte select is RSTK[2:3].

Any task can read or write TPC for an arbitrary task other than itself (an attempt to set TPC of the running task is unpredictable). The task number is B[12:15], and data is taken from or written into Link. The assembly language notations for these are RdTPC←B and LdTPC←B. After RdTPC←B, the 16 bits of data in Link are 1’s complemented.

Note: The dispatch-pending conditions for a task whose TPC is loaded by LdTPC← are cleared, so LdTPC← works even when that task has just executed a BDispatch←B or BigBDispatch←B.

Many events in the memory system, StkError and the hold simulator in the processor, and several IFU error conditions generate hold (The IFU error conditions cause a one-cycle hold iff an IFUJump occurs on the first cycle of the error.). The control section itself forces hold when a task switch occurs concurrent with TaskingOff. This signal, clocked at t1, occurs when the current instruction cannot be completed. Its effect on the hardware is to suspend the current instruction, while completing parts of the previous instruction that have been pipelined into the current cycle. Approximately, it converts the current instruction into a Goto[.] while preserving branch conditions and some other stuff.

The fact that the address of the next instruction is needed at t0, while Hold is not generated until t1 means that concurrence of Hold and BLOCK with a switch to a lower priority task produces an anomalous situation called "Next Lies". The hardware disables clocks to CIA, TPC, and MIR when this occurs, so that the current instruction is repeated. This results in some hardware complications discussed in the "Slow IO" chapter, but programmers need not worry about it.

Dorado contains a large number of multiplexors called mufflers which allow a selected signal from a set of up to 2048 signals to be observed on a one-wire bus called the DMux. This provides a passive method by which the Baseboard section or the external Midas debugger can examine internal control signals and registers not otherwise observable.

The particular DMux signal is selected by shifting in an 11-bit address one bit at-a-time. Each board with mufflers contains a 12-bit address register that responds to the shifted address bits; the highest bit is ignored for the purposes of selecting the signal to be read. "Dorado Debugging Interface" discusses a clever generator algorithm that allows all 2048 signals to be read into a table in 2048+11 shift-read cycles.

In addition, the DMux address can also be executed as a control function. In this case the full 12-bit address determines what function is executed. This "manifold" mechanism is used to control power supplies, set clock rate, enable/disable error halt conditions, and test IM without involving other hardware.

The DMux facility can also be controlled directly by Dorado programs by means of the MidasStrobe←B and UseDMD functions. Essentially, the DMux address mechanism is controlled externally by the Baseboard or by Midas operating through the Baseboard when Dorado isn’t running, and by Dorado when Dorado is running.

The MidasStrobe←B function causes B[4] to be shifted out as an address bit. This takes three cycles, so the program must execute three more instructions before doing another MidasStrobe←B function. The DMux signal selected by the last 11 address bits shifted out is read on B[0] when the Pd←ALUFMEM function is executed.