EULayout.tioga
Copyright © 1984, 1985 by Xerox Corporation. All rights reserved.
Last Edited by: Monier, June 7, 1985 9:34:43 am PDT
Data Path
Bit 0 is the high-order bit and is on the left. Whenever possible, this order is respected. The control is on the left. Every slice of the datapath follows the following rules:
Vdd, Gnd, clocks and control signals run horizontally in metal2.
Clocks and control signals might be doubled in poly.
Buses and wires connecting different stages run vertically in metal1.
Seven tracks are reserved on the left side of every cell: rBus, cBus, kBus, aBus/pBus, bBus/sBus, opLBus, opRbus.
Control descriptors
The control descriptor at the level of the RAM (52 wires) is:
PhA   BIT
PhB   BIT
~hold2BA  BIT
~rejectBA  BIT
aAdr   INT[8]
bAdr   INT[8]
cAdr   INT[8]
The control descriptor at the level of the pipeline registers (57 wires) and Field Unit is:
PhA   BIT
PhB   BIT
~hold2BA  BIT
~rejectBA  BIT
EUAluLeftSrc1BA  INT[2] *
EUAluRightSrc1BA  INT[2] *
EUStore2ASrc1BA  INT[2] *
EUSt3AisCBus2BA  BOOL *
EURes3AisCBus2BA BOOL *
FDInsert   BOOL
FDMask   INT[6]
FDShift   INT[6]
EUAluOp2AB  INT[5]
EULoadField3BA  BIT
The control descriptor at the level of the ALU (25 wires) is:
PhA   BIT
PhB   BIT
faultBA   BIT
~hold2BA  BIT
~rejectBA  BIT
EUAluOp2AB  INT[5]
EUCondSel2AB  INT[4]
EURes3BisPBus3AB INT[1]
The control descriptor at the level of the P-Port (11 wires) is:
PhA   BIT
PhB   BIT
faultBA   BIT
~hold2BA  BIT
~rejectBA  BIT
EURes3BisPBus3AB INT[1]
EUCheckParity3AB  BOOL ?
EUWriteToPBus3AB BOOL ?
(* means: input thru the kBus)
Horizontal pitch in the datapath
Fixed by the RAM. The h-pitch is now 180 lambda. It is an even number, so that we can fit two copies of the same cell in one slice.
Vertical pitch
A nand decoder fits on a 12 lambda vertical pitch, if we use mirroring.
RAM
Structure and behavior
Every line is made of 128 bits, and the RAM has 43 lines, i.e. 172 32-bit registers. The registers are organised as follows:
0..127: The stack, 32 very normal rows.
128..131: euJunk (just a convention), 129 (spare), euMAR, euField
132..143: FP registers.
144..155: constants registers.
160..175: auxilliary registers.
All decoders use a precharged NAND followed by a driver. Precharge clock is nPhB.
The control descriptor of the RAM (28 wires) is:
PhA, ~hold2BA, ~rejectBA, aAdr[0..7], bAdr[0..7], cAdr[0..7]
The output of the row decoder:
selectLine ← PhA AND (adr[0..5] matches)
The column decoder output:
selectLine ← PhA AND ~hold2BA AND ~rejectBA AND (adr[6..7] matches)
Behavior
The select lines are always maintained low during PhB.
In the absence of rejectBA or hold2BA, exactly one row select line and one column select line go high on PhA.
hold2BA OR rejectBA => absolutely no access to the RAM.
Array
Block: 180 x 55 l
One block is made of four ram cells, plus one column of contacts and the kBus. The ram cell is static, three-ported. It has four bit lines (c, a, ~c, ~b). Power and ground run horizontal in diffusion, with one contact to metal2 on every block. The three select lines run horizontally on poly, doubled in metal2 with one contact per block.
Access transistors are 4/2, pull-downs are 12/2, and pull-ups are 3/2 (could be longer if needed).
Bit lines: 2365 l of metal1, 1pF
In one cell, a bit line is made of 55 l of metal1, one n-diff contact and the drain of a 4/2 n-transistor. Total capa for the 43 rows: 0.43+43*(0.008+0.0054) = 1pF.
Size of precharge transistor: Every bit line is precharged on nPhB by a 4/2 p-transistor. Capa=1pF. Through the 4/2 p-transistor, the precharge takes 18ns, which is OK.
Size of Vdd bus for precharge: Only half of the bit lines actually need charge, so total capa=128*2*1 = 256pF, charged every 100ns at 5V: average current is thus 256pF*5V/100ns = 13mA. The bus is currently a 23m by 5800m wire of metal2: resistance is about 11W. The equivalent resistance of the precharge transistors is 44W. The peak current is thus 91mA, which induces a dV = 1V. If this is too much, it is easy to widen the bus to 100m. This is in fact likely since this bus will carry Vdd for the while chip anyway.
Multiplexers
The bit lines go through a 4 to 1 mux (column decoder). The transistors are all 14/2. Discharging a bit line thus takes 36*1/14=2.6ns. From top to bottom, the lines at the ouput of the mux are c, a, ~c, ~b.
Read and write
Read
After going through the multiplexers, the a and ~b bit-lines are sensed by an inverter. The ratio is modified in order to higher the threshold.
Then follows another inverter (in the case of ~b) and a hefty driver for the aBus (resp. bBus). A typical bus in the ddatapath is made of no more than 4mm of m1, and a few contacts that we will neglect. Say 0.8pF for a bus. A 16/2 device drives that in 2ns, which should suffice.
Write
Both c and ~c are precharged. The section located "under" the mux is precharged separately by a 4/2 p-tr. One of the bit lines is then pulled down by two 32/2 n-tr in serie, equivalent to a 16/2. Combined with the 14/2 mux, it takes about 4ns to write. If this is not enough, we have to increase the size of both the pull-downs and the mux.
Then follows another inverter (in the case of ~b) and a hefty driver for the aBus (resp. bBus). A typical bus in the datapath is made of no more than 4mm of m1, and a few contacts that we will neglect. Say 0.8pF for a bus. A 16/2 device drives that in 2ns, which should suffice.
Address registers
During PhB, a 32-bit register samples the kBus. Its 24 left-most bits are aAdr, bAdr, cAdr. The 8 bits on the right are EUAluLeftSrc1BA, EUAluRightSrc1BA, EUStore2ASrc1BA, EURes3AisCBus2BA, and EUSt3AisCBus2BA. These 32 bits are shipped to the control part, where an inverter provides the complementary value.
Size of the inverter: the worst case (in the RAM) is less than 3mm of poly and 50 4/2 gates = 1.2pF. An inverter of size n=8/2, p=16/2 will do the job in 5ns. The delay due to the resistance of the poly is about 3ns, so someday a double in m2 will be a good idea.
ALU
Structure and Behavior
Inputs: opLBus, opRbus, EUAluOp2AB, EUCondSel2AB
Outputs: sBus, EUCondition2BA
States: carryBA, carryAB
All inputs are stables by the end of PhA, and the results are valid before the end of PhB.
The major blocks:
FB: encoding p and k, computing the result
CP: carry propagator (probably a flat tree, since it is fast enough and smaller)
CS: carry selection
CC: condition codes computation and selection
FB encodes only five functions: add, sub, or, and, xor, and CP performs always the same function.
Function block and Carry propagator
Takes two 32-bit inputs and a carryIn, and produces a 32-bit result and a carryOut.
Select lines:
add ← PhB AND EUAluOp2AB IN {VAdd2(2), SAdd(4), LAdd(6), VAdd(12), UAdd(14)}
sub ← PhB AND EUAluOp2AB IN {BndChk(3), SSub(5), LSub(7), VSub(13), USub(15)}
or ← PhB AND EUAluOp2AB=Or(0)
And ← PhB AND EUAluOp2AB=And(1)
Xor ← PhB AND EUAluOp2AB=Xor(8)
The p and k functions are encoded as a function of the operands a and b as follows (+ is OR and . is AND)
add: ~p ← a.b+~a.~b ~k ← a+b
sub: ~p ← ~a.b+a.~b ~k ← a+~b
xor: ~p ← a.b+~a.~b ~k ← 0
or: ~p ← ~a.~b ~k ← 0
and: ~p ← ~a+~b ~k ← 0
In all cases s ← carry XOR p
Carry selection
carryAB
carryAB is usually just a copy of the carryBA. Exceptions: as long as rejectBA is issued, the carry is not recycled, in order to preserve the state (same idea as for all PhA registers). Similarly, if EUCondition2BA is TRUE, the IFU is going to execute a jump or trap, and the present instruction should not modify any state, since it is irrelevant.
IF NOT (rejectBA OR EUCondition2BA) THEN carryAB ← carryBA
The carryIn used by FB is produced from carryAB.
SAdd, SSub, UAdd, USub   => carryIn ← carryAB
VAdd, VAdd2, LAdd, FOP, FOPK, And, Or, Xor => carryIn ← FALSE
VSub, LSub, BndChk   => carryIn ← TRUE
carryBA
carryBA is a function of the carryOut.
SAdd, SSub, LAdd, LSub    => carryBA ← FALSE
UAdd      => carryBA ← carryOut
USub      => carryBA ← ~carryOut
VAdd, VAdd2, VSub, FOP, FOPK, And, Or, Xor, BndChk => carryBA ← carryOut
Condition codes
That's where the fun starts! The ALU computes 16 different condition codes, and selects the right one. Here we go with the list, with complementary codes grouped together:
False(0), True(4): pretty easy to implement.
EZ(1), NE(5): EZ ← result=0; a tree of gates (NOR?) checks the output of the FB.
LZ(2), GE(6): LZ ← result[0]; the high-order bit of the result.
LE(3), GZ(7): GZNOR[LZ, EZ].
OvFl(8), NotOvFl(12): rats! A XOR of two high-order carries.
BC(9), NotBC(13): BC ← opLBus[0]=1 OR GE
IL(10), NotIL(14): IL ← (opLBus[0]#opLBus[1]) OR (opRBus[0]#opRBus[1]) OR (result[0]#result[1])
Kernal(15): Kernal ← result[0..7]=0
This part is really irregular (so I hate it). Major blocks:
A tree of XXX gates, checking whether the result is zero. An intermediate value is used by Kernal, and the final value by EZ.
A bunch of XOR, NOT, ...
Field Unit
Field Descriptor and behavior
The opcode is described by a 13-bit quantity called the field descriptor, which can come from two registers: field, aliased with RAM[euField], and KBusAB, loaded on every PhA from kBus.
Field descriptor: insert(BOOL), mask([0..32]) and shift([0..32]).
The double word (opLBus, opRBus) is left-shifted by shift to produce shiftout. So if shift=0, we have shiftout=opLBus, and if shift=32, shiftout=opRBus.
Two mask are produced: mask1 has mask one's on the right, and mask2 has shift one's on the right.
If insert is FALSE, mask2 is discarded, and the result is AND[shiftout, mask1], which keeps only the shift bits on the right.
If insert is TRUE (ah!ah!), both masks are XORed to form maskHole. Now, the result is formed bitwise as follows: if maskHole=1, then the result is shiftout, and if maskHole=0 the result is opRBus.
The result is finally written onto the sBus if the opcode is FOP or FOPK.
This translates into:
maskHole ← MaskGen[mask] XOR (MaskGen[shift] AND insert)
shiftout ← ShiftLeft[opLBus, opRBus, shift]
result ← (maskHole AND shiftout) OR (~maskHole AND opRBus AND insert)
Shifter
In a first, I used a barrel shifter by even amounts (0, 2, 4, . . . , 30) followed by a shift by one, which saves control lines; however, the saving is pretty small, and the increase in regularity substancial, so I kissed it goodbye and I now use a plain barrel shifter. The inputs are connected directly to opLBus (on top) and opRBus (on the bottom).
Special case for shift = 32, so 33 select lines.
sh0 ← PhB AND (shift=0)
sh1 ← PhB AND (shift=1), and so on
sh31 ← PhB AND (shift=31)
sh32 ← PhB AND (shift =32) -- so don't use shift>32
Mask generators
Both are identical, though their control differ. A mask generator receives a 6-bit input (let's say k), and produces a 32-bit word with k ones on the right (of course the rest is filled with zeros).
***Theory of mask generation . . .***
A touch of theory: let k and i be two n+1-bit numbers, and F[n, k, i]=k>i. By recurring on the high-order bit, we find that
F[n, k, i]=k[n].~i[n] + (k[n]=i[n]). F[n-1, k, i]
F[0, k, i]=k[0].~i[0]
In every slice, i is a constant and k an input. Let's decompose the former equations according to the value of i[n]:
i[n]=0:
F[n, k, i]=k[n] + F[n-1, k, i]=Nand [~k[n], ~F[n-1, k, i]]
F[0, k, i]=k[0]
i[n]=1:
F[n, k, i]=k[n] . F[n-1, k, i]=Nor [~k[n], ~F[n-1, k, i]]
F[0, k, i]=0
Let's define Op[i, n] =if i[n] then Nor else Nand (risky notation, I know!). Then we find that
F[n, k, i]=Op[i, n][~k[n], ~F[n-1, k, i]]=Op[i, n][~k[n], ~Op[i, n-1][~k[n], ~F[n-2, k, i]]].
Now back to the EU. k is represented on 6 bits, k[5]..k[0]. With some care, we find that the mask can be implemented with an array of NOR and NAND as follows:
(1) if i[n]=0 use a Nand, otherwise use a Nor;
(2) if n is odd (5, 3, 1) then inverse the rule (1);
(3) the select lines carrying k are inverted for n=5, 3, 1, and 0;
(4) the first gate collapses into ~k[0] if i is even, and 1 otherwise, which explains (3);
Every slice contains 5 gates, and receives 6 select lines. The output is the mask, not its complement.
Control lines:
mask1Sel[i] ← PhB AND shift[i], for i=2, 4
mask1Sel[i] ← PhB AND ~shift[i], for i=0, 1, 3, 5
mask2Sel[i] ← PhB AND mask[i], for i=2, 4
mask2Sel[i] ← PhB AND ~mask[i], for i=0, 1, 3, 5
Merge box
It merges both masks, using a xor gate, implemented in cascode style. No control line is needed.
Then it produces the result. The only control needed is insert, no timing or holding really necessary.
The equation is:
~insert: mask ← mask1; sBus ← mask AND shiftout
insert: mask ← mask1 XOR mask2; sBus ← (mask AND shiftout) OR (~mask AND opRBus)
This can be summarized as:
mask ← mask1 XOR (mask2 AND insert)
sBus ← (mask AND shiftout) OR (~mask AND opRBus AND insert)
Finally the merge box writes on the sBus.
sBusWrEnable ← PhB AND ~ACERTAINHOLD AND (EUAluOp2AB=FOP OR EUAluOp2AB=FOPK)
PPort
The pport receives has three inputs: an address from rBus, a dataOut from pBus, and a dataIn from EPData. It has two outputs: EPData itself, and an output direct to result3BA.
Pipeline registers
Structure
The registers are static, with a weak feedback inverter. Optimum ratios as computed by Ed McCreight. Inputs through a simple mux: just one n-transistor, no pass-gate. The proper input ratio and feedback loop take care of the threshold loss. The output driver can be a simple inverter (size n=16/2 and p=32/2), or a tristate driver. It can drive a 1pF load (equivalent to the worst case for an internal bus) in less than 3ns.
PhA latches: updated on PhA, unless rejectBA or hold2BA is high
leftOp2AB
Inputs: aBus, rBus, cBus
Output: opLBus
leftSelA ← PhA AND ~hold2BA AND ~rejectBA AND (EUAluLeftSrc1BA=aBus)
leftSelR ← PhA AND ~hold2BA AND ~rejectBA AND (EUAluLeftSrc1BA=rBus)
leftSelC ← PhA AND ~hold2BA AND ~rejectBA AND (EUAluLeftSrc1BA=cBus)
rightOp2AB
Inputs: bBus, rBus, cBus, kBus
Output: opRBus
rightSelB ← PhA AND ~hold2BA AND ~rejectBA AND (EUAluRightSrc1BA=bBus)
rightSelR ← PhA AND ~hold2BA AND ~rejectBA AND (EUAluRightSrc1BA=rBus)
rightSelC ← PhA AND ~hold2BA AND ~rejectBA AND (EUAluRightSrc1BA=cBus)
rightSelK ← PhA AND ~hold2BA AND ~rejectBA AND (EUAluRightSrc1BA=kBus)
store2AB
Inputs: bBus, rBus, cBus
Output: direct to store2BA
st2ABSelB ← PhA AND ~hold2BA AND ~rejectBA AND (EUStore2ASrc1BA=bBus)
st2ABSelR ← PhA AND ~hold2BA AND ~rejectBA AND (EUStore2ASrc1BA=rBus)
st2ABSelC ← PhA AND ~hold2BA AND ~rejectBA AND (EUStore2ASrc1BA=cBus)
store3AB
Inputs: direct from store2BA, cBus
Output: pBus
st3ABSelC ← PhA AND ~hold2BA AND ~rejectBA AND EUSt3AisCBus2BA
st3ABSelSt ← PhA AND ~hold2BA AND ~rejectBA AND ~EUSt3AisCBus2BA
result3AB
Inputs: rBus, cBus
Output: direct to result3BA
res3ABSelR ← PhA AND ~hold2BA AND ~rejectBA AND ~EURes3AisCBus2BA
res3ABSelC ← PhA AND ~hold2BA AND ~rejectBA AND EURes3AisCBus2BA
kBusAB
Input: kBus
Output: fd
loadkBusAB ← PhA
field
Input: cBus
Output: fd
loadField ← PhA AND EULoadField3BA
fd
Input: kBusAB, field
Output: field descriptor in the control column
fdSelField ← PhA AND (EUAluOp2AB=FOP)
fdSelkBusAB ← PhA AND (EUAluOp2AB=FOPK)
PhB latches updated on PhB, unless hold2BA is high
store2BA
Inputs: direct from store2AB
Output: direct to store3AB
st2BASel ← PhB AND ~hold2BA
result2BA
Inputs: sBus
Output: rBus
res2BASel ← PhB AND ~hold2BA
result3BA
Inputs: direct from result3AB, pPort
Output: cBus
res3BASelP ← PhB AND ~hold2BA AND ~rejectBA AND EURes3BisPBus3AB
res3BASel ← PhB AND ~hold2BA AND (EUWriteToPBus3AB OR rejectBA OR (~rejectBA AND ~EURes3BisPBus3AB))
Sizes
m: RAM:      
m: Bit Lines Mux and read/write
m: drive kBus
m: address regs
m: pipeline regs
686 m: Field Unit
Power buses
The power buses are mostly in metal2. Let's assume that the maximum voltage drop we can tolerate is 0.25V and that the resistance of m2 is 0.042W/¡. Let dI be the peak current, and S the bus size in squares. We find R=0.042*S=0.25/dI, or
S (¡)=6/dI (A).
Now, let's assume that at most 100 such drivers are active at a given time. Actually, this is a bit too much since the drivers come in two sex (PhA and PhB) and most of them don't go high. The max spike is 200mA. If the chip is 5mm high, the resistance of a m2 power bus of width w is 0.042*5000/w = 210/w.
Control Drivers
The decoding is done with NAND decoders, sometime arranged as NOR of NAND. The worst case is 8 n-tr in series, and a 2pF load. The pull-down transistors are 4/2, the driver is p=30/2, n=12/2. Simulations show a spike of 1.5mA in the power supply, and a delay of about 10ns. Increasing the size of the driver makes things worse, so the delay comes from the pull-down chain. With 7/2 pull-down, the spike goes up to 2mA, but the delay is reduced to less than 7ns.
Now, let's assume that at most 100 such drivers are active at a given time. Actually, this is a bit too much since the drivers come in two sex (PhA and PhB) and most of them don't go high. The max spike is 200mA. With the standard formula, we find a bus aspect ratio of at most 30. If the chip is 5mm high, this means a 166m width. Escapes: well, this is only a worst case; the control part has no state, so a bounce is not so serious; I can connect more often from the pad ring; . . . I make a layout with 50m for now, OK?
Pad Frame
A standard pad frame using the pads from CmosPadLibraryPGA144 is square; inside dimension 7400m, outside dimension 8686m. I hope to share it with Pradeep.
Optimizations
After a simple mux and an inverter, add a p-tr to restore the level.
Metal2 should double the control lines in the control column too.
Don't Forget
Bias voltage generator for the control.
res2BA, 3AB, 3BA
pBus driver
ALU
inverters for all control lines who need it
some inputs (control from IFU) should be delayed