<> <> <> Data Path Bit 0 is the high-order bit and is on the left. Whenever possible, this order is respected. The control is on the left. Every slice of the datapath follows the following rules: Vdd, Gnd, clocks and control signals run horizontally in metal2. Clocks and control signals might be doubled in poly. Buses and wires connecting different stages run vertically in metal1. Seven tracks are reserved on the left side of every cell: rBus, cBus, kBus, aBus/pBus, bBus/sBus, opLBus, opRbus. Whenever possible, vertical tracks in metal1 connect horizontal power busses in metal2. Control descriptors The control descriptor at the level of the RAM is: PhA BIT PhB BIT ~hold2BA BIT ~rejectBA BIT DExecute BIT DStateAddress INT[4] aAdr INT[8] bAdr INT[8] cAdr INT[8] spare4w INT[4] spare1b BIT The control descriptor below the RAM (70 wires) is: PhA BIT PhB BIT ~hold2BA BIT ~rejectBA BIT DExecute BIT DStateAddress INT[4] EUAluLeftSrc1BA INT[2] * EUAluRightSrc1BA INT[2] * EUStore2ASrc1BA INT[2] * EUSt3AisCBus2BA BOOL * EURes3AisCBus2BA BOOL * FDInsert BOOL FDMask INT[6] FDShift INT[6] EUAluOp2AB INT[5] EULoadField3BA BIT EUWriteToPBus3AB BOOL EURes3BisPBus3AB BOOL (* means: input thru the kBus) Horizontal pitch in the datapath Fixed by the RAM. The h-pitch is now 200 lambda. It is an even number, so that we can fit two copies of the same cell in one slice. Vertical pitch A nand decoder fits on a 12 lambda vertical pitch, if we use mirroring. RAM Structure and behavior Every line is made of 128 bits, and the RAM has 43 lines, i.e. 172 32-bit registers. The registers are organised as follows: 0..127: The stack, 32 very normal rows. 128..131: euJunk (just a convention), 129 (spare), euMAR, euField 132..143: FP registers. 144..155: constants registers. 160..175: auxilliary registers. All decoders use a precharged NAND followed by a driver. Precharge clock is nPhB. The control descriptor of the RAM (28 wires) is: PhA, ~hold2BA, ~rejectBA, aAdr[0..7], bAdr[0..7], cAdr[0..7] The output of the row decoder: selectLine _ PhA AND (adr[0..5] matches) The column decoder output: selectLine _ PhA AND ~hold2BA AND ~rejectBA AND (adr[6..7] matches) Behavior The select lines are always maintained low during PhB. In the absence of rejectBA or hold2BA, exactly one row select line and one column select line go high on PhA. hold2BA OR rejectBA => absolutely no access to the RAM. Array Block: 200 x 55 One block is made of four ram cells, plus one column of contacts and the kBus. The ram cell is static, three-ported. It has four bit lines (c, a, ~c, ~b). Power and ground run horizontal in diffusion, with one contact to metal2 on every block. The three select lines run horizontally on poly, doubled in metal2 with one contact per block. Access transistors are 4/2, pull-downs are 12/2, and pull-ups are 3/2 (could be longer if needed). Bit lines: 2365 In one cell, a bit line is made of 55 rows: 0.43+43*(0.008+0.0054) = 1pF. Size of precharge transistor: Every bit line is precharged on nPhB by a 4/2 p-transistor. Capa=1pF. Through the 4/2 p-transistor, the precharge takes 18ns, which is OK. Size of Vdd bus for precharge: Only half of the bit lines actually need charge, so total capa=128*2*1 = 256pF, charged every 100ns at 5V: average current is thus 256pF*5V/100ns = 13mA. The bus is currently a 23The equivalent resistance of the precharge transistors is 44too much, it is easy to widen the bus to 100 Multiplexers The bit lines go through a 4 to 1 mux (column decoder). The transistors are all 14/2. Discharging a bit line thus takes 36*1/14=2.6ns. From top to bottom, the lines at the ouput of the mux are c, a, ~c, ~b. Read and write Read After going through the multiplexers, the a and ~b bit-lines are sensed by an inverter. The ratio is modified in order to higher the threshold. Then follows another inverter (in the case of ~b) and a hefty driver for the aBus (resp. bBus). A typical bus in the ddatapath is made of no more than 4mm of m1, and a few contacts that we will neglect. Say 0.8pF for a bus. A 16/2 device drives that in 2ns, which should suffice. Write Both c and ~c are precharged. The section located "under" the mux is precharged separately by a 4/2 p-tr. One of the bit lines is then pulled down by two 32/2 n-tr in serie, equivalent to a 16/2. Combined with the 14/2 mux, it takes about 4ns to write. If this is not enough, we have to increase the size of both the pull-downs and the mux. Then follows another inverter (in the case of ~b) and a hefty driver for the aBus (resp. bBus). A typical bus in the datapath is made of no more than 4mm of m1, and a few contacts that we will neglect. Say 0.8pF for a bus. A 16/2 device drives that in 2ns, which should suffice. Address registers During PhB, a 32-bit register samples the kBus. Its 24 left-most bits are aAdr, bAdr, cAdr. The 8 bits on the right are EUAluLeftSrc1BA, EUAluRightSrc1BA, EUStore2ASrc1BA, EURes3AisCBus2BA, and EUSt3AisCBus2BA. These 32 bits are shipped to the control part, where an inverter provides the complementary value. Size of the inverter: the worst case (in the RAM) is less than 3mm of poly and 50 4/2 gates = 1.2pF. An inverter of size n=8/2, p=16/2 will do the job in 5ns. The delay due to the resistance of the poly is about 3ns, so the double in m2 is a good idea. ALU Structure and Behavior Inputs: opLBus, opRbus, EUAluOp2AB, EUCondSel2AB Outputs: sBus, EUCondition2BA States: carryBA, carryAB All inputs are stables by the end of PhA, and the results are valid before the end of PhB. The major blocks: FB: encoding p and k, computing the result CP: carry propagator (probably a flat tree, since it is fast enough and smaller) CS: carry selection CC: condition codes computation and selection FB encodes only five functions: add, sub, or, and, xor, and CP performs always the same function. Function block and Carry propagator Takes two 32-bit inputs and a carryIn, and produces a 32-bit result and a carryOut. Select lines: add _ PhB AND EUAluOp2AB IN {VAdd2(2), SAdd(4), LAdd(6), VAdd(12), UAdd(14)} sub _ PhB AND EUAluOp2AB IN {BndChk(3), SSub(5), LSub(7), VSub(13), USub(15)} or _ PhB AND EUAluOp2AB=Or(0) And _ PhB AND EUAluOp2AB=And(1) Xor _ PhB AND EUAluOp2AB=Xor(8) The p and k functions are encoded as a function of the operands a and b as follows (+ is OR and . is AND) add: ~p _ a.b+~a.~b ~k _ a+b sub: ~p _ ~a.b+a.~b ~k _ a+~b xor: ~p _ a.b+~a.~b ~k _ 0 or: ~p _ ~a.~b ~k _ 0 and: ~p _ ~a+~b ~k _ 0 In all cases s _ carry XOR p Carry selection carryAB carryAB is usually just a copy of the carryBA. Exceptions: as long as rejectBA is issued, the carry is not recycled, in order to preserve the state (same idea as for all PhA registers). Similarly, if EUCondition2BA is TRUE, the IFU is going to execute a jump or trap, and the present instruction should not modify any state, since it is irrelevant. IF NOT (rejectBA OR EUCondition2BA) THEN carryAB _ carryBA The carryIn used by FB is produced from carryAB. SAdd, SSub, UAdd, USub => carryIn _ carryAB VAdd, VAdd2, LAdd, FOP, FOPK, And, Or, Xor => carryIn _ FALSE VSub, LSub, BndChk => carryIn _ TRUE carryBA carryBA is a function of the carryOut. SAdd, SSub, LAdd, LSub => carryBA _ FALSE UAdd => carryBA _ carryOut USub => carryBA _ ~carryOut VAdd, VAdd2, VSub, FOP, FOPK, And, Or, Xor, BndChk => carryBA _ carryOut Condition codes That's where the fun starts! The ALU computes 16 different condition codes, and selects the right one. Here we go with the list, with complementary codes grouped together: False(0), True(4): pretty easy to implement. EZ(1), NE(5): EZ _ result=0; a tree of gates (NOR?) checks the output of the FB. LZ(2), GE(6): LZ _ result[0]; the high-order bit of the result. LE(3), GZ(7): GZ _ NOR[LZ, EZ]. OvFl(8), NotOvFl(12): rats! A XOR of two high-order carries. BC(9), NotBC(13): BC _ opLBus[0]=1 OR GE IL(10), NotIL(14): IL _ (opLBus[0]#opLBus[1]) OR (opRBus[0]#opRBus[1]) OR (result[0]#result[1]) Kernal(15): Kernal _ result[0..7]=0 This part is really irregular (so I hate it). Major blocks: A tree of XXX gates, checking whether the result is zero. An intermediate value is used by Kernal, and the final value by EZ. A bunch of XOR, NOT, ... Field Unit Field Descriptor and behavior The opcode is described by a 13-bit quantity called the field descriptor, which can come from two registers: field, aliased with RAM[euField], and KBusAB, loaded on every PhA from kBus. Field descriptor: insert(BOOL), mask([0..32]) and shift([0..32]). The double word (opLBus, opRBus) is left-shifted by shift to produce shiftout. So if shift=0, we have shiftout=opLBus, and if shift=32, shiftout=opRBus. Two mask are produced: mask1 has mask one's on the right, and mask2 has shift one's on the right. If insert is FALSE, mask2 is discarded, and the result is AND[shiftout, mask1], which keeps only the shift bits on the right. If insert is TRUE (ah!ah!), both masks are XORed to form maskHole. Now, the result is formed bitwise as follows: if maskHole=1, then the result is shiftout, and if maskHole=0 the result is opRBus. The result is finally written onto the sBus if the opcode is FOP or FOPK. This translates into: maskHole _ MaskGen[mask] XOR (MaskGen[shift] AND insert) shiftout _ ShiftLeft[opLBus, opRBus, shift] result _ (maskHole AND shiftout) OR (~maskHole AND opRBus AND insert) Shifter In a first, I used a barrel shifter by even amounts (0, 2, 4, . . . , 30) followed by a shift by one, which saves control lines; however, the saving is pretty small, and the increase in regularity substancial, so I kissed it goodbye and I now use a plain barrel shifter. The inputs are connected directly to opLBus (on top) and opRBus (on the bottom). Special case for shift = 32, so 33 select lines. sh0 _ PhB AND (shift=0) sh1 _ PhB AND (shift=1), and so on sh31 _ PhB AND (shift=31) sh32 _ PhB AND (shift =32) -- so don't use shift>32 Mask generators Both are identical, though their control differ. A mask generator receives a 6-bit input (let's say k), and produces a 32-bit word with k ones on the right (of course the rest is filled with zeros). ***Theory of mask generation . . .*** A touch of theory: let k and i be two n+1-bit numbers, and F[n, k, i]=k>i. By recurring on the high-order bit, we find that F[n, k, i]=k[n].~i[n] + (k[n]=i[n]). F[n-1, k, i] F[0, k, i]=k[0].~i[0] In every slice, i is a constant and k an input. Let's decompose the former equations according to the value of i[n]: i[n]=0: F[n, k, i]=k[n] + F[n-1, k, i]=Nand [~k[n], ~F[n-1, k, i]] F[0, k, i]=k[0] i[n]=1: F[n, k, i]=k[n] . F[n-1, k, i]=Nor [~k[n], ~F[n-1, k, i]] F[0, k, i]=0 Let's define Op[i, n] =if i[n] then Nor else Nand (risky notation, I know!). Then we find that F[n, k, i]=Op[i, n][~k[n], ~F[n-1, k, i]]=Op[i, n][~k[n], ~Op[i, n-1][~k[n], ~F[n-2, k, i]]]. Now back to the EU. k is represented on 6 bits, k[5]..k[0]. With some care, we find that the mask can be implemented with an array of NOR and NAND as follows: (1) if i[n]=0 use a Nand, otherwise use a Nor; (2) if n is odd (5, 3, 1) then inverse the rule (1); (3) the select lines carrying k are inverted for n=5, 3, 1, and 0; (4) the first gate collapses into ~k[0] if i is even, and 1 otherwise, which explains (3); Every slice contains 5 gates, and receives 6 select lines. The output is the mask, not its complement. Control lines: mask1Sel[i] _ PhB AND shift[i], for i=2, 4 mask1Sel[i] _ PhB AND ~shift[i], for i=0, 1, 3, 5 mask2Sel[i] _ PhB AND mask[i], for i=2, 4 mask2Sel[i] _ PhB AND ~mask[i], for i=0, 1, 3, 5 Merge box It merges both masks, using a xor gate, implemented in cascode style. No control line is needed. Then it produces the result. The only control needed is insert, no timing or holding really necessary. The equation is: ~insert: mask _ mask1; sBus _ mask AND shiftout insert: mask _ mask1 XOR mask2; sBus _ (mask AND shiftout) OR (~mask AND opRBus) This can be summarized as: mask _ mask1 XOR (mask2 AND insert) sBus _ (mask AND shiftout) OR (~mask AND opRBus AND insert) Finally the merge box writes on the sBus. sBusWrEnable _ PhB AND ~ACERTAINHOLD AND (EUAluOp2AB=FOP OR EUAluOp2AB=FOPK) PPort The pport receives has three inputs: an address from rBus, a dataOut from pBus, and a dataIn from EPData. It has two outputs: EPData itself, and an output direct to result3BA. Pipeline registers Structure The registers are static, with a weak feedback inverter. Optimum ratios as computed by Ed McCreight. Inputs through a simple mux: just one n-transistor, no pass-gate. The proper input ratio and feedback loop take care of the threshold loss. The output driver can be a simple inverter (size n=16/2 and p=32/2), or a tristate driver. It can drive a 1pF load (equivalent to the worst case for an internal bus) in less than 3ns. The peak currents measured on a Thyme simulation are roughly identical for Vdd and Gnd, and 3.2mA per cell. This means 100mA for a row of 32 cells, or 100purpose?" asks EMM, . . . OK, 20 PhA latches: updated on PhA, unless rejectBA or hold2BA is high leftOp2AB Inputs: aBus, rBus, cBus Output: opLBus leftSelA _ PhA AND ~hold2BA AND ~rejectBA AND (EUAluLeftSrc1BA=aBus) leftSelR _ PhA AND ~hold2BA AND ~rejectBA AND (EUAluLeftSrc1BA=rBus) leftSelC _ PhA AND ~hold2BA AND ~rejectBA AND (EUAluLeftSrc1BA=cBus) rightOp2AB Inputs: bBus, rBus, cBus, kBus Output: opRBus rightSelB _ PhA AND ~hold2BA AND ~rejectBA AND (EUAluRightSrc1BA=bBus) rightSelR _ PhA AND ~hold2BA AND ~rejectBA AND (EUAluRightSrc1BA=rBus) rightSelC _ PhA AND ~hold2BA AND ~rejectBA AND (EUAluRightSrc1BA=cBus) rightSelK _ PhA AND ~hold2BA AND ~rejectBA AND (EUAluRightSrc1BA=kBus) store2AB Inputs: bBus, rBus, cBus Output: direct to store2BA (down) st2ABSelB _ PhA AND ~hold2BA AND ~rejectBA AND (EUStore2ASrc1BA=bBus) st2ABSelR _ PhA AND ~hold2BA AND ~rejectBA AND (EUStore2ASrc1BA=rBus) st2ABSelC _ PhA AND ~hold2BA AND ~rejectBA AND (EUStore2ASrc1BA=cBus) store3AB Inputs: direct from store2BA (up), cBus Output: pBus st3ABSelC _ PhA AND ~hold2BA AND ~rejectBA AND EUSt3AisCBus2BA st3ABSelSt _ PhA AND ~hold2BA AND ~rejectBA AND ~EUSt3AisCBus2BA result3AB Inputs: rBus, cBus Output: direct to result3BA (OpLBus) res3ABSelR _ PhA AND ~hold2BA AND ~rejectBA AND ~EURes3AisCBus2BA res3ABSelC _ PhA AND ~hold2BA AND ~rejectBA AND EURes3AisCBus2BA kBusAB Input: kBus Output: fd (OpLBus) loadkBusAB _ PhA AND ~hold2BA AND ~rejectBA field Input: cBus Output: fd (OpRBus) loadField _ PhA AND ~hold2BA AND ~rejectBA AND EULoadField3BA fd <<-- it does not need to be a register, but I want to restore the level, and for slow (no clocks) debugging, I need it>> <> Output: field descriptor in the control column (down) fdSelField _ PhA AND (EUAluOp2AB=FOP) fdSelkBusAB _ PhA AND (EUAluOp2AB=FOPK) address Driver (tristate) Inputs: rBus Output: down (to cache) driveAddress _ PhA drive KBus (tristate) Inputs: cBus Output: kBus (to IFU) driveKBus _ PhA AND (cAdr>239) PhB latches updated on PhB, unless hold2BA is high address registers Inputs: kBus Output: direct to routing (down) st2BASel _ PhB store2BA Inputs: direct from store2AB (up) Output: direct to store3AB (down) st2BASel _ PhB AND ~hold2BA result2BA Inputs: sBus Output: rBus res2BASel _ PhB AND ~hold2BA result3BA Inputs: direct from result3AB (OpLBus), pPort (up) Output: cBus res3BASelPBus _ PhB AND ~hold2BA AND EURes3BisPBus3AB res3BASelRes3AB _ PhB AND ~hold2BA AND ~EURes3BisPBus3AB data Driver (tristate) Inputs: pBus Output: down driveData _ PhB AND EUWriteToPBus3AB Power buses The power buses are mostly in metal2. Let's assume that the maximum voltage drop we can tolerate is 0.25V and that the resistance of m2 is 0.042 S ( Control Drivers The decoding is done with NAND decoders, sometime arranged as NOR of NAND. The worst case is 8 n-tr in series, and a 2pF load. The pull-down transistors are 8/2, the precharge transistors are 4/2, the driver is p=30/2, n=12/2. Simulations show a spike of 2mAmA in the power supply, and a delay of about 6ns. Increasing the size of the driver makes things worse, so the delay comes from the pull-down chain. Now, let's assume that at most 100 such drivers are active at a given time. Actually, this is a bit too much since the drivers come in two sexes (PhA and PhB) and most of them don't go high. The max spike is 200mA. With the standard formula, we find a bus aspect ratio of at most 30. If the chip is 5mm high, this means a 166has no state, so a bounce is not so serious; I can connect more often from the pad ring; . . . I make a layout with 50 Pad Frame A standard pad frame using the pads from CmosPadLibraryPGA144 is almost square; inside dimension 7400xxx. I share it with Pradeep. Optimizations After a simple mux and an inverter, add a p-tr to restore the level. Don't Forget ALU: no cc or carry for EU1. Check the equations of all control lines: nand and wires. <> <> <> <> <> <>