ALPS, A BOOLEAN TO VLSI LAYOUT TRANSDUCER ALPS, a boolean to VLSI layout transducer Bertrand Serlet c Copyright 1985 Xerox Corporation. All rights reversed. Abstract: Alps proposes an tree-structured way to represent arbitrary boolean expressions. This representation is compact, suitable for fast computation, and has a natural mapping into silicon. VLSI layout generated by Alps compares favorably in many cases to PLAs and standard cell designs. Keywords: Boole Algebra, Boolean expressions, Disjunctive Normal Form, VLSI, PLA, CMOS, Cascode, Automatic Layout Generation. XEROX Xerox Corporation Palo Alto Research Center 3333 Coyote Hill Road Palo Alto, California 94304 1. Introduction The original idea behind Alps was inspirated by different works and areas of research. Recent papers [Chen84, Yoffa85] describing use of differential Cascode Voltage Switch (CVS) for generating CMOS layout initiated a series of investigations on what are the possible styles of layout in CMOS. During the same time, the author speculated with Louis Monier on how to generate regular control structures for microprocessor-like chips. In some of Dragon [Monier84] chips currently under design, there was a need for control structures like PLAs, although a static alternative would have been welcome. An other background preoccupation was to generate automaticly non-critical cells. Last but not least, the first run of Alps was intended as a test for the Silicon Assembly program, PatchWork [Monier85]. 2. Boolean Expression Manipulation 1. Representation of boolean expressions Conventions Throughout all this document, we represent the And operator (product) by ".", the Or operator (sum) by "+", and the Not operator (Complement, Negation) by "~". We will constantly use the If operator (Multiplexer) which selects one clause between two clauses. In terms of And and Or, If can be defined by the relation: If [a, b, c] = a . b + ~a . c . The exclusive Or is denoted Xor. The constants are "0" (false) and "1" (true). An expression is made of operators, sub-expressions, constants and variables. Sets of primitive operators {And, Not}, {Or, Not}, {Xor, And}, {Nand}, {Nor}, {If} are all equivalent sets of boolean operators which allow the representation of any expression, when added constants and variables. Disjunctive Normal Form Any expression can be viewed as a sum of products, each term of the sum being made of a product of either an input variable or its complement. The decomposition is unique if each product contains all variables (possibly complemented). Without this condition, the decomposition is far from unique and many papers describe various techniques for minimizing the number of product terms. This representation is usually compact for a large variety of problems, especially if Xors are avoided. It is the representation used for generating PLAs and to some extent Weinberger's Arrays. 2. Binary decomposition. The most obvious representation The simplest way to represent a function of N-variables is just to decompose it into 2 partial functions of N-1 variables, with a boolean choice operator between those partial functions. That gives: F [x1, x2, ..., xn] = x1 . F1 [x2, ..., xn] + ~x1 . F0 [x2, ..., xn] We see that this is exactly the If operator introduced previously: F [x1, x2, ..., xn] = IF x1 THEN F1 [x2, ..., xn] ELSE F0 [x2, ..., xn] F [x1, x2, ..., xn] = If [x1, F1 [x2, ..., xn], F0 [x2, ..., xn]] Notation The notation for abbreviating the writing is: F = x . F1 + ~x . F0 It assumes that an order in the variables has been chosen. Composition laws When F = If [~x, F0, F1] and G = If [x, G1, G0]: ~F = If [x, ~F1, ~F0] F + G = If [x, F1 + G1, F0 + G0] F . G = If [x, F1 . G1, F0 . G0] Xor [F, G] = If [x, Xor [F1, G1], Xor [F0, G0]] [ F = G ] <=> [ ( F1 = G1 ) And ( F0 = G0 ) ] (Unicity of the representation) Size of the representation The operations on this representation are simple. However the representation has a major inconvenient: its size. The size of any expression of N inputs is 2 ** N. And size is the most important parameter of a representation, for layout area and computation times are usually growing if the representation is getting bigger, independently of the algorithm. 2. Alps normal form. Special cases recognition The major originality of Alps is just to distinguish all special cases of the partial sub-expressions F1 and F0. These two sub-expressions can verify (conditions are to be evaluated sequentially): F1=F0=1 Or F1=F0=0 [Constant] => F=0 Or F=1 F1=~F0=1 Or F1=~F0=0 [Identity] => F=x Or F=~x F1=1 [Or] => F= x + F0 F0=0 [And] => F= x . F1 F0=1 [OrNot] => F= F1 + ~x; -- mix of Or and Not F1=0 [AndNot] => F= ~x . F0; -- mix of And and Not F1=F0 [NoDependance] => F= F1 ; -- F does not depend at all on x. F1=~F0 [XorNot] => F= x . F1 + ~x . ~F1; -- mix of Xor and Not ELSE [If] => F= x . F1 + ~x . F0; -- General case. Properties When this distinction between all different cases is made, the decomposition is still unique (given an ordering of the inputs), and the composition laws are unaffected. We have so a normal form (Alps normal form). Branching factor The size of the representation is no longer 2 ** N, because in all cases apart from If, 0 or 1 subexpression is needed. The branching factor (number of needed subexpressions) is: Identity, case01 => 0, Or, And, OrNot, AndNot, NoDependance, XorNot => 1, If => 2. With Alps normal form, the representation is very compact. For example, Nor, And, Or of N inputs all take a size N, as in the sum of product form. But, expressed in sum of product form, the a Xor of N inputs also needs a size 2 ** N, against N for Alps. The consequence is that the Alps form is always more compact than the sum of product form, and, in cases which include Xor the compaction can be significant. 3. Minimization. Auxiliary variables The introduction of auxiliary variables may bring a drastic reduction of the size of an expression, whatever the representation is. Introduction of auxiliary variables is a difficult problem (many times NP-complete, because you have a lot of different problems to solve). You must find the "best" auxiliary variable, find the "best" way to use it in factorizing other expressions, all those assuming that you can find a "best" function, i.e. an ordering of the cost of two system which include auxiliary variables. But introduction of auxiliary variables is not the only optimization possible in Alps. As we will see, permutations of inputs may dramatically increase or decrease the size. In fact permutations alone are generally sufficient for generating area-efficient layout. Permutations In Alps, the representation is dependent on the ordering of inputs. For example (see figure 2.3): F = If [x2, XorNot [x1, G], x1], where G is another expression, can be rewritten when permuting x1 and x2: F = If [x1, Or [x2, G], AndNot [x2, G]]. The second expression duplicates the G expression. If no sharing (conceptually this means the introduction of an auxiliary variable) occurs the size of F in the second case is the size of F in the first case plus the size of G. The exhaustive search of all possible permutations is a theoretical answer to that minimization problem. In practice, a local search of better permutations, done by permuting only 2 inputs at a time, seems to be sufficient. 4. Generalization. Higher level multiplexers The main idea making Alps a good representation, i.e. the separation between all special cases, can be used more than once. The If gate is a conceptual multiplexer, that may be extended to a whole vector of variables. For example with 2 inputs, we can introduce the bi-multiplexer, If2, defined by: If2 [x1, x2, F11, F10, F01, F00] = If [x1, If [x2, F11, F10], If [x2, F01, F00]] The special cases to be distinguished for compacting trees containing If2 are when at least one of the four partial functions is 1 or 0, and/or when at least two of the partial functions are equal or relative complement. In the extreme case when you multiplex all variables, the function obtained is the truth-table function. Consequently, permutations are less useful for compacting the representation. With this generalization, Alps is very near to the representation proposed by Bryant [Brant85]. 3. Mapping into silicon 1. Overview of different CMOS styles 1. Standard See figure 3.1 The output of the gate is connected to Gnd via a Pull-Down tree, and to Vdd via a Pull-Up tree. Those two trees are dual, in the sense that you can transform one to the other by transforming parallel into series and vice-versa (There should be a lot of papers in the litterature describing this transformation). 2. Precharged See figure 3.1 This style is the same as previous for pulling down to zero, but it uses a preliminary clock phase to pull the output up. Because transistors leak, you have to use some analog tricks to maintain the output indefinitely to its value when the output is high, after the end of precharge. One way to do that is to have a weak p-transistor, connected to a voltage source just above threshold so that this transistor will be slightly closed, and will maintain the high voltage, without perturbation when the output is low. 3. Cascode Cascode is an old technique from the time when people where CASCading anODEs of lamps, newly revived for CMOS, and sometimes also called CVS (Cascode Voltage Switch) or CDVS (Cascode Differential Voltage Switch). See figure 3.1 . The 2 Pull-Down trees are dual, in the sense that one and exactly one of them is connecting to Gnd. The crossed p-transistors are having positive feedback which tend to make this gate less sensitive to noise. 4. Comparison between different styles. In current technologies, p-transistors and n-transistors must be separated by a large distance, to avoid effects as LatchUp. For the same resistance, p-transistors are also bigger than n-transistors. Those 2 reasons make the 1st style (standard) poorly efficient in terms of area. The 2nd style (Precharged) is roughly cheaper by a factor ot 2 in terms of transistor usage, and is the most widely used (for example in PLAs). Its main disadvantage are two-fold: Precharged logic requires a clocking scheme, and that means some extra global control and communication other than the gate. The inputs must remain steady once the precharged phase is finished. This last constraint is in fact the predominant one, for it means that you cannot use the output of a precharged gate for the input of another precharged gate without paying extra attention to timing problems. This leads to domino style where you use inverters to insure that input transitions are the same all over your logic. This cannot be used when you have non-monotonous functions, as the Xor function. 5. How to make PLAs in CMOS. The basic assumption making PLAs an efficient way to generate layout, is that it is possible to compute a NOR of multiple input in small time and area. Unfortunately, this is less and less true with current and future technologies for 2 reasons: - As feature size decreases, the time to compute a NOR is more and more proportional to the number of inputs. This was not the case before, for the NOR time could be considered constant with respect to the time-scale of other components of the chip. - CMOS technology has no pull-ups, and the only way to have a multiple input gate that operates roughly in constant time (with the limitations as described previously) is to use precharged gates. 2. Isomorphic mapping Logical values representation The cascode style has been chosen for the first version of Alps layout generation. That implies that both input and their complement are necessary throughout the design, and that expressions are carried by two wires, one of them being pulled down to Gnd by some n-transistors. The mapping of logical values for inputs can be the trivial one: 1 means same voltage as Vdd, 0 same voltage as Gnd. In order to have little n-transistors per If gate, the representation of expression values is the following: 1 means pulled down to Gnd via a n-transistor, 0 means not pulled down to Gnd. Gates representation Figure 3.2.1 shows the schematics for the If gate. All gates can be obtained by using the Constant gate (the one generating 1 and 0) and the If gate, but some optimization have gained 1 transistor in the cases [OrNot], [And], [Or] and [AndNot], using the fact that: F = x . F1 + ~x can be rewritten F = F1 + ~x and similarly for the other cases. The schematics for the OrNot is shown on figure 3.2.1. Layout for this 2-transistor gate can be made very compact, and this is important because most of the gates used in the control part of a chip belong to the {[OrNot], [And], [Or], [AndNot]} set. Glue and amplification The crossed cascode switch is needed to finally compute the output (and not only the 2 pull down chains). It is also necessary to avoid very long pull down chains containing transistors in series, that may give a quadratic delay. To do so, cascode crossed devices are inserted at regular intervals (each 5 or 6 with our technology). These crossed devices make the delay time truly linear in terms of number of inputs. The schematic is given in figure 3.2.2. The algorithm used for choosing where to put glue is slightly better than just dropping it at regular intervals: the interval is a fixed number of used input, which still makes the delay linear, and saves some glue. 3. Layout considerations Directions and layers assigment Many strategies have been tried to find the most compact way to layout Alps expressions. This discussion is very technology-dependent, but seems reasonnable for any 2 metal-layers CMOS technology. Because n-transistors have gates perpendicular to source and drain, it seems best to have inputs run in one dimension in poly (doubled with metal2 for avoiding delays), and outputs run in the other dimension in metal (for minimizing delays) and diffusion (where transistors are needed). N-wells are inserted are regular intervals, for making the amplifier cascodes, which means some extra Vdd and Gnd running parallel to the inputs, and so forth in metal2. Gnd routing Two choices may be made for Gnd: - In n-diffusion alongside the inputs (with the appropriate doubling in metal2); - In metal alongside the outputs. The first choice seems to save some area compared to the second one (10%) but necessitates to connect from time to time the diffusion to the metal2, which is tedious. The second approach was finally taken, and that means that for each expression, you find 3 wires, the 2 Pull-Down chains and Gnd in sandwich in-between. This solution has nice features like the ability to grow the size of transistors, without changing any other geometry. It is also very local, in the sense that each gates finds all its necessary signals in the nearby. Finally it allows Grid-Power-Routing, i.e. the connection of Gnd and Vdd in grid instead of comb, to reduce the effect of spikes, when all inputs switch at the same time. 4. Implementation 1. Fast operations on Alps Expressions. Lazyness of the negation and type of the data structures For distinguishing between possible cases, the test for equality of 2 expressions and for relative complementarity must be performed efficiently. Because of the last test, and because Not operations seem to happen frequently, the best way to implement the Not operation is the Lazy evaluation, i.e. complementation is coded explicitly into the structure, and all operators use that information. In Cedar syntax, expressions have the following type: Expression: TYPE = REF RECORD [ variable: Variable, -- Variable corresponding to this level of currification norm: BOOL _ TRUE, -- Whether the record represents the information (TRUE) or its complement (FALSE) case: TypeOfExpression, -- Type Of the Expression subexpr1, subexpr2: Expression -- Not always both useful, but that's the cheapest way to have them ]; with: TypeOfExpression: TYPE = { Constant, -- F1=F0=1 Identity, -- F1=1 F0=0 OrNot, -- F0=1 And, -- F0=0 Or, -- F1=1 AndNot, -- F1=0 XorNot, -- F1=~F0 If -- General Case }; Creation and operation The only primitive for creating structure is the Or Operator (with 2 arguments). The choise of Or (and not And for example) is arbitrary. In Cedar Or header is: Or: PROC [expr1, expr2: Expression, norm1, norm2: BOOL _ TRUE] RETURNS [result: Expression]; -- norm1 (resp. norm2) are flags which indicate whether to consider expr1 (resp. expr2) or their complement ~expr1 (resp. ~expr2). This is necessary because of the laziness of the evaluation of Not. 2. Permutations General frame The aim of this minimization technique is to minimize an evaluation function, usually the number of If gates, by trying permutations. Permuting is done by a function which takes the system of boolean expressions and the permutation for arguments and returns the new system of expressions. But only a restriction of this general function is used, for only 2 variables are permuted at a time. For a dozen input set of expression, the real time used on a Dorado for permuting 2 variables is less than a second. First heuristic The first heuristic only tries permutations between neighbours. It is somewhat analogous to bubble sort. It operates in linear time relatively to the number of inputs. This heuristic is not very efficient, even for finding a local optimum, but allows a first reduction of the size. Typically when run with a 31 input truth table, if finds 2 permutations to do in the set of 30 tried, which reduce the number of If from 6000 to 5000. Second heuristic The second heuristic tries for each input slot to find the best input. It starts with slots leftmost (when outputs are supposed to emerge from the left side). It operates in quadratic time, and converges after a few iterations, for finding a good local optimum. When run on the same 31 input truth table, it reduces the number of If from 6000 to 200 after half a dozen iterations (each of this iteration tries 31 * 30 permutations). Third heuristic The third algorithm consists in the full factorial search, with two heuristics for going faster. The first one is to try to put in leftmost slots all inputs which do not generate If gates. If such an input is found, that indeed means that trying other permutations for this input will not make any reduction in the If count. The second speed up consist in trying first the permutations which seem the best in terms of number of If gates, and avoiding search for the branches of the tree search which can not give a minimum. It operates in small time for small sets of expressions, but it seems an unreasonnable algorithm for a lot of inputs. It was stopped after a dozen hours of Dorado cpu time on the same 31 input problem. 3. Layout Generation Generation by program With the layout considerations discussed in 3.3, layout for an expression is straightforward, because it is just the 2 dimension layout of a tree. Two optimizations make that layout more tricky: - Metal lines are shared; - While layout an attempt is made to find neighbors which are common subtrees and routing them together to avoid ofduplicate the subtrees. This generalizes the sharing done for XorNot. Feeding back inputs The geometry generated by Alps is especially nice for feeding back inputs, from the outputs, after some treatment. This allows an easy implementation of auxiliary variables and latched outputs, as far as the layout generation is concerned. In effect, for generating an auxiliary variable, Alps just generates an ordinary input and the corresponding output, and then adds a special cascode amplifier on the output (same cell as used for regenerating signal, schematics in figure 3.2.2), and routes the amplified output back to the corresponding input. The general schematic is to be found in figure 4.3.2. Finite state machines Latches are laid out as auxiliary variables, with the only difference that the cell is latched. Its schematic is in figure 4.3.2. It is assumed that PhiA and PhiB are two non-overlapping clocks. If pseudo static CMOS is necessary, i.e. the ability to hold indefinitely signals when clocks freeze, 2 n-transistors connecting out and ~out to Gnd and having their gate connected to a signal called AntiBias can be added. AntiBias is a weak line that does not vary and which is always slightly under n-transistor threshold. Schematic for generating AntiBias can be found in figure 4.3.4. When the clock PhiA is High, the gates behave like an ordinary cascode inverter, since the pull-down branches using PhiB are disconnected, and the AntiBias has no effect on the inverter. When the clock PhiB is High, the gates behaves like an ordinary cascode inverter self-fed back, since the pull-down branches using PhiA are disconnected, and the AntiBias has no effect on the inverter. When both PhiA and PhiB are low, one between Output and NotOutput is High, with a maintained voltage from its p-transistor, while the other one is Low, with a maintained voltage from its AntiBias n-transistor. 5. Results 1. Performance of the layout Difficult to estimate All area and speed performances are strongly technology dependent. They also depend a lot on the problem. Area The area comparison with PLA is in favor of Alps for small or medium size problems (between 10 and 20 inputs and between 10 and 20 outputs). For large problems Alps-generated layouts seem to be around 50% larger. Three different factors are competing here. First, Alps representation is smaller, which tends to make smaller layout. Second it is more general and uses different cells, while PLAs are made of the same two cells, which allows some layout optimizations. Third cascode means a loss of a factor of 2 in area due to the transport of 2 signals instead of 1. Speed For speed it is harder to make a comparison between Cmos PLAs and Alps, for the Cmos PLAs must be precharged. Our Thyme simulations (Thyme is a Spice-like circuit simulator developped at Parc), for our technology, seem to indicate a 0.6ns delay per input, when amplifying of signals is done each 4 or 5 inputs. Thyme simulations results on a 16 input And: sizeN is the size of N transiztors (in terms of minimal transistor); sizeP is the size of N transiztors (in terms of minimal transistor); Grouping 2 by 2 (2+2+2+2+2+2+2+2), 7 amplifiers: sizeN=1, sizeP=1 => 11ns sizeN=2, sizeP=1 => 7-8ns Grouped 4 by 4 (4+4+4+4), 3 amplifiers: sizeN=1, sizeP=1 => 8-9ns sizeN=1, sizeP=2 => >15ns sizeN=2, sizeP=1 => 6ns Grouped 6 by 6 (6+6+4), 2 amplifiers: sizeN=1, sizeP=1 => 10-11ns sizeN=2, sizeP=1 => 6ns Grouped 8 by 8 (8+8), 1 amplifier: sizeN=1, sizeP=1 => 13-14ns sizeN=2, sizeP=1 => 5-6ns 2. Performance of the generator How big is Alps program? The program is composed of two parts: the boolean manipulator and the layout generator. Currently half of the code is there to make statistics, try different heuristics for permuting or for introducing auxiliary variables. An interactive tool for playing with those heuristics has also been made. The current version comprises 5 interfaces and their implementations, roughly 50 pages. It can be reduced a lot ... Speed of the execution For medium size PLAs (10-20 inputs, 10-20 outputs), the elapsed time is a few seconds on a Dorado, apart for the time spend for finding a good permutation of the inputs, which is of the order of magnitude of a minute, but may be much less or much more, depending of the algorithm used, of the original permutation, and of the problem. 3. Directions for future work What to do next? All described features have been implemented. What is left, and there is a lot, is: - Generation of precharge style or NMOS layout; - Improvement of the heuristics for introducing auxiliary variables, and for permuting; - Acceptance of more general topological constraints - Layout optimizations - Optimized generation of small size cells How much effort went to Alps? About 6 months were spend from the original idea, by one person, and half of these 6 months were devoted to the development of PatchWork, and other ChipNDale [Jacobi85] tools. These 3 man-months are to compare with the amount of effort used in PLAs. 6. Summary Alps representation is compact and allows fast manipulation. It is superior to the sum of product form. Alps layout generation produces compact Cmos Cascode layout. The generated cells are static and have small propagation delays. There is a huge application field for using Alps type of layout: designs where PLAs are currently used, designs where routing and standard cells are used, and a lot of designs in which non critical combinatorial cells are drawn by hand. 7. Bibliography [Bryant85] : Symbolic Manipulation of Boolean Functions Using a Graphical Representation Randal E. Bryant, Proceedings 22nd DAC, pp. 688-694 [Chen84] : Considerations for implementing CMOS processors [Jacobi85] : ChipNDale Documentation [Monier85] : The Architecture of the Dragon [Monier85] : PatchWork Documentation [Yoffa85] : ACORN: A local customiztion approach to DCVS physical design Ellen J. Yoffa, Peter S. Hauge, Proceedings 22nd DAC, pp. 32-38 VAlpsGenerationDoc.tioga Created by Bertrand Serlet, January 17, 1985 10:19:05 am PST Last Edited by Bertrand Serlet, September 4, 1985 5:35:03 pm PDT Chih-Liang Chen & Ralph Otten, October 84, Proceedings of the IEEE International Conference on Computer Design, pp. 48-53 Christian Jacobi & Kim Rachmeler, July 85, Internal Xerox Documentation, ChipNDale: An interactive editor for VLSI designs Louis Monier & Pradeep Sindhu, Spring 85, Proceedings of the IEEE 1985 COMPCON Spring, pp. 118-121 Louis Monier & Bertrand Serlet, February 85, Internal Xerox Documentation, How To Use PatchWork Ê–i(firstHeadersAfterPage) {1} .cvx .def (firstPageNumber) {1} .cvx .def (oneSidedFormat) {.false} .cvx .def˜šœ™J™