RAM Programmable Logic Richard Barth Draft of January 11, 1989 4:05:42 pm PST 1.0 Background Need malleable systems as long as possible so they can be beat into shape. This must be done when designing highly complex systems at the edge of what the human mind can grasp. Mistakes are inevitable so the game is to reduce the cost of mistakes to an acceptable level. Long fabrication turnaround time Especially true of full custom, less true of semicustom and board level. Makes prototyping with these technologies very expensive and drags out time to market. Complex systems require prototypes No satisfactory means exist to specify highly complex systems. Inevitably large systems are sliced up into smaller pieces until each piece is small enough for existing engineering methodologies to handle. The assembly of these pieces into the larger system is then carried out using ad hoc methods. The subject of this invention proposal is another tool to make our engineering methodologies more powerful so that larger pieces can be handled at once. It does this by reducing the time required to build prototypes which approximate the speed and functionality of the final system sufficiently closely that very complete testing can be carried out. Lesson of RISC One lesson of RISC is to expose to the application the underlying power of the hardware. Don't cover it up with "optimizations" which may not apply to the application. There is always a tension between specializing a system for a particular application so that it is optimal and generalizing a system so that it can be applied to a wide variety of applications. The subject of the proposal is intended to be more general than traditional processors because it exposes the fundamental parallelism of hardware. It does this at the expense of space, time, and power relative to a traditional processor so it will not compete with them for applications which are well suited to a sequential model of execution. Allows programmers to access the parallelism of hardware without all the layers of hardware designers and producers in between. Since the chip design is standard it can be updated to the most advanced technology much more rapidly than a large base of custom designs, thus allowing the performance advantage to get to customers faster. 2.0 Proposal This invention proposal addresses a chip design different from that used by Xilinx. It is different in at least four major ways. First, in the Xilinx design, the area of the chip is divided, at design time, into areas for routing, combinational logic, and memory. In the current invention the chip is uniformly tiled at design time. The chip is divided, at programming time, into areas for routing, combinational logic, and memory. This allows the application to determine aspects such as how large a combinational section must be, or where, and how large, routing areas must be. Second, the Xilinx design is optimized for fairly small functional components. In contrast the current invention is optimized to efficiently implement the blocks which consume most of the area in a custom chip design. Thus RAM's, CAM's, PLA's, and routing can all be programmed efficiently in this design. Third, the Xilinx chip has a serial interface to load the program into it. The current invention has a parallel interface which allows high bandwidth read and write access to both programs and data. This allows a function to be loaded into the array, the function applied to the data, another function to be loaded, the function applied to the data, and so on. The ability to change the function quickly can open up another dimension of programmability. Lastly, the Xilinx chip is mostly intended to be used in isolation, rather than conceived as being part of a large array. Thus there is a large discontinuity at the pads, so that the pins can be used in an optimal fashion. The current invention is intended to be part of a large array, has compatibility with high density hybrid packaging built in, and seeks to minimize the discontinuity at the pads, even at the expense of less optimal use of pins. Basic Cell Explain how the cell is built out of the most primitive memory and functional components possible in VLSI. Show how routing is a degenerate case of function. Vertical, Horizontal, and Corner stitch cells Give a brief explanation of the vertical, horizontal, and corner stitch cells. Tiling Define an array of basic cells as a grain. Examples Regular structures, e.g. PLA, RAM, CAM. PopAndReply Finite State Machine demonstrates routing, edge-triggered flip-flops, and combinational computation. RAM demonstrates bit-bitbar combinational generation, the use of HMatch and access computation for write control, the use of VMatch for buffer control. CAM adds programmable decoder to fixed decoder and RAM by shoving it in the middle. Irregular structures, i.e. Logic, e.g. adders, counters, shifters 3.0 Applications This chip design can be used to build a system which presents the programmer with the model of a large, uniform, 2D tiling. In addition it can be used as a component in its own right. The applications of such a system can be broken down into three areas: emulation, convolution, and direct execution. Each of these is fundamentally the same, the distinction is one of binding times and the programming model utilized. Direct-Execution Direct-execution describes applications in which the programmer/designer has the 2D tiling firmly in mind while the program/design is created, and thus seeks to use the resources of the hardware in an optimal fashion. These applications are generally expressed as high level algorithms, e.g. compression, encryption, code breaking, error detection/correction, pattern matching at disk transfer rates, imaging operators. Convolution Convolution means that a nontrivial transformation is performed between the model used by the programmer and the actual hardware. This is similiar to compilation as it is currently known but the target machine is much different than the target to which compilers are usually directed. A simple example of such a transformation is combining the source of a hardware design with a switch level simulation algorithm to produce a program for the array which simulates the design with a switch level model of the behaviour of the primitives. Randy Bryant's COSMOS system at CMU produces an intermediate form which has performed the convolution but has not tiled the plane with the resulting set of equations. One can imagine convolving imaging operations with the data so that rasterization can procede quickly. Emulation Emulation is a flavor of convolution of particular relevance to this system. In it the designer has the intention of building a custom piece of hardware, rather than always executing the appliation upon the existing system. Thus the designer will optimize the design for the custom hardware. However, because of the long fabrication times associated with building hardware, it is very useful to examine the state evolution of the design prior to construction. In this case the execution model is much more abstract than switch-level simulation. In may be enough that the I/O behaviour is the only constraint upon the transformation, the designer may promise not to examine the internal workings of his design during emulation. Component Low volume systems 4.0 Leverage Most of the work required to produce a system which is easily used by applications is in the software required to map from the application description into the bits which program the device. The hardware design consists of designing one chip, one hybrid, an interface board to a host computer, such as a Sun, and, for a large system, a board to tile a set of hybrids together. The hardware work is not trivial, it will require many man months from several highly skilled people, but the bulk of the complexity will be in the software. DATools Abstract capture We own the source Reasonable number of examples covering the space of complexity. Hybrids Results to date are promising Pad pitch will match cell pitch of chip thus reducing discontinuity at pads Vertical Integration combination of talents to span silicon substrate to high level applications such as the Imager. 5.0 Competition Field programmable logic devices are available which use a wide variety of programming mechanisms. The most malleable of these use standard RAM circuits to store the programming. Xilinx Inc. produces a series of devices that they refer to as programmable gate arrays or logic cell arrays. These devices are typically used in controller applications such as satellite links. The design of these devices is biased towards using fairly low complexity primitives and then building up more complex structures with place and route techniques. The software sold for programming these devices is not easily used to program large arrays of them. The company claims to have a large number of patents covering many aspects of their design. It would be interesting to see these patents so that the breadth of the claims can be determined. Quick Turn, a company based in Silicon Valley, has been formed to build a system that uses Xilinx parts to emulate electronic systems. They intend to use existing design capture systems, such as those supplied by Daisy, use software to embed the captured design into an array of Xilinx parts, and then emulate the design. The same sort of array of Xilinx parts can be used more directly by an applications programmer. Jean Vuillemin and his coworkers at the DEC Paris Research Laboratory are working on such a machine now. They have built a very small prototype and have programmed it to do 64 bit by N bit multiplications. They claim that the performance competes with the best custom implementations for two reasons. One is that the Xilinx part uses a more advanced technology and the other is algorithmic sophistication. They are working on a larger board which includes a 4 by 4 array of Xilinx parts plus some RAM directly attached to the array. Both of these boards are peripherals in a Sun workstation. DEC, through their purchase of the remnants of Trilogy, has access to the high density packaging which may be key to implementing large systems of this sort. The combination of the expertise that exists in Paris, Hudson MA, and Palo Alto make it possible for them to put together a very interesting system. 6.0 Protection Right To Use Real Value Translation Cell may be easier to protect References [Barth1] R. Barth, ``Architectural Considerations for RAM Programmable Logic'' tentative title, in preparation. [Barth2] R. Barth, ``2.5D Rasterization with Programmable Logic'' tentative title, in preparation. Êç– "cedar" style•Wordlist¢Rasterization Artwork Nectarine pixels rasterize programmability rasterizing pixel CommandTool polygons polygonal multiplexed rasterized ponging glitching prototyping ms RAMs Xilinx reprogram incrementally reloading zooming panning pipelined replicated rectilinear reprogramming serializer ˜Ititle˜Iauthorsšœ ˜ L˜(head˜Ibody˜˜ N˜ —˜"N˜Œ—˜N˜ÆN˜N˜Î——˜Nšœ˜NšœAÏiœŸœÔ˜ÅNšœ³˜³NšœÈ˜ÈNšœÃ˜Ã˜ N˜ž—˜-N˜N—˜N˜*—˜N˜'N˜pN˜—N˜SN˜A——˜N˜¸N˜ê˜N˜ÙN˜É—˜N˜N˜ûN˜¦N˜f—˜ N˜Û—˜ N˜——˜N˜˜˜N˜N˜N˜?—˜N˜N˜K—˜N˜_——˜Nšœôœœ ˜ÁNšœÂ˜ÂN˜µN˜³—˜M˜˜ M˜M˜——šœ ˜ referencešœ˜Iindentšœf˜f—šœ˜PšœY˜Y——I pagebreak˜Q˜—…—*,û