[_CD4_]<datools7.0>SoftHdw>SoftHdw25D.tioga!7

2.5D Rasterization with Programmable Logic

Richard Barth

Draft of February 16, 1989 10:43:32 am PST

1.0 Introduction

This paper describes the rasterization of 2.5D images with RAM programmable logic. The basic notion is to program a machine for each object in the image, z sort the machines, and then flow chunks of pixels through the pipeline thereby created. This contrasts with the standard graphics pipeline in which the objects flow through a transformation pipeline to produce pixels in a frame buffer. Rather than the transformation operators remaining stable in space, and the objects flowing through them, the objects, represented as machines, remain stable in space, and the pixels flow through them.

[ Nectarine figure; type 'Artwork on' to a CommandTool ]

Figure 1. Abstract Architecture

Figure 1 shows the abstract architecture. All of the items in the left column represent machines which are coded into programmable logic. The host is responsible for loading the programmable logic with these machines. The clock generator generates clocks which are multiples of the basic pixel clock. In the simplest implementation, all of the machines besides the parallel to serial converter run at one clock rate, and the parallel to serial converter converts from that clock rate to the pixel clock rate.

The architecture scales arbitrarily for bandwidth (represented as the product of image output rate and number of pixels) and the number of objects. With the exception of clock distribution and the relatively low-speed interface to the host, all of the connections are nearest neighbor. There is no fixed complexity limit for a particular image because the host can utilize the programmable logic to rasterize objects, and combine the resulting pixel maps together to produce a single pixel map machine. The only result is a decrease in performance, assuming there is enough pixel memory somewhere to form a frame buffer.

An alternative architecture uses the programmable array as a rasterization engine whose output is fed into a standard frame buffer. The output device is then driven from this frame buffer. This architecture can be scaled down to almost nothing, perhaps a single programmable logic device. This board level architecture is essentially the same as current graphics engines. The difference is the binding time of the raster generator operators. In prior designs these operators are bound at manufacturing time, and typically consist of an abstract set of operators such as polygon, rectangle, and text. In the proposed design the parameters of these abstract operators are bound prior to synthesizing the logic which generates the raster. This enables the simple common cases to be handled in a highly optimized manner.

Of course programmable logic is not limited to this application. Most of the complexity resides in the programming of the chips. The design of the chips themselves is extremely simple once the architecture is defined, so the chips can track technology readily. In fact, they are so simple that they will take advantage of advanced technology long before standard microprocessors. Since the number of applications is large, there will be motivation to produce these chips in large quantities, consequently reducing their cost.

Programmable logic is an enabling technology because it allows the grain size of a machine to be smaller than a standard von Neumann processor and allows highly optimized machines to be constructed for the simple cases, which are likely to be the most common ones.

This technique is applicable to all types of raster scan output devices. In section 2.0 we describe the relationship of this technique to previous published descriptions of rasterizing hardware as well as commercial products. Section 3.0 considers a number of alternative board level architectures. Section 4.0 covers the programming of the machines which reside in programmable logic. Sections 5.0 and 6.0 discuss design considerations specific to displays and printers. Finally, in section 7.0, we conclude with the mandatory conclusion section, where we repeat what we just said, but hopefully you, dear reader, will have the attitude of belief, rather than skepticism, by the time you get there.

2.0 Background

Nothing has been published about display architectures in which the logic is programmed to generate a specific image. No doubt this is because the density of integrated circuits has only recently reached a sufficient level to make this practical.

Some approximations to programmable logic rasterization are machines in which the objects are parceled out to a collection of function boxes, transformed into pixels, and the result merged in a frame buffer.

More closely related to programmable logic rasterization are machines in which a fixed number of objects with a very small range of programmability are rasterized on-the-fly. The most ubiquitous examples of this technique are video games. In these games each object, or sprite, has a set of registers which store the location of the object and a small pixel map. As the raster is swept the objects are multiplexed into the video stream at the proper time.

Another example of this technique is the generation of 3D polygonal images. In this case the number of objects which can be handled at once is severely restricted. This is the approach used by extremely expensive scene simulation engines. A survey of these techniques, which is rather dated at this point, is contained in [Sutherland et al.].

[Deering et al.] describe a machine in which pixels flow through a pipeline of objects. However they assume triangles for their primitive, and the silicon is customized for that primitive. They have an interesting idea for using the same pipeline to load new objects as processes the pixels. Their description may form an interesting basis for programming the logic to do 3D scene rendering.

This description of background is nowhere near complete.

3.0 Board Level Architecture

In this section we consider a number of alternative board-level architectures.

First a few definitions. The animation rate is the number of images per second which must be produced to give the appearance of smooth motion. The response time is the maximum tolerable latency from user input to modified image. The display rate is the bandwidth required to service the output device. A CRT requires a display rate many times the animation rate because of the lack of persistence in the phosphor.

The first two architectures are designs which represent extreme simplicity. The first allows for arbitrary images, the second for limited animation. The next combines the first two to allow arbitrary images with limited animation. The last decouples the display rate from the animation rate, and allows the raster generator to be used in a multipass method, so that the ratio between processing power and memory is tunable.

3.1 Frame Buffer

The simplest practical implementation of a raster output device relies completely on the host processor to rasterize the image. Figure 2 shows such a system, in which the frame buffer serves to isolate the bandwidth requirements of the display from the capability of the host to supply pixels.

[ Nectarine figure; type 'Artwork on' to a CommandTool ]

Figure 2. Frame Buffer System

This is the standard system used in many low priced computers. It does not contain any programmable logic, but serves as a capability reference point in that it allows display of arbitrary images and has decoupled the display rate and the animation rate.

3.2 Raster Generator

The simplest possible system which utilizes programmable logic for rasterization is depicted in figure 3. This is the system which was described in the introduction. The complexity of an image produced by this system is limited by the the space available in the programmable logic of the raster generator to store object descriptions. Thus, unlike the simple frame buffer system, arbitrary images cannot be displayed. However, also unlike the frame buffer system, the objects are stored in a very abstract manner, making it quite simple to animate the image.

[ Nectarine figure; type 'Artwork on' to a CommandTool ]

Figure 3. Raster Generator System

The programming of the raster generator is designed to produce pixels at the peak rate demanded by the output device. The clock generator takes care of producing vertical and horizontal sync. During either sync the clocks to the array are stopped. During vertical sync the host can reprogram the array to render a different image. This reprogramming could just be a block transfer of an entire new machine pipeline from host memory into the array.

3.3 Raster Generator and Frame Buffer

In this system the capabilities of the two previous systems have been combined. The host can store an arbitrary image in the frame buffer and can store objects in the raster generator to improve animation capabilities. This is a typical video game architecture.

[ Nectarine figure; type 'Artwork on' to a CommandTool ]

Figure 4. Raster Generator and Frame Buffer System

The amount of animation is limited by the raster generator size.

3.4 Machine Buffer, Raster Generator and Frame Buffers

Finally we add multiple frame buffers plus a machine buffer. The machine buffer stores machines and loads them into the raster generator as needed. This enables multiple pass use of the raster generator.

[ Nectarine figure; type 'Artwork on' to a CommandTool ]

Figure 5. Machine Buffer, Raster Generator, and Frame Buffer System

This system design is motivated by the recognition that memory is cheap. Even if RAM programmable logic becomes the accepted implementation mechanism for a wide variety of systems it will never be as cheap for state storage as DRAMs. In this architecture the output display rate and the rasterization rate are completely decoupled. The display multiplexor switches at the animation rate. The pixels come out of the output device selected frame buffer at the pixel clock rate. The other frame buffers either source or sink pixels at the raster generator rate. No frame buffer both sources and sinks pixels during a single rendering of a single image, i.e. the frame buffers are single ported so that they may be built out of cheap, standard, RAM rather than video RAM. The machine buffer cycles through all of its machines at the animation rate. There is a blank time slot in the raster generator between the last pixel of the previous image rendering and the first pixel of the next image rendering so that the machine buffer has time to insert new machines into the rasterization pipeline. The size of the frame buffers are fixed by the display, as the product of its width and height in pixels, and the number of bits per pixel. The machine buffer size is fixed by the complexity of the animated objects. The raster generator size is fixed by the machine-complexity, animation-rate product.

In this design one frame buffer holds the background pixel map. It consists of objects which are not moving and do not have any interaction with those that are moving. A pair of frame buffers is used as a ping-pong buffer while the raster generator is making multiple passes through the pixels, each time with a different set of machines loaded into it. The last frame buffer is refreshing the display.

The raster generator has control over the pixels which are transmitted from the frame buffer into the raster generator. This allows a gross clipping to be performed, so that multipass methods need not waste bandwidth passing pixels through the pipeline which are never modified by the current machines in the pipeline. If the raster generator is used in this fashion then three of the frame buffers must be combined into two dual-ported frame buffers, which ping-pong between the output device and the raster generator.

A very clever implementation could reduce the number of frame buffers to one single ported background frame buffer and another dual-ported frame buffer which is shared by the raster generator and the output device. However, this places a number of severe timing constraints on the raster generator.

All of the components in this architecture are memory-mapped to a host processor which is responsible for loading and coordinating their actions. Of course, there is a single time base which generates all of the clocks described.

4.0 The Raster Generator

In this section we cover the details of using programmable logic as a raster generator. We use a specific example to illustrate these details.

First, we describe the specific system context into which the example is cast, including the general organization of the programmable array. The organization required to fit a collection of these machines into a coherent pipeline fills the next section. Then we focus on detailed implementation of some of the objects, describing how they are synthesized from an abstract representation into machines.

4.1 System Context

The example of this secton is based on the following assumptions:

The board-level architecture of section 3.2 is used.

The output device is a binary display.

The pixel clock has a period of 5ns.

The resolution in the fast scan direction is 1152 pixels.

The resolution in the slow scan direction is 900 lines.

These assumptions characterize a Sun high resolution monochrome monitor. This display was chosen so that the example has a dose of reality, without the complications which a faster pixel clock or more bits per pixel would introduce. The architecture is easily scaled to handle much higher bandwidth and resolution.

[ Nectarine figure; type 'Artwork on' to a CommandTool ]

Figure 6. Programmable Logic Tiling

Figure 6 shows the physical organization of the programmable logic integrated circuits on the printed circuit board. The width of the array corresponds to the width of the column of machines. This assumes that some machines will require this width to efficiently implement their function. Information propagates from top to bottom, flowing through each of the machines programmed into the array.

We will use the (f, s) coordinate system, corresponding to the fast and slow scan directions. For a display this means that the origin is in the upper left corner, fast axis increasing to the right, and slow axis increasing down.

In the remainder of this paper the example descriptions consists of two levels of abstraction. The first is a logic abstraction expressed using the system described in [Barth et al.]. This logic abstraction is intended to convey function, not to represent a minimal implementation. The lower level description of the programmable logic implementation is in terms of an abstraction developed in [Barth]. We will refer to this level as assembly language. The assembly language implementations have been minimized, and so are used to estimate the space and time cost of each machine.

As long as each machine can be arbitrarily pipelined or made to handle bigger chunks, the pixel bandwidth is constrained only by the size of the programmable logic array and the number of machines. The latency grows as more pipeline stages are added, but since the nominal cycle time is about 100ns and the required response time is about 100ms this allows for 1,000,000 pipeline stages before the latency starts to become noticeable. It requires thousands of chips to build a pipeline that long in current technology.

4.2 Machine Combination

In this section we describe the assembly of machines into a coherent pipeline and implementation techniques common to many machines.

4.2.1 Parameters

A number of parameters control the synthesis of every machine type. These parameters allow the machines to plug together simply. The first is bitsPerPixel, which, obviously enough, determines how many bits are devoted to each pixel. In our example its value is 1.

The bandwidth required by the output device cannot be achieved with current day programmable logic if only one pixel is produced during each clock cycle. To overcome this limitation we group pixels together into chunks. Each chunk consists of a group of pixels in raster scan order. The number of pixels in each chunk is known as pixelsPerChunk, and its value is 8 for our example.

Each chunk has an address in the (fast, slow) coordinate system. sAddressBits is the number of bits needed to represent a pixel address in the slow scan direction. Its value is simply the smallest integer greater than or equal to the binary logarithm of the number of pixels in the slow scan direction. Log2(900) gives us 10 for the value of sAddressBits. fAddressBits is computed in the same fashion except that the number of pixels must be divided by pixelsPerChunk. Log2(1152/8) gives us 8 for the value of fAddressBits.

4.2.2 Interface Signals

In addition to the parameters which control the sizes of things, we also must have the global wires, and the wires which carry information from machine to machine. Of course, in programmable logic, the location of these wires is not assigned until the device is loaded with its program.

Figure 6.1 illustrates the general architecture of every machine.

[ Nectarine figure; type 'Artwork on' to a CommandTool ]

Figure 6.1. Machine Architecture

PipeClock0 is a global signal that is the basic clock which transfers a chunk from one pipeline stage to another. Its period is the product of pixelsPerChunk and the pixel clock period. For our example its value is 8*5ns, or 40ns. This is not an unreasonable number, but it is not certain that the examples will be able to run at this rate. More programmable logic engineering work must be completed to place reasonable confidence in this number.

Frequently a finite-state submachine must be implemented within a pipeline stage. PipeClock1 is the same as PipeClock0 except that it is phase shifted so that the leading edge occurs after the leading edge of PipeClock0 by the worst case delay of the combinatorial network leading from a PipeClock0 register to a PipeClock1 register.

SlowAddress, of size sAddressBits, carries the slow scan address of the current chunk.

FastAddress, of size fAddressBits, carries the fast scan address of the current chunk.

Chunk carries the value of the pixels. Physically, it is a double-indexed linear array of bits. The array index which varies most quickly is of size pixelsPerChunk. The array index which varies more slowly is of size bitsPerPixel. It is organized in this fashion so that during pixel shift operations, the wires which must communicate are as close as possible. Logically, it is easier to describe machines with the array indicies reversed, thus the logic diagrams show the order one way, while the assembly language description is the other. If you didn't understand anything but the first sentence of this paragraph, don't worry, it is not critical to understanding the remainder of the paper.

4.2.3 Address decoding

All of the examples we present here have rectangular bounding boxes. These simple bounding boxes allow simple address decoding. Address decoding occurs within each machine to determine if the machine should operate on the current chunk. The technique shown in figure 7 relies on efficient programmable logic implementation of comparison to a constant and edge-triggered flip flops. It also relies upon the pixels appearing in scan line order.

[ Nectarine figure; type 'Artwork on' to a CommandTool ]

Figure 7. Rectangle Address Decode Logic

If either the slow or fast decoders are only active for one cycle then the sequential decoding circuit reduces to a simple combinatorial decode.

4.2.4 Machine Interface Considerations

The previous sections have described a logical structure for the interconnection of machines. Here we consider aspects of machine assembly which depend upon the underlying technology constraints.

The most fundamental constraint is the physical constraint. How do we ensure that the physical locations of the logical networks which interconnect machines line up? The simplest answer, which is assumed in the remainder of this paper, is that we find the worst case width across all of the machines and then bloat the smaller machines to match it. More complicated and optimal solutions are possible.

For example, some machines may require more width because they need more parallelism to achieve the required bandwidth. Such machines can be matched to narrower machines through the use of reorganizing buffers, which match the data path width of one machine to that of another. This assumes that one machine is running at a cycle time which is an integer multiple of the other.

Another technology constraint is the propagation of inversions. The underlying programmable logic architecture relies upon computing a parity function of inversions due to functional requirements of the source and sinks of a network, as well as the number of corners between source and sink. The software which stitches machines together into a coherent pipeline must account for these inversions properly.

4.2.5 Storage

Machines which store sample maps as part of their internal structure may be extremely memory intensive. The programmable logic offers 8K bits per chip. This is almost three orders of magnitude less than can be achieved with commercial RAM. In this paper we ignore this optimization opportunity, merely mentioning its existence, and noting some facts which affect any architectural attempt to provide access to this dense storage.

The most obvious fact is that the bandwidth available from commercial RAM is limited to that which can be funneled through the pins of the device. Frequently, the quest for capacity must take a back seat to the necessity for bandwidth.

Another problem is the desire to provide a uniform tiling of the plane. This desire is motivated by the need to insert machines into the pipeline in z order. The z order constraint could be removed if the z depth were passed along with each pixel, but this results in a huge space explosion.

Finally, the physical placement of the commercial RAM is not obvious. Should it fit in between each programmable logic device? Float above them? Be moved to a wholly seperate location with a bus connecting the programmable logic to it?

In the end the most desirable solution is to expand the amount of RAM available uniformly within the programmable logic array.

4.3 Synthesizing Machines from Objects

A simple conceptual model to keep in mind while reading this section is that each machine corresponds to a single abstract graphical object, such as a line, or a collection of text. The translation is performed by a software library, which has a procedure for each abstract object type, that converts to the corresponding machine description. The beginning of such a software library is contained in appendix A.

4.3.1 Box

Figure 8 shows the parameterized machine for writing a box all the way down the image vertically. It simply decodes the chunks which have some subset of their pixels set to a constant value.

[ Nectarine figure; type 'Artwork on' to a CommandTool ]

Figure 8. Box Machine Logic

Assume the width of the box is 4 pixels and positioned in the middle of the second column chunk. The chunk decode circuitry is then s=xxx and f=01. The start bit is 2 and the width is 4. Figure 9 is an implementation of the machine that draws this box on this display.

[ Nectarine figure; type 'Artwork on' to a CommandTool ]

Figure 9. Box Machine Programming

SlowAddress is not shown since it is simply pipelined through the machine. It requires another 3 minor arrays. This machine requires 7 minor arrays total. This is about 3% of one programmable logic chip.

All boxes which manipulate common portions of a chunk in the same fashion share the chunk data path. Each position is simply added as an OR term in the address decode.

[ Nectarine figure; type 'Artwork on' to a CommandTool ]

Figure 10. Fully Constrained Box Machine Layout

Figure 10 shows a layout of tiles which implements a box constrained on all four edges. The width of this layout is 14 minor arrays; the height is 1 minor array. The slow address passes vertically through Slow-Decode; fast address through Fast-Decode; and Chunk through Data-Path. Start-Stop-State-Slow takes two nets from Slow-Decode which are asserted whenever the slow chunk address is equal to the first or last address contained in the box, and produces a net which indicates that the current chunk is within the bounding box of the Box. Start-Stop-State-Fast performs a similiar function for the fast address. Box-Decode-To-Data-Path transforms the nets indicating inside box into nets which indicate that the current chunk is along the left edge of the box, wholly within the box, along the right edge of the box, or not inside the box at all. Data-Path then performs the proper modifications to the pixels by either passing the old pixel value through, or replacing it with ink to draw the box.

[ Nectarine figure; type 'Artwork on' to a CommandTool ]

Figure 11. Box Programming Details

Figure 11 shows the details of the the programming for some of the minor arrays within a box machine. The polarity of the signals is not necessarily correct, i.e. the inversion tiles have not been inserted. Appendix A contains a Cedar program which synthesizes box drawing machines for any single bit per pixel device by calling the CreateBox routine. Multiple bits per pixel is a straightforward extension of the illustrated technique.

4.3.2 Sample Map

Sample Map implementation is an extension of Box implementation. However it reads the sample map out of a ROM tile instead of forcing the pixels to a fixed ink, as does the Box implementation. The ROM tile is loaded with the sample map at the time the machine is inserted into the programmable logic array. The sample map implementation contains a counter which indexes the sample map ROM. Each time a chunk is within the sample map bounding box, the machine fetches the next set of pixels from the ROM and inserts it into the chunk. The same clipping operation performed for the box must also be performed for the sample map so that the left and right edges are not constrained to be coincident to the chunk boundaries.

[ Nectarine figure; type 'Artwork on' to a CommandTool ]

Figure 12. Sample Map Machine Layout

Figure 12 shows the layout of tiles to implement a sample map. Obviously this arrangement does not allow dense storage of the sample maps. This can be fixed in a number of ways, for example, the width of the data path can be made larger so that the ratio of data path to control is larger. Another fix is to use a row for multiplexing pixel maps stored in another row below the multiplexor, so that the control and data storage are interleaved.

Further details of the implementation can be found in the procedure CreateSampleMap in Appendix A.

4.3.3 Interpress Imaging Model Machine

Figure 13 shows the space requirements for the data path of a machine that implements the Interpress imaging model at the pixel level. This machine assumes that the ink is stored as a sample map and that the masks are stored as bit maps. It contains two full barrel shifters so that ink can be arbitrarily aligned to mask, which can be arbitrarily aligned to the clipper, which finally chooses between the combined ink and mask, and the original data.

[ Nectarine figure; type 'Artwork on' to a CommandTool ]

Figure 13. Interpress Imaging Model Machine Datapath Layout

It is mostly true that n could be defined has the product of pixelsPerChunk and bitsPerPixel and the resulting space requirements would be quite close to that predicted by the equasions. Each thick horizontal bar represents a horizontal cut point at which the machine could be split across chip boundaries. All other locations require at least two signals to flow between minor arrays. The programmable logic only allows a single signal. Of course, at the price of reduced density, the implementation can be made to split across chip boundaries anywhere.

5.0 Display Specifics

The requirements which distinguish the display application from the printer application include the desire for interactive control over the viewpoint and image content. Viewpoint control implies panning and zooming.

Panning is relatively simple to accomplish within the framework described in the previous section. One need only expand the range of chunk addresses to include the entire image, rather than just the portion visible on the display, and then change the background generator to generate the chunk addresses which are currently visible on the display. Panning is then simply a matter of changing the starting chunk address in the background generator.

Zooming is more difficult. Within the architecture described above scaling has already been performed prior to loading the array. In this case zooming requires the host to regenerate the machines for the image. This is not particularly difficult for images composed of text and lines, that is just a matter of transforming a few points and reloading the font storage. Scenes with a large pixel map content are more difficult. Of course all of the computer graphics hacks applicable to frame buffer systems apply here as well.

Changing the content of the screen requires incrementally reprogramming the machines. For example, characters must be inserted into text objects. The simplest model retains a copy of the array programming in the host memory and block transfers the incrementally changed description into the array during vertical retrace. More sophisticated approaches can be imagined but optimizations upon this scheme should only be considered as required.

5.1 Zooming

Examine the added complexity due to zooming. Requires dividing rectangle width and height by z. The size of a pixel changes and the origin of an object changes in the screen coordinate system.

5.2 Performance Ratios

Create a table comparing the rate at which this machine does interesting interactive graphic functions versus existing commercial hardware. Normalize by cost.

6.0 Printer Specifics

The major requirement which is different for printers is the greater number of pixels which must be generated. For current day printers the number of bits per pixel is substantially less than for a high quality display, but even this constraint relaxation is not likely to last. Fortunately, the requirement for real time modification of viewpoint and image content is absent in this application. This eliminates the tight real-time constraint for generating, or incrementally changing, the array programming. Any reasonable amount of time can be spent transforming the image from its source representation into the representation required to print the image on a specific printer.

Another major difference is the rate at which images are rasterized. An image is required approximately twice per second, rather than the 80 images/second which high quality displays require. The effective bandwidth may not be much different since the number of pixels is so much greater but the time available to load the array is significantly relaxed.

There are several advantages of this approach over compressed raster schemes. One is the ability to encode the image in a image dependent manner. Rather than designing a compression scheme which works well over a statistical sample of images, each image can be encoded in a manner specific to its content. Another advantage is that the hardware implementation relies solely on generic components. There is no need to design a family of special purpose, highly complicated, integrated circuits.

6.1 Encoding Efficiency

It may be the case that the image, encoded in this fashion, will require significantly less storage. This would reduce the amount of money invested in every other portion of a printing system. The network connection need not be as fast, the disk can be smaller, the disk transfer rate can be less, and the host memory can be smaller.

Create a table which compares the number of bits required to load a typical page into an array versus the number of bits in the form currently used for a high end printer. Also compare the cost of the different architectures.

7.0 Conclusion

7.1 In Contrast

If a standard von Neumann architecture provides sufficient bandwidth to rasterize the images of interest then there is no need to do anything else. Hmmm...That isn't quite true. It might be possible to reduce the cost substantially by reducing the performance requirements of the v. Neumann machine, don't need fancy high speed memory systems with their attendant wide wires, many RAM chips, caches, etc.

However, if the image complexity or image size, as the product of width, height, and bits per pixel, is too large for a traditional implementation, then this scheme is of interest.

Advantages over traditional BitBLT implementations for both printers and displays:

Only have to touch pixels once instead of once per object.

Pixels scanned in serial order independent of objects, no need to cycle RAM address bits more than minimum.

Multi operand operations can use memory built into array instead of either time multiplexing existing memory paths or adding additional memory paths.

Advantages for displays:

The image is stored in a much more abstract manner so that animation becomes much easier.

7.2 Knowns and Unknowns

Implementation of the underlying programmable logic assembly abstraction is not an issue. An earlier version of the architecture defined in [Barth] has been captured at the transistor level, critical cells layed out in 2 micron double-level metal CMOS, and critical path simulations performed. It easily fit within a 1 cm square die, quite a reasonable die size. The performance was not terrific, but every attempt has been made in the current architecture to mitigate the performance problems encountered in that exercise. Many man months of work remain between the current architecture definition and its realization as a board which plugs into a workstation. However, this is, for the most part, straightforward engineering.

A much larger question is the comparison, in terms of price-performance, of this technique against traditional methods of rasterization. A relatively large body of application specific software needs to be built in order to answer this question in detail. At the handwaving level, this requires a characterization of the images which we would like to rasterize, and rough numbers to translate this characterization into space and performance numbers. This requires some time investment on the part of experts in this application domain.

7.3 Finally

By now, gentle reader, you should be convinced that utilizing programmable logic for rasterizing 2.5D images is a wonderful idea. If you are not, then I have failed to put across the idea sufficiently clearly. Or, worse, you have thought of something I failed to consider, which demolishes the whole idea. In either case I'd like to hear from you. Do write or visit.

References

[Barth]

R. Barth, ``A Programmable Logic Architecture'' tentative title, in preparation.

[Barth et al.]

R. Barth, L. Monier, B. Serlet, and P. Sindhu, ``VLSI Design Aids: Capture, Integration, and Layout Generation,'' Computer Science Laboratory Technical Report CSL-88-1, Xerox Palo Alto Research Center, July 1988.

[Deering et al.]

M. Deering, S. Winner, B. Schediwy, C. Duffy, and N. Hunt, ``The Triangle Processor and Normal Vector Shader: A VLSI System for High Performance Graphics,'' Proceedings of SIGGRAPH '88, in Computer Graphics Vol. 22, No. 4, August 1988.

[Sutherland et al.]

I. Sutherland, R. Sproul, and R. Schumacker, ``A Characterization of Ten Hidden-Surface Algorithms,'' acm computing surveys, March 1974.

Appendix A

SoftHdw25D.mesa

Barth, February 9, 1989 3:54:03 pm PST

DIRECTORY BitOps, CD, CDBasics, CDInstances, CDOps, DABasics, ImagerSample, PW, Rope, SF;

SoftHdw25D: CEDAR PROGRAM

IMPORTS BitOps, CD, CDBasics, CDInstances, CDOps, ImagerSample, PW, SF

EXPORTS

= BEGIN

TileType: TYPE = {MinorArray, BoardTraces, Corner, Clock0, Clock1, RAMEven, RAMOdd, ROMEven, ROMOdd, MuxRDLUToRDEven, MuxRDLUToRDOdd, MuxLURDToLUEven, MuxLURDToLUOdd, Program, Inverter, FlipFlop, OLUToI, OLUToRD, ORDToI, ORDToLU, ORDToL, RToI, LUToLU, LUToI, RDToRD, RDToI, OToP, RDToP, LUToP, LToP, PToI, PToRD, StartStopStateSlow, StartStopStateFast, BoxDecodeToDataPath, SampleMapCount, SampleMapExtension, SampleMapMux, SampleMapRam};

tileNames: ARRAY TileType OF Rope.ROPE ← ["MinorArray", "BoardTraces", "Corner", "Clock0", "Clock1", "RAMEven", "RAMOdd", "ROMEven", "ROMOdd", "MuxRDLUToRDEven", "MuxRDLUToRDOdd", "MuxLURDToLUEven", "MuxLURDToLUOdd", "Program", "Inverter", "FlipFlop", "OLUToI", "OLUToRD", "ORDToI", "ORDToLU", "ORDToL", "RToI", "LUToLU", "LUToI", "RDToRD", "RDToI", "OToP", "RDToP", "LUToP", "LToP", "PToI", "PToRD", "StartStopStateSlow", "StartStopStateFast", "BoxDecodeToDataPath", "SampleMapCount", "SampleMapExtension", "SampleMapMux", "SampleMapRam"];

tileObjects: ARRAY TileType OF CD.Object ← ALL[NIL];

InterconnectTileType: TYPE = TileType[OLUToI..RDToI];

PinTileType: TYPE = TileType[OToP..PToRD];

SoftHdwPgmTileType: TYPE = TileType[MinorArray..PToRD];

SoftHdw25DTileType: TYPE = TileType[StartStopStateSlow..SampleMapRam];

MachineSizes: TYPE = RECORD [sAddressBits, fAddressBits, pixelsPerChunk, bitsPerPixel: INT];

LoadTiles: PROC = {

design: CD.Design ← PW.OpenDesign["SoftHdwPgm.dale"];

CDOps.SetMutability[design, readonly];

FOR index: TileType IN SoftHdwPgmTileType DO

tileObjects[index] ← PW.Get[design, tileNames[index]];

ENDLOOP;

design ← PW.OpenDesign["SoftHdw25D.dale"];

CDOps.SetMutability[design, readonly];

FOR index: TileType IN SoftHdw25DTileType DO

tileObjects[index] ← PW.Get[design, tileNames[index]];

ENDLOOP;

};

CreateChip: PROC [size: DABasics.Position] RETURNS [object: CD.Object] = {

minorArray: CD.Object ← tileObjects[MinorArray];

boardTraces: CD.Object ← tileObjects[BoardTraces];

boardTracesSize: DABasics.Position ← CD.InterestSize[boardTraces];

corner: CD.Object ← tileObjects[Corner];

inner: CD.Object ← PW.Array[minorArray, size.x, size.y];

side: CD.Object ← PW.ArrayY[boardTraces, size.y];

middle: CD.Object ← PW.AbutX[side, inner, side];

topBottom: CD.Object ← PW.AbutX[corner, PW.ArrayX[PW.FlipX[PW.Rot90[boardTraces]], size.x], corner];

object ← PW.AbutY[topBottom, middle, topBottom];

};

lambda: INT = 8;

minorSize: INT = 128*lambda;

minorSpace: INT = 24*lambda;

tileWidth: INT = 32*lambda;

tileHeight: INT = 24*lambda;

outputHeight: INT = 20*lambda;

inputHeight: INT = 16*lambda;

tile0: INT = tileWidth + 3*tileHeight;

tile1: INT = tileWidth + 2*tileHeight;

tile2: INT = tileWidth + tileHeight;

tile3: INT = tileWidth;

chunksPerRAM: INT = 8;

bitsPerRAMROMPair: INT = 8;

CreateSampleMap: PROC [sizes: MachineSizes, position: ImagerSample.Vec, map: ImagerSample.SampleMap] RETURNS [object: CD.Object] = {

EmitRAM: PROC [y: INT] = {

FOR index: INT IN [0..sizes.pixelsPerChunk/bitsPerRAMROMPair) DO

x: INT ← decodeWidth-tileWidth+2*index*minorSize;

il ← CONS[InstantiateTile[SampleMapRam, [x, y]], il];

ENDLOOP;

};

box: SF.Box ← SF.Displace[ImagerSample.GetBox[map], position];

widthInChunks: INT ← ((box.max.f-1)/sizes.pixelsPerChunk) - (box.min.f/sizes.pixelsPerChunk) + 1;

heightInChunks: INT ← box.max.s-box.min.s;

chunkCount: INT ← widthInChunks*heightInChunks;

decodeObject: CD.Object ← CreateBoxDecode[sizes, box];

decodeWidth: INT ← CD.InterestSize[decodeObject].x;

il: CD.InstanceList ← NIL;

IF ImagerSample.GetBitsPerSample[map]#1 THEN ERROR; -- not yet implemented

IF sizes.bitsPerPixel#ImagerSample.GetBitsPerSample[map] THEN ERROR;

IF (sizes.pixelsPerChunk MOD bitsPerRAMROMPair)#0 THEN ERROR; -- not yet implemented

il ← CONS[CreateInstance[decodeObject, [0, 0]], il];

il ← CONS[InstantiateTile[SampleMapCount, [decodeWidth - tileWidth - 7*minorSize, -minorSize]], il];

FOR index: INT IN [0..sizes.pixelsPerChunk/bitsPerRAMROMPair) DO

x: INT ← decodeWidth-tileWidth+2*index*minorSize;

il ← CONS[InstantiateTile[SampleMapMux, [x, 0]], il];

ENDLOOP;

EmitRAM[-minorSize];

FOR index: INT IN [1..(chunkCount/chunksPerRAM)-1] DO

y: INT ← -(index+1)*minorSize;

il ← CONS[InstantiateTile[SampleMapExtension, [decodeWidth - tileWidth - 4*minorSize, y]], il];

EmitRAM[y];

ENDLOOP;

object ← PW.CreateCell[il, [0, -(chunkCount/chunksPerRAM)*minorSize, decodeWidth + ((sizes.pixelsPerChunk+3)/4)*minorSize, minorSize+tileWidth]];

sample: ImagerSample.Sample ← ImagerSample.Get[map, [s, f]];

};

CreateBox: PROC [sizes: MachineSizes, box: SF.Box] RETURNS [object: CD.Object] = {

decodeObject: CD.Object ← CreateBoxDecode[sizes, box];

decodeWidth: INT ← CD.InterestSize[decodeObject].x;

il: CD.InstanceList ← NIL;

il ← CONS[CreateInstance[decodeObject, [0, 0]], il];

IF sizes.bitsPerPixel#1 THEN ERROR; -- not yet implemented

FOR dataIndex: INT IN [0..sizes.pixelsPerChunk) DO

x: INT ← decodeWidth + outputHeight + ((dataIndex/4)*minorSize) + ((dataIndex MOD 4)*minorSpace);

il ← CONS[InstantiateProgramTile[[x, tile3 + inputHeight]], il];

IF dataIndex >= (box.min.f MOD sizes.pixelsPerChunk) THEN il ← CONS[InstantiateProgramTile[[x, tile0 + inputHeight]], il];

IF dataIndex < (box.max.f MOD sizes.pixelsPerChunk) THEN il ← CONS[InstantiateProgramTile[[x, tile1 + inputHeight]], il];

ENDLOOP;

object ← PW.CreateCell[il, [0, 0, decodeWidth + ((sizes.pixelsPerChunk+3)/4)*minorSize, minorSize+tileWidth]];

};

CreateBoxDecode: PROC [sizes: MachineSizes, box: SF.Box] RETURNS [object: CD.Object] = {

slowObject: CD.Object ← CreateStartStop[sizes.sAddressBits, box.min.s, box.max.s, StartStopStateSlow];

slowWidth: INT ← CD.InterestSize[slowObject].x;

fastObject: CD.Object ← CreateStartStop[sizes.fAddressBits, box.min.f/sizes.pixelsPerChunk, box.max.f/sizes.pixelsPerChunk, StartStopStateFast];

fastWidth: INT ← CD.InterestSize[fastObject].x;

il: CD.InstanceList ← NIL;

il ← CONS[CreateInstance[slowObject, [0, 0]], il];

il ← CONS[CreateInstance[fastObject, [slowWidth-tileWidth, 0]], il];

il ← CONS[InstantiateTile[OLUToRD, [slowWidth-tileWidth, tile3]], il];

il ← CONS[CreateInstance[tileObjects[BoxDecodeToDataPath], [slowWidth + fastWidth - 2*tileWidth, 0]], il];

FOR minorIndex: INT IN [1..fastWidth/minorSize) DO

il ← CONS[InstantiateTile[RDToRD, [slowWidth - tileWidth + minorIndex*minorSize, tile3]], il];

ENDLOOP;

IF ((fastWidth/minorSize) MOD 2) = 0 THEN il ← CONS[InstantiateTile[Inverter, [slowWidth + fastWidth - 2*tileWidth, tile3]], il];

object ← PW.CreateCell[il, [0, 0, slowWidth + fastWidth + minorSize - tileWidth, minorSize+tileWidth]];

};

CreateStartStop: PROC [bits: INT, start, stop: INT, logicTile: TileType] RETURNS [object: CD.Object] = {

startObject: CD.Object ← CreateDecoder[bits, start];

IF start=stop THEN {

ERROR; -- single chunk height or width boxes not yet implemented

}

ELSE {

stopObject: CD.Object ← CreateDecoder[bits, stop];

decoder: CD.Object ← PW.AbutY[stopObject, startObject];

il: CD.InstanceList ← NIL;

il ← CONS[CreateInstance[decoder, [0, tileWidth + 2*tileHeight]], il];

il ← CONS[InstantiateTile[logicTile, [minorSize*((bits+1)/2), 0]], il];

object ← PW.CreateCell[il, [0, 0, minorSize*(((bits+1)/2)+1) + tileWidth, minorSize+tileWidth]];

};

CreateDecoder: PROC [bits: INT, constant: INT] RETURNS [object: CD.Object] = {

il: CD.InstanceList ← NIL;

FOR bitIndex: INT IN [0..bits) DO

adjust: INT ← minorSize*(bitIndex/2) + tileWidth + inputHeight + 2*minorSpace*(bitIndex MOD 2) + (IF BitOps.EBFD[constant, bitIndex, bits] THEN minorSpace ELSE 0);

il ← CONS[InstantiateProgramTile[[adjust, outputHeight]], il];

ENDLOOP;

FOR minorIndex: INT IN [1..(bits+1)/2) DO

x: INT ← minorIndex*minorSize;

il ← CONS[InstantiateTile[OLUToI, [x, 0]], il];

il ← CONS[InstantiateTile[Inverter, [x, 0]], il];

il ← CONS[InstantiateProgramTile[[x+tileWidth, outputHeight]], il];

ENDLOOP;

object ← PW.CreateCell[il, [0, 0, minorSize*((bits+1)/2) + tileWidth, tileHeight]];

};

InstantiateProgramTile: PROC [position: DABasics.Position] RETURNS [instance: CD.Instance] = {

object: CD.Object ← tileObjects[Program];

size: DABasics.Position ← CD.InterestSize[object];

size.x ← size.x/2;

size.y ← size.y/2;

instance ← CreateInstance[object, CDBasics.SubPoints[position, size]];

};

InstantiateTile: PROC [tile: TileType, position: DABasics.Position, orientation: DABasics.Orientation ← original] RETURNS [instance: CD.Instance] = {

object: CD.Object ← tileObjects[tile];

instance ← CreateInstance[object, position, orientation];

};

CreateInstance: PROC [object: CD.Object, offset: DABasics.Position, orientation: DABasics.Orientation ← original] RETURNS [instance: CD.Instance] = {

base: DABasics.Position ← CD.InterestBase[object];

tbase: DABasics.Position ← CDBasics.MapPoint[base, [[0, 0], orientation]];

toff: DABasics.Position ← CDBasics.SubPoints[offset, tbase];

instance ← CDInstances.NewInst[object, [toff, orientation]];

};

Test: PROC = {

sizes: MachineSizes ← [10, 8, 8, 1];

myBox: SF.Box ← [[1,1],[9,9]];

chip: CD.Object ← CreateChip[[16, 8]];

decoder: CD.Object ← CreateDecoder[sizes.sAddressBits, 5];

startStop: CD.Object ← CreateStartStop[sizes.sAddressBits, 5, 7, StartStopStateSlow];

boxDecode: CD.Object ← CreateBoxDecode[sizes, myBox];

box: CD.Object ← CreateBox[sizes, myBox];

map: ImagerSample.SampleMap ← ImagerSample.NewSampleMap[box: myBox, bitsPerSample: 1];

sampleMap: CD.Object ← CreateSampleMap[sizes, [1,1], map];

il: CD.InstanceList ← NIL;

il ← CONS[CreateInstance[chip, [-32*lambda,-32*lambda]], il];

il ← CONS[CreateInstance[decoder, [0,tileWidth]], il];

il ← CONS[CreateInstance[startStop, [0,minorSize]], il];

il ← CONS[CreateInstance[boxDecode, [0,2*minorSize]], il];

il ← CONS[CreateInstance[box, [0,3*minorSize]], il];

il ← CONS[CreateInstance[sampleMap, [0,4*minorSize]], il];

[] ← PW.Draw[PW.CreateCell[il]];

};

LoadTiles[];

END.