1.0 Introduction
This paper describes the rasterization of 2.5D images with RAM programmable logic. The basic notion is to program a machine for each object in the image, z sort the machines, and then flow chunks of pixels through the pipeline thereby created. This contrasts with the standard graphics pipeline in which the objects flow through a transformation pipeline to produce pixels in a frame buffer. Rather than the transformation operators remaining stable in space, and the objects flowing through them, the objects, represented as machines, remain stable in space, and the pixels flow through them.
[ Nectarine figure; type 'Artwork on' to a CommandTool ]
Figure 1. Abstract Architecture
Figure 1 shows the abstract architecture. All of the items in the left column represent machines which are coded into programmable logic. The host is responsible for loading the programmable logic with these machines. The clock generator generates clocks which are multiples of the basic pixel clock. In the simplest implementation, all of the machines besides the parallel to serial converter run at one clock rate, and the parallel to serial converter converts from that clock rate to the pixel clock rate.
The architecture scales arbitrarily for bandwidth (represented as the product of image output rate and number of pixels) and the number of objects. With the exception of clock distribution and the relatively low-speed interface to the host, all of the connections are nearest neighbor. There is no fixed complexity limit for a particular image because the host can utilize the programmable logic to rasterize objects, and combine the resulting pixel maps together to produce a single pixel map machine. The only result is a decrease in performance, assuming there is enough pixel memory somewhere to form a frame buffer.
An alternative architecture uses the programmable array as a rasterization engine whose output is fed into a standard frame buffer. The output device is then driven from this frame buffer. This architecture can be scaled down to almost nothing, perhaps a single programmable logic device. This board level architecture is essentially the same as current graphics engines. The difference is the binding time of the raster generator operators. In prior designs these operators are bound at manufacturing time, and typically consist of an abstract set of operators such as polygon, rectangle, and text. In the proposed design the parameters of these abstract operators are bound prior to synthesizing the logic which generates the raster. This enables the simple common cases to be handled in a highly optimized manner.
Of course programmable logic is not limited to this application. Most of the complexity resides in the programming of the chips. The design of the chips themselves is extremely simple once the architecture is defined, so the chips can track technology readily. In fact, they are so simple that they will take advantage of advanced technology long before standard microprocessors. Since the number of applications is large, there will be motivation to produce these chips in large quantities, consequently reducing their cost.
Programmable logic is an enabling technology because it allows the grain size of a machine to be smaller than a standard von Neumann processor and allows highly optimized machines to be constructed for the simple cases, which are likely to be the most common ones.
This technique is applicable to all types of raster scan output devices. In section 2.0 we describe the relationship of this technique to previous published descriptions of rasterizing hardware as well as commercial products. Section 3.0 considers a number of alternative board level architectures. Section 4.0 covers the programming of the machines which reside in programmable logic. Sections 5.0 and 6.0 discuss design considerations specific to displays and printers. Finally, in section 7.0, we conclude with the mandatory conclusion section, where we repeat what we just said, but hopefully you, dear reader, will have the attitude of belief, rather than skepticism, by the time you get there.
3.0 Board Level Architecture
In this section we consider a number of alternative board-level architectures.
First a few definitions. The animation rate is the number of images per second which must be produced to give the appearance of smooth motion. The response time is the maximum tolerable latency from user input to modified image. The display rate is the bandwidth required to service the output device. A CRT requires a display rate many times the animation rate because of the lack of persistence in the phosphor.
The first two architectures are designs which represent extreme simplicity. The first allows for arbitrary images, the second for limited animation. The next combines the first two to allow arbitrary images with limited animation. The last decouples the display rate from the animation rate, and allows the raster generator to be used in a multipass method, so that the ratio between processing power and memory is tunable.
3.1 Frame Buffer
The simplest practical implementation of a raster output device relies completely on the host processor to rasterize the image. Figure 2 shows such a system, in which the frame buffer serves to isolate the bandwidth requirements of the display from the capability of the host to supply pixels.
[ Nectarine figure; type 'Artwork on' to a CommandTool ]
Figure 2. Frame Buffer System
This is the standard system used in many low priced computers. It does not contain any programmable logic, but serves as a capability reference point in that it allows display of arbitrary images and has decoupled the display rate and the animation rate.
3.2 Raster Generator
The simplest possible system which utilizes programmable logic for rasterization is depicted in figure 3. This is the system which was described in the introduction. The complexity of an image produced by this system is limited by the the space available in the programmable logic of the raster generator to store object descriptions. Thus, unlike the simple frame buffer system, arbitrary images cannot be displayed. However, also unlike the frame buffer system, the objects are stored in a very abstract manner, making it quite simple to animate the image.
[ Nectarine figure; type 'Artwork on' to a CommandTool ]
Figure 3. Raster Generator System
The programming of the raster generator is designed to produce pixels at the peak rate demanded by the output device. The clock generator takes care of producing vertical and horizontal sync. During either sync the clocks to the array are stopped. During vertical sync the host can reprogram the array to render a different image. This reprogramming could just be a block transfer of an entire new machine pipeline from host memory into the array.
3.3 Raster Generator and Frame Buffer
In this system the capabilities of the two previous systems have been combined. The host can store an arbitrary image in the frame buffer and can store objects in the raster generator to improve animation capabilities. This is a typical video game architecture.
[ Nectarine figure; type 'Artwork on' to a CommandTool ]
Figure 4. Raster Generator and Frame Buffer System
The amount of animation is limited by the raster generator size.
3.4 Machine Buffer, Raster Generator and Frame Buffers
Finally we add multiple frame buffers plus a machine buffer. The machine buffer stores machines and loads them into the raster generator as needed. This enables multiple pass use of the raster generator.
[ Nectarine figure; type 'Artwork on' to a CommandTool ]
Figure 5. Machine Buffer, Raster Generator, and Frame Buffer System
This system design is motivated by the recognition that memory is cheap. Even if RAM programmable logic becomes the accepted implementation mechanism for a wide variety of systems it will never be as cheap for state storage as DRAMs. In this architecture the output display rate and the rasterization rate are completely decoupled. The display multiplexor switches at the animation rate. The pixels come out of the output device selected frame buffer at the pixel clock rate. The other frame buffers either source or sink pixels at the raster generator rate. No frame buffer both sources and sinks pixels during a single rendering of a single image, i.e. the frame buffers are single ported so that they may be built out of cheap, standard, RAM rather than video RAM. The machine buffer cycles through all of its machines at the animation rate. There is a blank time slot in the raster generator between the last pixel of the previous image rendering and the first pixel of the next image rendering so that the machine buffer has time to insert new machines into the rasterization pipeline. The size of the frame buffers are fixed by the display, as the product of its width and height in pixels, and the number of bits per pixel. The machine buffer size is fixed by the complexity of the animated objects. The raster generator size is fixed by the machine-complexity, animation-rate product.
In this design one frame buffer holds the background pixel map. It consists of objects which are not moving and do not have any interaction with those that are moving. A pair of frame buffers is used as a ping-pong buffer while the raster generator is making multiple passes through the pixels, each time with a different set of machines loaded into it. The last frame buffer is refreshing the display.
The raster generator has control over the pixels which are transmitted from the frame buffer into the raster generator. This allows a gross clipping to be performed, so that multipass methods need not waste bandwidth passing pixels through the pipeline which are never modified by the current machines in the pipeline. If the raster generator is used in this fashion then three of the frame buffers must be combined into two dual-ported frame buffers, which ping-pong between the output device and the raster generator.
A very clever implementation could reduce the number of frame buffers to one single ported background frame buffer and another dual-ported frame buffer which is shared by the raster generator and the output device. However, this places a number of severe timing constraints on the raster generator.
All of the components in this architecture are memory-mapped to a host processor which is responsible for loading and coordinating their actions. Of course, there is a single time base which generates all of the clocks described.
Appendix A
SoftHdw25D.mesa
Copyright Ó 1988 by Xerox Corporation. All rights reserved.
Barth, February 9, 1989 3:54:03 pm PST
DIRECTORY BitOps, CD, CDBasics, CDInstances, CDOps, DABasics
, ImagerSample, PW, Rope, SF;
SoftHdw25D:
CEDAR
PROGRAM
IMPORTS BitOps, CD, CDBasics, CDInstances, CDOps, ImagerSample, PW, SF
EXPORTS
= BEGIN
TileType: TYPE = {MinorArray, BoardTraces, Corner, Clock0, Clock1, RAMEven, RAMOdd, ROMEven, ROMOdd, MuxRDLUToRDEven, MuxRDLUToRDOdd, MuxLURDToLUEven, MuxLURDToLUOdd, Program, Inverter, FlipFlop, OLUToI, OLUToRD, ORDToI, ORDToLU, ORDToL, RToI, LUToLU, LUToI, RDToRD, RDToI, OToP, RDToP, LUToP, LToP, PToI, PToRD, StartStopStateSlow, StartStopStateFast, BoxDecodeToDataPath, SampleMapCount, SampleMapExtension, SampleMapMux, SampleMapRam};
tileNames: ARRAY TileType OF Rope.ROPE ← ["MinorArray", "BoardTraces", "Corner", "Clock0", "Clock1", "RAMEven", "RAMOdd", "ROMEven", "ROMOdd", "MuxRDLUToRDEven", "MuxRDLUToRDOdd", "MuxLURDToLUEven", "MuxLURDToLUOdd", "Program", "Inverter", "FlipFlop", "OLUToI", "OLUToRD", "ORDToI", "ORDToLU", "ORDToL", "RToI", "LUToLU", "LUToI", "RDToRD", "RDToI", "OToP", "RDToP", "LUToP", "LToP", "PToI", "PToRD", "StartStopStateSlow", "StartStopStateFast", "BoxDecodeToDataPath", "SampleMapCount", "SampleMapExtension", "SampleMapMux", "SampleMapRam"];
tileObjects: ARRAY TileType OF CD.Object ← ALL[NIL];
InterconnectTileType: TYPE = TileType[OLUToI..RDToI];
PinTileType: TYPE = TileType[OToP..PToRD];
SoftHdwPgmTileType: TYPE = TileType[MinorArray..PToRD];
SoftHdw25DTileType: TYPE = TileType[StartStopStateSlow..SampleMapRam];
MachineSizes: TYPE = RECORD [sAddressBits, fAddressBits, pixelsPerChunk, bitsPerPixel: INT];
LoadTiles:
PROC = {
design: CD.Design ← PW.OpenDesign["SoftHdwPgm.dale"];
CDOps.SetMutability[design, readonly];
FOR index: TileType
IN SoftHdwPgmTileType
DO
tileObjects[index] ← PW.Get[design, tileNames[index]];
ENDLOOP;
design ← PW.OpenDesign["SoftHdw25D.dale"];
CDOps.SetMutability[design, readonly];
FOR index: TileType
IN SoftHdw25DTileType
DO
tileObjects[index] ← PW.Get[design, tileNames[index]];
ENDLOOP;
};
CreateChip:
PROC [size: DABasics.Position]
RETURNS [object:
CD.Object] = {
minorArray: CD.Object ← tileObjects[MinorArray];
boardTraces: CD.Object ← tileObjects[BoardTraces];
boardTracesSize: DABasics.Position ← CD.InterestSize[boardTraces];
corner: CD.Object ← tileObjects[Corner];
inner: CD.Object ← PW.Array[minorArray, size.x, size.y];
side: CD.Object ← PW.ArrayY[boardTraces, size.y];
middle: CD.Object ← PW.AbutX[side, inner, side];
topBottom: CD.Object ← PW.AbutX[corner, PW.ArrayX[PW.FlipX[PW.Rot90[boardTraces]], size.x], corner];
object ← PW.AbutY[topBottom, middle, topBottom];
};
lambda: INT = 8;
minorSize: INT = 128*lambda;
minorSpace: INT = 24*lambda;
tileWidth: INT = 32*lambda;
tileHeight: INT = 24*lambda;
outputHeight: INT = 20*lambda;
inputHeight: INT = 16*lambda;
tile0: INT = tileWidth + 3*tileHeight;
tile1: INT = tileWidth + 2*tileHeight;
tile2: INT = tileWidth + tileHeight;
tile3: INT = tileWidth;
chunksPerRAM: INT = 8;
bitsPerRAMROMPair: INT = 8;
CreateSampleMap:
PROC [sizes: MachineSizes, position: ImagerSample.Vec, map: ImagerSample.SampleMap]
RETURNS [object:
CD.Object] = {
EmitRAM:
PROC [y:
INT] = {
FOR index:
INT
IN [0..sizes.pixelsPerChunk/bitsPerRAMROMPair)
DO
x: INT ← decodeWidth-tileWidth+2*index*minorSize;
il ← CONS[InstantiateTile[SampleMapRam, [x, y]], il];
ENDLOOP;
};
box: SF.Box ← SF.Displace[ImagerSample.GetBox[map], position];
widthInChunks: INT ← ((box.max.f-1)/sizes.pixelsPerChunk) - (box.min.f/sizes.pixelsPerChunk) + 1;
heightInChunks: INT ← box.max.s-box.min.s;
chunkCount: INT ← widthInChunks*heightInChunks;
decodeObject: CD.Object ← CreateBoxDecode[sizes, box];
decodeWidth: INT ← CD.InterestSize[decodeObject].x;
il: CD.InstanceList ← NIL;
IF ImagerSample.GetBitsPerSample[map]#1 THEN ERROR; -- not yet implemented
IF sizes.bitsPerPixel#ImagerSample.GetBitsPerSample[map] THEN ERROR;
IF (sizes.pixelsPerChunk MOD bitsPerRAMROMPair)#0 THEN ERROR; -- not yet implemented
il ← CONS[CreateInstance[decodeObject, [0, 0]], il];
il ← CONS[InstantiateTile[SampleMapCount, [decodeWidth - tileWidth - 7*minorSize, -minorSize]], il];
FOR index:
INT
IN [0..sizes.pixelsPerChunk/bitsPerRAMROMPair)
DO
x: INT ← decodeWidth-tileWidth+2*index*minorSize;
il ← CONS[InstantiateTile[SampleMapMux, [x, 0]], il];
ENDLOOP;
EmitRAM[-minorSize];
FOR index:
INT
IN [1..(chunkCount/chunksPerRAM)-1]
DO
y: INT ← -(index+1)*minorSize;
il ← CONS[InstantiateTile[SampleMapExtension, [decodeWidth - tileWidth - 4*minorSize, y]], il];
EmitRAM[y];
ENDLOOP;
object ← PW.CreateCell[il, [0, -(chunkCount/chunksPerRAM)*minorSize, decodeWidth + ((sizes.pixelsPerChunk+3)/4)*minorSize, minorSize+tileWidth]];
sample: ImagerSample.Sample ← ImagerSample.Get[map, [s, f]];
};
CreateBox:
PROC [sizes: MachineSizes, box:
SF.Box]
RETURNS [object:
CD.Object] = {
decodeObject: CD.Object ← CreateBoxDecode[sizes, box];
decodeWidth: INT ← CD.InterestSize[decodeObject].x;
il: CD.InstanceList ← NIL;
il ← CONS[CreateInstance[decodeObject, [0, 0]], il];
IF sizes.bitsPerPixel#1 THEN ERROR; -- not yet implemented
FOR dataIndex:
INT
IN [0..sizes.pixelsPerChunk)
DO
x: INT ← decodeWidth + outputHeight + ((dataIndex/4)*minorSize) + ((dataIndex MOD 4)*minorSpace);
il ← CONS[InstantiateProgramTile[[x, tile3 + inputHeight]], il];
IF dataIndex >= (box.min.f MOD sizes.pixelsPerChunk) THEN il ← CONS[InstantiateProgramTile[[x, tile0 + inputHeight]], il];
IF dataIndex < (box.max.f MOD sizes.pixelsPerChunk) THEN il ← CONS[InstantiateProgramTile[[x, tile1 + inputHeight]], il];
ENDLOOP;
object ← PW.CreateCell[il, [0, 0, decodeWidth + ((sizes.pixelsPerChunk+3)/4)*minorSize, minorSize+tileWidth]];
};
CreateBoxDecode:
PROC [sizes: MachineSizes, box:
SF.Box]
RETURNS [object:
CD.Object] = {
slowObject: CD.Object ← CreateStartStop[sizes.sAddressBits, box.min.s, box.max.s, StartStopStateSlow];
slowWidth: INT ← CD.InterestSize[slowObject].x;
fastObject: CD.Object ← CreateStartStop[sizes.fAddressBits, box.min.f/sizes.pixelsPerChunk, box.max.f/sizes.pixelsPerChunk, StartStopStateFast];
fastWidth: INT ← CD.InterestSize[fastObject].x;
il: CD.InstanceList ← NIL;
il ← CONS[CreateInstance[slowObject, [0, 0]], il];
il ← CONS[CreateInstance[fastObject, [slowWidth-tileWidth, 0]], il];
il ← CONS[InstantiateTile[OLUToRD, [slowWidth-tileWidth, tile3]], il];
il ← CONS[CreateInstance[tileObjects[BoxDecodeToDataPath], [slowWidth + fastWidth - 2*tileWidth, 0]], il];
FOR minorIndex:
INT
IN [1..fastWidth/minorSize)
DO
il ← CONS[InstantiateTile[RDToRD, [slowWidth - tileWidth + minorIndex*minorSize, tile3]], il];
ENDLOOP;
IF ((fastWidth/minorSize) MOD 2) = 0 THEN il ← CONS[InstantiateTile[Inverter, [slowWidth + fastWidth - 2*tileWidth, tile3]], il];
object ← PW.CreateCell[il, [0, 0, slowWidth + fastWidth + minorSize - tileWidth, minorSize+tileWidth]];
};
CreateStartStop:
PROC [bits:
INT, start, stop:
INT, logicTile: TileType]
RETURNS [object:
CD.Object] = {
startObject: CD.Object ← CreateDecoder[bits, start];
IF start=stop
THEN {
ERROR; -- single chunk height or width boxes not yet implemented
}
ELSE {
stopObject: CD.Object ← CreateDecoder[bits, stop];
decoder: CD.Object ← PW.AbutY[stopObject, startObject];
il: CD.InstanceList ← NIL;
il ← CONS[CreateInstance[decoder, [0, tileWidth + 2*tileHeight]], il];
il ← CONS[InstantiateTile[logicTile, [minorSize*((bits+1)/2), 0]], il];
object ← PW.CreateCell[il, [0, 0, minorSize*(((bits+1)/2)+1) + tileWidth, minorSize+tileWidth]];
};
};
CreateDecoder:
PROC [bits:
INT, constant:
INT]
RETURNS [object:
CD.Object] = {
il: CD.InstanceList ← NIL;
FOR bitIndex:
INT
IN [0..bits)
DO
adjust: INT ← minorSize*(bitIndex/2) + tileWidth + inputHeight + 2*minorSpace*(bitIndex MOD 2) + (IF BitOps.EBFD[constant, bitIndex, bits] THEN minorSpace ELSE 0);
il ← CONS[InstantiateProgramTile[[adjust, outputHeight]], il];
ENDLOOP;
FOR minorIndex:
INT
IN [1..(bits+1)/2)
DO
x: INT ← minorIndex*minorSize;
il ← CONS[InstantiateTile[OLUToI, [x, 0]], il];
il ← CONS[InstantiateTile[Inverter, [x, 0]], il];
il ← CONS[InstantiateProgramTile[[x+tileWidth, outputHeight]], il];
ENDLOOP;
object ← PW.CreateCell[il, [0, 0, minorSize*((bits+1)/2) + tileWidth, tileHeight]];
};
InstantiateProgramTile:
PROC [position: DABasics.Position]
RETURNS [instance:
CD.Instance] = {
object: CD.Object ← tileObjects[Program];
size: DABasics.Position ← CD.InterestSize[object];
size.x ← size.x/2;
size.y ← size.y/2;
instance ← CreateInstance[object, CDBasics.SubPoints[position, size]];
};
InstantiateTile:
PROC [tile: TileType, position: DABasics.Position, orientation: DABasics.Orientation ← original]
RETURNS [instance:
CD.Instance] = {
object: CD.Object ← tileObjects[tile];
instance ← CreateInstance[object, position, orientation];
};
CreateInstance:
PROC [object:
CD.Object, offset: DABasics.Position, orientation: DABasics.Orientation ← original]
RETURNS [instance:
CD.Instance] = {
base: DABasics.Position ← CD.InterestBase[object];
tbase: DABasics.Position ← CDBasics.MapPoint[base, [[0, 0], orientation]];
toff: DABasics.Position ← CDBasics.SubPoints[offset, tbase];
instance ← CDInstances.NewInst[object, [toff, orientation]];
};
Test:
PROC = {
sizes: MachineSizes ← [10, 8, 8, 1];
myBox: SF.Box ← [[1,1],[9,9]];
chip: CD.Object ← CreateChip[[16, 8]];
decoder: CD.Object ← CreateDecoder[sizes.sAddressBits, 5];
startStop: CD.Object ← CreateStartStop[sizes.sAddressBits, 5, 7, StartStopStateSlow];
boxDecode: CD.Object ← CreateBoxDecode[sizes, myBox];
box: CD.Object ← CreateBox[sizes, myBox];
map: ImagerSample.SampleMap ← ImagerSample.NewSampleMap[box: myBox, bitsPerSample: 1];
sampleMap: CD.Object ← CreateSampleMap[sizes, [1,1], map];
il: CD.InstanceList ← NIL;
il ← CONS[CreateInstance[chip, [-32*lambda,-32*lambda]], il];
il ← CONS[CreateInstance[decoder, [0,tileWidth]], il];
il ← CONS[CreateInstance[startStop, [0,minorSize]], il];
il ← CONS[CreateInstance[boxDecode, [0,2*minorSize]], il];
il ← CONS[CreateInstance[box, [0,3*minorSize]], il];
il ← CONS[CreateInstance[sampleMap, [0,4*minorSize]], il];
[] ← PW.Draw[PW.CreateCell[il]];
};
LoadTiles[];
END.