[Pixel]<DATools7.0>SoftHdw>SoftHdwNotes.tioga!14

SoftHdwNotes.tioga

Communication

Examples

garbage collection

encryption (Ralph)

checksum

regular expressions on strings

Abstractions

architectural block diagrams ala i860

limits, how many registers, alu's, etc.

how fast

system performance issues

real sustainable bandwidth from Sun VME card

Soft Testarossa

space and time language for laying out datapath, no automatic binding of operations to operators, interdigitate 10 bit paths.

Random Thoughts

Write an abstract position paper that talks about the systems aspects of using a softhardware board on a sun. Talk about swapping virtual machines to disk. Connection across backplane bus to RAM via DMA. Applications such as ATMS, binary image manipulation for scanning and printing, encryption and compression. Reduction of algorithms by a factor of n. Point out that rates achievable now for those applications may not make sense commercially because of other weak links in the system but will make sense as these others evolve.

Perhaps a new RAM tile that uses the same address bits for read and write and removes the read enable. The read data is always the addressed location. Then some other method for ganging the tiles vertically must be invented. Also think about using long lines for read/write data.

Consider doubling the number of LeftUp and RightDown lines. Look at programmings to determine the interconnect corner to straight ratio. If it is low then adding only straight lines helps, otherwise both corners and straight lines are needed.

People could be working on:

Circuit design of substrate, including characterization for timing analyzer

Layout of substrate

Assembly language tools (parser, simulator, timing analyzer)

Higher level language (parameterized tilings as in SoftHdw25D but graphics)

Trapezoid machines

Text machines

Coupling Imager to SoftHardware

Board level architecture

Work is divided into:

Assembly language abstraction

Implementing the abstraction

Tools for evaluating space and time of a program in terms of the abstraction

Applications written in terms of the abstraction

Report

"Programmable Logic: Architecture and Applications"

"Hardware Without Tears"

This report discusses a model of computation which is radically different from what has become known as the von Neumann architecture. It is different in the means of communication between logical elements and the granularity of the logical elements available to the programmer. It exploits late binding to an extraordinary degree. Philosophically, it is closer to the Connection Machine [Hillis] than to a traditional von Neumann machine. However, the sequencing mechanism is much different from a Connection Machine. It also has roots in VLIW architectures [?]. The programmable gate arrays of [Xilinx] provided the initial guide to this marriage of technology and architecture. The Xilinx literature views this marriage as between technology and traditional logic design. In this report we expand this view to encompass descriptions closer to parallel programming languages, as well as logic design.

The report proceeds as follows. Chapter 2 discusses architectural philosophy. In it we look at history, seeking a philosophical basis for the architecture. We study the primitive elements which have been useful for hardware implementation. We study applications, and the underlying algorithms which have been used to implement them. Finally we look at the underlying technology, seeking understanding of the computational structures which it implements well.

Chapter 3 proposes a specific architecture. It describes the programming abstraction, similiar to the principles of operation of a von Newmann machine. We do not forsake the von Neumann architecture completely. Our architecture is embedded in an existing commercial workstation. We give programming examples which illustrate the implementation of traditional logic structures.

Chapter 4 describes the utilization of this architecture as a logic emulation machine. We lay out the design automation tools required and the interface of the architecture to existing components. A commercial realization of this architecture using the Xilinx Programmable Gate Arrays is already available [QuickTurn]. We contrast the range of applications which the two designs can handle.

Chapter 5 turns to 2.5D rasterization as an application. Here we study the rendering of planar images with overlap. We talk about system architectures which embed programmable logic as a rasterization engine. The rendering of different types of graphical objects, such as rectangles and character glyphs, is discussed. High quality raster print engines and interactive display devices are the major subareas, each is given a section.

Finally, chapter 6 takes a critical look at the architecture, refining the boundary seperating applications which match this architecture and those that do not. We also wave our hands at a number of applications which seem particularly well suited to the architecture.

Programming Notes

4.6 Bit Serial Processing

4.7 Multiport RAM

4.8 CAM

Rasterization Notes

4.3.4 Text

Text can be thought of in several ways. One is simply as a sequence of (position, character index) pairs. At the end of a pipeline segment in which there are no z dependencies the character indicies are transformed into bitmaps which carry transparency information. The next stage shifts this information into proper alignment with the chunk. The next pipeline stage converts the current chunk, the transparency information, and the pixel value of the ink writing the characters into the next chunk. This model can be substantially simplified if bitsPerPixel is one.

Another way to think of text is as a linear sequence of character indicies plus a formatting algorithm. For a fixed pitch font with fixed size spaces this is particularly simple.

4.3.5 Serializer

talk about different clocks, implications for this block of other chunk form factors, e.g. if text is dominant use character high chunks to reduce memory cycles

4.4 Other Object Types

Pure speculation. Video, splines, interpress operator set

It is possible that the existing Xilinx chips, combined with standard commercial RAMs are adequate for this task. One problem with them is the long period of time it takes to reprogram them through their serial port. The Xilinx design handbook indicates that the load time varies between 12ms and 24 ms. This is almost an entire frame time for a display output device. So you could use Xilinx chips for prototyping by glitching the display on every change, or do it glitch free by ping ponging completely replicated hardware. Another problem with them is the limited amount of memory within them. This makes a very deeply pipelined design quite difficult. These chips also waste a tremendous amount of space on interconnect options.

Look at Spreitzer's encoding of the screen.

Think about the relationship between the width of a text object and the size of the alignment network to put the pixels from font storage into the chunk.

The object types should be parameterized by the number of pixels they deal with at once. This allows the bandwidth to be scaled without rewriting the object types.

Of course the position of the object and the position of the window can, conceptually at least, be bound when the object is generated.

For now let's assume that the object and window are rectangles. They are obviously rectilinear since this machine is working at the pixel level.

We must decide where clipping things to the screen will occur. This determines the range necessary for specifying coordinate positions. To within a chunk the position in which an object is emitted is determined by the clock cycle in which the object starts emitting its pixels. In this way objects which lie off the screen will either never be generated, because vertical sync starts the object over, or the bits for the object... I think the answer is that the host can clip objects if it needs to in order to not overflow the resources in the soft hardware. If it doesn't bother then the machines programmed into the array are smart enough to clip properly. This mostly means that they know how to start generating pixels someplace in the interior of an object, rather than always starting at the object origin.

Another way to increase bandwidth is to make the chunks into bands, i.e. more than one pixel high. Then it is the job of the serializer to store the additional scan lines and play them out when the time comes.

For a line object we need a single point in addition to the generic information which every object must have. For a line parallel to the axis we merely need a bit specifying which axis and another number which is the length of the line. This is for single pixel lines. Lines with width are really rectangles.

CreateSimpleLine: PROC [origin: Point, vertical: BOOL, length: Number];

There are several cases for a simple line.

IF vertical THEN paint single pixel

ELSE {

IF fill chunk THEN jam all bits on

ELSE (horizontal line terminates inside) paint left border to right border.

Of course all this case analysis is done by the host. For vertical lines the machine just jams the proper pixel on when the right chunk is going through. Horizontal lines are more complicated.

3D rasterization - basically the pixel description must give the z coordinate of the screen in the virtual scene. The eye is assumed to be some fix distance from the screen and to be a point. The objects get a third coordinate and their contribution to pixels depend on the pyramid constructed from the pixel and the eye point. The simple implementation of this still sends all the pixels through all the objects. The complicated method clips the objects to those which are at least partially visible in the vision pyramid. A simplified version of this that uses a square solid is the solution to the 2.5D zooming problem.

Ray tracing - a model that uses the same architecture but which flows quanta of light through the objects can implement ray tracing. In its full glory this requires a 3D interconnect so that objects can be sorted in three dimensions instead of the ersatz one dimension of 2.5D rasterization. The objects hold onto their illumination so that they can respond properly when display pixels, instead of light quanta, make their way through the object. The light quanta have their direction encoded, in addition to their value, instead of the pixel position.

Think about pagination. Parallel between fixed points of forced page breaks, page number filled in after chapters are paginated.

Need statistical analysis of display and printer images to tune this thing.

Build prototype for existing display and printer.

The maximum width of a machine will finally limit the bandwidth producible. If the width is limited to one chip then the connectivity is very small. Could mount in SIMs like memories.

If each pixel map pixel bit has about 64 bits of storage then about 2^14 of these is adequate for a 1Kx1K display since 2^10*2^10 is about 2^6*2^14. If they are packed 2^6 per chip then 14-6=>8 or 256 chips or 32 SIM. Must ratio memory to processing depending on the number of pixels in image and rate of image production, also if frame buffer is between output device and array. (8.5*11*600*600*8)/4M => 67.32 or about 64 chips with a little clipping around the edges.

Shifting is cheap as long as you don't have to go too far. Put character in middle of chunk, shift left or right to align. Shift distance limited by pixelsPerChunk. May want to build shift network into array?

Two v. Neumann machines per printer, first is guard to see if raster generator can keep up with page in one pass, it also does all the network, queue management, etc. Images spooled with print time or at least exception flag saying multiple raster passes required. Second v. Neumann spools images off disk, loads array and frame buffers, schedules paper path. Real time machine.

Think about serial methods. Transmit only the fast address every cycle, slow address during horizontal retrace, control line every cycle that says fast or slow. Possibly transmit pixel data over same lines as address data. Keep addresses stored in RAM? Indexed by slow address? The ball to keep your eye on is to minimize the area/bandwidth ratio.

References to be acquired

Automata networks in computer science, Theory and applicatons, edited by F. Soulie, Y. Robert, and M. Tchuente. Princeton University Press, 1987.

Xilinx, ISSCC '87 or '88, somewhere around there.

Toffoli and ?, book on CAM-2

Browse ISSCC. Newman &Sproull. Byte about sprite generators. Rider about existing ESS architectures. Webster, esp. Sidney Marshall.

Look at SigGraph '80 about Pixel Planes.

Gershon Kedem@cs.duke.edu about ?