SoftHdwNotes.tioga
Statistics
Technology mapping
SoftHdwCoreA, 1203, Testarossa DecompCtlLogic, dies due to PCNand2, PCInvCK, IOClkBuf, dLatch
SoftHdwCoreB, RSD InnerControlSC, works
SoftHdwCoreC, 180, RSD MultControlSC, is a joke
SoftHdwCoreD, Memory Controller FifoInputControl, works
SoftHdwCoreE, Memory Controller MCControlLogic, works
SoftHdwCoreF, 248, SIC Control, dies due to TLatch
SoftHdwCoreG, MapCacheSC, works
SoftHdwCoreH, ArbInFrame, works
SoftHdwCoreI, IOBInnerSC, works
SoftHdwCoreJ, Display Controller VCKBlockFloorPlan, works
SoftHdwCoreK, Display Controller DCKBlockFloorPlan, works
D, F, G, H, I, K have multiple clocks, so A, B, C, E, J do not. Since A does not work and C is a joke this leaves B, E, and J as my best examples.
Numbers
This is with AddChannels[&place, 2].
    Logic             Physical
Name Size Critical Completes Incompletes Size  Critical
B  208 8   109   1    21, 19  68
D  1259
E  1555 10   789   5    72, 54  173
G  2681
H  4120 21
I  3919
J  1350 9   1664   5    72, 64  212
K  2504
Communication
Examples
garbage collection
encryption (Ralph)
checksum
regular expressions on strings
Abstractions
architectural block diagrams ala i860
limits, how many registers, alu's, etc.
how fast
system performance issues
real sustainable bandwidth from Sun VME card
Soft Testarossa
space and time language for laying out datapath, no automatic binding of operations to operators, interdigitate 10 bit paths.
Random Thoughts
Write an abstract position paper that talks about the systems aspects of using a softhardware board on a sun. Talk about swapping virtual machines to disk. Connection across backplane bus to RAM via DMA. Applications such as ATMS, binary image manipulation for scanning and printing, encryption and compression. Reduction of algorithms by a factor of n. Point out that rates achievable now for those applications may not make sense commercially because of other weak links in the system but will make sense as these others evolve.
Perhaps a new RAM tile that uses the same address bits for read and write and removes the read enable. The read data is always the addressed location. Then some other method for ganging the tiles vertically must be invented. Also think about using long lines for read/write data.
Consider doubling the number of LeftUp and RightDown lines. Look at programmings to determine the interconnect corner to straight ratio. If it is low then adding only straight lines helps, otherwise both corners and straight lines are needed.
People could be working on:
Circuit design of substrate, including characterization for timing analyzer
Layout of substrate
Assembly language tools (parser, simulator, timing analyzer)
Higher level language (parameterized tilings as in SoftHdw25D but graphics)
Trapezoid machines
Text machines
Coupling Imager to SoftHardware
Board level architecture
Work is divided into:
Assembly language abstraction
Implementing the abstraction
Tools for evaluating space and time of a program in terms of the abstraction
Applications written in terms of the abstraction
August 30, 1989
The flop needs to be moved to Output for several reasons:
it is difficult to extract Q without wasting a lot of resources
the typical complex flop requires logic in the feedback, not in the output
the technology mapper has to replicate the flop to the multiple inputs instead of moving it back to the primitive driving D, this makes it less likely to be able to reduce the number of flops from the current flop per grain to, say, two flops per minor array (much like Xilinx). Also it introduces value ambiguity since skew could cause the multiple flops to clock in different values, especially when synchronizing inputs.
The RAM tile should probably be changed to cover two minor arrays and to be 64x1 rather than 16x4 since it takes a minor array per bit of data path to accomplish anything.
August 31, 1989
Consider designing a chip according to the following criteria:
the major limitation is interchip communication, anything that slows it down is bad
the communication needs of applications are so unpredictable that it must be possible to build up arbitrary interconnect paths with equal ease
Which can lead to:
build a chip with input/output programming along the edges and a full crossbar in the middle
build the logic elements into an array below the crossbar
But this is simply the current design with the ability to connect the crossing long lines together in fewer gate delays than the current design.
September 1, 1989
Let us remove from our minds the unnatural dichotomy between control and data and instead think in terms of the underlying synchronous sequential networks. From this point of view traditional von Neumann machines provide but one highly restricted mapping of an abstract algorithm into the universe of space-time solutions. (How's that for a bunch of BS?)
If we retain the model of allowing semi-infinite time for the translation of the algorithm from its source form into an executable form with the intention of minimizing the run time, that is the aggregate time required to feed the executable form along with the data through the processor, what can we do with fine grain processor which programmable logic provides? What should be the architecture of the logic? How much should we assume about the application at the time we design the substrate and how much should we leave to the compilation of the executable form, and, finally, how much to the data? What does this continuum of binding times mean when we are using programmable logic?
Need to pick an interesting class of compute intensive problems and ask what would we like to have in the most ideal of all worlds, still considering the high cost of hardware binding times. Considering the architecture as it stands now, we would prefer applications which use booleans or small numbers to represent the necessary values, although it would be very interesting to consider a different architecture that requires 32 bit parallel adders, or even short or long floating point adders and multipliers.
September 6, 1989
Frequently we are willing to trade off time for space, slowing things down in order to deal with larger problems. In current architectures we do this by implementing specialized memory chips, which achieve much higher storage densities at the cost of bandwidth. How extreme is the current position? How miniscule is the overhead to achieve good bit density in current RAMs? If it is very small then perhaps current designs err on the side of reducing bandwidth without really increasing storage significantly. What is the smallest RAM in which the overhead of decoders and buffers is small compared to the bit storage? Replicating this smallest RAM will give maximum bandwidth while retaining small overhead.
Really need a classification of compute intensive applications, dividing by the basic operations which the applications are based upon. Also need to pick one which has local expertise and implement it.
September 11, 1989
Histogramming SoftHdwCoreJ, it appears that there are 3446 outputs, broken down as
Sinks Wires Cumm %
0 184 5
1 2577 80
2 232 87
3 32 88
4 64 90
5 14 90
6 186 95
7 29
8 41
9 8
10 5
11 3
12 2
15 8
16 43 99
>16 18 100
Most connections have 1 sink (~75%). Almost all have 16 or fewer sinks (~99%).
October 26, 1989
Figure out how to put commercial RAM chips into an array that can be combined together to appear as a standard large memory or can be segmented so that independent processors can access chunks in parallel. Think of the machine as two planes, the processing plane and the memory plane.
Many of the thoughts I heard expressed in dealer assume that one must pick a single processor and communication architecture and then fit applications into that architecture. I think you have to build a fluid machine that, at a very fine grain, adapts itself to the problem at hand.
Build a Scheme interpreter that when the program forks a process, a new processor, with stack space and memory connection is programmed into the logic. Combine this with at least one application specific processor, say encryption.
October 27, 1989
Think about taking standard sequential programs but then as part of the compilation process synthesizing the programming for an array of Xilinx chips. Perhaps not the main control flow but function boxes which happen to be useful to the program.
Report
"Programmable Logic: Architecture and Applications"
"Hardware Without Tears"
This report discusses a model of computation which is radically different from what has become known as the von Neumann architecture. It is different in the means of communication between logical elements and the granularity of the logical elements available to the programmer. It exploits late binding to an extraordinary degree. Philosophically, it is closer to the Connection Machine [Hillis] than to a traditional von Neumann machine. However, the sequencing mechanism is much different from a Connection Machine. It also has roots in VLIW architectures [?]. The programmable gate arrays of [Xilinx] provided the initial guide to this marriage of technology and architecture. The Xilinx literature views this marriage as between technology and traditional logic design. In this report we expand this view to encompass descriptions closer to parallel programming languages, as well as logic design.
The report proceeds as follows. Chapter 2 discusses architectural philosophy. In it we look at history, seeking a philosophical basis for the architecture. We study the primitive elements which have been useful for hardware implementation. We study applications, and the underlying algorithms which have been used to implement them. Finally we look at the underlying technology, seeking understanding of the computational structures which it implements well.
Chapter 3 proposes a specific architecture. It describes the programming abstraction, similiar to the principles of operation of a von Newmann machine. We do not forsake the von Neumann architecture completely. Our architecture is embedded in an existing commercial workstation. We give programming examples which illustrate the implementation of traditional logic structures.
Chapter 4 describes the utilization of this architecture as a logic emulation machine. We lay out the design automation tools required and the interface of the architecture to existing components. A commercial realization of this architecture using the Xilinx Programmable Gate Arrays is already available [QuickTurn]. We contrast the range of applications which the two designs can handle.
Chapter 5 turns to 2.5D rasterization as an application. Here we study the rendering of planar images with overlap. We talk about system architectures which embed programmable logic as a rasterization engine. The rendering of different types of graphical objects, such as rectangles and character glyphs, is discussed. High quality raster print engines and interactive display devices are the major subareas, each is given a section.
Finally, chapter 6 takes a critical look at the architecture, refining the boundary seperating applications which match this architecture and those that do not. We also wave our hands at a number of applications which seem particularly well suited to the architecture.
Programming Notes
4.6 Bit Serial Processing
4.7 Multiport RAM
4.8 CAM
Rasterization Notes
4.3.4 Text
Text can be thought of in several ways. One is simply as a sequence of (position, character index) pairs. At the end of a pipeline segment in which there are no z dependencies the character indicies are transformed into bitmaps which carry transparency information. The next stage shifts this information into proper alignment with the chunk. The next pipeline stage converts the current chunk, the transparency information, and the pixel value of the ink writing the characters into the next chunk. This model can be substantially simplified if bitsPerPixel is one.
Another way to think of text is as a linear sequence of character indicies plus a formatting algorithm. For a fixed pitch font with fixed size spaces this is particularly simple.
4.3.5 Serializer
talk about different clocks, implications for this block of other chunk form factors, e.g. if text is dominant use character high chunks to reduce memory cycles
4.4 Other Object Types
Pure speculation. Video, splines, interpress operator set
It is possible that the existing Xilinx chips, combined with standard commercial RAMs are adequate for this task. One problem with them is the long period of time it takes to reprogram them through their serial port. The Xilinx design handbook indicates that the load time varies between 12ms and 24 ms. This is almost an entire frame time for a display output device. So you could use Xilinx chips for prototyping by glitching the display on every change, or do it glitch free by ping ponging completely replicated hardware. Another problem with them is the limited amount of memory within them. This makes a very deeply pipelined design quite difficult. These chips also waste a tremendous amount of space on interconnect options.
Look at Spreitzer's encoding of the screen.
Think about the relationship between the width of a text object and the size of the alignment network to put the pixels from font storage into the chunk.
The object types should be parameterized by the number of pixels they deal with at once. This allows the bandwidth to be scaled without rewriting the object types.
Of course the position of the object and the position of the window can, conceptually at least, be bound when the object is generated.
For now let's assume that the object and window are rectangles. They are obviously rectilinear since this machine is working at the pixel level.
We must decide where clipping things to the screen will occur. This determines the range necessary for specifying coordinate positions. To within a chunk the position in which an object is emitted is determined by the clock cycle in which the object starts emitting its pixels. In this way objects which lie off the screen will either never be generated, because vertical sync starts the object over, or the bits for the object... I think the answer is that the host can clip objects if it needs to in order to not overflow the resources in the soft hardware. If it doesn't bother then the machines programmed into the array are smart enough to clip properly. This mostly means that they know how to start generating pixels someplace in the interior of an object, rather than always starting at the object origin.
Another way to increase bandwidth is to make the chunks into bands, i.e. more than one pixel high. Then it is the job of the serializer to store the additional scan lines and play them out when the time comes.
For a line object we need a single point in addition to the generic information which every object must have. For a line parallel to the axis we merely need a bit specifying which axis and another number which is the length of the line. This is for single pixel lines. Lines with width are really rectangles.
CreateSimpleLine: PROC [origin: Point, vertical: BOOL, length: Number];
There are several cases for a simple line.
IF vertical THEN paint single pixel
ELSE {
IF fill chunk THEN jam all bits on
ELSE (horizontal line terminates inside) paint left border to right border.
Of course all this case analysis is done by the host. For vertical lines the machine just jams the proper pixel on when the right chunk is going through. Horizontal lines are more complicated.
3D rasterization - basically the pixel description must give the z coordinate of the screen in the virtual scene. The eye is assumed to be some fix distance from the screen and to be a point. The objects get a third coordinate and their contribution to pixels depend on the pyramid constructed from the pixel and the eye point. The simple implementation of this still sends all the pixels through all the objects. The complicated method clips the objects to those which are at least partially visible in the vision pyramid. A simplified version of this that uses a square solid is the solution to the 2.5D zooming problem.
Ray tracing - a model that uses the same architecture but which flows quanta of light through the objects can implement ray tracing. In its full glory this requires a 3D interconnect so that objects can be sorted in three dimensions instead of the ersatz one dimension of 2.5D rasterization. The objects hold onto their illumination so that they can respond properly when display pixels, instead of light quanta, make their way through the object. The light quanta have their direction encoded, in addition to their value, instead of the pixel position.
Think about pagination. Parallel between fixed points of forced page breaks, page number filled in after chapters are paginated.
Need statistical analysis of display and printer images to tune this thing.
Build prototype for existing display and printer.
The maximum width of a machine will finally limit the bandwidth producible. If the width is limited to one chip then the connectivity is very small. Could mount in SIMs like memories.
If each pixel map pixel bit has about 64 bits of storage then about 2^14 of these is adequate for a 1Kx1K display since 2^10*2^10 is about 2^6*2^14. If they are packed 2^6 per chip then 14-6=>8 or 256 chips or 32 SIM. Must ratio memory to processing depending on the number of pixels in image and rate of image production, also if frame buffer is between output device and array. (8.5*11*600*600*8)/4M => 67.32 or about 64 chips with a little clipping around the edges.
Shifting is cheap as long as you don't have to go too far. Put character in middle of chunk, shift left or right to align. Shift distance limited by pixelsPerChunk. May want to build shift network into array?
Two v. Neumann machines per printer, first is guard to see if raster generator can keep up with page in one pass, it also does all the network, queue management, etc. Images spooled with print time or at least exception flag saying multiple raster passes required. Second v. Neumann spools images off disk, loads array and frame buffers, schedules paper path. Real time machine.
Think about serial methods. Transmit only the fast address every cycle, slow address during horizontal retrace, control line every cycle that says fast or slow. Possibly transmit pixel data over same lines as address data. Keep addresses stored in RAM? Indexed by slow address? The ball to keep your eye on is to minimize the area/bandwidth ratio.
References to be acquired
Automata networks in computer science, Theory and applicatons, edited by F. Soulie, Y. Robert, and M. Tchuente. Princeton University Press, 1987.
Xilinx, ISSCC '87 or '88, somewhere around there.
Toffoli and ?, book on CAM-2
Browse ISSCC. Newman &Sproull. Byte about sprite generators. Rider about existing ESS architectures. Webster, esp. Sidney Marshall.
Look at SigGraph '80 about Pixel Planes.
Gershon Kedem@cs.duke.edu about ?