Page Numbers: Yes X: 520 Y: -.5" First Page: 144
Heading:
Dorado Hardware ManualPerformance Issues14 September 1981
Performance Issues
This chapter discusses two issues:
(1) How rapidly will Dorado be able to execute Mesa, Lisp, SmallTalk, etc. macroprograms;
(2) What relationship do some of the design parameters bear to performance;
Cycle Time
The first issue is cycle time. Dorado was designed for a 50 ns cycle time; the first three prototypes used stitchweld technology for interconnections and operated correctly at 55 ns cycle time; however, subsequent machines are being built using multiwire technology and will not operate faster than about 60 ns cycle time. The baseboard at present initializes the clock period to 64 ns for all machines during a boot, although there is some indication that design changes made recently and repair of a few lingering slow path problems would permit 5 to 10 ns faster operation.
With respect to achievable cycle time, the two important differences between stitchweld and multiwire technology are that stitchweld uses point-to-point wiring and has wire impedance of about 100 ohms (which is ideal), but multiwire uses Manhattan (square-corner) wiring with wire impedance of about 50 ohms on the inner layer and 70 ohms on the outer layer of wiring (Most signals are in the outer layer.); longer wires and imperfect impedance matching result in slower speed.
Emulator Performance
Gene McDaniel’s measurements of the Alto Mesa compiler have been adjusted to make them compatible with Pilot Mesa and are summarized below. It must be pointed out that the compiler makes heavier use of short pointers than do Pilot Mesa programs; programs being developed now are heavily biased toward long pointers and would be slower than the execution rate below indicates. Average execution rate was about 5.6 cycles/opcode excluding disk wait. About 38% of all cycles are consumed by XFER opcodes (i.e., subroutine call or return) and account for about 6% of opcodes executed. If these are excluded, the remaining 94% average about 3.1 cycles/opcode; if jumps and conditional jumps are also excluded (about 14% of executions), the others average 2.5 cycles/opcode. These times include all memory and IFU delays.
These excellent results indicate that there are no unusual delays due to problems with the memory or IFU and that the processor is completing most opcodes quickly. Since XFER opcode take 34 (local) to 54 (external) cycles/opcode excluding memory delays, speeding, respecifying, or reducing executions of XFER seem to be the most promising ways of improving performance.
In the above results, instruction forwarding has saved an average of about .25 cycles/opcode or about 4% overall, in agreement with our expectations.
For SmallTalk and Lisp instruction sets, performance is much worse than Mesa (averaging over 30 cycles/opcode on Smalltalk 76). Careful studies should be made to understand the reasons for this fully, but one reason is that the 16-bit word size is a serious limitation. Long storage pointers are used extensively, so execution would be substantially faster on a machine with, say, 32-bit data paths.
IFU Not-Ready Wait
For the Mesa compiler, 19.5% of all cycles were in IFU not-ready wait; 16% due to incorrectly predicted jumps, 2.5% to cache miss wait, and 1% to other causes. The 16% due to incorrectly predicted jumps might be improved.
The Mesa microcode presently predicts that all conditional jumps will not jump; it is desirable to predict not-jump unless more than 75% of executions jump due to the overhead of restarting the IFU an extra time. 40% of the time the prediction is wrong and a jump occurs, so it seems that the microcode is doing the best it can.
However, some loops ("while J ne 0 do," for example) are compiled as a normally-false conditional jump at the beginning of the loop and an unconditional jump from the end of the loop back to the beginning; a faster sequence is a normally-true conditional jump at the end of the loop, eliminating the unconditional jump altogether. The general objectives in changing the compiler would be as follows: (1) Eliminate unnecessary jumps and conditional jumps; (2) Make the jump/not-jump execution of conditional jumps be as predictable as possible; and (3) Make the not-jump path be the most likely, unless this conflicts with objective (1).
Microstore Requirements
Speed is not the only issue—some reduction in microstore requirements might be possible through design changes. Space requirements for a 1981 release of the Alto/Mesa emulator system were as follows:
Table 29: Utilization of the Microstore
Mesa basic opcode set20248
Cedar allocator & collector
8
Floating point
8
Alto opcode set
11638
Alto BCPL Runtime
8
BitBlt subroutine
8
Fault handling
 󈋽8
Ethernet driver
8
Disk driver
8
Display driver
8
Junk io driver
 󈌈8
LoadRam
8
Initialization
8
Total76738leaving 1058 free locations
Since we do not require that more than two emulators be loaded in the microstore at one time, there is presently a little space left for extensions. MicroD is able to utilize well over 99% of the available microstore.
The third performance issue is cache efficiency and miss wait; the fourth is available io bandwidth and io task cycle consumption. These are discussed in sections below.
Cache Efficiency and Miss wait
The value of shortening the wait for a storage read is roughly proportional to miss likelihood. Suppose that the prototypical opcode was a one-byte opcode implemented by the following microcode:
Fetch←Id, StkP+1;
Stack←Md, IFUJump[0];
For this example, execution time on a hit is 2 cycles; on a miss, 28 cycles. Delay for IFU misses must be added to this. Since the IFU is 6 bytes ahead of the current opcode, its misses delay 28 cycles less execution time for preceding 6 bytes; if any of the 6 bytes itself causes a miss, IFU delay will be 0 because it will catch up; the IFU never gets two misses (in this example) because it crosses at most one munch boundary. Hence, execution time will be 2 + 26*(1−H) + (28-12)*H6*(1−H), with the following results:
Table 30: Execution Time vs. Cache Efficiency
HitExecution IFU% Miss
%
CyclesCycles Wait
1002.00.00𔁆
99
2.26.1517
98
2.52.2829
96
3.04.5044
94
3.56.6753
92
4.08.7959
This crude analysis shows the importance of cache efficiency in determining system performance. Fortunately, measurements made by Doug Clark and Gene McDaniel indicated the following surprisingly high cache hit statistics:
Overall cache hit rate on three Mesa programs was 99.2% to 99.8%. 4.9% to 8.1% of all cycles were held. 10% to 19% of references were Store←’s, the rest fetches. 16% to 66% of misses had dirty victims, which cause additional cycles to be held while the cache address section is busy.
Another measurement showed a 99.7% hit rate for IFU references.
The processor obtains a word from the cache in 16% of all cycles and the IFU in 32% of all cycles; the processor actually shuts out the IFU by making its own reference about 20% of the time.
Provision has been made to expand the Dorado cache to 16k words, when 4k x 1 MECL RAM’s are economically available, but the existing cache is so efficient that this may never be necessary.
Performance Degradation Due to IO Tasks
To first approximation, only the display controller word task (DWT) uses enough storage bandwidth to interfere significantly with emulators. Since it uses the fast io system, DWT requires service once/munch and will require two instructions/wakeup in the ordinary case. In addition, if the next instruction (by another task) issues a memory reference, it will always be held one cycle while the DWT’s IOFetch← advances ASRN.
A quick calculation shows that at an io bandwidth of 256 x 106 bits/sec (106 munches/sec) the display controller will use 48% of storage bandwidth and 12% of processor cycles at 60 ns/cycle.
The earlier example showed that with no io interference and a 99% hit rate, the emulator spent 17% of cycles in miss wait, 83% in useful execution. With a 256 x 106 bit/sec display active, emulator misses are slowed about 2 cycles each, so the overall effect of the display would be that about 78% of all cycles are emulator executions, 12% display task executions, and 16% hold; the one cycle holds for IOFetch← would make performance somewhat worse than this.
An IOFetch← by the display task to the same cache row as an emulator miss will remain in the address section, increasing display task latency and requiring more buffering. However, this won’t degrade emulator performance.
The Alto monitor only uses 14.7 x 106 bits/sec (1/17 of the above) and would not interfere appreciably with emulators.
The disk controller is the fastest "slow" io device among standard peripherals. When running, its word interrupt task reads a double word from the cache every 3.2 ms in a 3 instruction/interrupt inner loop, consuming about 5.6% of all cycles at 60 ns/cycle. Its memory references consume the cache at a rate of .04 munches/ms, low enough that storage interference with the emulator isn’t significant. However, a 256-word disk transfer displaces about 1/16 of the cache entries, so the emulator may experience a lower hit rate.
Cache and Storage Geometry
The current geometry was chosen without measurements or simulation of programs, but measurements made since then have indicated a surprisingly good cache performance, so not much could be gained through changes.
The following parameters are relevant:
1 word as the unit of storage inside the memory pipeline;
16-word
munch;
256 munches in the cache (expandable to 1024);
4
columns in the cache.
Munch Size
A 16-word munch size was chosen primarily because 8 cycles for transport balances 10 cycles for storage access, avoiding loss of bandwidth. The use of 256x4 RAM’s to implement the cache address section allows the original 4k-word cache (implemented with 1kx1 RAM’s) to be expanded to 8k words or 16k words, when 4kx1 RAM’s are economically available—this is possible because only 64 of the 256 words in the address section are being used with the 4k-word cache. Miss wait is about 28 cycles and storage bandwidth about 533 x 106 bits/sec with 16-word munches.
8-word munches would lower the storage bandwidth to about 262 x 106 bits/sec, probably unacceptable. Also 8-word munches would limit cache expansion to 8k words. However, miss wait would be reduced to about 24 cycles because transport would require only 4 cycles. 32-word munches would not allow greater storage bandwidth to fast io devices because bandwidth is already limited by transport with 16-word munches. Nor would it allow expansion to a larger cache data section because we have no way to build a data section larger than 16k words. Also, miss wait would be slowed to 36 cycles, so it does not seem that this munch size is attractive.
For a given size of the cache data section, with smaller munches the cache will tend to stabilize with a larger amount of useful information; however, when a program is changing contexts, larger munches might bring the new context into the cache more quickly. Also, fast io tasks will interfere less with the emulator on larger munches because fewer wakeups and IOFetch←’es will be required. However, the extra buffering and longer miss wait offsets this advantage somewhat.
Considered together, these factors suggest that the 16-word munch we are using is substantially better than either 8 or 32-word munches.
Data Path Width
Having only 16 bit wide data paths slows misses. Doubling the paths to 32 bits would reduce EC time by 1 cycle and transport time into the cache by 4 cycles (i.e., delay on misses would be 23 cycles instead of 28). There were not enough edge pins to do this. However, if a method of doubling the path width were found, the storage system would probably be arranged as two modules of four storage boards each rather than four modules of two boards each, and 32-word munches might be better than 16-word munches.
Cache Columns
The reason for multiple columns is to approximate LRU reloading; the columns are moderately expensive because separate hit logic has to be provided for each one; the V-NV stuff also costs a few ic’s with more than two columns. Altogether the current 64x4 cache is about 40 ic’s larger than a 128x2 cache (Because of its 50-50 LRU behavior on the fourth column, our cache is somewhere between the 64x4 and 128x2 or 128x3 caches below.). The table below shows likelihood that the Nth LRU munch is no longer in the cache for various geometries:
Table 31: Cache Geometry vs. LRU Behavior
N32x464x2128x232x364x3128x364x4128x4
4.000.001.000.000.000.000.000.000
8
.000.006.002.002.000.000.000.000
16
.001.025.007.013.002.000.000.000
32
.017.089.026.077.014.002.002.000
64
.140.264.090.323.079.014.018.002
128
.570.596.264.767.323.080.141.019
256
.960.910.595.987.764.323.568.142
512
.763.959.567
These numbers are computed from a binomial distribution using the following formulae:
let R = rows in cache
let C = columns in cache
then p = (R−1)/R = probability that a munch of VA is in its row
then q = 1/R = probability that a munch of VA is not in its row
then probability of a miss for the nth element is:
CP(miss)
11 − pn
2
1 − pn − nqpn−1
3
1 − pn − nqpn−1 − n(n−1)q2pn−2/2!
4
1 − pn − nqpn-1 − n(n−1)q2pn−2/2! − n(n−1)(n−2)q3pn−3/3!
etc.
Without extensive measurements on programs, it is impossible to know how much better, say, a 32x4 cache is than a 64x2 cache, or to know whether a 128x2 cache is better or worse than a 32x4 cache, for example. If a particular program is confining itself to a very small set of munches, then more closely approximating LRU reloading is most important. However, if the likelihood of reference flattens out after a small N, then it won’t matter much that LRU reloading isn’t very well approximated—the total size of the cache will be a more important determinant of performance.