Page Numbers: Yes X: 527 Y: -.5" First Page: 136
Heading:
Dorado Hardware ManualError Handling14 September 1981
Error Handling
In addition to single-error correction and double-error detection on data from storage, Dorado also generates, stores, and checks parity for a number of internal memories and data paths. The general concepts on handling various kinds of detected failures are as follows:
(1) Failures of the processor or control sections should generally halt Dorado because these sections must be operational before any kind of error analysis or recovery firmware can be effective.
(2) Failures arising from memory and io sections should generally result in a fault task wakeup and be handled by firmware. In some situations, such as map parity errors, it is especially important to report errors this way rather than immediately halting because firmware/software may be able to bypass the hardware affected by the failure and continue normal operation until a convenient time for repair occurs. In other situations, the firmware may be able to diagnose the failure and leave more information for the hardware maintainers before halting.
(3) IFU section failures and memory section failures detected by the IFU should generally be buffered through to the affected IFUJump, then reported via a trap; in this way, if it is possible to recover from the failure, then it will be possible to restart the IFU at the next opcode and continue.
(4) Memories and data paths involving many parts should generally be parity checked. It is not obvious that this is always a good idea because extra parts in the parity logic will be an additional source of failures, but instantly detecting and localizing a failure seems preferable to continuing computation to an erroneous and undetected result.
(5) When Dorado halts due to a failure, information available on mufflers and in the 16-bits of passively available error status (ESTAT) should localize the cause of the error as precisely as possible.
Since the MECL-10K logic family has a fast 9-input parity ladder component, the hardware uses parity on 8-bit bytes in most places; there is usually insufficient time to compute parity over larger units. IM and MIR, two exceptions, compute parity over the 17-bits of data in each half of an instruction; and the cache address section computes parity over the 15 address bits and WP bit.
Odd parity is used throughout the machine, except that the cache address section and IFUM use even parity. Odd parity means that the number of ones in the data unit, including the parity bit, should be odd, if the data is ok.
The control processor (Midas or the baseboard microcomputer) independently enables various kinds of error-halt conditions by executing a manifold operation discussed in the "Dorado Debugging Interface" document. It also has to initialize RM, T, the cache address and data sections, the Map, and IFUM to have valid parity before trying to run programs. Reasons for this will be apparent from the discussion below.
When Dorado halts, error indicators in ESTAT indicate the primary reason for the halt, and muffler signals available to the control processor further define the halt condition; ESTAT also shows the halt-enables. Midas will automatically prettyprint a message describing the reasons for an error halt. The exact conditions that cause error halts are detailed in the sections below; the table here shows the ESTAT and muffler information which is relevant.
Table 27: Error-Related Signals
ESTATESTAT Task
Error
EnableExperiencingRelated Muffler Signals
Bit
Bit Haltand Meaning
RAMPERAMPEenTask2BkSTK, RM, or T parity failure.
RmPerr and TmPerr mufflers on each processor
board indicate which byte of RM/STK or T had a
parity failure. StkSelSaved indicates that RmPerr applies
to STK rather than RM.
MdPEMdPEenprocessor-detected Md parity failure
Task2Bkif immediate ←Md (←MDSaved false)
Task3Bkif deferred ←Md (←MDSaved true)
MdPerr muffler on each processor board
shows which byte of Md failed.
IMrhPEIMrhPEenCTDparity failure of IM[17:33]
IMlhPEIMlhPEenCTDparity failure of IM[0:16]
IOBPEIOBPEenTask2BkPd←Input parity failure if IOBoutSaved false
Task2BkOutput←B parity failure if IOBoutSaved true
IOPerr mufflers on each processor board show
which byte failed.
MemoryPEMemoryPEencache address section parity failure,
cache data parity failure on write of
dirty victim or dirty Flush← hit, or
fast input bus parity failure.
Processor Errors
The processor has parity ladders on each byte of the following:
input to RM/STKgenerate parity for write of RM/STK
input to T
generate parity for write of T
B
generate parity for DBuf←B, MapBuf←B, Output←B, IM←B
IOB
check parity for Pd←Input and Output←B
Md
check parity for ←Md
R
check parity for ←RM/STK (unless bypassed from Pd or
Md or replaced by ←Id)
T
check parity for ←T (unless bypassed from Pd or Md or
replaced by ←Id)
Input ladders to RM/STK and T generate parity stored with data in the RAM; these ladders are not used for detecting errors.
The processor computes parity on its internal B bus (alub). The generated parity may be transmitted onto IOB when an Output←B function is executed; Store← references write B data and parity in the cache; parity for IM writes and map writes is computed from B parity. None of the other B destinations either check or store B parity. External B sources do not generate parity.
Parity on the R/T ladders is checked only when the R/T data path is sourced from the RAM, not when bypassing from Md or Pd is occurring, and not when R/T is sourced from Id. A detected failure causes the RAMPE error halt, which indicates that some byte of RM, STK, or T had bad parity. The muffler signals that further describe this error are in the PERR word: StkSelSaved is true if the source for R was STK, false if the source for R was RM; each processor board has RmPerr and TmPerr signals; RmPerr is true if the RM/STK byte on that board had bad parity, TmPerr if the T byte had bad parity. Note that if an instruction beginning at t0 suffered an error, Dorado halts immediately after t4; the muffler signals apply to the instruction starting at t0. The Task2Bk muffler signals show the task that executed the instruction at t0.
Md parity is checked whenever ←Md is done; a failure causes the MdPE error-halt when enabled. The ←MDSaved muffler signal in PERR is true when a deferred ←Md caused the error (T←Md, RM/STK←Md), false when an immediate ←Md (A←Md, B←Md, or ShMdxx) caused the error. On a deferred ←Md error, Dorado halts after t6 and Task3Bk shows the task that executed the instruction starting at t0; on an immediate ←Md, Dorado halts after t4, and Task2Bk shows the task. The MDPerr muffler signals on each processor board show which byte of Md was in error.
Io devices (optionally) compute and send odd parity with each byte of data; the processor checks parity when the Pd←Input function is executed, but not when the Pd←InputNoPE function is executed. When enabled, an IOBPE error halts the processor at t4 of the instruction that suffered the error; Task2Bk shows the task that executed the instruction. The processor also checks IOB parity on Output←B, and an error halts at t4 as for Pd←Input. The IOBoutSaved muffler signal distinguishes Pd←Input from Output←B errors; an IOPerr muffler signal on each processor board shows which byte of IOB was in error; all of these are in the PERR muffler word.
The processor generally does not pass parity at one stage through multiplexing to the next stage, so any failure in the multiplexing between one stage and the next will go undetected (exception: B parity passed through to IOB).
For example, the processor could write Md parity sent by the cache into the T RAM, when T is being written from Md. Instead, however, it checks Md parity independently, but then recomputes the parity written into T with the input ladder. Hence, a parity failure detected on a byte of T can only indicate a failure in either (1) the input parity ladder; (2) the output parity flipflop; (3) the output parity ladder; (4) one of three 16x4 T RAM’s; (5) one of two 4-bit latches clocked at t1 (Figure 3) through which the output of the T RAM passes; (6) one of two 4-bit latches clocked by preSHC’.
Parity is handled similarly for writes of RM/STK.
Parity is similarly recomputed on B.
The processor does not generate or check parity on the A, Mar, or Pd data paths. Any failures of the A, Mar, B, Pd, or shifter multiplexing or of the ALU go undetected; failures of Q, Cnt, RBase, MemBase, ALUFM, or branch conditions go undetected.
Remark
Since 256x4 and 16x4 RAM’s are used for RM, STK, and T, and since the processor is implemented with the high byte (0:7) on ProcH and the low byte (8:15) on ProcL, byte parity requires an additional 4-bit storage element on each board, of which only 1 bit is used. We could conceivably have used all 4 bits to implement a full error-correcting code for each byte of R and T data. However, there is insufficient time to correct the data. (Also, we use 256x1 RAM’s instead of 256x4 RAM’s for the RM and STK parity bits.)
Alternatively, parity could be computed over each 4-bit nibble rather than each 8-bit byte; the MC170 component allows nibble parity to be computed just as economically as byte parity. If this were done, then a parity failure would be isolated to a particular nibble. With byte parity, a detected failure could be any of 9+ components; with nibble parity, it would be isolated to one of 6+ components. Implementing nibble parity for RM/STK and T would require about 4 more ic’s per board than byte parity.
It is hard to say whether the additional precision of nibble parity would be worth the additional parts.
Control Section Errors
The control section stores parity with each 17-bit half of data in IM. When IM is written, the two byte-parity bits on B are xor’ed with the 17th data bit to compute the odd parity bit written into IM. It is possible to specify that bad (even) parity be written into IM, and this artifice is used to create breakpoints; bad parity from both halves of IM is assumed to be a deliberately set breakpoint by Midas.
IM RAM output is loaded into MIR and parity ladders on each 17-bit half give rise to error indicators that, when enabled, will halt the processor after t2 of the instruction suffering an error. For testing purposes, halt-on-error can be independently enabled for each half of MIR. Both the unbuffered output of the MIR parity ladders and values buffered at t2 appear in ESTAT. The buffered values show the cause of an error halt, and the unbuffered signals allow Midas to detect parity errors in MIR before executing instructions or when displaying the contents of IM.
The special MIRDebug feature discussed in the "Dorado Debugging Interface" document prevents MIR from being loaded at t2 when MIR parity is bad. In other words, when the MIRDebug feature is being used, all of the t2 clocks in the machine will occur except the ones to MIR. This feature prevents the instruction that suffered an error from being overwritten at the expense of being unable to continue execution after the error. MIRDebug can be enabled/disabled by the control processor.
IFU Errors
The IFU never halts the processor; any errors it detects are buffered until an IFUJump transfers control to a trap location. The errors it detects, discussed in "IFU Section", are parity failures on bytes from the cache, IFUM parity failures, and map parity failures on IFU fetches.
Memory System Errors
There is no parity checking on Mar or on data in BR, so any failure in the address computation for a reference goes undetected. However, valid parity is stored with VA in the cache, and any failure detected will cause the MemoryPE error to occur, halting the system (if MemoryPE is enabled).
Parity is also stored in the Map (computed from B parity) and an error causes a fault task wakeup in most situations (Exceptions: IFU references and Map← references do not wakeup the fault task when a map parity error occurs).
The cache data section stores valid parity with each byte of data. When a munch is loaded from storage, the error corrector carries out single-error correction and double error detection using the syndrome and recomputes parity on each 8-bit byte of data stored in the cache. When a word from B is Store←’d in the cache, byte parity on B is stored with the data.
A MemoryPE error occurs if, when storing a dirty victim back into storage, the memory system detects bad parity on data from the cache.
The IFU and processor also check parity of data from the cache, as discussed previously.
Sources of Failures
In a full 4-module storage configuration, Dorado will have 1173 MOS storage, about 700 Schottky-TTL, 3000 MECL-10K, and 60 MECL-3 DIPs, and about 1500 SIPs (7-resistor packages). This logic is connected with over 100,000 stitch-welded or multiwire connections to sockets into which the parts plug; logic boards connect to sidepanels through about 2500 edge pins. Sockets are used for all the RAM DIPs in the machine; other parts are soldered in. Given all these potential sources of failure, reliable operation has been a surprising achievement.
Initial debugging of new machines has been slow and difficult, requiring expertise not easily available in a production environment. In addition to mechanical assembly, board stuffing, and testing for shorts and opens both before and after stuffing, each machine has averaged about one man month of expert technician time to repair other malfunctions before it could be released to users.
Once released, the Dorados have been pretty reliable. During a 100-day period (6 October 1980 to 14 January 1981) the CSL technicians kept records of service calls made for approximately 15 Dorados in service at that time. The following summarizes the 43 service calls that were made.
37 daysmean time between service calls per machine.
45 days mean time between failures (some service calls were for microcode or software problems).
2.5 hours per machine per month average service time.
13% of failures and 5% of time reseating logic boards in the chasis (connectors not making contact).
11% of failures and 17% of time on open nets.
13% of failures and 12% of time repairing 16k MOS RAM failures (standard configuration was 2 modules).
37% of failures and 28% of time replacing other DIPs and SIPs.
5% of failures and 10% of time on T80 problems.
13% of failures and 11% of time on power supply failures.
2% of failures and 2% of time on Terminal and display problems.
4% of failures and 20% of time on repairing boards damaged during manufacturing or overheating.
The power supply failures were due to problems that have since been corrected, and most of the service calls for microcode or software problems would not happen in the more mature environment we have today. However, the other failures are believed to be representative. Note that none of the MOS RAM failures was the reason for a service call. These were found when testing a machine with diagnostics after a service call had been made for some other reason.
Error Correction
Reliability has been improved by error-correction on storage. The Dorado error-correction unit of 64 data and 8 check bits (quadword), guards 1152 MOS RAMs from single failures, but almost no other parts on storage boards or in the error corrector are guarded.
Our Alto experience suggests that some machines repeatedly fail under normal use due to undiagnosable failures. For this reason, error correction should be viewed as guarding not only against new failures but also against imperfect testing of parts that are either already bad or subject to noise (e.g., cosmic rays) or other kinds of intermittent failure. The latter may be more important in our environment.
The failure summary above indicates, for a small sample, that 16k MOS RAMs, accounting for 6% of all DIPs and SIPs (because the 15 Dorados had 2-module configurations, half the maximum) average about 4 times the failure rate of other parts and account for about 1.5 failures/year/Dorado−this would become 3 failures/year with a 4-module configuration. If we continue to do this well, a Dorado with error correction should run for years without uncorrectable MOS RAM failures. The manufacturer’s literature indicates that the dominant failure mode appears to be single-bit failures with row and column addressing failures affecting many bits somewhat less frequent, but we don’t know the distribution of these.
If MOS failures do become significant, different strategies may be needed for single- and multi-address failure modes. With a multi-address failure, another failure in the same quadword causes a double error; but many single-address failures can occur in the same quadword without double errors.
The failure model used below shows that with no periodic testing and replacement of bad MOS RAMs, fatal failure statistics of the 1152 RAMs would approximate those of a 108 RAM uncorrected store. By thoroughly testing storage and replacing bad parts 4 times more often than the mean time to total failure of a part (defined below), the likelihood of an uncorrectable RAM failure crashing the system can be made insignificant compared with other sources of failure.
Although system software could bypass all pages affected by a multi-address RAM failure, the entire module, 25% of storage, would be eliminated, so this is impractical except on an emergency basis. Continuing execution despite a multi-address RAM failure will result in a double error when any other coincident storage failure occurs in the same quadword; 1/16 of future failures will do this.
Some interesting questions are: How does MTBF vary with the EC arrangement? MTBF is pertinent if we let Dorados run until they fail. Alternatively, how likely is a failure in the next day, week, or month, if we test the memory that often and replace bad RAMs? These questions can be asked assuming perfect testing (no failures at t=0) or imperfect testing (some likelihood of failures at t=0 because diagnostics didn’t find them).
To answer them, MOS RAM failures are modelled as one of two types: those affecting a single address in the RAM (called SF’s), and those affecting all addresses (called TF’s). We assume that TF’s occur about 1/4 as often as SF’s in 4Kx1 RAM’s. RAM failures are assumed exponentially distributed, correct if the failure rate doesn’t change with time; over the time range of interest, this is reasonable. Finally, perfect testing is assumed, so there are 0 failures at t=0. These assumptions give rise to the following:
let p = prob that an ic has a TF = 1 − e−at
let q = prob that an ic has a SF = 1 − e
−bt
let n = number of MOS RAMs in the memory
Without error correction, MTBF is the integral from 0 to infinity of [(1−p)(1−q)]n = 1/n(a+b). With b = 4a, in our 4-module system with n = 1024, this is 1/5120a = .00018/a.
With error correction, failure occurs when, in a single EC unit, a TF coincides with either another TF or an SF. This ignores two coinciding SF’s which is about 4000 (16k RAMs) or 16000 (64k RAMs) times less likely.
let n = number of RAMs in an error correction unit
then Prob[no failure] = Prob[no TF] + Prob[1 TF and 0 SF]
Prob[no TF] = (1−p)n
Since failure modes are independent,
Prob[1 TF and 0 SF] = np[(1−p)(1−q)]
n−1
Prob[no failure] = P
ok = (1−p)n + np((1−p)(1−q))n−1
P
ok = e−nat + n(1−e−at)(e−(a+b)(n−1)t)
This is the probability for a single EC unit, so mean time to failure for all MOS storage is Pok raised to a power equal to the number of EC units. In other words, the argument of the integral for a 4-module x 4 quadwords/module system is Pok16 with n = 64+8; it is Pok4 with n = 256+10 for a one munch EC unit.
Then, expected time to failure for our 16 x n=64+8 memory system, is about:
(1/n) * (1/16a + 16a/(16a+b)2 + 240a2/(16a+2b)3 + 3360a3/(16a+3b)4)
= (1/an) * (1/16 + 1/25 + 5/288 + 105/17208)
= (1/16an) * (1 + .64 + .28 + .006) = 1.93/16an
= 1.93/16*72*a = .00168/a
In other words, mean time to failure is about 1.93 times longer than the time to the first TF = 9.5 times better than with no error correction = as often as 1024/9.5 = 108 uncorrected storage ic’s.
The results don’t change much when imperfect testing is assumed. The effect of this is to replace densities for p and q by 1 − Ae−at, where A would be .999 if there was a 1/1000 chance of a MOS ic being bad at t=0.
Remarks
On each storage board, data from MemD is transported to a shift register consisting of 8 flipflops which are then written into the MOS RAM’s after transport has been completed. This arrangement is unfortunate−any failure in one of these components will cause a multiple error, and there are about 250 of these parts in a full storage configuration.
One way to eliminate this problem while simultaneously reducing the part count on each storage board would be to make modules consist of four storage boards, rather than two, so that only four flipflops receive data on each bit path during transport; since each of these is in a different quadword, single failures would not cause multiple errors.
The Dorado EC operates on quadwords, requiring 8 check-bits/64 data bits, or a 12.5% storage penalty. Alternative schemes are: 10 check bits/256 data bits (3.9%); 9 check bits/128 data bits (7.4%); 7 check bits/32 data bits (22%); and no error correction at all (0%).
The implementation of the EC pipeline is such that wider correction units significantly increase the time for a miss. The current quadword error corrector requires 7 clocks (3 clocks for setup and correction, 1 clock per word of the quadword); this would become 11 clocks with a 128-bit EC scheme or 19 clocks with a 256-bit EC scheme. Although cache hit rate seems to be above 99%, some implementation avoiding this delay would still be needed to make larger correction units attractive.
If our quadword correction unit were replaced by a 4 x n=256+10 scheme:
1/4na + 4a/n(4a+b)2 + 3a2/2n(2a+b)3, where for b = 4a this is
(1/4na)*(1 + 1/4 + 1/36) = 1.28/4na = .0012/a
In other words, MTBF is about 1.28 times longer than the time to the first TF. So error correction has increased MTBF by a factor of 6.2 over no error correction; alternatively, a 1064-RAM corrected memory fails as frequently as a 1064/6.7 = 159 RAM uncorrected memory.
Surprisingly, the 64+8 EC scheme has only 42% longer MTBF than a 256+10 EC scheme. This improvement may not be worth the 96 additional MOS RAM and 80 other DIPs required for address buffering; the 80 additional DIPs might cause more failures than they save, being a net loss.
The other method of maintaining our systems is to regularly test storage and replace bad RAMs. Then the likelihood of no double error before replacement is simply the value of the probability distribution (Pok4 and Pok16 above) at the selected instant. This reduces to an approximation of the form Pok = [e−x + xe−x]m where x = nat, m is 4 or 16, and n = 72 for m=4 or 266 for m=16. If this is evaluated at t = 1/mna, 1/2mna, 1/4mna, etc. the following results are obtained:
Table 28: Double Error Incidence vs. Repair Rate
m1/mna1/2mna1/4mna1/8mna
4.52.81.94.98
16
.79.84.98.99
The interpretation of this table is as follows: Measure mean time to total failure (TF) of a MOS RAM and call this time 1/a; then assume 4 SF’s per TF. Then the rate at which TF’s occur in storage will be 1/mna. So the above tables show probability that the Dorado hasn’t suffered a double error when tested and fixed as often, 1/2 as often, 1/4 as often, or 1/8 as often as the mean rate of TF’s.