Page Numbers: Yes X: 530 Y: 10.5" First Page: 1 Not-on-first-page
Columns: 1 Edge Margin: .6" Between Columns: .4"
Margins: Top: 1.3" Bottom: 1" Binding: -5
Line Numbers: No Modulus: 5 Page-relative
Odd Heading: Not-on-first-page
ERROR RECOVERY IN LR PARSERS
Even Heading:
ERROR RECOVERY IN LR PARSERS
Error Recovery in LR Parsers
James Eve
University of Newcastle upon Tyne
Edwin Satterthwaite
Xerox Corporation
Last revision: January 18, 1980 10:13 AM
Parsing algorithms of the LR type, notably SLR(1) and LALR(1) [DeRemer, AEH], have become quite popular. They accept larger classes of grammars than most other formalized methods and do so without significant penalties in parsing speed or storage economy. This note describes an error recovery algorithm for use with such parsers. It seems to perform on average about as well as any other published method and has the following advantages:
It is compatible with LR parsers. Some otherwise attractive algorithms are designed to be used only with precedence or LL parsers.
The recovery is deduced entirely from tables that are already required by the scanner and parser.
Performance is acceptable with arbitrary grammars; thus the compiler writer can ignore error recovery during initial design and implementation phases. On the other hand, straightforward tuning for a particular application is possible, and the algorithm’s organization makes it easy to use lexical or semantic hints to guide the recovery.
Nonterminal symbols are never discarded from the parse stack. The problem of undoing or otherwise neutralizing the ‘semantic’ side effects of syntactic reductions is thereby avoided.
The basic ideas underlying our method are simple and might even appear naive. We therefore discuss the practical results that we have been able to achieve and the sensitivity of those results to variations in the development of the basic idea..
1. Preliminaries
1.1 Background
Our interest in error recovery stems from experience with the Algol W compiler. We found that many of the residual bugs in that compiler were consequences of the ad hoc techniques that it used to deal with syntactic errors. Specifically, it often discarded nonterminal symbols from the parse stack, and we were unable to find systematic ways of preserving the consistency of semantic information when that happened. Gries also discusses this problem and concludes that recovery should be designed to retain all nonterminal symbols [Gries, Section 15.3]. Our method was inspired by his recommendations and by the observation that much predictive information is conveniently encoded in LR parsing tables.
The Algol W compiler was used for our initial experiments [Wynn], but most of the subsequent development has taken place in the context of Mesa, a language for systems programming. Mesa is still evolving, and we have also been using LR parsers in translators of special purpose languages, some of which are purely experimental. Syntactic revisions and extensions are frequent in these applications, and the ability to provide tolerable error recovery at negligible cost to the implementer has also become quite important to us.
Because of the basic simplicity of our method, it is likely that similar techniques have been discovered independently, but we have not seen them published. Graham and Rhodes [GR] describe an algorithm for recovery in precedence parsers that performs well and has become a standard of comparison. The same authors include a good survey of the literature as well as concise descriptions of many of the published methods. Some of the more recent work on error recovery in LR parsers [Tai, Graham, Fischer] is closely related to ours but the emphasis and details have been different; we discuss the relationships briefly in Section ??.
Our framework can accomodate a number of techniques complementary to our own. These include spelling correction [Morgan], the use of semantic information to choose among possible recoveries [FL], and ‘panic mode’ strategies to be invoked when local recovery fails [GR]. We have not investigated any of these in detail.
1.2 Notation
All our parsers have been constructed using the methods described in Anderson, Eve and Horning (henceforth cited as AEH). We assume familiarity with that paper and use compatible notation when possible.
If a is a string, then |a| denotes the length of a. For any nonnegative integer k, k:a is the initial k character substring of a if |a| > k and a otherwise.
At any point in the parsing of a string, the internal state of the parser can be represented by a stack of states, which we call a stack configuration, usually denoted by S. The immediate parsing action is determined by the topmost state of a configuration; we call that the exposed state.
2. Description
2.1 Overview
Suppose that the parser is presented with the string ayw, where a is a correct prefix, i.e., there exists some string b such that ab is a sentence of the language. All LR parsers implemented using the methods described in AEH have the following correct prefix property: if the parser has recognized a, it will recognize ay only if ay is also a correct prefix. Otherwise, the parser will report an error.
In the latter case, we attempt a local repair of the error by seeking a deletion d and an insertion i with the following properties (where yw is repartitioned as dt):
For some I (the insertion limit), |i| < I.
For some D (the deletion limit), |d| < D.
For some S (the scan limit), ai(S:t) is a correct prefix.
In general, more than one correction will satisfy these constraints, and we wish to make a change of minimum measure. If we accept a definition of measure in which insertion, deletion, and replacement of a single symbol have unit cost [Levy], this is equivalent to minimizing max(|d|, |i|).
There is no guarantee that this approach will find an acceptable recovery. In fact, finding any recovery is sometimes precluded; for example, no more than I closing brackets can be appended to a truncated sentence. We have observed, however, that programming languages contain considerable local redundancy and that the density of syntactic errors in real programs is relatively low. When these two conditions are met, our method will succeed much of the time.
The preceding remarks suggest a recovery algorithm that can be used with any parser having the correct prefix property. It works as follows: systematically generate candidates for d and i by examining the source stream and the grammar respectively; for each pair, check whether ai(S:t) is accepted by the parser. If the generation scheme is designed so that the measure never decreases between sucessive candidates, the search can terminate with the first such d and i. The following sections discuss the details of converting this into a practical algorithm.
2.2 Difficulties and Refinements
As noted above, no more than I symbols can be inserted between two symbols in the original string, and recovery is thereby precluded in some cases. This is seldom a practical problem, even with small values of I. Unsatisfactory performance is more often associated with the following phenomona: (1) In many cases, the ‘real’ error, as defined by the programmer’s intent, occurs within a, and the recovery is constrained to complete a valid but implausible prefix. (2) Because of the boundedness of the acceptance check, a ‘correction’ can cause spurious cascade errors after n further symbols have been accepted, where n > S and is possibly quite large. The behavior of two candidate corrections might be very different in this respect but, if they have the same measure, we have provided no basis for choosing between them.
The first difficulty is a problem with all methods that preserve correct prefixes. In our experience, it has caused fewer problems than anticipated. Local insertions can usually complete an aberration and get the parser back on track. Our method in fact allows us to discard or replace the last symbol in a correct prefix. If this possibility is exploited carefully, as described below, the inability to back up an arbitrary distance is seldom a problem.
The second difficulty has more serious practical consequences. The problem can be mitigated by increasing S. Unfortunately, S must also be chosen to keep the probability of another error within S:t fairly small; such an error (unless lumped with the first) precludes recovery. In addition, there is seldom any value of S that can guarantee the absence of cascade errors. In a typical language, for example, an expression of unbounded length can occur between the point of error and the symbol that would discriminate among potential recoveries. Graham and Rhodes [GR] deal with this by attempting to ‘condense’ the right context; their method reduces an unbounded initial substring of yw to a string of terminal and nonterminal symbols of bounded length in which a discriminating symbol is likely to occur. Because of the success of their techniques, several investigators have addressed the technical problem of condensing right context in LR parsers [DR, PD, MM].
Use of a better metric than the simple measure defined above can also alleviate the second difficulty. Our scheme generates candidate insertions in an easily controlled order; that order effectively assigns weightings to corrections of equal measure and can be manipulated to ‘fine tune’ the recovery process [GR, FMQ]. For example, the terminal symbols can be ordered so that common or frequently omitted symbols are tried first. A drawback is that there is little or no theory to guide the choice of weights. They must be derived empirically, and apparently minor changes can have unexpected consequences. By tuning the ordering after some initial experience, however, we have usually been able to ensure reasonable recovery from ‘characteristic’ errors of a particular language. Any special knowledge available to the recovery routines, such as similarities in spelling, can also be used to override the standard ordering.
We have found that these approaches are complementary and that both are useful. With a small value of S, acceptable performance can be achieved only with carefully chosen weights; with a large value, the quality of recovery is almost insensitive to the choice but clustered errors cause problems. We discuss some practically useful compromises below. So far, we have rejected the additional complexity of condensing the right context, for the following reasons: (1) The semantic routines in our compiler have not been organized to deal with the resulting noncanonical parses. (2) There is considerable empirical evidence that subphrases such as expressions are typically short in real programs [Sweet]. When this is the case, examination of S:t for some moderate value of S is likely to discover the discriminating symbol. (3) Although the cost of checking can in theory grow linearly with S, checks usually fail on the first few symbols of t and have a cost essentially independent of S.
3. Application to LR Parsers
Any recovery provided by the preceding algorithm can be characterized by i and d, and ai must be a correct prefix. This observation suggests that the computational problem can be divided into the following four subproblems:
(1) generating the set of possible insertions,
(2) generating the set of possible deletions,
(3) pairing insertions and deletions in order of increasing measure, and
(4) checking the right context for a successful pairing.
Organizing these subcomputations as coroutines minimizes the amount of generation and checking. Note that the set defined by (1) depends only upon a, i.e., it need not be recomputed for each choice of d. Subcomputation (2) is trivial. The properties of LR parsers provide no special help with (3), which is approximately breadth first tree search. LR parsers do, however, considerably facilitate subcomputations (1) and (4); the tables used by those parsers allow efficient generation of candidate insertions, and the generation process itself provides information that speeds checking for successful recovery.
3.1 Computing Continuations
If a is a correct prefix, we define the pair (g, S) to be a continuation of a if ag is a correct prefix and S is the stack configuration of the parser after recognizing ag. The configuration is unique up to LR(0) reductions; we assume that all such reductions have been made. We also define C(a,k) to be the set of all continuations (g, S) of a such that |g| = k, k > 0. Computing and remembering continuations is worthwhile for two reasons.
By enumerating C(a,k), we can discover all possible insertions of length k. The enumeration will in fact produce the pair (i, S) for the insertion i. For any given deletion d, we can check for an acceptable recovery by restarting the parser with configuration S and input S:t. Recovery has succeeded just in the case that input is recognized without the parser’s detecting a further error.
If (g’, S’) B C(a,k+1), then g’ = gc for some (g, S) B C(a,k). Furthermore, there must be a nonblank entry for c in the row of the state transition table corresponding to S’s exposed state. This observation allows efficient inductive computation of the sets of continuations. If (g, S) B C(a,k), we consider each terminal symbol c with a nonblank entry in the exposed state’s transition table. Again, we restart the parser in state S, this time with input c. Either the parser will reach the configuration S’, in which case (gc, S’) B C(a,k+1), or it will detect an error. (The latter never happens with proper LR(1) parsing tables but is possible if SLR(1) or LALR(1) tables are used.) By enumerating all elements of C(a,k) and accumulating the new continuations, we can be sure of generating all of C(a,k+1).
Both insertion strings and configurations are items of variable size. It is possible, however, to base their representations upon a list structure in which the node representing each continuation contains just one state, one terminal symbol, and a number of links. These links also allow common prefixes of insertions and configurations to be shared among continuations in a space-efficient way. Questions of implementation are discussed further in Section 5.
Two technical problems remain. One is that the cardinality of C(a,k) can grow rapidly with k; furthermore, some elements of these sets are useless in the sense that their possible continuations are identical to those of other elements; they therefore open no new possibilities for recovery. Eliminating some of the redundant elements is discussed in Section 3.5.
The second problem is establishing the basis for the inductive computation of the continuation sets defined above. Suppose that the parser detects an error in parsing ayw and stops with some configuration S. We would like to have C(a,0) = {(L, S)}. Unfortunately, the parser will have not only recognized a but also examined y. A parser using full LR(1) parsing tables will report the error on the first examination of y, and C(a,0) can be constructed as indicated. In practice, however, more compact tables are necessary. With these smaller tables, the LR(1) standard of error detection is not necessarily maintained; inspection of y can signal several reductions of the stack before the error is noticed. S then is not the configuration after recognizing a but one derived from it. Approximating C(a,0) with {(L, S)} in this case unduly limits the set of possible continuations. The problem and a solution are described in Sections 3.2 and 3.3.
3.2 Lost Recovery Options
Let the input string be axyw and suppose that ax has been accepted by the parser, leaving it in state i. The transition from i is determined by the symbol y. There are three possibilities.
(1) y should be accepted and a new state pushed onto the parse stack.
(2) A reduction of the stack using the production A ← s should occur.
(3) y is illegal in state S.
In all parsers considered here, transitions of type (1) are recorded directly in the parsing tables. Entries of types (2) and (3) are also recorded explicitly in full LR tables, and alternative (3) can be detected when y is first examined. In SLR and LALR tables, entries of type (2) are overestimated, i.e., the set of symbols signifying a reduction is larger than the set of symbols that will eventually be accepted. The SLR algorithm, for example, assigns type (2) entries to all y B follow(A). Furthermore, several of the more powerful table compaction techniques described in AEH introduce default actions of type (2).
The reductions forced by erroneous input potentially limit the set of allowable continuations. An example using the language defined in Figure 1 of AEH will demonstrate this. Consider the string
id := id ) * ( id + id * id ) T.
During the parse of this string using the (SLR) table in Figure 2 of AEH, the following configurations occur:
Configuration Parsing StackAssociated Symbols
(1)0 4 6 13; id := id
(2)0 4 6 11; id := P
(3)0 4 6 10; id := T
(4)0 4 6 9; id := E
Since there is a blank entry for ‘)’ in the row of the table for state 9, an error will be detected at this point. If we examine the entries for that state, however, we see that the only legal continuations begin with the symbols ‘+’, ‘else’, or ‘T’. It is easy to verify that simply inserting one of these symbols will not allow more of the input to be accepted. In fact, the best possible recovery starting from configuration (4) is to replace ‘) *’ with ‘+’. This is a moderately satisfactory correction but not a minimal one, and the repaired string is unlikely to have the intended semantics. We would prefer detection of the error when the spurious ‘)’ is first inspected (configuration 1, state 13). At that point, more options are available; in particular, the symbol ‘*’ is acceptable in state 13, and deletion of just the ‘)’ allows the parse to continue.
Compaction of parsing tables using the techniques suggested in Section 5.1 of AEH can make the problem of lost options substantially worse. In our example, default reductions are introduced for state 9 (specifying production 3), perhaps for state 2 (specifying production 1), and elsewhere. Then the following additional configurations are generated before the error is detected:
(5)0 2; A
(6)0 1; S
Recovery is now hopeless; the only choice is to delete the remainder of the input. Indeed, the parser must be careful not to conclude that a valid sentence has been accepted at this point; a termination test based solely upon entry to state 1 (the state {[0, 1]}) as suggested in AEH is no longer sufficient.
This example is obviously contrived, but it shows that difficulties can arise even with simple grammars. Realistic grammars are much more complex. Our experimental investigations suggest that options lost due to premature reductions of the stack do create serious practical problems. For example, errors in the use of separators are quite common. We often use productions in our grammars with the forms
L’ ← E | L’ : E
L ← L’
to describe a list L of elements E with separator :. Semantic routines associated with the productions on the first line count elements; those with the production on the second process the accumulated list. If omitted or incorrect separators force premature reduction of L’ to L, satisfactory recovery is often impossible.
3.3 A Modified Parsing Algorithm
In an LR parser, the final step in recognizing the correct prefix ax is pushing (a state corresponding to) x onto the parsing stack. Loss of recovery options can be avoided by deferring all reductions until just before the next terminal symbol is to be pushed onto the stack (or a complete sentence is recognized). On the other hand, reduction of the stack is necessary to discover the required parsing actions. This conflict can be resolved by maintaining two copies of the parsing stack, one of which ‘lags’ the other by a terminal symbol.
In practice, administering another stack or set of stacks is likely to prove burdensome or inefficient. One solution, which we have found useful, is to allow a ‘hole’ in the parse stack in which its old state is preserved. Assume that the stack is implemented using an array stack and a stack pointer sp. Normally, the stack elements would be stored in locations stack[i] for i B [0, sp]. We introduce two new variables, valid and top, to delimit the hole containing the preserved states. The stack can then be partitioned into three segments, with indices in the following intervals:
[0, valid]states shared by both stacks
(valid, top]states preserved
(top, top+(sp-valid)]new states pushed onto the stack.
The stack pointer sp can be manipulated as usual, but stack references must be mapped into the appropriate segment. In general, references to stack[n] are replaced by references to
stack[if n<valid then n else n + (top-valid)] .
The following reformulation of the parsing algorithm makes the representation of the stack explicit and also indicates the changes required to queue reductions. A data structure for doing the queuing is assumed, but its structure is left implicit. We have adopted the style and nomenclature of AEH to allow easy comparison with the original algorithm and to separate the various cases for later reference. Note that i and j are indices identifying a state and a terminal symbol respectively.
Initialization.
Assign the first input symbol to j. Also create an empty queue.
i ← InitialState; sp ← 0; valid ← top ← sp.
Step 1. (stack and dispatch)
If i is the final state and the input stream is empty, then process the queue and exit. Otherwise, stack i on the parse stack.
stack[sp + (top-valid)] ← i.
Examine the (i,j)-th entry of the state table. If the entry is
a) blank, the terminal symbol j is invalid. Recover as described below.
b) a state number, go to step 2.
c) a scan production number, go to step 3.
d) a production number, go to step 4.
Step 2. (scan a terminal symbol)
Dequeue all deferred actions and normalize the stack as follows:
stack[valid+m] ← stack[top+m] for m B [1, sp-valid].
Assign the new state number to i. Read the next input symbol and assign it to j. Increment the stack pointer
sp ← sp + 1; valid ← top ← sp
and return to Step 1.
Step 3. (scan a terminal symbol, then reduce)
Dequeue all deferred actions and normalize the stack as follows:
stack[valid+m] ← stack[top+m] for m B [1, sp-valid].
Read the next input symbol and assign it to j. Increment the stack pointer
sp ← sp + 1; valid ← top ← sp
and continue with Step 4.
Step 4. (reduce using the p-th production, A ← s)
Pop one element from the parse stack for each symbol in s, thereby uncovering a state number k.
sp ← sp - |s|; valid ← min[sp, valid];
k ← stack[if sp<valid then sp else sp + (top-valid)] .
Also, enque p for later processing. Increment the stack pointer
sp ← sp + 1 .
Let A, the LHS, be nonterminal l. If the (k,l)-th entry of the table is
a) a state number, assign this state number to i and return to Step 1
b) a (scan) production number, assign a value of ‘undefined’ to i and repeat Step 4.
Note that the processing of the queue indicated in steps 2 and 3 begins with semantic information matching the preserved stack. If, as usual, that information is accumulated in another parallel stack, the dequeuing routine requires the value of top for initializing its internal stack pointer. That pointer must also be adjusted as each production is processed (cf. Step 4).
Consider the erroneous input axyw as above. The error is discovered upon examining y in step 1a. If x was read by a scan action (step 2), a valid state number has been assigned to stack[top], and C(ax, 0) = {(L, stack[0..top])}. If, however, x was read as part of a scan-reduce action (step 3 followed by step 4), no value has been assigned to stack[top]; indeed, the corresponding LR(0) state has been eliminated from the parsing tables. A valid initial configuration can be obtained by performing all enqueued scan-reduce actions (possibly including those introduced in step 4b). This much dequeuing can eliminate no recovery options, since LR(0) reductions are independent of the input symbol. Alternatively, stack[0..sp-1] can be taken as the initial configuration. This configuration corresponds to the valid prefix a, and x should be reinserted in the input stream before invoking the recovery algorithm. Note that x has become a candidate for deletion or replacement; however, the initial configuration does reflect any reductions forced by x.
3.4 Adding Limited Backup
The ‘backing up’ over x, the last symbol in a correct prefix, as discussed above does not depend upon x’s being read by a scan-reduce action. The queuing scheme guarantees that the (possibly ill-defined) state associated with x is in stack[top] when the error is detected. Since no reduction using x has yet occurred, x can uniformly be considered for replacement. In theory, this opens the possibility of more satisfactory recovery from certain classes of errors. Consider, for example, the clause ‘IF i = 0 THEN’ and the following corruptions, ‘IG i = 0 THEN’ and ‘i = 0 THEN’. The initial identifiers, IG and i, are parts of correct prefixes, and the error cannot satisfactorily be corrected without backing up. Our observations indicate that this possibility can usefully be exploited. Of course, such a limited backup capability does not help if, e.g., a complicated variable reference precedes the ‘=’, but that is rare even in syntactically correct programs [Sweet].
We define an A-continuation to be a continuation of ax, the correct prefix after accepting x, and a B-continuation to be one of a, the correct prefix before accepting x. We call the corresponding recoveries A- and B-recoveries, etc.
If (g, S) B C(ax, n), then clearly (xg, S) B C(a, n+1). Thus it is possible to generate the A-continuations by taking the B-continuations and extending a subset one more level. Note, however, that all elements in that subset have special properties:
Of all the B-continuations with level I, only those in this subset are to be extended to level I+1.
Consider an A-recovery in which the string d is deleted and the string g is inserted. As a B-recovery, this requires deleting xd and inserting xg. However, the nominal measure of this correction, max(|xd|, |xg|) is one greater than the ‘correct’ measure, max(|d|, |g|); i.e., the measure must be computed specially on this subset or it will be too high.
There is a complication if the very first symbol of the input is in error. Then only one set of continuations can be generated. Rooting the B-continuations in the null set can handle this case. Alternatively (and somewhat more simply), the goal symbol can be forced to produce the form TST, with the initial T supplied by the parser. Then a nonnull prefix always exists.
3.5 Eliminating Useless Continuations
Let (g, S) B C(a,j) and (g’, S’) B C(a,k), j < k. If S = S’, then the strings that can follow ag and ag’ are identical. Since |g| < |g’|, there is no point in retaining (g’, S’) as an element of C(a,k). This observation suggests checking for and eliminating duplicate configurations when constructing the continuation sets. Note that there are two advantages. Space is saved in representing C(a,k), and computation is saved in generating C(a,k+1).
An example will illustrate these points. Consider the grammar G defined by the following set of productions.
S ← E
E ← V | E + V | E - V
V ← x | x ( E )
The continuation sets for the string ‘x’ are the following (where continuations are denoted by just their string components):
kC(x,k)
1{T, +, -, (}
2{+x, -x, (x}
3{+xT, +x+, +x-, +x(, -xT, -x+, -x-, -x(, (x+, (x-, (x(}.
Eliminating useless continuations prunes these sets with the following result:
kC(x,k)
1{T, +, -, (}
2{+x, -x, (x}
3{+x(, -x(, (x+, (x-, (x(}.
Note that both ‘+x’ and ‘-x’ are retained in C(x,2). This is because there is no LR(0) reduction of ‘x+x’ (a parenthesis could follow). During the computation of C(x,3), ‘+x+’ and ‘-x+’ are discovered to be equivalent to one another and to ‘+’ in C(x,1). On the other hand, ‘+x(’ and ‘-x(’ are retained although one of these is clearly useless also.
The check for useless continuations is more effective when the grammar is arranged to introduce a nonterminal symbol for each set of syntactically equivalent terminal symbols. (Some scanners do this automatically, at least for arithmetic operators.) Consider the following grammar G’, which defines the same language as G.
S ← E
E ← V | E A V
V ← x | x ( E )
A ← + | -
There are now LR(0) reductions of ‘+’ and ‘-’ to A, and the useless configurations can be recognized and eliminated much sooner. The continuation sets are the following:
kC(x,k)
1{T, +, (}
2{+x, (x}
3{+x(, (x+, (x(}.
If both A- and B-continuations are being considered, some care is necessary in deciding which are actually useless. Let (g, S) B C(a, m) and (g’, S) B C(ax, n). Then the B-continuation (g, S) sees xyw as initial right context, but the A-continuation (g’, S) sees yw. Note that appending identical symbols to g and g’ produces a new pair of continuations that again have identical configuration components. Thus we can restrict our attention to recoveries in which g or g’ is the entire insertion. There are two interesting cases:
agxyw is an acceptable recovery. This succeeds immediately as a B-recovery. The corresponding A-recovery, axg’xyw requires inserting a second x and has the following properties: the measure can be (and usually is) one greater; |g’x| can exceed I, the insertion limit (note that we are interested in cases in which |g’| = |g| as well as |g’| < |g|, since we generate A-trees first); the semantic consequences are likely to be bad. Thus the B-continuation (g, S) should not be rejected because of previous discovery of the A-continuation (g’, S), at least not if |g’| = |g|.
axg’yw is an acceptable recovery. This succeeds immediately as an A-recovery, but the corresponding B-recovery, agyw, requires the further deletion of x and has the following properties: the measure can be one greater; the number of deletions required can exceed D, the deletion limit. Thus the A-continuation (g’, S) should not be rejected because of previous discovery of the B-continuation (g, S).
In both cases, raising the apparent measure from n to n+1 is a problem if that permits a less desirable recovery of measure n (recall the importance of not defeating the ordering of symbols) or of (genuine) measure n+1.
Our experience suggests that distinguishing the A- and B-continuations only occasionally gives different recoveries, but some of these cases are important. An example is instructive. Consider the following Mesa text:
P1: PROCEDURE =
BEGIN
s1; ...; sn; -- each si is a statement
P2: PROCEDURE = ...
The problem is a missing ‘END’, but no error is discovered until the ‘:’ is seen. The A-continuation of measure 1 that inserts ‘END’ after ‘P2’ forces a series of reductions (P2 becomes a procedure call) terminating with an LR(0) reduction (using ‘END’) to a procedure body. The B-continuation of measure 1 that inserts ‘END’ before ‘P2’ also forces an LR(0) reduction producing the same stack configuration. If that continuation is discarded as useless, the desired B-continuation of measure 2 (‘END ;’) is never generated, and recovery essentially fails. (‘END ; id’, the A-continuation that would have a similar effect, is not tried if I = 2.)
4. Strategy Considerations
The description so far has left many choices of parameters and strategy open. We have in fact tried a large number of variations of the basic method in an attempt to find an ‘optimal’ one, and we use those experiments to justify a number of remarks that we make below. Our experiments were based upon a sample of 303 Mesa programs, the characteristics of which are described in Section 6.2. We have limited evidence, also discussed in Section 6, that our results transfer to other Algol- and Pascal-like languages and to other environments.
Most of our claims are based upon relative performances; these are much easier to judge than the absolute quality of a recovery. In our descriptions, ‘much worse’ (‘much better’) means that global phrase structure was seriously damaged in the repair chosen by one strategy but not the other. ‘Worse’ (‘better’) refers only to the effect on local structure and its semantics; global structure was maintained about as well (or as poorly) in each case.
It is worth pointing out that we observed considerable insensitivity to the details of the strategy. In the majority of our examples, most strategies that we tried either succeeded equally well (but often differently) or failed equally. Enough of the remaining examples produced dramatically different behavior (especially with respect to cascade errors) to justify, in our environment, careful tuning and more elaborate coding. Section 4.5 discusses the consequences of certain possible simplifications.
4.1 The Parameters
These have been fairly well tuned for Mesa.
InsertLimit (I). In the Mesa sample, allowing insertions of length 3 improved the recovery in only 1 case, and growing that large a tree is expensive in time and space. We suggest a value of 2.
DiscardLimit (D). The value can be set rather arbitrarily. We recommend a fairly large value (~10) so that deleting long sequences of symbols can serve as a fairly effective ‘pseudo-panic mode’.
ScanLimit (S). Increasing S improves the quality of individual recoveries but increases the likelihood that a cluster of errors will prevent any recovery’s being found. On the basis of our experiments, we recommend a minimum value of 4; 3 was noticably worse, and 2 gave so many cascade errors as to be useless. As S increased in our tests, the overall recovery rate did not change dramatically; 4 and 8 were about equal, 6 worse, and 12 or greater better. The nature of the recoveries did vary, however. The correction algorithm failed more often, but successful corrections were nearly always quite good and seldom resulted in subsequent cascade errors. (Need examples). When it is possible to reset the scanner in the case that the entire lookahead buffer is not accepted, a good compromise is to look for large S but to accept the ‘best’ recovery over some threshold when this fails. Note that, for computing ‘best’, an A-recovery already implies accepting one more symbol than a B-recovery.
Note that the cost of checking right context does not, on average, grow very rapidly with S, since most checks fail quickly. In our sample, the average number of symbols examined to reject a candidate recovery was 1.4 with S = 12.
If the scanner cannot be reset, the number of symbols checked varies from MinScanLimit to MinScanLimit + InsertLimit - 1 for A-checks and from MinScanLimit + 1 to MinScanLimit + InsertLimit + 1 for B-checks. This can easily be made uniform if the input stream can be repositioned, but many operating systems fail to support repositioning in a reasonable way.
4.2 Insertion/Deletion Strategy
Interactions between the scanner and the lookahead buffers may constrain this. In our experience, insertions should be favored over deletions, and A-recoveries over B-recoveries. Except to deal with special problems, symbols should be ordered as follows: first separators (comma, semicolon), then infix operators (+, ←), then closing brackets, and finally opening brackets. All these heuristics tend to localize the effect of the repair on both syntax and semantics. Need examples and statistics for each.
4.3 Searching for Insertions
As discussed in the preceding section, propagating the ordering of terminal symbols to the ordering of the continuations is important. The entries from the state transition table must be sorted to reflect this ordering. If there is a default transition for a state, the states that can be entered by chains of default reductions must be determined, and the nondefault entries in all these states must merged to generate the continuations. The sorting, merging and generation of continuations can conveniently be done by a system of coroutines.
4.4 Acceptance Testing
Lookahead that succeeds by recognizing a complete sentence must be rejected unless the end of the input file has also been reached. Otherwise, truncated sentences are accepted as repairs, and an arbitrary amount of input is ignored.
4.5 Possible Simplifications
(Assumes ‘standard’ parameters; increasing S tends to mask other changes.)
Queuing reductions in the parser.
Using only A-continuations or only B-continuations.
As discussed in Section 3.5, checks for useless continuations should distinguish A- and B-continuations. There is one special case: the A-continuation of measure 0 that is obtained from the B-continuation of mesaure 0 by accepting x) need not (and should not) be retained as a B-continuation of measure 1. On the sample of 303 programs, the following results were obtained when the distinction was not made:
Strategy A subsumes B B subsumes A
Different recoveries1315
Compared to standard
Much worse40
Worse26
Neutral68
Better11
The sets of differences were disjoint, so each row is additive. The first case raises more serious problems because I must be kept quite small but D can be made large.
5. Implementation Considerations
The following notes make the data structures more concrete and discuss non-obvious coding details. This section needs work; it has been transcribed from coding notes and is still a guide to reading the listings.
Given the string axyw, assume that ax is a correct prefix but axy is not. Because of the queuing scheme, x has not participated in a reduction by the time the error is detected. Possible correction actions are insertion, deletion, or replacement of a symbol, each of which is taken to have unit measure. For a composite correction, the measure is the sum along a minimal path. A correction of measure n that retains ax is an nA correction (modification after x); one that retains only a, an nB correction (modification before x).
The recovery scheme builds trees of possible continuations of ax (the A-tree) and a (the B-tree) in which the leaves at level n are the continuations containing exactly n terminal symbols. The trees are built breadth first. At each level, they are checked for corrections of measure n by scanning initial symbols of yw (perhaps after some deletions). A-corrections are favored by building and checking A-leaves before B-leaves at each level. For measures greater than InsertLimit, a form of ‘pseudo-panic’ mode is used in which symbols of yw are deleted until one of the already generated leaves becomes an acceptable continuation.
5.1 Data Structures
StackNode and tree
Useful measure 2 recoveries in Mesa have occasionally been observed to require ~220 tree nodes, but the only ones that overflowed 256 were (and should have been) corrected by long deletions with short or empty insertions. Note that the Mesa grammar introduces a nonterminal for operators at each precedence level; this keeps expression trees from growing too rapidly. The last CheckSize nodes are reserved to allow RightScan some work space. We have had no trouble when CheckSize > MinScanLimit + InsertLimit + 2.
Three fields have been added to StackNode. We use hashing (which links nodes with the same exposed state) to speed the check for useless continuations but haven’t measured its effect. Two Booleans are used to distinguish leaves of the two trees without tracing back to the roots.
StackRep
This makes explicit the idea that a stack configuration can usefully be represented by a chain in ‘memory’ (leaf, an index of a tree node) plus a ‘fast register’ (extension). Then most nodes in which nonterminals are associated with the exposed state need never occupy tree space (cf. the hardware implementation of the B5000 stack).
SymbolRecord and sourceText
A SymbolRecord contains the internal code for the symbol (class), an uninterpreted value, and an index, which is the position of the token in the source file (used for generating error messages and for resetting the scanner when not all buffered symbols can be accepted).
There is a single buffer sourceText for scanned symbols, both those that have been discarded and the right context to be checked. These segments have indices [discardBase .. scanBase) and [scanBase .. scanLimit) respectively. Note that sourceText[0] contains x, the last symbol in the correct prefix. Manipulations of discardBase (and scanBase) hide this during the processing of A-trees. Because of the initial adjustment of top, however, a non-null and non-discarded sourceText[0] must be fed back to the parser after recovery (even an A-recovery).
5.3 The Procedures
ActOnStack and ParseStep
These are separated because ActOnStack is also handy for processing default reductions during tree generation. Note that, on entry, the number of positions ‘above’ the tree node (leaf) is the sum of the occupancy of extension and the number of symbols scanned prior to entry. This sum can be 0, 1, or 2. Special action is required when a reduction does not pop all of these ‘extra’ positions or when the count is 2 and a scan action is specified. In the former case, the difference between this count and the rhs length cannot be 2 (that would imply a scan-reduce with empty rhs). Exactly one extra position (extension) is filled on exit.
GrowTree
First a list of all states entered by default reductions is created; then the terminal symbols tabulated for those states are merged to grow the tree in the ‘correct’ order. This procedure is coded so that other symbol generators could be added; i.e., it does not set levelStart or levelEnd internally, and the procedure AddLeaf is split out. The latter is coded to reject unacceptable symbols or those that cause transition to a configuration already seen. Thus it is safe to try to add arbitrary symbols an arbitrary number of times.
SyntaxError
The initial setup must be done carefully (see Section 3.4). The place to add symbol generators based upon knowledge of common errors (e.g., spelling correction) has been marked. A spelling corrector should try to insert alternative spellings of sourceText[level] when p = after and of sourceText[level-1] when p = before. This is probably worthwhile only when level = 1.
6. Applications and Examples
The recovery algorithm described here has been implemented in conjunction with systems that generate SLR(1) and LALR(1) parsers as described in AEH. The resulting parsers have been used in a number of specialized or experimental translators. Most of our experience, however, has come from parsing the languages Algol W [] and Mesa []. The compilers for these two languages are organized somewhat differently, and they are used in quite different environments.
6.1 Algol W
The Algol W compiler uses three passes. The first of these is a lexical scanner (basically a finite-state machine) that also generates a skeletal, block-structured symbol table. The second pass parses the token stream (using an SLR(1) parser), completes the symbol tables, and constructs a tree-structured representation of the program.
Error recovery is especially troublesome because these two passes use quite different algorithms to compute the block structure. If the two computations get out of step because of recovery actions, many spurious errors will be reported. In addition, the grammar is written in a way that attempts to use syntactic rules to impose semantic constraints; thus many common errors that have little to do with the phrase structure (such as misuses of identifiers) require syntactic error recovery. Finally, the tree structure is interpreted. Building it has many side effects on tables, etc., and restoring consistency after discarding nonterminal symbols is difficult. All of these characteristics aggravate the recovery problem; our techniques help with the second and third but do not address the first.
The Algol W compiler is used mainly in batch mode by students and staff members with limited access to the computer. Keypunches and alphanumeric display terminals are the primary devices used to prepare input. Conjcture: in this environment, some sort of ‘panic mode’ recovery must supplement the techniques described above; also, the use of spelling similarities to order the continuations is beneficial.
6.2 Mesa
Mesa is compiled by a multipass compiler in which the scanner and parser constitute the first pass. The output of that pass is a ‘pruned’ parse tree that represents the syntactic structure of the program. The tree is entirely uninterpreted so that, e.g., subscripted variables and function applications are indistinguishable. The lack of interpretation is necessary to avoid restrictions on forward references and to accomodate uniform referents. Subsequent passes perform semantic analysis, optimization and code generation.
LALR(1) parsing tables are used. Several aspects of error recovery are simplified by building the very spartan parse tree. In addition, the grammar is specifically not written in a way that attempts to outlaw semantically invalid forms, since we find it easier to give informative messages during the semantic analysis phases.
Users of Mesa are experienced systems programmers working in a very responsive machine environment. Nearly all programs are prepared using an interactive editor. A high quality display provides very legible feedback to the user; furthermore, visual fidelity is maintained, e.g., a backspace erases a character from both the display screen and the file being edited. The goal of the recovery logic is to maximize the (useful) information obtained about syntactic and semantic errors on each run. There is no ‘panic mode’ recovery; an avalanche of error messages because of poor recovery is considered worse than abandoning the compilation.
The Mesa compiler was used to explore the effects of varying parameters and strategy in our method. To provide a realistic data base for these experiments, users of Mesa were asked to collect the error reports generated by the compiler and to submit them for analysis. Since the symbols following an error influence the recovery, no attempt was made to find occurrences of equivalent errors in different programs. Multiple occurrences of the same error in a single program were eliminated when those errors appeared to be produced by copying of incorrect text; ‘clusters’ in which the density of errors was high enough to affect recovery were retained as such; and different errors well separated by text were considered to be independent. Small programs intended to encapsulate the observed errors were then produced by hand. Claims in the preceding sections are based upon experiments with 303 of these programs. Taken in the context of their lookahead strings, nearly all these errors were different. On the other hand, a more usual classification scheme would find many instances of the same error (e.g., ‘missing semicolon’ was a relatively common error), and the sample appeared to give reasonable weights to these larger classes of errors without further adjustment.
6.3 Some Examples
The following examples cited in [GR] are reproduced here for comparison. Output associated with error detection and repair is interleaved and marked by a font change. Because of the high density of errors in the final example, the parameters of the recovery algorithm were changed so that MinScanLimit = 3, and the threshold below which pseudo-panic mode recovery is invoked was similarly lowered. Otherwise, the parameters and strategies suggested by the Mesa sample set were used without modification. Note that our method produces a recovery almost identical to Graham and Rhodes’ Figure 1. Their Figures 2 and 4 show the comparable recoveries provided by some widely used compilers.
Example 1
BEGIN
INTEGER k, m, p, q;
m ← q - 3;
i = 2*(m-p) THEN k ← 1 ELSE m ← 1;
↑ Syntax Error
Text inserted is: IF
END .
Example 2
BEGIN
INTEGER i, j, k, l, x;
x ← i j;
↑ Syntax Error
Text inserted is: +
END .
Example 3
BEGIN
INTEGER i;
WRITE (BEGIN i ← 3 END);
↑ Syntax Error
Text deleted is: BEGIN id ← num END
Text inserted is: id
END .
If the insertion limit I is increased to 3, the following recovery is produced instead.
BEGIN
INTEGER i;
WRITE (BEGIN i ← 3 END);
↑ Syntax Error
Text inserted is: id ) ;
↑ Syntax Error
Text deleted is: )
END .
This illustrates our general observation that changes in the parameters give recoveries that are quite different in particular cases but on average do not differ substantially in quality.
Figures 1-4 [GR]
BEGIN
INTEGER ARRAY a, b(1..5 1..10);
↑ Syntax Error
Text inserted is: ,
INTEGER i,j,k,l;
up: i + j > k + l * 4 THEN GO l1 ELSE k IS 2;
↑ Syntax Error
Text inserted is: IF
↑ Syntax Error
Text inserted is: TO
↑ Syntax Error
Text deleted is: id
Text inserted is: ←
a 1,2 ← b(3 * (i+4, j*/k)
↑ Syntax Error
Text inserted is: (
↑ Syntax Error
Text inserted is: )
↑ Syntax Error
Text inserted is: )
↑ Syntax Error
Text inserted is: id
IF i = 1 THEN THEN GO TO up;
↑ Syntax Error
Text inserted is: ;
↑ Syntax Error
Text deleted is: THEN
l2:
END .
References
[AEH]
Anderson, T., Eve, J., and Horning, J. J., Efficient LR(1) parsers, Acta Informatica 2 (1973), 12-39
[GR]
Graham, S. L., and Rhodes, S. P., Practical syntactic error recovery, Communications of the ACM 18 (November 1975), 639-650
[Wynn]
Wynn, P., Error Recovery in SLR Parsers (M.Sc. Dissertation), Computing Laboratory, University of Newcastle upon Tyne (1973).