XEROX PARSER 2 4 1 PARSER 1 4 By: Tayloe Stansbury (Stansbury.pa@Xerox.com) PARSER implements a LALR(k) parser generator. Given any grammar G, provided G is LALR(k) for some k, PARSER will produce a deterministic bottom-up parser for L(G). PARSER supports semantic actions associated with production rules in the grammar, so that much if not all of the intended translation can take place during the parse. The form of the parsers generated can be tailored to a fine degree by generation-time parameters; this makes it possible, for example, for the same parser generation mechanism to be used to generate both lexical analyzers and conventional parsers. HOW THE PARSER GENERATOR WORKS The parser generation algorithm used is set forth in "Theory and Construction of LR(k) Parsers" (Preliminary Version), Benjamin M. Brosgol, Center for Research in Computing Technology, Harvard University, March 1973. It works by constructing a finite state machine, the LR(0)-FSM, which recognizes stack configurations during the course of a parse. This machine can be used to determine whether to read or reduce on the next parsing step. Where ambiguity still remains, lookahead is added to produce the LALR(k)-FSM. It looks ahead only as many symbols as are locally necessary to resolve ambiguity. PARSER constructs a Lisp function which implements the parser controlled by the LALR(k)-FSM. There is no parsing table. This approach of generating a source code parsing engine specialized to the language to be parsed, rather than generating transition tables to be used by a general parsing engine, was chosen because it makes it possible for the Lisp optimizing compiler to perform most of the parser optimizations. HOW TO USE THE PARSER GENERATOR A number of examples for use with PARSER are provided on the file PARSERG. To use the parser generator, call (MAKEPARSER parser-spec ordered?) [Function] where parser-spec is an instance of the PARSERSPEC datatype, and ordered? is a flag. Parser-spec specifies the form the parser is to take: the grammar of the language it is to parse, the semantic actions to be executed during the course of the parse, the stack and queue implementations, and so forth. If ordered? is non-NIL, the parsing function will be generated in such a way as to make it more humanly readable. The fields of the PARSERSPEC datatype are as follows: PARSERNAME: This should be an atom. It will become the name of the parsing function. GRAMMAR: an instance of the datatype GRAMMAR (explained below) READFN: This should be the name of the function (or macro) which reads the next token from the stream of tokens to be parsed. The read function can be anything from BIN to a hand-written lexical analyzer to another parser generated by this parser generator. The read function will be passed two arguments: a list of the class of tokens expected, and a state-container. (The state-container holds whatever useful internal state this read function and any subordinate read functions might want to retain from call to call. In the case of BIN, this might be the stream, or in the case of a lexical analyzer it might be the unused contents of the lookahead queue. This state problem could of course be handled much more naturally with closures.) The output of the read function can be in any format; it will be interpreted by the CLASSFN and INSTANCEFN. CLASSFN, INSTANCEFN: These should be the names of functions which, given a token and a list of expected token classes, will return respectively the class or instance of the token given. Class and instance frequently can differ: for example, a particular token may have class number and instance 3.14159. Note that token classes and instances must be equivalence comparable by EQ. The default class and instance macros just return the token itself. EOFFN: This should be the name of a function of one argument, a token class, which returns non-NIL if the token class is admissible as end-of-file, and NIL otherwise. All parses must end on reading a (perhaps virtual) end-of-file. Parsers used in certain configurations, lexical analyzers in particular, might want to be rather permissive about what counts as end-of-file. The default EOFFN admits only the atom EOF. The following fields of PARSERSPEC are optional (i.e., they have reasonable defaults): STACKINITFN, PUSHFN, POPFN, TOPFN: These should be the names of functions which implement a stack datatype. There are two parallel stacks used in these parsers: one holds the nascent parse tree, and the other keeps track of the left-context of the parse. The default stack implementation uses lists: CONS to push, CDR to pop, etc. Specialized stack implementations might take advantage of knowledge that the stack for a particular grammar cannot grow past a certain depth. QUEUEINITFN, ENQUEUEFN, DEQUEUEFN, QUEUENOTEMPTYFN: These should be the names of functions which together implement a queue datatype. The parser uses a single queue to hold lookahead symbols. The default queue implementation is made from a list and uses TCONC to enqueue things. Specialized queue implementations might take advantage of the degree of lookahead required by a particular grammar. LAQUEUEINITFN: the name of a function of one argument, the old parser state, which returns a lookahead queue SAVESTATEFN: This should be the name of a function of one argument, the lookahead queue, which returns the state the parser wishes to save for its next invocation (to be offered to the LAQUEUEINITFN). In lexical analyzers, lookahead queues in particular need to be saved. The fields of the GRAMMAR datatype are as follows: StartSymbol: the nonterminal which is the start symbol of the grammar PRODUCTIONS: This should specify the grammar. The format is ((nonterminal1 ((rhs111 rhs112 ...) semantic-action11) ((rhs121 rhs122 ...) semantic-action12) ...) (nonterminal2 ((rhs211 rhs212 ...) semantic-action21) ((rhs221 rhs222 ...) semantic-action22) ...) ...) So the BNF grammar S ::= S+A | a A ::= aA | a should be expressed to the parser generator as something like ((S ((S + A) (CONS LHS RHS)) ((a) (CONS LHS RHS))) (A ((a A) (CONS LHS RHS)) ((a) (CONS LHS RHS)))) Here S and A are interpreted as nonterminals because they appear on the left hand side of some rule, and a is interpreted as a terminal because it appears only on the right hand side of rules. Nonterminals and terminals in the grammar must be equivalence comparable by EQ. The semantic action associated with a rule will be performed whenever the parser makes a reduction to that rule. The parser will pop from the parse stack the symbols being reduced, and replace them with the result of the semantic action. The following Lisp atoms are defined for the use of the semantic actions: LHS will be bound the left hand side of the reduction rule; RHS will be bound to the symbols popped from the parse stack (instances of the elements of the right hand side of the reduction rule); and EXPECTED will be bound to a list of token classes expected by the parser as a whole. In the example above, the semantic action (CONS LHS RHS) will cause the parser to build a reasonably normal parse tree. As MAKEPARSER constructs the LALR(k)-FSM, it will announce any inadequate states (states where there exists a read-reduce or reduce-reduce conflict), why those states are inadequate, and the number of tokens of lookahead required to resolve the conflict. These announcements should help the grammar designer pinpoint any grammar rules which are causing trouble. The parser-generator is not able to detect grammars that are not LALR(k) for any k. In this case, it will enter an infinite loop, printing messages like 1-level lookahead from state A0033 2-level lookahead from state A0033 3-level lookahead from state A0033 4-level lookahead from state A0033 5-level lookahead from state A0033 6-level lookahead from state A0033 7-level lookahead from state A0033 ... When the grammar designer decides that the depth of lookahead is becoming ludicrous, it is up to him to interrupt the parser generator and use the conflict information printed to rectify his grammar. When MAKEPARSER completes, it returns the name of the Lisp parsing function it has generated. The function consists roughly of a large PROG with the names of the states of the LALR(k)-FSM as labels and GOs to accomplish transitions from state to state. If required, the user can optimize this function by hand somewhat before passing it to the Lisp compiler. The resulting parsing function takes two arguments: a list of expected token classes, and a state-container. The expected token classes argument is for the benefit of the semantic actions employed by this parser. The state-container must be a list whose CAR is the state necessary to this parser, as interpreted by the LAQUEUEINITFN and SAVESTATEFN, and whose CDR is the state-container for the parser's lexical analyzer (i.e., its READFN). The parsing function will return the result of the last semantic action executed. AUTHORS The parser generator is the work of three people. Beau Sheil started the project in 1977. The choice of parser generating algorithm was his, as was the idea of generating a Lisp parser rather than a parse table. Beau wrote the initial implementation. David Snyder did some additional implementation work in the summer of 1983. The current version of the project was completed by Tayloe Stansbury in the fall of 1983 and in February of 1986. His work included fixing a number of bugs, speeding up the parser generation algorithms dramatically, allowing rules with empty right hand sides, adding semantic actions, and making it possible to tailor resulting parsers to a variety of different applications. COMMENTS, BUG REPORTS, ETC. Please address comments, bug reports, and design suggestions to Stansbury.pa@Xerox.COM. (LIST ((PAGE NIL (PAPERSIZE Letter FOLIOINFO (ARABIC ) STARTINGPAGE# 1) (0 0 612 792) ((FOLIO NIL (PARALOOKS (QUAD CENTERED) CHARLOOKS (SUPERSCRIPT 0 INVISIBLE OFF SELECTPOINT OFF PROTECTED OFF SIZE 10 FAMILY MODERN OVERLINE OFF STRIKEOUT OFF UNDERLINE OFF EXPANSION REGULAR SLOPE REGULAR WEIGHT MEDIUM INVERTED OFF USERINFO NIL STYLE NIL) FORMATINFO (ARABIC )) (174 36 288 36) NIL) (HEADING NIL (HEADINGTYPE RUNNINGHEAD) (84 744 528 36) NIL) (TEXT NIL NIL (84 96 456 600) NIL))) (PAGE NIL (PAPERSIZE Letter FOLIOINFO (ARABIC )) (0 0 612 792) ((FOLIO NIL (PARALOOKS (QUAD CENTERED) CHARLOOKS (SUPERSCRIPT 0 INVISIBLE OFF SELECTPOINT OFF PROTECTED OFF SIZE 10 FAMILY MODERN OVERLINE OFF STRIKEOUT OFF UNDERLINE OFF EXPANSION REGULAR SLOPE REGULAR WEIGHT MEDIUM INVERTED OFF USERINFO NIL STYLE NIL) FORMATINFO (ARABIC )) (174 36 288 36) NIL) (HEADING NIL (HEADINGTYPE RUNNINGHEAD) (84 744 528 36) NIL) (TEXT NIL NIL (84 96 456 600) NIL))) (PAGE NIL (PAPERSIZE Letter FOLIOINFO (ARABIC )) (0 0 612 792) ((FOLIO NIL (PARALOOKS (QUAD CENTERED) CHARLOOKS (SUPERSCRIPT 0 INVISIBLE OFF SELECTPOINT OFF PROTECTED OFF SIZE 10 FAMILY MODERN OVERLINE OFF STRIKEOUT OFF UNDERLINE OFF EXPANSION REGULAR SLOPE REGULAR WEIGHT MEDIUM INVERTED OFF USERINFO NIL STYLE NIL) FORMATINFO (ARABIC )) (174 36 288 36) NIL) (HEADING NIL (HEADINGTYPE RUNNINGHEAD) (84 744 528 36) NIL) (TEXT NIL NIL (84 96 456 600) NIL)))))( ) T((8(8D PAGEHEADING RUNNINGHEADTERMINAL MODERN MODERN MODERNLOGOMODERN MODERN  HRULE.GETFNMODERN  HRULE.GETFNMODERN  HRULE.GETFNMODERN  HRULE.GETFNMODERN  HRULE.GETFNMODERN /F\ K"  0  g7W@YWn3G>...->]@kiX&z