NPGSDoc.tioga
Copyright Ó 1987, 1988 by Xerox Corporation. All rights reserved.
Doug Wyatt, April 2, 1984
Rick Beach, March 13, 1987 4:29:20 pm PST
Russ Atkinson (RRA) March 28, 1988 4:58:24 pm PST
JKF February 6, 1989 1:02:50 pm PST
Michael Plass, December 10, 1992 10:44 pm PST
NPGS — NEW PARSER GENERATOR SYSTEM
CEDAR 10.1 — FOR INTERNAL XEROX USE ONLY
NPGS — The New Mesa Parser Generator System
Doug Wyatt, Russ Atkinson
© Copyright 1988, 1992 Xerox Corporation. All rights reserved.
Abstract: The NEW parser generator system (NPGS) is a Mesa program that transforms a LALR(1) grammar written in BNF into a set of parsing tables. The input is a Mesa program with BNF productions given as specialized comments (see Pass1T.pgs from Mimosa for an example). The output is a Cedar/Mesa interface and Cedar/Mesa implementation that describe procedures to be called to generate tables usable by the Cedar/Mesa parser (see MimParserImpl.mesa for an example). This package is intended to replace PGS, which specialized in producing binary tables in bcd files, an outmoded technique. One strong advantage of the new technique is that the table interface and implementation are both in the safe subset, and are written to be portable.
Created by: James Eve
Maintained by: Russ Atkinson <Atkinson:PARC:Xerox>
Keywords: Cedar language, compiler, Mesa language, parsing, PGS, NPGS, BNF, LALR
XEROX Xerox Corporation
Palo Alto Research Center
3333 Coyote Hill Road
Palo Alto, California 94304
For Internal Xerox Use Only
The Mesa Parser Generator System
James Eve
0. Caveat
Russ Atkinson converted PGS to NPGS in March 1988 to produce Cedar/Mesa programs rather than binary tables stuffed into bcd files. A preliminary revision of the documentation has taken place, but there may still be errors in the more obscure sections.
Also, the TableCompiler is not included under NPGS.df. The TableCompiler is considered obsolete, since the tables produced are neither safe nor portable.
1. Introduction
[James Eve wrote the bulk of this document in 1979. Doug Wyatt converted it to Tioga form and revised section 6 to describe the operaton of PGS and TableCompiler in Cedar. Much of Section 6.2 was derived from a brief memo by Ed Satterthwaite.]
The parser generator system (PGS) is a Mesa program which takes a context free grammar specified in Backus-Naur form as input and tests whether it is LALR(1); if the LALR(1) condition is satisfied, compacted binary tables are output which can be used in conjunction wth the Mesa parser. Ancillary tables are also output which simplify the writing of lexical routines to recognise the terminal symbols of the grammar. As a primary task of NPGS is to aid development the Mesa system itself a preprocessor is included which aids this.
The LALR(1) parsing algorithm is fairly widely held to be the best practical algorithm currently available; it has good space and time performance properties, it handles a larger subset of the context free grammars than other methods in common use and it allows a good syntax error repair capability to be added. The one snag is that the LALR(1) condition is far from intuitively obvious; checking that the condition holds requires substantial computation and in the event that it does not, a fair knowledge of the underlying theory may be needed to modify a grammar to make it meet the condition. A really good account of the theory has yet to be written, notational problems tend to obscure the underlying ideas. Probably the easiest starting point is the tutorial article "LR Parsing" by Aho and Johnson in Computing Surveys, 6(1974)74, the most comprehensive account readily available is in "The Theory of Parsing, Translating and Compiling" Volume 1, Prentice-Hall (1972), by Aho and Ullman; the algorithms used in NPGS are from Anderson, Eve and Horning, "Efficient LR(1) Parsers", Acta Informatica 2(1973)12 so the terminology here follows the latter paper, which is referenced subsequently as AEH.
2. Using the Parser Generator
In Cedar, NPGS.df is a CedarChest package, and can be retrieved and run using the normal Cedar methodology. Usually, NPGS takes a command line with the following form:
NPGS [defs: Defs, grammar: Grammar] ← Input.pgs
Such a command produces several files:
Defs.mesa -- the definitions for the parse tables
DefsImpl.mesa -- the implementation to produce the parse tables
Grammar.grammar -- the grammar extracted from the input
Input.mesa -- a rewritten (and renumbered) version of Input.pgs
The files Defs.mesa and DefsImpl.mesa restrict themselves to portable Cedar/Mesa in the safe subset. These files define and produce the parse tables that are to be used by a parser (NPGSParser.mesa is an example of such a parser).
If the defs: Defs part is not given, the default name is the name of the type given in the TYPE: option in the Input.pgs file.
For use by non-Cedar programs, NPGS will produce
Defs.mesa and
DefsImpl.mesa without using any Cedar langauge features if the -M switch is given, as in the command:
NPGS [defs: Defs, grammar: Grammar] ← Input.pgs -m
In an effort to reduce the number of PGS tools the CuspPGS features have been incorporated into NPGS. To use the CuspPGS features use the -c switch, as in:
NPGS CuspPGSParserImpl.pgs -mc
3. Format of the Input File.
RRA: Most of this section discusses the format of the file when it is NOT in the form of a Cedar/Mesa program with comments supplying the productions. This section is retained for historical purposes, but the features described may not be fully retained by NPGS. Only the usage supported by the preprocessor (see Section 5 - The Preprocessor) is likely to be supported in the future.
The input file is treated as a sequence of tokens, where a token is defined to be a sequence of characters none of which are in the range [0C..' ]. Tokens are delimited by sequences of characters in this range. In addition to the tokens ::=, |,?,C and the integers which have special roles there are a number of tokens starting with the character pair || which control NPGS.
The input consists of five subsequences,
directives, terminals, nonterminals, aliases and
productions which must appear in this order.
1. The principal directives and their functions are,
||INPUT - causes the input to be echoed to the .echo file,
||CHAIN - causes an optimisation to be performed which speeds up parsing by eliminating all references to productions marked as chain productions,
||LISTS - causes the LALR(1) tables to be compacted and output to the .
binary and .
module files,
RRA: This may be obsolete. No warranty is expressed or implied.
||PRINTLALR - causes a readable form of the LALR(1) tables to be output to the .log file. (For a grammar of about 300 productions these tables will contain about 400,000 characters. Generating them slows NPGS perceptibly.)
2. The terminals are introduced by ||TABLE1 which is followed by tokens representing the terminal symbols of the grammar, the last of which denotes a unique sentence endmarker which will be used only to delimit sentences supplied to the resulting parser. It should not appear in any production. (The scanner invoked by the parser using the tables generated by NPGS is required to map the end of input signal onto this token.)
3. ||TABLE2 similarly introduces the nonterminal symbols of the grammar; that appearing first after ||TABLE2 is taken to be the goal symbol of the grammar. The way that the Mesa parsing algorithm terminates entails a weak constraint that the goal symbol should not appear in the right part of any production nor should any of its productions be designated chain productions.
4. The alias sequence may be empty but if it is not it is introduced by ||TABLE3. The terminal symbols of a grammar do not necessarily have the form of identifiers but lexical or error recovery routines may need to be able to reference them. The alias sequence consists of pairs of tokens the first of which is a terminal symbol (i.e. it appears in the sequence following ||TABLE1), the second is an alias in the form of an identifier by which it can be referenced; appropriate constant definitions are constructed for these identifiers and included in the .module file.
5. Finally the productions are listed in Backus-Naur notation following ||TABLE4. The tokens ::= and | play their usual role; there is no symbol terminating or separating productions, likewise a production deriving the empty string is specified the absence of any token after ::= or | and the succeeding |, end of input or token followed by ::= sequence.
Immediately preceeding the |'s or to the left of a token followed by ::= there will usually be an integer, the rule number, which may itself be preceded by the token C (upper case only). The rule number associates the production with a semantic routine (an arm of a SELECT statement) which is to be invoked by the parsing algorithm whenever a string derived from the associated production is recognised. Some productions, of the form nonterminal ::= nonterminal, have no semantic significance and serve merely to assert that members of one syntactic class are also members of some larger class. Chains of such productions appear in expression grammars (expr ::= term, term ::= factor, factor ::= primary, etc.) and can significantly reduce the speed of parsing. The appearance of the token C is an assertion that the following production is of the specified form and has no semantic processing associated thereby permitting NPGS to eliminate all references to it with an increase in speed of parsing.
The input phase of NPGS uses the Mesa parsing routine and parsing tables of its own construction so there is clearly a grammar specifying the form of the input. The description given above was preferred as an introduction to the input format since it covers only the essentials. NPGS will, on request, echo its input interspersed with other information in a formatted form. As there was a requirement that this output should also be acceptable as input, the grammar allows a rather wider class of input forms but information other than that described is simply discarded during re-input. The grammar appears in the appendix.
It should be clear that the terminal symbols of NPGS grammar cannot be used as tokens in a client grammar, a syntax error would result. This is not likely to be a problem to most clients insofar as the symbols starting with || are concerned (which is why this curious system was adopted) but there are also ::=, |, ?, C and GOAL some of which may well appear in a client grammar. Finally of course NPGS grammar contains all of them. The standard solution is used; if |, ?, or C are required as tokens in a client grammar they must be specified as '|, '? and 'C. Multi-character symbols such as ::= must be specified as "::=". The only special treatment that these quoted symbols receive at the hands of NPGS is in building the tables for use by the scanner; any two character token beginning with a single quote has the quote removed; similarly any token of three or more symbols where the first and last are double quotes has these bracketing quotes stripped.
Occasionally it is convenient to be able to flag certain productions in the input text. The PGS will ignore ? when looking for C or a rule number; the ? should precede C when a rule number is also present.
4. Output of the Parser Generator
Four output files are normally constructed, a record of the input, a log, a definitions file which contains definitions of aliased terminal symbols and some ranges defining the sizes of the various tables, and the implementations file, which contains procedures to quickly build the tables on demand. The definitions and implementations files are written in Cedar/Mesa. These files must be compiled and bound with the Mesa parser and a suitably modified version of the Mesa scanner.
4.1 The Input Record File
This file is produced if the directives in the input stream contain ||INPUT. The name of the file is obtained by appending .echo to the main part of the input file name.
The record of the input differs from the true input in that the directives may be displayed in a different order and the terminal and nonterminal symbols are displayed one to a line each preceded by an integer. (The integers are allocated sequentially starting at one for the first terminal symbol. Each production is displayed starting on a new line; each is preceded by two integers, the second being the rule number from the input. If a C was specified on the input it appears between the two integers. the first integer is simply a unique label for the production. The first production is labelled one and again NPGS simply labels the productions with ascending integers. These labels are used in some of the diagnostic messages output by NPGS. A production with the implicit label zero is constructed and output before the others, it has the form,
GOAL ::= goal eof
where goal stands for the goal symbol and eof for the end of sentence marker. A check during input ensures that these symbols occur to the right of ::= in no other production.
4.2 The Log File
This file contains error messages and various items of supplementary information output during the generation of the parsing tables. If error or warning messages are logged then, immediately prior to the end of execution, the message, "Errors or warnings logged" is displayed followed by the usual invitation to type any character to terminate processing.
4.2.1 Error Messages
Most error messages occur during input of the grammar. Those messages prefixed by the word ERROR cause the program to terminate after completing input and checking for further errors. WARNING messages allow processing to continue. Each message is accompanied by a fragment of input text with a pointer to the current input token.
The warning messages are,
1. Overlength symbol (increase TOKENSIZE?) truncated to -
2. Not a chain production -
3. Unused(U) or undefined(D) symbols (refer to TABLE1 and 2)
These messages illustrate some general points; messages ending with a dash are followed by further information. For message 1, it is the truncated form of the token, for message 2 it is the integer label appended to the offending production in the echoed input. After message 2, processing continues as though no chain indication had been specified.
Messages such as the first which indicate that an internal field size is too small also specify the compile time constant (in pgscondefs.mesa) which controls the field size. Currently tokensize constrains tokens to 14 characters.
The third message occurs at the end of input if there are symbols in the TABLE1 and 2 sequences which do not appear in any production (unused symbols) or if a symbol in the TABLE2 sequence does not appear to the left of ::= in the productions (undefined symbols). This message is followed by a list of integers which designate symbols in TABLE1 and 2 using the numeric labels appended in the echoed input. Each integer is tagged with U and/or D to indicate whether the corresponding symbol is unused or undefined or both.
The ERROR messages are,
4. Nonterminal with too many rules (increase ALTERNATELIM?) -
Only K productions are allowed for any nonterminal (K is given by NPGSTypes.alternateLim); if this is not enough it would be better to introduce a new nonterminal and split them into two groups rather than increase the limit of K.
5. Multiple definitions of symbol -
This message occurs either because the same symbol appears more than once in TABLE1 and 2 or because the same symbol occurs to the left of ::= more than once. The offending symbol is printed after the message.
6. Symbol not defined -
This message also appears in two contexts, either a symbol appears in a production which does not appear in TABLE1 or 2 or alternatively message 3 was issued with undefined symbols. in the first case the message is followed by the symbol in question, in the second by "see previous message".
7. Terminal precedes ::= -
The terminal symbol follows the message.
8. Too many rhs symbols in production (increase RHSLIM?) -
Fifteen symbols in the right part of a production is unlikely to be exceeded; if it is, change the grammar, increasing rhslim would involve consequential changes in the binary table formats.
9. Internal field will overflow - increase PSSLIM
This one is unlikely, it would involve a grammar with 1024 symbols or productions. An increase up to 2047 would be possible without changing the binary table formats.
10. Aliased symbol not a terminal symbol -
The symbol follows the message.
11. Aliases must not be terminal symbols -
The symbol follows the message.
12. Goal symbols used in rhs
The goal symbol or end of sentence marker appear in a production right part. (See the paragraphs numbered 1 and 2 in section 3).
4.2.2 Output during table construction
During the construction of the parsing tables there is a reasonably remote possibility that error message 9 will occur and terminate any further processing, though in this case it implies that more than 1023 parsing states are necessary for the grammar. Much more likely are messages indicating that the grammar is not LALR(1).
In the event of conflicting parsing actions arising for some terminal symbol in a parsing state, all data appertaining to that state is listed. The first heading line specifies,
1. an integer naming the state,
2. a symbol of the grammar (recognition of this symbol causes this state to be entered),
3. a set of p,j pairs defining the state (see AEH), each pair being followed by /.
Below the heading in a four column format is the list of symbols of the grammar which may be encountered in this state. Each symbol is preceded by an encoding of its associated parsing action,
1. unsigned integers denote scan (or shift) entries to the state named by the integer,
2. integers preceded by an asterisk signify reduce operations using the production with the integer as its label,
3. an integer preceded by a minus sign also indicates a production number but implies that the symbol associated with this action must be stacked before the reduce operation; a so-called scanreduce operation. (These marking conventions differ from those used in AEH though they use the same symbols.)
During construction of the tables scan entries are computed before reduce entries, so when conflicting actions arise they are reduce actions which either conflict with an existing scan entry or with a previously computed reduce entry. (It is an inherent property of scanreduce entries that they cannot conflict with another entry.) Conflicts are indicated by lines of the form,
Reduce with n conflicts with **********
where n is a production number. Each such line is followed by a list of items of the form, symbol action /, where action is either SCAN if a scan action for this (terminal) symbol has already been constructed or is an integer naming the production for which a reduce action has already been constructed.
If the directive ||PRINTLALR is specified, the output just described occurs for every state whether or not it contains conflicts. The heading, LALR(1) Tables, precedes such output. This output is rarely worth having, it is occasionally of value in tracking down a conflict. Its primary function was in debugging NPGS.
The PGS discards conflicting entries after printing them and it will not form either the binary tables or the module output if there are reduce-reduce conflicts. In the case of reduce-scan conflicts the decision to process scan entries first implies that the scan entry takes precedence. This is occasionally useful. For example it solves the dangling else problem in the preferred way. However, the scan-reduce conflict message is a warning and as such triggers the displayed message directing attention to the .log file.
After generating the tables a heading, LALR(1) Statistics, is output and a few counts are printed. Only the first three may be of any general interest, they indicate the number of parsing states and the total number of actions in the tables for both terminal and nonterminal symbols.
4.2.3 Output during table compaction
Table compaction and output of the binary tables only occur if the directive ||LISTS is specified in which case the output described here appears on the .log file.
The earlier stages of NPGS are written in a general way, data structures will expand to accomodate very large grammars; at the cost of recompiling the system and changing compile time constants, the limits on field sizes mentioned previously can be increased. At this stage the objective is to pack information economically into 16-bit words and it is here that the size of fields is an absolute constraint. Final checks are made which could conceivably produce one or more of the self explanatory messages
FAIL - more than 255 terminal symbols
FAIL - more than 254 nonterminal symbols
FAIL - more than 2047 states or productions
FAIL - more than 15 symbols in a production right part
These are rather unlikely. If any of these checks fail processing terminates.
Assuming no error messages, the only unsolicited output here is a heading, Parse Table Data, and one table. A hash accessed look-up table, for the terminal symbols of the grammar, is created for use by the scanner. As hash functions are notoriously unreliable, the following is printed so that a visual check can be applied to ensure no disasters. The subheading,
Scanner hashtable probe counts (terminal symbol, probe count, hashcode)
followed by a four column layout of triples which as the heading indicates show for each symbol the number of probes needed to locate it in the hash table and its hashcode. The technique used is Algorithm C, page 514, Vol 3 of Knuth's "The Art of Computer Programming", Addison-Wesley(1973). If there are
n terminal symbols, they are hashed into a table of using MIN[
m,251] buckets;
m is always an odd integer, it is either 3
*
n/2 or 3
*
n/2+1. The hash function uses the
ASCII values of the first and last character of the symbol and is
((127*first character+last character) MOD buckets) + 1.
One can expect the performance of this hash function to deteriorate as the number of terminal symbols approaches 255.
If the directive ||PRINTLALR is specified as well as ||LISTS, a record of the table compaction transformations is produced. This record is only of interest for maintenance of the system and familiarity with the compaction techniques described in AEH is assumed in its description.
First, a set of default actions for the nonterminal symbols of the grammar are determined and a table headed, Nonterminal Default Actions, is printed. Each nonterminal symbol appears preceded by its associated default action encoded in the form already described, that is, unsigned integers represent scan entries and negative integers represent scan-reduce actions. (Reduce actions never take place on nonterminal symbols.)
After removing all occurrences of these defaulted entries from the LALR(1) tables, NPGS determines those states which have identical symbol-action pairs, first, over the set of terminal symbols and then independently, over the set of nonterminal symbols. All such states reference one copy of the list of symbol-action pairs stored in the binary tables. The table, Table Overlays, has three columns headed row, ttab and ntab; if a row of this table contains (integer) entries a, b and c respectively then it implies that the terminal entries of state a are the same as those of state b, while the nonterminal entries of state a are the same as those of state c. It is exceptional for both the terminal and nonterminal entries of a state to match those of other states so usually one of the entries b or c is blank. If neither the terminal nor nonterminal entries of a state match those of another state then it does not appear in this table.
The final transformation is to renumber the states so that all of those states containing (nondefaulted) actions on nonterminal symbols are labelled by integers contiguous to eachother and to 1. This is acheived by swapping the highest numbered state with nonterminal actions with the lowest numbered state without nonterminal actions until no more swaps are possible. The table headed, Renumbered States, simply records this with entries of the form, a swapped with b.
4.3 The Definitions File
This is most readily illustrated with an example; it happens to be the definitions file generated by NPGS for its own grammar.
NPGSParseTable: CEDAR DEFINITIONS = {
Symbol: TYPE = CARDINAL[0..255];
TSymbol: TYPE = Symbol[0..19];
NTSymbol: TYPE = Symbol[0..13];
token indices for the scanner and parser
tokenID: TSymbol = 1;
tokenNUM: TSymbol = 2;
tokenQUERY: TSymbol = 3;
tokenTAB3: TSymbol = 9;
tokenTAB4: TSymbol = 10;
initialSymbol: TSymbol = 3;
defaultMarker: TSymbol = TSymbol.FIRST;
endMarker: TSymbol = TSymbol.LAST;
HashIndex: TYPE = CARDINAL[0..29];
VIndex: TYPE = CARDINAL[0..106];
VocabHashEntry:
TYPE =
MACHINE
DEPENDENT
RECORD [
symbol(0: 00..07): TSymbol, -- symbol index
link (0: 08..15): HashIndex]; -- link to next entry
State: TYPE = CARDINAL[0..26];
NTState: TYPE = State[0..6];
TIndex: TYPE = CARDINAL[0..64];
NTIndex: TYPE = CARDINAL[0..3];
Production: TYPE = CARDINAL[0..37];
Rule: TYPE = CARDINAL[0..28];
ActionTag:
TYPE =
MACHINE
DEPENDENT
RECORD [
reduce(0: 0..0): BOOL, -- TRUE iff reduce entry
pLength(0: 1..4): [0..15]]; -- number of symbols in production rhs
ActionEntry:
TYPE =
MACHINE
DEPENDENT
RECORD [
tag(0: 0..4): ActionTag, -- [FALSE,0] if a shift entry
transition(0: 5..15): [0..2047]]; -- production number / next state
ProductionInfo:
TYPE =
MACHINE
DEPENDENT
RECORD [
rule(0: 00..07): Rule, -- reduction rule
lhs (0: 08..15): NTSymbol]; -- production lhs symbol
ScanTable: TYPE = ARRAY CHAR['\040..'\177] OF TSymbol;
HashTable: TYPE = ARRAY HashIndex OF VocabHashEntry;
IndexTable: TYPE = ARRAY TSymbol OF CARDINAL;
Vocabulary: TYPE = TEXT;
ProdData: TYPE = ARRAY Production OF ProductionInfo;
NStarts: TYPE = ARRAY NTState OF NTIndex;
NLengths: TYPE = ARRAY NTState OF CARDINAL;
NSymbols: TYPE = ARRAY NTIndex OF NTSymbol;
NActions: TYPE = ARRAY NTIndex OF ActionEntry;
NTDefaults: TYPE = ARRAY NTSymbol OF ActionEntry;
TStarts: TYPE = ARRAY State OF TIndex;
TLengths: TYPE = ARRAY State OF CARDINAL;
TSymbols: TYPE = ARRAY TIndex OF TSymbol;
TActions: TYPE = ARRAY TIndex OF ActionEntry;
initialState: State = 1;
finalState: State = 0;
ScanTableRef: TYPE = REF ScanTable;
InitScanTable: PROC RETURNS [ScanTableRef];
HashTableRef: TYPE = REF HashTable;
InitHashTable: PROC RETURNS [HashTableRef];
IndexTableRef: TYPE = REF IndexTable;
InitIndexTable: PROC RETURNS [IndexTableRef];
VocabularyRef: TYPE = REF Vocabulary;
InitVocabulary: PROC RETURNS [VocabularyRef];
ProdDataRef: TYPE = REF ProdData;
InitProdData: PROC RETURNS [ProdDataRef];
NStartsRef: TYPE = REF NStarts;
InitNStarts: PROC RETURNS [NStartsRef];
NLengthsRef: TYPE = REF NLengths;
InitNLengths: PROC RETURNS [NLengthsRef];
NSymbolsRef: TYPE = REF NSymbols;
InitNSymbols: PROC RETURNS [NSymbolsRef];
NActionsRef: TYPE = REF NActions;
InitNActions: PROC RETURNS [NActionsRef];
NTDefaultsRef: TYPE = REF NTDefaults;
InitNTDefaults: PROC RETURNS [NTDefaultsRef];
TStartsRef: TYPE = REF TStarts;
InitTStarts: PROC RETURNS [TStartsRef];
TLengthsRef: TYPE = REF TLengths;
InitTLengths: PROC RETURNS [TLengthsRef];
TSymbolsRef: TYPE = REF TSymbols;
InitTSymbols: PROC RETURNS [TSymbolsRef];
TActionsRef: TYPE = REF TActions;
InitTActions: PROC RETURNS [TActionsRef];
}.
The above file defines aliases using the information given in the aliases segment of the input. For example the token ||TABLE3, which is a terminal symbol of NPGS grammar was given the alias tokenTAB3, it was the ninth token in the sequence of terminal symbols in the input file and so internally within NPGS and the binary tables produced it is encoded as 9.
The ranges, HashIndex, TSymbol, NTSymbol, State, NTState, TIndex, NTIndex, Production and VIndex prescribe the dimensions of arrays in the binary tables.
The purpose and content of the arrays in parseTable are explained in AEH; only the definitions relevant to the scanner are discussed here. Terminal symbols of the grammar represented by a single ASCII character are treated separately from those requiring a string of characters. In scanTable there is an array scanTab which can be indexed by characters not in the range [0C..' ]; any single character symbol used to extract an element of this array will extract a non zero integer only if it represents a terminal symbol and the integer is its numeric encoding.
The string vocabBody contains the character strings representing all other valid terminals stored head to tail. Element i-1 of vocabIndex indexes the first character of the string in vocabBody which represents the terminal symbol with the encoding i. Given a string which purports to be a terminal symbol its hash value can be computed using the hash function given in section 4.2.3 (identifying buckets there with LAST[HashIndex]). The hash value can be used to select an element from the array hashTab; the elements of hashTab are records, one field of which is used to select an entry of vocabIndex (to find the substring in vocabBody to compare with the given string), the other field is an index to another element in hashTab to be tried when the string comparison fails; hashTab[0] is void, a zero index terminates the search indicating that the given string does not represent a terminal.
4.5 The LR and First Files
As part of the debugging facilities built into NPGS two other output files are created on request.
The first step in building the LALR(1) tables is to construct LR(0) tables. (This is done using the SLR(1) algorithm in section 3.1 of the AEH paper by omitting the computation of the sets specified in relation (5b).) The LR(0) tables may be output to a file with the extension .lr by specifying the directive ||PRINTLR in the input; the form of the output is similar to that used in the LALR(1) tables but of course the terminal symbols triggering reduce actions are not known so a line
Reduce with p
follows the state heading if the production with label p has reduce actions in the state.
The next stage is to compute lookahead symbols for all these incompletely determined reduce actions according to the LALR(1) rules. This is done using Anderson's bewilderingly succinctly stated algorithm at the foot of page 21 and top of page 22 in AEH. It is convenient, as a preliminary to this, to compute, for each nonterminal symbol, the set of terminal symbols which can appear as the first symbol in a string derived from the nonterminal. This is a transitive closure calculation. It provides the initial data for computing Anderson's exact right contexts which is yet another transitive closure calculation.
If the directive ||FIRST appears among the input directives in addition to either of the directives ||PRINTLALR or ||LISTS, a file with extension name .first is created which contains a list of all nonterminal symbols each being followed by an (offset) list of the terminals which can start a string derived from it.
5. The Preprocessor
The preprocessor is invoked if the input file name has the extension .pgs. Its function arises from the following considerations. For maintenance purposes it is more or less essential that each arm of the SELECT statement implementing the semantic processing routines in the Mesa compiler has comments displaying those productions associated with this arm. The test preceding the => symbol in the arm is the rule number for these productions. As Mesa evolves, changes are made to the grammar, productions associated with one arm must be moved to another, new productions and new arms are introduced and periodically it is necessary to renumber the rules in a systematic fashion in order to find things (there being being over 200 arms). The bookkeeping necessary to ensure that the story told by the SELECT statement is consistent with that told to NPGS in the input file is tiresome and error prone. Most of the data needed by NPGS is present in the SELECT statement and by adding the rest only one copy of the grammar need be maintained and the reassignment of rule numbers can be mechanised.
The preprocessor expects a Mesa program module as input and it scans for the sequence
digits => --
after which it expects to find data relevent to it. On encountering CR, or Bravo's ControlZ ... CR sequence, it checks whether the next printing characters open a Mesa comment or not; if they do then more data is expected otherwise the end of the data associated with a particular select arm is presumed and a search is instituted for the next.
Supposing the input file name given was
George.pgs, then during the scan the preprocessor copies a modified copy of the file to
George.mesa. The modification is a trivial one; as the sequences,
digits => --
are encountered, the next integer in the sequence 0,1,2,... is substituted before the => symbol.
The rather crude procedure used to locate the grammatical information in the program text places constraints on the program module containing it. In particular it precludes comments in a fairly natural place in any other SELECT statements that may appear in the module which also use integer tests. On the other hand anything approaching parsing the text is out since it merely replaces one updating problem with another.
Since NPGS constructs tables using the new rule numbers just assigned, arms of the SELECT statement can be reordered to group logically related arms together without making the search for an arm with a given rule number tedious.
In the comments associated with SELECT arms (other than the first encountered) the preprocessor expects to find only productions. Here each production is specified in full, its left part token followed by either ::= or (to designate a chain production) ::=C followed by the right part tokens.
Comments preceding the first production contain the additional information needed. This information in order of occurrence is,
1. In any order, the tokens TABLE:, TYPE:, or EXPORTS: may appear, each followed by a simple identifier. Of these, only the TYPE: token is still supported in NPGS. The other are accepted for compatibility, but have no effect. The name following the TYPE: token is the root name of the definitions file. The root name of the implementation file is always the root name of the definitions file followed by Impl, and the extension is always .mesa.
2. Next must appear GOAL: followed by the token naming the nonterminal which is the goal symbol of the grammar,
3. Optionally there may appear TERMINALS: followed by the tokens representing the terminals. The end of sentence marker should not be included, eof is supplied,
4. Optionally there may appear ALIASES: followed by the alias sequence as described in section 3, paragraph 4,
5. Finally PRODUCTIONS: must appear before the first production.
Nonterminal symbols are deduced from the tokens to the left of ::= tokens. If the TERMINALS: sequence appears only these symbols are taken to be terminals. In its absence any token in a production which is not a nonterminal according to the previous definition is a terminal. Omitting this sequence means that typing errors define terminal symbols!
From a file of this form (see the example at the end of the appendix), the preprocessor constructs a scratch file in the format specified in section 3, with the directives ||INPUT, ||CHAIN and ||LISTS, which it passes to the input phase of NPGS.
The preprocessor only generates one error message, "Directives incorrect or out of sequence". No further processing occurs in this situation so the message is displayed, followed by the request to type a character to terminate execution.
When inserting new arms in the SELECT statement there is no need (so far as the preprocessor is concerned) to use an integer distinct from those on other arms but it is probably not a good idea. The preprocessor will recognise ? as an alternative to a digit sequence.
6. Operation
6.1 Historical notes
The mesa definitions files for NPGS are (* => produced by NPGS):
NPGSParseTable.mesa, the parse table definitions *
NPGS1.mesa, the scanner-parser interface
NPGSOps.mesa, a trivial interface for calling NPGS
NPGSTypes.mesa, the internal types for NPGS
NPGSConDefs.mesa, the internal procedure interface for NPGS
The mesa implementation files for NPGS are (* => produced by NPGS):
NPGSParseTableImpl.mesa, the parse table implementation *
NPGSScan.mesa, the scanner and semantic processing routines *
NPGSControl.mesa, the control module
NPGSFormat.mesa, the preprocessor
NPGSInterface.mesa, the internal types for NPGS
NPGSLALR.mesa, the LALR(1) table generator
NPGSParser.mesa, the Mesa parser
NPGSTab.mesa, the compacted table constructor
To make NPGS, follow the directions in MakeNPGS.cm.
The module NPGSConDefs interfaces NPGS to the Mesa system providing I/O and storage allocation routines for the other modules; as a consequence only PGSConDefs and PGSControl (which implements these facilities) reference Mesa system modules.
6.2 NPGS Operation
NPGS command line parsing has been revised to handle module identifiers, file names, etc., in the currently approved way and to allow easier parameterization with respect to long or short pointers. The basic idea was to make NPGS more like the compiler and binder in these areas (to avoid the current file name hassles and to make NPGS more usable by the system modeller).
Processing Modes
Input: Mesa vs. grammar. NPGS can extract the information needed to build parsing tables from comments embedded within standard Mesa source files. Although the above documents this input mode almost as an afterthought, it has become the standard one. The conventional name for the input file in this mode is Input.pgs. The grammar mode has been retained for its occasional utility in experimenting with grammars. See the Appendix.
Output: Two Cedar/Mesa files are produced: a definitions file that defines the tables, and an implementation file that provides procedures to produce the tables on request.
Mesa Programs as NPGS Source Files
The list of keywords that optionally precede the first production has been revised and expanded as follows:
TABLE: {note: obsolete, do not use}
TYPE:
EXPORTS:
GOAL:
TERMINALS:
ALIASES:
PRODUCTIONS
The first three must precede all the others but may occur in any order; the next section explains their significance. The last four must appear in the specified order; all but the last may be omitted.
Output Identification
In the source text, the old keywords dealing with file and module names are replaced by
TABLE: tableId TYPE: typeId EXPORTS: interfaceId -- or EXPORTS: SELF
This is supposed to remind you of
programId: <ProgramType> EXPORTS interfaceId
(sorry about the different treatment of colons, etc.). The names tableId, typeId and interfaceId are module identifiers (capitalization counts) and get put into BCDs, symbol tables, etc.
Appendix
An Input File
||INPUT ||CHAIN ||LISTS ||PRINTLALR
||TABLE1
id + ( ) * IF THEN OR ELSE := EOF
||TABLE2
g s a i e t p b l
||TABLE3
+ tokenPLUS
||TABLE4
1 g ::= s
C s ::= a
C | i
2 a ::= id := e
3 i ::= IF b THEN a l
C e ::= t
4 | e + t
C t ::= p
5 | t * p
6 p ::= ( e )
7 | id
8 b ::= b OR id
9 | id
10 l ::=
11 | ELSE s
The Resulting Module File
ParseTable: DEFINITIONS =
BEGIN -- types for data structures used by the Mesa parser and scanner
Symbol: TYPE = [0..255];
-- token indices for the scanner and parser
tokenPLUS: TSymbol = 2;
HashIndex: TYPE = [0..17];
VIndex: TYPE = [0..22];
TSymbol: TYPE = Symbol [0..11];
NTSymbol: TYPE = Symbol [0..10];
State: TYPE = [0..17];
NTState: TYPE = State [0..5];
TIndex: TYPE = [0..23];
NTIndex: TYPE = [0..9];
Production: TYPE = [0..15];
END.
The Resulting Echo File
||INPUT ||CHAIN ||LISTS ||PRINTLALR
||TABLE1
1 id
2 +
3 (
4 )
5 *
6 IF
7 THEN
8 OR
9 ELSE
10 :=
11 EOF
||TABLE2
12 g
13 s
14 a
15 i
16 e
17 t
18 p
19 b
20 l
||TABLE3
+ tokenPLUS
||TABLE4
GOAL ::= g EOF
1 1 g ::= s
2 C 2 s ::= a
3 C 3 | i
4 2 a ::= id := e
5 3 i ::= IF b THEN a l
6 C 6 e ::= t
7 4 | e + t
8 C 8 t ::= p
9 5 | t * p
10 6 p ::= ( e )
11 7 | id
12 8 b ::= b OR id
13 9 | id
14 10 l ::=
15 11 | ELSE s
The Resulting Log File
LALR(1) Tables
1 0 0/
2 id 3 IF 0 g -1 s
-1 a -1 i
2 id 4 1/
4 :=
3 IF 5 1/
-13 id 5 b
4 := 4 2/
-11 id 6 ( 7 e 8 t
8 p
5 b 5 2/ 12 1/
9 THEN 10 OR
6 ( 10 1/
-11 id 6 ( 11 e 12 t
12 p
7 e 4 3/ 7 1/
13 + *4 ELSE *4 EOF
8 e 4 3/ 7 1/ 9 1/
13 + 14 * *4 ELSE *4 EOF
9 THEN 5 3/
2 id 15 a
10 OR 12 2/
-12 id
11 e 7 1/ 10 2/
13 + -10 )
12 e 7 1/ 9 1/ 10 2/
13 + -10 ) 14 *
13 + 7 2/
-11 id 6 ( 16 t 16 p
14 * 9 2/
-11 id 6 ( -9 p
15 a 5 4/
17 ELSE *14 EOF -5 l
16 t 7 3/ 9 1/
*7 + *7 ) 14 * *7 ELSE
*7 EOF
17 ELSE 15 1/
2 id 3 IF -15 s -15 a
-15 i
LALR(1) Statistics
States = 17
Terminal entries = 37
Nonterminal entries = 19
First OR operation count = 5
Total OR operation count = 33
Maximum number of contexts = 9
Parse Table Data
Nonterminal Default Actions
0 g -1 s 15 a -1 i
7 e 8 t 8 p 5 b
-5 l
Entries removed = 0
Table Overlays
row ttab ntab
6 4
13 4
14 4
17 1
Scanner hashtable probe counts (terminal symbol, probecount, hashcode)
id 1 6 IF 1 9 THEN 1 3 OR 1 1
ELSE 1 10 := 1 16
Renumbered States
2 swapped with 17
3 swapped with 14
4 swapped with 13
5 swapped with 6
The PGS Grammar
||CHAIN ||LISTS ||INPUT
||TABLE1
1 symbol
2 num
3 '?
4 '|
5 "::="
6 'C
7 "||TABLE1"
8 "||TABLE2"
9 "||TABLE3"
10 "||TABLE4"
11 "||INPUT"
12 "||CHAIN"
13 "||LISTS"
14 "||PRINTLR"
15 "||PRINTLALR"
16 "||FIRST"
17 "||IDS"
18 "GOAL"
19 eof
||TABLE2
20 grammar
21 head
22 ruleset
23 directives
24 terminals
25 nonterminals
26 aliases
27 directive
28 discard
29 rulegroup
30 prefix
31 goalrule
||TABLE3
symbol tokenID
num tokenNUM
'? tokenQUERY
"||TABLE3" tokenTAB3
"||TABLE4" tokenTAB4
'? InitialSymbol
||TABLE4
GOAL ::= grammar eof
1 0 grammar ::= '? head ruleset
2 1 head ::= directives terminals nonterminals "||TABLE4"
3 1 | directives terminals nonterminals aliases "||TABLE4"
4 2 directives ::=
5 28 | directives directive
6 3 directive ::= "||INPUT"
7 4 | "||CHAIN"
8 5 | "||LISTS"
9 6 | "||PRINTLR"
10 7 | "||PRINTLALR"
11 8 | "||FIRST"
12 9 | "||IDS"
13 10 terminals ::= "||TABLE1"
14 11 | terminals discard symbol
15 11 nonterminals ::= nonterminals discard symbol
16 12 | "||TABLE2"
17 13 aliases ::= "||TABLE3"
18 14 | aliases symbol symbol
19 15 discard ::=
20 28 | num
21 28 | '?
22 16 rulegroup ::= symbol "::="
23 17 | prefix symbol "::="
24 18 | rulegroup symbol "::="
25 19 | rulegroup prefix symbol "::="
26 20 | rulegroup '|
27 21 | rulegroup prefix '|
28 22 | rulegroup symbol
29 23 prefix ::= num
30 24 | num num
31 24 | '? num
32 25 | discard 'C
33 26 | discard 'C num
34 27 | '?
35 C 28 ruleset ::= rulegroup
36 28 | goalrule rulegroup
37 28 goalrule ::= "GOAL" "::=" symbol symbol
An Input File For the Preprocessor
Program text has been stripped from within and around the SELECT statement implementing the semantic routines of NPGS to expose the grammatical information.
SELECT prodData[q[qj].transition].rule FROM
0 => -- MODULE: pgsptabdefsnew.mesa BINARY: pgsnew.binary
-- GOAL: grammar
-- TERMINALS: symbol num '? '| "::=" 'C "||TABLE1" "||TABLE2" "||TABLE3"
-- "||TABLE4" "||INPUT" "||CHAIN" "||LISTS" "||PRINTLR"
-- "||PRINTLALR" "||FIRST" "||IDS" "GOAL"
-- ALIASES: symbol tokenID num tokenNUM '? tokenQUERY
-- "||TABLE3" tokenTAB3 "||TABLE4" tokenTAB4 '? InitialSymbol
-- PRODUCTIONS:
-- grammar ::= '? head ruleset
BEGIN
END;
1 => -- head ::= directives terminals nonterminals "||TABLE4"
-- head ::= directives terminals nonterminals aliases "||TABLE4"
BEGIN
END;
2 => -- directives ::=
BEGIN
END;
3 => -- directive ::= "||INPUT"
BEGIN
END;
4 => -- directive ::= "||CHAIN"
flags[chain] ← TRUE;
5 => -- directive ::= "||LISTS"
flags[lists] ← TRUE;
6 => -- directive ::= "||PRINTLR"
flags[printlr] ← TRUE;
7 => -- directive ::= "||PRINTLALR"
flags[printlalr] ← TRUE;
8 => -- directive ::= "||FIRST"
flags[first] ← TRUE;
9 => -- directive ::= "||IDS"
flags[ids] ← TRUE;
10 => -- terminals ::= "||TABLE1"
BEGIN
END;
11 => -- terminals ::= terminals discard symbol
-- nonterminals ::= nonterminals discard symbol
BEGIN
END;
12 => -- nonterminals ::= "||TABLE2"
BEGIN
END;
13 => -- aliases ::= "||TABLE3"
BEGIN
END;
14 => -- aliases ::= aliases symbol symbol
BEGIN
END;
15 => -- discard ::=
l[top] ← InputLoc[]; -- keep the parser error recovery happy
16 => -- rulegroup ::= symbol "::="
BEGIN
END;
17 => -- rulegroup ::= prefix symbol "::="
lhssymbol[v[top+1]];
18 => -- rulegroup ::= rulegroup symbol "::="
BEGIN
END;
19 => -- rulegroup ::= rulegroup prefix symbol "::="
lhssymbol[v[top+2]];
20 => -- rulegroup ::= rulegroup '|
BEGIN
END;
21 => -- rulegroup ::= rulegroup prefix '|
prodheader[FALSE];
22 => -- rulegroup ::= rulegroup symbol
BEGIN
END;
23 => -- prefix ::= num
setrulechain[v[top], FALSE];
24 => -- prefix ::= num num
-- prefix ::= '? num
setrulechain[v[top+1], FALSE];
25 => -- prefix ::= discard 'C
setrulechain[prix, TRUE];
26 => -- prefix ::= discard 'C num
setrulechain[v[top+2], TRUE];
27 => -- prefix ::= '?
setrulechain[prix, FALSE];
28 => -- directives ::= directives directive
-- discard ::= num
-- discard ::= '?
-- ruleset ::=C rulegroup
-- ruleset ::= goalrule rulegroup
-- goalrule ::= "GOAL" "::=" symbol symbol
NULL;
ENDCASE => ERROR;