Heading: Using PGS Inter-Office Memorandum To PGS Users Date April 2, 1984 From Doug Wyatt Location Palo Alto Subject Using PGS Organization PARC/CSL XEROX Filed on: [Indigo]Documentation>PGS.tioga DRAFT The Mesa Parser Generator System James Eve 1. Introduction [James Eve wrote the bulk of this document in 1979. Doug Wyatt converted it to Tioga form and revised section 6 to describe the operaton of PGS and TableCompiler in Cedar. Much of Section 6.2 was derived from a brief memo by Ed Satterthwaite.] The parser generator system (PGS) is a Mesa program which takes a context free grammar specified in Backus-Naur form as input and tests whether it is LALR(1); if the LALR(1) condition is satisfied, compacted binary tables are output which can be used in conjunction wth the Mesa parser. Ancillary tables are also output which simplify the writing of lexical routines to recognise the terminal symbols of the grammar. As a primary task of the PGS is to aid development the Mesa system itself a preprocessor is included which aids this. The LALR(1) parsing algorithm is fairly widely held to be the best practical algorithm currently available; it has good space and time performance properties, it handles a larger subset of the context free grammars than other methods in common use and it allows a good syntax error repair capability to be added. The one snag is that the LALR(1) condition is far from intuitively obvious; checking that the condition holds requires substantial computation and in the event that it does not, a fair knowledge of the underlying theory may be needed to modify a grammar to make it meet the condition. A really good account of the theory has yet to be written, notational problems tend to obscure the underlying ideas. Probably the easiest starting point is the tutorial article "LR Parsing" by Aho and Johnson in Computing Surveys, 6(1974)74, the most comprehensive account readily available is in "The Theory of Parsing, Translating and Compiling" Volume 1, Prentice-Hall (1972), by Aho and Ullman; the algorithms used in the PGS are from Anderson, Eve and Horning, "Efficient LR(1) Parsers", Acta Informatica 2(1973)12 so the terminology here follows the latter paper, which is referenced subsequently as AEH. 2. Using the Parser Generator Given an Alto disc with the PGS installed, issuing the command pgs will invoke it under the Mesa Executive. It prompts for an input file name. The input file name, in fact, implicitly defines the names of the various output files. The main part of the input file name is extended by .echo, .log, .binary and .module to form the names of the primary output files. However input files with extension .mesa or .Mesa are an exception, they discussed in section 5. Input and output are discussed in sections 3 and 4; listings of an example input file and the resulting .module, .echo and .log files are given in the appendix. Installing the system and assembling it from its components are described in section 6. 3. Format of the Input File. The input file is treated as a sequence of tokens, where a token is defined to be a sequence of characters none of which are in the range [0C..' ]. Tokens are delimited by sequences of characters in this range. In addition to the tokens ::=, |,?,C and the integers which have special roles there are a number of tokens starting with the character pair || which control the PGS. The input consists of five subsequences, directives, terminals, nonterminals, aliases and productions which must appear in this order. 1. The principal directives and their functions are, ||INPUT - causes the input to be echoed to the .echo file, ||CHAIN - causes an optimisation to be performed which speeds up parsing by eliminating all references to productions marked as chain productions, ||LISTS - causes the LALR(1) tables to be compacted and output to the .binary and .module files, ||PRINTLALR - causes a readable form of the LALR(1) tables to be output to the .log file. (For a grammar of about 300 productions these tables will contain about 400,000 characters. Generating them slows the PGS perceptibly.) 2. The terminals are introduced by ||TABLE1 which is followed by tokens representing the terminal symbols of the grammar, the last of which denotes a unique sentence endmarker which will be used only to delimit sentences supplied to the resulting parser. It should not appear in any production. (The scanner invoked by the parser using the tables generated by the PGS is required to map the end of input signal onto this token.) 3. ||TABLE2 similarly introduces the nonterminal symbols of the grammar; that appearing first after ||TABLE2 is taken to be the goal symbol of the grammar. The way that the Mesa parsing algorithm terminates entails a weak constraint that the goal symbol should not appear in the right part of any production nor should any of its productions be designated chain productions. 4. The alias sequence may be empty but if it is not it is introduced by ||TABLE3. The terminal symbols of a grammar do not necessarily have the form of identifiers but lexical or error recovery routines may need to be able to reference them. The alias sequence consists of pairs of tokens the first of which is a terminal symbol (i.e. it appears in the sequence following ||TABLE1), the second is an alias in the form of an identifier by which it can be referenced; appropriate constant definitions are constructed for these identifiers and included in the .module file. 5. Finally the productions are listed in Backus-Naur notation following ||TABLE4. The tokens ::= and | play their usual role; there is no symbol terminating or separating productions, likewise a production deriving the empty string is specified the absence of any token after ::= or | and the succeeding |, end of input or token followed by ::= sequence. Immediately preceeding the |'s or to the left of a token followed by ::= there will usually be an integer, the rule number, which may itself be preceded by the token C (upper case only). The rule number associates the production with a semantic routine (an arm of a SELECT statement) which is to be invoked by the parsing algorithm whenever a string derived from the associated production is recognised. Some productions, of the form nonterminal ::= nonterminal, have no semantic significance and serve merely to assert that members of one syntactic class are also members of some larger class. Chains of such productions appear in expression grammars (expr ::= term, term ::= factor, factor ::= primary, etc.) and can significantly reduce the speed of parsing. The appearance of the token C is an assertion that the following production is of the specified form and has no semantic processing associated thereby permitting the PGS to eliminate all references to it with an increase in speed of parsing. The input phase of the PGS uses the Mesa parsing routine and parsing tables of its own construction so there is clearly a grammar specifying the form of the input. The description given above was preferred as an introduction to the input format since it covers only the essentials. The PGS will, on request, echo its input interspersed with other information in a formatted form. As there was a requirement that this output should also be acceptable as input, the grammar allows a rather wider class of input forms but information other than that described is simply discarded during re-input. The grammar appears in the appendix. It should be clear that the terminal symbols of the PGS grammar cannot be used as tokens in a client grammar, a syntax error would result. This is not likely to be a problem to most clients insofar as the symbols starting with || are concerned (which is why this curious system was adopted) but there are also ::=, |, ?, C and GOAL some of which may well appear in a client grammar. Finally of course the PGS grammar contains all of them. The standard solution is used; if |, ?, or C are required as tokens in a client grammar they must be specified as '|, '? and 'C. Multi-character symbols such as ::= must be specified as "::=". The only special treatment that these quoted symbols receive at the hands of the PGS is in building the tables for use by the scanner; any two character token beginning with a single quote has the quote removed; similarly any token of three or more symbols where the first and last are double quotes has these bracketing quotes stripped. Occasionally it is convenient to be able to flag certain productions in the input text. The PGS will ignore ? when looking for C or a rule number; the ? should precede C when a rule number is also present. 4. Output of the Parser Generator Four output files are normally constructed, a record of the input, a log, a binary output file containing the parser and lexical tables and a module file which contains definitions of aliased terminal symbols and some ranges defining the sizes of the various arrays constituting the binary file. The module file is a Mesa module with the name ParseTable. It must be compiled and bound with the Mesa parser, a suitably modified version of the Mesa scanner and the definitions module which describes the invariant parts of the binary parse tables. Examples of these latter modules exist in the PGS. The files pgsptabdefs.mesa, pgsscan.mesa, pgsparse.mesa and pgs1.mesa contain respectively ParseTable, the scanner and semantic processing routines for the PGS, the Mesa parser and, finally, definitions of the invariant part of the binary tables. For operational reasons, the low level routines interfacing with I/O, storage management, etc. have been removed from pgsparse.mesa and pgsscan.mesa to the control module of the PGS which is in the file pgscon.mesa. Nonetheless these modules should provide a model for anyone needing it. In particular the code for loading and unloading the binary tables and invoking the parser can be found in the main line code of the module PGScon. 4.1 The Input Record File This file is produced if the directives in the input stream contain ||INPUT. The name of the file is obtained by appending .echo to the main part of the input file name. The record of the input differs from the true input in that the directives may be displayed in a different order and the terminal and nonterminal symbols are displayed one to a line each preceded by an integer. (The integers are allocated sequentially starting at one for the first terminal symbol. Each production is displayed starting on a new line; each is preceded by two integers, the second being the rule number from the input. If a C was specified on the input it appears between the two integers. the first integer is simply a unique label for the production. The first production is labelled one and again the PGS simply labels the productions with ascending integers. These labels are used in some of the diagnostic messages output by the PGS. A production with the implicit label zero is constructed and output before the others, it has the form, GOAL ::= goal eof where goal stands for the goal symbol and eof for the end of sentence marker. A check during input ensures that these symbols occur to the right of ::= in no other production. 4.2 The Log File This file contains error messages and various items of supplementary information output during the generation of the parsing tables. If error or warning messages are logged then, immediately prior to the end of execution, the message, "Errors or warnings logged" is displayed followed by the usual invitation to type any character to terminate processing. 4.2.1 Error Messages Most error messages occur during input of the grammar. Those messages prefixed by the word ERROR cause the program to terminate after completing input and checking for further errors. WARNING messages allow processing to continue. Each message is accompanied by a fragment of input text with a pointer to the current input token. The warning messages are, 1. Overlength symbol (increase TOKENSIZE?) truncated to - 2. Not a chain production - 3. Unused(U) or undefined(D) symbols (refer to TABLE1 and 2) These messages illustrate some general points; messages ending with a dash are followed by further information. For message 1, it is the truncated form of the token, for message 2 it is the integer label appended to the offending production in the echoed input. After message 2, processing continues as though no chain indication had been specified. Messages such as the first which indicate that an internal field size is too small also specify the compile time constant (in pgscondefs.mesa) which controls the field size. Currently tokensize constrains tokens to 14 characters. The third message occurs at the end of input if there are symbols in the TABLE1 and 2 sequences which do not appear in any production (unused symbols) or if a symbol in the TABLE2 sequence does not appear to the left of ::= in the productions (undefined symbols). This message is followed by a list of integers which designate symbols in TABLE1 and 2 using the numeric labels appended in the echoed input. Each integer is tagged with U and/or D to indicate whether the corresponding symbol is unused or undefined or both. The ERROR messages are, 4. Nonterminal with too many rules (increase ALTERNATELIM?) - Only 31 productions are allowed for any nonterminal; if this is not enough it would be better to introduce a new nonterminal and split them into two groups rather than increase the limit of 31. 5. Multiple definitions of symbol - This message occurs either because the same symbol appears more than once in TABLE1 and 2 or because the same symbol occurs to the left of ::= more than once. The offending symbol is printed after the message. 6. Symbol not defined - This message also appears in two contexts, either a symbol appears in a production which does not appear in TABLE1 or 2 or alternatively message 3 was issued with undefined symbols. in the first case the message is followed by the symbol in question, in the second by "see previous message". 7. Terminal precedes ::= - The terminal symbol follows the message. 8. Too many rhs symbols in production (increase RHSLIM?) - Fifteen symbols in the right part of a production is unlikely to be exceeded; if it is, change the grammar, increasing rhslim would involve consequential changes in the binary table formats. 9. Internal field will overflow - increase PSSLIM This one is unlikely, it would involve a grammar with 1024 symbols or productions. An increase up to 2047 would be possible without changing the binary table formats. 10. Aliased symbol not a terminal symbol - The symbol follows the message. 11. Aliases must not be terminal symbols - The symbol follows the message. 12. Goal symbols used in rhs The goal symbol or end of sentence marker appear in a production right part. (See the paragraphs numbered 1 and 2 in section 3). 13. Number greater than MAXRULE Currently the PGS allows for 255 rule numbers. A relatively minor reformatting of the binary tables would permit it to be increased to LAST[CARDINAL]. 4.2.2 Output during table construction During the construction of the parsing tables there is a reasonably remote possibility that error message 9 will occur and terminate any further processing, though in this case it implies that more than 1023 parsing states are necessary for the grammar. Much more likely are messages indicating that the grammar is not LALR(1). In the event of conflicting parsing actions arising for some terminal symbol in a parsing state, all data appertaining to that state is listed. The first heading line specifies, 1. an integer naming the state, 2. a symbol of the grammar (recognition of this symbol causes this state to be entered), 3. a set of p,j pairs defining the state (see AEH), each pair being followed by /. Below the heading in a four column format is the list of symbols of the grammar which may be encountered in this state. Each symbol is preceded by an encoding of its associated parsing action, 1. unsigned integers denote scan (or shift) entries to the state named by the integer, 2. integers preceded by an asterisk signify reduce operations using the production with the integer as its label, 3. an integer preceded by a minus sign also indicates a production number but implies that the symbol associated with this action must be stacked before the reduce operation; a so-called scanreduce operation. (These marking conventions differ from those used in AEH though they use the same symbols.) During construction of the tables scan entries are computed before reduce entries, so when conflicting actions arise they are reduce actions which either conflict with an existing scan entry or with a previously computed reduce entry. (It is an inherent property of scanreduce entries that they cannot conflict with another entry.) Conflicts are indicated by lines of the form, Reduce with n conflicts with ********** where n is a production number. Each such line is followed by a list of items of the form, symbol action /, where action is either SCAN if a scan action for this (terminal) symbol has already been constructed or is an integer naming the production for which a reduce action has already been constructed. If the directive ||PRINTLALR is specified, the output just described occurs for every state whether or not it contains conflicts. The heading, LALR(1) Tables, precedes such output. This output is rarely worth having, it is occasionally of value in tracking down a conflict. Its primary function was in debugging the PGS. The PGS discards conflicting entries after printing them and it will not form either the binary tables or the module output if there are reduce-reduce conflicts. In the case of reduce-scan conflicts the decision to process scan entries first implies that the scan entry takes precedence. This is occasionally useful. For example it solves the dangling else problem in the preferred way. However, the scan-reduce conflict message is a warning and as such triggers the displayed message directing attention to the .log file. After generating the tables a heading, LALR(1) Statistics, is output and a few counts are printed. Only the first three may be of any general interest, they indicate the number of parsing states and the total number of actions in the tables for both terminal and nonterminal symbols. 4.2.3 Output during table compaction Table compaction and output of the binary tables only occur if the directive ||LISTS is specified in which case the output described here appears on the .log file. The earlier stages of the PGS are written in a general way, data structures will expand to accomodate very large grammars; at the cost of recompiling the system and changing compile time constants, the limits on field sizes mentioned previously can be increased. At this stage the objective is to pack information economically into 16-bit words and it is here that the size of fields is an absolute constraint. Final checks are made which could conceivably produce one or more of the self explanatory messages FAIL - more than 255 terminal symbols FAIL - more than 254 nonterminal symbols FAIL - more than 2047 states or productions FAIL - more than 15 symbols in a production right part These are rather unlikely, the tightest constraint is likely to be the limit of 255 on rule numbers mentioned at the end of Section 4.2.1. If any of these checks fail processing terminates. Assuming no error messages, the only unsolicited output here is a heading, Parse Table Data, and one table. A hash accessed look-up table, for the terminal symbols of the grammar, is created for use by the scanner. As hash functions are notoriously unreliable, the following is printed so that a visual check can be applied to ensure no disasters. The subheading, Scanner hashtable probe counts (terminal symbol, probe count, hashcode) followed by a four column layout of triples which as the heading indicates show for each symbol the number of probes needed to locate it in the hash table and its hashcode. The technique used is Algorithm C, page 514, Vol 3 of Knuth's "The Art of Computer Programming", Addison-Wesley(1973). If there are n terminal symbols, they are hashed into a table of using MIN[m,251] buckets; m is always an odd integer, it is either 3*n/2 or 3*n/2+1. The hash function uses the ASCII values of the first and last character of the symbol and is ((127*first character+last character) MOD buckets) + 1. One can expect the performance of this hash function to deteriorate as the number of terminal symbols approaches 255. If the directive ||PRINTLALR is specified as well as ||LISTS, a record of the table compaction transformations is produced. This record is only of interest for maintenance of the system and familiarity with the compaction techniques described in AEH is assumed in its description. First, a set of default actions for the nonterminal symbols of the grammar are determined and a table headed, Nonterminal Default Actions, is printed. Each nonterminal symbol appears preceded by its associated default action encoded in the form already described, that is, unsigned integers represent scan entries and negative integers represent scan-reduce actions. (Reduce actions never take place on nonterminal symbols.) After removing all occurrences of these defaulted entries from the LALR(1) tables, the PGS determines those states which have identical symbol-action pairs, first, over the set of terminal symbols and then independently, over the set of nonterminal symbols. All such states reference one copy of the list of symbol-action pairs stored in the binary tables. The table, Table Overlays, has three columns headed row, ttab and ntab; if a row of this table contains (integer) entries a, b and c respectively then it implies that the terminal entries of state a are the same as those of state b, while the nonterminal entries of state a are the same as those of state c. It is exceptional for both the terminal and nonterminal entries of a state to match those of other states so usually one of the entries b or c is blank. If neither the terminal nor nonterminal entries of a state match those of another state then it does not appear in this table. The final transformation is to renumber the states so that all of those states containing (nondefaulted) actions on nonterminal symbols are labelled by integers contiguous to eachother and to 1. This is acheived by swapping the highest numbered state with nonterminal actions with the lowest numbered state without nonterminal actions until no more swaps are possible. The table headed, Renumbered States, simply records this with entries of the form, a swapped with b. 4.3 The Module File This is most readily illustrated with an example; it happens to be the module file generated by the PGS for its own grammar. ParseTable: DEFINITIONS = { Symbol: TYPE = [0..255]; TSymbol: TYPE = Symbol[0.. 19]; NTSymbol: TYPE = Symbol[0.. 13]; -- token indices for the scanner and parser tokenID: TSymbol = 1; tokenNUM: TSymbol = 2; tokenQUERY: TSymbol = 3; tokenTAB3: TSymbol = 9; tokenTAB4: TSymbol = 10; initialSymbol: TSymbol = 3; defaultMarker: TSymbol = TSymbol.FIRST; endMarker: TSymbol = TSymbol.LAST; HashIndex: TYPE = [0.. 29]; VIndex: TYPE = [0..106]; State: TYPE = [0.. 26]; NTState: TYPE = State[0.. 6]; TIndex: TYPE = [0.. 64]; NTIndex: TYPE = [0.. 3]; Production: TYPE = [0.. 37]; }; The module defines aliases using the information given in the aliases segment of the input. For example the token ||TABLE3, which is a terminal symbol of the PGS grammar was given the alias tokenTAB3, it was the ninth token in the sequence of terminal symbols in the input file and so internally within the PGS and the binary tables produced it is encoded as 9. The ranges, HashIndex, TSymbol, NTSymbol, State, NTState, TIndex, NTIndex, Production and VIndex prescribe the dimensions of arrays in the binary tables. 4.4 The Binary File The format of the binary file is captured by a set of Mesa definitions which, since they are of interest to both scanner and parser, can conveniently be specified in the definitions module which constitutes the scanner-parser interface. These definitions are reproduced below. ActionTag: TYPE = MACHINE DEPENDENT RECORD [ reduce(0: 0..0): BOOL, -- TRUE iff reduce entry pLength(0: 1..4): [0..15]]; -- number of symbols in production rhs ActionEntry: TYPE = MACHINE DEPENDENT RECORD [ tag(0: 0..4): ActionTag, -- [FALSE,0] if a shift entry transition(0: 5..15): [0..2047]]; -- production number / next state ProductionInfo: TYPE = MACHINE DEPENDENT RECORD [ rule(0: 0..7): [0..256), -- reduction rule lhs(0: 8..15): NTSymbol]; -- production lhs symbol VocabHashEntry: TYPE = MACHINE DEPENDENT RECORD [ symbol(0: 0..7): TSymbol, -- symbol index link(0: 8..15): HashIndex]; -- link to next entry ScanTable: TYPE = ARRAY CHAR['\040..'\177] OF TSymbol; HashTable: TYPE = ARRAY HashIndex OF VocabHashEntry; IndexTable: TYPE = ARRAY TSymbol OF CARDINAL; Vocabulary: TYPE = MACHINE DEPENDENT RECORD [ -- a string body length(0), maxlength(1): CARDINAL, text(2): PACKED ARRAY VIndex OF CHAR]; ProdData: TYPE = ARRAY Production OF ProductionInfo; NStarts: TYPE = ARRAY NTState OF NTIndex; NLengths: TYPE = ARRAY NTState OF CARDINAL; NSymbols: TYPE = ARRAY NTIndex OF NTSymbol; NActions: TYPE = ARRAY NTIndex OF ActionEntry; NTDefaults: TYPE = ARRAY NTSymbol OF ActionEntry; TStarts: TYPE = ARRAY State OF TIndex; TLengths: TYPE = ARRAY State OF CARDINAL; TSymbols: TYPE = ARRAY TIndex OF TSymbol; TActions: TYPE = ARRAY TIndex OF ActionEntry; initialState: State = 1; finalState: State = 0; Table: TYPE = MACHINE DEPENDENT RECORD [ scanTable: RECORD [ scanTab: TableRef RELATIVE POINTER TO ScanTable, hashTab: TableRef RELATIVE POINTER TO HashTable, vocabIndex: TableRef RELATIVE POINTER TO IndexTable, vocabBody: TableRef RELATIVE POINTER TO Vocabulary ], parseTable: RECORD [ prodData: TableRef RELATIVE POINTER TO ProdData, nStart: TableRef RELATIVE POINTER TO NStarts, nLength: TableRef RELATIVE POINTER TO NLengths, nSymbol: TableRef RELATIVE POINTER TO NSymbols, nAction: TableRef RELATIVE POINTER TO NActions, ntDefaults: TableRef RELATIVE POINTER TO NTDefaults, tStart: TableRef RELATIVE POINTER TO TStarts, tLength: TableRef RELATIVE POINTER TO TLengths, tSymbol: TableRef RELATIVE POINTER TO TSymbols, tAction: TableRef RELATIVE POINTER TO TActions ] ]; TableRef: TYPE = LONG BASE POINTER TO Table; The purpose and content of the arrays in parseTable are explained in AEH; only the definitions relevant to the scanner are discussed here. Terminal symbols of the grammar represented by a single ASCII character are treated separately from those requiring a string of characters. In scanTable there is an array scanTab which can be indexed by characters not in the range [0C..' ]; any single character symbol used to extract an element of this array will extract a non zero integer only if it represents a terminal symbol and the integer is its numeric encoding. The string vocabBody contains the character strings representing all other valid terminals stored head to tail. Element i-1 of vocabIndex indexes the first character of the string in vocabBody which represents the terminal symbol with the encoding i. Given a string which purports to be a terminal symbol its hash value can be computed using the hash function given in section 4.2.3 (identifying buckets there with LAST[HashIndex]). The hash value can be used to select an element from the array hashTab; the elements of hashTab are records, one field of which is used to select an entry of vocabIndex (to find the substring in vocabBody to compare with the given string), the other field is an index to another element in hashTab to be tried when the string comparison fails; hashTab[0] is void, a zero index terminates the search indicating that the given string does not represent a terminal. 4.5 The LR and First Files As part of the debugging facilities built into the PGS two other output files are created on request. The first step in building the LALR(1) tables is to construct LR(0) tables. (This is done using the SLR(1) algorithm in section 3.1 of the AEH paper by omitting the computation of the sets specified in relation (5b).) The LR(0) tables may be output to a file with the extension .lr by specifying the directive ||PRINTLR in the input; the form of the output is similar to that used in the LALR(1) tables but of course the terminal symbols triggering reduce actions are not known so a line Reduce with p follows the state heading if the production with label p has reduce actions in the state. The next stage is to compute lookahead symbols for all these incompletely determined reduce actions according to the LALR(1) rules. This is done using Anderson's bewilderingly succinctly stated algorithm at the foot of page 21 and top of page 22 in AEH. It is convenient, as a preliminary to this, to compute, for each nonterminal symbol, the set of terminal symbols which can appear as the first symbol in a string derived from the nonterminal. This is a transitive closure calculation. It provides the initial data for computing Anderson's exact right contexts which is yet another transitive closure calculation. If the directive ||FIRST appears among the input directives in addition to either of the directives ||PRINTLALR or ||LISTS, a file with extension name .first is created which contains a list of all nonterminal symbols each being followed by an (offset) list of the terminals which can start a string derived from it. 5. The Preprocessor The preprocessor is invoked if the input file name has the extension .mesa. Its function arises from the following considerations. For maintenance purposes it is more or less essential that each arm of the SELECT statement implementing the semantic processing routines in the Mesa compiler has comments displaying those productions associated with this arm. The test preceding the => symbol in the arm is the rule number for these productions. As Mesa evolves, changes are made to the grammar, productions associated with one arm must be moved to another, new productions and new arms are introduced and periodically it is necessary to renumber the rules in a systematic fashion in order to find things (there being being over 200 arms). The bookkeeping necessary to ensure that the story told by the SELECT statement is consistent with that told to the PGS in the input file is tiresome and error prone. Most of the data needed by the PGS is present in the SELECT statement and by adding the rest only one copy of the grammar need be maintained and the reassignment of rule numbers can be mechanised. The preprocessor expects a Mesa program module as input and it scans for the sequence digits => -- after which it expects to find data relevent to it. On encountering CR, or Bravo's ControlZ ... CR sequence, it checks whether the next printing characters open a Mesa comment or not; if they do then more data is expected otherwise the end of the data associated with a particular select arm is presumed and a search is instituted for the next. Supposing the input file name given was semroutine.mesa, then during the scan the preprocessor copies the input file to semroutine.mesa$ (shades of Bravo) and it makes a modified copy in a scratch file. The modification is a trivial one; as the sequences, digits => -- are encountered, the next integer in the sequence 0,1,2,... is substituted before the => symbol. At the end of the input scan, the scratch file is written back to semroutine.mesa. Clearly the rather crude procedure used to locate the grammatical information in the program text places constraints on the program module containing it. In particular it precludes comments in a fairly natural place in any other SELECT statements that may appear in the module which also use integer tests. On the other hand anything approaching parsing the text is out since it merely replaces one updating problem with another. Since the PGS constructs tables using the new rule numbers just assigned, arms of the select statement can be reordered to group logically related arms together without making the search for an arm with a given rule number tedious. In the comments associated with SELECT arms (other than the first encountered) the preprocessor expects to find only productions. Here each production is specified in full, its left part token followed by either ::= or (to designate a chain production) ::=C followed by the right part tokens. Comments preceding the first production contain the additional information needed. This information in order of occurrence is, 1. Optionally, and in either order, the tokens MODULE: or BINARY: may appear each followed only by a token naming the file to contain the corresponding output, 2. Next must appear GOAL: followed by the token naming the nonterminal which is the goal symbol of the grammar, 3. Optionally there may appear TERMINALS: followed by the tokens representing the terminals. The end of sentence marker should not be included, eof is supplied, 4. Optionally there may appear ALIASES: followed by the alias sequence as described in section 3, paragraph 4, 5. Finally PRODUCTIONS: must appear before the first production. When the preprocessor is selected the output file names are formed by appending the various extensions to pgs rather than the main part of the input file name thus pgs.module and pgs.binary are the default names if the MODULE: and BINARY: options are not used in the input. Nonterminal symbols are deduced from the tokens to the left of ::= tokens. If the TERMINALS: sequence appears only these symbols are taken to be terminals. In its absence any token in a production which is not a nonterminal according to the previous definition is a terminal. Omitting this sequence means that typing errors define terminal symbols! From a file of this form (see the example at the end of the appendix), the preprocessor constructs a scratch file in the format specified in section 3, with the directives ||INPUT, ||CHAIN and ||LISTS, which it passes to the input phase of the PGS. The preprocessor only generates one error message, "Directives incorrect or out of sequence". No further processing occurs in this situation so the message is displayed, followed by the request to type a character to terminate execution. When inserting new arms in the SELECT statement there is no need (so far as the preprocessor is concerned) to use an integer distinct from those on other arms but it is probably not a good idea. The preprocessor will recognise ? as an alternative to a digit sequence. 6. Operation 6.1 Historical notes The mesa source components of PGS include, PGSParseTable.mesa, the binary tables definitions module, ProtoP1.mesa, the scanner-parser interface module, PGSConDefs.mesa, the PGS interface module, ProtoParser.mesa, the Mesa parser, PGSScan.mesa, the PGS scanner and semantic processing routines PGSLALR.mesa, the LALR(1) table generator, PGSTab.mesa, the compacted table constructor, PGSFormat.mesa, the preprocessor, PGSControl.mesa, the PGS control module, The compilation sequence for the PGS modules is simply summarised by the rules, (a) compile PGSParseTable before PGS1, (b) compile PGSTypes before PGSConDefs, (c) compile PGSParser after (a), (d) compile PGSScan or PGSControl after both (a) and (b), the remaining modules can be compiled after (b). The module PGSConDefs interfaces the PGS to the Mesa system providing I/O and storage allocation routines for the other modules; as a consequence only PGSConDefs and PGSControl (which implements these facilities) reference Mesa system modules. A few precautions were taken during implementation in case a 64K memory should prove too small. 1. Apart from a few small arrays for which precise bounds could be predetermined, all arrays are declared using types declared in pgscondefs; these types are descriptors or pointers with the word LONG commented out. Should a 64K memory subsequently prove constraining, removal of these comment symbols and transfer to an extended memory machine should dispose of the problem. 2. The display is turned off just before invoking the procedure lalrgen in the module pgslalr, since it is during the construction of the LALR(1) tables that the total memory demand is greatest. 3. Much of the memory space used in the construction of the LALR(1) tables is recovered for use in table compaction by outputting the LALR(1) tables in a much more convenient form to a scratch file (the .binary file is used). The table compaction routine tabgen builds it own record of the LALR(1) tables from this file; it inherits all other data needed from the data structures built during the input of the grammar. 6.2 PGS Operation PGS command line parsing has been revised to handle module identifiers, file names, etc., in the currently approved way and to allow easier parameterization with respect to long or short pointers. The basic idea was to make PGS more like the compiler and binder in these areas (to avoid the current file name hassles and to make PGS more usable by the system modeller). Processing Modes Input: Mesa vs. grammar. PGS can extract the information needed to build parsing tables from comments embedded within standard Mesa source files. Although the above documents this input mode almost as an afterthought, it has become the standard one. The conventional name for the input file in this mode is .pgs. The grammar mode has been retained for its occasional utility in experimenting with grammars. See the Appendix. Output: BCD vs. binary. The usual output mode is BCD; this facilitates packaging the parsing tables with the code that uses them (in a BCD, boot or image file). The binary mode has been retained primarily for situations in which the parsing tables are to be used by non-Mesa programs. Mesa Programs as PGS Source Files The list of keywords that optionally precede the first production has been revised and expanded as follows: TABLE: TYPE: EXPORTS: GOAL: TERMINALS: ALIASES: PRODUCTIONS The first three must precede all the others but may occur in any order; the next section explains their significance. The last four must appear in the specified order; all but the last may be omitted. Output Identification In the source text, the old keywords dealing with file and module names are replaced by TABLE: tableId TYPE: typeId EXPORTS: interfaceId -- or EXPORTS: SELF This is supposed to remind you of programId: EXPORTS interfaceId (sorry about the different treatment of colons, etc.). The names tableId, typeId and interfaceId are module identifiers (capitalization counts) and get put into BCDs, symbol tables, etc. Examples The following examples are taken from the current system. The compiler and binder specify the same type name because they use the same parsing module, in which that interface is a (compile-time) parameter. On the other hand, they export different interfaces because loading of the corresponding tables is handled differently (see below). TABLE: BCDParseData TYPE: ParseTable EXPORTS: SELF -- binder TABLE: MesaTab TYPE: ParseTable EXPORTS: CBinary -- compiler TABLE: PGSParseData TYPE: PGSParseTable EXPORTS: SELF -- PGS Invoking PGS File Naming When you invoke PGS, you can specify an arbitrary association between files and module identifiers using the same command-line conventions that the compiler and binder do. The most general form is: [defs: defsFile, bcd: bcdFile, grammar: grammarFile] _ sourceFile[interfaceId: interfaceFile]/switches which puts the source for the interface typeId on defsFile.mesa, the BCD for the tableId (or binary, if you say "binary:") on bcdFile.bcd (or .binary), a summary of the grammar on grammarFile.grammar (only if input was a Mesa source file) and finds the BCD for interfaceId on interfaceFile.bcd. Capitalization doesn't count here. By default defsFile: typeId.mesa bcdFile: tableId.mesa grammarFile: tableId.grammar (inhibit with /-g) error messages: tableId.pgslog (There are further defaults for the cases in which the input is just a grammar file or you omit keyword items for the module identifiers in the source). Note that the new use of keywords is not compatible with previous use. Command line examples The following commands build parsers for the Compiler, Binder, and PGS. PGS [grammar: Cedar] _ Pass1T.pgs needs CBinary.bcd produces Pass1T.mesa, MesaTab.bcd, ParseTable.mesa exports MesaTab as a PROGRAM in the interface CBinary puts grammar summary in Cedar.grammar PGS [defs: BcdParseTable, grammar: CMesa] _ BcdTreeBuild.pgs produces BcdTreeBuild.mesa, BcdParseData.bcd, BcdParseTable.mesa exports BcdParseData directly (no interface) puts grammar summary in CMesa.grammar PGS [defs: PGSParseTable, grammar: PGS] _ PGSScan.pgs produces PGSScan.mesa, PGSParseData.bcd, PGSParseTable.mesa exports PGSParseData directly (no interface) puts grammar summary in PGS.grammar 6.2 TableCompiler Operation TableCompiler command line parsing has been similarly revised to resemble that of the compiler and binder. Processing Modes TableCompiler is a program to convert assorted inputs to Cedar .bcd files that can be bound into configurations and managed as code segments. If the source file name has extension .mesa, it extracts string literals from string array declarations and gives the skeleton of a DEFINITIONS file describing the resulting structure; otherwise, it just wraps a bcd header around a collection of bits. Invoking TableCompiler File Naming When you invoke TableCompiler, you can specify an arbitrary association between files and module identifiers using the same command-line conventions that the compiler and binder do. The most general form is: [bcd: bcdFile, format: formatFile] _ sourceFile[interface: interfaceFile]/switches which puts the BCD output on bcdFile.bcd, the record format on formatFile.format (only if input was a Mesa source file) and finds the interface BCD on interfaceFile.bcd. Capitalization doesn't count here. By default bcdFile: source.bcd formatFile: source.format grammarFile: tableId.grammar (inhibit with /-g) error messages: TableCompiler.log where source is the root of the sourceFile name. Note that the new use of keywords is not compatible with previous use. Command line example The following command builds components of the compiler: TableCompiler ErrorTab[interface: CBinary] DebugTab[interface: CBinary] needs CBinary.bcd produces ErrorTab.format, ErrorTab.bcd, DebugTab.format, DebugTab.bcd exports ErrorTab and DebugTab as PROGRAMs in the interface CBinary Appendix An Input File ||INPUT ||CHAIN ||LISTS ||PRINTLALR ||TABLE1 id + ( ) * IF THEN OR ELSE := EOF ||TABLE2 g s a i e t p b l ||TABLE3 + tokenPLUS ||TABLE4 1 g ::= s C s ::= a C | i 2 a ::= id := e 3 i ::= IF b THEN a l C e ::= t 4 | e + t C t ::= p 5 | t * p 6 p ::= ( e ) 7 | id 8 b ::= b OR id 9 | id 10 l ::= 11 | ELSE s The Resulting Module File ParseTable: DEFINITIONS = BEGIN -- types for data structures used by the Mesa parser and scanner Symbol: TYPE = [0..255]; -- token indices for the scanner and parser tokenPLUS: TSymbol = 2; HashIndex: TYPE = [0.. 17]; VIndex: TYPE = [0.. 22]; TSymbol: TYPE = Symbol [0.. 11]; NTSymbol: TYPE = Symbol [0.. 10]; State: TYPE = [0.. 17]; NTState: TYPE = State [0.. 5]; TIndex: TYPE = [0.. 23]; NTIndex: TYPE = [0.. 9]; Production: TYPE = [0.. 15]; END. The Resulting Echo File ||INPUT ||CHAIN ||LISTS ||PRINTLALR ||TABLE1 1 id 2 + 3 ( 4 ) 5 * 6 IF 7 THEN 8 OR 9 ELSE 10 := 11 EOF ||TABLE2 12 g 13 s 14 a 15 i 16 e 17 t 18 p 19 b 20 l ||TABLE3 + tokenPLUS ||TABLE4 GOAL ::= g EOF 1 1 g ::= s 2 C 2 s ::= a 3 C 3 | i 4 2 a ::= id := e 5 3 i ::= IF b THEN a l 6 C 6 e ::= t 7 4 | e + t 8 C 8 t ::= p 9 5 | t * p 10 6 p ::= ( e ) 11 7 | id 12 8 b ::= b OR id 13 9 | id 14 10 l ::= 15 11 | ELSE s The Resulting Log File LALR(1) Tables 1 0 0/ 2 id 3 IF 0 g -1 s -1 a -1 i 2 id 4 1/ 4 := 3 IF 5 1/ -13 id 5 b 4 := 4 2/ -11 id 6 ( 7 e 8 t 8 p 5 b 5 2/ 12 1/ 9 THEN 10 OR 6 ( 10 1/ -11 id 6 ( 11 e 12 t 12 p 7 e 4 3/ 7 1/ 13 + *4 ELSE *4 EOF 8 e 4 3/ 7 1/ 9 1/ 13 + 14 * *4 ELSE *4 EOF 9 THEN 5 3/ 2 id 15 a 10 OR 12 2/ -12 id 11 e 7 1/ 10 2/ 13 + -10 ) 12 e 7 1/ 9 1/ 10 2/ 13 + -10 ) 14 * 13 + 7 2/ -11 id 6 ( 16 t 16 p 14 * 9 2/ -11 id 6 ( -9 p 15 a 5 4/ 17 ELSE *14 EOF -5 l 16 t 7 3/ 9 1/ *7 + *7 ) 14 * *7 ELSE *7 EOF 17 ELSE 15 1/ 2 id 3 IF -15 s -15 a -15 i LALR(1) Statistics States = 17 Terminal entries = 37 Nonterminal entries = 19 First OR operation count = 5 Total OR operation count = 33 Maximum number of contexts = 9 Parse Table Data Nonterminal Default Actions 0 g -1 s 15 a -1 i 7 e 8 t 8 p 5 b -5 l Entries removed = 0 Table Overlays row ttab ntab 6 4 13 4 14 4 17 1 Scanner hashtable probe counts (terminal symbol, probecount, hashcode) id 1 6 IF 1 9 THEN 1 3 OR 1 1 ELSE 1 10 := 1 16 Renumbered States 2 swapped with 17 3 swapped with 14 4 swapped with 13 5 swapped with 6 The PGS Grammar ||CHAIN ||LISTS ||INPUT ||TABLE1 1 symbol 2 num 3 '? 4 '| 5 "::=" 6 'C 7 "||TABLE1" 8 "||TABLE2" 9 "||TABLE3" 10 "||TABLE4" 11 "||INPUT" 12 "||CHAIN" 13 "||LISTS" 14 "||PRINTLR" 15 "||PRINTLALR" 16 "||FIRST" 17 "||IDS" 18 "GOAL" 19 eof ||TABLE2 20 grammar 21 head 22 ruleset 23 directives 24 terminals 25 nonterminals 26 aliases 27 directive 28 discard 29 rulegroup 30 prefix 31 goalrule ||TABLE3 symbol tokenID num tokenNUM '? tokenQUERY "||TABLE3" tokenTAB3 "||TABLE4" tokenTAB4 '? InitialSymbol ||TABLE4 GOAL ::= grammar eof 1 0 grammar ::= '? head ruleset 2 1 head ::= directives terminals nonterminals "||TABLE4" 3 1 | directives terminals nonterminals aliases "||TABLE4" 4 2 directives ::= 5 28 | directives directive 6 3 directive ::= "||INPUT" 7 4 | "||CHAIN" 8 5 | "||LISTS" 9 6 | "||PRINTLR" 10 7 | "||PRINTLALR" 11 8 | "||FIRST" 12 9 | "||IDS" 13 10 terminals ::= "||TABLE1" 14 11 | terminals discard symbol 15 11 nonterminals ::= nonterminals discard symbol 16 12 | "||TABLE2" 17 13 aliases ::= "||TABLE3" 18 14 | aliases symbol symbol 19 15 discard ::= 20 28 | num 21 28 | '? 22 16 rulegroup ::= symbol "::=" 23 17 | prefix symbol "::=" 24 18 | rulegroup symbol "::=" 25 19 | rulegroup prefix symbol "::=" 26 20 | rulegroup '| 27 21 | rulegroup prefix '| 28 22 | rulegroup symbol 29 23 prefix ::= num 30 24 | num num 31 24 | '? num 32 25 | discard 'C 33 26 | discard 'C num 34 27 | '? 35 C 28 ruleset ::= rulegroup 36 28 | goalrule rulegroup 37 28 goalrule ::= "GOAL" "::=" symbol symbol An Input File For the Preprocessor Program text has been stripped from within and around the SELECT statement implementing the semantic routines of the PGS to expose the grammatical information. SELECT prodData[q[qj].transition].rule FROM 0 => -- MODULE: pgsptabdefsnew.mesa BINARY: pgsnew.binary -- GOAL: grammar -- TERMINALS: symbol num '? '| "::=" 'C "||TABLE1" "||TABLE2" "||TABLE3" -- "||TABLE4" "||INPUT" "||CHAIN" "||LISTS" "||PRINTLR" -- "||PRINTLALR" "||FIRST" "||IDS" "GOAL" -- ALIASES: symbol tokenID num tokenNUM '? tokenQUERY -- "||TABLE3" tokenTAB3 "||TABLE4" tokenTAB4 '? InitialSymbol -- PRODUCTIONS: -- grammar ::= '? head ruleset BEGIN END; 1 => -- head ::= directives terminals nonterminals "||TABLE4" -- head ::= directives terminals nonterminals aliases "||TABLE4" BEGIN END; 2 => -- directives ::= BEGIN END; 3 => -- directive ::= "||INPUT" BEGIN END; 4 => -- directive ::= "||CHAIN" flags[chain] _ TRUE; 5 => -- directive ::= "||LISTS" flags[lists] _ TRUE; 6 => -- directive ::= "||PRINTLR" flags[printlr] _ TRUE; 7 => -- directive ::= "||PRINTLALR" flags[printlalr] _ TRUE; 8 => -- directive ::= "||FIRST" flags[first] _ TRUE; 9 => -- directive ::= "||IDS" flags[ids] _ TRUE; 10 => -- terminals ::= "||TABLE1" BEGIN END; 11 => -- terminals ::= terminals discard symbol -- nonterminals ::= nonterminals discard symbol BEGIN END; 12 => -- nonterminals ::= "||TABLE2" BEGIN END; 13 => -- aliases ::= "||TABLE3" BEGIN END; 14 => -- aliases ::= aliases symbol symbol BEGIN END; 15 => -- discard ::= l[top] _ InputLoc[]; -- keep the parser error recovery happy 16 => -- rulegroup ::= symbol "::=" BEGIN END; 17 => -- rulegroup ::= prefix symbol "::=" lhssymbol[v[top+1]]; 18 => -- rulegroup ::= rulegroup symbol "::=" BEGIN END; 19 => -- rulegroup ::= rulegroup prefix symbol "::=" lhssymbol[v[top+2]]; 20 => -- rulegroup ::= rulegroup '| BEGIN END; 21 => -- rulegroup ::= rulegroup prefix '| prodheader[FALSE]; 22 => -- rulegroup ::= rulegroup symbol BEGIN END; 23 => -- prefix ::= num setrulechain[v[top], FALSE]; 24 => -- prefix ::= num num -- prefix ::= '? num setrulechain[v[top+1], FALSE]; 25 => -- prefix ::= discard 'C setrulechain[prix, TRUE]; 26 => -- prefix ::= discard 'C num setrulechain[v[top+2], TRUE]; 27 => -- prefix ::= '? setrulechain[prix, FALSE]; 28 => -- directives ::= directives directive -- discard ::= num -- discard ::= '? -- ruleset ::=C rulegroup -- ruleset ::= goalrule rulegroup -- goalrule ::= "GOAL" "::=" symbol symbol NULL; ENDCASE => ERROR;