Heading:
Using PGS
Inter-Office Memorandum
To    PGS Users    Date    April 2, 1984
From    Doug Wyatt    Location    Palo Alto
Subject    Using PGS    Organization    PARC/CSL
XEROX       
Filed on: [Indigo]<Cedar®>Documentation>PGS.tioga    DRAFT


The Mesa Parser Generator System
James Eve
1. Introduction
[James Eve wrote the bulk of this document in 1979.  Doug Wyatt converted it to Tioga form and revised section 6 to describe the 
operaton of PGS and TableCompiler in Cedar.  Much of Section 6.2 was derived from a brief memo by Ed Satterthwaite.]
The parser generator system (PGS) is a Mesa program which takes a context free grammar specified in Backus-Naur form as input and 
tests whether it is LALR(1); if the LALR(1) condition is satisfied, compacted binary tables are output which can be used in 
conjunction wth the Mesa parser. Ancillary tables are also output which simplify the writing of lexical routines to recognise the 
terminal symbols of the grammar. As a primary task of the PGS is to aid development the Mesa system itself a preprocessor is 
included which aids this. 
The LALR(1) parsing algorithm is fairly widely held to be the best practical algorithm currently available; it has good space and 
time performance properties, it handles a larger subset of the context free grammars than other methods in common use and it allows 
a good syntax error repair capability to be added. The one snag is that the LALR(1) condition is far from intuitively obvious; 
checking that the condition holds requires substantial computation and in the event that it does not, a fair knowledge of the 
underlying theory may be needed to modify a grammar to make it meet the condition. A really good account of the theory has yet to 
be written, notational problems tend to obscure the underlying ideas. Probably the easiest starting point is the tutorial article 
"LR Parsing" by Aho and Johnson in Computing Surveys, 6(1974)74, the most comprehensive account readily available is in "The Theory 
of Parsing, Translating and Compiling" Volume 1, Prentice-Hall (1972), by Aho and Ullman; the algorithms used in the PGS are from 
Anderson, Eve and Horning, "Efficient LR(1) Parsers", Acta Informatica 2(1973)12 so the terminology here follows the latter paper, 
which is referenced subsequently as AEH.
2. Using the Parser Generator
Given an Alto disc with the PGS installed, issuing the command pgs will invoke it under the Mesa Executive. It prompts for an input 
file name. The input file name, in fact, implicitly defines the names of the various output files. The main part of the input file 
name is extended by .echo, .log, .binary  and .module to form the names of the primary output files. However input files with 
extension .mesa or .Mesa are an exception, they discussed in section 5. Input and output are discussed in sections 3 and 4;  
listings of an example input file and the resulting .module, .echo and .log files are given in the appendix. Installing the system 
and assembling it from its components are described in section 6.
3. Format of the Input File.
The input file is treated as a sequence of tokens, where a token is defined to be a sequence of characters none of which are in the 
range [0C..' ]. Tokens are delimited by sequences of characters in this range. In addition to the tokens ::=, |,?,C and the 
integers which have special roles there are a number of tokens starting with the character pair || which control the PGS.
The input consists of five subsequences, directives, terminals, nonterminals, aliases and productions which must appear in this 
order. 
1. The principal directives and their functions are,
||INPUT - causes the input to be echoed to the .echo file,
||CHAIN - causes an optimisation to be performed which speeds up parsing by eliminating all references to productions marked as 
chain productions,
||LISTS - causes the LALR(1) tables to be compacted and output to the .binary and .module files,
||PRINTLALR - causes a readable form of the LALR(1) tables to be output to the .log file. (For a grammar of about 300 productions 
these tables will contain about 400,000 characters. Generating them slows the PGS perceptibly.) 
2. The terminals are introduced by ||TABLE1 which is followed by tokens representing the terminal symbols of the grammar, the last 
of which denotes a unique sentence endmarker which will be used only to delimit sentences supplied to the resulting parser. It 
should not appear in any production. (The scanner invoked by the parser using the tables generated by the PGS is required to map 
the end of input signal onto this token.)
3. ||TABLE2 similarly introduces the nonterminal symbols of the grammar; that appearing first after ||TABLE2 is taken to be the 
goal symbol of the grammar. The way that the Mesa parsing algorithm terminates entails a weak constraint that the goal symbol 
should not appear in the right part of any production nor should any of its productions be designated chain productions.
4. The alias sequence may be empty but if it is not it is introduced by ||TABLE3. The terminal symbols of  a grammar do not 
necessarily have the form of identifiers but lexical or error recovery routines may need to be able to reference them. The alias 
sequence consists of pairs of tokens the first of which is a terminal symbol (i.e. it appears in the sequence following ||TABLE1), 
the second is an alias in the form of an identifier by which it can be referenced; appropriate constant definitions are constructed 
for these identifiers and included in the .module file. 
5. Finally the productions are listed in Backus-Naur notation following ||TABLE4. The tokens ::= and | play their usual role; there 
is no symbol terminating or separating productions, likewise a production deriving the empty string is specified the absence of any 
token after ::= or | and the succeeding |, end of input or token followed by ::= sequence.
Immediately preceeding the |'s or to the left of a token followed by ::= there will usually be an integer, the rule number, which 
may itself be preceded by the token C (upper case only). The rule number associates the production with a semantic routine (an arm 
of a SELECT statement) which is to be invoked by the parsing algorithm whenever a string derived from the associated production is 
recognised. Some productions, of the form nonterminal ::= nonterminal, have no semantic significance and serve merely to assert 
that members of one syntactic class are also members of some larger class. Chains of such productions appear in expression grammars 
(expr ::= term, term ::= factor, factor ::= primary, etc.) and can significantly reduce the speed of parsing. The appearance of the 
token C is an assertion that the following production is of the specified form and has no semantic processing associated thereby 
permitting the PGS to eliminate all references to it with an increase in speed of parsing.
The input phase of the PGS uses the Mesa parsing routine and parsing tables of its own construction so there is clearly a grammar 
specifying the form of the input. The description given above was preferred as an introduction to the input format since it covers 
only the essentials. The PGS will, on request, echo its input interspersed with other information in a formatted form. As there was 
a requirement that this output should also be acceptable as input, the grammar allows a rather wider class of input forms but 
information other than that described is simply discarded during re-input. The grammar appears in the appendix.
It should be clear that the terminal symbols of the PGS grammar cannot be used as tokens in a client grammar, a syntax error would 
result. This is not likely to be a problem to most clients insofar as the symbols starting with || are concerned (which is why this 
curious system was adopted) but there are also ::=, |, ?, C and GOAL some of which may well appear in a client grammar. Finally of 
course the PGS grammar contains all of them. The standard solution is used; if |, ?, or C are required as tokens in a client 
grammar they must be specified as '|, '? and 'C. Multi-character symbols such as ::= must be specified as "::=". The only special 
treatment that these quoted symbols receive at the hands of the PGS is in building the tables for use by the scanner; any two 
character token beginning with a single quote has the quote removed; similarly any token of three or more symbols where the first 
and last are double quotes has these bracketing quotes stripped.
Occasionally it is convenient to be able to flag certain productions in the input text. The PGS will ignore ? when looking for C or 
a rule number; the ? should precede C when a rule number is also present.
4. Output of the Parser Generator
Four output files are normally constructed, a record of the input, a log, a binary output file containing the parser and lexical 
tables and a module file which contains definitions of aliased terminal symbols and some ranges defining the sizes of the various 
arrays constituting the binary file. The module file is a Mesa module with the name ParseTable. It must be compiled and bound with 
the Mesa parser, a suitably modified version of the Mesa scanner and the definitions module which describes the invariant parts of 
the binary parse tables.
Examples of these latter modules exist in the PGS. The files pgsptabdefs.mesa, pgsscan.mesa, pgsparse.mesa and pgs1.mesa contain 
respectively ParseTable, the scanner and semantic processing routines for the PGS, the Mesa parser and, finally, definitions of the 
invariant part of the binary tables. For operational reasons, the low level routines interfacing with I/O, storage management, etc. 
have been removed from pgsparse.mesa and pgsscan.mesa to the control module of the PGS which is in the file pgscon.mesa. 
Nonetheless these modules should provide a model for anyone needing it. In particular the code for loading and unloading the binary 
tables and invoking the parser can be found in the main line code of the module PGScon.
4.1 The Input Record File
This file is produced if the directives in the input stream contain ||INPUT. The name of the file is obtained by appending .echo to 
the main part of the input file name.
The record of the input differs from the true input in that the directives may be displayed in a different order and the terminal 
and nonterminal symbols are displayed one to a line each preceded by an integer. (The integers are allocated sequentially starting 
at one for the first terminal symbol. Each production is displayed starting on a new line; each is preceded by two integers, the 
second being the rule number from the input. If a C was specified on the input it appears between the two integers. the first 
integer is simply a unique label for the production. The first production is labelled one and again the PGS simply labels the 
productions with ascending integers. These labels are used in some of the diagnostic messages output by the PGS. A production with 
the implicit label zero is constructed and output before the others, it has the form,
GOAL ::= goal eof
where goal stands for the goal symbol and eof for the end of sentence marker. A check during input ensures that these symbols occur 
to the right of ::= in no other production.
4.2 The Log File
This file contains error messages and various items of supplementary information output during the generation of the parsing 
tables. If error or warning messages are logged then, immediately prior to the end of execution, the message, "Errors or warnings 
logged" is displayed followed by the usual invitation to type any character to terminate processing.
4.2.1 Error Messages
Most error messages occur during input of the grammar. Those messages prefixed by the word ERROR cause the program to terminate 
after completing input and checking for further errors. WARNING messages allow processing to continue. Each message is accompanied 
by a fragment of input text with a pointer to the current input token.
The warning messages are,
1. Overlength symbol (increase TOKENSIZE?) truncated to -
2. Not a chain production -
3. Unused(U) or undefined(D) symbols (refer to TABLE1 and 2)
These messages illustrate some general points; messages ending with a dash are followed by further information. For message 1, it 
is the truncated form of the token, for message 2 it is the integer label appended to the offending production in the echoed input. 
After message 2, processing continues as though no chain indication had been specified.
Messages such as the first which indicate that an internal field size is too small also specify the compile time constant (in 
pgscondefs.mesa) which controls the field size. Currently tokensize constrains tokens to 14 characters.
The third message occurs at the end of input if there are symbols in the TABLE1 and 2 sequences which do not appear in any 
production (unused symbols) or if a symbol in the TABLE2 sequence does not appear to the left of ::= in the productions (undefined 
symbols). This message is followed by a list of integers which designate symbols in TABLE1 and 2 using the numeric labels appended 
in the echoed input. Each integer is tagged with U and/or D to indicate whether the corresponding symbol is unused or undefined or 
both.
The ERROR messages are,
4. Nonterminal with too many rules (increase ALTERNATELIM?) -
Only 31 productions are allowed for any nonterminal; if this is not enough it would be better to introduce a new nonterminal and 
split them into two groups rather than increase the limit of 31.
5. Multiple definitions of symbol -
This message occurs either because the same symbol appears more than once in TABLE1 and 2 or because the same symbol occurs to the 
left of ::= more than once. The offending symbol is printed after the message.
6. Symbol not defined -
This message also appears in two contexts, either a symbol appears in a production which does not appear in TABLE1 or 2 or 
alternatively message 3 was issued with undefined symbols. in the first case the message is followed by the symbol in question, in 
the second by "see previous message".
7. Terminal precedes ::= -
The terminal symbol follows the message.
8. Too many rhs symbols in production (increase RHSLIM?) -
Fifteen symbols in the right part of a production is unlikely to be exceeded; if it is, change the grammar, increasing rhslim would 
involve consequential changes in the binary table formats.
9. Internal field will overflow - increase PSSLIM
This one is unlikely, it would involve a grammar with 1024 symbols or productions. An increase up to 2047 would be possible without 
changing the binary table formats.
10. Aliased symbol not a terminal symbol -
The symbol follows the message.
11. Aliases must not be terminal symbols -
The symbol follows the message.
12. Goal symbols used in rhs
The goal symbol or end of sentence marker appear in a production right part. (See the paragraphs numbered 1 and 2 in section 3).
13. Number greater than MAXRULE
Currently the PGS allows for 255 rule numbers. A relatively minor reformatting of the binary tables would permit it to be increased 
to LAST[CARDINAL].
4.2.2 Output during table construction
During the construction of the parsing tables there is a reasonably remote possibility that error message 9 will occur and 
terminate any further processing, though in this case it implies that more than 1023 parsing states are necessary for the grammar. 
Much more likely are messages indicating that the grammar is not LALR(1).
In the event of conflicting parsing actions arising for some terminal symbol in a parsing state, all data appertaining to that 
state is listed. The first heading line specifies,
1. an integer naming the state,
2. a symbol of the grammar (recognition of this symbol causes this state to be entered),
3. a set of p,j pairs defining the state (see AEH), each pair being followed by /.
Below the heading in a four column format is the list of symbols of the grammar which may be encountered in this state. Each symbol 
is preceded by an encoding of its associated parsing action,
1. unsigned integers denote scan (or shift) entries to the state named by the integer,
2. integers preceded by an asterisk signify reduce operations using the production with the integer as its label,
3. an integer preceded by a minus sign also indicates a production number but implies that the symbol associated with this action 
must be stacked before the reduce operation; a so-called scanreduce operation. (These marking conventions differ from those used in 
AEH though they use the same symbols.)
During construction of the tables scan entries are computed before reduce entries, so when conflicting actions arise they are  
reduce actions which either conflict with an existing scan entry or with a previously computed reduce entry. (It is an inherent 
property of scanreduce entries that they cannot conflict with another entry.) Conflicts are indicated by lines of the form,
Reduce with n conflicts with                                    **********
where n is a production number. Each such line is followed by a list of items of the form, symbol action /, where action is either 
SCAN if a scan action for this (terminal) symbol has already been constructed or is an integer naming the production for which a 
reduce action has already been constructed.
If the directive ||PRINTLALR is specified, the output just described occurs for every state whether or not it contains conflicts. 
The heading, LALR(1) Tables, precedes such output. This output is rarely worth having, it is occasionally of value in tracking down 
a conflict. Its primary function was in debugging the PGS.  
The PGS discards conflicting entries after printing them and it will not form either the binary tables or the module output if 
there are reduce-reduce conflicts. In the case of reduce-scan conflicts the decision to process scan entries first implies that the 
scan entry takes precedence. This is occasionally useful. For example it solves the dangling else problem in the preferred way. 
However, the scan-reduce conflict message is a warning and as such triggers the displayed message directing attention to the .log 
file.
After generating the tables a heading, LALR(1) Statistics, is output and a few counts are printed. Only the first three may be of 
any general interest, they indicate the number of parsing states and the total number of actions in the tables for both terminal 
and nonterminal symbols.
4.2.3 Output during table compaction
Table compaction and output of the binary tables only occur if the directive ||LISTS is specified in which case the output 
described here appears on the .log file.
The earlier stages of the PGS are written in a general way, data structures will expand to accomodate very large grammars; at the 
cost of recompiling the system and changing compile time constants, the limits on field sizes mentioned previously can be 
increased. At this stage the objective is to pack information economically into 16-bit words and it is here that the size of fields 
is an absolute constraint. Final checks are made which could conceivably produce one or more of the self explanatory messages
FAIL - more than 255 terminal symbols
FAIL - more than 254 nonterminal symbols
FAIL - more than 2047 states or productions
FAIL - more than 15 symbols in a production right part
    These are rather unlikely, the tightest constraint is likely to be the limit of 255 on rule numbers mentioned at the end of 
Section 4.2.1. If any of these checks fail processing terminates.
    Assuming no error messages, the only unsolicited output here is a heading, Parse Table Data, and one table. A hash accessed 
look-up table, for the terminal symbols of the grammar, is created for use by the scanner. As hash functions are notoriously 
unreliable, the following is printed so that a visual check can be applied to ensure no disasters. The subheading,
Scanner hashtable probe counts (terminal symbol, probe count, hashcode)
followed by a four column layout of triples which as the heading indicates show for each symbol the number of probes needed to 
locate it in the hash table and its hashcode. The technique used is Algorithm C, page 514, Vol 3 of Knuth's "The Art of Computer 
Programming", Addison-Wesley(1973). If there are n terminal symbols, they are hashed into a table of using MIN[m,251] buckets; m is 
always an odd integer, it is either 3*n/2 or 3*n/2+1. The hash function uses the ASCII values of the first and last character of 
the symbol and is
((127*first character+last character) MOD buckets) + 1.
One can expect the performance of this hash function to deteriorate as the number of terminal symbols approaches 255.
If the directive ||PRINTLALR is specified as well as ||LISTS, a record of the table compaction transformations is produced. This 
record is only of interest for maintenance of the system and familiarity with the compaction techniques described in AEH is assumed 
in its description.
First, a set of default actions for the nonterminal symbols of the grammar are determined and a table headed, Nonterminal Default 
Actions, is printed. Each nonterminal symbol appears preceded by its associated default action encoded in the form already 
described, that is, unsigned integers represent scan entries and negative integers represent scan-reduce actions. (Reduce actions 
never take place on nonterminal symbols.)
After removing all occurrences of these defaulted entries from the LALR(1) tables, the PGS determines those states which have 
identical symbol-action pairs, first, over the set of terminal symbols and then independently, over the set of nonterminal symbols. 
All such states reference one copy of the list of symbol-action pairs stored in the binary tables. The table, Table Overlays, has 
three columns headed row, ttab and ntab; if a row of this table contains (integer) entries a, b and c respectively then it implies 
that the terminal entries of state a are the same as those of state b, while the nonterminal entries of state a are the same as 
those of state c. It is exceptional for both the terminal and nonterminal entries of a state to match those of other states so 
usually one of the entries b or c is blank. If neither the terminal nor nonterminal entries of a state match those of another state 
then it does not appear in this table.
The final transformation is to renumber the states so that all of those states containing (nondefaulted) actions on nonterminal 
symbols are labelled by integers contiguous to eachother and to 1. This is acheived by swapping the highest numbered state with 
nonterminal actions with the lowest numbered state without nonterminal actions until no more swaps are possible. The table headed, 
Renumbered States, simply records this with entries of the form, a swapped with b.
4.3 The Module File
This is most readily illustrated with an example; it happens to be the module file generated by the PGS for its own grammar.

ParseTable: DEFINITIONS = {
Symbol: TYPE = [0..255];
TSymbol: TYPE = Symbol[0.. 19];
NTSymbol: TYPE = Symbol[0.. 13];

-- token indices for the scanner and parser
tokenID: TSymbol =  1;
tokenNUM: TSymbol =  2;
tokenQUERY: TSymbol =  3;
tokenTAB3: TSymbol =  9;
tokenTAB4: TSymbol = 10;
initialSymbol: TSymbol =  3;

defaultMarker: TSymbol = TSymbol.FIRST;
endMarker: TSymbol = TSymbol.LAST;

HashIndex: TYPE = [0.. 29];
VIndex: TYPE = [0..106];

State: TYPE = [0.. 26];
NTState: TYPE = State[0..  6];
TIndex: TYPE = [0.. 64];
NTIndex: TYPE = [0..  3];
Production: TYPE = [0.. 37];
};
The module defines aliases using the information given in the aliases segment of the input. For example the token ||TABLE3, which 
is a terminal symbol of the PGS grammar was given the alias tokenTAB3, it was the ninth token in the sequence of terminal symbols 
in the input file and so internally within the PGS and the binary tables produced it is encoded as 9.
The ranges, HashIndex, TSymbol, NTSymbol, State, NTState, TIndex, NTIndex, Production and VIndex prescribe the dimensions of arrays 
in the binary tables.
4.4 The Binary File
The format of the binary file is captured by a set of Mesa definitions which, since they are of interest to both scanner and 
parser, can conveniently be specified in the definitions module which constitutes the scanner-parser interface. These definitions 
are reproduced below.

ActionTag: TYPE = MACHINE DEPENDENT RECORD [
reduce(0: 0..0): BOOL,       -- TRUE iff reduce entry
pLength(0: 1..4): [0..15]];    -- number of symbols in production rhs
ActionEntry: TYPE = MACHINE DEPENDENT RECORD [
tag(0: 0..4): ActionTag,        -- [FALSE,0] if a shift entry
transition(0: 5..15): [0..2047]];    -- production number / next state
ProductionInfo: TYPE = MACHINE DEPENDENT RECORD [
rule(0: 0..7): [0..256),        -- reduction rule
lhs(0: 8..15): NTSymbol];    -- production lhs symbol
VocabHashEntry: TYPE = MACHINE DEPENDENT RECORD [
symbol(0: 0..7): TSymbol,        -- symbol index
link(0: 8..15): HashIndex];    -- link to next entry

ScanTable: TYPE = ARRAY CHAR['\040..'\177] OF TSymbol;
HashTable: TYPE = ARRAY HashIndex OF VocabHashEntry;
IndexTable: TYPE = ARRAY TSymbol OF CARDINAL;
Vocabulary: TYPE = MACHINE DEPENDENT RECORD [ -- a string body
length(0), maxlength(1): CARDINAL,
text(2): PACKED ARRAY VIndex OF CHAR];
ProdData: TYPE = ARRAY Production OF ProductionInfo;
NStarts: TYPE = ARRAY NTState OF NTIndex;
NLengths: TYPE = ARRAY NTState OF CARDINAL;
NSymbols: TYPE = ARRAY NTIndex OF NTSymbol;
NActions: TYPE = ARRAY NTIndex OF ActionEntry;
NTDefaults: TYPE = ARRAY NTSymbol OF ActionEntry;
TStarts: TYPE = ARRAY State OF TIndex;
TLengths: TYPE = ARRAY State OF CARDINAL;
TSymbols: TYPE = ARRAY TIndex OF TSymbol;
TActions: TYPE = ARRAY TIndex OF ActionEntry;

initialState: State = 1;
finalState: State = 0;

Table: TYPE = MACHINE DEPENDENT RECORD [
scanTable: RECORD [
scanTab:    TableRef RELATIVE POINTER TO ScanTable,
hashTab:    TableRef RELATIVE POINTER TO HashTable,
vocabIndex: TableRef RELATIVE POINTER TO IndexTable,
vocabBody:  TableRef RELATIVE POINTER TO Vocabulary
],
parseTable: RECORD [
prodData:   TableRef RELATIVE POINTER TO ProdData,
nStart:     TableRef RELATIVE POINTER TO NStarts,
nLength:    TableRef RELATIVE POINTER TO NLengths,
nSymbol:    TableRef RELATIVE POINTER TO NSymbols,
nAction:    TableRef RELATIVE POINTER TO NActions,
ntDefaults: TableRef RELATIVE POINTER TO NTDefaults,
tStart:     TableRef RELATIVE POINTER TO TStarts,
tLength:    TableRef RELATIVE POINTER TO TLengths,
tSymbol:    TableRef RELATIVE POINTER TO TSymbols,
tAction:    TableRef RELATIVE POINTER TO TActions
]
];

TableRef: TYPE = LONG BASE POINTER TO Table;

The purpose and content of the arrays in parseTable are explained in AEH; only the definitions relevant to the scanner are 
discussed here. Terminal symbols of the grammar represented by a single ASCII character are treated separately from those requiring 
a string of characters. In scanTable there is an array scanTab which can be indexed by characters not in the range [0C..' ]; any 
single character symbol used to extract an element of this array will extract a non zero integer only if it represents a terminal 
symbol and the integer is its numeric encoding.
The string vocabBody contains the character strings representing all other valid terminals stored head to tail. Element i-1 of 
vocabIndex indexes the first character of the string in vocabBody which represents the terminal symbol with the encoding i. Given a 
string which purports to be a terminal symbol its hash value can be computed using the hash function given in section 4.2.3 
(identifying buckets there with LAST[HashIndex]). The hash value can be used to select an element from the array hashTab; the 
elements of hashTab are records, one field of which is used to select an entry of vocabIndex (to find the substring in vocabBody to 
compare with the given string), the other field is an index to another element in hashTab to be tried when the string comparison 
fails; hashTab[0] is void, a zero index terminates the search indicating that the given string does not represent a terminal.
4.5 The LR and First Files
As part of the debugging facilities built into the PGS two other output files are created on request.
The first step in building the LALR(1) tables is to construct LR(0) tables. (This is done using the SLR(1) algorithm in section 3.1 
of the AEH paper by omitting the computation of the sets specified in relation (5b).) The LR(0) tables may be output to a file with 
the extension .lr by specifying the directive ||PRINTLR in the input; the form of the output is similar to that used in the LALR(1) 
tables but of course the terminal symbols triggering reduce actions are not known so a line
Reduce with p
follows the state heading if the production with label p has reduce actions in the state.
The next stage is to compute lookahead symbols for all these incompletely determined reduce actions according to the LALR(1) rules. 
This is done using Anderson's bewilderingly succinctly stated algorithm at the foot of page 21 and top of page 22 in AEH. It is 
convenient, as a preliminary to this, to compute, for each nonterminal symbol, the set of terminal symbols which can appear as the 
first symbol in a string derived from the nonterminal. This is a transitive closure calculation. It provides the initial data for 
computing Anderson's exact right contexts which is yet another transitive closure calculation.
If the directive ||FIRST appears among the input directives in addition to either of the directives ||PRINTLALR or ||LISTS, a file 
with extension name .first is created which contains a list of all nonterminal symbols each being followed by an (offset) list of 
the terminals which can start a string derived from it.
5. The Preprocessor
The preprocessor is invoked if the input file name has the extension .mesa. Its function arises from the following considerations. 
For maintenance purposes it is more or less essential that each arm of the SELECT statement implementing the semantic processing 
routines in the Mesa compiler has comments displaying those productions associated with this arm. The test preceding the => symbol 
in the arm is the rule number for these productions. As Mesa evolves, changes are made to the grammar, productions associated with 
one arm must be moved to another, new productions and new arms are introduced and periodically it is necessary to renumber the 
rules in a systematic fashion in order to find things (there being being over 200 arms). The bookkeeping necessary to ensure that 
the story told by the SELECT statement is consistent with that told to the PGS in the input file is tiresome and error prone. Most 
of the data needed by the PGS is present in the SELECT statement and by adding the rest only one copy of the grammar need be 
maintained and the reassignment of rule numbers can be mechanised.
The preprocessor expects a Mesa program module as input and it scans for the sequence
digits =>  --
after which it expects to find data relevent to it. On encountering CR, or Bravo's ControlZ ... CR sequence, it checks whether the 
next printing characters open a Mesa comment or not; if they do then more data is expected otherwise the end of the data associated 
with a particular select arm is presumed and a search is instituted for the next. 
Supposing the input file name given was semroutine.mesa, then during the scan the preprocessor copies the input file to 
semroutine.mesa$ (shades of Bravo) and it makes a modified copy in a scratch file. The modification is a trivial one; as the 
sequences,
digits =>  --
are encountered, the next integer in the sequence 0,1,2,... is substituted before the => symbol. At the end of the input scan, the 
scratch file is written back to semroutine.mesa.
Clearly the rather crude procedure used to locate the grammatical information in the program text places constraints on the program 
module containing it. In particular it precludes comments in a fairly natural place in any other SELECT statements that may appear 
in the module which also use integer tests. On the other hand anything approaching parsing the text is out since it merely replaces 
one updating problem with another.
Since the PGS constructs tables using the new rule numbers just assigned, arms of the select statement can be reordered to group 
logically related arms together without making the search for an arm with a given rule number tedious.
In the comments associated with SELECT arms (other than the first encountered) the preprocessor expects to find only productions. 
Here each production is specified in full, its left part token followed by either ::= or (to designate a chain production) ::=C 
followed by the right part tokens.
Comments preceding the first production contain the additional information needed. This information in order of occurrence is,
1. Optionally, and in either order, the tokens MODULE: or BINARY: may appear each followed only by a token naming the file to 
contain the corresponding output,
2. Next must appear GOAL: followed by the token naming the nonterminal which is the goal symbol of the grammar, 
3. Optionally there may appear TERMINALS: followed by the tokens representing the terminals. The end of sentence marker should not 
be included, eof is supplied,
4. Optionally there may appear ALIASES: followed by the alias sequence as described in section 3, paragraph 4,
5. Finally PRODUCTIONS: must appear before the first production.
When the preprocessor is selected the output file names are formed by appending the various extensions to pgs rather than the main 
part of the input file name thus pgs.module and pgs.binary are the default names if the MODULE: and BINARY: options are not used in 
the input.
Nonterminal symbols are deduced from the tokens to the left of ::= tokens. If the TERMINALS: sequence appears only these symbols 
are taken to be terminals. In its absence any token in a production which is not a nonterminal according to the previous definition 
is a terminal. Omitting this sequence means that typing errors define terminal symbols!
From a file of this form (see the example at the end of the appendix), the preprocessor constructs a scratch file in the format 
specified in section 3, with the directives ||INPUT, ||CHAIN and ||LISTS, which it passes to the input phase of the PGS.
The preprocessor only generates one error message, "Directives incorrect or out of sequence". No further processing occurs in this 
situation so the message is displayed, followed by the request to type a character to terminate execution.
When inserting new arms in the SELECT statement there is no need (so far as the preprocessor is concerned) to use an integer 
distinct from those on other arms but it is probably not a good idea. The preprocessor will recognise ? as an alternative to a 
digit sequence.  
6. Operation
6.1 Historical notes
The mesa source components of PGS include,
PGSParseTable.mesa, the binary tables definitions module,
ProtoP1.mesa, the scanner-parser interface module,
PGSConDefs.mesa, the PGS interface module,
ProtoParser.mesa, the Mesa parser,
PGSScan.mesa, the PGS scanner and semantic processing routines
PGSLALR.mesa, the LALR(1) table generator,
PGSTab.mesa, the compacted table constructor,
PGSFormat.mesa, the preprocessor,
PGSControl.mesa, the PGS control module,
The compilation sequence for the PGS modules is simply summarised by the rules,
(a) compile PGSParseTable before PGS1,
(b) compile PGSTypes before PGSConDefs,
(c) compile PGSParser after (a),
(d) compile PGSScan or PGSControl after both (a) and (b),
the remaining modules can be compiled after (b).
The module PGSConDefs interfaces the PGS to the Mesa system providing I/O and storage allocation routines for the other modules; as 
a consequence only PGSConDefs and PGSControl (which implements these facilities) reference Mesa system modules.
A few precautions were taken during implementation in case a 64K memory should prove too small.
1. Apart from a few small arrays for which precise bounds could be predetermined, all arrays are declared using types declared in 
pgscondefs; these types are descriptors or pointers with the word LONG commented out. Should a 64K memory subsequently prove 
constraining, removal of these comment symbols and transfer to an extended memory machine should dispose of the problem.
2. The display is turned off just before invoking the procedure lalrgen in the module pgslalr, since it is during the construction 
of the LALR(1) tables that the total memory demand is greatest.
3. Much of the memory space used in the construction of the LALR(1) tables is recovered for use in table compaction by outputting 
the LALR(1) tables in a much more convenient form to a scratch file (the .binary file is used). The table compaction routine tabgen 
builds it own record of the LALR(1) tables from this file; it inherits all other data needed from the data structures built during 
the input of the grammar.  
6.2 PGS Operation
PGS command line parsing has been revised to handle module identifiers, file names, etc., in the currently approved way and to 
allow easier parameterization with respect to long or short pointers.  The basic idea was to make PGS more like the compiler and 
binder in these areas (to avoid the current file name hassles and to make PGS more usable by the system modeller).
Processing Modes
Input: Mesa vs. grammar.  PGS can extract the information needed to build parsing tables from comments embedded within standard 
Mesa source files.  Although the above documents this input mode almost as an afterthought, it has become the standard one.  The 
conventional name for the input file in this mode is <name>.pgs.  The grammar mode has been retained for its occasional utility in 
experimenting with grammars.  See the Appendix.
Output: BCD vs. binary.  The usual output mode is BCD; this facilitates packaging the parsing tables with the code that uses them 
(in a BCD, boot or image file).  The binary mode has been retained primarily for situations in which the parsing tables are to be 
used by non-Mesa programs.
Mesa Programs as PGS Source Files
The list of keywords that optionally precede the first production has been revised and expanded as follows:
TABLE:
TYPE:
EXPORTS:
GOAL:
TERMINALS:
ALIASES:
PRODUCTIONS
The first three must precede all the others but may occur in any order; the next section explains their significance.  The last 
four must appear in the specified order; all but the last may be omitted.
Output Identification
In the source text, the old keywords dealing with file and module names are replaced by
TABLE: tableId  TYPE: typeId  EXPORTS: interfaceId   -- or EXPORTS: SELF
This is supposed to remind you of
programId: <ProgramType> EXPORTS interfaceId
(sorry about the different treatment of colons, etc.).  The names tableId, typeId and interfaceId are module identifiers 
(capitalization counts) and get put into BCDs, symbol tables, etc.
Examples
The following examples are taken from the current system.  The compiler and binder specify the same type name because they use the 
same parsing module, in which that interface is a (compile-time) parameter.  On the other hand, they export different interfaces 
because loading of the corresponding tables is handled differently (see below).
TABLE: BCDParseData TYPE: ParseTable EXPORTS: SELF    -- binder
TABLE: MesaTab TYPE: ParseTable EXPORTS: CBinary    -- compiler
TABLE: PGSParseData TYPE: PGSParseTable EXPORTS: SELF    -- PGS
Invoking PGS
File Naming
When you invoke PGS, you can specify an arbitrary association between files and module identifiers using the same command-line 
conventions that the compiler and binder do.  The most general form is:
[defs: defsFile, bcd: bcdFile, grammar: grammarFile] _
  sourceFile[interfaceId: interfaceFile]/switches
which puts
the source for the interface typeId on defsFile.mesa,
the BCD for the tableId (or binary, if you say "binary:") on bcdFile.bcd (or .binary),
a summary of the grammar on grammarFile.grammar (only if input was a Mesa source file)
and finds the BCD for interfaceId on interfaceFile.bcd.  Capitalization doesn't count here.  By default
defsFile:  typeId.mesa
bcdFile:  tableId.mesa
grammarFile:   tableId.grammar  (inhibit with /-g)
error messages:   tableId.pgslog
(There are further defaults for the cases in which the input is just a grammar file or you omit keyword items for the module 
identifiers in the source).
Note that the new use of keywords is not compatible with previous use.
Command line examples
The following commands build parsers for the Compiler, Binder, and PGS.
PGS [grammar: Cedar] _ Pass1T.pgs
needs CBinary.bcd
produces Pass1T.mesa, MesaTab.bcd, ParseTable.mesa
exports MesaTab as a PROGRAM in the interface CBinary
puts grammar summary in Cedar.grammar
PGS [defs: BcdParseTable, grammar: CMesa] _ BcdTreeBuild.pgs
produces BcdTreeBuild.mesa, BcdParseData.bcd, BcdParseTable.mesa
exports BcdParseData directly (no interface)
puts grammar summary in CMesa.grammar
PGS [defs: PGSParseTable, grammar: PGS] _ PGSScan.pgs
produces PGSScan.mesa, PGSParseData.bcd, PGSParseTable.mesa
exports PGSParseData directly (no interface)
puts grammar summary in PGS.grammar
6.2 TableCompiler Operation
TableCompiler command line parsing has been similarly revised to resemble that of the compiler and binder.
Processing Modes
TableCompiler is a program to convert assorted inputs to Cedar .bcd files that can be bound into configurations and managed as code 
segments.  If the source file name has extension .mesa, it extracts string literals from string array declarations and gives the 
skeleton of a DEFINITIONS file describing the resulting structure; otherwise, it just wraps a bcd header around a collection of 
bits.
Invoking TableCompiler
File Naming
When you invoke TableCompiler, you can specify an arbitrary association between files and module identifiers using the same 
command-line conventions that the compiler and binder do.  The most general form is:
[bcd: bcdFile, format: formatFile] _
  sourceFile[interface: interfaceFile]/switches
which puts
the BCD output on bcdFile.bcd,
the record format on formatFile.format (only if input was a Mesa source file)
and finds the interface BCD on interfaceFile.bcd.  Capitalization doesn't count here.  By default
bcdFile:  source.bcd
formatFile:  source.format
grammarFile:   tableId.grammar  (inhibit with /-g)
error messages:   TableCompiler.log
where source is the root of the sourceFile name. Note that the new use of keywords is not compatible with previous use.
Command line example
The following command builds components of the compiler:
TableCompiler ErrorTab[interface: CBinary] DebugTab[interface: CBinary]
needs CBinary.bcd
produces ErrorTab.format, ErrorTab.bcd, DebugTab.format, DebugTab.bcd
exports ErrorTab and DebugTab as PROGRAMs in the interface CBinary
Appendix
An Input File
||INPUT ||CHAIN ||LISTS ||PRINTLALR

||TABLE1
id + ( ) * IF THEN OR ELSE := EOF
||TABLE2
g s a i e t p b l
||TABLE3
+ tokenPLUS
||TABLE4
1 g ::= s
C s ::= a 
C     | i
2 a ::= id := e
3 i ::= IF b THEN a l
C e ::= t 
4     | e + t
C t ::= p 
5     | t * p
6 p ::= ( e ) 
7     | id
8 b ::= b OR id 
9     | id
10 l ::= 
11    | ELSE s 

The Resulting Module File
ParseTable: DEFINITIONS = 
  BEGIN -- types for data structures used by the Mesa parser and scanner

  Symbol: TYPE = [0..255];

-- token indices for the scanner and parser

  tokenPLUS: TSymbol =  2;

  HashIndex: TYPE = [0.. 17];
  VIndex: TYPE = [0.. 22];

  TSymbol: TYPE = Symbol [0.. 11];
  NTSymbol: TYPE = Symbol [0.. 10];
  State: TYPE = [0.. 17];
  NTState: TYPE = State [0..  5];
  TIndex: TYPE = [0.. 23];
  NTIndex: TYPE = [0..  9];
  Production: TYPE = [0.. 15];

  END.
The Resulting Echo File

||INPUT ||CHAIN ||LISTS ||PRINTLALR 

||TABLE1
  1  id
  2  +
  3  (
  4  )
  5  *
  6  IF
  7  THEN
  8  OR
  9  ELSE
 10  :=
 11  EOF


||TABLE2
 12  g
 13  s
 14  a
 15  i
 16  e
 17  t
 18  p
 19  b
 20  l


||TABLE3
+               tokenPLUS


||TABLE4

                    GOAL ::= g EOF

  1     1  g              ::= s 

  2 C   2  s              ::= a 
  3 C   3                   | i 

  4     2  a              ::= id := e 

  5     3  i              ::= IF b THEN a l 

  6 C   6  e              ::= t 
  7     4                   | e + t 

  8 C   8  t              ::= p 
  9     5                   | t * p 

 10     6  p              ::= ( e ) 
 11     7                   | id 

 12     8  b              ::= b OR id 
 13     9                   | id 

 14    10  l              ::= 
 15    11                   | ELSE s
The Resulting Log File
LALR(1) Tables


   1    0  0/ 
    2 id                  3 IF                  0 g                  -1 s               
   -1 a                  -1 i               

   2 id               4  1/ 
    4 :=              

   3 IF               5  1/ 
  -13 id                  5 b               

   4 :=               4  2/ 
  -11 id                  6 (                   7 e                   8 t               
    8 p               

   5 b                5  2/   12  1/ 
    9 THEN               10 OR              

   6 (               10  1/ 
  -11 id                  6 (                  11 e                  12 t               
   12 p               

   7 e                4  3/    7  1/ 
   13 +                  *4 ELSE               *4 EOF             

   8 e                4  3/    7  1/    9  1/ 
   13 +                  14 *                  *4 ELSE               *4 EOF             

   9 THEN             5  3/ 
    2 id                 15 a               

  10 OR              12  2/ 
  -12 id              

  11 e                7  1/   10  2/ 
   13 +                 -10 )               

  12 e                7  1/    9  1/   10  2/ 
   13 +                 -10 )                  14 *               

  13 +                7  2/ 
  -11 id                  6 (                  16 t                  16 p               

  14 *                9  2/ 
  -11 id                  6 (                  -9 p               

  15 a                5  4/ 
   17 ELSE              *14 EOF                -5 l               

  16 t                7  3/    9  1/ 
   *7 +                  *7 )                  14 *                  *7 ELSE            
   *7 EOF             

  17 ELSE            15  1/ 
    2 id                  3 IF                -15 s                 -15 a               
  -15 i               

LALR(1) Statistics
States =  17
Terminal entries =   37
Nonterminal entries =   19
First OR operation count =    5
Total OR operation count =   33
Maximum number of contexts =    9

Parse Table Data

Nonterminal Default Actions
    0 g                  -1 s                  15 a                  -1 i               
    7 e                   8 t                   8 p                   5 b               
   -5 l               
Entries removed =     0

Table Overlays
 row ttab ntab
   6    4
  13    4
  14    4
  17    1

Scanner hashtable probe counts (terminal symbol, probecount, hashcode)
id             1   6  IF             1   9  THEN           1   3  OR             1   1  
ELSE           1  10  :=             1  16  

Renumbered States
   2 swapped with  17
   3 swapped with  14
   4 swapped with  13
   5 swapped with   6


The PGS Grammar

||CHAIN ||LISTS ||INPUT 

||TABLE1
  1  symbol
  2  num
  3  '?
  4  '|
  5  "::="
  6  'C
  7  "||TABLE1"
  8  "||TABLE2"
  9  "||TABLE3"
 10  "||TABLE4"
 11  "||INPUT"
 12  "||CHAIN"
 13  "||LISTS"
 14  "||PRINTLR"
 15  "||PRINTLALR"
 16  "||FIRST"
 17  "||IDS"
 18  "GOAL"
 19  eof


||TABLE2
 20  grammar
 21  head
 22  ruleset
 23  directives
 24  terminals
 25  nonterminals
 26  aliases
 27  directive
 28  discard
 29  rulegroup
 30  prefix
 31  goalrule


||TABLE3
symbol          tokenID
num             tokenNUM
'?              tokenQUERY
"||TABLE3"      tokenTAB3
"||TABLE4"      tokenTAB4
'?              InitialSymbol


||TABLE4

                    GOAL ::= grammar eof

  1     0  grammar        ::= '? head ruleset 

  2     1  head           ::= directives terminals nonterminals "||TABLE4" 
  3     1                   | directives terminals nonterminals aliases "||TABLE4" 

  4     2  directives     ::= 
  5    28                   | directives directive 

  6     3  directive      ::= "||INPUT" 
  7     4                   | "||CHAIN" 
  8     5                   | "||LISTS" 
  9     6                   | "||PRINTLR" 
 10     7                   | "||PRINTLALR" 
 11     8                   | "||FIRST" 
 12     9                   | "||IDS" 

 13    10  terminals      ::= "||TABLE1" 
 14    11                   | terminals discard symbol 

 15    11  nonterminals   ::= nonterminals discard symbol 
 16    12                   | "||TABLE2" 

 17    13  aliases        ::= "||TABLE3" 
 18    14                   | aliases symbol symbol 

 19    15  discard        ::= 
 20    28                   | num 
 21    28                   | '? 

 22    16  rulegroup      ::= symbol "::=" 
 23    17                   | prefix symbol "::=" 
 24    18                   | rulegroup symbol "::=" 
 25    19                   | rulegroup prefix symbol "::=" 
 26    20                   | rulegroup '| 
 27    21                   | rulegroup prefix '| 
 28    22                   | rulegroup symbol 

 29    23  prefix         ::= num 
 30    24                   | num num 
 31    24                   | '? num 
 32    25                   | discard 'C 
 33    26                   | discard 'C num 
 34    27                   | '? 

 35 C  28  ruleset        ::= rulegroup 
 36    28                   | goalrule rulegroup 

 37    28  goalrule       ::= "GOAL" "::=" symbol symbol

An Input File For the Preprocessor
Program text has been stripped from within and around the SELECT statement implementing the semantic routines of the PGS to expose 
the grammatical information.

SELECT prodData[q[qj].transition].rule FROM

 0  => -- MODULE: pgsptabdefsnew.mesa  BINARY: pgsnew.binary
       -- GOAL: grammar
       -- TERMINALS: symbol num '? '| "::=" 'C "||TABLE1" "||TABLE2" "||TABLE3"
       --   "||TABLE4" "||INPUT" "||CHAIN" "||LISTS" "||PRINTLR"
       --   "||PRINTLALR" "||FIRST" "||IDS" "GOAL"
       -- ALIASES: symbol tokenID  num tokenNUM  '? tokenQUERY
       --   "||TABLE3" tokenTAB3  "||TABLE4" tokenTAB4  '? InitialSymbol
       -- PRODUCTIONS:
       -- grammar        ::= '? head ruleset
    BEGIN
    END;

 1  => -- head           ::= directives terminals nonterminals "||TABLE4"
       -- head           ::= directives terminals nonterminals aliases "||TABLE4"
    BEGIN
    END;

 2  => -- directives     ::=
    BEGIN
    END;

 3  => -- directive      ::= "||INPUT"
    BEGIN
    END;

 4  => -- directive      ::= "||CHAIN"
    flags[chain] _ TRUE;

 5  => -- directive      ::= "||LISTS"
    flags[lists] _ TRUE;

 6  => -- directive      ::= "||PRINTLR"
    flags[printlr] _ TRUE;

 7  => -- directive      ::= "||PRINTLALR"
    flags[printlalr] _ TRUE;

 8  => -- directive      ::= "||FIRST"
    flags[first] _ TRUE;

 9  => -- directive      ::= "||IDS"
    flags[ids] _ TRUE;

10  => -- terminals      ::= "||TABLE1"
    BEGIN
    END;

11  => -- terminals      ::= terminals discard symbol
       -- nonterminals   ::= nonterminals discard symbol 
    BEGIN
    END;

12  => -- nonterminals   ::= "||TABLE2"
    BEGIN
    END;

13  => -- aliases        ::= "||TABLE3"
    BEGIN
    END;

14  => -- aliases        ::= aliases symbol symbol
    BEGIN
    END;

15  => -- discard        ::=
    l[top] _ InputLoc[]; -- keep the parser error recovery happy

16  => -- rulegroup      ::= symbol "::="
    BEGIN
    END;

17  => -- rulegroup      ::= prefix symbol "::="
    lhssymbol[v[top+1]];

18  => -- rulegroup      ::= rulegroup symbol "::="
    BEGIN
    END;

19  => -- rulegroup      ::= rulegroup prefix symbol "::="
    lhssymbol[v[top+2]];

20  => -- rulegroup      ::= rulegroup '|
    BEGIN
    END;

21  => -- rulegroup      ::= rulegroup prefix '|
    prodheader[FALSE];

22  => -- rulegroup      ::= rulegroup symbol
    BEGIN
    END;

23  => -- prefix         ::= num
    setrulechain[v[top], FALSE];

24  => -- prefix         ::= num num
       -- prefix         ::= '? num
    setrulechain[v[top+1], FALSE];

25  => -- prefix         ::= discard 'C
    setrulechain[prix, TRUE];

26  => -- prefix         ::= discard 'C num
    setrulechain[v[top+2], TRUE];

27  => -- prefix         ::= '?
    setrulechain[prix, FALSE];

28  => -- directives     ::= directives directive
       -- discard        ::= num
       -- discard        ::= '?
       -- ruleset        ::=C rulegroup
       -- ruleset        ::= goalrule rulegroup
       -- goalrule       ::= "GOAL" "::=" symbol symbol
  NULL;
ENDCASE => ERROR;