Inter-Office Memorandum
To Atkinson, Kessler Date May 3, 1985
From Dick Sweet  Location PARC
Subject Symbolic information Organization CSL
This memo is a start on specifying what symbolic information should be contained in the object files that are created by the compiler resulting from the Mimosa project.
Information in current object files
Before trying to design any new structures, we should understand what is in the current Cedar (or Mesa) object files, and the uses to which this information is used.
Variables and Types
For every variable instance declared in a Cedar program, there is a corresponding entry in the compile-time symbol table called a semantic entry. This entry contains the name of the variable (by reference to a separate hash table), location and size, and its type.
Every variable is linked together with all others declared at the same place by means of lists called contexts. An example of a context is "all the fields of the record R" or "all the local variables of the procedure P." At compile time, there is an ordered list of visible contexts, a context stack, that changes whenever the scope rules of the language dictate a change. For example, after a ".", if the left hand operator is of a record type, the context of field names of that record becomes visible.
Named constants are treated in the same way as variables, except that instead of a location and size, the semantic entry contains the value.
Types can be either named or constructed from existing types. The current compiler preloads the symbol table with the names and constructed types of the predefined types. The only basic, non-constructed types are a numeric type, which is named INTEGER in the predefined context, and a character type, named CHARACTER or CHAR in the predefined context. There is also another basic type used for ANY. The symbol table data structures have means of constructing all of the constructed types defined in the language, including record types, enumerations, subranges, arrays, etc.
Copied Symbols
Very few Mesa programs are completely self contained; most reference symbols from other modules via the DIRECTORY clause of the language. The compiler tends to copy definition structures from the included modules into the symbol table of the current program. For reasons having to do largely with conserving space and time (I think), it doesn't copy the structures fully: when it copies a record, it copies only those fields into the field context which are actually referenced in the current program. Likewise for constants of an enumerated type. I'm not sure about the Cedar Abstract Machine, but the Mesa debugger contains large amounts of hair for dealing with incomplete contexts so that one can debug as fully as possible in the absense of all of the necessary symbols files. It seems to me that with the larger VM now available, not copying symbols would be a wise decision.
Statement Mapping Information
In order to set breakpoints by pointing at the program source, there is a mapping from source character position to object byte called the fine grain table. In current Cedar object files, the table contains the character position of the first source character of each statement, and the byte offset of the first byte of code corresponding to that statement. Certain program transformations, such as cross jumping, can remove entire statements from the code, and the corresponding table entries are eliminated as well.
The fine grain table is used by the debugger in both directions: to set breakpoints, and to determine source location when walking the activation stack. There are a few rough edges associated with the current implementation that I will discuss after laying a little more ground work for the discussion.
In the following example, fine grain table entries are show by the symbol õ. Note that they are at the beginning of executable bodies declarations (programs and procedures) and at the beginning of executable statements, counting declarations with executable initialization (but not manafest constants).
õSimple: PROGRAM =    --1 (22)
BEGIN
õi: INT ← 3;     --2 (22)
õProc1: PROC =     --3 (40)
BEGIN
j: INT;
õProc2: PROC [n: CARDINAL] RETURNS [CARDINAL] = --4 (106)
BEGIN
i: CARDINAL;
õi ← 1;    --5 (111)
õFOR k: CARDINAL IN [0..n) DO  --6 (113)
õi ← i * k;    --7 (121)
ENDLOOP;
õRETURN[i/2];   --8 (132)
END;
z: INT = 0;
õm: CARDINAL = i;    --9 (43)
k: INT;
õj ← Proc2[m];    --10 (51)
õk ← Proc2[m+1];    --11 (57)
õi ← i + j + k + z;    --12 (67)
END;
õi ← 2*i;     --13 (30)
õProc1[];     --14 (36)
END.
The monotonic numbers on the right do not correspond to any numbering scheme in the object file data structures, but are there for use in this discussion. The numbers in parantheses correspond to the byte offsets (octal) in the object code for the beginning of the correspond statements.
The first thing to note is that the current compiler generates code for procedures by a top-down walk of the nesting tree, generating a given procedure, say Proc1, and then all nested procedures, Proc2 in the example. This has the nice feature that the code bodies for different procedures are disjoint. As a result, the debugger can unambiguously determine for a given object offset what procedure it comes from. The algorithm is quite simple, for an object offset p, it finds the greatest lower bound of p from the set of paranthesized numbers, and assumes that you are somewhere in the corresponding statement (in truth, the data structures are so highly optimized for space as to make this simple algorithm rather ugly to implement).
Mapping in the opposite direction is more difficult, since the source text does not obey the disjoint property. For setting breakpoints, a straightforward algorithm is used. For a given source position, find the greatest lower bound of source positions in the table and set the breakpoint at the corresponding object position. Sometimes there are surprises: if you point somewhere inside the line containing
z: INT = 0;
the greatest lower bound is the entry numbered 8 above. This unfortunately is not even in the same procedure as where you pointed. This is exacerbated by the fact that it is often difficult to visually determine the difference between a manafest constant declaration (z in this case) and the immutable initialization of the variable m in the next line. The saving grace is that the debugger gives feedback by highlighting the source text where the break is placed, and Cedar programmers learn to watch where breakpoints actually go.
When the program is suspended by a breakpoint or uncaught signal, the debugger not only needs to find the source location, but also to reconstruct the context stack, those visible naming scopes associated with a given place in the source program. Below is a list of all the possible contexts of the sample program showing the variables declared therein (except the contexts of predeclared identifiers).
1 {Simple} 
the outermost context, containing the module name and the names of another modules named in the directory (empty in the example).
2 {i, Proc1}
the global declarations of the module
3 {j, Proc2, z, m, k}
the local declarations of Proc1
4 {n}
the parameters of Proc2
5 {<anon>}
the results of Proc2
6 {i}
the local declarations of Proc2
7 {k}
the variables local to the body of the FOR loop
The scope rules of Cedar are such that all seven contexts are visible (in reverse order) at fine grain entry 7, but only the first six at entries 6 or 8. Of course, the "i" in context 6 obscures the one in context 2. The current Cedar object file contains a threaded tree structure that reflects the complete tree structure of contexts in the program. This structure forms a part of the fine grain table. (actually a related table called the body table, but the two tables are so intertwined as to make the FGT meaningless without the body table).
Binding Information
At the beginning of the object file is a data structure called a BCD, which in addition to being a bad pun is an acronym for binary configuration description. It has a header that describes the full contents of the object file, which for binder output can contain numerous modules and sub-configurations. It also contains various tables that tell the binder and/or loader how the modules in this object file are to be connected to the outside world and vice versa. This interconnection is done by importing and exporting various interfaces. Interfaces are characterized by a module name and a version stamp. There is also a file name, but this can be treated as a hint. One can think of an interface as a record with a field for every importable or exportable item (procedure or variable).
The export records are positional constructors for these records with the values in the fields being the representation of the actual implementation within one of the modules of the object file. Imported interfaces are encoded in a somewhat different manner, but are still positional. There has been some interest from the Mesa group in OSD for making the interface references more symbolic, so that additions to interfaces would not necessitate massive recompilations.
Literals
xxx
Safe Storage Information
xxx