CSL Notebook Entry: Saffron Dealer Notes and Slides XEROX To From CSL James Rauen PARC CSL Subject Date Dealer notes and slides: Saffron, an August 31, 1988 experimental Cedar compiler FOR XEROX INTERNAL USE ONLY Abstract Cedar gives its programmers more than enough rope to hang themselves, but rarely allows them to kick the chair out from underneath. The incredible maze of Cedar "features" makes it difficult to produce a coherent specification of the language, even more so an implementation, not to mention a coherent implementation. I will describe Saffron, a compiler which attempts to coherently implement Cedar. I will discuss my work on Saffron this summer, describe some of Cedar's "features" which make it such a joy to implement, and show some highlights from the trip I took into the abyss of Weird Cedar. Attributes Dealer, Cedar, compiler, Casaba, informal References CSL-83-15 "A Description of the Cedar Language" Very Brief Introduction These are the slides I used for my Saffron dealer on August 30, 1988, along with some notes. Slide 1 (Title) Saffron An Experimental Cedar Compiler James Rauen August 31, 1988 I had originally titled this talk "Weird Cedar", intending to talk about all the weird nooks and crannies of Cedar that I spent this summer exploring. But Howard Sturgis and Bill Jackson talked me out of it and convinced me that I should talk about Saffron instead. Slide 2 Why another Cedar compiler? Demonstrate the viability of Casaba Principled implementation Easy to maintain Easy to verify Experiments The first question you might ask is, "Why do we need another Cedar compiler?" After all, we already have one working, and there are plenty of people in the lab who are porting Cedar to other architectures. Well, the immediate goal (by immediate, I mean in the foreseeable future) of Saffron is to demonstrate that Casaba is a viable compiler implementation language. The best way to show this is to implement a real language, like Cedar, using Casaba. It would also be really nice to have a principled implementation of Cedar kicking around. I haven't looked at the existing compiler much, but I've heard some spine-tingling horror stories about it. The problem with the existing compiler is that it is an aggregate of ten (?) years of work, during which time the language definition was constantly changing. So it wasn't like the compiler could just have been designed and implemented. A principled Cedar implementation would make it easy (?) to compare the language semantics with the implementation itself. Also, a principled Cedar compiler would be a useful testbed for adding (and removing!) features from the language. Slide 3 What is Cedar? CLRM Mesa manual The implementation Religion The first problem I ran into was finding a definition of Cedar. Needless to say, this wasn't an easy task. There are at least four major sources, none of which is completely suitable. The definitive source (and definitive here is in quotation marks) is the Cedar Language Reference Manual, Butler Lampson's work. Butler's book reads like very elegant poetry written in Serbo-Croatian. You know it's elegant; but you don't necessarily understand it right away. And if you ever do take the time to plow through it, you'll find that Butler presents a very elegant definition of a language which he calls Cedar. Butler's Cedar bears a very strong resemblance to the current version of Cedar we all know and love, but there are also a lot of differences. Not the least of which is that Butler's Cedar is polymorphic. There are also some discrepancies and inconsistencies in his definition of "current Cedar". So we can't just say that CLRM is the definitive definition of Cedar, not without changing our idea of what Cedar is. There are also some bugs in CLRM which no one has gotten around to fixing. I'll get to some of these later. But CLRM isn't the only definition around. There's also the Mesa manual. If you consider Cedar to be a strict superset of Mesa, where Mesa is the subset of programs which don't have the keyword "CEDAR" in their headers, then the Mesa subset should agree with what the Mesa manual says. On the other hand, if you don't buy the argument that Mesa is a strict subset of Cedar, I still think you'd agree that Cedar constructs should behave like their Mesa counterparts whenever possible. Either way, the Mesa manual provides at least a partial definition of Cedar. And the implementation. (The current Cedar compiler.) Finally, we have the Cedar Religion. Every Cedar programmer has his/her way of envisioning how the language "should be". What this all boils down to is that there is no canonical definition of Cedar. So, when I say to myself, "I'm going to sit down and write a Cedar compiler," I'm at a bit of a loss to say exactly which Cedar I'm compiling. I'm never going to say all this. Looks like only posterity will get to see it. Slide 4 Saffron Front end of a compiler Implements Cedar/Mesa + Religion Implemented in Casaba By "implements Cedar give-or-take Religion", I mean that we're sticking as close as possible to Butler and the current implementation. But there are also a few places where we add a feature, for completeness' sake...that's where the religion comes in. One thing we're not doing is eliminating disgusting language features. I might look at something like implementation modules in directory clauses, or side-effecting initialization expressions, and say, "that's disgusting." And Howard will jump up and down and might even say, "that's illegal". But somebody will then point out that, hey, some crucial code buried deep within the system depends on that language feature. In any event, we're not pruning the language at all. Slide 5 Parser Concrete Grammar 154 nonterminals 475 productions Abstract Grammar 120 nonterminals 364 productions A numbers slide. (Howard pointed out that I probably won't have the time to do the spiel on concrete grammars versus abstract grammars.) Faithful reader: If you're interested, find Howard or one of his Casaba papers. Casaba automatically generates a parser based on these grammars. This winds up giving us an abstract parse tree. Slide 6 Recursive Functions Applied to the abstract parse tree Build: Context Tree Type Graph Program Graph Example CompileFrameBlock: TreeRecursiveFunction [Tree, TypeGraphNode.transferType, ContextTree.parent, ProgramGraph.arg, CompilerState] Returns [ContextTree, ProgramFragment, ProgramGraph.res] DamagedReps [ContextTree.parent, ProgramGraph.arg]; Now that we have the parse tree, we define recursive functions to apply to the tree. In Saffron, we write functions which build three major data structures: the context tree, the type graph, and the program graph. I'll talk about these in detail in the next three slides. Slide 7 Context Tree Mimics the block structure of a program One node for each block/scope: Opened names Enabled signals Field List Exit List Dependency graph Parent context pointer O patient reader: note that Opened names, enabled signals, exit lists aren't yet implemented, just thought about. Slide 8 Type Graph Predefined types Constructed types Pair: TYPE = RECORD [x: INT, y: INT]; u: LIST OF REAL; DIRECTORY Foo: TYPE Bar USING [Baz]; Runtime state Predefined types: INT, CARD32, BOOLEAN, etc. Facility for any constructed type. In the first example, "Pair", we construct three type graph nodes, one for the RECORD type and one for the named "Pair" type. The second example shows that we also build type graph nodes for anonymous types (LIST OF REAL). Directory entries are type declarations, too, so we have them in the type graph. Here's where we take off on a tangent. The type graph also includes type graph nodes describing the type structure of the current state of the computation. This includes the frame types of all the lexically visible stack frames. (following the static links). We also know the type structure of the execution stack at all times, although we don't build type graph nodes for it. They can be inferred from the program graph (next slide) Slide 9 Program Graph Intermediate language Retains all type information Type checking This is our intermediate language. Unusual feature: it retains all the type information of the source code. As I mentioned in the last slide, given the program graph and the type graph, we can describe the exact type structure of a computation at every instruction. Departure point for one of Howard's radical type theories: Do the type checking at this point, instead of while we walk the parse tree. Slide 10 Environment Environments. As we compile a file which includes others, we build up an environment structure indexed by file name. We only have to deal with each file once. The example here is RopeImpl. Here are the files in the various directories: RopeImpl: Basics, PrincOpsUtils, Rope, RopePrivate Basics: PrincOps PrincOps: nothing PrincOpsUtils: Basics, PrincOps Rope: Basics, PrincOps RopePrivate: PrincOps, Rope Everything I've described so far was fairly well understood, and a lot of it was implemented, before I picked up the ball. About two months ago, Howard and I found ourselves at the brink of a huge pit. Here it is... Slide 11 Into the Abyss Tree: TYPE = REF TreeBody; TreeBody: TYPE = RECORD [ nodeData: REF ANY, children: LIST OF Tree ]; a: INT = b; b: INT = 3; Declare-before-use is more than slightly relaxed in Cedar. I started looking at code fragments like this, and thinking really hard. Type declarations can depend on other type declarations. (So we don't have forward (as in Pascal) or other abhorrences). The values of named constants depends on other declarations further down the line. (In the second example, if I replace b: INT = 3 with b: INT _ 3, then the value of a changes from 3 to TRASH.) The following paragraphs are obsolete; they refer to the example a: INT = b; b: INT = a; First, I noticed what happens if you scratch out the second "a" and replace it with something like "3". Cedar folds forward references to compile time constants, which means that the value of "a" is 3. But if I change this equals sign (the second one) to a gets arrow, then the value of "a" suddenly becomes zero (or garbage?...) So it turns out that you cannot, in general, know the value of an expression until you've looked at all the declarations which follow it. There are lots of other things I can worry about if I feel like it. What happens if I wrap an identity function around "a"? It makes a different if the function is inline or not. What if one of my initializations has a side effect, like assigning to some other variable? Well, there's a lot more flaming I can do along these lines, but I don't have time to get into any of it now. Slide 12 Deeper and Deeper Foo: TYPE = INT[k..k+FIRST[Bar] ]; Bar: TYPE = INT[FIRST[Foo]..10]; k: INT = LAST[Bar] Here we have dependency relationships all over the place. These include properties of types (FIRST, LAST, SIZE) which have to be constant, and named constant values. Slide 13 The Problem Sequential processing of declarations EXCEPT FOR Named constants with compile-time values (but not runtime constants) AND Constant properties of types This is the problem. Slide 14 The Solution Dependency Graph Nodes for: Values of named constants Bodies of inline procedures Initialization expressions of types SIZE, FIRST, LAST properties of types Runtime state Evaluate nodes in topological order Reject if: Cyclic, or A supposedly constant type property depends on runtime state The solution: we build a dependency graph. A dependency graph is a directed graph with nodes for: (1) the values of named constants, (2) the bodies of inline procedures -- so we can detect and reject recursive inline procedures, (3) the initialization expressions of types -- since these also behave like inline procedures, (4) the properties of types which must be known at compile time -- SIZE of all types, FIRST and LAST of element types, and (5) the runtime state, so we can express quantities that depend on the runtime state. For each block in the program, we build a dependency graph for the names declared in that block. Then we walk through the graph, visiting a node only after we have visited all the nodes that it depends on. As we visit the nodes, we can compute the constant values of named constants and constant type properties. We reject the program if there is a cycle in the dependency graph, or if a property of a type which is supposed to be known at compile time (like the type's size) depends on the runtime state. Slide 15 Example Foo: TYPE = INT[k..k+FIRST[Bar] ]; Bar: TYPE = INT[FIRST[Foo]..10]; k: INT = LAST[Bar] I'll work through the example here. Visit the nodes in this order: LAST[Bar], VALUE[k], FIRST[Foo], FIRST[Bar], LAST[Foo], SIZE[Foo], SIZE[Bar] Sure, it looks easy now, but we sweated this stuff out for several weeks back in July. Slide 16 Saffron Now Builds type graphs and context trees Dependency analysis Generates some intermediate code: Initialization (declarations) Expressions Assignments A few simple statements Blocks Transfer type values (procedures, ...) Here's where Saffron is right now. It builds type graphs and the context tree (it already did before I got here, so that's not very interesting.) Saffron performs dependency analysis on a block by block basis. This was described in the preceding slides. And, we're generating some intermediate code (program graph). This includes initialization code for variable declarations, expressions (folding constants, generating code for what has to be done at runtime), assignment statements, a few simple statements (IF-THEN-ELSE works, and NULL is really easy, but that's about it so far), blocks, and transfer type values (things like procedure bodies -- we're coping with all the lexical structure of Cedar programs). Oh, Saffron has representations for and can manipulate Cedar values (compile-time or runtime). Slide 17 Wrapping Up Cedar Casaba Acknowledgements Howard Sturgis Bill Jackson Carl Hauser Russ Atkinson Peter Kessler Cedar: I've been working with it off and on for over two years now, since the early days of 6.0. Working on compiling Cedar gives you, ah, an entirely new perspective on the langauge. Before, my attitude was, "yes, it's an incredibly powerful language which I love to use even though it's very crufty in places." Now I really know how crufty it is. But I still like it. Casaba: Howard Sturgis, my supervisor, who's the whole reason I'm working on this. Bill, whom I've also been working with...answered all the obscure Casaba questions that Howard passed on. Carl, Russ & Peter: on the few occasions that I came up with a truly bizarre Cedar question, they always knew the answer. Copyright c 1988 by Xerox Corporation. All rights reserved. [Artwork node; type 'Artwork on' to command tool] [Artwork node; type 'Artwork on' to command tool] StyleDef BeginStyle (Cedar) AttachStyle (abstract) "for abstract nodes" { head4 AlternateFontFamily 14 en tabStops } StyleRule (root) "format for root nodes" { cedarRoot } ScreenRule (root) "format for root nodes" { 0 firstHeaders 0 lastDropFolio FontPrefix docStandard 36 pt topMargin 36 pt headerMargin .5 in footerMargin .5 in bottomMargin 1.75 in leftMargin 1.5 in rightMargin 5.25 in lineLength 24 pt topIndent 24 pt topLeading 0 leftIndent 0 rightIndent } PrintRule (pagenumber) "format for folio on each page" { unleaded AlternateFontFamily } StyleRule (positionInternalMemoLogo) "Xerox logo: screen" { docStandard 1 pt leading 1 pt topLeading 1 pt bottomLeading } ScreenRule (positionInternalMemoLogo) "for Xerox logo" { docStandard 1 pt leading 1 pt topLeading -36 pt bottomLeading -1.5 in leftIndent } PrintRule (internalMemoLogo) "Xerox logo: screen" { "Logo" family 18 bp size 20 pt topLeading 20 pt bottomLeading } ScreenRule (internalMemoLogo) "for Xerox logo" { "Logo" family 18 pt size 12 pt leading -36 pt topLeading 36 pt bottomLeading -1.5 in leftIndent } PrintRule (memoHead) "for the To, From, Subject nodes at front of memos" { docStandard AlternateFontFamily 240 pt tabStops } StyleRule EndStyleIunleadedMark insideHeaderbo33IpositioninternalmemologoIinternalmemologomemoheadsNt N NNN N44NNNIabstract m1