Release as [Indigo]2.0>interscript.tioga, .press Draft [Indigo]Draft2.0>interscript.tioga, .press Last edited By Mitchell on January 3, 1983 6:28 pm LIMITED DISTRIBUTION: FOR XEROX INTERNAL USE Towards an Interchange Standard for Editable Documents by Jim Mitchell (Mitchell.PA) and Jim Horning (Horning.PA) Version 2.0/January 3, 1983 The Interscript standard will define a digital representation of editable documents for exchange among different editing systems. A script is the representation of a document in the Interscript format; it can be transmitted from one editor to another over a network, or can be stored for later editing. A script is not limited to any particular editor: if a script contains editable information some of which is not understandable by a particular editor, it is still possible to edit the parts of the document understood by that editor without losing or invalidating the parts it does not understand. This draft is a proposal for the technical content of the Interscript standard. It defines and explains the proposed standard, gives examples of its use, explains how to externalize documents from an editor's private format as scripts, and how to internalize scripts into an editor's private format. It also indicates a number of issues that must still be resolved to establish a practical standard. Note: This draft is being circulated to interested parties within Xerox to report preliminary ideas. It should not be interpreted as a definitive proposal, and should not be distributed outside. XEROX PALO ALTO RESEARCH CENTER COMPUTER SCIENCE LABORATORY 3333 Coyote Hill Road / Palo Alto / California 94304 Towards an Interchange Standard for Editable Documents by Jim Mitchell and Jim Horning Version 2.0/January 3, 1983 The Interscript standard will define a digital representation of editable documents for exchange among different editing systems. A script is the representation of a document in the Interscript format; it can be transmitted from one editor to another over a network, or can be stored for later editing. A script is not limited to any particular editor: if a script contains editable information some of which is not understandable by a particular editor, it is still possible to edit the parts of the document understood by that editor without losing or invalidating the parts it does not understand. This draft is a proposal for the technical content of the Interscript standard. It defines and explains the proposed standard, gives examples of its use, explains how to externalize documents from an editor's private format as scripts, and how to internalize scripts into an editor's private format. It also indicates a number of issues that must still be resolved to establish a practical standard. The standard provides for documents with a dominant hierarchical structure (e.g., book/chapter/section/paragraph...) while also providing for documents needing more general structure than a single tree (e.g., for graphics, for certain kinds of document formatting, or for cross-references in a textual document), formatting information (e.g., margins, fonts, line widths, etc.), definitional structure (such as styles or property sheets), and intermixed kinds of editable information (e.g., text with imbedded graphics). This draft deals primarily with the contents of Layers 0 and 1 (the base language) of the proposed standard. Contents 1. Introduction 2. The Language Basis: Syntax and Semantics 3. HigherLevel Issues 4. Pragmatics Appendix A: Glossary 1. Introduction Interscript provides a means of representing editable documents. This representation is independent of any particular editor and can therefore be used to interchange documents among editors. The basis of Interscript is a language for expressing editable documents as scripts. Scripts are created by computer programs (usually an editor or associated program); scripts are "compiled" by programs to produce whatever private format a particular editor uses to represent documents. 1.1. Rationale for an interchange standard As office systems proliferate, being able to interchange documents among different editing systems is becoming more and more important. Customers need document compatibility to avoid being trapped in evolutionary cul-de-sacs and having to pay the awful price of converting documents from one product's format to another's (even within one company's product line sometimes). Now, an editing program typically uses a private, highly-encoded representation for documents in order to meet goals of performance and functionality. Generally, this means that different editors use different, incompatible private formats, and the user can conveniently edit a document only with the editor used to create it. This problem can be solved by providing programs to convert between one editor's private (or file) format and another's. However, a set of different editors with N different document representations requires N(N-1) conversion routines to be able to convert directly from each format to every other. This N(N-1) problem can be reduced to 2(N-1) by noticing that we could write N-1 conversion routines to go from F1 (format for editor1) to F2,. . .,FN, and another N-1 routines to convert from F2,. . .,FN to F1. Except when converting from or to F1, this scheme requires two conversions to go from Fi to Fj (j is a more critical issue, however, since the capabilities of that editor will determine how general a class of documents can be interchanged among the editors. This presents a truly difficult problem in the case that there is no single functionally dominant editor. If the pivotal editor1 doesn't incorporate all of the structures, formats, and content types used by all of the others, then it will not be possible to faithfully convert documents containing them. Even if we had a single editor that was functionally dominant, it would place an upper bound on the functionality of all future compatible editors. Since there are no actual candidates for a totally dominant editor, we have chosen instead to examine in general what information editors need and how that information can be organized to represent general documents. Since we are not proposing an editor, we do not need to design a private format for its documents; we only need an external representation that is capable of conveying the content, form, and structure of editable documents. That external representation has only one purpose: to enable the interchange of documents among different editors. It must be easy to convert between real editors' formats and this interchange encoding. Using a standard interchange encoding has the additional advantage that much of the input and output conversion algorithms will be common to all conforming editors. For example, when a new version of an existing editor is released, the only differences in the new version's conversion routines will be in the areas in which its internal document format has changed from its previous form; this represents a significant saving of programming. 1.2. Properties that any interchange standard must have An interchange encoding for editable documents must satisfy a number of constraints. Among these are the following: 1.2.1. Universal character set Scripts must be encoded using the graphic (printable) subset of the ISO 646 printing character set. As well as the obvious rationale that these characters are guaranteed not to have control significance to any devices meeting the ISO standard, it has the additional advantage that a script is humanly readable. 1.2.2. Encoding efficiency Since editable documents may be stored as scripts, may be transmitted over a network, and must certainly be processed to convert them to various editors' private formats, it is important that the encoding be reasonably space-efficient. Similarly, the time cost of converting between interchange encoding and private formats must be reasonably low, since it will have a significant effect on how useful the interchange standard is. (If the overheads were small enough, an editor might not even use a private file format for document storage.) 1.2.3. Open-ended representation Scripts must be capable of describing virtually all editable documents, including those containing formatted text, synthetic graphics, scanned images, etc., and mixtures of these various modes. Nor may the standard foreclose future options for documents that exploit additional media (e.g., audio) or require rich structures (e.g., VLSI circuit diagrams, database views). For the same reasons, the standard must not be tied to particular hardware or to a file format: documents will be stored and transmitted using a variety of media; it would be folly to tie the representation to any particular medium. 1.2.4. Document content and form The complete description of a document component usually requires more than an enumeration of its explicit contents; e.g., paragraphs have margins, leading between lines, default fonts, etc. Scripts must record the association between attributes (e.g., margins) and pieces of content. Both the contents and attributes of typical documents require a rich value space containing scalar numbers, strings, vectors, and record-like constructs in order to describe items as varied as distances, text, coefficients of curves, graphical constraints, digital audio, scanned images, transistors, etc. 1.2.5. Document structure Many documents have hierarchical structure; e.g., a book is made of chapters containing sections, each of which is a sequence of paragraphs; a figure is embedded in a frame on a page and in turn contains a textual caption and imbedded graphics; and the description of an integrated circuit has levels corresponding to modular or repeated subcircuits. The standard should exploit such structure, without imposing any particular hierarchy on all documents. Hierarchy is not sufficient, however. Parts of documents must often be related in other ways; e.g., graphics components must often be related geometrically, which may defy hierarchical structuring, and it must be possible to indicate a reference from some part of a document to a figure, footnote, or section in way a that cuts across the dominant hierarchy of the document (section 1.6.4). Documents often contain structure in the form of indirection. For instance, a set of paragraphs may all have a common "style," which must be referred to indirectly so that changing the style alone is sufficient to change the characteristics of all the paragraphs using it. Or a document may be incorporated "by reference" as a part of more than one document and may need to "inherit" many of its properties from the document into which it is being incorporated at a given time. 1.2.6. Transcription fidelity It must be possible to convert any document from any editor's private format to a script and reconvert it back to the same editor's private format with no observable effect on the document's content, form, or structure. This characteristic is called transcription fidelity, and is a sine qua non for an interchange encoding; if it is not possible to accomplish this, the interchange encoding or the conversion routines (or both) must be defective. 1.2.7. Script comprehension Even complicated documents have simple pieces. A simple editor should be able to display parts of documents that it is capable of displaying, even in the presence of parts that it cannot. More precisely, an editor must, in the course of internalizing a script (converting it from a script to its private, editable format), be able to discover all the information necessary to recognize and to display the parts that it understands. This must work despite the fact that different editors may well use different data structures to represent the content, form, and structure of a document. At a minimum, this requires that a script contain information by which an editor can easily determine whether or not it understands a component well enough to display or edit it, and that it be able to interpret the effect that components which it does not understand have on the ones it does. For example, if an editor does not understand figures, it should still be possible for it to display their embedded textual captions correctly, even though a figure might well dictate some of its caption's content or attributes such as margins, font, etc. This constraint requires that an interchange encoding must have a simple syntax and semantics that can be interpreted readily, even by low-capability editors. Along with the desire for openendedness (section 1.2.3), this suggests a language with some form of "extension by definition" built around a small core. 1.2.8. Regeneration Processing a script to internalize it correctly is only half the problem. It is equally important that an editor, in externalizing a script from its private document format be able to regenerate the content, form, and structure carried by the script from which the document originally came. In particular, when regenerating a script from an edited document, it should be possible to retain the structure in parts of the original script that were not affected by editing operations. For example, an editor that understands text but not figures should be able to edit the text in a document (although editing a caption may be unsafe without understanding figures) while faithfully retaining and then regenerating the figures when externalizing it. This problem is much less severe when an editor is transcribing a document that it "understands" completely, e.g., because the entire document was generated using that editor. 1.3. What the Interscript standard does not do There are a number of issues that the Interscript standard specifically does not discuss. Each of these issues is important in its own right, but is separable from the design of an interchange representation 1.3.1. Interscript is not a file format The interchange encoding of a script is a sequence of ASCII/ISO 646 characters. The standard is not concerned with how that representation is held in files on various media (floppy disks, hard disks, tapes, etc.), or with how it is transmitted over communications media (Ethernet, telephone lines, etc.). 1.3.2. Interscript is not a standard for editing A script is not intended as a directly editable representation. It is not part of its function to make editing of various constructs easier, more efficient, or more compact: those are the purview of editors and their associated private document formats. A script is intended to be internalized before being edited. This might be done by the editor, by a utility program on the editing workstation, or by a completely separate service. 1.3.3. Combining documents is not an interchange function This exclusion is really a corollary of the statement, "A script is not intended as a directly editable representation." In general, it is no easier to "glue" two arbitrary documents together than it is to edit them. 1.3.4. Interscript does not overlap with other standards There are a number of standards issues that are closely related to the representation of editable documents, but which are not part of the Interscript standard because they are also closely related to other standards. For example, the issues of specifying encodings for characters in documents, how fonts should be named or described, or how the printing of documents should be specified (i.e., Interpress) are not part of this work. 1.4. Concepts and Guiding Principles 1.4.1. Layers The Interscript standard is presented in layers: Layer 0 defines the syntax of scripts; parsing reveals the dominant structure of the documents they represent. Layer 1 defines the semantics of the base language, particularly the treatment of bindings and environments. Layer 2 defines the semantics of properties and attributes that are expected to have a uniform interpretation across all editors. Various Layer 3 extensions will define the semantics of properties and attributes that are expected to be shared by particular groups of editors. The present document focusses almost exclusively on Layers 0 and 1, although some of the examples illustrate properties and attributes likely to be defined in Layer 2. 1.4.2. Externalization and Internalization Transcription fidelity requires that any document prepared by any editor can be externalized as a script that will then be internalized by the editor without loss of information. Ease of internalization requires that the Interscript base language contain only relatively few (and simple) constructs. We resolve this apparent paradox by including within the base language a simple, yet powerful, mechanism for abbreviation and extension. A script may be considered to be a "program" that could be "compiled" to convert the document to the private representation of a particular editor, ready for further editing. The Interscript language has been designed so that internalizing scripts into typical editors' representations can be performed in a single pass over the script by maintaining a few simple data structures. 1.4.3. Content, Form, Value, and Structure Most editors deal with both the content of a document (or piece of a document), and its form. The former is thought of as "what" is in the document, the latter as "how" it is to be viewed; e.g., "ABC" has a sequence of character codes as its contents; its format may include font and position information. Interscript maintains this distinction. The distinction between the value and the structure of both content and form within a document is also important. When viewing a document, only the value is of concern, but the structure that leads to that value may be essential to convenient editing. An example of structure in content is the grouping of text into paragraphs; in form, associating a named "style" with a paragraph. Content: Text and graphics are common special cases. Interscript's treatment of these has been largely modelled on that of Interpress. Other kinds of content may be represented by structures built from character strings, numbers, Booleans, and identifiers. Form: Interscript provides for open-ended sets of properties and attributes. Properties are associated with content by means of tags. Attributes are bindings between names and values that apply over some scope (sections 1.4.4.23). The way the contents of a document are to be "understood" is determined by its properties; Interscript makes it straightforward to determine what these properties are without having to understand them. Structure: Most editors structure the content of a document somehowinto words, sentences, paragraphs, sections, chapters; or lines, pages, signatures, for example. This assists in obtaining private efficiency, but, more importantly, provides a conceptual structure for the user. Full transcription fidelity requires that the Interscript language be adequate to record any structure that is maintained by any editor for either form or content. Of course, some editors provide a number of different structures. A general structure, of which all the editors we know use special cases, is the labelled directed graph. Interscript provides this structure, without restricting the purposes for which it may be used. There are also two specializations of general graphs that occur so frequently that Interscript treats them specially: Sequences: The most important, and most frequent, relationship between values is logical adjacency (sequentiality), which is represented by simply putting them one after another in the script. Ordered trees: Most editors that structure contents have a "dominant" hierarchy that maps well into trees whose arcs are implicitly labelled by order. (Different editors use these trees to represent different hierarchies). Interscript provides a simple linear notation for such trees, delimiting node values by braces ("{" and "}"). If an editor maintains multiple hierarchies, the dominant one is the one transcribed into the tree structure and used to control the inheritance of attributes. Structure for content beyond that contained in the dominant hierarchy is represented by explicit links in the script; any node may be labelled as the source and/or the target of any number of links. A link whose target is a single node uniquely identifies that node; links with multiple targets may be used to represent sets of nodes. Typical structures recorded for form are expressions (indicating intended relations among attribute values) and sharing (representable by indirection). Interscript allows expressions to be composed of literals, identifiers, operators, and function applications, and permits the use of identifiers to represent expressions. 1.4.4. Features of the Base Language 1.4.4.1 Values Expressions in a script may denote Literal values of primitive types Booleans: F, T Integers: . . . 3, 2, 1, 0, 1, 2, 3, . . . Reals: 1.2E5, . . . Strings: Universal names: TEXT, XEROX, PARAGRAPH Structured values Nodes Vectors of values Environments Generic operations Invocations Applications Selections Operations specific to particular types Arithmetic Comparison Logical Subscript . . . Bindings Labels Tags Targets Sources Link introductions Expressions to be evaluated at the point of invocation 1.4.4.2 Environments and Attributes Environments bind attribute identifiers to values (or expressions denoting values), in various modes: "_" denotes a local binding, which may be freely superseded, ":=" denotes a global binding, which creates or modifies an attribute in the outermost environment. NULL denotes the "empty" environment, containing bindings for no attributes. The (implicit) outermost environment binds each identifier id to the corresponding universal name ID (written with all capital letters). Each piece of content in a document has its own environment. Editors will use relevant attributes from that environment to control its form. Attributes may also be used in scripts for two structuring purposes: abbreviation: an identifier may be bound to a quoted expression; within the scope of the binding, the use of the identifier is equivalent to the use of the full expression; indirection: reference through an identifier permits information (such as styles) to be defined in one place and shared throughout its scope; this is an example of structure (which must be preserved) in the form of a document. 1.4.4.3 Inheritance The dominant hierarchy of a document is represented by grouping its pieces within nodes, which are the most obvious form of content structuring. They also control the scope of bindings. The environment of a node is initially inherited from its containing node (except for the outermost node, which inherits it from the editor), and may be modified by bindings. A binding takes effect at the point where it appears, and its scope extends to the end of the innermost node containing it, with two exceptions: any binding except a definition may be superseded by a (textually) later binding (if the later binding is in a nested node, the outer binding's scope will resume at the end of the inner node), and a global binding extends over the all of the document lexically to the right of the binding. Attributes are inherited only via environments following the dominant structure. Thus the choice of a dominant structure to represent scripts from a particular editor will be strongly influenced by expectations about inheritance. Attributes are "relevant" to a node if they are assumed by any of its tags. In general, a node's environment will also contain bindings for many "latent" attributes that are either relevant to its ancestors (and inherited by default) or are potentially relevant to its descendants. The interior of each node is implicitly prefixed by Sub, which will generally be bound in the containing environment to a quoted expression performing some bindings, applying some labels, and/or supplying some initial content. 1.4.4.4 Expressions Expressions involving the four infix operators (+, , *, /) are evaluated right-to-left (a la APL); since we expect expressions to be short, we have not imposed precedence rules. Parentheses are used to delimit vector values. Square brackets are used to delimit the argument list of an operator application and to denote environment constructors, which behave much like records. The notation for selections (conditionals) follows Algol 68: ( | | ) This is consistent with our principles of using balanced brackets for compound constructions and avoiding syntactically reserved words; the true part and false part may each contain an arbitrary number of items (including none). 1.4.4.5 Tags and Links A tag is written as a universal name followed by $''. A tag, also invokes the component of the outermost environment X with the name whereas attributes have values that apply throughout a scope. Layer 2 of the standard will be primarily concerned with the definition of a (small) set of standard properties that are expected to be shared among all conforming editors. For each standard property, it will describe the associated tag that denotes it, the assumptions it implies about the contents (values that must/may be present and their intended intepretation, invariant relations that are to be maintained, etc.), the assumptions it makes about the environment (attributes that must be present and their intended intepretation). Links enable a script to model associations that cut across its dominant structure: a link set denotes a set of directed arcs from each of its source nodes to all its target nodes. There are several ways this facility can be used: (ST) A link set with a single source node and a single target node models a simple reference from one node in a document to another. (S*T) For a link set with a single target node and multiple source nodes, each source node can be viewed as "pointing to" that target node. (ST*) The symmetrical extreme case of a single source node and multiple target nodes corresponds closely to an entry in an index, which refers to all the places where some term is used (section 1.6 contains an example). (S*T*) Finally, multiple source and target nodes in a link set can be used for all the cross references within a document of the form "see sections 1.6, 1.7, 2.3". To use links, a script must declare the "main" identifier of a link set ("LINKS" id) at the root of a subtree containing all its sources and targets, and textually preceding them. Once this main identifier has been introduced, nodes can be labelled as sources for subsets of this linkset. For example, the label "id.a.b:" would make a node a target for source nodes containing references of the sort "^id", "^id.a", or "^id.a.b". 1.4.5. Script comprehension The Interscript standard applies to interchange among editors with widely varying capabilities. It will be important to define some structure to the space of possible scripts, just as Interpress has for printable documents. Dimensions in which we foresee reasonable variations in script comprehension are: Abbreviations: only editor-supplied  defined in document. Dominant structure: single-layer  arbitrary. Other structure: no links or indirections  links and indirections preserved. Bindings: Local only and global (:=). Selection: No conditionals  conditionals. Numbers: Integers only  floating point. See section 2.4 for further details. 1.4.6. Internalizing a Script The private representations of low-capability editors are not generally adequate to provide a full-fidelity internalization of every script produced by a high-capability editor. Thus, when internalizing a script, some information may not be viewable or editable. The Interscript language has been designed to simplify value-faithful internalization, even if structure is lost, and content-faithful internalization, even if form is lostor the conversion of form to additional content to allow it to be examined (and perhaps even edited) by a low capability-editor. The standard provides some simple conditions under which a low-capability editor can safely modify parts of a document that it understands fully, without thereby destroying the value or structure of parts that it is not prepared to deal with. A script may be internalized into an editor's (private or file) representation as follows: Parse the entire script from left to right. As each literal is encountered in the script, convert it to the editor's representation. As each abbreviation (free-standing invocation) is encountered in the script, replace it with the value to which it is bound in the environment. As each structure is recognized in the script, represent the corresponding structure in the editor's representation, if possible; if not, use the semantics of Interscript to compute the value to be internalized. Update the environment whenever a binding is encountered or a scope is exited, according to the semantics of Interscript. Transfer the values of all attributes relevant to each piece of content from the current environment to the editor's representation, if possible; if not, apply an invertible function to convert the attribute-value binding into additional content. Determine the properties of each node from its tags; this list will be complete at the end of the node. A node is viewable if any of its tags denotes a property in the set of those the editor is prepared to display; it is understood if they are all in the set of those the editor is prepared to edit. Record the sources and targets of all links; for any link, these lists will be complete at the end of the node in which its main identifier was introduced. Translate each link to the corresponding editor structure, according to the properties of the node that introduces it. Of course, any process yielding an equivalent result is equally acceptable. 1.5. Introduction to the Interscript Base Language This section is intended to lead the reader through a set of examples, to show what the language looks like and how it is used to represent a number of commonly occurring features of editable documents. The examples purposely use rather long identifiers and lots of white space to make them more readable. In actual use, programs, not people, will generate and read scripts; names will tend to be short; and logically unneeded spaces and carriage returns will tend to be omitted. 1.5.1. Simple text as a document The following script defines a document consisting of the string "The text of the main node of example 1.5.1"; no font, paragraph structure, or formatting information is supplied. This example will gradually be expanded to represent accurately figure 1.5.1, below. The numbers at the left margin do not form part of the script; they are used to refer to the various lines in the discussion below. 0 Interscript/Interchange/1.0 1 {} 2 EndScript Line 0 is the header denoting version 1.0 of the interchange encoding. Line 1 is the entire body of this script: it contains a single node enclosed in {} which in turn contains a single string value enclosed in <>. Line 2, with the keyword "EndScript" marks the end of script. The text of the main node of example 1.5.1 The text of the first subnode of example 1.5.1 Example 1.5.1: A simple document The next version of the example adds the tag, TEXT$ to the node. The identifier TEXT is called a universal name (or atom), which is indicated by its being composed of all uppercase letters. Universal names have no definition within the base language (they are expected to be defined in Layers 2 and 3). 0 Interscript/Interchange/1.0 1 {TEXT$ 2 3 } 4 EndScript A tag is denoted by placing "$" after a universal name. A node's tags are strictly local (they are not inherited by other nodes in the script) and serve as "type information" about the node. The tag TEXT$ labels this node as one that can be viewed as textual data. Tags can also create implicit indirections; see section 1.6.5. 0 Interscript/Interchange/1.0 1 {PARAGRAPH$ 2 leftMargin_3.25*inch rightMargin_5.0*inch 3 4 } 5 EndScript This example shows how auxiliary information, such as margins, may be associated with a node of a script. The binding leftMargin_3.25*inch adds the attribute leftMargin to the node's environment and binds the value of the expression 3.25*inch to it (inch is a value whose dimensions are inches/meters; meters are the standard Interscript units of distance). The bindings to leftMargin and rightMargin convey the fact that this node has margins for display. To denote the change in character of the node, we have tagged it as PARAGRAPH instead of TEXT. Figure 1.5.1 uses these margins for its first line of text. 0 Interscript/Interchange/1.0 1 {PARAGRAPH$ 2 leftMargin_3.25*inch rightMargin_5.0*inch 3 4 {PARAGRAPH$ leftMargin_+0.5*inch 5 6 } 7 } 8 EndScript We have further elaborated the example by nesting another text node in the primary one, with its text following the primary node's text and with an indented leftMargin. The binding leftMargin_+0.5*inch is a contraction of leftMargin_leftMargin+0.5*inch. The right side of the binding is evaluated, and since there is as yet no binding in the inner node's (lines 46) environment for leftMargin, it is looked up in the environment of the containing node (lines 13). The value of the right hand side expression is thus 3.75*inch. This value is then bound to the identifier leftMargin in the inner node's environment. Since no value is bound to rightMargin in the inner node's environment, it will have the same rightMargin as its parent node. 0 Interscript/Interchange/1.0 1 p _ 'PARAGRAPH$ leftMargin_3.25*inch rightMargin_6.0*inch' 2 {p rightMargin_5.0*inch 3 4 {p leftMargin_+0.5*inch 5 6 } 7 } 8 EndScript One can also define an abbreviation by binding a sequence of unevaluated expressions to an identifier and subsequently using the identifier to cause those expressions to be evaluated at the point of invocation. This example binds the quoted expression 'PARAGRAPH$leftMargin_3.25*inchrightMargin_6.0*inch' to the identifier p. When p is invoked in lines 2 and 4, the quoted expression replaces the invocation and is evaluated there. Invoking p places the tag PARAGRAPH$ on the node, sets the leftMargin to 3.25*inch and the rightMargin to 6.0*inch. In line 2, the rightMargin is then rebound to 5.0*inch, overriding the default binding created by invoking p. Similarly, the binding for leftMargin in line 4 overrides the one resulting from invoking p, resulting in its leftMargin being 3.75*inch and its rightMargin being 6.0*inch. An identifier can also be bound to an environment value as a convenient record-like manner of naming a set of related bindings. For example, a font might be defined as follows (a more complete definition is given later in section 1.6.3): font _ [ | family_TIMES size_10*pt face_[ | weight_NORMAL style_ROMAN slant_NIL] ] This defines font to be the environment formed by taking the empty or NULL environment and altering it according to the series of bindings following the initial "[ |." In this case font is an environment having bindings for three attributes, family, size, and face. face is itself bound to an environment (with attributes weight, style, and slant). The set of default bindings in font specify a normal weight (non-bold), non-italic Times Roman 10-point font. We can incorporate this font definition in the example and then use it to indicate that the word "first" in the subnode should be in italics: 0 Interscript/Interchange/1.0 1 p _ 'PARAGRAPH$ leftMargin_3.25*inch rightMargin_6.0*inch' 2 font _ [ | family_Times size_10*pt face_[ | weight_NORMAL style_ROMAN slant_NIL] ] 3 {p rightMargin_5.0*inch 4 5 {p leftMargin_+.5*inch 6 7 font.face.slant_ITALIC font.face.slant_NIL 8 < subnode of example 1.5.1> 9 } 10 } 11 EndScript Bindings affect node contents to their right: so, "first" will be italic, while "subnode of example 1.5.1" will be non-italic due to the binding immediately preceding it. If we expected to switch between italics and non-italics frequently, it might be profitable to introduce abbreviations to shorten what must appear. For example, in the scope of the definition l _ [ | i _ 'font.face.slant_ITALIC' nI _ 'font.face.slant_NIL'] line 7 could be abbreviated l.il.nI 1.6. Further Examples This section gives some more realistic examples of the use of the Interscript language and explores the issues of making sets of standard definitions for use in scripts. 1.6.1. A Laurel Message Here is a possible Interscript transcription of a Laurel message: 0 Interscript/Interchange/1.0 -- standard heading -- 1 {LAURELMSG$ -- tag for a Laurel document -- 2 Sub _ 'PARAGRAPH$ leftMargin_1.0*inch rightMargin_7.5*inch' --standard node prelude for nodes below-- 3 justified_F 4 font.family_TIMES font.size_10 5 leading.x_1 6 leading.y_1 -- overridable default leadings -- 7 LINKS heading -- declare main identifier of link set -- 8 laurelInfo _ -- Laurel information for easy access -- 9 (^Heading.time ^Heading.from ^Heading.subject ^Heading.to ^Heading.cc) 10 { {Heading.time: <18 June 1981 9:18 am PDT (Thursday)>} 11 {Heading.from: AUTHENTICATED$} 12 {Heading.subject: } 13 {Heading.to: } 14 {Heading.cc: }} 15 leading.y_6 -- override outer y leading -- 16 {} -- node which is a paragraph -- 17 {} 18 {} 19 } EndScript Line 1 tags this document (by tagging its root node) as a Laurel message, and line 2 tags its subnodes (starting on lines 10, 16, 17, and 18) as paragraphs with default margins. Lines 36 bind some other attributes, likely to be relevant to paragraphs. Line 7 declares the main link identifier heading, and lines 89 bind to laurelInfo a vector of source links whose targets are the parts of the document of interest for mail transport. Lines 1014 have similar structures: each consists of a string followed by a node containing a target link for the label heading and text for that Laurel "field." Line 11 is additionally tagged as AUTHENTICATED. Lines 1618 contain paragraphs constituting the body of the message. Alternatively, the external environment might well contain a definition of laurel60 that establishes a suitable environment for a Laurel 6.0 document: 1 laurel60 _ ' 2 LINKS time LINKS from LINKS subject LINKS to LINKS bodyNodes LINKS cc 3 LAURELMSG$ 4 cr _ <#13#> tab _ <#9#> 5 p _ 'PARAGRAPH$ leftMargin_1.0*inch rightMargin_7.5*inch' 6 justified_F 7 font.family _ TIMES font.size _ 10 8 margins.left_2540 margins.right_19050 9 leading.x_1 leading.y_1 -- overridable default leadings -- 10 printForm _ 11 '{p ^time tab 12 ^from cr 13 ^subject cr 14 ^to 15 leading.y_6 16 ^bodyNodes 17 ^cc 18 }' 19 heading _ 'LAURELHEADING$ Sub_'TEXT$ LAURELFIELD$' ' 20 body _ 'Sub_'p bodyNodes:' ' 21 ' One advantage of using source labels for the "bodies" of the To:, From:, etc. fields (lines 1114, 17) is that they can represent sets of nodes as well as single nodes. Now the Laurel document would be described by the following script: 22 Interscript/Interchange/1.0 -- standard heading -- 23 {laurel60% -- invoke Laurel 6.0 definitions 24 {heading% -- invoke heading style -- 25 {time: <18 June 1981 9:18 am PDT (Thursday)>} 26 {from: AUTHENTICATED$ } 27 {subject: } 28 {to: } 29 {cc: } 30 } 31 {body% -- Invoke body style -- 32 {} 33 {} 34 {} 35 } 36 } EndScript Invoking laurel60 in line 23 introduces the quoted expressions heading and body into the root node's environment, tags it as LAURELMSG and declares the labels time, from, etc. It also acquires a definition for a print form, which could be used to format the message for sending to a printer. The "%" (indirection) operator indicates that this is intentional structure, to be preserved by each internalization, rather than merely an abbreviation. Thus the message heading and body should "see" the effects of any future changes made to laurel60, by editing its definition. By contrast, p is used as an abbreviation; when the script is rendered, its value may safely be copied at each use. Look at the definition of heading (line 19): the right side is a quoted expression sequence. The first expression of the sequence produces the tag LAURELHEADING$ and the second binds the quoted expression 'TEXT$ LAURELFIELD$' to Sub. As a result, each subnode of the one beginning on line 24 will be initialized by invoking Sub implicitly from its containing node, which gives each the tags TEXT$ and LAURELFIELD$. Similarly, the definition of body (line 20) defines Sub, and the nodes on lines 3234 will be initialized by invoking p and having the target link bodyNodes placed on it. Labelling the set of body nodes this way means that the source link, ^bodyNodes, in printForm (line 19) denotes the entire sequence of body nodes, in left-to-right depth-first tree order. 1.6.2. A page of a Star document This example is taken from page 71 of the Star Functional Specification and shows one page of a paginated document with a diagram and a footnote (we recommend that you have that page in front of you when analyzing this transcription): -- pages 1 .. 6 supposedly precede this one -- {pg.a7: Sub_'PARAGRAPH$' { {fn.n1: -- just a unique label: fn: introduced somewhere earlier -- FOOTNOTE$ } < which has shown our techniques to be valid. Other data can be collected by future changes to your accounting and billing packages, which will allow us to perform even better analyses and lead to better problem discovery and correction.> } { } Sub_'FRAME$' -- change to subnode tag FRAME -- {Alignment.horizonally_FlushLeft Alignment.vertically_Floating height_2.8*inch width_3.67*inch edges.expandingRightEdge_T border_dots1 -- change to default subnode environment Rectangle with solid, double width outline -- Sub_'RECTANGLE$ lineType.width_2 lineType.style_solid Sub_'Title'' LINKS rect -- declare label class to be used below -- {rect.a1: UpperLeft_(.0254 .07) shading_7 height_.01 width_.027 {} } {rect.a2: UpperLeft_(.073 .015) height_.01 width_.018 {} } height_.013 -- attribute value shared by following subnodes {rect.a3: UpperLeft_(.02 .03) width_.025 {} } {rect.a4: UpperLeft_(.02 .03) width_.028 {} } {rect.a5: UpperLeft_(.042 .055) width_.016 {} } {rect.a6: UpperLeft_(.067 .055) width_.016 {} } -- default subnode environment is LINE with solid, double width outline -- Sub_'LINE lineType.width_2 lineType.style_solid' LINKS ln {ln.out1: ^rect.a1 ^ln.in34} {ln.out2: ^rect.a2 ^ln.out1} {ln.in3: ^ln.in34 ^rect.a3} {ln.in4: ^ln.in34 ^rect.a4} {ln.in34: ^ln.in3 ^ln.in4} {ln.out4: ^rect.a4 ^ln.in56} {ln.in56: ^ln.in5 ^ln.in6} {ln.in5: ^ln.in56 ^rect.a5} {ln.in6: ^ln.in56 ^rect.a6} } -- end of Frame1 -- Sub_'PARAGRAPH$' -- restore default subnode initialization to PARAGRAPH -- {} {} } -- end of page -- 1.6.3. Some Star property sheets Here a few of the definitions invoked in the above example (these were derived from page 148 of the Star Functional Specification). Some of them simply give default values for various attributes; some, like default.font, define a collection of related attributes as an environment; and most are quoted expression sequences for providing abbreviations or "decorating" nodes with tags and their environments with relevant attributes. 1.6.3.1. Font-related defaults and definitions baseline_0 -- the base line for characters -- underlined_F -- whether or not text in node is to be underlined -- strikeOut_F -- whether or not text in node is to have strike-out line through it -- -- there is no rhyme and little reason behind the names of type fonts. The following definition is intended to provide enough choice, using standard "terms" to name any existing font in an arbitrary font catalog (of course, it doesn't, but perhaps it is close enough) -- default.font _ [ | -- Definition -- family_Times -- a font family name -- face_[ | -- Definition -- weight_NORMAL -- In (EXTRALIGHT, LIGHT, BOOK, NORMAL, MEDIUM, DEMIBOLD, SEMIBOLD, BOLD, EXTRABOLD, ULTRABOLD, HEAVY, EXTRAHEAVY, BLACK, GROTESQUE) -- lineType_SOLID -- In (SOLID, INLINE, OPEN, OUTLINE, DISPLAY, SHADED) -- proportions_NORMAL -- In (NORMAL, CONDENSED, EXPANDED, EXTENDED, WIDE, BROAD, ELONGATED) -- style_ROMAN -- In (ROMAN, GOTHIC, EGYPTIAN, CURSIVE, SCRIPT) -- slant_NIL -- In (NIL, ITALIC, OBLIQUE) -- swash_F -- T => use swash capitals -- lowercase_T -- T => use lowercase letters -- uppercase_T -- T => use uppercase letters -- smallCaps_F -- T => use small capitals -- ] size_10*pt -- distance -- ] -- some useful font shorthands: -- Helvetica _ 'font _ [default.font% | family_HELVETICA]' Italic _ 'font.face.slant_ITALIC' Bold _ 'font.face.weight_BOLD' Helvetica10BI _ 'Helvetica font.size_10*pt Bold Italic' 1.6.3.2. Footnote-related definitions fnCount:=0 -- global variable for counting footnotes FOOTNOTE _ 'fnCount:=+1 font.size_8*pt FootnoteRef%' FootnoteRef _ '{FOOTREF$ baseline_+5*pt fnCount}' -- raise 5 pts -- 1.6.3.3. Paragraph-related definitions Tab _ [ | position_0 type_LEFT -- In (LEFT, CENTERED, RIGHT, DECIMAL) -- ] MakeTabs _ 'n_0 tabs_(RecursiveMakeTab[Value])' RecursiveMakeTab _ '(EQ[Value 0] | NIL | n_+.25*inch [Tab | position_n ] RecursiveMakeTab[Value-1])' Default.PARAGRAPH _ 'Indent _ [ | Left_0.0 Right_0.0] -- distance -- Alignment_FLUSHLEFT -- In (FLUSHLEFT, FLUSHRIGHT, BOTH, CENTERED) -- Justified_F leading_[leading | between_1*pt above_12*pt below_0] charStyle_[| Normal_'font_default.font' Emphasis1_'font_default.font Italic' Emphasis2_'font_default.font Bold' ] Hyphenation_F KeepOn_NIL -- In (NIL, SamePageAsNextParagraph) -- MakeTabs[8] -- binds tabs to a sequence of 8 tabs (0, .25 inch, .50 inch, . . .) -- charStyle.Normal -- initializes to normal style 1.6.3.4. frame, rectangle, and line definitions Def.UpperLeft _ 'UpperLeft_(0.0 0.0)' -- Def is just a convenient place to put useful auxiliary definitions -- Def.lineType _ ' lineType_[ | Visible_T Width_1 Style_SOLID] -- IN (SOLID, DOT, DASH, DOTDASH, DOUBLE, . . .) -- ' Def.Shading _ 'Shading_0' Def.Box _ 'Def.UpperLeft Def.lineType Def.Shading' Frame _ 'FRAME$ Def.Box' Rectangle _ 'RECTANGLE$ Def.Box Constraint_MagnifyOnly -- IN (NIL MagnifyOnly) -- ' Def.LineEnd _ ' LineEnd_(LeftUpper_Flush RightLower_Flush) -- IN (Flush Round Square arrow1 arrow2 arrow3) -- ' Line _ 'LINE$ constraint_FixedAngle Def.lineType Def.LineEnd' Title _ 'CAPTION$ Paragraph' 1.6.4. Using links Links are intended to provide the means for associating nodes in non-hierarchical ways. They can be used for referring to figures, examples, tables, etc., for describing tables of contents, for denoting index items, keeping lists, etc. 1.6.4.1. References to figures The following outlines how the labelling facilities and global bindings can be used to generate references to (source links for) a figure whose number may not be known at the point of reference. The identifier n5 is assumed to have been generated by the program that produced the script and is assumed to be unique over the target labels with naming prefix "figures." in the script. LINKS figures figCount:= 0 -- should appear in a script's root node -- makeFigureNum _ 'HIDDEN$ figCount:=+1 figCount' {. . . ^figures.n5 . . .} -- ref to node with label figures.n5: -- { . . . {figures.n5: makeFigureNum} . . .} -- a hidden node holding the figure number -- The node in which the figure number for figure n5 is defined contains a tag, HIDDEN$, which means that the node is not to be considered a part of the dominant structure for display purposes even though it is part of it. The node's sole content is the value of figCount after it has been incremented by 1. Because figCount is bound with ":=", the scope of the binding is global. 1.6.4.2. Collections of index items Assume that the word "diarchy" is to be considered an index item in certain places where it occurs in a document. The link class Indexable should be introduced at the root of the document, and each to-be-indexed occurrence of "diarchy" in a string, e.g., , should be replaced by the sequence diarchy% < is established, it . . .>. Somewhere in the script within the scope of the declaration of Indexable, at the root of a subtree containing all the uses of diarchy should be the following definition: diarchy _ '{HIDDEN$ indexable.diarchy: pageNumber} ' Invoking diarchy results in the appearance of a hidden node containing the current page number (assumed to be held in the attribute pageNumber) and labelled as being in the set of target links indexable and indexable.diarchy. The index for the document might then contain the following entry for "diarchy": {INDEXENTRY$ ^indexable.diarchy} This entry contains the minimal information needed to generate the sequence of page numbers corresponding to indexable occurrences of diarchy. If some occurrences are considered primary and some secondary, then these mechanisms can be generalized to have diarchy defined as diarchy _ [ | primary _ '{HIDDEN$ indexable.diarchy.primary: pageNum} ' secondary _ '{HIDDEN$ indexable.diarchy.secondary: pageNum} '] Primary references are denoted in the script as diarchy.primary% and secondary ones as diarchy.secondary%. Similarly, the index entry takes the form: {INDEXENTRY$ ^indexable.diarchy.primary ^indexable.diarchy.secondary} 1.6.5. Using indirections Indirections provide a way to centralize (and delay) the binding of information within a document. They can be used to share information that is intended to be consistent. 1.6.5.1 Styles and style sheets Documents generally follow stylistic conventions for presenting different kinds of content. E.g., major headings may be in bold face with twelve points of extra leading, minor headings in italic with six points of extra leading. If this information is explicitly bound for each piece of content, then a stylistic change may require locating and changing all the relevant bindings (note that italic is likely to be also used for other purposes, such as emphasis). If, however, the binding is done indirectly, through a style, a single change will be effective for all places where the style is referenced. Note that each occurrence of a tag implicitly establishes an indirection through the same identifier; this is convenient in associating styles with semantically meaningful tags. For example: MajorHeading _ 'PARAGRAPH$ Bold leading_+12' MinorHeading _ 'PARAGRAPH$ Italic leading_+6' 2. The Language Basis: Syntax and Semantics 2.1. Grammar Our notation is basically BNF with terminals quoted and augmented by the following conventions: a sequence enclosed in [ ] brackets may occur zero or one times; a construct followed by * may occur zero or more times; parentheses ( ) are used purely for grouping. script ::= header node trailer header ::= "Interscript/Interchange/1.0 " trailer ::= "EndScript" item ::= content | binding | label content ::= term | node term ::= primary | primary op term op ::= "+" | "" | "*" | "/" primary ::= literal | invocation | indirection | application | selection | vector literal ::= Boolean | integer | real | string | universal invocation ::= name name ::= id ( "." id )* indirection ::= name "%" application ::= ( name | universal ) "[" item* "]" universal ::= ucID selection ::= "(" term "|" item* "|" item* ")" vector ::= "(" item* ")" node ::= "{" item* "}" binding ::= localBind | globalBind localBind ::= name "_" rhs globalBind ::= ( name | universal ) ":=" rhs rhs ::= content | op term | "'" item* "'" | "[" item* "|" binding* "]" label ::= tag | link tag ::= universal "$" link ::= "LINKS" id | "^" name | name ":" 2.2. Discussion of Features [Note that we have a formal semantic definition for this language that is every bit as precise as the grammar above. However, we have not yet figured out how to present it in a form that humans find equally palatable, so we have placed it in Appendix C.] primary ::= literal literal ::= Boolean | integer | real | string The primitive elements by which the value of a document is represented. term ::= primary op term op ::= "+" | "" | "*" | "/" Both the primary and the term must reduce to numbers; the arithmetic operators are evaluated right-to-left (a la APL, without precedence) and bind less tightly than function application. The result is a real if either operand is. invocation ::= id Id is looked up in the current environment; depending on its current binding, this may produce contents, bindings, and/or labels; if the rhs bound to id was quoted, that expression is evaluated in the current environment. In the (implicit) outermost environment, every id is bound to the corresponding universal (ID). invocation ::= name "." id Qualified names represent lookup in "nested" environments; name must have been bound to an environment, in which id is looked up. indirection ::= name "%" This indicates an intentional indirection through name, which should be preserved as part of the structure; replacing the indirection by its value in the current environment is a value-preserving loss of structural fidelity. (An invocation that is simply a name is an abbreviation that need not be preserved.) universal ::= ucID Universals are identifiers that are written entirely in upper case letters. They are presumed to be defined externally, so they are not looked up in the environment (with one exceptionsee the discussion of tags below). application ::= ( name | universal ) "[" item* "]" If the application involves a universal (either explicitly, or because the name is bound to a universal), the corresponding function is applied to the argument list that results from evaluating item*. Part of the definition of Layer 2 will involve the specification of a small set of standard functions, which may be expanded in various Layer 3 extensions. If name is not bound to a universal, the current environment is temporarily augmented with a binding of the value of item* to the identifier value, and the value of the application is the result of evaluating name in that environment; this allows function definition within the language. Neither form of application changes the environment of succeeding expressions because item* is evaluated in a free-standing environment that is thrown away. selection ::= "(" term "|" item1* "|" item2* ")" This is a standard conditional item sequence, using syntax borrowed from Algol 68. The value and effect are those of item1* if the term evaluates to "T" in the current environment, those of item2* if it evaluates to "F". vector ::= "(" item* ")" Parentheses group a sequence of items as a single vector; bindings affect the environment of items to the right in the containing node, but labels have no meaning. node ::= "{" item* "}" Nodes have nested environments, and affect the containing environment only through global (:=) bindings to ids. Item* is implicitly prefixed by an invocation of Sub, which may be bound to any sequence of items intended to be common to all subnodes in a item. item* ::= "" The empty sequence of items has no value and no effect; this is the basis for the following recursive definition. item* ::= item1 item* In general, the value of a sequence of items is just the sequence of item values; binding items change the environment of items to their right in the sequence. localBind ::= name "_" rhs This adds a single binding to the current scope (i.e., to its associated environment); bindings have no other "side effects" and no value (i.e., they do not change the length of a containing vector or node value). globalBind ::= ( name | universal ) ":=" rhs This adds a single binding to the outermost environment X. It makes sense to bind something to a universal only if the universal is a tag name (see tag below). binding ::= name mode op term "name mode op term" is just a convenient piece of syntactic shorthand for "name mode name op term". mode ::= "_" | ":=" A value can be bound to a name either locally ("_") in the environment of the node in which the binding appears, or globally (":=") in the environment of the root node of a script. rhs ::= "'" item* "'" A quoted rhs is evaluated in the environment of invocation, rather than the environment current at the point of binding. rhs ::= "[|" binding* "]" This creates a new environment value that may be used much like a record. rhs ::= "[" item* "|" binding* "]" This creates a new environment value that is an extension of the environment that is the value of item*. tag ::= universal "$" This gives the containing node the property denoted by the universal. It also looks for a binding to the universal in X, the outermost environment; if one exists, it is invoked in the context of the current environment. This gives an easy way to attach a tag to a node and provide a set of defaults associated with the tag. link ::= "LINKS" id This introduces the link set whose main component is id, and defines their scope. link ::= "^" name This identifies the immediately containing node as a source of the link name (like a reference to the set of nodes which are link targets). link ::= name ":" This identifies the immediately containing node as a target of each of the links that is a prefix of name. For example, the link target "id1.id2...idn:" would make the node containing it a target in the link sets for id1, id1.id2, ..., id1.id2...idn. 2.3. Safety Rules for Low-capability Editors Interscript claims to make it possible for editors to manipulate the parts of documents they understand without harming parts they do not. This section develops a set of conservative rules for editor treatment of script nodes created by other editors. We first need to define some terms. The implementor of an editor is said to understand a tag, T, if (1) she knows the set of attributes and contents that are relevant to T, and (2) she knows all the invariants among attributes that must be maintained for a node with tag T. An editing system is said to understand a tag, T, if (1) it is able to provide some rendition (display) of a node with tag T; and (2) it allows insertion or deletion of direct subnodes of that node. An editing system is said to implement a tag if (1) it understands T; and (2) it is able to alter a node with tag T. Finally, an editing system is said to fully implement a tag if it is capable of changing any attribute relevant to T or any contents of a node with tag T. With these definitions, we can now give some conservative rules for editors in treating parts of documents corresponding to nodes in a script: It's OK for an editor to display a node if it understands at least one of its tags. It's OK for an editor to edit within a node if it implements all of its tags, and either (a) doesn't remove any of them, or (b) also understands all tags of its parent. It's OK for an editor to copy a node if it understands all the tags of the node's new parent, no labels are moved outside their scope, and the two environments have the same bindings for all attributes that the editor either doesn't understand, or knows can't be relevant anywhere in the node or its subnodes. It's OK for an editor to delete a node if it understands all the tags of its parent. [Less stringent rules will suffice if the document is merely to be viewed, rather than edited, using the original editor.] 2.4. Encodings [Any resemblance between the following material and the corresponding section of the Interpress standard is purely an intentional consequence of plagiarism.] The script for a document can be encoded in many different ways. This section gives the rules for designing encodings. The purpose of these rules is to ensure that information is not lost or added by conversions from one encoding to another. There are two types of encodings: a single interchange encoding and many possible private encodings. The interchange encoding is used to transmit a script from one site to another when the two sites must be assumed to be arbitrarily different. A private encoding is used to transmit scripts from one site to another when the two sites share the private encoding conventions. For example, a line of document-preparation products made by the same manufacturer might share a private encoding, which can be used to transmit documents from one editor in the product line to another; presumably this encoding is designed to make these transfers simpler or more efficient. However, when one of these editors transmits a document to an unknown editor, the interchange encoding must be used. The interchange encoding is designed to allow easy generation, transmission, and interpretation by many different editors, possibly at the expense of compactness and speed of encoding and decoding. 2.4.1. The interchange encoding The interchange encoding is designed to simplify creation, communication and interpretation of scripts for the widest possible range of editors and systems. For this reason, a script in the interchange encoding is represented as a sequence of graphic (printable) characters taken from the ASCII set; the subset of ASCII used is also a subset of ISO 646. Communication of a script in the interchange encoding requires only the ability to communicate a sequence of ASCII characters; Interscript does not specify how the characters are encoded. In effect, we define a text representation of the commands to be executed. The choice of a text format for the interchange encoding leads to rather lengthy scripts in some cases. The bulk of an interchange script presents no great problem for document storage, since a document need not be stored in this form. Rather, as it is transmitted, the sending editor can translate its own private encoding into the interchange encoding. Similarly, the receiving editor can translate the interchange encoding into its own, usually different, private encoding for storage. However, a bulky interchange script may be more expensive to transmit. If a document consists mostly of text, the interchange encoding is quite efficientvery few characters are required in addition to those appearing in the document itself. Character set. The character set used in the interchange encoding is described by the ISO 646 7-bit Coded Character Set For Information Processing Interchange. The interchange encoding interprets the 94 characters of the G1 set defined in the International Reference Version (ISO 646, Table 2) and the space character (2/0). This set of 95 characters is called the interchange set. Note that except for the concise "string" encoding of vectors described below, the interchange encoding has nothing to do with the integers corresponding to the characters, but depends only on the character set itself. It is extremely important to understand that the choice of the ISO standard for the interchange format has nothing to do with character mappings in Interscript fonts. Although these mappings must adhere to a character set standard that is shared by interchanging editors, that standard is not part of Interscript. It is expected that Xerox will develop a separate corporate standard in this area. If the underlying encoding of the ISO character set can also encode other characters (e.g., the control characters (0/0 through 1/15) and del (7/15), or another group of 128 characters if eight bits are being used to encode each character), these are ignored in interpreting an interchange script. This does not mean that these characters are converted to spaces, but that they are treated as if they were not present. There are several reasons for this choice: Control characters may be inserted freely by software that generates the interchange encoding. For example, carriage returns (0/13), line feeds (0/10), and form feeds (0/12) may be inserted at will to conform to limitations that may be imposed by an operating system. Restrictions on line length or the use of fixed-length records thus become straightforward. Control characters may be removed or inserted freely by software that receives the interchange encoding. In this way, the receiving software can adhere to any restrictions imposed by its operating system. The absence of control characters allows certain kinds of "non-transparent" data communication methods (such as binary synchronous communication) to be used freely. A minor disadvantage of these conventions is that if a script is typed in, care must be taken not to omit a significant space at the end of a line. Since scripts are normally generated by programs, this is not important. A system for manually generating (and perhaps interactively debugging) Interscript should provide for various convenience features on input, and for prettyprinting the script on output. Any number of space characters may also be added after any token without changing the meaning. Throughout the following, a delimiter is a space or comma, which may be omitted if the next character is not an alphanumeric, "" or ".". VersionId. The first characters of an interchange script conforming to this version of the Interscript standard must be "Interscript/Interchange/1.0 ". Note that the VersionId is of variable length, and ends with a space. These conventions simplify the design of systems that must deal with more than one kind of encoding. If a privately encoded script can be interpreted as a sequence of characters, its first characters must be "Interscript/private/i.j", where private is replaced by an appropriately chosen hierarchical name that identifies the encoding, e.g., "Xerox/860", and i.j is replaced by an appropriate version identification, e.g., "2.4"; the resulting header would be "Interscript/Xerox/860/2.4". A private encoding that cannot be interpreted as a sequence of characters (e.g., a binary, word-oriented encoding on a 36-bit machine which packs five 7-bit characters into a word) should use any available convention to make its scripts self-identifying. Following the versionId is a node constituting the body of the script which is in turn followed by the trailer of a script, "ENDSCRIPT". The body of the script contains values encoded as follows. Integer. An integer is represented in radix 10 notation using the characters "0" through "9" as digits, followed by a delimiter. A negative integer is preceded by a minus sign "". Thus the decimal number 1234 is encoded as "1234", and 1234 is encoded as "1234". The trailing delimiter may be empty if the following character is a letter. A sequence of integer literals in the range 0..255 can be represented in radix 16 notation using the characters "A" through "P" as digits ("A" corresponds to 0, "P" to 15). The entire sequence is enclosed in "#" brackets. For example, the integer 93 is represented as "#FN#", and the sequence of integers 93, 94, 95, 96 as "#FNFOFPGA#". These sequences require only two characters for each integer (plus two characters of overhead). Note that there is no delimiter between the integers in this encoding. Booleans are represented by the characters "F" and "T", followed by a delimiter. Real. A real is represented using Fortran E or F notation, with a trailing delimiter. Thus "12.34" is the same as "1.234E1". Minus signs may precede the mantissa or the exponent: "12.34E3 ". Identifier. An identifier is encoded by its characters (which are limited to letters and digits), followed by a delimiter: "x", "arg1". The first character of an identifier must be a letter, and must be written in lower case to distinguish identifiers from universals. Other letters may be written in either case for readability, since case is not significant in distinguishing identifiers. Vector. A vector is encoded by surrounding a sequence of values with parentheses, "(" and ")". String. A text vector usually contains integers that are interpreted as character codes. Often these codes lie in the range 32 to 126 inclusive, which are the numbers assigned to the characters of the interchange set by ISO 646. It is convenient to encode an element of such a vector by the character whose ISO code is the desired value. Such a string can be encoded by surrounding the characters with "<" and ">", thus "". If the string contains elements outside the allowed range (i.e., if the value is less than 32 or greater than 126) or the value 62 or 35 (the ISO codes for the characters ">" and "#"), those elements must be represented as integers inside "#" brackets, as described above. The two-character encoding of small integers is designed to make escape sequences compact. Thus "", "", and "" are all equivalent. Universal names. A universal is encoded by giving a name that begins with an uppercase letter followed by zero or more uppercase letters or digits, followed by a delimiter. E.g., "TEXT", "XEROX860 ". Node. A node is encoded by a "{", followed by a sequence of items, followed by a "}". Comment. The beginning and end of a comment are both marked by a double minus sign: the sequence "" "" is a comment and may occur between any two tokens. Comments are ignored in rendering the script. The tokens of the interchange encoding are defined by the following BNF grammar, together with rules about delimiters: The delimiter that terminates an identifier or universal may only be empty if the next character is not an alphanumeric, or "". The delimiter that terminates an integer may only be empty if the next character is not a digit, "E", "F", "", or ".". extra delimiters may be inserted after any token. token ::= literal | id | ucID | op | bracket | punctuation | comment literal ::= Boolean | integer | real | string Boolean ::= ( "F" | "T" ) delimiter delimiter ::= " " | "," | empty empty ::= "" integer ::= [ "" ] digit digit* delimiter digit ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" real ::= [ "" ] digit digit* "." digit* [ "E" integer ] delimiter string ::= "<" stringElem* ">" stringElem ::= stringChar | hexSequence stringChar ::=  any character but "#" or ">"  hexSequence ::= "#" hex* "#" hex ::= hexChar hexChar id ::= lowerCase idChar* delimiter idChar ::= letter | digit letter ::= lowerCase | upperCase lowerCase ::= "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" | "k" | l" | "m" | "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z" upperCase ::= hexChar | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" hexChar ::= "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K" | "L" | "M" | "N" | "O" | "P" ucID ::= upperCase ucIDchar* delimiter ucIDchar ::= upperCase | digit op ::= "+" | "" | "*" | "/" bracket ::= "(" | ")" | "{ " | "}" | "<" | ">" | "[" | "]" | ""' punctuation ::= "." | ";" | ":" | "=" | "_" | "!" | "%" | "|" comment ::= "" commentString "" commentString ::=  any sequence of characters not containing ""  A simple listing of an interchange script can just print the character sequence, with line breaks every n characters, or perhaps at the nearest convenient delimiter. Such a listing is reasonably easy to read, so that problems can be tracked down simply by studying it. Additional help in reading the file can be furnished by utility programs which format the file for more pleasant reading. 2.4.2. Normalization Every encoding must define a normalization function N, which maps a script in the encoding into another script in the encoding which generates the same output. N must be idempotent (i.e., N2=N); it must not change the fidelity level of the script (see 2.4.3). If a script violates the definition of Interscript, a normalization function may report this fact instead of producing a normalized result. In other words, normalization need not be defined on erroneous scripts. The purpose of this function is to make possible a precise description of the rules for private encodings in section 2.4.4. The idea is that when an encoding provides several ways of saying the same thing (typically a basic way, and some more concise ways which work in common special cases), the normalized script will uniformly choose one way of saying it. Note that the normalized script is not intended for any purpose other than precisely defining a notion of equivalent script; it is neither especially compact nor especially readable. The normalization function for the interchange encoding is defined as follows: Comments are omitted. Delimiters are replaced by empty if possible, otherwise with ",". Leading zeros are dropped from a digits encoding of an integer. Reals are uniformly encoded in E format with a single non-zero digit to the left of the "." and no trailing zeros; 0 is encoded by "0.0". An upper case letter in an identifier is replaced by the corresponding lower case letter. Each direct invocation (abbreviation) is replaced by its binding. 2.4.3. Level restriction For each internalization fidelity level L of Interscript, there is an (idempotent) level restriction function RIL which converts an arbitrary interchange script into an interchange script of level L. An interchange script is of level L if RIL applied to it is the identity. A restriction function replaces an excluded structure with its value according to the semantics of Interscript, converts excluded form information into additional content with a special property, and removes excluded tags. 2.4.4. Private encodings A private encoding may use any scheme for expressing the content of a script. Certain requirements are imposed on private Interscript encodings to ensure that they can express the entire content of a script at a given level, and no more. Since no general statements can be made about the bits, characters or other low level constituents of a private encoding, these constraints are stated in terms of the existence of certain functions that convert private encodings to interchange encodings and vice versa. An encoding for which these functions do not exist is not an Interscript encoding. The recommended way of demonstrating that the functions exist is to exhibit them as executable programs. This makes it easy to run test cases. A particular private encoding has a fixed fidelity level. Informally, this means that it can encode any script of that level. For any private Interscript encoding P of fidelity level L, the following functions must exist: NP, the normalization function for P; see 2.4.2. CPI, a conversion function from a script in P to an interchange script of level L. CIP, a conversion function from an interchange script of level L to a script in P. If a script violates the definition of Interscript, a conversion function may report this fact instead of producing a converted result. In other words, conversion need not be defined on erroneous scripts. Given these functions, we can define functions which convert normalized private scripts to normalized interchange scripts of level L and conversely: NPI=NI NIP=NP In other words, first convert to the other encoding, and then normalize. These functions must be inverses of each other. This means that after normalization (which does not change the output), a private script can be converted to an interchange script and then back to the same private script, and vice versa. Hence it seems reasonable to say that the private encoding can express exactly the same information. Many tricks are available for designing private encodings with desirable properties. With some knowledge of the statistics of actual scripts, encodings can minimize the number of bits required to represent the average script, by Huffman or conditional coding of the primitives. For example, if strings consist primarily of ordinary written English text, an encoding with five bits per character might be attractive: lower case letters except "q", "x", and "z" (23), space, comma space, semicolon space, colon space, dot space space one upper case character, escape to upper case, one upper case character, escape to digits, one digit character (32 total). The upper case and digits sets would be analogous. A more complex, but perhaps even more compact encoding would take account of the letter frequencies in English text. Similarly, the most common labels can be encoded compactly. There are other useful ideas for private encodings. The bracketting constructs may be replaced by constructs with explicit length fields; these can be shorter, it is easy for the decoder to skip the bracketted constructs, and if the script is damaged it is easier to recover than from the loss of a closing bracket. Hints can be associated with nodes that will speed translation to a particular editor's representation. In designing a private encoding, it is advisable to handle all the constructs of Interscript reasonably compactly, rather than allowing some "unpopular" ones to be encoded very clumsily. Otherwise scripts originally generated in another encoding may cause terrible performance. 3. Higher-Level Issues 3.1. Standard and Editor-Specific Transcriptions: We need a two-level structure for documents expressed in the base language to be both (a) interchangeable among different editors, and (b) retain information of special significance to a specific editor. We call (a) the interchange standard information, or standard information and (b) editor-specific information. Basically, an editor X is free to couch properties in its own terms, which can make it easy for it to consume a script produced by itself, but it must provide a set of mappings which will transform properties into the interchange standard. The recommended method for doing this is to invoke its name as the very first item in the root node of any X-specific subtree. The rules for inheritance of properties mean that often only the root node of a document will need to have this property, but there is nothing wrong with nodes being in different editor-specific terms provided they invoke the appropriate editor properties. Now, to be a valid standard script, the document must have the definition of the name X placed in the script itself (There is nothing wrong with having libraries of editor-specific them in each script). When X parses an X-specific script, it will use its X-specific attributes and never invoke the mappings from X-specific information to standard terms; i.e., it can use a null definition for the name X. However, when such a document is interpreted by some other editor Y, any time it tries to access a standard name, the mapping from that name to the corresponding expression in terms of the X-specific values in the script will have been provided by the definition of X. What guarantee is there that this can always be done? It is worth noting first that we are speaking here of a script being internalized by an editor, Y, rather than being externalized. Consequently, it is never necessary to access standard names in left-hand contexts; i.e., to do bindings that are not part of the script in order to interpret it. Y may, however, need to access components of environments in order to internalize the script for itself. These are always values in right-hand side contexts, and must be computed in terms of the X-specific information that X put in the script. We can examine this issue on a case-by-case basis. Below is a list of examples of possible editor-specific uses of the base language and the mappings that would allow another editor to treat the document in standard terms: Symbolic values used instead of numbers: supply standard values for the symbolic values: Standard: lineLeading _ 1*pt -- some numeric value -- Editor-specific: lineLeading _ single mapping: single = 2*pt Different names used for standard names: supply a binding to the standard name from the editor-specific name using a quoted expression so that it is only evaluated when needed in a righthand context: Standard: lineLeading _ 2*pt Editor-specific: lineSpace _ single mapping: lineLeading _ 'lineSpace' Different concepts used for standard ones: supply a binding to the standard attribute names from the editor-specific concepts using quoted expressions so that they are only evaluated when needed in righthand contexts: Standard: lineLeading _ 2*pt Editor-specific: lineSpacing _ [fontSize_10 on_14 leading_1] -- lineSpacing units assumed to be pts -- mapping: lineLeading _ 'pt*Spacing.onSpacing.fontSize' -- compute result in standard units -- In general, one can use the facilities of the base language to write essentially arbitrary programs that can be bound as quoted expressions to a standard identifier to cause the appropriate value to be computed based on editor-specific information put in the document by the editor that externalized it. Moreover, since the mappings provided by editor X can be overridden in any subtree of the document, an editor that does not "understand" some subtree of a document produced by another editor Y can simply leave that subtree intact when producing an edited version of the original script except to ensure that that subtree's root node's first expression is an invocation of "Y", which will cause Y's editor-specific mappings to obtain in that subtree. 3.2. Standard External Environment It is important to provide for a standard external environment for rendering scripts so that standard definitions need not be carried along with every script that uses them. The external environment contains definitions for units (inch, pt, etc.), various "styles" (para, figure, etc.), and useful abbreviations (italic, bold, etc.). 3.2.1. Units The Interscript standard assumes that distances are in meters and angles are in degrees. Using the language and the following constants defined in the standard external environment, a script can readily express distances and meters in other, possibly more convenient units: meter=1.0 -- IN TERMS OF METERS -- mica=1.E5*meter -- mica = 1.E5 inch=2540*mica -- inch = 2540 -- pt=.013836*inch -- pt = 35.143 -- pica=12*pt -- pica = 421.752 -- tenPitch=inch/10 -- tenPitch = 254 -- twelvePitch=inch/12 -- twelvePitch = 211.667 -- degree=1.0 -- ANGLES ARE IN DEGREES -- pi=3.14159265 radian=180*degree/pi -- = 57.29577951 -- APPENDIX A GLOSSARY Italics indicate words defined in this glossary. abbreviation An invocation used to shorten a script, rather than to indicate structure attribute A component of an environment, identified by its name, which is bound to a value base language The part of the Interscript language that is independent of the semantics of particular properties and attributes base semantics The semantic rules that govern how scripts in the base language are elaborated to determine their contents, environments, and labels binding The operation of associating a value with a name to add an attribute to an environment; also the resulting association binding mode A value may be bound to an identifier as local, const or global Boolean An enumerated primitive type (F, T) used to control selection and as primitive values const binding A binding of an attribute that prevents its being rebound in any contained scope contents The vector of values denoted by a node of a script definition Another name for a const binding document The internalization of a script in a representation suitable for some editor dominant structure The tree structure of a document corresponding to the node structure of its script editor-specific name A non-standard name used by a specific editor in scripts it generates; an editor may use editor-specific terms without interfering with the interchangeability of a script if it provides definitions of the standard names in terms of its editor-specific names elaborate (verb) To develop the semantics of a script or a node of a script according to the Interscript semantic rules. This is a left-to-right, depth-first processing of the script encoding A particular representation of scripts environment A value consisting of a set of attributes. An environment may be either free-standing or nodal. A free-standing environment is a structured value much like a record, with the components being the attributes of the environment. A nodal environment is associated with a node of a script and represents the attributes bound in that node. expression A syntactic form denoting a value external environment A standard environment relative to which an entire script is elaborated externalization The process of converting from a document to a script; also the result of that process fidelity The extent to which an externalization or internalization preserves contents, form, and structure hexInt A component of a hexSequence formed from a pair of letters in the set {A,B,...,O,P}, and representing an integer in the range [0..256) hexSequence A sequence of hexInt pairs enclosed between "#" pairs and used to encode characters in string literals, e.g., #ENCODE# hierarchical name A name containing at least one period, whose prefix unambiguously denotes the naming authority that assigned its meaning identifier A sequence of letters used to identify an attribute integer A mathematical integer in a limited range; one of the primitive types interchange encoding The standard encoding for scripts internalization The process of converting from a script to a document; also the result of that process Interscript The current name of this basis for an editable document standard invocation The appearance of a name in an expression, except as the attribute of a binding label A tag, or a source, a target, or a link introduction placed in a node link The cross product of a source and a target; in general, a link is a set of (source, target) pairs; in the special case when there is exactly one source and one target, a link behaves like a directed arc between a pair of nodes link introduction The appearance of LINKS id in a node, where id is the main identifier of a link literal A representation of a value of a primitive type in a script local binding A binding of a value to a name, causing the current environment to be updated with the new attribute; any outer binding's scope will resume at the end of the innermost containing node name A sequence of identifiers internally separated by periods; e.g., a.b.c nested environment The initial environment of a node contained in another node NIL A name for the empty value; it does not lengthen a vector or node in which it appears node Everything between a matched pair of {}s in a script; this generally represents a branch point in a document's dominant structure NULL Identifies the empty environment; the value it associates with any identifier is NIL OUTER A standard attribute of every environment: For a free-standing environment (i.e., a record-like, structured value), OUTER=NULL For a nodal environment, OUTER's value is the environment of the current node's parent just prior to the start of the current node. For the root node of a document, OUTER=X. For X, OUTER=NULL global binding A kind of binding (indicated by ":=") that modifies the environment of the root node of a document only, and hence may endure beyond the end of the current node and may be seen by nodes to the right of the current node, even those not hierarchically descended from the current node. primitive type Boolean, Integer, Real, String, or Universal primitive value A literal or a node, vector, or environment containing only primitive values private encoding One of a number of non-standard encodings of a script property Each tag on a node labels it with a property; the properties of a node determine how it may be viewed and edited quoted expression A value which is an expression bracketted by single quotes ("'"); the expression is evaluated in each environment in which the identifier to which it is bound is invoked real A floating point number scope The region of the script in which invocations of the attribute named in a binding yield its value; the scope starts textually at the end of the binding, and generally terminates at the end of the innermost containing node script An Interscript program; the interchangeable result of externalizing a document selection A conditional form in a script that denotes one of two expressions, depending on the value of a Boolean expression in the current environment source The set of nodes with REF link, which thereby refer to the set of target links. string A literal which is a vector of characters bracketed by "<>", e.g., style A quoted expression to be invoked in a node to modify the node's environment, labels, or contents Sub A standard component of each environment, which is implicitly invoked to initialize nested environments SUBSCRIPT A function that can be used to extract a value from a vector, e.g. SUBSCRIPT[(a b ), 3] is the value tag A universal name labelling a node using the syntax universal$; the properties of a node correspond to the set of tags labelling it target The set of nodes labelled with link: transparency A characteristic of scripts that allows an editor to identify the nodes of a script that it understands and thereby enables it to operate on those nodes without disturbing the ones that it doesn't understand Units A set of definitions relating various typographical and scientific units to the Interscript standard units, meters; e.g., inch=2.54E2*meter, pt=.013836*inch universal An identifier formed entirely of uppercase letters and digits value A primitive value, node, vector, environment, universal, or quoted expression vector An ordered sequence of values that may be subscripted X The standard outer environment for an entire script; the value of an unbound identifier in X is the universal consisting of the same letters in upper case APPENDIX B ARBITRARY CHOICES "One of the primary purposes of a standard is to be definitive about otherwise arbitrary choices." There are many places in this proposal where we have made an arbitrary choice for definiteness. It will be important that the ultimate standard make some choice on these points; it matters little whether it is the same as ours. To forestall profitless debate on these points, we have tried to list some of the choices that we believe can be easily changed at a later date: Encoding choices: The choice of representations for literals (we generally followed Interpress here). The selection of particular characters for particular kinds of bracketting, and for particular operators. The choice of infix and functional notation for the interchange encoding (as opposed, e.g., to Polish postfix). The choice of particular identifiers for basic concepts. Linguistic choices: The choice of a particular set of basic operators for the language. The particular set of primitive data types (we followed Interpressits set seems about as small as will suffice). The choice of particular syntactic sugars for common linguistic forms. APPENDIX C FORMAL SEMANTICS C.1. Grammar Our notation is basically BNF with terminals quoted and augmented by the following conventions: a sequence enclosed in [ ] brackets may occur zero or one times; a construct followed by * may occur zero or more times; parentheses ( ) are used purely for grouping. script ::= header node trailer header ::= "Interscript/Interchange/1.0 " trailer ::= "EndScript" item ::= content | binding | label content ::= term | node term ::= primary | primary op term op ::= "+" | "" | "*" | "/" primary ::= literal | invocation | indirection | application | selection | vector literal ::= Boolean | integer | real | string | universal invocation ::= name name ::= id ( "." id )* indirection ::= name "%" application ::= ( name | universal ) "[" item* "]" universal ::= ucID selection ::= "(" term "|" item* "|" item* ")" vector ::= "(" item* ")" node ::= "{" item* "}" binding ::= localBind | globalBind localBind ::= name "_" rhs globalBind ::= ( name | universal ) ":=" rhs rhs ::= content | op term | "'" item* "'" | "[" item* "|" binding* "]" label ::= tag | link tag ::= universal "$" link ::= "LINKS" id | "^" name | name ":" C.2. Notation for environments Environments bind identifiers to expressions, in various modes ("=", ":=", "_"): NULL denotes the "empty" environment [E | id _ e] means "E with id bound to e" locVal(id, E) denotes the value locally bound to id in E locVal(id, NULL) = NIL = "" locVal(id, [E | id' m e]) = if id=id' then e else locVal(id, E) C.3. Semantic functions R: expression, environment --> expression -- Reduction R is used for evaluating right-hand sides: identifiers, expressions, etc. C: expression --> expression -- Contents C is basically used to indicate which evaluated expressions become part of the content of a node B: expression, environment --> environment -- Bindings B indicates the effect a binding has on an environment. B and R are mutually recursive functions (e.g., the evaluation of an expression may cause some bindings to occur as well) The following four semantic functions occur less frequently in any substantive way in the semantics below. You might wish to skip them until they occur in a nontrivial manner in the semantics. T: expression --> expression -- Tags T indicates when an identifier is to be included in the tag set for a node L: expression --> expression -- Links L indicates link declarations Ls: expression --> expression -- Link sources Ls indicates a link to the set of nodes having associated target links Lt: expression --> expression -- Link targets Lt indicates that the node is to be included in the target set of all the names which are prefixes of the name to which the expression should evaluate C.4. Presentation by feature [E is used to represent the value of the environment in which the feature occurs.] script ::= header node trailer header ::= "Interscript/Interchange/1.0 " trailer ::= "EndScript" The semantics of the root node of a script are equivalent to the following general semantics for a node with the initial environment being the outermost, external environment X instead of E: node ::= "{" item* "}" R = C = "{" R<"Sub" item*>([NULL | "OUTER" "=" E]) "}" B = locVal("OUTER", (B<"Sub" item*>([NULL | "OUTER" "=" E]))) T = L = Ls = Lt = NIL Nodes have nested environments, and can have more global effects only through global (:=) bindings. The items of a node are implicitly prefixed with the identifier Sub, which may be bound to any information intended to be common to all subnodes in a scope. item* ::= "" R = C = T = L = Ls = Lt = NIL B = E The empty sequence of items has no value and no effect; this is the basis for the following recursive definition. item* ::= item1 item* R = R(E) R(B(E)) B = B(B(E)) For F in {C, T, L, Ls, Lt}: F = F F In general, the value of a sequence of items is just the sequence of item values; binding items affect the environment of items to their right; NIL does not change the length of a result sequence. term ::= primary op term op ::= "+" | "-" | "*" | "/" R = C = R(E) op R(E) B = E T = L = Ls = Lt = NIL Both the primary and the term must reduce to numbers; the arithmetic operators are evaluated right-to-left (a la APL, without precedence) and bind less tightly than application. primary ::= literal literal ::= Boolean | integer | hexint | real | string R = C = literal B = E T = L = Ls = Lt = NIL The basic contents of a document. invocation ::= id R = R(E) B = B(E) where valOf(id, E) = locVal(id, whereBound(id, E)) -- Gets innermost value whereBound(id, E) = CASE -- Gets innermost binding locBinding(id, E) ~= NONE => E locBinding("OUTER", E) ~= NONE => whereBound(id, locVal("OUTER", E)) True => NULL Both attributes and definitions are looked up in the current environment; depending on the current binding of id, this may produce values and/or bindings; if the binding's rhs was quoted, the expression is evaluated at the point of invocation. When an id is referred to and locBinding(id, E)=NONE, then the value is sought recursively in locVal("OUTER"). The outermost environment, X, binds each id to the "universal" name which is the uppercase equivalent of id. invocation ::= name "." id R = R(E))>(E) B = B(E))>(E) Qualified names are treated as "nested" environments. universal ::= ucID R = C = ucID B = E T = L = Ls = Lt = NIL Uppercase-only identifiers are presumed to be directly meaningful and are not looked up in the environment. application ::= invocation "[" item* "]" R = apply(invocation, R(E), E) B = E where apply(invocation, value*, E) = CASE R(E) OF "EQUAL" => value1 = value2 "GREATER" => value1 > value2 . . . "SUBSCRIPT" => value1[value2] -- value1: sequence, value2: int "CONTENTS" => "(" C ")" "TAGS" => "(" T ")" "LINKS" => "(" L ")" "SOURCES" => "(" Ls ")" "TARGETS" => "(" Lt ")" ELSE => R([[NULL | "OUTER" "=" E] | "Value" "=" value*]) inner("{" value* "}") = value* If the invocation does not evaluate to one of the standard external function names, the current environment is augmented with a binding of the value of the argument list to the identifier Value, and the value is the result of the invocation in that environment; this allows function definition within the language. selection ::= "(" term "|" item1* "|" item2* ")" R = if R(E) then R(E) else R(E) B = if R(E) then B(E) else B(E) The notation for selections (conditionals) is borrowed from Algol 68: ( | | ) This is consistent with our principles of using balanced brackets for compound constructions and avoiding syntactically reserved words; the true part and false part may each contain an arbitrary number of items (including none). sequence ::= "(" item* ")" R = C = "(" R(E) ")" B = B(E) T = L = Ls = Lt = NIL Parentheses group a sequence of items as a single value; bindings in the sequence affect the environment of items to the right in the containing node, but labels are disallowed. Parentheses may also be used to override the right-to-left evaluation of arithmetic operators; an operand sequence must reduce to a single numeric value. binding ::= name "_" rhs R = NIL B = localBind(name, R(E), E) where localBind(id, value, E) = [E | id _ value] localBind(id "." name, value, E) = [E | id _ localBind(name, value, valOf(id, E))] This adds a single binding to E; bindings have no other "side effects" and no value. binding ::= universal ":=" rhs binding ::= name ":=" rhs R = NIL B = globalBind(name, R(E), E) where globalBind(name, value, E) = if locVal("OUTER", E)=NIL then localBind(name, value, E) else [E | "OUTER" _ globalBind(name, value, locVal("OUTER", E))] Each environment, E, initially contains only its "inherited" environment (bound to OUTER). Most bindings take place directly in E. To allow for "global" bindings, the value of a globalBind(name, R(E), E) will change E by rebinding id in the outermost environment X (reached in the semantics by following the OUTER path from E until the outermost one is reached; if we started in a nodal environment, this will be X). Note that a global binding to some variable b does not guarantee that using b in a rhs context will result in accessing the global b because a local binding to b may intervene. Note that in a context such as [ | a := 7], the effect of the above semantics is the same as [ | a _ 7]. binding ::= name mode op term = This is just a convenient piece of syntactic sugar for the common case of updating a binding. rhs ::= "'" item* "'" R = item* If the rhs of a binding is surrounded by single quotes, it will be evaluated in the environments where the name is invoked, rather than the environment in which the binding is made. rhs ::= "[|" binding* "]" R = [B([NULL | "OUTER" "=" E]) | "OUTER" "=" NULL] This creates a new environment value that may be used much like a record. rhs ::= "[" invocation "|" binding* "]" R =[B([R(E) | "OUTER" "=" E]) | "OUTER" "=" NULL] This creates a new environment value that is an extension of an existing one. tag ::= universal "$" R = R(E) B = B(E) T = universal C = L = Ls = Lt = NIL This gives the containing node the property denoted by the universal and also invokes the universal in the outermost environment (if it is not bound there, NIL will be produced, which contributes nothing to R). link ::= "LINKS" id R = "LINKS" id L = id B = E C = T = Ls = Lt = NIL This defines the scope of the set of links whose "main" component is id. A label N: on a node makes that node a "target" of the link N (and its prefixes); a reference ^N makes it a "source." The "main" identifier of a link must be declared (using LINKS id) at the root of a subtree containing all its sources and targets. The link represents a set of directed arcs, one from each of its sources to each of its targets. Multiple target labels make a node the target of multiple links. A target label that appears only on a single node places it in a singleton set, i.e., identifies it uniquely. link ::= "^" name R = "^" name Ls = name B = E C = T = L = Lt = NIL This identifies the containing node as a "source" of the link name. link ::= name ":" R = name ":" Lt = prefixes(name) B = E C = T = L = Ls = NIL where prefixes(id) = id prefixes(name "." id) = name "." id prefixes(name) This identifies the containing node as a "target" of each of the links that is a prefix of name. C.5. Discussion Each script is evaluated in the context of an initial environment, X, which can contain attributes global to all scripts, attributes that specify values for system-specific identifiers, and in which all global bindings are made. Each environment, E, initially contains only its "inherited" environment (bound to the OUTER). Most bindings take place directly in E. To allow for more persistent bindings, the value of a bind(id, ":=", val, E) will change E by rebinding id in X. For the root node of a script, OUTER = X. If the right-hand side of a binding is surrounded by single quotes, it will be evaluated in the environments where the name is invoked, rather than the environment in which the binding is made. When an id is referred to and locBinding(id, E)=NONE, then the value is sought recursively in locVal("OUTER"). The X environment binds each id to the "universal" name which is its uppercase equivalent (e.g., the universal for iDentiFieR is IDENTIFIER). Nodes are delimited by brackets. The contents of each node are implicitly prefixed by Sub, which will generally be bound in the containing environment to a quoted expression performing some bindings, and perhaps supplying some labels (tags and links). Parentheses are used to delimit sequence values. Square brackets are used to delimit the argument list of an operator application and to denote environment constructors, which behave much like records. Expressions involving the four infix ops (+, -, *, /) are evaluated right-to-left (a la APL); since we expect expressions to be short, we have not imposed precedence rules. The notation for selections (conditionals) is borrowed from Algol 68: ( | | ) This is consistent with our principles of using balanced brackets for compound constructions and avoiding syntactically reserved words; the true part and false part may each contain an arbitrary number of items (including none). A label N: on a node makes that node a "target" of the link N (and its prefixes); a reference ^N makes it a "source." The "main" identifier of a link must be declared (using LINKS id) at the root of a subtree containing all its sources and targets. The link represents a set of directed arcs, one from each of its sources to each of its targets. Multiple target labels make a node the target of multiple links. A target label that appears only on a single node places it in a singleton set, i.e., identifies it uniquely. C.6. Grammatical feature X Semantic function matrix LEGEND: - Semantic function produces NIL or E or does not apply. + Non-trivial semantic equation. =For R: passes value unchanged; for C: value same as R. FEATURES: FUNCTIONS: R C B T L Ls Lt term ::= primary op term + = - - - - - primary ::= literal = = - - - - - invocation ::= id + - + - - - - invocation ::= name "." id + - + - - - - universal ::= name "$" = = - - - - - application ::= invocation "[" item* "]" + - - - - - - selection ::= "(" term "|" item1* "|" item2* ")" + - + - - - - node ::= "{" item* "}" + = + - - - - sequence ::= "(" ( value | binding )* ")" + = + - - - - item* ::= item1 item* + + + + + + + binding ::= name mode rhs - - + - - - - rhs ::= "'" item* "'" + - - - - - - rhs ::= "[|" binding* "]" + - - - - - - rhs ::= "[" invocation "|" binding* "]" + - - - - - - tag ::= invocation "%" + - - + - - - link ::= "LINKS" id = - - - + - - link ::= "^" name = - - - - + - link ::= name ":" = - - - - - + - Semantic function produces NIL or E or does not apply. + Non-trivial semantic equation. =For R: passes value unchanged; for C: value same as R.