Page Numbers: Yes X: 530 Y: 10.5" First Page: 20
Columns: 1 Edge Margin: .6" Between Columns: .4"
Margins: Top: 1.3" Bottom: 1" Binding: -5
Line Numbers: No Modulus: 5 Page-relative
Odd Heading:
Towards an Interchange Standard for Editable Documents
Even Heading:
Towards an Interchange Standard for Editable Documents
2. The Language Basis: Syntax and Semantics
2.1. Grammar
Our notation is basically BNF with terminals quoted and augmented by the following conventions:
a sequence enclosed in [ ] brackets may occur zero or one times;
a construct followed by * may occur zero or more times;
parentheses ( ) are used purely for grouping.
script::=versionId item
item::=content | binding | label
content::=term | node
term::=primary | primary op term
op::="+" | "−" | "*" | "/"
primary::=literal | invocation | indirection | application | selection | vector
literal::=Boolean | integer | intSequence | real | string | universal
invocation::=name
name::=id ( "." id )*
indirection::=name "%"
application::=( name | universal ) "[" item* "]"
universal::=ucID ( "." ucID )*
selection::="(" term "|" item* "|" item* ")"
vector::="(" item* ")"
node::="{" item* "}"
binding::=name mode rhs
mode::="←" | "=" | ":" | ":="
rhs::=content | op term | "’" item* "’" | "[" [ item* ] "|" binding* "]"
label::=tag | link
tag::=universal "$"
link::=id "@!" | name "@" | name "!"
2.2. Discussion of Features
[Note that we have a formal semantic definition for this language that is every bit as precise as the grammar above. However, we have not yet figured out how to present it in a form that humans find equally palatable, so we postpone it to an appendix.]
primary ::= literal
literal ::= Boolean | integer | intSequence | real | string
The primitive elements by which the value of a document is represented.
term ::= primary op term
op ::= "+" | "−" | "*" | "/"
Both the primary and the term must reduce to numbers; the arithmetic operators are evaluated right-to-left (a la APL, without precedence) and bind less tightly than function application. The result is a real if either operand is.
invocation ::= id
Id is looked up in the current environment; depending on its current binding, this may produce contents, bindings, and/or labels; if the rhs bound to id was quoted, that expression is evaluated in the current environment. In the (implicit) outermost environment, every id is bound to the corresponding universal (ID).
invocation ::= name "." id
Qualified names represent lookup in "nested" environments; name must have been bound to an environment, in which id is looked up.
indirection ::= name "%"
This indicates an intentional indirection through name, which should be preserved as part of the structure; replacing the indirection by its value in the current environment is a value-preserving loss of structural fidelity. (An invocation that is simply a name is an abbreviation that need not be preserved.)
universal ::= ucID ( "." ucID )*
Universals are like names, but written entirely in upper case letters. They are presumed to be defined externally, so they are not looked up in the environment.
application ::= ( name | universal ) "[" item* "]"
If the application involves a universal (either explicitly, or because the name is bound to a universal), the corresponding function is applied to the argument list that results from evaluating item*. Part of the definition of Layer 2 will involve the specification of a small set of standard functions, which may be expanded in various Layer 3 extensions.
If name is not bound to a universal, the current environment is temporarily augmented with a binding of the value of item* to the identifier value, and the value of the application is the result of evaluating name in that environment; this allows function definition within the language.
Neither form of application changes the environment of succeeding expressions.
selection ::= "(" term "|" item1* "|" item2* ")"
This is a standard conditional item sequence, using syntax borrowed from Algol 68. The value and effect are those of item1* if the term evaluates to "T" in the current environment, those of item2* if it evaluates to "F".
vector ::= "(" item* ")"
Parentheses group a sequence of items as a single vector; bindings in item* affect the environment of items to the right in the containing node, but labels have no meaning.
node ::= "{" item* "}"
Nodes have nested environments, and affect the containing environment only through persistent (:=) bindings to ids with outer var (:) bindings. Item* is implicitly prefixed by an invocation of Sub, which may be bound to any sequence of items intended to be common to all subnodes in a scope.
item* ::= ""
The empty sequence of items has no value and no effect; this is the basis for the following recursive definition.
item* ::= item1 item*
In general, the value of a sequence of items is just the sequence of item values; binding items change the environment of items to their right in the sequence.
binding ::= name mode rhs
This adds a single binding to the current environment; bindings have no other "side effects" and no value (i.e., they do not change the length of a containing vector or node value).
binding ::= name mode op term
name mode op term is just a convenient piece of syntactic shorthand for
name mode name op term.
rhs ::= "’" item* "’"
A quoted rhs is evaluated in the environment of invocation, rather than the environment current at the point of binding.
rhs ::= "[|" binding* "]"
This creates a new environment value that may be used much like a record.
rhs ::= "[" item* "|" binding* "]"
This creates a new environment value that is an extension of the environment that is the value of item*.
tag ::= universal "$"
This gives the containing node the property denoted by the universal.
link ::= id "@!"
This introduces the set of links whose main component is id, and defines their scope.
link ::= name "@"
This identifies the immediately containing node as a source of the link name.
link ::= name "!"
This identifies the immediately containing node as a target of each of the links that is a prefix of name.
2.3. Safety Rules for Low-capability Editors
Conservative rules for editor treatment of script nodes created by other editors:
It’s OK to display a node if
you understand at least one of its properties.
It’s OK to edit (the items in) a node if
you understand all of its (local) properties, and either
you don’t remove any of them, or
you also understand all properties of its parent.
It’s OK to copy a node if
you understand all properties of its new parent,
no labels are moved outside their scope, and
the two environments have the same bindings for all attributes that you don’t either
understand, or
know can’t be relevant, and anywhere in the node or its subnodes.
It’s OK to delete a node if
you understand all properties of its parent.
[Less stringent rules will suffice if the document is merely to be viewed, rather than edited, using the original editor.]
2.4. Encodings
[Any resemblance between the following material and the corresponding section of the Interpress standard is purely an intentional consequence of plagiarism.]
The Interdoc script for a document can be encoded in many different ways. This section gives the rules for designing encodings. The purpose of these rules is to ensure that information is not lost or added by conversions from one encoding to another. There are two types of encodings: a single interchange encoding and many possible private encodings.
The interchange encoding is used to transmit a script from one site to another when the two sites must be assumed to be arbitrarily different. A private encoding is used to transmit scripts from one site to another when the two sites share the private encoding conventions. For example, a line of document-preparation products made by the same manufacturer is likely to share a private encoding, which can be used to transmit documents from one editor in the product line to another; presumably this encoding is designed to make these transfers simpler or more efficient. However, when one of these editors transmits a document to an unknown editor, the interchange encoding must be used. The interchange encoding is designed to allow easy generation, transmission, and interpretation by many different editors, possibly at the expense of compactness and speed of encoding and decoding.
2.4.1. The interchange encoding
The interchange encoding is designed to simplify creation, communication and interpretation of scripts for the widest possible range of editors and systems. For this reason, a script in the interchange encoding is represented as a sequence of graphic (printable) characters taken from the ASCII set; the subset of ASCII used is also a subset of ISO 646. Communication of an Interdoc script in the interchange encoding requires only the ability to communicate a sequence of ASCII characters; Interdoc does not specify how the characters are encoded. In effect, we define a text representation of the commands to be executed.
The choice of a text format for the interchange encoding leads to rather lengthy scripts in some cases. The bulk of an interchange script presents no great problem for document storage, since a document need not be stored in this form. Rather, as it is transmitted, the sending editor can translate its own private encoding into the interchange encoding. Similarly, the receiving editor can translate the interchange encoding into its own, usually different, private encoding for storage. However, a bulky interchange script may be more expensive to transmit. If a document consists mostly of text, the interchange encoding is quite efficient−very few characters are required in addition to those appearing in the document itself.
Character set. The character set used in the interchange encoding is described by the ISO 646 7-bit Coded Character Set For Information Processing Interchange. The interchange encoding interprets the 94 characters of the G1 set defined in the International Reference Version (ISO 646, Table 2) and the space character (2/0). This set of 95 characters is called the interchange set. Note that except for the concise "string" encoding of vectors described below, the interchange encoding has nothing to do with the integers corresponding to the characters, but depends only on the character set itself.
It is extremely important to understand that the choice of the ISO standard for the interchange format has nothing to do with character mappings in Interdoc fonts. Although these mappings must adhere to a character set standard that is shared by interchanging editors, that standard is not part of Interdoc. It is expected that Xerox will develop a separate corporate standard in this area.
If the underlying encoding of the ISO character set can also encode other characters (e.g., the control characters (0/0 through 1/15) and del (7/15), or another group of 128 characters if eight bits are being used to encode each character), these are ignored in interpreting an interchange script. This does not mean that these characters are converted to spaces, but that they are treated as if they were not present.
There are several reasons for this choice:
Control characters may be inserted freely by software that generates the interchange encoding. For example, carriage returns (0/13), line feeds (0/10), and form feeds (0/12) may be inserted at will to conform to limitations that may be imposed by an operating system. Restrictions on line length or the use of fixed-length records thus become straightforward.
Control characters may be removed or inserted freely by software that receives the interchange encoding. In this way, the receiving software can adhere to any restrictions imposed by its operating system.
The absence of control characters allows certain kinds of "non-transparent" data communication methods (such as binary synchronous communication) to be used freely.
A minor disadvantage of these conventions is that if an Interdoc script is typed in, care must be taken not to omit a significant space at the end of a line. Since scripts are normally generated by programs, this is not important. A system for manually generating (and perhaps interactively debugging) Interdoc should provide for various convenience features on input, and for prettyprinting the script on output.
Any number of space characters may also be added after any token without changing the meaning. Throughout the following, a delimiter is a space or comma, which may be omitted if the next character is not an alphanumeric, "−" or ".".
VersionId. The first characters of an interchange script conforming to this version of the Interdoc standard must be "Interdoc/Interchange/1.0 ". Note that the VersionId is of variable length, and ends with a space. These conventions simplify the design of systems that must deal with more than one kind of encoding.
If a privately encoded script can be interpreted as a sequence of characters, its first characters must be "Interdoc/private/i.j", where private is replaced by an appropriately chosen hierarchical name that identifies the encoding, e.g., "Xerox/860", and i.j is replaced by an appropriate version identification, e.g., "2.4"; the resulting header would be "Interdoc/Xerox/860/2.4".
A private encoding that cannot be interpreted as a sequence of characters (e.g., a binary, word-oriented encoding on a 36-bit machine which packs five 7-bit characters into a word) should use any available convention to make its scripts self-identifying.
Following the versionId is an item (usually a node) constituting the body of the script, with values encoded as follows.
Integer. An integer is represented in radix 10 notation using the characters "0" through "9" as digits, followed by a delimiter. A negative integer is preceded by a minus sign "−". Thus the decimal number 1234 is encoded as "1234 ", and −1234 is encoded as "−1234 ". The delimiter may be empty if the following character is a letter.
A sequence of integer literals in the range 0..255 can be represented in radix 16 notation using the characters "A" through "P" as digits ("A" corresponds to 0, "P" to 15). The entire sequence is enclosed in "#" brackets. For example, the integer 93 is represented as "#FN#", and the sequence of integers 93, 94, 95, 96 as "#FNFOFPGA#". These sequences require only two characters for each integer (plus two characters of overhead). Note that there is no delimiter between the integers in this encoding. Ordinary integer literals, with their delimiters, may be included in the sequence; e.g., 7, 93, 400, 40 could be represented by "#7,FN400CI#".
Booleans are represented by the characters "F" and "T", followed by a delimiter.
Real. A real is represented using Fortran E or F notation, with a trailing delimiter. Thus "12.34 " is the same as "1.234E1 ". Minus signs may precede the mantissa or the exponent: "−12.34E−3 ".
Identifier. An identifier is encoded by its characters (which are limited to letters and digits), followed by a delimiter: "x ", "arg1 ". The first character of an identifier must be a letter, and must be written in lower case to distinguish identifiers from universals. Other letters may be written in either case for readability, since case is not significant in distinguishing identifiers.
Vector. A vector is encoded by surrounding a sequence of values with parentheses, "(" and ")".
String. A text vector usually contains integers that are interpreted as character codes. Often these codes lie in the range 32 to 126 inclusive, which are the numbers assigned to the characters of the interchange set by ISO 646. It is convenient to encode an element of such a vector by the character whose ISO code is the desired value. Such a string can be encoded by surrounding the characters with "<" and ">", thus "<Hello!>". If the string contains elements outside the allowed range (i.e., if the value is less than 32 or greater than 126) or the value 62 or XX (the ISO codes for the characters ">" and "#"), those elements must be represented as integers inside "#" brackets, as described above. The two-character encoding of small integers is designed to make escape sequences compact. Thus "<Hello!>", "<Hello#CB#>", "<Hel#GMGP#!>", and "<Hello#33#>" are all equivalent.
Universal names. A universal is encoded by giving its name in upper case letters, followed by a delimiter. E.g., "TEXT ".
Node. A node is encoded by a "{", followed by a sequence of items, followed by a "}".
Comment. The beginning and end of a comment are both marked by a double minus sign: the sequence "−−" <any characters other than "−−"> "−−" is a comment and may occur between any two tokens. Comments are ignored in rendering the script.
The tokens of the interchange encoding are defined by the following BNF grammar, together with rules about delimiters:
The delimiter that terminates an identifier or universal may only be empty if the next character is not an alphanumeric, "−", or ".".
The delimiter that terminates an integer may only be empty if the next character is not a digit, "E", "F", "−", or ".".
extra delimiters may be inserted after any token.
token::=literal | id | ucID | op | bracket | punctuation | comment
literal::=Boolean | integer | intSequence | real | string
Boolean::=( "F" | "T" ) delimiter
delimiter::=" " | "," | empty
empty::=""
integer::=[ "−" ] digit digit* delimiter
digit::="0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
intSequence::="#" intOrHex* "#"
intOrHex::=integer | hexChar hexChar
hexChar::="A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K" | "L" | "M" |
"N" | "O" | "P"
real::=[ "−" ] digit digit* "." digit* [ "E" integer ] delimiter
string::="<" stringElem* ">"
stringElem::=stringChar | intSequence
stringChar::=−− any character but "#" or ">" −−
id::=lowerCase idChar* delimiter
idChar::=letter | digit
letter::=lowerCase | upperCase
lowerCase::="a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" | "k" | l" | "m" | "n" |
"o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"
upperCase::=hexChar | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"
ucID::=upperCase* delimiter
op::= "+" | "−" | "*" | "/"
bracket ::="(" | ")" | "{ " | "}" | "<" | ">" | "[" | "]" | "‘"’
punctuation::="." | ";" | ":" | "=" | "←" | "!" | "#" | "@" | "|"
comment::="−−" commentString "−−"
commentString::=−− any sequence of characters not containing "−−" −−
alphanumeric::=letter | digit
A simple listing of an interchange script can just print the character sequence, with line breaks every n characters, or perhaps at the nearest convenient delimiter. Such a listing is reasonably easy to read, so that problems can be tracked down simply by studying it. Additional help in reading the file can be furnished by utility programs which format the file for more pleasant reading.
2.4.2. Normalization
Every encoding must define a normalization function N, which maps a script in the encoding into another script in the encoding which generates the same output. N must be idempotent (i.e., N2=N); it may not change the fidelity level of the script (see 2.4.3). If a script violates the definition of Interdoc, a normalization function may report this fact instead of producing a normalized result. In other words, normalization need not be defined on erroneous scripts.
The purpose of this function is to make possible a precise description of the rules for private encodings in section 2.4.4. The idea is that when an encoding provides several ways of saying the same thing (typically a basic way, and some more concise ways which work in common special cases), the normalized script will uniformly choose one way of saying it. Note that the normalized script is not intended for any purpose other than precisely defining a notion of equivalent script; it is neither especially compact nor especially readable.
The normalization function for the interchange encoding is defined as follows:
Comments are omitted.
Delimiters are replaced by empty if possible, otherwise with ",".
An integer encoded in hex is replaced by the same integer encoded in digits; except in strings, "#" brackets are replaced by parentheses.
Leading zeros are dropped from a digits encoding of an integer.
Reals are uniformly encoded in E format with a single non-zero digit to the left of the "." and no trailing zeros; 0 is encoded by "0.0".
An upper case letter in an identifier is replaced by the corresponding lower case letter.
Each direct invocation (abbreviation) is replaced by its binding.
2.4.3. Level restriction
For each rendition fidelity level L of Interdoc, there is an (idempotent) level restriction function RIL which converts an arbitrary interchange script into an interchange script of level L. An interchange script is of level L if RIL applied to it is the identity. A restriction function replaces an excluded structure with its value according to the semantics of Interdoc, converts excluded form information into additional content with a special property, and removes excluded tags.
2.4.4. Private encodings
A private encoding may use any scheme for expressing the content of an Interdoc script. Certain requirements are imposed on private Interdoc encodings to ensure that they can express the entire content of an Interdoc script at a given level, and no more. Since no general statements can be made about the bits, characters or other low level constituents of a private encoding, these constraints are stated in terms of the existence of certain functions that convert private encodings to interchange encodings and vice versa. An encoding for which these functions do not exist is not an Interdoc encoding. The recommended way of demonstrating that the functions exist is to exhibit them as executable programs. This makes it easy to run test cases.
A particular private encoding has a fixed fidelity level. Informally, this means that it can encode any script of that level.
For any private Interdoc encoding P of fidelity level L, the following functions must exist:
NP, the normalization function for P; see 2.4.2.
CPI, a conversion function from a script in P to an interchange script of level L.
CIP, a conversion function from an interchange script of level L to a script in P.
If a script violates the definition of Interdoc, a conversion function may report this fact instead of producing a converted result. In other words, conversion need not be defined on erroneous scripts.
Given these functions, we can define functions which convert normalized private scripts to normalized interchange scripts of level L and conversely:
NPI=NIoCPI
NIP=NPoCIP
In other words, first convert to the other encoding, and then normalize. These functions must be inverses of each other.
This means that after normalization (which does not change the output), a private script can be converted to an interchange script and then back to the same private script, and vice versa. Hence it seems reasonable to say that the private encoding can express exactly the same information.
[We need to say similar things about editor representations, transcription fidelity, and rendition fidelity.]
Many tricks are available for designing private encodings with desirable properties. With some knowledge of the statistics of actual scripts, encodings can minimize the number of bits required to represent the average script, by Huffman or conditional coding of the primitives. For example, if strings consist primarily of ordinary written English text, an encoding with five bits per character might be attractive: lower case letters except "q", "x", and "z" (23), space, comma space, semicolon space, colon space, dot space space one upper case character, escape to upper case, one upper case character, escape to digits, one digit character (32 total). The upper case and digits sets would be analogous. A more complex, but perhaps even more compact encoding would take account of the letter frequencies in English text. Similarly, the most common labels can be encoded compactly.
There are other useful ideas for private encodings. The bracketting constructs may be replaced by constructs with explicit length fields; these can be shorter, it is easy for the decoder to skip the bracketted constructs, and if the script is damaged it is easier to recover than from the loss of a closing bracket. Hints can be associated with nodes that will speed translation to a particular editor’s representation.
In designing a private encoding, it is advisable to handle all the constructs of Interdoc reasonably compactly, rather than allowing some "unpopular" ones to be encoded very clumsily. Otherwise scripts originally generated in another encoding may cause terrible performance.