2.4.1. The interchange encoding
2.4.3. Level restriction
3.1. Standard and Editor-Specific Transcriptions:
A
PPENDIX B
ARBITRARY CHOICES
"One of the primary purposes of a standard is to be definitive about otherwise arbitrary choices."
There are many places in this proposal where we have made an arbitrary choice for definiteness. It will be important that the ultimate standard make some choice on these points; it matters little whether it is the same as ours. To forestall profitless debate on these points, we have tried to list some of the choices that we believe can be easily changed at a later date:
Encoding choices:
The choice of representations for literals (we generally followed Interpress here).
The selection of particular characters for particular kinds of bracketting, and for particular operators.
The choice of infix and functional notation for the interchange encoding (as opposed, e.g., to Polish postfix).
The choice of particular identifiers for basic concepts.
Linguistic choices:
The choice of a particular set of basic operators for the language.
The particular set of primitive data types (we followed Interpressits set seems about as small as will suffice).
The choice of particular syntactic sugars for common linguistic forms.
We need a two-level structure for documents expressed in the base language to be both (a) interchangeable among different editors, and (b) retain information of special significance to a specific editor. We call (a) the interchange standard information, or standard information and (b) editor-specific information.
Basically, an editor X is free to couch properties in its own terms, which can make it easy for it to consume a script produced by itself, but it must provide a set of mappings which will transform properties into the interchange standard. The recommended method for doing this is to invoke its name as the very first item in the root span of any X-specific subtree. The rules for inheritance of properties mean that often only the root span of a document will need to have this property, but there is nothing wrong with spans being in different editor-specific terms provided they invoke the appropriate editor properties.
Now, to be a valid standard script, the document must have the definition of the name X placed in the script itself (There is nothing wrong with having libraries of editor-specific b standard mappings in a library of some sort to avoid having copies of them in each script).
When X parses an X-specific script, it will use its X-specific attributes and never invoke the mappings from X-specific information to standard terms; i.e., it can use a null definition for the name X. However, when such a document is interpreted by some other editor Y, any time it tries to access a standard name, the mapping from that name to the corresponding expression in terms of the X-specific values in the script will have been provided by the definition of X. What guarantee is there that this can always be done?
It is worth noting first that we are speaking here of a script being internalized by an editor, Y, rather than being externalized. Consequently, it is never necessary to access standard names in left-hand contexts; i.e., to do bindings that are not part of the script in order to interpret it. Y may, however, need to access components of environments in order to internalize the script for itself. These are always values in right-hand side contexts, and must be computed in terms of the X-specific information that X put in the script. We can examine this issue on a case-by-case basis. Below is a list of examples of possible editor-specific uses of the base language and the mappings that would allow another editor to treat the document in standard terms:
Symbolic values used instead of numbers:
supply standard values for the symbolic values:
Standard: lineLeading ← 1*pt -- some numeric value --
Editor-specific: lineLeading ← single
mapping: single = 2*pt
Different names used for standard names:
supply a binding to the standard name from the editor-specific name using a quoted expression so that it is only evaluated when needed in a righthand context:
Standard: lineLeading ← 2*pt
Editor-specific: lineSpace ← single
mapping: lineLeading ← 'lineSpace'
Different concepts used for standard ones:
supply a binding to the standard attribute names from the editor-specific concepts using quoted expressions so that they are only evaluated when needed in righthand contexts:
Standard: lineLeading ← 2*pt
Editor-specific: lineSpacing ← [fontSize on leading𡤁] -- lineSpacing units assumed to be pts --
mapping: lineLeading ← 'pt*Spacing.onSpacing.fontSize' -- compute result in standard units --
In general, one can use the facilities of the base language to write essentially arbitrary programs that can be bound as quoted expressions to a standard identifier to cause the appropriate value to be computed based on editor-specific information put in the document by the editor that externalized it. Moreover, since the mappings provided by editor X can be overridden in any subtree of the document, an editor that does not "understand" some subtree of a document produced by another editor Y can simply leave that subtree intact when producing an edited version of the original script except to ensure that that subtree's root span's first expression is an invocation of "Y", which will cause Y's editor-specific mappings to obtain in that subtree.
For each internalization fidelity level L of Interscript, there is an (idempotent) level restriction function RIL which converts an arbitrary interchange script into an interchange script of level L. An interchange script is of level L if RIL applied to it is the identity. A restriction function replaces an excluded structure with its value according to the semantics of Interscript, converts excluded form information into additional content with a special property, and removes excluded tags.
The interchange encoding is designed to simplify creation, communication and interpretation of scripts for the widest possible range of editors and systems. For this reason, a script in the interchange encoding is represented as a sequence of graphic (printable) characters taken from the ASCII set; the subset of ASCII used is also a subset of ISO 646. Communication of a script in the interchange encoding requires only the ability to communicate a sequence of ASCII characters; Interscript does not specify how the characters are encoded. In effect, we define a text representation of the commands to be executed.
The choice of a text format for the interchange encoding leads to rather lengthy scripts in some cases. The bulk of an interchange script presents no great problem for document storage, since a document need not be stored in this form. Rather, as it is transmitted, the sending editor can translate its own private encoding into the interchange encoding. Similarly, the receiving editor can translate the interchange encoding into its own, usually different, private encoding for storage. However, a bulky interchange script may be more expensive to transmit. If a document consists mostly of text, the interchange encoding is quite efficient—very few characters are required in addition to those appearing in the document itself.
Character set. The character set used in the interchange encoding is described by the ISO 646 7-bit Coded Character Set For Information Processing Interchange. The interchange encoding interprets the 94 characters of the G1 set defined in the International Reference Version (ISO 646, Table 2) and the space character (2/0). This set of 95 characters is called the interchange set. Note that except for the concise "string" encoding of vectors described below, the interchange encoding has nothing to do with the integers corresponding to the characters, but depends only on the character set itself.
It is extremely important to understand that the choice of the ISO standard for the interchange format has nothing to do with character mappings in Interscript fonts. Although these mappings must adhere to a character set standard that is shared by interchanging editors, that standard is not part of Interscript. It is expected that Xerox will develop a separate corporate standard in this area.
If the underlying encoding of the ISO character set can also encode other characters (e.g., the control characters (0/0 through 1/15) and del (7/15), or another group of 128 characters if eight bits are being used to encode each character), these are ignored in interpreting an interchange script. This does not mean that these characters are converted to spaces, but that they are treated as if they were not present.
There are several reasons for this choice:
Control characters may be inserted freely by software that generates the interchange encoding. For example, carriage returns (0/13), line feeds (0/10), and form feeds (0/12) may be inserted at will to conform to limitations that may be imposed by an operating system. Restrictions on line length or the use of fixed-length records thus become straightforward.
Control characters may be removed or inserted freely by software that receives the interchange encoding. In this way, the receiving software can adhere to any restrictions imposed by its operating system.
The absence of control characters allows certain kinds of "non-transparent" data communication methods (such as binary synchronous communication) to be used freely.
A minor disadvantage of these conventions is that if a script is typed in, care must be taken not to omit a significant space at the end of a line. Since scripts are normally generated by programs, this is not important. A system for manually generating (and perhaps interactively debugging) Interscript should provide for various convenience features on input, and for prettyprinting the script on output.
Any number of space characters may also be added after any token without changing the meaning. Throughout the following, a delimiter is a space or comma, which may be omitted if the next character is not an alphanumeric, "—" or ".".
VersionId
. The first characters of an interchange script conforming to this version of the Interscript standard must be "INTERSCRIPT/INTERCHANGE/1.0
#
". Note that the VersionId
is of variable length, and ends with a space. These conventions simplify the design of systems that must deal with more than one kind of encoding.
If a privately encoded script can be interpreted as a sequence of characters, its first characters must be "Interscript/private/i.j", where private is replaced by an appropriately chosen hierarchical name that identifies the encoding, e.g., "Xerox/860", and i.j is replaced by an appropriate version identification, e.g., "2.4"; the resulting header would be "Interscript/Xerox/860/2.4".
A private encoding that cannot be interpreted as a sequence of characters (e.g., a binary, word-oriented encoding on a 36-bit machine which packs five 7-bit characters into a word) should use any available convention to make its scripts self-identifying.
Following the versionId is a span constituting the body of the script which is in turn followed by the trailer of a script, "ENDSCRIPT". The body of the script contains values encoded as follows.
Integer. An integer is represented in radix 10 notation using the characters "0" through "9" as digits, followed by a delimiter. A negative integer is preceded by a minus sign "—". Thus the decimal number 1234 is encoded as "1234", and —1234 is encoded as "—1234". The trailing delimiter may be empty if the following character is a letter.
A sequence of integer literals in the range 0..255 can be represented in radix 16 notation using the characters "A" through "P" as digits ("A" corresponds to 0, "P" to 15). The entire sequence is enclosed in "#" brackets. For example, the integer 93 is represented as "#FN#", and the sequence of integers 93, 94, 95, 96 as "#FNFOFPGA#". These sequences require only two characters for each integer (plus two characters of overhead). Note that there is no delimiter between the integers in this encoding.
Booleans are represented by the characters "F" and "T", followed by a delimiter.
Real. A real is represented using Fortran E or F notation, with a trailing delimiter. Thus "12.34" is the same as "1.234E1". Minus signs may precede the mantissa or the exponent: "—12.34E—3 ".
Identifier. An identifier is encoded by its characters (which are limited to letters and digits), followed by a delimiter: "x", "arg1". The first character of an identifier must be a letter, and must be written in lower case to distinguish identifiers from universals. Other letters may be written in either case for readability, since case is not significant in distinguishing identifiers.
Vector. A vector is encoded by surrounding a sequence of values with parentheses, "(" and ")".
String. A text vector usually contains integers that are interpreted as character codes. Often these codes lie in the range 32 to 126 inclusive, which are the numbers assigned to the characters of the interchange set by ISO 646. It is convenient to encode an element of such a vector by the character whose ISO code is the desired value. Such a string can be encoded by surrounding the characters with "<" and ">", thus "<Hello!>". If the string contains elements outside the allowed range (i.e., if the value is less than 32 or greater than 126) or the value 62 or 35 (the ISO codes for the characters ">" and "#"), those elements must be represented as integers inside "#" brackets, as described above. The two-character encoding of small integers is designed to make escape sequences compact. Thus "<Hello!>", "<Hello#CB#>", and "<Hel#GMGP#!>" are all equivalent.
Universal names. A universal is encoded by giving a name that begins with an uppercase letter followed by zero or more uppercase letters or digits, followed by a delimiter. E.g., "TEXT", "XEROX860 ".
Span. A span is encoded by a "{", followed by a sequence of items, followed by a "}".
Comment. The beginning and end of a comment are both marked by a double minus sign: the sequence "——" <any characters other than "——"> "——" is a comment and may occur between any two tokens. Comments are ignored in rendering the script.
The tokens
of the interchange encoding are defined by the following BNF grammar, together with rules about delimiters
:
The delimiter that terminates an identifier or universal may only be empty if the next character is not an alphanumeric, or "—".
The delimiter that terminates an integer may only be empty if the next character is not a digit, "E", "F", "—", or ".".
extra delimiters may be inserted after any token.
token ::= literal | id | ucID | op | bracket | punctuation | comment
literal ::= Boolean | integer | real | string
Boolean ::= ( "F" | "T" ) delimiter
delimiter ::= " " | "," | empty
empty ::= ""
integer ::= [ "—" ] digit digit* delimiter
digit ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
real ::= [ "—" ] digit digit* "." digit* [ "E" integer ] delimiter
string ::= "<" stringElem* ">"
stringElem ::= stringChar | hexSequence
stringChar ::= —— any character but "#" or ">" ——
hexSequence ::= "#" hex* "#"
hex ::= hexChar hexChar
id ::= lowerCase idChar* delimiter
idChar ::= letter | digit
letter ::= lowerCase | upperCase
lowerCase ::= "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" | "k" | l" | "m" | "n" |
"o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"
upperCase ::= hexChar | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"
hexChar ::= "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K" | "L" | "M" |
"N" | "O" | "P"
ucID ::= upperCase ucIDchar* delimiter
ucIDchar ::= upperCase | digit
op ::= "+" | "—" | "*" | "/"
bracket ::= "(" | ")" | "{ " | "}" | "<" | ">" | "[" | "]" | ""'
punctuation ::= "." | ";" | ":" | "=" | "←" | "!" | "%" | "|"
comment ::= "——" commentString "——"
commentString ::= —— any sequence of characters not containing "——" ——
A simple listing of an interchange script can just print the character sequence, with line breaks every n characters, or perhaps at the nearest convenient delimiter. Such a listing is reasonably easy to read, so that problems can be tracked down simply by studying it. Additional help in reading the file can be furnished by utility programs which format the file for more pleasant reading.