Page Numbers: Yes X: 530 Y: 10.5" First Page: 0 Not-on-first-page
Columns: 1 Edge Margin: .6" Between Columns: .4"
Margins: Top: 1.3" Bottom: 1" Binding: -5
Line Numbers: No Modulus: 5 Page-relative
Odd Heading: Not-on-first-page
Towards an Interchange Standard for Editable Documents
Even Heading: Not-on-first-page
Towards an Interchange Standard for Editable Documents
Towards an Interchange Standard for Editable Documents
by Jim Mitchell and Jim Horning
Revision 1/May 4, 1982
The Interdoc standard will define a digital representation of editable documents for exchange among different editing systems. An Interdoc script can be transmitted from one editor to another over a network, or can be stored for later editing. A script in Interdoc is not limited to any particular editor: if a script contains editable information some of which is not understandable by a particular editor, it is still possible to edit the parts of the document understood by that editor without losing or invalidating the parts it does not understand.
This document is a draft of a proposal for the technical content of the Interdoc standard. It defines and explains the proposed standard, gives examples of its use, explains how to transcribe from an editor’s private format into scripts, and how to render scripts into an editor’s private format. It also indicates a number of issues that must still be resolved to establish a practical standard.
The standard provides for documents with
a dominant hierarchical structure (e.g., book/chapter/section/paragraph . . .) while also providing for documents needing a more general structure than a tree (e.g., for graphics or cross-references in a textual document),
formatting information (e.g., margins, fonts, line widths, etc.),
definitional structure (such as styles or property sheets), and
intermixed kinds of editable information (e.g., text with imbedded graphics).
This draft deals primarily with the contents of Layers 0 and 1 (the base language) of the proposed standard.
Contents
1. Introduction
2. The Language Basis: Syntax and Semantics
3. Higher–Level Issues
4. Pragmatics
Appendix A: Glossary
1.Introduction
Interdoc provides a means of representing editable documents that is independent of any particular editor and can therefore be used to interchange documents among editors.
The basis of Interdoc is a language for expressing editable documents, or scripts. Scripts are created by computer programs (usually an editor or associated program); scripts are "compiled" by programs to produce whatever private or file format a particular editor uses to represent editable documents.
1.1. Rationale for an interchange standard
An editing program generally uses an private, highly-encoded representation for documents in order to meet its performance and functionality goals. Generally, this means that different editors use different, incompatible private formats, and the user can conveniently edit a document only with the editor used to create it. This problem can be solved by providing programs to convert between one editor’s private (or file) format and another’s. However, a set of different editors with N different document representations requires N(N-1) conversion routines to be able to convert a document directly from each format to every other.
This N(N-1) problem can be reduced to 2(N-1) by noticing that we could write N-1 conversion routines to go from F1 (format for editor1) to F2,. . .,FN, and another N-1 routines to convert from F2,. . .,FN to F1. Except when converting from or to F1, this scheme requires two conversions to go from Fi to Fj (j=i); this is a minor drawback. Choosing which editor should be editor1 is a more critical issue, however, since the capabilities of that editor will determine how general a class of documents can be interchanged among the N editors.
This presents a truly difficult problem in the case that there is no single functionally dominant editor. If the pivotal editor1 doesn’t incorporate some of the structures, formats, or content types used by others, then it will not be possible to faithfully convert documents containing them. Even if we had a single editor that was functionally dominant, it would place an upper bound on the functionality of all future compatible editors. Since there are no actual candidates for a totally dominant editor, we have chosen instead to examine in general what information editors need and how that information can be organized to represent general documents.
Since we are not proposing an editor, we do not need to design an private format for its documents; we only need an external representation that is capable of conveying the structure, form, and content of editable documents. That external representation has only one purpose: to enable the interchange of documents among a different editors. It must be easy to convert between real editors’ formats and this interchange encoding.
Using a standard interchange encoding has the additional advantage that much of the input and output conversion algorithms will be common to all conforming editors. In fact, when adding a new version of a previous editor, the only differences in the new version’s conversion routines will be in the areas in which its format has changed from its previous form; this represents a significant saving of programming. Finally, no special routines or human procedures would be needed to upgrade documents to a new version of an editor, since each conforming editor will be capable of understanding and producing the interchange representation anyway.
1.2. Properties that any interchange standard must have
An interchange encoding for editable documents must satisfy a number of constraints. Among these are the following:
1.2.1. Universal character set
Scripts must be encoded using the graphic (printable) subset of the ISO 646 printing character set. As well as the obvious rationale that these characters are guaranteed not to have control significance to any devices meeting the ISO standard, it has the additional advantage that a script is humanly readable.
1.2.2. Encoding efficiency
Since editable documents may be stored as scripts, may be transmitted over a network, and must certainly be processed to convert them to various editors’ private formats, it is important that the encoding be reasonably space-efficient.
Similarly, the time cost of converting between interchange encoding and private formats must be reasonably low since it will have a significant effect on how useful the interchange standard is. (If the overheads were small enough, an editor might not even use a private file format.)
1.2.3. Open-ended representation
Scripts must be capable of describing virtually all editable documents, including those containing formatted text, synthetic graphics, scanned images, etc., and mixtures of these various modes. Nor may the standard foreclose future options for documents that exploit additional media (e.g., audio) or require rich structures (e.g., VLSI circuit diagrams, database views). For the same reasons, the standard must not be tied to particular hardware or to a file format: documents will be stored and transmitted using a variety of media; it would be folly to tie the representation to any particular medium.
1.2.4. Document structure
Many documents have hierarchical structure; e.g., a book is made of chapters containing sections, each of which is a sequence of paragraphs; a figure is embedded in a frame on a page and in turn contains a textual caption and imbedded graphics; and the description of an integrated circuit has levels corresponding to modular or repeated subcircuits. The standard should exploit such structure, without imposing any particular hierarchy on all documents.
Hierarchy is not sufficient, however. Parts of documents must often be related in other ways; e.g., graphics components must often be related geometrically, which may defy hierarchical structuring, and it must be possible to indicate a reference from some part of a document to a figure, footnote, or section in way a that cuts across the dominant hierarchy of the document (section 1.6.4).
Documents often contain structure in the form of indirection. For instance, a set of paragraphs may all have a common "style," which must be referred to indirectly so that changing the style once is sufficient to change the characteristics of all the paragraphs using it. Or a document may be incorporated "by reference" as a part of more than one document and may need to "inherit" many of its properties from the document into which it is being incorporated at a given time.
1.2.5. Document form and content
The complete description of a document component usually requires more than an enumeration of its explicit contents; e.g., paragraphs have margins, leading between lines, default fonts, etc. Scripts must record the association between attributes and pieces of content.
The contents of a document must be represented by a rich space containing scalar numbers, strings, vectors, and record-like constructs in order to describe items as varied as distances, text, coefficients of curves, graphics constraints, digital audio, scanned images, transistors, etc.
Attribute values should also be described in this rich value space.
1.2.6. Transcription fidelity
It must be possible to convert any document from any editor’s private format to a script and reconvert it back to the same editor’s private format with no observable effect on the document’s form, structure, or content. This characteristic is called transcription fidelity, and is a sine qua non for an interchange encoding; if it is not possible to accomplish this, the interchange encoding or the conversion routines (or both) must be defective.
1.2.7. Rendition fidelity
Even complicated documents have simple pieces. A simple editor should be able to display document components that it understands, even in the presence of components that it does not. More precisely, an editor must, in the course of processing a script produced by a different editor, be able to discover all the information necessary to render the document in its own representation and to display the parts that it understands. This must work despite the fact that different editors may well use different data structures to render the structure, form, and content of a document.
At a minimum, this requires that a script contain information by which an editor can easily determine whether or not it understands a component well enough to render it (or parts of it) and that it be able to interpret the effect that components that it does not understand have on the ones it does. For example, if an editor does not understand figures, it should still be possible for it to display their embedded textual captions correctly, even though a figure might well dictate some of its caption’s content or attributes such as margins, font, etc.
This constraint requires that an interchange encoding must have a simple syntax and semantics that can be interpreted readily, even by low-capability editors. Along with the desire for openendedness (section 1.2.3), this suggests a language with some form of "extension by definition" built around a small core.
1.2.8. Regeneration
Processing a script to render it faithfully is only half the problem. It is equally important that an editor, in generating an Interdoc script from its private document format be able to regenerate the structure, form, and content carried by the script from which the document came.
This problem is much less severe when an editor is transcribing a document that it "understands" completely, e.g., because the entire document was generated using that editor. However, when regenerating a script from an edited document, it should be possible to retain the structure in parts of the original script that were not affected by editing operations. For example, an editor that understands text but not figures should be able to edit the text in a document (although editing a caption may be unsafe without understanding figures) while faithfully retaining and then regenerating the figures when transcribing from its private format.
1.3. What the Interdoc standard does not do
There are a number of issues that the Interdoc standard specifically does not discuss. Each of these issues is important in its own right, but is separable from the design of an interchange representation
1.3.1. Interdoc is not a file format
The interchange encoding of an Interdoc script is a sequence of ASCII/ISO 646 characters. The standard is not concerned with how that representation is held in files on various media (floppy disks, hard disks, tapes, etc.), or with how it is transmitted over communications media (Ethernet, telephone lines, etc.).
1.3.2. Interdoc is not a standard for editing
An Interdoc script is not intended as a directly editable representation. It is not part of its function to make editing of various constructs easier, more efficient, or more compact: those are the purview of editors and their associated private document formats. An Interdoc script is intended to be rendered into the editor’s private format before being edited. This rendition might be done by the editor, by a utility program on the editing workstation, or by a completely separate service.
1.3.3. Combining documents is not an interchange function
This exclusion is really a corollary of the statement, "An Interdoc script is not intended as a directly editable representation." It is no easier to "glue" two arbitrary documents together than it is to edit them.
1.3.4. Interdoc does not overlap with other standards
There are a number of standards issues that are closely related to the representation of editable documents, but which are not part of the Interdoc standard because they are also closely related to other standards. For example, the issues of specifying encodings for characters in documents, how fonts should be named or described, or how the printing of documents should be specified (i.e., Interpress) are not part of this work.
HISTORY LOG
Edited by Mitchell, September 1, 1981 3:12 PM, added first version of glossary
Edited by Mitchell, September 7, 1981 2:11 PM, wrote parts of introduction
Edited by Mitchell, September 10, 1981 10:14 AM, added Tab def to Star property sheets
Edited by Mitchell, September 14, 1981 9:54 AM, renumbered chapters and did minor edits
Edited by Mitchell, September 16, 1981 8:42 AM, folding in comments from JJH’s review and added sections on rendition and transcription fidelity
Edited by Mitchell, September 18, 1981 1:56 PM, folded in comments from JJH’s review
Edited by Horning, May 3, 1982 6:02 PM, Folded in comments from Truth copy
Edited by Mitchell, DDD, Explanation