JOLOBOFF TRENDS AND STANDARDS IN DOCUMENT REPRESENTATION SIGGRAPH '87 TUTORIAL COURSE NOTES DOCUMENTATION GRAPHICS Trends and Standards in Document Representation [Republished from Text Processing and Document Manipulation, Copyright 1986, Cambridge University Press] Trends and Standards in Document Representation Vania Joloboff Bull Research Center BP 68-38402 Saint Martin d'Heres Cedex. France ABSTRACT: This paper starts by tracing the architecture of document preparation systems. Two basic types of document representations appear: at the page level or at logical level. The paper then focuses on logical level representations and tries to survey three existing formalisms: SGML, Interscript and ODA. 1. Introduction Document preparation systems might be now the most commonly used computer systems, ranging from stand-alone text processing individual machines to highly sophisticated systems running on mainframe computers. All of those systems internally use a more or less formal system for representing documents. Document representation formalisms are very different according to their goals. Some of them define the interface with the printing device, they are oriented towards a precise geometric description of the contents of each page in a document. Others are used internally in systems as a memory representation. Yet others have to be learned by users; they are symbolic languages used to control document processing. The trouble is that there are today nearly as many representation formalisms as document preparation systems. This makes it nearly impossible, first to interchange documents among heterogeneous systems, second to have standard programming interfaces for developping systems. Standardization organizations and large companies are now trying to establish standards in the field in order to stop proliferation of formalisms and facilitate document interchange. This paper focuses in the last sections on three document representation formalisms often called 'revisable formats', namely SGML [SGML], ODA [ODA], and Interscript [Ayers & al.], [Joloboff & al.]. In order to better understand what is a revisable format, the paper starts with a look at the evolution of the architecture of document preparation systems. 2. Architecture of Document Preparation Systems Document preparation systems have appeared as soon as computer printing devices were able to output typewriter-like quality documents. Although the evolution of printing technology have been the major one, several factors have influenced the architecture of document preparation systems: low cost computing power, distributed systems, and the simple maturation of ideas in the field. The evolution of printing technology has lead to the digital representation of documents ready to be printed, called final form representation. The evolution of software techniques has principally lead to representations capturing the logical structure, the structure that is perceived by the author when the document is revised, i.e. constructed or modified. 2.1 Final form representation On early document preparation systems, printing devices were basically typewriter-like terminals directly connected in character mode to the unique processing computer. Those devices were driven by sequences of control characters inserted in the data stream they received in order to produce layout rendition (underlining, overstriking). A formatting system basically had to translate the formatting commands into printer control sequences. As printers from different vendors had different control sequences, device independent formats were needed in order to print the same document on different sites with different printers. Final form representation had appeared, that is, the final digital representation of a document before it is printed. The main property of a final representation is that the number of pages in a document has then been computed. The way each object (character string or graphics) should appear on the page is totally determined. On non impact printers, virtually any image is reproducable: characters in any alphabet, graphics and images as well. There do not exist any more a specific set of imaging functions available from the hardware. Then the limit to the expressive power of the page creator is set by the software interface. This fundamental change brought by technology has implied a fundamental change in the design of final form representations for non-impact printing. A final form representation is not any more a sequence of characters, it has to be an organized structure. A formal method must be used to describe the page layout, offering a maximum expressiveness to the page creator. Such formalisms theoretically allow for the description of any page for any printer. They divide into static formats and dynamic ones, more recent. In a static format the page layout is described as a static data structure. The standard CCITT T73 [T73] is a typical example of such formats. In dynamic formats, also referred to as procedural page description languages, a page description actually describes how to compute the layout. Brian Reid's paper [Reid86] in that very conference talks more extensively on procedural page description languages, such as PostScript[PostScript]. The point we want to emphasize now is that the architecture (figure 1) of document preparation systems has now a clean interface with printing devices. It generates a final form representation of documents in terms of a structured page description formalism. 2.2 Revisable form representation A document has to undergo many additions or modifications before it is ready to be printed. Working on a page based representation when editing a document would be tedious and cumbersome both for users and the editing system. An unformatted representation of documents is necessary. This representation typically is the output of the editing system and the input of the formatting system. Figure 1 shows the three basic components of a document preparation system: editing, formatting and printing. Revisable form and final form are the two representations interfacing these components. === Figure 1. Typical architecture of a document preparation system. === The first document preparation systems have naturally imitated the method used in the publishing industry for typesetting: additional information is interspersed among the document contents to produce a data stream directly processed by the typesetting device. On those early systems the revisable form representation simply consists of a text file containing control sequences, directly keyed in by the user from a standard terminal. Control sequences consists of a series of markup signs. That was the beginning of so called procedural markup languages, since those markup signs were interpreted as instructions controlling subsequent processing in the formatting system. Procedural markup has well known inconvenients: f the logical structure of a document is not much evidenced once the document is marked up. For example, if chapter titles have been marked with a centering command, it does not appear clearly that what follows a centering command is a title. If someone later wants to flush all titles right, changing all centering commands into flush commands will probably not give the expected result. f the style of the resulting documents, i.e. the aspect of the document layout, is determined by the user who placed the markup signs. A good layout style, if some style at all, requires from the user some typographic knowledge. The lack of this knowlege is responsible for all of the ugly documents produced on procedural markup systems... Also, it makes it difficult to output the same document in a different style. Disavantages of procedural markup have been avoided with a new method, known as declarative markup. The standpoint in declarative markup is that the user should describe the logical structure of a document, what is to be processed rather than how the document content is to be processed. A user enters mark up signs indicating logical properties of data, for example paragraph or heading, expressing its logical structure, which sounds more familiar, and does not imply a particular processing. The responsibility of making consistent styles, or applying specific functions is left to the system. GML [Goldfarb] and Scribe[Reid83] are two examples of declarative markup systems; the reader is referred to [Furuta & al.] for an extensive survey of such formatting systems. The SGML formalism is essentially the definition of an international standard by ISO for covering these systems. Yet a SGML entity may refer to non-character data, as shown in the next section, it has been designed in the spirit of all markup systems. As the standard says (page 3); ``The millions of existing text entry devices must be supported. SGML documents can easily be keyboarded and understood by humans.'' A user does not need a specific editor to build a markep up document. As far as there are only characters, any editor will do on any standard terminal. The revisable form representation of a document in a markup system, be it declarative or procedural, is (or should be) fully known from users, they have to key it in... More recent approaches have a different viewpoint. They assume the revisable form representation is not directly accessed by users, but solely by the editing system. Thus a specific editor is needed, which generates that representation. It is intended such editors will not expose users to the revisable representation; that they will actually hide to the user the internal representation of documents, constructing themselves this representation from the user input. These editors are expected to provide a more convivial user interface. Most of the editors from this new generation do not run on standard terminals, for example Grif, presented in this conference [Quint & Vatton]. They rather use bitmap display terminals, a window system and a pointing device. The new type of document representation used in this approach may then be designed to be quite complex, nearly unmanageable by human beings, but very suitable to be handled by computers. Graphics and images may be directly inserted in documents more easily than for markup formats. Graphics may rely on existing standard graphics representation, images may be stored trough specific data compression techniques, while the user only sees on the screen a real layout. Interscript and ODA both belong to this new genereration of formalisms. They assume more computing power from the editing system, they lose the possibility to be directly entered from a standard terminal, but promise many more possibilities. 3. Generalized Markup Language SGML stands for Standard Generalized Markup Language. It is essentially a declarative markup language, which has inherited mainly from its ancestor GML. However it includes a lot of new interesting features. A first difference with its predecessors is that markup is defined rigorously. It is possible from the SGML standard definition to build a general syntactic parser that will not arise ambiguities. According to this rigorous syntax, SGML documents may be processed very much like programs by a compiler. A document may be parsed to build an abstract syntactic tree together with its attributes. Semantics of that tree may be evaluated by semantic functions according to the attributes values. Thus, SGML can be used for other tasks than formatting ones. Semantics of markup tags and attributes might be used for machine translation, automatic indexing or any other process needing parsing of documents. A markup sign in SGML is named a tag. Any element which needs to be tagged starts with a start-tag and ends with an end-tag. Any tag is delimited by the characters < and >. A tag is defined by an identifier, which appears first in the start-tag. An end-tag repeats the same identifier preceded by /. Note that all of these mark characters are redefinable for each document. End tags may also be omitted under conditions specified in the standard. For example, a paragraph will appear as:
This is a short paragraph.
A drawback of usual declarative markup systems is that one is forced to use the catalog of markup tags which is offered by the system. Since markup tags express the logical structure of documents, it means one cannot define the logical structure in other terms than the general tags set up once for all by the system. A property of SGML is that tags are themselves described trough a formal language: the SGML meta-language, which may be used within SGML documents to dynamically define new symbols. Syntax to introduce a meta-language construct simply follows < by !. The SGML meta language allows for the definition of complex constructs, named elements. An element declaration defines of a class of objects, i.e. an element type. Subsequent objects in the document may be tagged with the element name. Elements may have a hierarchical structure, and each element in the hierarchy may have its own attributes. Element types may be used either to facilitate the interactive creation of documents, to control the validity of a document structure, or to associate a layout style to a particular document type. For example, one might define a document type for a conference paper as follows: This document type declaration specifies that a paper has a title, an abstract, and a body. The title consists of characters, the abstract is one paragraph and the body one or more paragraphs. A paper has a language attribute to indicate in which language it is written. More complex combina-tions can be designed to define document types that have some commonality. The facility to define new elements brings troubles when laying out those elements, because the formating system then does not know how to format such constructs. SGML provide two ways for handling that situation. The first one is naturally to add to the SGML system a procedure to take care of the new tags. This requires a good knowledge of the system and prohibits further interchange of documents with such tags to systems which do not have this procedure. The second one is to use a LINK tag. A LINK tag says to the system that a construct should be handled as another one, presumably known from the system, with possible attributes modifications. For example, if one says , it means an abstract has to be formatted like a paragraph, however using a different indentation value. It is often required in a document to be able to refer to other parts of the document. Some binding mechanism is needed in the formalism to attach a value to some identifier, which resembles to progamming language variables. Binding is achieved in SGML trhough entity declaration and entity references. An entity (a value, a character string or any valid SGML constituent) may be bound to a name by the notation . From now on, that entity may later be referenced by its name either to set an attribute value, or to be included into the running text. Entities also provide means to handle non character data. An external entity is declared . Then it is known that this entity is not in the document stream. The processing system will find in the system information how to access that content. If the document is to be interchanged among different computers with different operating systems, this system information is specific to each system. SGML provides an IGNORE/INCLUDE mechanism for that purpose. Information relative to some particular system, let say osx, has to be encoded within the magic declaration ]]>. Then a user only needs to turn a switch at the beginning of the document to the local system for the document to be processed correctly. 4. Interscript We mentioned previously Interscript is a representation formalism from a new generation. Interscript, which was originally designed at Xerox PARC, starts from the idea that a document representation should be suited to be processed by computers, not by the humans who manipulate documents. Such things as traversing trees, evaluating expressions, searching values of variables within contexts are among what computers can easily do. Thus, a fundamental notion in Interscript is to rely on a formal language to describe document constructs, not only a document logical structure, but all formal constructs that could be necessary into a document representation. These abstract constructs may be data structures such as paragraphs, fonts, geometric shapes, but may also represent computations, like setting a context or evaluating expressions within some context. The Intescript approach is very much like the approach used in software engineering: general programming languages are used by people to build abstract constructs and procedures to solve their particular problem. A document representation problem should be solved using the a document representation language. The Interscript base language is simple (around 25 grammar rules) and powerful. Its semantics are well defined but its syntax rapidly leads to document that cannot be managed by humans. A document encoded in the Interscript base language is called a script. A script is very much like a program. The processing paradigm (figure 2) is that a script should be first internalized by a system. Internalizing a script implies execution of computations, which are dictated only by Intercript base language semantics, and result in the construction of another representation available for the client process. This simply means that one translates a standard disk representation into a non standard memory representation, while achieving computations. Computations are necessary in the internalizing process because the base language includes a binding mechanism and the evaluation of expressions within hierarchical contexts. For example, evaluating the expression: rightmargin = leftmargin + linelength needs to obtain the values bound to the variable names. === Figure 2. Interscript processing model. === We will not in this paper enters into the details of the internalizing process, which looks like the evaluation of any interpreted programming language, to focus on the central concepts of node and tag. A script is a hierarchy of nodes. Nodes have contents and tags. The authors have compared an Interscript node to a bottle of wine. The contents of the bottle is qualified by several tags on the bottle: a price tag, a product number tag. Interscript tags similarly qualifies the node contents. To some extent an Interscript tag is similar to an sgml tag, it introduces an element, it has attributes, it denotes structural properties of the contents. The difference is that, first a Interscript node may have simultaneous tags, second attributes of a tag may be bound to an expression which must be evaluated. For example, a figure caption could be affixed with both a CAPTION and a PARAGRAPH tag. The paragraph tag says that the caption text has to be laid out as a parapragph, the caption tag restricts the placement of that paragraph relatively to the figure picture. The leftmargin attribute of the paragraph might be set to be equal to the margin of some object X. Then the node hierarchy is searched for that X. Interscript syntax denotes nodes between curly braces. Tags are character strings followed by a dollar sign. A typical node is: { PARAGRAPH$ PARAGRAPH.leftmargin = 10 {CHARS$