<> <> 1 Introduction 1 INTRODUCTION ==================== 1.1 Overview This thesis addresses the problem of formatting complex documents using electronic document composition tools. In particular, the two problems of incorporating high-quality illustrations into documents and of laying out tabular information will be treated in depth. The research reported in this thesis concentrates on techniques and tools for producing electronic documents that will meet the graphic arts standards of quality, both in electronic form and in subsequent hardcopy form. The notion of style applied to text composition systems, separating form from content, is extended to both graphical illustrations and tables. Document composition systems have supported the inclusion of scanned illustrations and computer-generated line drawings before, but frequently the results have lacked the quality expected in a typeset document. This thesis presents techniques for specifying illustrations within the document manuscript and for controlling the quality of the resulting artwork by separating form from content in the specification of illustrations. Document composition systems have also formatted tabular information before, but they have often permitted only limited control over the table formatting parameters and they have imposed restrictions on the possible layouts for tables. This thesis presents a new framework for controlling the table formatting process and for interactively designing a table layout. The table formatting paradigm introduced here can be extended to other two-dimensional layout problems, such as mathematical notation or complete page makeup. The notion of style, a way of maintaining consistency, runs throughout the thesis. It is the underlying mechanism for managing the complexities of formatting high-quality illustrations and tables. The use of style both simplifies the design choices and enhances the disciplined use of high-quality typesetting. The thesis includes aids to assist the reader in understanding the typographic subject matter: a glossary of typesetting terms, an index to the thesis, and plentiful diagrams and illustrations. Glossary terms appear italicized in a distinctive sans-serif typeface when they first appear in the thesis, such as the term style in the preceding paragraph. Prototype implementations of the document formatting tools were developed in the Cedar programming environment [Teitelman, Cedar] at the Xerox Palo Alto Research Center. The Tioga editor and typesetter were extended to incorporate illustrations and tables using the prototypes. The Tioga document model and style mechanism were extended to accommodate both the graphical and tabular style requirements. The lessons learned from developing and using these prototype systems will be valuable during ongoing research designing a new integrated document composition environment based on Tioga. This chapter introduces the notion of a document and particular problems in document composition that have led to the research reported in the thesis. The chapter concludes with a brief summary of the material presented in subsequent chapters. 1.2 What is a document? A document communicates information both textually and pictorially. An author collects, organizes, and presents the information contained in a document. The document may be intended to inform, to persuade, to argue, or simply to entertain. This thesis deals with some of the problems of producing and reproducing documents for effective communication of information. Historically, documents were handwritten manuscripts reproduced laboriously by scribes. Many of these early documents were handsomely illustrated with hand-drawn sketches or images. Early printing processes required hand-carved printing plates. The invention of movable type enabled people with less skill to prepare printing plates in less time. Movable type also created more freedom to correct and change pages. It is worth noting that the early success of these printing systems depended on meeting the quality standards set by handwritten and hand-illustrated manuscripts. There has since been a long history of development in the printing process, with a keen regard for quality always at the forefront. The traditional skills of reproducing modern documents are called the graphic arts. These crafts represent a high standard of quality and readability. The artistry of letter forms, the inclusion of fine quality photographs and line drawings, and the uniformity of color when printing many copies of a document are all testaments to the concern for standards in the graphic arts that have evolved over many years. Traditional graphic arts processes are labor intensive and often time consuming. Electronic tools and techniques have already been introduced into these processes, but the standards and skills remain largely traditional. We wish to preserve the quality aspects of the graphic arts, while making the quality more consistent and more accessible through the use of electronic document composition tools. Authors assemble the text, illustrations, and reference material for a document and then compose a manuscript. The graphic arts staff receives the manuscript and transforms it into a book, report, pamphlet, poster, or newsletter. The author often works with a manuscript that is very different from the final printed result. A great deal of trust must be placed in the graphic artists to produce a document that resembles what the author expects. The traditional document production process works best when creating a static and long-lived document. The author must plan ahead to guide the reader through this static form of the document. Reading aids, like tables of contents, indices, and cross references between parts of a document, are often necessary to permit different readers to control their individual reading paths through a large document. The most exciting prospect for introducing electronic composition tools is not in producing traditional documents faster, but rather in producing entirely new kinds of documents more easily. The author working with draft manuscripts should be able to see his information in its final form. The all too common phenomenon of author's alterations (changes requested by the author to the typeset galleys supposedly in `final' form) is convincing evidence that authors often do not really comprehend the implications of their manuscripts until they see them in their final form. This is a costly phenomenon that begs to be corrected. Electronic tools can help authors produce better draft manuscripts and thus better (and cheaper) final documents. Engelbart's vision of electronic tools to augment human intellect, published in 1968 [Engelbart, NLS], contained many elements only now becoming standard in current document preparation systems. His work was based on a timesharing computer with each user (sometimes several participants in an electronic conference) viewing a document on graphic displays and interacting with two-handed input (a five-key handset, a keyboard, and the original mouse pointing device) [Engelbart, Terminals]. Engelbart's on-line document system NLS first introduced structured files. These files represented a document as a hierarchy of information statements with cross reference links among related statements. NLS documents could contain both text and simple line drawing graphics. The reader of an NLS document guided his own exploration of the document by selecting the content of interest, which of the cross references to be followed, the formatting parameters that would apply to the display, the levels of the hierarchy to be made visible, and the amount of each statement to be displayed. Later, the Hypertext [Carmody, Hypertext] and Xanadu [Nelson, Xanadu] projects evolved similar systems that organized document content in a structured fashion and used graphic displays to present the document. Electronic documents need not be restricted to presenting static information as are documents printed in hardcopy form. The potential exists for electronic documents to react to the specific reader, perhaps by choosing parts of the document based on the reader's experience or interests, or by connecting a dynamic document to a database and extracting the most recent information available. Nevertheless, the presentation of information through electronic composition tools must strive to meet the reader's expectations for readability, quality typography, illustration, and organization that have been established by the graphic arts community. The challenge to electronic documents is effectively presenting (in a timely fashion) a wide range of material that includes textual statements (possibly in foreign languages), notation such as mathematical or chemical formulae, tabular presentation of information, photographs, line drawings (possibly with shaded and colored elements), and more. 1.3 Personal Reflections on Document Production The author of this thesis has composed several scholarly books and journal articles through a typesetting company, Waterloo Computer Typography (WATTYPE), founded in partnership with a graphic designer. This experience involved the development and application of electronic composition tools for preparing and typesetting scholarly manuscripts for publishers who insisted on traditional graphic arts standards. Despite the benefits of those composition tools, many deficiencies in the tools and a multitude of production difficulties due to a lack of integration among the tools had to be circumvented. Those difficulties have helped to focus the author's current research into illustration and table formatting problems. A sequence of three introductory texts for computer science, typeset by WATTYPE, demonstrates the benefits of storing and editing the manuscript with electronic tools. The first book was based on the WATFIV-S programming language [Dyck, WATFIV-S]. The manuscript was created by the authors using a text formatting processor that could only produce draft copies on a line printer. When the book was contracted for publication, the computer files were translated, using automated text processing tools, into formatting commands for a typesetting system [Beach, Typeset]. Later, two variations of this book were developed for PASCAL [Dyck, PASCAL] and FORTRAN77 [Dyck, FORTRAN77]. In each case, the manuscript files were methodically reviewed and edited to produce the preliminary version of the next book. Much of the material, especially the mathematical presentations and algorithms, remained unchanged and could be used without modification. Completely new sets of computer programs for each programming language were incorporated from the computer files used to compile and test the programs. A robust file system and archiving tools available on the host timesharing system made the job of managing all of these manuscripts and changes practical. Editorial consistency within a document was achieved more easily using electronic tools than by manual proofreading. Checking words in a foreign language lexicon for a Chaucer bibliography [Peck, Chaucer] or checking the citations of figure captions for heavily illustrated text books were easily accomplished through the facilities of word processing or text editing programs by making global edits. The organization of documents into chapters, sections, paragraphs, tables, figure captions, and reference citations was regulated by defining standard formatting tags or commands within the computer manuscript files. Families of documents, such as the sequence of three text books or all of the articles in a conference proceedings [Lusignan, ICCH3], shared a common design and layout by using the same composition tools with the same design parameters. With sufficient care and foresight, the resulting documents have significantly greater editorial consistency yet meet the same quality standards of traditional methods. The accurate presentation of computer programs and computer generated data through electronic composition tools has seen a significant improvement over traditional graphic arts techniques. Manual transcription of data and the misinterpretation of the unusual appearance of computer programs leads to inevitable errors when using traditional methods. WATTYPE contracted to typeset an APL manual [, APL/66] because the authors refused to proofread the APL notation if it was manually transcribed from their draft manuscript that had been prepared on a computer line printer. Other authors in private conversations have related problems with publishing their programs and data. Typographers are experienced in transforming typewritten manuscripts into typeset books, but not in reproducing the line printer output from computers. Monospaced fonts, typical of the fixed pitch characters on computer output devices, are rarely used in traditional typesetting. Typographers often substitute other fonts and treat the material as they would typewritten manuscripts. Authors of computer science texts did not appreciate the unexpected changes made to their programs, which of course invalidated them as computer programs. One example concerns quoted strings in programs being treated as quotations. Many style books demand that the quotation be set off by matching open- and close-quote marks and further indicate that quotations may have their punctuation moved inside the quote marks and otherwise changed for clarity [van Leunen, Handbook, p 60]. Computer programs require extremely precise placement of punctuation outside the quotation marks, and the computer character set often does not distinguish between open- and close-quote marks. Authors who use electronic compositions tools can control the interpretation of unusual material, like computer programs, to ensure the accuracy and consistency of the published form. One aspect of electronic composition that remains difficult is formatting complex mathematical notation. Tools like eqn and TEX provide abstract languages for an author to describe mathematical notation. These tools require converting the notation into a complicated syntax, especially complicated when nonstandard mathematical notation is involved. While automated syntax checkers do exist, correcting mathematical notation typically requires typesetting proof copies, marking corrections, and making revisions. There are few interactive `what you see is what you get' (WYSIWYG) editing tools for composing mathematical notation. Few systems understand mathematical concepts to help authors avoid errors. While the lack of understanding permits notational schemes like eqn to be quite flexible, accommodating new notation is a major frustration in these systems. Authors frequently invent new notation to serve their purposes, especially in Engineering disciplines, or they make heavy use of notation that is not well supported by the mathematical composition tools. For example, the matrix algebra notation in a sparse matrix text [George & Liu, Sparse Matrices] stretched the capabilities of an eqn-like formatter to compose square matrices and align rows across matrix equations. More flexible notation schemes designed to accommodate authors who create new notations are needed. Incorporating support from a symbolic algebra package for checking the mathematical notation, analogous to spelling and diction analysis tools, would provide a better mathematical composition environment. Most composition systems treat mathematical notation separately from normal text. Thus one cannot freely use mathematical notation in all parts of a document, even though mathematical notation is quite natural in chapter or section headings of technical documents. When the text of the heading in the sparse matrix book was automatically duplicated for use as a running head, the running head did not format correctly because the text fonts were different. Similarly, mathematical notation cannot be easily used in figure captions where the size of type is different, in the table of contents where headings have been automatically copied from chapter or section headings, or in index entries where phrases have been automatically collected from throughout the manuscript. A more integrated document content model for objects like text, mathematical notation, and illustrations would help to solve the problem of reusing the same material consistently in different contexts. Tables of information are also awkward to compose. Each table in the documents composed by WATTYPE staff tended to be treated as a separate design problem, requiring special coding for each one. Table formatting tools are less well developed than text formatting tools, with many special table formatting features not possible or not provided. The content of table entries may be restricted, so that mathematical notation or illustrations may not be acceptable as table entries. Because tables are treated differently than text or mathematical notation, it may be awkward to use the same document style for tables as for the textual parts of a document. Simple tables, especially spreadsheets or tables of computed numeric data, are frequently formatted by special purpose programs, making it difficult to incorporate such tables within the body of an editable document. Illustrations remain outside the mainstream of document formatting. The illustration packages currently available either produce results crude by graphic arts standards, or are limited in the range of artwork they produce. Almost all the illustrations for books produced by WATTYPE were drawn by draftsmen at a larger scale and reduced to improve the quality of the reproduction. This lead to several difficulties with inconsistent line thicknesses, varying typefaces for labels and captions, and differences among a set of similar drawings. Formatting large documents into pages is another difficulty with electronic composition tools. Because text books contain hundreds of pages, automated pagination techniques are desirable. Unfortunately, in complicated situations, the current algorithms are likely to create unpleasant and unacceptable results, especially in placing figures and footnotes. Each special case has to be handled by manually coding special formatting instructions on how to properly break the page. Of course, each time the document changes, these instructions also have to be changed, and consequently, pagination is left until the very last moment. A contributing factor to the pagination problem is the cost of formatting an entire document all at once. Some systems are noninteractive and the processing is actually run a batch at a time, typically one batch for each chapter. Multiple runs are needed when the document contains cross references between chapters in order to get the page numbers correct, when parts are automatically numbered in order to get the sequencing correct, and when index entries are automatically collected in order to get the page references correct. These formatting cycles often involve reprocessing a lot of the document that has not changed, thereby wasting resources, increasing costs, and introducing delays. Another implication, even with interactive systems, is the `pregnant pause' syndrome, where reproduction-quality output is delayed until the moment when everything has been completely and finally formatted. Though WATTYPE had provided publishers with several drafts, each of which appeared much like the final result, none of the pages could be considered final pages. Publishers gain confidence when they see final pages coming out of the production pipeline. With the pregnant pause syndrome there are no final pages until the last minute. This places considerable faith and stress on the composition system to handle the surge of demand to output a complete document. Several failures in the computer hardware, operating system, storage system, communications system, and typesetter delayed book projects for WATTYPE. Furthermore, last minute touch-ups were always necessary to correct overlooked mistakes or to include artwork not produced with the system, and these must be handled outside the normal production cycle. These difficulties, experienced first-hand by the author, lead to a concern for developing incremental and integrated electronic document composition tools. The research reported in this thesis is directed towards an interactive `what you see is what you get' environment that supports a variety of document content and produces the high-quality results expected by graphic arts standards. 1.4 The Concept of Document Style Electronic aids for document production have contributed to the concept of document style. A crucial insight is the notion of separating form from content in a document, made explicit in document compilers like Scribe [Reid, Scribe thesis] and implicit in many earlier macro packages like the -ms package [Lesk, -ms] for troff. Style deals with issues of form: appearance, aesthetics, and understandability of document content. Generally styles have expressed how text is formatted, the typography of text: ``The practice of typography, if it be followed faithfully, is hard work  full of detail, full of petty restrictions, full of drudgery, and not greatly rewarded as men now count rewards. There are times when we need to bring to it all the history and art and feeling that we have, to make it bearable. But in the light of history, and of art, and of knowledge, and of man's achievement, it is as interesting a work as exists  a broad and humanizing employment which can indeed be followed merely as a trade, but which if perfected into an art, or even broadened into a profession, will perpetually open new horizons to our eyes and new opportunities to our hands.'' [Updike, Printing Types, quoted in [Williamson, Book Design, p 4]] Electronic composition systems have been more concerned with simple typography [Beach, Computerized Typesetting] and less concerned with higher levels of style that apply to nontextual components, such as pages, illustrations and tables [Furuta, Survey]. On the surface, it appears that specifying a style is easy. ``To lay down rules of style would be easy enough  we need only consider how things were done yesterday, or how they are done today, or how we prefer to do them ourselves, and to elevate these practices or preferences to the status of dogma.'' [Williamson, Book Design, p 2] For most people who prepare documents, many decisions are institutionalized and thus already made for them. When one must create a document style with the rigorous detail demanded by a document compiler or macro package, the quantity of detail is enormous. Some publishers provide style manuals with hundreds of pages that capture this detail [, The Chicago Manual of Style, 1982]. Other publishers have felt threatened by revealing the style details to those creating document composition tools [Johnson, JACM style] or they can only provide sufficient detail after several iterations of critiquing samples [Bell, Sc.Am. illustration]. Document style appears to be an area that might benefit from the application of expert systems techniques to capture style rules. For now, because we do not understand very well how or why things are done, we must fall back on replicating how documents were formatted in the past. 1.5 Roadmap to the Thesis The rest of this thesis reviews the state of electronic tools for document composition and details solutions to some of the difficult problems in handling illustrations and tables. Chapter 2, Document Composition, presents a survey of the traditional graphic arts process for producing a document. This includes a review of how books get published and the roles of the people involved in producing a book. Typesetting systems, including early computer typesetting systems, document compilers, and integrated document composition systems are reviewed for their handling of document style, illustrations, and tables. A survey of existing document models highlights the need for more structured models to integrate various kinds of document content. Chapter 3, Graphical Style, extends the style mechanism to illustrations. The same `form versus content' separation so successfully applied to textual objects is applied to graphical objects. A prototype implementation demonstrates the effectiveness of graphical style in achieving this separation and consistency in illustrations. Graphical style is revealed to be insufficient because it does not deal adequately with layout. The observation that specifying positioning constraints within illustrations would help control the layout leads to the consideration of a concentrated layout problem of formatting tables as a constraint satisfaction problem. Chapter 4, Tabular Composition, examines the problems and difficulties in formatting tables. The earliest computer-typesetting programs were for preparing numeric tables but their approaches were simplistic and limited. A survey of the typographic features required for formatting tables leads to an examination of current table formatting capabilities available in document composition systems. Chapter 5, A New Framework for Tabular Composition, introduces the use of grid systems and mathematical constraint solvers to the table formatting problem. A review of grid systems and their application to table layout provides the basis for incorporating many typographic features into a document structure suitable for tables. The constraint solver provides the general layout engine for formatting tables as well as the basis for an interactive table design tool. A prototype table formatter demonstrates the capabilities for handling complex tables. Chapter 6, Future Directions, discusses several research problems that evolve from the graphical style and table formatting work reported in Chapters 3 and 5. The Glossary explains terms used by typographic specialists. The glossary assumes the reader has a computer science background, and thus does not include common terms from computer science. Terms that appear in the glossary are identified in the thesis by use of a distinctive italic typeface. An extensive list of References is provided for more detailed reading about typography, document composition systems, and graphic design.