<> <> 2 Document Composition 2 DOCUMENT COMPOSITION ==================== This chapter surveys existing techniques for producing documents, beginning with the traditional graphic arts process for turning a manuscript into a finished book. The chapter investigates the concept of document style that arises from the graphic design discipline and which pervades modern electronic composition systems. The use of computers and electronics in document composition is surveyed next, first examining the early typesetting systems, then document compilers such as troff, Scribe, and TEX, and finally integrated document composition systems such as Etude, Janus, and the Xerox Star. The survey of document composition techniques concludes with a discussion of several issues concerning the structure of information in documents and a description of some models of structured documents. 2.1 Traditional Document Production Techniques Researchers make substantial use of books and journals in their everyday work. However, few people understand how those documents are produced. Only when they decide to write their own book or to edit a scholarly journal do they become involved in the mysterious world of the graphic arts. This survey is intended to help the reader to understand document production, to appreciate the many diverse roles and skills necessary, and to realize the vast number of details and decisions involved in producing high-quality documents. 2.1.1 How do books get produced? An interesting review of how books are produced is contained in the anthology One Book/Five Ways [AAUP, One Book/Five Ways]. This reports on a comparative publishing experiment in which five university presses prepared the same book for publication: the University of Chicago Press, the MIT Press, the University of North Carolina Press, the University of Texas Press, and the University of Toronto Press. The procedures used in each press were remarkably common. Although the approaches varied somewhat, all involved the stages of acquisition, market and preliminary cost estimation, editorial revision, design, production, sales, and promotion. Each press documented their procedures, their forms, and the guidelines they applied to the various processes. One Book/Five Ways contains a rich collection of raw material for anyone interested in the publishing process. In particular, the report includes the style guidelines from each of the presses. These guidelines establish the publisher's house style, and govern editorial, graphic design, illustration, composition, and typesetting decisions. Perhaps the most well-known style guideline for scholarly documents is The Chicago Manual of Style, which was referenced by several presses in this experiment, although most have their own refinements and special instructions. An important feature of the traditional book production process is the parallelism achieved through several groups working on distinct aspects of a book. When a manuscript arrives at the press for consideration, it is quickly copied and sent out for two or more independent reviews to decide whether to publish the work. Once the decision to publish is made and the completed manuscript arrives from the author, copies are sent simultaneously to (1) the production editor, who establishes a job docket to track all of the subsequent stages of the publication, (2) the copy editor, who makes editorial revisions, and (3) the graphic designer, who designs the book and its illustrations. This parallelism is shown in Figure 2-1 for a simplified and hypothetical publication process. ==================== ///Beach/Thesis/Figure2-1-GAProcess.press leftMargin: 151 pt, topMargin: 160 pt, width: 328 pt, height: 237 pt Figure 2-1. TRADITIONAL GRAPHIC ARTS PROCESSES involve considerable parallelism in the procedures for publishing a manuscript. The author's manuscript is copied and sent to the production editor, the copy editor, and the design/illustration department. Edited pages are typeset by the composition staff who are guided by the design of the document. The typeset manuscript and the illustrations are then assembled into pages in preparation for printing. ==================== Other parts of the document publication process also involve parallelism. If the book is to have a jacket or cover illustration, that illustration is undertaken while the insides of the book are prepared. The table of contents and Library of Congress submission forms are prepared as soon as the book enters production to ensure that the imprint page and the front matter of the book are ready for printing. The index is often on the critical path near the end of the document production cycle. Since index entries must have the correct page numbers, the index can not be fully completed until all of the pages have been assembled. Typically the index entries are compiled in parallel with the book composition. After the page numbers are assigned on the reproduction pages (or page repros) the index manuscript is completed in parallel with the final proofreading of the book pages. Even with the use of electronic composition tools, preparation of back matter is on the critical path and inconsistent page numbering occasionally results. Such problems appear in the appendices of the second edition of Newman and Sproull's Principles of Interactive Computer Graphics [Newman&Sproull, Computer Graphics], in which the reference citations all refer to a preliminary draft version, because the authors forgot to make `one last revision pass' over the reference citations in the appendices. The second edition was typeset by the authors using facilities at Xerox PARC because they could complete revisions up to the last minute and control the accuracy of computer programs contained in the text. In a normal production process, there are more people checking things and hence less chance of oversights, such as what actually happened in the appendices. An area of great concern to the publisher is administration of the production process. Publishers usually have several projects underway at the same time because of the delays involving revisions and approvals from the author of a single project. The production editor controls the document publication process for the publisher, determining time and cost estimates for the publication, selecting and contracting with suppliers, tracking the parallel stages of the composition process, and keeping records of deadlines and expenses. In a journal publishing situation, the problem is compounded by the dual pressures of multiple authors and frequent publication deadlines for each issue. These process control functions are the most important contributions of publishers. Some publishing companies employ little more than production and marketing editors in house, subcontracting most of the skilled jobs such as copy editing, design, illustration, composition, printing. In the electronic publishing or self-publishing process, these subcontracted jobs are performed by the manuscript author and electronic document production tools will have to handle them successfully. The traditional document production process in the graphic arts routinely accommodates difficult manuscripts. Typically tables, mathematical notation, illustrations, and page layout are aspects of document production that are considered difficult by traditional publishers. The following sections discuss how each one of these areas was handled in the comparative publishing experiment. Tables There were only a small number of tables in the One Book/Five Ways experiment, but they were always treated separately from the main body of text. Many publishers rely on the skill of the compositor or typesetter to handle tables: ``A good composing room can translate almost any tabular copy in a reasonably clear and presentable example of tabular composition.'' [Williamson, Book Design, p 160] The Chicago Manual of Style provides authors with the ``dos and don'ts'' for preparing tables in manuscripts. In particular, authors are expected to prepare tables on separate pages because the tables will be composed separately from the text. There are some cautions also. For instance, the University of Chicago Press no longer prefers vertical rules in tables because Monotype composition (using molten metal casting of individual letters), which could insert a vertical rule easily, is no longer economical. With phototypesetter composition, vertical rules are difficult and expensive: ``In line with a nearly universal trend among scholarly and commercial publishers, the University of Chicago Press has given up vertical rules as a standard feature of tables in the books and journals that it publishes. The handwork necessitated by including vertical rules is costly no matter what mode of composition is used, and in the Press's view the expense of it can no longer be justified by the additional refinement it brings.'' [, The Chicago Manual of Style, 1982, p 325-326] Mathematics Although there were no mathematics in this experiment, publishers treat mathematical notation very differently than textual material. Kernighan and Cherry note this difficulty in their paper on computer typesetting of mathematics [Kernighan&Cherry, eqn] where they quote the following from The Chicago Manual of Style: ``Mathematics is known in the trade as difficult, or penalty, copy because it is slower, more difficult and more expensive to set in type than any other kind of copy normally occurring in books and journals.'' [, A Manual of Style, 1969, p 295] Some publishers specialize in mathematical and scientific documents. They utilize both skilled copy editors and special suppliers to handle the difficult mathematical material. Other North American publishers send mathematics copy to the Far East, where hot metal composition provides the quality and cheap labor rates reduce the cost. Illustrations The treatment of illustrations varied widely in the publishing experiment described in One Book/Five Ways. In one instance, a publisher chose to have an artist prepare line drawings rather than include halftone photographs because there were no convenient local suppliers to create halftone screens for the photographs. In contrast, another publisher planned photographs for the opening page of each chapter as well as for most of the illustrations. Generally, illustrations are prepared separately while the book is being copy-edited, and are then manually assembled onto the completed pages. Page Layout Examining the book design and page layout used by most publishers reveals mainly the results rather than the design process itself. Page dummies and sample pages are the usual products of the design process. Page dummies are sketches of the page layouts prepared by the graphic designer for approval. Sample pages are pages typeset and assembled by the composition supplier. Both techniques may require several iterations between designer, supplier, and publisher to make certain that the publisher is satisfied and that all the style guidelines are followed. Unfortunately, such an iterative design process generally means that the publisher's guidelines have never been completely specified, frustrating those attempting to become a supplier with new technology. 2.1.2 Roles involved in producing a book The document production process is complex. To help understand the process better, this section examines the individual roles of people involved in producing a published document. Anthropomorphism, or the attribution of human behavior to some problem, has proven beneficial in making complex parallel processes more easily understood [Dyment, Corkscrew] [Booth&Gentleman, Anthropomorphism]. An interactive paint program [Beach, Paint] was implemented using multiple processes, where anthropomorphism served to clarify and simplify the relationships of the parallel processes. Through cataloguing the roles involved in document production, the structure of the problem becomes apparent as a set of integrated processes. ==================== ///Beach/Thesis/Figure2-2-PubProcess.press leftMargin: 78 pt, topMargin: 156 pt, width: 372 pt, height: 222 pt Figure 2-2. A HYPOTHETICAL PUBLISHING PROCESS indicating the roles and their interactions at various stages. The horizontal axis represents elapsed time and the thin vertical lines join activities that begin or end at the same time. Delays or inactivity are not shown, but may exist at many places in the process. ==================== (Aside: An example of the lack of integration in electronic tools occurred when preparing Figure 2-2. There are 15 text labels and the first version of the illustration contained two spelling mistakes. Because the illustration was prepared with a separate illustration tool and was not integrated with the document, the spelling tool used on the text of this chapter was unable to find the mistakes in the illustration.) An important thing to remember while reading this categorization of roles is that the descriptions relate to activities and not people. Sometimes people may fill several roles at once, such as an author who types and composes the manuscript, or a graphic designer who does the layout, illustration, and paste-up. The use of document composition tools in universities and research labs has tended to encourage (or force) authors to take on multiple roles. From this experience, people may falsely conclude that each job looks easier than it is, especially when they are not aware of what they are doing wrong. Concentrating on each role separately helps us to understand the process and to realize the skills necessary to accomplish all aspects of that specialist's job. · Author of the manuscript The author creates the original manuscript. Generally, the manuscript is textual material, although for some subject areas there will be vast quantities of mathematical notation, computer programs, tables, line drawings, or photographs. The author may produce several draft manuscripts with the assistance of a typist. Some authors now do their own typing with word processors or text editors. Sophisticated editorial tools, such as the diction and writing style analysis tools offered in the UNIX Writer's Workbench [Cherry, Writing Tools] [Macdonald, Writer's Workbench] and in other commercial editing systems [Alexander, Editor Aids], may be used by an author to improve the quality of the writing. A draft manuscript is submitted by an author to an acquisition editor or journal editor for consideration. After a favorable publishing decision, the author completes the manuscript and adds front matter that may include a preface, an introduction, acknowledgements, etc. If the document is to be indexed or have other reference material, the author may need to prepare this material also. The completed manuscript is sent to the production editor, who begins the publication process. Some publishers will now accept manuscript submission in electronic form, such as word processor diskettes or magnetic tape. The author may be involved in reviewing decisions made by the publisher. The copy editor will mark the manuscript with suggested changes and questions to be dealt with by the author. The graphic designer or illustrator may send drafts of the book design and illustration artwork for review and approval. There may also be an indexer involved, who may send the preliminary index entries to the author for review. The author must also check the composition process by first looking at the galleys and later at proofs of the assembled pages. · Typist The typist prepares the draft manuscript for the author using a typewriter, word processor, or text editor program. Typewriter composition involves only simple typography, typically with only a small number of type styles. Technical typing with many mathematical symbols is much more difficult and time consuming; some typists resort to hand printing symbols that are unavailable on the typewriter. The layout of typewritten material is free form and requirements are quite relaxed. Tables are easily laid out with fixed-width characters on a typewriter. The human typist frequently acts as a built-in spelling checker and copy-editing service while transcribing the manuscript. There are several drafts prepared during the creation of a manuscript. If each draft is retyped to incorporate changes, there is a strong tendency to reduce the number of drafts because of the effort required. Often, the completed manuscript contains partial page inserts pasted or stapled together. · Acquisition Editor or Journal Editor The acquisitions editor solicits and reviews new manuscripts from authors. Opinions of reviewers are sought to determine if the manuscript should be published. The publishing decision is made by a publication board or a committee of journal editors and is concluded by the signing of a publication contract or agreement between the publisher and the author. · Reviewer or Referee A manuscript reviewer may be asked by a publisher to give one of several opinions. Book publishers refer to these people as reviewers, and journal editors refer to them as referees. Reviews made early in the process seek to establish the marketability of a manuscript or the appropriateness of a journal article. Later, more comprehensive reviews seek to assess the subject coverage, research contributions, and technical accuracy of the manuscript. Reviewers are generally most concerned with document content, although in some special cases they may also consider the format or style of a manuscript. Some reviewers of technical material may use their own typesetting capabilities to capture their comments in the complex notation of the subject area, such as mathematics or computer programming. In some cases, such as computer science journals, the reviews may even be transmitted electronically via electronic mail networks. · Production Editor The production editor controls the document production process. Initially the production editor deals with the author to ensure that the manuscript has all necessary illustrations, that all the sections of the manuscript are finished, and that permission is obtained to reproduce items from other sources. Copies of the completed manuscript are sent in parallel to the copy editor for editorial revisions and to the graphic designer for book design and illustration. Production editors contact and select appropriate suppliers for graphic arts services when those services are not available within the publisher. To help manage and track the various stages of several publications going on simultaneously, the production editor maintains a production database recording the expected services, the date and time each service began and finished, the estimated and actual costs incurred, and the current status of ongoing services. This database exists either on paper as the job docket (a large envelope that contains all the partially completed results) or in a computer file. · Graphic Designer The graphic designer provides the book design and layout guidelines. This design can only be done effectively when the entire manuscript is available, although some designs are attempted with incomplete information and later revised during publication. The design guidelines are written in a specification sheet or in a style sheet to be sent to the compositor with the copy-edited manuscript (see the example in the next section). As difficult typographic situations arise, graphic designers may design special guidelines for those not covered in the general scheme, such as designing the layout for tables, and specifying the typography for nested lists of material or for foreign language extracts. Artwork for the illustrations may or may not be the responsibility of a graphic designer, depending on the designer's agreement, talents, or interests. Jacket or cover designs may also be the graphic designer's responsibility. · Copy Editor The copy editor ensures that the manuscript meets the publisher's house style for language usage, grammar, spelling, citations, references, illustration captions, table arrangements, headings, lists of items, foreign language phrases, etc. The copy editor deals with all the irksome details that would annoy the reader if they were not treated consistently. For example, the copy editor checks cross references from one section to another for completeness and verifies that captions, footnotes, and citations are numbered sequentially. Missing information or references and questionable corrections are sent to the author for action. Obviously electronic editing tools greatly assist the copy editor to accomplish these consistency checks. Displaying both the cross reference and its referent through multiple views (or windows) of a manuscript help to check cross references; pattern-matching search operations permit quick global checks; style and diction analysis tools may be of assistance in checking the grammar, spelling and language usage. The copy editor marks the manuscript for the compositor by identifying the logical parts of the document, such as chapter openings, various levels of section headings, types of lists of items, and captions for tables and illustrations. Selecting the typographic treatment of those logical parts is the responsibility of the graphic designer, who specifies to the compositor the typography for each part in the style guidelines. · Indexer The indexer prepares the index entries for a manuscript, assigns page or reference numbers to each entry, sorts them, and creates an index manuscript. The indexing job may or may not be done by the author, although the author usually must approve the index manuscript. The indexer works with the manuscript in two stages: the copy-edited manuscript prior to composition to determine the index entries, and the page proofs to assign the correct page numbers to the sorted index entries. The requirement for correct page numbers places the index on the critical path for publication and some publications omit the index to reduce the delay. Electronic aids for indexing have not proven to be a panacea. Winograd and Paxton created a general set of indexing tools [Winograd&Paxton, TEX Indexing], yet the index still required hand editing and fine tuning. The difficulty in preparing an index is the proper selection and cross referencing of index entry terms or phrases. Skilled indexers still produce better indices than most computer-generated ones because they index on meaning, not on a precise phrase found in the manuscript. · Illustrator, Draftsman, Graphic Artist The illustrations for a publication are prepared from initial artwork provided by the author. The range of illustrations found in technical documents spans fine hand-drawn illustrations produced by a graphic artist, engineering drawings prepared by a draftsman, and photographs supplied by the author or a photographic service. Often illustrations are produced by tracing the author's sketches, which results in revision cycles as the author more clearly indicates his intentions. The graphic designer may produce illustration artwork personally or may establish artwork guidelines for original size, reduction factors, line weight, typography, shading textures, materials, and so on. Reducing the original artwork improves the quality of the line drawings by making the line weights appear more consistent (small variations are less noticeable) and by sharpening the contrast in the image. Careful coordination of dimensions and text size on the original artwork is necessary to ensure that the reduced artwork suits the surrounding typography when assembled on the page. · Keyboarder, Coder, or Inputter The composition of a document is accomplished in two stages: entering the marked-up manuscript into a typesettable file, and then outputting the file on a typesetting device. Typically there is one format code for each logical part of the document marked by the copy editor. For example, there might be a code for the chapter opening, for each level of section heading, for beginning an indented list of items, and for a line of a table. The job of entering the marked-up manuscript may be further subdivided into several phases: assigning format codes to the copy editor's marks, designing the typesetter codes for each format, and inputting the manuscript codes and text. The style sheet provided by the graphic designer determines the appearance of marked up parts of the manuscript and hence the typesetter codes required. The typesettable files may either be entered directly, on less expensive slow typesetting devices, or kept on some storage medium (perhaps paper tape, floppy diskettes, rigid disks or magnetic tape) for more expensive high-speed typesetters. Corrections to the typeset galley proofs are most often made by typesetting corrected pieces of the manuscript, rather than correcting the files and retypesetting the entire galley. In the case of large documents, management of the corrections is a concern and poses difficulties for subsequent uses of the document. · Compositor, Typesetter The compositor produces the actual typeset output. This person may also do the keyboarding, but a compositor must have the skill to enter specific typographic codes for unusual or difficult typesetting jobs, such as for mathematics, tables, illustration labels, copy fit text that must fit certain dimensions, and so on. The compositor runs the typesettable file through the typesetting device and produces the typeset galleys or pages. · Paste-up Artist Most documents are typeset in galley form and later cut and pasted into page assemblies. The paste-up artist collects all the pieces of the manuscript in their final form: typeset text, running heads with page numbers, mechanical artwork for the illustrations, and photographs. Pages are assembled by cutting apart the galleys into pieces that will fit on each individual page and pasting the pieces onto page layout forms. These layout forms are typically printed with light blue lines that will not reproduce on photographic negatives for printing. The paste-up process requires a sharp knife and a waxing machine, which coats the back of photopaper lightly with wax that helps the paper adhere to the layout forms when the two are pressed together. The wax adhesive is pliable so that the pieces can be safely separated if the layout needs to change. Paste-up only applies to photocomposition systems that produce paper or film original type. With metal foundry type, the assembly process involves moving metal type slugs into place and performing craft operations, like surrounding type slugs with furniture to provide the spacing for page layout, or kerning individual letter slugs by cutting off the corners to make them fit together better. Some legal organizations have required metal type for legal documents to avoid potential errors in electronic composition systems using phototypesetters [Leith, Metal type]; they wanted to see and verify the final type. The graphic designer may paste-up a document, especially if the manuscript requires frequent design decisions. In such cases it is quite difficult afterward to determine the rules and logic that were applied to accomplish some of the creative layouts. · Process Camera Operator, Stripper After the page assembly stage, completed pages are ready for printing. Depending on the printing process, it may be necessary to use a large-format graphic arts process camera to prepare photographic negatives of each page. The negatives are in turn used to expose printing plates. Text and line art illustrations are photographed directly on very high contrast negative film, whereas photographs are screened or halftoned to provide the tonal variations on high contrast film. If the printer is capable of printing several pages in one pass, then the stripper must prepare an imposition of several pages into one printing signature. The graphic arts process of producing printing plates from assembled pages (master images) has been imitated by the concept of rendering device-independent image masters through page description languages like Interpress from Xerox [, Interpress] and PostScript from Adobe Systems [,PostScript]. · Printer The printing process selected by the publisher depends on the number of copies or impressions required. Short-run printing (up to 50 copies) can be printed cost-effectively with a photocopier from a paper original. Medium-run printing (from 50 to 1,000 copies) can be printed with an offset duplicator using an inexpensive paper-based printing plate. Long-run printing (from 1,000 to 10,000 copies) are generally printed with high-speed offset printing presses in signatures containing several pages and using metal printing plates. If the document requires color, then there must be separate impressions made for each printing ink color. Each impression requires a separate master image, one for each color of ink. To print images with a full range of colors, separations may be prepared by an outside supplier working from a slide transparency of the colored image. For a small number of flat colors (typically black plus one or two colors) the separations may be made by the process camera operator from color-keyed parts of the original document. · Binder The printed pages must be collated and bound together to form a completed document. The bindery specializes in taking the bulk pages, possibly in signature form, folding them, collating them in the correct sequence, sewing or otherwise fastening the pages together, and trimming the pages to finished size. The cover, whether a cloth-covered hard-cardboard case or a strong paper back, is attached around the document. Any printing on the cover or jacket must be designed and printed in time for binding. The result is a completed publication ready for distribution. 2.2 The Concept of Style It is important to observe that there are an incredible number of choices in the design parameters that go into producing a document. How do people make the choices? What controls the choices? How are the choices communicated when they are made? 2.2.1 Style as a Series of Design Choices Many design choices are involved in the process of producing a document. For example, the copy editor chooses names for the logical parts of the document and communicates them to the graphic designer and compositor on the marked-up manuscript. The graphic designer chooses the typographical parameters for these marked parts of the manuscript and communicates them to the compositor on type-specification sheets, such as the one shown in Figure 2-3. The compositor acts on the mark-up codes, using the type specifications, and enters typographical formatting codes in the typesettable file. ==================== ///Beach/Thesis/Figure2-3-StyleMatrix.Artwork Figure 2-3. TYPOGRAPHIC STYLE SHEET typical of the specifications that graphic designers provide compositors to control the parameters of typeset documents. ==================== All of these choices influence the publishing style of the organization. The American Heritage Dictionary's definitions of `style' and `style book' help clarify what style means and how it can be used: ``style n. 1. The way something is said or done, as distinguished from its substance . . . 7. A customary manner of presenting printed material, including usage, punctuation, spelling, typography, and arrangement.'' [, Dictionary] ``style book n. 1. A book giving rules and examples of usage, punctuation, and typography, used in the preparation of copy for publication.'' [, Dictionary] Each publishing house develops its own house style, a way of doing things that will distinguish documents from that publisher. In the publishing experiment described in One Book/Five Ways, the University of Toronto Press provided the most concise set of composition style guidelines, covering the following topics: text composition: word spacing, word division (hyphenation), letterspacing, paragraphs, leading, small capitals, figures (numerals). punctuation: dashes, periods, apostrophes, colons, semi-colons, exclamations, question marks, ellipses, quotations. special settings: capitals, tables (avoid vertical rules), footnotes, extracts, quotations. page makeup: facing pages, widows. People at different levels contribute to a publisher's distinctive style. The editorial staff establishes the guidelines for authors and copy editors, such as recommended forms of presentation, spelling, language usage, or the avoidance of vertical rules in tables. Graphic designers select the typography and layout for book designs. The composition staff determines the final typesetting choices through interpreting the typographic specifications. A publisher's style is developed through an iterative process. The high level plan is established by the publisher and the editorial staff; they request a certain `look' or `feel' for a publication. The graphic designer reduces that high level plan into more specific guidelines, but the compositor still has some freedom to interpret typographic choices. The result is sample pages. These pages are passed up the chain for approval and are returned for correction. The changes iterate among publisher, graphic designer, and compositor until the publishing staff finally `sees' what they want. For large documents, this leads to inconsistencies in how variations not covered in the sample pages are handled, or even differences due to different people working on the manuscript. The solution has traditionally been ``Try it again until you get it right.'' For automated composition systems that rely on algorithms to carry out repetitive actions, the traditional design process makes it hard to extract the formatting algorithms from style guidelines. The guidelines are expressed in terms of what people are doing, rather than the process of doing it, or the cause and effect decisions that lead to the result. Therefore, it takes several iterations with sample pages that cover all the expected situations before a creative programmer can express the style rules as an algorithm. 2.2.2 What Do Styles Affect? Style may seem to affect or control more than just the appearance of a document. For instance, consider the choice between Canadian and American spelling, something that might be treated as a style choice. Clearly different spellings contain different letters, as in `colour' versus `color', `labelling' versus `labeling', but the same letters may appear in a different order, as in `centre' versus `center'. The concept of style must accommodate these apparent changes in substance. We need to realize that style can accomplish changes at many different levels. The change in spelling does not affect the meaning of the sentence containing those words, and therefore the substance of the meaning remains constant while the spelling varies. In fact, many Canadian and American readers easily pass over these different spellings. The style may have changed the characters but not the meaning of the words. Consider the language processing tricotomy of lexical, syntactic, and semantic analysis. Style can be seen to affect primarily the first two stages of analysis. Style at the lexical level affects a token's appearance, such as the choice of spelling. More common lexical style changes are the use of distinctive typefaces for section headings, the inclusion of whitespace above and below section headings, etc. In fact, most typographic parameters fall into this lexical category of style. Style at the syntactic level affects the order of information in the document. One example is the order of names in a bibliographic citation; one style places the surname before initials, while another style places initials before the surname. Another example of syntactic style is the placement rule for parts of a document during page layout, such as locating figures at the top or bottom of a page and collecting all footnotes at the bottom of each column. Style is also possible at the semantic level by providing different readers with different views of the document. For instance, a document on how to use the Cedar mail system on a new kind of file server [van Leunen, One Document] was prepared for readers with different backgrounds. The document contained written modules of information for one of three kinds of audiences: those who had never used the mail system, those who had used the mail system but stored their files locally, and those who had used the mail system and had some experience with the new file server. A map of which modules applied to which experience categories was used to compile three versions of the document from the various modules. Cargill presents similar ideas for managing different views of software source code [Cargill, Views]. In his scheme, multiple software versions for differently configurable systems were maintained in the same file structure. Depending on the configuration desired, different software versions would be extracted. 2.2.3 Styles for Specific Media Another style dimension is differentiation in media. Traditional printing processes provide some variation in colors and papers, but other reproduction technology and electronic documents span a broader range of possibilities. Documents that become projection slides, posters, or video displays represent some of these. The notion of device independence in computer graphics can be applied to document formatting. The survey article on document formatting [Furuta, Survey] presents the notion of a `view' of a document as the device-independent post-processing of a formatted document for a particular device. However, media and device capabilities may influence the appearance and readability of information in a document. In this case, device independence is less desirable. Rather, we wish to reformat the document to take advantage of device characteristics or, put another way, to change the style to suit the medium in which the information will be presented. Low-resolution devices without color must obviously use different techniques than high-resolution color laser printers. Type families are hard to distinguish on low-resolution devices: 8-point Times Roman on a display screen is difficult to distinguish from any other serif typeface (such as Garamond or Baskerville) because there are so few `bits' available to display subtle differences. A color image may lose a great deal when viewed in black and white, especially on low-resolution devices that display only a few, if any, grey levels. 2.3 Early Typesetting Systems The early use of computers in graphic arts typesetting systems has been chronicled in several interesting books. One report of a computer composition system is Barnett's Computer Typesetting [Barnett, Computer Typesetting], which describes his work at MIT in the early 1960's. Arthur Phillips's compendium Computer Peripherals and Typesetting [Phillips, Computer Typesetting] describes the computing and typesetting technologies that were being applied in the graphic arts industry up to the late 1970's. Seybold's classic book, Fundamentals of Modern Photocomposition [Seybold, Fundamentals], surveys the first three photocomposition generations and the state of document preparation systems, as well as presenting his seminal thoughts on the problems of area composition (page layout), computer-generated halftones, and integrated system solutions. Phillips's later book, Handbook of Computer-Aided Composition [Phillips, Handbook], describes the evolution of electronic tools in the publishing and printing industries. Berg's Electronic Composition [Berg, Composition] provides a complete assessment (much in the style of a consultant's report) of the issues in composition systems, the options available, and the pitfalls to be avoided. Most of the early graphic arts systems used rather small resources and simple approaches to the complex problem of producing typeset documents: Barnett [Barnett, Computer Typesetting] used the IBM 709 at MIT; Seybold [Seybold, Fundamentals] describes composition software run on an IBM 1130; the first stand-alone typesetter at Waterloo, a Photon 737 Econosetter, had only a 4K 12-bit program memory [Beach, PROFF]. These computer programs accepted typographic codes that mimicked the manual actions of a typographer using a hot-metal type-casting machine. The coding structure intermixed action codes with text character codes. Due to the use of shift-codes, super-shift codes, and even upper-rail and lower-rail shift codes, the text was often inscrutable for editing purposes. Table formatting was an early application of computers in typesetting. The earliest such publication, found after an extensive literature search, was the 1962 NBS Monograph 53, Experimental Transition Probabilities for Spectral Lines of Seventy Elements, by Corliss and Bozman [Corliss&Bozman, NBS53]. Since computers were generating numeric data and since typesetting equipment was being driven from magnetic tape, it was natural to combine the two together. This monograph contained only a single table and the table formatting was accomplished by a special purpose program. The program is described in more detail in Chapter 4. Another class of document composition systems evolved from the text formatting programs developed on general purpose computer systems. The evolution of such formatters from Saltzer's RUNOFF document formatter [Saltzer, RUNOFF] is chronicled in Brader's Masters thesis, An Incremental Text Formatter [Brader, Incremental Formatter], and later in the Computing Surveys article by Furuta et al., ``Document Formatting Systems'' [Furuta, Survey]. Documents for such formatters were presented as a stream of characters that included embedded control codes. The earliest RUNOFF systems used a period at the beginning of a line of input, an unlikely occurrence in normal written material, to indicate the presence of a formatting command. Later systems escaped from the `line of input per command' restriction by designating command delimiters as infrequently used characters like braces [Beach, Typeset], backslashes [Ossanna, troff] or at-signs [, SCRIBE]. Macro and conditional execution facilities for commands extend the range of document formatting possibilities. One tenet of documentation folklore at that time was that if you could make writing a document more like programming, then programmers would take the time to prepare documentation for their work, something which proved difficult to ensure. Unfortunately this was the wrong paradigm. It did not make writing a document easy and it did not get programmers to write better documentation. The model of a document as a stream of text with embedded commands survives today as a prevalent document formatting model. One consequence of the stream document model in both the early graphic arts systems and the early document formatters is the need to accept the document stream as an abstraction of the formatted document. An early system by Engelbart provided an alternative document model and several alternative views of the document. The editing and formatting part of Engelbart's augmented human intellect system, NLS [Engelbart, NLS], provided a concrete view of formatted documents as they would appear when printed, without the intrusion of formatting commands. The NLS system was the original `what you see is what you get' document formatting system and Engelbart coined the phrase WYSIWYG (pronounced whizy-wig) to describe it. Due to the limitations of the display and printing devices, NLS was exactly a WYSIWYG system. Many later systems also claim to be WYSIWYG, but cannot claim to render printed output exactly on the display, mainly because of differences in fonts and character widths between the display and the printing devices. In a further departure from the stream of text and embedded commands model, the NLS system represented the document contents in a tree-structured hierarchy of text blocks, such as the common hierarchy of chapters, sections, subsections, and paragraphs. A reader of the on-line document could display one of several views. For example, one viewing parameter controlled whether the structure labelling was visible or not, another parameter controlled the number of hierarchy levels displayed, and yet another controlled the number of lines displayed in each text block. NLS could also incorporate line drawings within documents by allowing a graphical object to take the place of a paragraph. Sadly, these ideas were not widely accepted at the time when they were first introduced in the late 1960's. Almost a generation passed before Engelbart began to receive the appropriate credit for the ideas of the mouse pointing device, multiple windows, and WYSIWYG formatting systems. Many early graphic arts typesetting systems did not attempt to deal with page layout but only produced typeset galleys to be pasted-up manually in the normal way. The RUNOFF-style formatters provided some limited page breaking capabilities and they could print running heads and footnotes. Such formatters relied on the simple and easily-handled dimensions of fixed-width characters on a line printer or teletype page to make the algorithms workable. Typesetting document formatters based on extensions to the RUNOFF model could produce output for typesetters. They produced typeset pages by executing page breaking algorithms coded as macros. Some early typesetting work with PROFF [Beach, PROFF], a RUNOFF-like formatter for the University of Waterloo's Photon Econosetter, used simple page depth measurements to break large documents into pages. This was done mainly to avoid the manual paste-up stage due to a lack of available manpower to handle the number of pages produced. Seybold [Seybold, Fundamentals] outlines many of the concerns and difficulties with page layout or area composition addressed by commercial typesetting suppliers. More complex typesetting systems for high speed typesetters, like the Page-1 composition language [Pierson, PAGE-1] for the RCA Videocomp, permitted more complex page breaking logic to utilize the typesetter for very large documents better. Page-1 is one of the few early composition systems with widely available published documentation. A programmer could write a page breaking algorithm and style handling routines in the Page-1 language, have that compiled, and then execute the resulting composition program against the document input data. 2.4 Document Compilers A significant stage in the evolution of document formatters occurred when the embedded formatting commands in the document began to describe the logical content of the document. These logical commands, or formatting tags, require an additional level of indirection to associate the detailed formatting attributes with each tag. Initially this association was provided by a macro processor. Each tag was treated as a macro name. Expanding the macro produced the primitive formatting commands necessary to format that part of the document. For example, with logical commands one could specify that a part of a document was a heading. By including a tag like .heading, one could replace a sequence of detailed commands like ``leave 24 points of whitespace, select Times Roman bold typeface, use 14 point type size, and produce unjustified line endings.'' Later systems like Reid's Scribe introduced the notion of compiling a document [Reid, Scribe thesis]. The tags in a Scribe document identify the document parts that are compiled using a tag definition database to supply the formatting attributes for a suite of built-in formatting algorithms. Since more processing is required to interpret macros or compile a document, the development of document compilers was restricted to large general purpose computer systems, typically in universities and industrial research laboratories. Most commercial graphic arts systems remained on less expensive and smaller mini-computers and chose not to provide these `more expensive' features. This introduction of document compilers coincides with increasing support for document style. The indirection from tags to detailed formatting instructions emulates the style sheet concept used by graphic designers. Designing tag macros or formatting databases is separated from the marking up of a manuscript. A document style can be shared among a set of documents, for example, among the chapters of a book, the theses written at a university, or journal articles submitted to a particular journal. With such tools, authors who lack the skills for document design can still produce good-looking documents by choosing a document style database and inserting the appropriate tags within their document. The separation of document design and markup enables the document content to be reused in different situations by changing the style definitions associated with the formatting tags, without changing the manuscript or the tags themselves. At Bell Laboratories, where troff was developed, a manuscript could be published in three forms: first as an internal memorandum circulated within the lab, second as a technical report cleared for external review, and finally as a published journal article. A single set of tags within the document sufficed by substituting different style parameters for each of the three forms. The notion of compiling a document implies a massive undertaking. Indeed, problems with compiling monolithic documents occur frequently. Large documents often evolve from smaller ones rather than being planned, requiring more computing resources to format, longer turnaround time, and introducing longer delays in producing drafts of the document. There is a constant tension between the simplicity of making the document out of smaller modules and the complexity of managing the pieces. Simple problems like numbering pages sequentially between pieces can be a problem with some document compilers. Document compilers exhibit similar debugging problems to those found in compilers for programming languages. An example of a bug in a compiled document is the production of fifty typeset pages with a column width of 1.5 inches because the logic of a macro failed to reset a temporary change in line width. Debugging tools for document compilers have modeled program debuggers such as syntax checkers, simulators of the final output device on less expensive or faster devices, and interactive previewers to display the typeset document on a graphics display. The complexity of writing document format designs in the language of the document compiler leads to the need for `gurus,' `experts,' and `wizards,' just as for complex programming languages. Certainly, document compilers make some kinds of changes much easier. For example, correcting a chapter heading in one place can automatically affect the chapter opening, the table of contents, and the running heads for that chapter. Some aspects of documents may not be handled very well or at all by a particular compiler. Difficult composition features are sometimes left for future development, such as mathematical and tabular composition, the incorporation of line drawings and scanned images, or complex page layout designs. The inability to integrate all aspects of the document leads to special handling of the unintegrated parts of the publication, resulting in pasting up artwork for illustrations or special notation typeset separately. Other typographic problems may require specific commands to override the automatic compiled algorithms, such as forcing page breaks to avoid one-line widows, and inserting explicit line breaks to avoid rivers of whitespace or awkward hyphenation problems. Final corrections and revisions are frequently done by manual cut and paste methods because recompiling the corrected document would take too long or would create new problems, especially with page breaks. Of course, various document compilers do better than others with these problems. The following sections describe aspects of three document compilers in widespread use, troff on UNIX, the Scribe portable document compiler, and Knuth's TEX. Of special interest will be the way these systems handle document style, mathematics composition, illustrations, table formatting, and page layout. The survey articles mentioned above [Brader, Incremental Formatter] [Furuta, Survey] discuss additional aspects of document compilers. 2.4.1 troff The troff document formatting language developed by Ossanna [Ossanna, troff] and distributed for UNIX systems is perhaps the most widely used document compiler. The earliest UNIX application was preparing patent applications with troff [Ritchie, Turing Lecture, p 758]. troff encompasses a family of document compilers. All accept the same formatting commands but differ in their formatting algorithms, which are sensitive to output device characteristics: nroff formats for typewriter and line-printer devices with fixed width characters; troff formats for typesetting devices with multiple fonts and variable width characters. Porting troff to other typesetting devices was very difficult. An output device independent version, ditroff [Kernighan, ditroff], was created by Kernighan to handle a wide variety of typesetting devices and laser printers, although the formatting algorithms were essentially unchanged from troff. The troff formatting language has remained essentially constant since the late 1970's. There are primitive functions for controlling the formatting algorithms and the output device, establishing parameter values, selecting type fonts and sizes, positioning characters, and drawing lines. Additional primitives provide programming support for writing macros and building data structures, such as strings and diversions of formatted text. The command name space is severely limited to two-character tags. Generally lower case letter tags are reserved by convention for troff primitive commands and combinations of upper case letters and graphic symbols are used for macro commands. Commands are embedded in the document, either occupying an entire line of input beginning with a command character, or included within lines of input delimited by a backslash character. The strength of the troff document formatting system is the collection of tools implemented as preprocessors. These preprocessors include tbl for formatting tables [Lesk, tbl], eqn for typesetting mathematics notation [Kernighan&Cherry, eqn], pic [Kernighan, pic] and ideal [van Wyk, ideal] for drawing illustrations, and refer [Lesk, refer] for producing bibliographic references. The filter/pipe model from UNIX has determined the architecture of the troff document formatting system. The filter model forces the document file to be a linear stream of characters. Each tool reads the entire document file and produces a modified version for the next tool in the pipeline. The recommended processing order is refer, pic, tbl, eqn, then troff, a convenient order for the majority of documents. Occasionally, when it is not possible to establish a sequential processing order, this scheme breaks down and elaborate techniques to break circular dependencies are needed. Otherwise, the material cannot be formatted by troff. Nonetheless, collecting several types of diverse content in a complete document manuscript is more convenient for the author than managing the separate pieces. Each tool distinguishes its commands in some unique way. For instance, eqn processes embedded mathematical notation by recognizing its own delimiters different from other formatting commands. This leads to a hiding of information among various tools. For example, the spelling checker does not investigate any misspelled words inside eqn or tbl commands even if they are English phrases. Some document commands are treated differently at different stages in the pipeline. For example, .TS and .EQ are tbl and eqn commands respectively to begin formatting tables and displayed equations. Later, these commands are passed on to troff, which treats them as ordinary macro commands to layout a particular table or displayed equation. An unfortunate consequence of executing the macro processor last in the troff pipeline is the preclusion of style facilities or indirect definitions of formats for tables, mathematical notation, illustrations, or any other preprocessor to troff. Some preprocessors furnish their own simple and different macro languages while some users invoke their own preprocessor to provide the missing macro facilities. Yet the unifying concept of the pipe mechanism provides the troff document formatting system with its simplicity. Should the need arise, it is easy to create your own tools to solve difficult document content problems. The ubiquitous document model of a stream of characters with embedded commands makes this possible. Document style in troff is provided by its macro packages. Two frequently used packages are the -ms and -me packages, the former created by Mike Lesk at Bell Labs [Lesk, -ms] and the latter by Eric Allman at UC Berkeley [Allman, -me]. Macro packages provide two alternative techniques for defining different document styles: one can either parameterize the behavior of the macros or replace the macro package with another that defines the same commands with different effects. As an example, the Bell Labs -ms package can format title pages in different ways by initializing parameters from the .RP command (released paper format) rather than .TM command (Bell Labs technical memorandum format). Also, several variants of the -ms package exist for formatting documents in styles suitable for the Journal of the ACM, Communications of the ACM, and ACM conference papers [Johnson, CACM]. Mathematics composition is provided within troff by the eqn preprocessor. This mathematics typesetting system has become widely emulated and variations have appeared at other research centers [Gruhn, YFL] and in commercial typesetting systems [Alexander, Micros]. The basic technique is to define a notation language that expresses various two-dimensional relationships among boxes. These relationships may affect the size of boxes, such as making brackets larger around large fractions, or their relative arrangement, such as positioning superscripts and subscripts. eqn knows nothing about mathematical concepts or the actual dimensions of the boxes. It relies on the author to provide the precise spacing or line breaking of mathematical notation, and on troff to do the actual positioning and formatting of the boxes. Unfortunately, eqn guesses about some size relationships and it must be told the current type size explicitly. eqn provides built-in relationships for common mathematical notation, but the set of notations is not extensible. However, the eqn macro facility (separate from troff) and some low-level positioning primitives do provide an escape mechanism for creating new notation as macros. As there are different versions of troff for different device classes, there are also two versions of the mathematical typesetting system: neqn for typewriter devices and eqn for typesetting. This lack of knowledge of mathematical concepts in eqn is both a strength and a weakness. Without any knowledge in the mathematical formatter, one is forced to supply in tedious detail all the necessary spacing for operators. On the other hand, the absence of built-in knowledge avoids having to circumvent inadequate rules when they must be broken. The two illustration preprocessors, pic and ideal, provide elementary facilities for including line drawings within documents. The two differ in the mechanisms for defining the line drawings: pic uses a line and curve paradigm while ideal uses nonlinear constraints to define boundaries and connected lines. The illustration tools have only rudimentary style facilities for solid and dashed lines and for arrow heads. More elaborate styles, such as various line weights, fancier arrows, textures, and shadows are not provided. Through the preprocessor architecture, and subject to the pipeline order of preprocessors, it is possible to include any troff material in the illustrations, including equations and formatted text. The tbl table formatter is a very comprehensive facility capable of formatting almost any table design. Evidence of its power is shown by one author who created boxed illustrations for his paper with tbl when a line drawing tool was unavailable [Rosenthal, Graphical Resources]. Even with its flexibility and generality for table formatting, it is awkward to achieve consistent table styles in tbl. The user of tbl must carefully specify all the layout parameters in a consistent fashion. There are no separate macro facilities within tbl to help, and the troff macro processor executes later in the pipeline, after tbl has processed all of the table information. There is no interactive design tool for tables to assist with the specification. An extensive search for such tools found only a prototype built by Biggerstaff as part of an experiment in object-oriented program design [Biggerstaff, TABLE], described more fully in Chapter 4. The page layout mechanisms in troff depend on two powerful ideas: traps and diversions [Witten, Traps]. A trap is a macro to be executed at a measured distance from the most recent page break in the output stream. The trap macro might, for example, emit the footnotes at the bottom of a page of text. A diversion is an alternate output stream, distinct from the normal stream which is directed to the output device or file. Internal diversions are used to capture formatted information, such as footnotes, for later inclusion on a page. Floating a table or illustration from where it occurs in the manuscript to where it will next fit on a page is accomplished with diversions in troff. When a table is first encountered, a new diversion is started to capture the formatted table. After the table has been formatted, the diversion is closed and the macro package can check for sufficient space on the current page to hold the diverted table. If there is room then the diversion is copied immediately onto the normal stream, otherwise it is held until the beginning of the next page. (Complications arise if the table is larger than the page but they need not be considered here.) Implementation restrictions within troff have persisted for a decade. troff is an old program, originally coded in assembly language for the DEC PDP-11, subsequently translated into C. It has changed little since. The facelift for device independent troff concentrated mainly on font data structures and generalizing the output device model. The most notorious restriction is the two-character name limitation. Macro packages for troff have used elaborate conventions to avoid catastrophe due to name conflicts. Each preprocessor requires some set of macro and register names to support its operations and therefore each reserves some set of two-character names. Users of troff must tread gently when creating their own macros or extending the existing packages to avoid name conflicts because the name space is so limited. Limited internal data structures are another annoyance, restricting the complexity of lines of text, the number of columns in a table, or how many entries may exist within a matrix. The author of this thesis led the development of a document compiler, TYPESET [Beach, Typeset], to alleviate many of the shortcomings of troff. A small typesetting business, WATTYPE, used TYPESET extensively to create technical and scholarly publications for a variety of publishers, mainly in mathematics and computer science. TYPESET was built with full knowledge of the troff system and attempted to eliminate many of its limitations. The macro processor has a similar syntax and flavor to GPM [Strachey, GPM]. Conditional execution and control structures were added to the macro language, providing document layout programmers with more freedom in expressing their designs. Register names of any length were stored in a hashed symbol table. Most data structures were dynamic and could grow as necessary. Math typesetting was based on the eqn design [Kernighan&Cherry, eqn] but the implementation incorporated many enhancements to improve the quality of formatted equations and to facilitate more options for aligning equations and handling matrices. A table formatting package was built using the macro language and provided several additional typographic facilities suitable for style control over the table design. Pagination and layout algorithms were based on trap and diversion paradigms similar to Page-1 [Pierson, PAGE-1] and troff. TYPESET was not fully exploited because it required significant programming skill to design new documents and it lacked sufficient documentation. Nonetheless, it did serve well to express difficult book designs and produce competitive graphic arts quality typesetting. TYPESET serves as an interesting contrast in goals with Scribe, described next. troff is in widespread use and continues to have a significant impact on technical document production. Its strengths are the simple document model and the large number of preprocessors and tools that can manipulate documents. It provides the broadest range of functional capabilities for mathematics, tables, and illustrations. There are several difficulties that limit its effectiveness: implementation limitations, expensive and resource intensive computation, and restrictions on the inclusion of mathematics, illustrations, and tables. 2.4.2 Scribe The second document formatting system in widespread use is Scribe [Reid, Scribe thesis], which was the first to use the term `document compiler.' The goal for Scribe was to format documents in a portable fashion across document styles, across output devices, and across various computer system installations. Scribe is widely used among the ARPANET community and is distributed commercially [, SCRIBE]. The notion of the form of a document, or its style, as opposed to the content of a document or marked up manuscript was made explicitly separate in Scribe. Style information is maintained in a database under the control of a database administrator. The database contains various formatting environments, each specifying a vast number of formatting attributes. Normal users of the document compiler cannot create new environments or attributes. The compilation process takes in a marked up manuscript file and creates a formatted document file suitable for printing. Unlike troff with its macro packages, Scribe provides only built-in formatting algorithms that are parameterized by the environment attributes. The Scribe document formatting language is declarative only. Reid defends the absence of procedural facilities in the database language because 1) a procedural language would reduce the feedback from users when they were unable to do something in Scribe, and 2) without enforcement or user training, ``programmability invariably leads to a diversity of style'' [Reid, Scribe thesis, p108-109]. Reid concedes that an algorithmic language would increase the usability of the compiler (the goal of this author's TYPESET system). Reid concludes that: ``Furthermore, a programmed system implemented by a diverse variety of people without central control, namely the union of the procedural extensions with the basic system, will invariably be more obtuse and difficult to understand and use than a unified one.'' [Reid, Scribe thesis, p109] The Scribe system provides many services for document writers and Reid coined the phrase `writer's workbench' [Reid, Scribe thesis, p71] to describe the collection. Included among these facilities are the automatic collection of entries for tables of contents, indices, and glossaries, the automatic cross referencing within a document through symbolic labels, the collection and sorting of index entries, the management of large documents composed of many component files, and the extraction of bibliographic citations from a database of reference entries. The lack of an algorithmic formatting language has prevented the proliferation of special purpose formatting preprocessors like those for troff. There are some mathematical formatting capabilities, but they are of limited capacity, sufficient for some technical documentation but limited for more concentrated mathematical documents. The lack of recursion in Scribe has been a serious impediment to building a mathematical formatter; overprinting is the only readily available technique that can handle the recursive nature of mathematical expressions [Monier, Scribe math]. Table formatting in Scribe is simple to use, but again limited in functionality. The scheme is based on extending the notion of typewriter tab stops that define column boundaries. Scribe provides several typographic capabilities, such as centering within tab stops and filling with leaders, but more general facilities such as centering headings over several columns requires changes to the tab stop settings. Scribe does provide a nonportable capability for scanned illustrations. Image files scanned for a particular class of output device may be incorporated into a document. Scribe will manage the whitespace layout described for the image, but expects the device to output the image file. The major accomplishment of Scribe was its successful separation of form from content in a document. The database of formatting environments takes advantage of the special skills of document designers and shares the design among document creators. The range of document content beyond text is limited, and there are few options for building special purpose formatters or preprocessors due to the lack of a formatting language. The next document compiler deals directly with formatting algorithms. 2.4.3 TEX Donald Knuth has made document formatting a legitimate topic for study in computer science by his work on TEX [Knuth, The TEXbook]. The boxes-and-glue model serves as the basis for algorithmic research into document formatting. Three algorithms have resulted from this work: optimal line breaking [Knuth, Line Breaking], hyphenation [Liang, Hyphenation], and optimal page breaking [Plass, pagination]. The TEX document compiler incorporates this model and these algorithms to provide a comprehensive document formatting system that extends beyond text to mathematics, tabular matter and complex typography. Typesetting mathematics was a primary goal of Knuth's work on TEX [Knuth, AMS lecture]. The notion of composing mathematical expressions from boxes surrounding each character and composing equation boxes for the arrangement of other boxes applies directly to mathematical notation. TEX relies on both a large font library to represent mathematical symbols, and a set of positioning operators. The TEX typesetting language is a linear expression of the boxes-and-glue model for formatting two-dimensional notation. Related work on METAFONT [Knuth, METAFONT] resulted in a font design tool capable of producing the many mathematical symbols and alphabets used in TEX. METAFONT relies on linear optimization and equation solvers to determine the outline shape of character designs specified by small METAFONT programs. TEX provides a macro definition capability which permits the introduction of shorthand inclusion of complex formatting commands and repeated text. There have been a small number of macro packages developed for TEX. Perhaps the most widely distributed is Lamport's LaTEX package [Lamport, LaTEX]. Document layout is expressed through implicit controls in TEX. TEX uses one global algorithm for many layout situations. Thus when breaking lines, there is no explicit notion of centering. Instead one surrounds centered text with two gobs (a technical term in TEX) of glue with large but equal stretchiness values. The justification (glue-setting) algorithm fixes the glue size to accommodate all the boxes within the given line measure. Similarly, page justification algorithms are influenced indirectly by gobs of glue between lines of text or parts of a page to accomplish the vertical layout. Plass's work with dynamic programming optimization algorithms lead to the development of line- and page-breaking algorithms for TEX [Knuth, Line Breaking] [Plass, pagination]. The optimization goal is to minimize some badness criteria, such as the sum of penalties for breaking a line in some way. Examples of line-breaking penalties are inserting a hyphen between syllables in a word, hyphenating very short syllables, and introducing hyphens in two or more successive lines in a paragraph. Given the set of boxes and the set of penalties, the optimization algorithm determines the optimal break points. The line-breaking algorithm is part of the current TEX82 release, but the page-breaking algorithm is not because it is too resource-intensive and/or too slow. Sometimes the algorithms in TEX produce beautiful results, but they require very clever designers. In this regard, the following summary comment appeared in the Seybold Report on Publishing Systems when discussing Tyxset, the first implementation of TEX available commercially on a microcomputer: `There are, however, some serious flaws. The greatest of these is the need for access to Xenix and TEX `gurus.' This would be necessary, we think, for all but the most trivial work.'' [Alexander, Tyxset, p 14] Many situations that can be handled by TEX are collected into the `Dirty Tricks' appendix of The TEXbook. One example of both the power of TEX and the excessive cleverness required to master TEX is the inclusion of leaders in an index entry [Knuth, The TEXbook, p 392-394]. The example in Figure 2-4 provides the TEX codes needed to format the given input for various line measures. Tables can be handled within TEX. However, TEX tables rely on horizontal and vertical justification primitives that align in one direction or the other but not both simultaneously. The TEXbook demonstrates the ability of TEX to reproduce some of the sample tables from the tbl manual as evidence of the functionality of TEX. TEX is valuable for the algorithmic foundations it brings to document formatting. The interface to those algorithms remains a stream document model without any structure. No WYSIWYG interactive composition system is yet based on TEX. 2.5 Integrated Composition Systems An integrated document composition system provides a more direct way of working with documents. One aspect of integration is the combining of the editing and formatting tasks. Document compilers require one to first create or edit a separate manuscript file, then to ask the compiler to turn it into a formatted document. In an integrated system, changes to the document become visible as they are made. Another aspect of integration is the integration of a variety of document content beyond simple text, such as line drawings, scanned images, mathematics and tables. The first integrated composition system was Engelbart's NLS, developed during the late 1960's. The NLS system introduced the notion of `what you see is what you get,' presented a visual interface to accelerate human understanding of the document, accepted direct manipulation of the document structure and appearance, separated the form of the document from its content, and integrated many abstract objects into a uniform representation. ==================== ///Beach/Thesis/Figure2-4-CleverTeX.press leftMargin: 72 pt, topMargin: 82 pt, width: 232 pt, height: 239 pt \hyphenpenalty10000 \exhyphenpenalty10000 \pretolerance10000 % no hyphens \newbox\dbox \setbox\dbox=\hbox to .4em{\hss.\hss} % dot box for leaders \newskip\rrskipb \rrskipb=0.5em plus3em % ragged right space before break \newskip\rrskipa \rrskipa=-0.17em plus-3em % ragged right space after break \newskip\rlskipa \rlskipa=0pt plus3em % ragged left space after break \newskip\rlskipb \rlskipb=0.33em plus-3em % ragged left space before break \newskip\lskip \lskip=3.3\wd\dbox plus1fil minus0.3\wd\dbox % for leaders \newskip\lskipa \lskipa=-2.67em plus-3em minus0.11em % after leaders \mathchardef\rlpen=1000 \mathchardef\leadpen=600 % constants used \def\rrspace{\nobreak\hskip\rrskipb\penalty0\hskip\rrskipa} \def\rlspace{\penalty\rlpen\hskip\rlskipb\vadjust{}\nobreak\hskip\rlskipa} \uccode`~=` \uppercase{ \def\:{\nobreak\hskip\rrskipb \penalty\leadpen \hskip\rrskipa \vadjust{}\nobreak\leaders\copy\dbox\hskip\lskip \kern3em \penalty\leadpen \hskip\lskipa \vadjust{}\nobreak\hskip\rlskipa \let~=\rlspace} \everypar{\hangindent=1.5em \hangafter=1 \let~=\rrspace}} \uccode`~=0 \parindent=0pt \parfillskip=0pt Figure 2-4. CLEVERNESS REQUIRED TO MASTER TEX is exemplified by one of the `dirty tricks' from The TEXbook, page 392-394. The top three examples show the same index entry formatted with different line lengths. Note that the dots behave differently depending on whether the two parts fit on the same line or not. The TEX code fragment is shown at the bottom. ==================== In the following survey, several more recent integrated document composition systems are reviewed. The first two, Etude and Janus, are research projects and deal mainly with document structure and interaction issues. The Xerox Star system tackled the integration of various document contents from the perspective of an office information system rather than a typesetting system. Evolving research at Xerox PARC into document structure and integrated composition will be highlighted briefly. 2.5.1 Etude The Etude project at MIT [Ilson, Etude] [Hammer, Etude] was part of a larger office automation research project. Etude (easy to use display editor) concentrated on the integration of document editing and formatting. The document formatting functionality was similar to Scribe while the internal formatting model was based on TEX's boxes-and-glue model. Etude operated on `high level typographical objects,' such as a chapter, section, paragraph, or italic phrase. A document design database mapped these objects into formatting attributes. The system was designed to be a `what you see is what you get' document composition system with a high standard of typographic quality, oriented towards the `professional user.' User interface issues and minimizing training were major goals of the research effort in Etude [Good, Etude interface]. Documents in Etude are hierarchical structures. A document exists as two structures, one for the internal representation of the content and another for the representation of the formatted document broken into lines, columns and pages. The document structure has the potential for accommodating nontextual content but this has not been described in published papers. Support for mathematics and tables appears to have been deferred. Etude supports the notion of a document style by providing formatting environments for each high level typographical object. These environments supply attribute values for formatting parameters. Relative values are permitted and the current attribute value is determined by an inheritance scheme that traverses the path from the root to the current node in the hierarchical structure. The Etude formatter displays changes as they are entered into the document. An incremental formatting algorithm minimizes the recomputation necessary to display the changes. Extra state information is maintained in the formatted document structure to support the incremental displayer. The incremental algorithms provide the basis for developing formatters and displayers for more general document objects represented as boxes and glue. 2.5.2 Janus The Janus project at IBM Research [Chamberlin, JANUS] took a slightly different approach to integrating document composition. The Janus workstation uses two different displays of the document, one the abstract manuscript file and the other the formatted document. Editing changes are made to the manuscript file and periodically the formatted document view is updated. Janus is a declarative formatter like Scribe, rather than a procedural document compiler like troff or TEX that relies on macro packages to execute primitive formatting operations. A declarative tag on part of the document annotates the intention of an author to compose a heading or an itemized list. The document may contain images as well as text since the tag may interpret the document content as it wishes, perhaps as words of text, or as a line drawing or scanned image. Incorporating mathematical and tabular material is planned but has not yet been accomplished. The definition of tags involves specifying the names of the tags, coding a tag action routine, and designing a page layout template. The tag markup language is a direct descendant of the `Generalized Markup Language' of IBM's Document Composition Facility [, DCF] [, GML]. A document designer creates a library of tag routines that captures the formatting attributes and layout actions related to the tagged content. Tag routines are coded in a Pascal-like language. Page templates control the placement of the tagged object and are designed using a graphical design tool. The Janus formatting algorithm is based on the boxes-and-glue model of TEX. Tag routines produce boxes that are collected into a galley. Out of sequence boxes, such as the text of a footnote or a floating illustration, are represented by an anchor in the galley pointing to a galley fragment. The packer algorithm places boxes from the galley into the page templates. The resulting structure is the formatted document. Janus provides for local intervention in the final positioning of boxes on a page by moving pieces of the document. Any such changes are lost when the document is reformatted. The published papers on Janus give no specifics on how illustrations are accommodated by the tag routines. The tag routines provide a very general capability for interpretting style and rendering nontextual content. The Janus prototype was expected to provide the base for future research in formatting mathematics and tables, but no published information is available. Janus provides insight into the customization of formatting documents by providing a formatting language for writing the tag routines to accommodate new classes of document content objects. To support a new class, one writes a tag routine and links it into the existing software. The style machinery in Janus is not centralized; each tag routine can accept its own set of attributes. This will cause difficulty with tables where the style information comes from the document containing the table, the table, and from each row and column. 2.5.3 Xerox Star The Xerox Star [Smith, Star Interface] [Seybold, Xerox's Star] approaches integrated document composition from the office information perspective. Since office documents are its major focus, Star supplies less capability to describe and control all of the possible typographic features. Nonetheless, Star integrates several classes of document objects, such as text in various fonts and sizes, simple business graphics, mathematical notation, tables, and forms. The Star user interface design is carefully managed to ensure that common actions operate across all document object classes. For instance, the act of copying part of a bar chart is the same as copying part of a mathematical formula or a part of a sentence of text. The style mechanism in Star provides only for the specification of individual formatting attributes. There is no indirection or central registry of named collections of attributes; changing the appearance of an entire document requires changing the attributes of each instance. Illustrations created by Star Graphics [Lipkie, Star Graphics] are mainly simple business graphics. Predefined categories of graphic images are provided: bar charts, pie charts, and simple line drawings such as organization charts. Scanned images are not supported. The centrally designed user interface extends to illustrations and its property sheet mechanism provides style attributes for graphic illustrations. Mathematical notation is handled very well by Star. The notation is displayed in a WYSIWYG fashion using special fonts for the mathematical symbols. When symbols take their sizes from the expressions nearby, such as summation or large parentheses, these symbols grow automatically as the expressions are changed. The mathematical notations are built-in and not extensible, but cover most of the rudimentary algebraic notation. Star also supports the interactive editing and formatting of tables and forms. Tables are defined as a matrix of rows and columns with some distinguished entries spanning multiple columns. Table appearance is controlled through specifications in a set of property sheets for the table, the rows, columns, headings, and the rules (or lines) within the table. Because the fundamental premise for Star assumes an office environment, there are many extensions and quality issues that were avoided. Adapting the Star to more typographically demanding environments will require addressing many of these design issues. 2.5.4 Xerox PARC Research Xerox PARC has continued to research integrated document composition systems. Prior to the development within Xerox of the Star office workstation, PARC had built the Bravo editor [Lampson, Bravo]. Bravo is an integration of both the editing and formatting of text and provides a WYSIWYG display of the document. When a document was to be printed Bravo produced a Press format file. The Press format is a device independent representation of the marks on paper. Text, line drawings, and scanned images are all treated in a resolution and device independent notation. However, Bravo could not incorporate illustrations directly and relied on the PressEdit utility to merge Press-file versions of the illustration into the document Press file. This digression into how illustrations were incorporated in documents serves to highlight the distinction in meaning between integrated editor/formatters and integrated document composition systems. An integrated editor/formatter permits editing a text document which you see presented in the form that it will be printed; an integrated document composition system stresses the integration of various kinds of document content, typically text and illustrations, into a single integrated document. The Tioga editing system that is part of the Cedar programming environment [Teitelman, Cedar] attempted to provide an extensible document structure to integrate document object classes. Tioga is noteworthy for its document structure and style machinery. The hierarchical node structure in Tioga evolved from NLS. Each node contains the document content and formatting properties. The editor provides interactive operations on that structure, such as selecting subtrees of nodes, displaying nodes at various depths, and applying properties to subtrees. Special node properties can be interpreted by the formatter to extend the kinds of content supported. Current artwork properties designate Press files, scanned images, line drawings, and tables. The Tioga style machinery maps node properties into formatting attributes. Style rules are explicitly named, written in an interpreted style language, and stored in style dictionaries. Attributes for a node are inherited through the node hierarchy. Relative attributes, such as ``make the indent of this node so much more than the indent of its parent,'' are easily accommodated. The style machinery is extensible by defining new style dictionaries and new attributes. The implementation has not progressed beyond a robust integrated editor/formatter with some extension capabilities. Tioga served as the test bed for various experiments in document objects discussed in later chapters of this thesis. 2.5.5 WYSIWYG  or is it? What is it that you get with WYSIWYG formatters? There are two answers: either a preview of the final appearance or the real appearance of the document on the output device. If what you get is the real appearance, the acronym might be renamed `what you see is all that you get.' When a WYSIWYG formatter displays the real appearance, then there is no compensation for the resolution capabilities between printers and displays, sometimes a ratio as high as 20:1. In true WYSIWYG the final hardcopy output is artificially limited by the display technology. The MacWrite formatter produces almost identical printed output on the ImageWriter as the displayed output because the screen and printer fonts are the same [Seybold, MacWrite]. If what you get is a preview, then the acronym should be `what you see is almost what you get.' Resolution differences result in positioning inaccuracies and limited font discrimination on the display. With limited resolution and small type sizes, there are simply too few bits to convey the distinctions between typefaces like Times Roman, Garamond, and Baskerville. Therefore WYSIWYG formatters often provide only generic typefaces that distinguish major type families and characteristics, like serif versus sans serif, or bold versus italic. The Xerox Star provides this kind of preview capability. A WYSIWYG formatter constitutes a dilemma. On the one hand, creators of documents may tend to spend excess time and energy on the wrong aspect of document production. Instead of creating information, there is a tendency to make beautiful documents of little value. On the other hand, authors are notorious in the publishing world for making substantial author's alterations only after their manuscript is typeset. A typeset manuscript is more readable so authors read it more carefully and often see a different message in that context. Significant benefits in improving the quality of documents accrue to authors working with a document in its close to final appearance, using the ability to incorporate those insights into the manuscript prior to publication. WYSIWYG formatters are expensive. Displaying typographic fonts implies using higher resolution displays and more computational power than simple text editors. These formatters are more complex because they must maintain formatted data structures while accepting changes to the document. Operating all the formatting controls requires investing time in learning to use such a formatter, although the time can be reduced with good user interface design such as evidenced by the Xerox Star. Further study of these problems seems warranted, but is beyond the scope of this thesis. The work reported here is designed for future incorporation into WYSIWYG systems. 2.6 Document Content Models and Views of Documents The representation of documents changes radically through this survey of document composition systems. Traditional graphic arts processes produce a master document only on paper or photographic film. Electronic processes produce computer files for the electronic master document with all the text, illustrations, and formatting information included. The prevalent document representation is a simple stream of text with embedded commands, as used by troff, Scribe, TEX, and Janus. This simple model is ubiquitous and may be used by preprocessing tools for these formatters. More complicated representations involving a structured document organized into a hierarchical tree or linked directed graph structure are used by NLS, Tioga, and Etude. Typically the structure contains property lists or a labelling of the content associated with parts of the structure. Abstract document structures has been recently studied by Kimura [Kimura, thesis] [Kimura&Shaw, Abstract Documents]. This representation is a graph-like structure composed of abstract document objects. The abstract objects are mapped into concrete objects by a formatting process and these concrete objects are made visible by a viewing process. This structure has introduced ordered and unordered sets of objects, and the sharing of objects within the document structure. Structured documents have several advantages at the expense of a more complex representation. The scope of operations can be specified in terms of substructures of the document, such as rearranging sections within a chapter, the items in a list, or rows and columns of a table. A class mechanism for building extensible document object classifications can be superimposed on the document structure. Each object in the document structure can have a content class associated with it, and a specific set of procedures to perform editing and formatting operations. A distinct advantage of such a class mechanism is that all text is marked as text even if it used in the context of other objects, such as illustrations or tables. Therefore spelling checkers can check all of the text objects anywhere in the document when represented in a structured document. More general document structures, such as integrated documents and databases, are suggested in Chapter 6. As the document structure becomes more complex, the difficulty in managing the pieces of a document increases. Almost all formatters provide a mechanism to include parts of a document within a larger document. Scribe does the best job of managing components of large documents, especially cross references. However, few schemes exist to provide interactive formatters for nonhierarchical document structures. Another advantage of structure within the document representation is the ability to present the reader with different views of the information. Hierarchical structures permit showing only a few levels of the hierarchy (level clipping) to reveal the outline structure of the chapter and section headings. NLS also provided a view of only a few lines of each paragraph (line clipping) to compress more thoughts onto the same display space [Engelbart, NLS]. Additional views might be based on selecting content matching a pattern string or on matching properties of the reader to the nodes in the document structure [van Leunen, One Document]. Cargill's notion of different views of software based on configuration properties [Cargill, Views] demonstrates how a single comprehensive structure, possibly containing redundant information, may be accessed to produced several different configurations of a document.