Building the Cedar TeX82 To The TeX wizard Date February 14, 1984 From Lyle Ramshaw Location PARC Subject Building TeX in Cedar Organization CSL XEROX Filed on BuildingTeX.tioga in Tex.df Last edited by Ramshaw.pa, February 15, 1984 9:23:46 pm PST Abstract TeX82 is a Pascal program that Michael Plass and Lyle Ramshaw succeeded in porting into Cedar. This memo describes some of the funny things that we did and why we did them. After a Cedar programmer reads this memo, she should be able to rebuild TeX from the sources, taking it step by step through all three of the compilers involved. The big picture TeX is distributed from Stanford as a source file in the Web language. Web is a programming system built by Prof. Donald E. Knuth of Stanford University expressly for the purpose of producing a TeX implementation of publishable quality. A Web source file consists of executable code fragments in Pascal and documentation fragments in TeX, combined with macro definitions of various flavors. This source file is operated on by two translators: Tangle takes it, evaluates all of the macros, throws away the documentation, and produces one huge and complete Pascal program; Weave decorates the source file with still more typesetting stuff and produces a TeX source file for a beautiful listing of the program, complete with documentation and indices of various sorts. As a programming language, Mesa is almost a superset of Pascal; and the syntaxes of the two languages differ only in the details. Capitalizing on this, Edward M. McCreight wrote a Pascal-to-Mesa translator some years ago; it takes Pascal programs, parses them by recursive descent, and outputs the source of an equivalent Mesa program. I converted PasMesa to run in Cedar, and used the resulting PasMesa to port Knuth's Tangle processor. Then, Michael Plass and I used PasMesa and Tangle to port TeX itself. As of this writing, no one has yet bothered to port Weave, although it wouldn't be hard. Thus, TeX goes through three compilers. First, it goes through Tangle to be converted to Pascal. Then, it goes through Pasmesa to be converted to Mesa. And finally, it goes through the Cedar compiler and binder to be converted into a runnable BCD. At each step in the process, there are special files on the side that adjust how things are done and supply small changes. The introductory documentation for TeX is in the file TeXDoc.tioga, available through TeX.df. It's not what you know, it's who you know. Don Knuth at Stanford wrote TeX. But he is a busy man, and he doesn't know too much about the detailed hassles of porting TeX to other machines. You probably want to ask your question of Dave Fuchs instead . He has guided innumerable people through ports of TeX, and he is generally very helpful. If your question concerns fonts, either fonts for Dovers or TFM files, you might also be interested in speaking with Arthur Keller , who maintains the Dover font dictionary at Stanford. Fine points about TeX itself TeX's memory array. TeX does all of its own storage allocation by working out of a big array. That makes good sense from one point view, since it would have been risky for Knuth to trust the storage allocators of a many Pascal compilers. But it is rather a pain from our point of view, since the Mesa world is not prepared to handle very large arrays. The biggest block of storage that the normal allocator can construct is only 128 KBytes. Thus, it was not possible to implement TeX's memory array as a Cedar array. Instead, we took advantage of the ProcArray feature of Mesa (which I had implemented to handle a similar problem with Tangle). When you ask for a ProcArray, PasMesa compiles the Pascal array of thingies into a Mesa inline procedure that takes the index type (or types) of the array as an argument (or arguments) and returns a long pointer to a thingy as result. Every use of the array in either a fetch or a store gets a derefence tacked on by PasMesa as well, so that everything works. TeX's main memory is an array of TexTypes.MemMax (currently 58,000) 32-bit thingies, implemented as a ProcArray named Mem. The storage for the array is actually allocated by the start code of TexSysdepInlineImpl by calling VM.Allocate. The body of the inline proc that does the array calculation is in the definitions module TexSysdepInline. By the way, we used inlines only very sparingly in the TeX port because of their tendency to exacerbate the problem of modules too big for the Mesa compiler. The memory ProcArray and a few calls to SirPress for showing characters are the only inlines used. In particular, the file is input via calls to the PascalRuntime that turn into calls on IO.GetChar, doing at least two xfers per character. I have done essentially no performance work on TeX, so I don't know if this is a significant performance problem. Writing Press files. The current TeX gives the user the choice of writing either DVI or Press file output. I left the DVI stuff in just to make the TRIP torture test easier to perform; I presume that everyone will actually use Press format, at least as long as the Dovers are our workhorse printers. When the Imager comes along, TeX should probably be converted to it from its current dependence upon SirPress. I wrote the Press output module for TeX by taking the DVI output module and replicating the procedures hlist_out and vlist_out into two procedures: hlist_out and vlist_out for DVI output and hlist_press_out and vlist_press_out for Press. The file TeX.changes includes a long change that changes the DVI output code from itself to itself; this is so that any bug fixes in that code that might occur in later releases of TeX will cause error messages from Tangle, pointing out where the corresponding Press procedures might have to be changed. In general, the Press code is just a simplification of the DVI code. For example, the DVI code has to keep track of the stack of positions, while Press has no stack, just one global position. The DVI code also has lots of fanciness in it for doing what is essentially register allocation on the DVI abstract machine; none of that is relevant for Press. There are two non-trivial parts to the Press code. First, as an efficiency move, I chose to use SirPress pipes to handle the character output. Thus, if TeX is walking along a horizontal list just outputing character boxes, spaces, and kerns, all of these commands will be stored in a pipe and dumped all at once by a call to ClosePipe when something else happens, such as a font change or a vertical move. This adds a small amount of complexity to the code, from two directions: I have to be careful to keep track of whether a pipe is open or not, and I have to do my own bounds checking to make sure that the pipe doesn't run off the end. Since TeX is compiled with bounds checking off (it shouldn't need it, and the bounds checking code also exacerbates the problem of the compiler size limits), running off the end of a pipe inside of Cedar would crash the world instead of raising a signal. The other non-trivial part of the Press output code is the units conversion. Sirpress tries to be very nice about units, working in a very small unit and letting you specify your conversion factor. But pipes are different: with them, you must use micas. Hence, I just call TeX's own arithmetic procedures and do the conversion to micas before calling SirPress. This appears in the code as multiplication by an unexplained and mysterious-looking rational number; that number is the closest approximation to the correct conversion factor between scaled points and micas for which the arithmetic doesn't overflow. While we are on the topic of units, I should mention that TeX seems to be getting the mica sizes of some fonts off by one from what Spruce thinks they are---in particular, the five point fonts at magstephalf. Spruce won't bitch about an off-by-one-mica request for a ten point font, but, by the time you get down to five points, it will bitch (the error threshold is a percentage of the font size). I vaguely recall that I used to get lots more of these font substitution messages, and that I put an extra plus one-half in on the call to SirPress.GetFontPipeCode and fixed lots of them. But I haven't gotten things exactly right yet. Exactly right, in this context, means that TeX does the calculation the same way that the Sail Metafont did when it wrote the OC's. I'm not sure, by the way, whether the Sail Metafont used 1.095 as the magnification for magstephalf or a more accurate approximation to the square root of 1.2; it might make a difference, sad to say. I wonder why they don't get these messages at Stanford. You might ask Dave Fuchs what their DVI-to-Press converter does for font sizes, if the fix doesn't look obvious. Operational hints Feature changes in TeX. Suppose that you have fixed a bug in TeX, and you want to release a new one. But suppose that Tangle and Cedar and PasMesa and the other parts of the runtime support system are working and haven't changed. Then, here is what you do. First, you type ``Tangle TeX''. This tells Tangle to read TeX.web and TeX.changes and do its processing. Remember that TeX.web is a big file; if Tangle seems to take a long time getting started, it might be that FS is busy flushing your disk cache in order to free up space for TeX.web. Once Tangle gets started, it plugs along at a respectable rate. I changed the error messages so that they report errors by character number, so that should help in tracing down bugs that Tangle reports. The abbreviated module names that Knuth insisted on using are one frequent source of problems here. I suspect that Tangle reports at most one error message per pass; in any case, don't be surprised if you fix the one error that it reports and then it stumbles across some totally unrelated error on the next try. When Tangle is done, the next processor that you want to run is PasMesa, and the command is ``PasMesa TeX.mod''. Be warned, however, that PasMesa is quite a hog of virtual memory; it allocates many, many short ropes and long ropes, holding onto some of them well beyond their useful lifetime because of the plethora of global variables in this papered-over Alto Mesa program. Thus, I suggest that you Rollback just before and just after running PasMesa. The elapsed time will be less if you include the Rollbacks, I assure you. PasMesa starts by reading the file ``TeX.mod'', and gets the rest of its instructions from there. It ends by writing out many Mesa source files. I haven't ever tried running PasMesa in a working directory, so there might be some problems in that area; be careful. In the unlikely event that PasMesa finds a bug in TeX.pas, it will probably have something to do with an undeclared variable of some kind. Perhaps you put in a change that calls an external procedure and forgot to declare the external procedure, for example. PasMesa reports one error at a time. When the Rollback returns, you are ready to put the resulting Mesa modules through the compiler and binder. PasMesa has written a command file named ``CompileTeX.cm'' that will do just the right thing. This is the longest step of the process, and takes roughly 15 minutes. I stopped working on the performance of PasMesa as soon as I got it to run at least as fast as the Mesa compiler, so that it was no longer the bottleneck. A correct compile of TeX includes no errors, of course, but does include several warning messages. TexScanImpl will report a warning of a signed/unsigned ambiguity. I looked at the types in that expression with great care, and I'll be damned if I can figure out why the compiler is getting confused. Russ Atkinson couldn't figure it out either. But we looked at the code that is generated, and the compiler is doing the right thing. I recommend that you just tolerate this warning. Next, there are three modules that will report one warning each for unreachable code: TexRest2Impl, TexFinalizeImpl, and TexMainControlImpl. The first two arise because the TeX that you are making is an IniTeX. There are two things that IniTeX can do that regular TeX's can't: build the hyphenation trie and do a \dump. In each case, the code for IniTeX has an unreachable error message in it; a vanilla TeX would hit the error message instead, and it would report that only IniTeX can do what you ask. The third instance of unreachable code is more subtle, and arises from PasMesa's clever way of translating goto's. The block in the procedure MainControl ends with an unconditional goto; that is to say, control never exits this block by falling off the bottom. But PasMesa has inserted an EXIT into the Mesa code to arrange that control paths that fall off the bottom will get out correctly past all of the intermediate blocks that are handling the Pascal gotos. It is this EXIT that is unreachable code. Life is hard. There is one more module with warnings, four of them this time; but they're very minor. The module TexExtensionsImpl declares four unused variables on purpose so that people who are trying to debug with a Pascal debugger can store integers somewhere. The Mesa compiler quite correctly notices that these variables are unused. These warnings would go away if TeX's debug switch were turned off, which probably wouldn't hurt anything. The code that it enables is essentially useless in comparison with the Cedar debugger. The bind of TeX is next, and has no surprises. When the bind is done, you should have a runnable TeX.bcd, ready to try out. There are two auxiliary files to worry about as well: TeX.pool and TeX.bcd. TeX.pool is handed out with TeX, since TeX isn't really ready to start reading user profiles and looking around for things on remote servers before the strings are working. Tangle produced a new TeX.pool, and your new TeX will access it by its short name, and all should be well. When you SModel TeX.df, this TeX.pool will go out along with TeX.bcd for other folk to retrieve via their Bringover. The other basic file is Plain.fmt, and that one works a little differently. Format files are generally referenced on a remote server. You might get into trouble if you have made any changes that would invalidate the old format file. If so, and you try and test your new TeX by typing ``TeX story'', or something like that, TeX will try and load the default format file, which will be the old one (because you haven't SModel'ed a new one yet). Thus, early on in your testing, you should produce a new format file by typing ``IniTeX plain \dump''. And then reference that format file instead of the default one by trying out the story with ``TeX &plain story''. At least, this is what I did. You might be able to devise a better procedure now that the default format is set in your user profile; just changing your profile to point to the local Plain.fmt as the default might work. Note that you probably shouldn't put Plain.fmt out on Indigo until you are really ready to SModel the new TeX, because other folk are referencing Plain.fmt without an explicit version stamp (to avoid polluting their working directories with the short name Plain.fmt; if this gets to be a problem, change TeX.df to export Plain.fmt and to hell with working directories). Once you have found a bug, you have to decide how much of this tedious loop you have to go back around in order to fix it and continue debugging. If you are lucky, your bug was in TexSysdepImpl. You can recompile just that module and rebind TeX, and you are back in business. If your bug is anywhere else, you almost certainly have to run PasMesa again. And, if the bug involved TeX.changes, Tangle as well. In either of these latter cases, you'll have time for a pleasant coffee break. It is somewhat annoying, by the way, to have to watch your machine for the better part of an hour just in order to type short command lines and do rollbacks. Back in Cedar 4.4, I wrote a program called RollBackHack; it registered a proc to be called after Rollback that looked at a text file of ropes (called ``Com.cm'' for sentimental historical reasons). If Com.cm had any ropes in it, the Rollback proc would take the first one off, write the rest back onto Com.cm, then get a UserExecutive, and hand the rope the that executive. The net result was that I didn't have to babysit the three compilers. (The Rollback proc was careful to wait for 30 seconds or so before starting off, so that I could intervene in the process if necessary!) I did not update this hack to Cedar 5.0 because it depended on the UserExecutive session log stuff: when you return to your Dorado after an hour, you want to see the typescripts of all three compiles in order to check for error messages. With work, of course, a Cedar 5.0 version of RollBackHack could implement session logging on its own. A new release of Cedar. Things are a little bit more complicated when there is a new release of Cedar, especially if interfaces have changed. Remember that Tangle is a bootstrap processor; think twice before doing anything to it or deleting any version of it. You must have a relatively competent Tangle that you can run in the current Cedar before you can work on fixing bugs in any Web program, including Tangle itself. Fortunately, the worst problems in this area are probably over, since Tangle itself is likely to be very stable from now on; the trickiest bootstrap occurred back in September of 1983, when I wanted to build, in a new Cedar, a new Tangle for which the format of change files had changed. Back to the subject of a Cedar release. The first thing to do is to convert PasMesa. This should be old hat, considering how much converting of old code to new releases of Cedar we all get to do. Also, convert the PascalRuntime. You will find this a more tedious job, because the PascalRuntime is in a pretty ugly state at the moment and because a runtime package has to have its fingers in many pies. Note that the runtime services needed by a Pascal program have been divided up into various classes, with different interfaces for each class. Given that there is no way in Mesa to bind up several interfaces into a bigger interface, I couldn't figure out any better way to proceed. The issue is that different Pascal programs want to have different file systems under them, and some Pascal programs don't use Sets at all, while the rest of the runtime stuff is common to all Pascal programs. Hence, there is a PascalBasic with the basic stuff, three different file interfaces, and a Sets interface, along with implementations for each. The PascalNoviceFiles package tries to be really nice to the novice programmer. There is code to make text files avoiding reading one character ahead (as most Pascal files do), so that terminal interaction can work correctly. In Cedar 4.4, there was code to have the PascalOutput file appear in a viewer on your screen as well as in your file system, so that you could see your program at work; but I fear that I didn't get that working in Cedar 5.0 (the Cedar 4.4 implementation used DribbleStreams, which went away). PascalWizardFiles is a much thinner layer on top of IO.STREAM; this is more to the liking of big applications programs like TeX, which generally open files and the like by calling Cedar procedures that are declared as external to the Pascal program in any case. PascalInlineFiles is an inline version of PascalWizardFiles. Be warned that using the inline version will make the modules into which you have broken your Pascal program somewhat less likely to make it past the size limits of the Mesa compiler. Having converted PasMesa and the PascalRuntime, turn your attention next to Tangle. Start with Tangle.pas, which was carefully saved away in addition to Tangle.web and Tangle.bcd by the DF file. Run Tangle.pas through PasMesa and the compiler to get (I hope) a working Tangle. Then, run this Tangle over Tangle.web, and check (using Waterlily) that the resulting Tangle.pas file is identical to the one that you started with. At this point, you can breathe a little easier; and you can SModel the new Tangle. You could jump right on to producing a new TeX, but it probably wouldn't be a bad idea to run through the four little programs that constitute the TexWare package first. This will give you more practice before you hit the big time. And, if you don't do them now, you will be tempted not to do them at all. They should be easy. Each of PoolType, DviType, PLtoTF, and TFtoPL has a separate DF file; they each go through the three compilers in order in the obvious way. Then take TeX through the three compilers as described in the last section. And you're done. A new release of TeX itself from Stanford. All of the stuff from Stanford is described in the DF file TexWeb.df. This DF file should only be SModel'ed if you have retrieved a new version of TeX82 from Stanford, probably over the ArpaNet. If Maxc is still alive, the right place to go is the directory on SU-SCORE. On that directory, you will find ``-read-.me'' and ``textap.cmd''; the former describes what is in the various files in english while the latter tells you what directory they are on at SCORE. Note that TexWeb.df has almost all of the same files as textap.cmd; the only differences are: (i) Tangle.pas and Tangle.changes aren't included, because we have already done the Tangle bootstrap, and (ii) the TFM's aren't included because they are included instead in /Indigo/Tioga/TFM/tfm.df. Log in at SU-SCORE as the user with name ``Anonymous'' and any password. Retrieve text files in the default mode, but change the ``structure'' to ``F'' and the ``type'' to ``L 8'' for binary files (such as TFM's). (I'm pretty sure that ``F'' is the right structure; if it doesn't work, try the other one. ``L 8'' is definitely the correct Type.) To retrieve new TFM's, if MAXC is still alive, I would recommend going to directory [tex,sys] on SU-AI, since that is the directory that Arthur Keller considers as the ultimate truth for the fonts that he maintains, and he maintains the Stanford fonts. Remember that directories come after the file names on SU-AI rather than before, and remember that font file names on SU-AI are shortened to six letters by dropping all but the first three and last three letters of a longer name. Then, copy the new TFM's from Maxc to your Dorado, and SModel the TFM DF file again. SModel will give you warnings about all of the 250 other TFM's that you don't have on your disk, but the warnings don't indicate a real problem in this case. Nifty hacks for writing changes files The files of changes to the standard Web source files have been formatted to be relatively convenient to browse around in with Tioga Levels. Each change is a top-level branch; the root node of this branch consists of the position number of the beginning of the change in the Web source followed by a comment describing the nature and purpose of the change. Then come children nodes with the change itself. If you look at the first level only, you see just the index of changes, which is convenient. If you want to add a change, the most convenient way is to type a control-return to get yourself a new top-level node. Then, type ``web.ch'' followed by control-E. Along with the sources for TeX comes a web.abbreviations file that defines the abbreviation ``ch'' to expand into a template for a change branch. There is also a hack program contributed by Michael Plass to help in dealing with position numbers. It is very helpful to include the position numbers in the changes for several reasons. First, the file TeX.web is huge, and getting to the right place by other means would be a pain. Second, the changes must be in the correct order in order for the merge performed by Tangle to work. With the position numbers included, it is easy to figure out where to put a new change. (One could even use the EditTool sort-branches stuff to reorder a change file that was out of order, although I have never tried that myself.) The helpful hack program is called ShowPosition. If you run it, it posts a button at the top of the screen, which is backed up by a register that can hold an integer. The value of the register is displayed as the label of the button. Left-clicking the button stores into the register the position count for the current selection. Right-clicking the button inserts at the current caret position a six-digit text representation of the integer in the register. Thus, when making a new change, select the location of the change (I use the first character of the old text as the reference point); left-click the ShowPosition button; select the six-digit number at the beginning of the change template in pending-delete mode; then right-click the ShowPosition button. Ideas for improvements to TeX in Cedar. Automatic spooling It is a minor annoyance that TeX doesn't send your output to your favorite printer automatically. It should be possible to fix this by programming the shell in some way, at least if the Commander is really getting up into the Unix class. But I haven't designed a scheme for this. It might be easier to make such a scheme work well if TeX returned an ATOM result from its CommandProc that revealed the worst level of error message. TeX computes this level already, and it would be pretty easy to get it returned, although it might demand a minor change to PascalBasic (in order to get the right hook in for the return result in the ExclusiveProc). Making TeX a server The InterLisp folk would love it if TeX could be made into a server called via RPC from Lisp before the demise of Maxc. I estimate that such a project would take me a couple of weeks. IncludePress The only way to merge illustration Press files into a TeX document at the moment is by going back to the Alto world and running PressEdit/M. It should be relatively straightforward to do a lot better. Various schemes are possible. One big decision is what you believe the lingua franca of illustrations it: Press or Interpress or Imager. Let us explore first schemes in which Press format is the way that all of the programs in the world produce the illustrations (which is pretty close to true at the moment). One-page Press files have a weakness as a way to encode illustrations: there is no standard way to give the bounding box and the origin of the illustration on the Press page. This can be fixed in one of two ways: either make the user specify these four or five dimensions when she invokes the illustration; or settle on a hacky way to encode the information in the file itself that any illustrator will be able to handle, such as the strings ``<<<'' and ``>>>'' in this lower-left and upper-right corners, for example. Once you decide these questions, the rest of a TeX IncludePress should be a lot like the Tioga version. In detail, you should pick a name to serve as a \special operator, such as ``includepress'', and an argument structure. I would recommend that ``includepress'' take just one argument, the name of the press file. Its semantics would be to copy all of the commands from the Press file into the TeX output, shifting the origin of the Press page to the current position. In the scheme where the user has to type in the offsets and bounding box dimensions, the rest can all be done with TeX macros and appropriate use of negative glus. The top-level macro call might look something like \includePressIllustration{fig1.press width 5truept depth 1in height 7in xoffset 3in yoffset 2in}. The scheme where the Press file includes its own markers is trickier to implement, since the dimensions of the illustration box must be known by glue-setting time, which is before the output routine runs. And TeX, as currently set up, assumes that \special processing will be done only by the output routine. But something could be worked out, I'm sure. On the other hand, maybe this whole thing should wait until the Imager conversion, and either Interpress masters or Imager display lists should become the lingua franca of illustrations. Then, the design issues would be somewhat different. Current bugs 5.0 bugs that 5.1 fixes Without any change to TeX (other than to the TSetter reference in the DF file), Cedar 5.1 fixes two problems with TeX. First, a bug in SirPress positioning is fixed which caused characters to come out in the wrong places if you backed up to exactly the same place you had been before when using a pipe. Second, a CommandProc can successfully open the terminal for input even if it has just been loaded for the first time, which it couldn't in 5.0. Unfortunately, 5.1 seems to introduce a new bug: TeX is careful to arrange that the commands ``TeX'' and ``IniTeX'' will be made uninterpreted, since both ``&'' and ``\'' are quite useful characters to type in command lines with their TeX semantics, and it is annoying to have to quote them. (ShiftInterp gives the user access to the interpretation functionality if that is what they want instead.) This worked fine in 5.0, but uninterpreted commands seems to be broken in 5.1; I sent Larry Stewart a message about it. Output routine narrows The 32-bit INT's that TeX works with get narrowed to 16 bits at some point on the way into SirPress pipe positioning commands (inside of ClosePipe, I think). Thus, users who position things in funny places off the page may end up looking at a bounds fault, which is a little impolite.