<<>> CharacterDiscussion.mail 18-May-90 To: Mike Spreitzer Re: Character Proposals (Part 1) Date: 18 May 90 10:10:25 PDT From: Kenneth A Pier:PARC:Xerox Subject: Re: Character Proposals (Part 1) In-reply-to: "Mike Spreitzer:PARC:Xerox's message of Thu, 17 May 90 08:44:39 PDT" To: Mike Spreitzer:PARC:Xerox cc: PCedarImplementors:PARC:Xerox, SchemeXeroxImplementors:PARC:Xerox I am keeping track of the messages Re: Character Proposals as a PFUDGe type discussion on [CedarCommon2.0]CharacterDiscussion.mail. 21-May-90 Mike Spreitzer Character Proposals (Part 0) Date: Mon, 21 May 90 08:13:24 PDT From: Mike Spreitzer:PARC:Xerox Subject: Character Proposals (Part 0) To: PCedarImplementors, SchemeXeroxImplementors Cc: Mike Spreitzer:PARC:Xerox I should mention, for the DLs, why I think this is interesting to think about now. There are three problems I'm addressing: 1. We want to use more than the ASCII characters inside PARC. I personally am interesting in using mathematical symbols in a new programming language, and in using Greek letters in existing programming languages. I also want to use mathematical symbols in English text; this includes comments in programs. While it is true that Tioga already handles more than ASCII, Tioga is not all of Cedar and Scheme. For example, a program pretty printer could (and mine actually does) use multiple ROPEs and STREAMs to move the contents of comments through several packages. Also, I think Xerox as a company should support non-English texts as nearly equally well as it does English ones as possible. While we're not responsible for making products, sticking our heads in the ASCII sand makes it that much harder to transfer our work to product programs, and assumes that there are no interesting problems to be solved concerning multinational text. 2. We want to interoperate with systems external to PARC that use more than the ASCII characters. I deduce this from two slightly smaller assertions: (A) we want to interoperate with systems external to PARC, (B) some external systems will, and some already do, use character codings other than ASCII. One such system is Viewpoint. Pavel alleges that much of Europe uses an 8-bit coding called `Latin 1', which coincides with ASCII in the first 128 codes, and diverges in the second. The X consortium is using 32-bit character/key codes. Surely other organizations are also feeling the pressure, and we can expect more widespread use of non-ASCII codings in the future. 3. We are already interoperating with non-Cedar systems that use character codings other than PARC ASCII, but we don't properly translate the character codes. PARC ASCII associates up-arrow and left-arrow with the numbers that real, current ASCII associates with circumflex and underscore. You can tell that we don't even interoperate properly with UNIX or other external ASCII systems because the underscore/left-arrow confusion shows up all over the place. XCC associates the international currency symbol with the number that PARC ASCII associates the dollar-sign. We are so confused about this that if you switch between screen and print StyleKind (try it! [look in the Places menu]) this character ($) changes appearance! The problem is that Cedar doesn't recognize when it is importing characters from, or exporting characters to, a foreign system, and thus doesn't do any of the necessary translations. Note that merely recognizing system boundaries and doing character translation does not necessarily require a larger character set; but if you accept point (2), it does. I think this is interesting to think about now because of the youth of PCedar and Scheme. With PCedar3.0 coming up, there will soonish be an opportunity to recompile everything, which means that changes to the Rope and IO interfaces are within the Pale (for DCedar, I think the expectation is that we will never go to a higher major version, and thus will never again recompile everything). SchemeXerox is also quite young; there is very little in it now that depends on the character coding being used (eg, there is no editor, no file reader/writer, and nothing that images text --- it gets all these things from Cedar). We haven't even got Modula-3 on PCR yet. I think that if we are going to expand the character set, there will be fewer software engineering obstacles if we do it now than any time later. The biggest downside I see is that this requires work, and person-cycles are in very short supply. Is this likely to ease in the future? I think the biggest open question is: is it worth the work? In order to answer that, we must explore just how little work we can get away with. To do that means sketching out a design, which is what I'm trying to do with this discussion. Still to come: the ROPE and STREAM level interfaces question, and the compiler support question. Thoughts? Mike 17-May-90 Mike Spreitzer Character Proposals (Part 1) Date: Thu, 17 May 90 08:44:39 PDT From: Mike Spreitzer:PARC:Xerox Subject: Character Proposals (Part 1) To: PCedarImplementors, SchemeXeroxImplementors Cc: Mike Spreitzer:PARC:Xerox I propose that Cedar and Scheme eventually adopt XCC as their standard character code. I claim that one of the biggest bogons in Cedar's current handling of characters is that it doesn't know its boundaries: Cedar doesn't take care to note when it's getting characters from a foreign system and convert them into its own code (of course, with Cedar's code currently having only 8 bits, that's a pretty hopeless thing to do anyway). I propose that the Cedar and Scheme interfaces with foreign systems be expanded to discuss character translation. I propose to start developing interfaces now for the new characters. The rest of this message is concerned with the design of those interfaces. I think it would be great if we could expand Rope.ROPE and IO.STREAM to handle the new characters, because that could save us a lot of editing. But we can only do that when we're ready to recompile all of PCedar and either freeze DCedar or recompile all of it too. Did I hear the sentiment that it's nearly time to freeze DCedar expressed at PFUDGE yesterday? Regardless of whether we expand Rope and IO or make new interfaces, I think we should start with an interface defining characters. It might look something like this: Char: CEDAR DEFINITIONS = { CHR: TYPE = RECORD [CARD]; <> noCHR: CHR = [CARD32.LAST]; <> Valid: PROC [CHR] RETURNS [BOOL]; <> Ord: PROC [CHR] RETURNS [CARD]; Val: PROC [CARD] RETURNS [CHR]; Widen: PROC [CHAR] RETURNS [CHR]; <> Narrow: PROC [CHR] RETURNS [CHAR]; <> Coding: TYPE = REF CodingPrivate; CodingPrivate: TYPE; <> LookupCoding: PROC [ATOM] RETURNS [Coding]; <> CodedChar: TYPE = RECORD [cdg: Coding, chr: CHR]; <> Defined: PROC [CodedChar] RETURNS [BOOL]; <> Describe: PROC [CodedChar] RETURNS [ATOM]; <> Corresponds: PROC [CHR, CodedChar] RETURNS [BOOL]; <> NumCorrespndents: PROC [CodedChar] RETURNS [INT]; <> NthCorrespondent: PROC [CodedChar, INT] RETURNS [CHR]; <= NumCorrespndents. There is no significance to the ordering, except that correspondent 0 is the one returned by Import.>> Import: PROC [CodedChar] RETURNS [CHR]; <> Export: PROC [chr: CHR, to: Coding] RETURNS [CHR]; <> NoTranslation: SIGNAL [CHR] RETURNS [CHR]; <> }. One of the issues raised by this interface is the data type used to represent our new characters. Scheme is fortunately opaque --- programs can't tell how characters are represented, and there are no constraints on the behavior of integer->char and char->integer (the Scheme equivalents of VAL and ORD) that prevent expansion. The SchemeXerox representation for character currently allows 16 bits, and could relatively easily be expanded by another 2 bits; this should be able to hold all the numbers we'll need in the forseeable future. I'm assuming that we don't want to widen Cedar's built-in type CHAR. I chose to declare the Cedar representation to take at least 32 bits, instaed of 16, because (1) you should always prepare for growth, and (2) the XCC standard explicitly states that it may require more than 16 bits in the future. Alternatives for theCedar representation that come to mind are: CHR: TYPE = CARD; CHR: TYPE[SIZE[CARD32]]; CHR: TYPE = RECORD [CHRRep]; CHRRep: TYPE[SIZE[CARD32]]; I disfavor the first alternative because it prevents Cedar's type system from distinguishing characters and cardinals. I actually think I prefer the second alternative to what I proposed earlier, but refrained from making that my actual proposal because I seem to remember that such constructions are disfavored by Cedar programmers in general. I favor making the representation opaque because information hiding is good in general. Alternative 3 makes the representation opaque, but in a way that may not trouble most programmers --- for example, I think it makes NARROW[.., REF CHR] possible. Another issue is the names for the Cedar interface and type. Alternatives that come to mind are: Char.X, Char.CHAR (does the compiler allow this?), Char.ECHAR, Char.Char, CH.AR, and Char.XCHAR. One problem that always arises in this area is the poorly standardized, overloaded English terms used to discuss it. We need names for: a number that represents a character (sometimes called a character code; I refrain from using that term because it's ambiguous), an association between numbers and characters (sometimes called a character code, sometimes called a character set; I chose to call this a coding), and a set of characters (sometimes called a repertoire, sometimes a character set; I've avoided needing this term so far). Should the translation stuff go in the basic interface, or somewhere else? It needs to be extremely available: wherever Cedar or Scheme meets a foreign system --- and that may be very deep inside Cedar's or Scheme's own implementation --- characters must be translated. I think putting it in the basic interface is good because that emphasizes the fact that translation is available whever these new characters are. Should Codings be looked up by name, or fetched from an interface that simply exports the codings we need as individual items (eg, Coding.NewAscii, Coding.JIS6226, Coding.PressMath)? I favor both. Having an interface that exports some particular codings as individual items is good because references to interface items are better than indirection through ATOMs (or whatever we use for names). Having the lookup by name is good because that's part of the support for extending our interoperability without recompiling or running any code. I think we should be able to extend our interoperability without recompiling and running any code. This means extending the set of codings, adding associations between CodedChars and CHRs, and adding CHRs. This can be done, for example, by keeping these things in files in standard places --- somewhat analogously to the way we define fonts by files. Gotta go; more later. Thoughts? Mike 17-May-90 Doug Wyatt Re: Character Proposals (Part 1) Date: 17 May 90 11:03:12 PDT From: Doug Wyatt:PARC:xerox Subject: Re: Character Proposals (Part 1) In-reply-to: "Mike Spreitzer:PARC:Xerox's message of Thu, 17 May 90 08:44:39 PDT" To: Mike Spreitzer:PARC:Xerox cc: PCedarImplementors:PARC:Xerox, SchemeXeroxImplementors:PARC:Xerox Here's another possible representation for CHR ... CHR: TYPE = MACHINE DEPENDENT {nil(CARD.LAST)}; ORD and VAL will work on this, and you can test chr=nil. It's tempting to name some of the values too, but 1) you're using the same CHR type for different codings, and 2) the compiler might not deal gracefully with an enumeration containing hundreds or thousands of names. Actually, I prefer XCHAR (or ECHAR) to CHR; X is too short and uninformative, CHAR and Char are unacceptable name conflicts. You're not serious about CH.AR, are you? -- D. 17-May-90 Mike Spreitzer Re: Character Proposals (Part 1) Date: Thu, 17 May 90 11:25:19 PDT From: Mike Spreitzer:PARC:Xerox Subject: Re: Character Proposals (Part 1) In-reply-to: "Doug Wyatt:PARC:xerox's message of 17 May 90 11:03:12 PDT" To: Doug Wyatt:PARC:xerox Cc: Mike Spreitzer:PARC:Xerox, PCedarImplementors:PARC:Xerox, SchemeXeroxImplementors:PARC:Xerox I like the enumerated type idea. The main use will be for only one coding, so I think it would be OK to name some of the values. That brings up the issue of what happens when XCC-1-... becomes XCC-2-...; it's easy enough to deal with additions to the coding, but what happens when changes are made? We have this problem regardless of whether we put some name-code bindings in interfaces. Note that the current technology (CHAR) simply forbids changes (and doesn't do too well on additions, either). My thought behind the name `Char.X' is that clients (and impls) would use the full name `Char.X'; the name only appears short in the interface itself. I'm not sure how seriously to take `CH.AR'; do we believe in the idea of designing for the full name only? I'm not sure how tightly we want to bind ourselves to XCC. For example, as we discover places where XCC fails to hold to One-Semantic-One-Code we may wish to diverge our standard coding a bit from XCC. If we admit we may not always use XCC, then maybe using X in the name is misleading; that's why I suggested using E, and other schemes. Mike 17-May-90 Foote:OSBU North Re: Character Proposals (Part 1) Date: 17-May-90 12:02:31 PDT Subject: Re: Character Proposals (Part 1) In-Reply-To: Originator: "::", UniqueString: "Mike Spreitzer:PARC's message of 17 May 90 11:26:12 PDT (Thursday)" Message-ID: Originator: "James K. Foote:OSBU North:Xerox", UniqueString: "17-May-90 12:02:31 PDT" To: Mike Spreitzer:PARC:Xerox Cc: Doug Wyatt:PARC:Xerox, PCedarImplementors:PARC:Xerox, SchemeXeroxImplementors:PARC:Xerox Reply-To: Foote:OSBU North:Xerox From: Foote:OSBU North:Xerox > I'm not sure how tightly we want to bind ourselves to XCC. Make sure that the benefits of not using XCC are high enough to justify the costs. The costs may include converters for interoperability and might include new fonts. Further, I'd recommend that if you're not going to use XCC then it would be better to invent your own standard than to be close to, but not exactly the same as, XCC. But then what do I know. -- Jim 18-May-90 Mike Spreitzer Character Proposals (Part 1, cont'd) Date: Fri, 18 May 90 08:01:51 PDT From: Mike Spreitzer:PARC:Xerox Subject: Character Proposals (Part 1, cont'd) To: PCedarImplementors, SchemeXeroxImplementors Cc: Mike Spreitzer:PARC:Xerox Another issue raised by the Char interface is the use of ©name¹ data in that interface --- which is more primitive than the Rope or whatever interface is used to define our standard for ©name¹ data. That is, what should LookupCoding take as an argument, in an interface ©below¹ Rope? What should Describe return? I think the right answer is that those procedures don't belong in the Char interface --- they can be lifted to some other interface that is above Rope (or equivalent). This brings us to the issue of how to layer a system so that nothing depends on something above it. I think the following organization would work: Higher Stuff Knowledge of how to import/export with an arbitrary system Knowledge of how to import/export with base operating system (SunOS, Mac, ..) Knowledge of pure CHR operations and CHR R CHAR operations Here's how the arbitrary import/expor operations could be implemented in a way that allows extensibility without changing code. Imagine that we assign a name to every foreign coding we understand. With each foreign coding we associate a file whose name is derived (in some deterministic way that's accessible to people) from the coding's name. Using the lower-level knowledge of how to translate into the coding of the base operating system, we can translate that file name, open the file, and translate its contents. The file's contents is a listing (in the base operating system's coding) that gives, for each number that the foreign coding associates with a character, the set of corresponding CHRs and an English (how chauvanistic!) description. People can thus extend the set of understood foreign codings by creating such files, and can extend the association between CodedChars and CHRs by editing such files; these edits can be done in Cedar or Scheme or the base operating system. This scheme relies on something else, probably code, to implement the translation between CHRs and the base operating system's coding; extending this association would be more difficult, but also less likely to be required. Mike 18-May-90 Mike Spreitzer Character Proposals (Part 2) Date: Fri, 18 May 90 08:22:46 PDT From: Mike Spreitzer:PARC:Xerox Subject: Character Proposals (Part 2) To: PCedarImplementors, SchemeXeroxImplementors Cc: Mike Spreitzer:PARC:Xerox Not only do we need the ability to translate to/from a foreign coding, we need to know when to translate. For plain-text files, that's probably easy: assume it's in the base operating system's coding, but provide clients (and thus, indirectly, users) the ability to indicate otherwise. What about Tioga files? In current Tioga files, you cannot determine what character is meant at a given position just by looking at the CHAR. For some positions, XCC is used, and you can determine the intended character by looking also at the CharSet (which ain't easy from a program, unless you're using TiogaAccess). However, most of our Tioga characters don't use XCC --- they use ªPARC ASCIIº, or worse. Worse is using ©looks¹ (or Postfix properties, or other Tioga style hackery) to select a Press font that images a glyph corresponding to an entirely different character than what anybody's ASCII assigns to the CHAR. Each Press font uses an independent coding --- do most Press fonts (ie, all of them except the Math, Greek, and symbol ones) share one or a few codings? Anyway, if the Tioga document isn't so old that its style annotations are too broken to tell you what font applies at a given position, you can tell which coding applies. I'm not going to ask for changes in Tioga (nor would I turn them down!). But I think that for programs to properly read and write Tioga documents, we need a package that correctly converts a Tioga document to/from a stream of CHR. This means that the TStyle level of Tioga has to be below programs that read/write Tioga documents --- which doesn't seem too unreasonable. Do we want to make a new kind of file, which is not full Tioga, nor plain Ascii, but has CHRs in it? 18-May-90 Christian P Jacobi Re: Character Proposals Date: Fri, 18 May 90 11:58:18 PDT From: Christian P Jacobi:PARC:Xerox Subject: Re: Character Proposals In-reply-to: "Mike Spreitzer:PARC:Xerox's message of Fri, 18 May 90 08:22:46 PDT" To: Mike Spreitzer:PARC:Xerox Cc: PCedarImplementors:PARC:Xerox, SchemeXeroxImplementors:PARC:Xerox Just for the curious: I have stored what I have not yet deleted from the last two weeks mail about internationalization for X windows. ---> /net/palain/palain/jacobi/charcodemail.maillog For my taste pretty incomprehensive... but can we afford to ignore it? Christian 21-May-90 Mike Spreitzer Character Proposals (Part 3) Date: Mon, 21 May 90 08:26:54 PDT From: Mike Spreitzer:PARC:Xerox Subject: Character Proposals (Part 3) To: PCedarImplementors, SchemeXeroxImplementors Cc: Mike Spreitzer:PARC:Xerox Maybe XCC is not the coding we should adopt. It has been alledged that there is other work (I'm looking into some) on >8-bit standards that stands a better chance of winning in the marketplace. On the other hand, XCC is currently being used by Mother X. If we are going to build one system with some parts written in Scheme and others in Cedar (and others in Modula-3?), these languages will either have to agree on one character coding (political yuk) or translate character codes as part of inter-language interoperability (performance yuk). Note that SchemeXerox and PCedar already store their character codes differently (SchemeXerox shifts the number and adds tag bits). Mike 22-May-90 Mike Spreitzer Character Proposals (Part 4) Date: Tue, 22 May 90 09:33:32 PDT From: Mike Spreitzer:PARC:Xerox Subject: Character Proposals (Part 4) To: PCedarImplementors, SchemeXeroxImplementors Cc: Mike Spreitzer:PARC:Xerox Much of the impact on existing code comes from whatever is done about the Rope and IO interfaces. Following are some alternatives. I. Make new interfaces (eg, ERope and EIO). Least disruptive to the existing system, but requires the most work to get to a new system where expanded characters are used ubiquitously. PCedar.depends lists 1630 modules that depend directly on one or both of Rope and IO. To get to a point where expanded characters are used ubiquitously, each of those 1630 modules would have to be edited. Since some of those modules are themselves interfaces --- some of them pretty central, like FS and PFS --- other problems ensue. For an interface like FS and PFS, whose use of ROPEs and STREAMs is essentially confined to procedure arguments and results, alternate procedures that take and return EROPEs and ESTREAMs can be added. But FS and PFS also declare Error, which passes a ROPE; that should be changed at some point to pass an EROPE. In other interfaces, the non-argument-or-result use of ROPEs and STREAMs is more central. For example, a Viewer has a name field that is declared to be a ROPE; that should eventually chage to an EROPE. Each use of ROPE or STREAM in the type of an interface item will require either: (1) adding an alternate E-item to the interface, and thus eventual editing of all the clients, or (2) changing the type of that item to use EROPE and ESTREAM, and editing and recompiling the implementation and all the clients in one release step. II. Extend Rope and IO, without changing the types of existing public items. Redefine a ROPE to mean a sequence of ECHAR, instead of a sequence of CHAR, and a STREAM analogously; this is natural since most clients are using them for sequences of characters, and our notion of characters is changing. The existing Rope and IO procedures that take or yield CHARs are left as they are, and new alternate versions that traffic in ECHARs are added. When a procedure is required to convert an ECHAR into a CHAR and there is no correct conversion, an error is raised. The text variant of RopeRep can be left as it is, adding all the new variants (including the inevitable new wide flat one) to those grouped under the node variant; thus REF TEXT and Rope.Text are still bitwise equivalent. Since extremely few Cedar modules know about the internal structure of the node variant of RopeRep, and IO's impl can easily be made to deal with stream classes that don't provide the expanded class procedures, simply expanding these definitions --- without changing the rest of the system to actually pass or receive larger characters --- can be done with very little disruption to the existing system. It does require recompiling most of Cedar --- but that's expected to be done for PCedar3.0 anyway. It also requires either freezing or recompiling most of DCedar. Is it time to freeze DCedar? Anyway, as far as PCedar is concerned, there is very little disruption at first; what I'm talking about is adding RopeRep variants and STREAM class procedures that nobody uses at first. Then, the conversion of the rest of the system to actually use larger characters can proceed without requiring further interface changes (except for interfaces with items that use CHAR in their type --- I think there are far fewer of these than ones that use ROPE or STREAM, but I don't know of an easy way to check that). The lack of interface changes means that the changes to the rest of the system don't all need to be done at the same time; the rest of the system can be upgraded incrementally, and each upgrade to a package does not disturb its clients (unless and until one client tries to use that upgraded package to pass an expanded character to another client that isn't yet prepared to deal with them, in which case the receiving client will get a runtime error). A disadvantage is that Cedar's type system doesn't help a client tell when a service has been upgraded to work with expanded characters; this is the flip side of not having the type system require editing or recompilation every time a service is upgraded. I think this is a win because many modules that use ROPEs and STREAMs just do `pipefitting' with them, and don't fondle characters individually; these modules never need to be edited in this proposal (but they must in proposal I). I did a little study to quantify this claim; it's quick and dirty and only approximate, but I think informative. I consulted PCedar.depends to find all the modules that reference either or both of Rope and IO; these are the 1630 mentioned earlier. I studied those whose names begin with either A or B; there are 111 of these. I examined the Rope and IO interfaces, and identified the items that yield CHARs; they are: Rope: Fetch InlineFetch Map Translate Flatten InlineFlatten ToRefText AppendChars UnsafeMoveChars ContainingPiece IO: GetChar GetBlock UnsafeGetBlock PeekChar TOS TextFromTOS GetCedarToken GetToken BreakProc CharClass GetLine GetByte GetHWord GetFWord CreateStreamProcs CreateStream STREAMRecord StreamProcs Value ValueType Using XRef, I found and examined all the modules that reference these items. Of the 111 Rope/IO-referencing modules, only 27 depend on a ROPE or STREAM element fitting into a CHAR. This suggests that following proposal II instead of I would save over 3/4 of the editing. I noticed that most of the editing that would be required is pretty brainless. This study overestimates the saving because it doesn't look at other interfaces that mention CHAR. Some clients use ROPEs and STREAMs to pass 8-bit bytes, rather than characters. The expanded ROPEs and STREAMs can, of course, also pass bytes; there will even be implementations that only pass bytes, and thus pay no time or space penalty. How should sequences of characters be encoded in files? All the popular encodings should be supported. III. Extend Rope and IO, changing public items that traffic in CHAR to traffic in ECHAR instead; add ECHAR to the built-in types of the compiler, and make CHAR a subrange of ECHAR. Russ says adding a new built-in type requires very little compiler work, and a recompilation of everything (every .mob contains a list of all the built-in types). He wasn't sure how much work would be involved in making CHAR a subrange of ECHAR. Proposal III impacts the rest of the system much like proposal II, except that the ECHAR -> CHAR conversion (and potential error) would occur outside Rope.Fetch or IO.GetChar instead of inside. The advantage of this proposal is that clients whose only use of Fetch or GetChar is to compare the returned character to a literal do not require eventual editing (unless they logically should be comparing against more characters) under proposal III (but do under I and II). In the study mentioned above, this reduces the number of modules that must be edited from 27 to 24 (assuming their character comparisons remain appropriate). One disadvantage is that the conversion must then be done for every ROPE and STREAM, even those that only implement 8-bit characters. Another disadvantage is that it is harder to tell which packages have been upgraded --- there is no longer the clue of their using EFetch instead of Fetch. I think proposal II is the best. What do you think? Mike