SetsAndStuffDoc.Tioga
Last tweaked by Mike Spreitzer on September 21, 1987 12:12:52 pm PDT
SetsAndStuff
CEDAR 7.0 — FOR INTERNAL XEROX USE ONLY
SetsAndStuff
Abstract Data Types for Sets, Relations, &etc.
Mike Spreitzer
© Copyright 1987 Xerox Corporation. All rights reserved.
Abstract: This package defines some abstract data types for abstractions like sets, relations, functions, sequences, and so on. Some implementations are also provided.
Created by: Mike Spreitzer
Maintained by: Mike Spreitzer <Spreitzer.pa>
Keywords: Collection, Set, Relation, Function, Sequence, Series, Array
XEROX  Xerox Corporation
   Palo Alto Research Center
   3333 Coyote Hill Road
   Palo Alto, California 94304

For Internal Xerox Use Only
1. The Package
Basic Philosophy
An abstract data type is one that specifies the operations the may be performed without specifying the implementation. This is done, as usual, by having a REF ANY that holds the actual implementation, and a record of class procedures. For each abstraction, or related set of abstractions, there is an interface. That interface include a section containing the generic operations, a section for some special cases of the general abstraction and operations peculiar to the special case, and a section dealing with how to make an implementation for one of the abstractions.
Particulars
The basic interface is Collections. It defines the basic abstraction, the Collection, which is the union of the Set and the Sequence. The next interface is PairCollections, which specializes Collection to PairCollection, which is a collection of pairs. This abstraction includes Relations. The interface IntStuff is a digression to define extended precision integers. Finally, there is the interface IntFunctions, which concerns partial functions over the INTs.
2. Design Issues
There are lots of them. Discussion and suggestions are welcome.
Organization of Abstractions
Why should we mess with Collection, why not have Set and Sequence? There are lots of questions like this that can be asked. Should we have separate abstract data types (ADTs) for functions vs. relation? Note that some implementations of Set may be able to test membership, but not enumerate; should there be a special sub-ADT for them? I am not convinced I have really good answers for this, but I have followed two principles.
One is that it's OK to have one ADT for two abstractions that share a lot. In particular, Sequences and Sets share a lot. Consider that for any sequence you can think about the set of things in that sequence — and in fact, you often do. Consider that in SETL, sets and tuples share a lot of operations. Consider that in Smalltalk they chose to do it this way.
The other principle is derived from an ugly fact about Cedar (it may be true of a lot of other programming environments as well): the memory management doesn't work very well — dynamic storage allocation and collection are expensive. Thus, I consider it valid to make a new ADT to eliminate major usage of dynamic memory. This is why I felt it was OK to have a separate ADT for collections of pairs, even though that's really just a specialization of collections. The ADT for partial functions of INTs was introduced for similar reasons. But it's also true that these specializations add a lot of semantics to the general abstractions they specialize; I might not have felt OK about introducing new ADTs if their abstractions weren't so substantially different.
Specializing Without Introducing New ADTs
OK, so the ADTs correspond to an abstraction that has a lot of interesting special cases. Some of those special cases have their own ADTs, and some don't. What happens when a client wants to deal with one of those special cases that doesn't have its own ADT? That client uses the general ADT. For instance, the client who wants to use Set uses Collection. The client who wants to use Filter (a Boolean function of one variable) uses Collection. A Collection can test membership, and thus represent a Filter. A Collection can also enumerate members, and thus provides the two most interesting operations of Sets.
Each ADT defines a largish group of operations that it may perform. Not every datum of that ADT is obliged to be able to perform every one of those operations. For instance, some Collections may be unable to enumerate their members, even though they can test membership. This is necessary to allow that ADT to represent some special case abstraction that doesn't include all the operations of the general one. When a client tries to invoke an operation that is not supported, an ERROR (Cant) is raised. There are procedures for asking what's supported by a given datum (and arguments --- it may depend on some).
Some of the class procedures take arguments that are more or less interesting for various special cases of the general abstraction. For instance, the procedure for adding an element to a Collection takes an argument specifying where the element is to be put; this is meaningful for Sequences, not for Sets. And so the TYPE of this argument includes a way to say "don't care", and that's the default value for that argument. Thus, Set clients needn't talk about this argument at all (but Set implementors should check that it is "don't care"); and Sequence clients can either leave it "don't care" or give a more specific value. This technique is generally applied.
You may be asking about type-checking for the special cases. There are two classes of specialization. One involves removing operations. An example is the specialization from Set to Filter: a Set can do everything a Filter can. In such cases, a definition like "Filter: TYPE = Set" is provided. The client code can mention the name Filter for clarity, but the distinction between Filter and Set is not checked by the Cedar type system. That's because it will be completely adequately checked at runtime: the only mistake you can make is to try to use a Filter in an operation that Sets have but Filters don't, and in that case the problem will be detected and reported (by raising Cant).
The other kind of specialization involves more subtle semantics. There are two good examples. One involves the issue of mutability, and the other the difference between Set and Sequence. For these distinctions it would be nice to have type checking, because mistakes will not always be caught at runtime. We do have a nice mechanism for doing type checking for specialization. Unfortunately, it only works for one dimension of specialization. We thus have to choose which gets this treatment. I chose the dimension of mutability, because I think it's easiest to make mistakes in this area, and so it would be the most beneficial use of the one opportunity for type checking.
Sugar and Spice and Everything Nice
In Cedar, some TYPEs are painted. This means they compare by name, rather than by structure. Records, and references to them, are painted. Arrays are not.
When a procedure Zork is defined in the same interface as some painted TYPE Foo, and the first argument of Zork is a Foo, then Zork can be invoked in `object-oriented' style: "foo.Zork[other args]", rather than the usual "Interface.Zork[foo, other args]". This is noticeably more concise, as it elides the interface name. (Don't whine about making it difficult to tell where Zork is coming from --- let TiogaDWIM make your "Def" button smarter.) For basic data types, such as we are talking about, this may be quite worthwhile. And if you don't like it, you don't have to do it --- the normal-style invocations also work.
So an effort has been made to make it possible to take advantage of object-oriented invocations. Another convenience is that for every class procedure, there is a procedure in the interface. So instead of writing "coll.class.Op[coll, other args]", a client can write "coll.Op[other args]". To avoid wasting time on procedure call overhead, these procedures are INLINEs. (Don't whine to me about how stupid the interpreter is --- whine to the interpreter maintainers. And note that it's not hard to replace such code with drivel acceptable to the interpreter.)
Type-Checking Special Cases
When you have a Cedar TYPE Foo, you can introduce a special case Bar with the constructor "Bar: TYPE = RECORD [Foo]". This is because Cedar has the peculiar notion that a record with only one component, and that one being unnamed, can be used everywhere that the component could be used. So your special Bar can be passed to any procedure that takes a Foo, and it can be assigned to any variable that ranges over Foos. And since records defined in interfaces are painted, if you make another specialization "Zot: TYPE = RECORD [Foo]", it will be different, according to the Cedar type system, from Bar.
Mutability
There are values, and there are variables. A variable takes on different values at different times. A variable is typed --- it can only take on certain values. A variable is a value. Everything is a value.
Thus, there are the immortal, immutable, mathematically pure collections. And then there are variables that range over collections. Should there be separate ADTs for these two abstractions? I chose `no'. They share a lot of operations --- those of a collection and those accessing the current value of a collection variable. If they were separate ADTs, they would either have to duplicate these operations, or the current value would have to be produced as an immutable collection by the variable for every access. Now, some implementations are inherently mutable, in the sense that it's cheaper to modify an existing instance of some kinds of data structure than it is to produce a new instance that differs in some small way. And some clients are happy with the semantic restriction that places -- they may only access the current value of a variable (or do an explicit copy to keep an old value). It would be a shame for the ADT to keep such happy couples apart.
There is a Cedar type, Mutability, that is used to describe a datum's position in this mess. It is an enumeration, of three values. One, constant, means the datum is not describing a variable; the other two, variable and readonly, mean the datum is describing a variable. The difference between variable and readonly is that in the former case, the datum grants `change rights' to its holder; in the latter case, although the variable may change value, the datum does not provide a way to make changes.
What is Equality?
We're trying to achieve polymorphism here, even though Cedar doesn't much support it. We need a Cedar data type that can represent anything. Well, lots of them will theoretically do, but in practice REF ANY works best, and that's what's used. But values occasionally need to be compared for equality. And sometimes you even want to do things like compare them to see which is `greater' (for some client-specifiable notion of greater), or to compute a client-specified hash function. How should these procedures be associated with the values? One bad answer would be to use address equality, address comparison, and a hash function of the address that the REF is. But that wouldn't let clients make, for example, a function of character strings (represented by ROPEs). A better answer would be by runtime type of the referent. But that still has limitations. A better answer is to store those procedures with each Collection. So the data type Space is introduced, to hold all those polymorphic procedures that apply to values, and each collection uses some Space.
Still Yet More To Come
Probably