Page Numbers: Yes X: 530 Y: 10.5" First Page: 57
Columns: 1 Edge Margin: .6" Between Columns: .4"
Margins: Top: 1.3" Bottom: 1"
Line Numbers: No Modulus: 5 Page-relative
Even Heading:
DESIGN AND IMPLEMENTATION OF A RELATIONSHIP-ENTITY-DATUM DATA MODEL
Odd Heading: Not-on-first-page
DATA MODEL DESIGN ISSUES
5. Data model design issues
A number of decisions were necessary in the design of the Cypress data model described in this paper. A few of these decisions were made arbitrarily because no solution was obviously best. For the most part, however, the criteria of simplicity and utility discussed in Section 1, as we interpret them for the applications we envision, clearly point towards the model we have developed. The decisions that led to the Cypress model chosen are discussed in this section.
5.1 Relations and attributes
The Relational model, as described by Codd[1970], was the obvious starting point for the Cypress model. The paper by Codd is easily the most referenced paper in the database literature and the Relational model is widely acknowledged as simple yet reasonably powerful. There have actually been a number of interpretations of "Relational model" in various implementations (such as System R by Astrahan et al [1976] and INGRES by Stonebraker et al [1976]). However, all implementations share the most basic idea of a database as a set of relations whose columns are attributes and whose rows are tuples, and most of them share the idea that the tuple attribute values may be strings, numbers, and booleans.
The fundamental type constraint, which states that all the relationships in a relation have the same number and types of attributes, is almost the same in all data models defined since the Relational model. The wording changes somewhat with the introduction of entities or type hierarchies, and some additional constraints are added with the introduction of keys, names, and normalization. We will discuss these minor differences as we contrast the various data model features.
5.2 Entities and domains
Codd [1979] introduced the idea of referential integrity: that attribute values come from domains such as "People" or "Parts." However, few implementations or theoretical studies of the Relational model carry this idea through to enforcing that persons or parts with unique names actually exist when referenced and that all references to these domains in a database are consistent. Thus, we introduce the concept of an entity to provide the important function of recognizing the objects we are trying to represent in the database. The entity idea frees applications from the integrity checking necessary in the Relational model on strings and numbers that are being used as unique identifiers for objects. We are used to the idea of objects as distinguished from their name, attributes, and relationships to other objects; the introduction of entities thus fulfills the simplicity (understandability) as well as utility criteria.
The introduction of entities has the disadvantage that it tends to "cast in stone" the decision to define a particular real-world object as an entity as opposed to a datum value (or vice versa). For example, if it is decided that the publisher attribute of the author relation is a string datum (the name of the publisher), and later it is changed to be an entity instead, then programs referring to the author relation must be modified. If a client uses the SetFS / GetFS operations, however, it is possible to obtain or even change the publisher without knowing whether it is an entity or datum value. This feature allows an application additional data independence at the expense of ignoring knowledge of entity existence and entity types.
The introduction of entities encodes more knowledge about the structure of the underlying information than the Relational data model. We can use this knowledge to provide better performance, and we do so in the Cypress Database System. For example: (1) entities indicate where physical links are desirable, (2) it is not necessary to store or use artificial numbers to identify objects because unique entity names provide this, (3) entities (atoms) may be stored and manipulated more efficiently than strings or integers (e.g., string matching person-names is more expensive than checking entity equality).
Note that the Cypress data model includes the Relational model as a special case in two senses:
1.If no domains are defined (i.e., all relationship fields are datum values), then we have exactly the Relational model: there is no domain type checking (e.g., that a value is a person’s name as opposed to just a string). Any correspondence between strings in one relation and strings in another, and their correspondence to objects, is left to the application programs.
2.The Relational model can be exercised on top of the Cypress data model even in the presence of entities by using only the translucent attribute operations. These operations treat all fields as strings as in the Relational model, but entities and any underlying physical structures required by the implementation are automatically maintained as well. For example, a new entity may automatically be created by using SetFS on a relation tuple. Thus different users may examine and manipulate the same database as datum values (Relational model), links (graph model), or both (our model).
The introduction of entities to the Relational model is now fairly widely accepted as a good idea, and Chen [1976] is probably most responsible for spreading this notion. Unfortunately, Chen introduces entities in a framework which is more complicated than it needs to be. Entities extend the representational capability of the Relational model in one basic way: they make it possible to distinguish between database items that represent objects and those that represent properties of (or relationships between) objects. Chen uses entities for this basic function, as unique "atoms." However, Chen and others have combined this function with two other data model features:
1.Treating entities as relationships, by allowing entities to have attributes just as relationships do. Chen does this, and most others have followed his lead.
2.Treating relationships as entities, by allowing attributes of relationships to reference other relationships as well as entities. Chen does not do this, but others (McLeod [1978], Codd [1979], Smith [1977b]) do, as do the older hierarchical and network models.
Combining the basic idea of an entity as an atom and a relationship as a data record can lead to some confusion. In the next two subsections we specifically discuss the merits of (1) and (2), respectively.
5.3 Entities as relationships: "attributes" of entities
Chen [1976] and others allow entities to have attributes. That is, they define an entity to have the features of a relationship: entity-valued or datum-valued attributes. This definition divides the information about an entity into two classes: the information that is given by its attributes, and the information that is given by relationships. For example, age might be an attribute of a person, while the projects he works on might be represented by member relationships between persons and projects. In many cases, the choice is not clear: should spouse or boss be attributes of persons or relationships between persons? Should birthday be a quaternary relation between a person, month, date, and year, or should the birthmonth, birthdate, and birthyear be attributes of a person? These decisions must be made at data definition time and changes later would necessitate changing application programs or providing some kind of translation mechanism.
Applying the simplicity and utility criteria to the entity attribute question suggests that:
1.[Utility] Entity attributes provide a short-hand making application programs and user interfaces simpler: the age of a person can be obtained in one operation instead of two, the extra operation being fetching the relationship. There could also be a performance issue here, but the database system can store relationships physically co-located with the entity about as easily as it can co-locate attributes, so this is not an issue in practice.
2.[Simplicity] Entity attributes introduce additional complexity, both in understandability and the parsimony of the representation. For example, there are now two different mechanisms to represent most facts about entities. There are also indirect, but perhaps more annoying, effects: the syntax and semantics of the query language will be more complicated and/or verbose, if we must handle both cases.
A solution we have not seen in the literature is to allow the best of both worlds, by providing a shorthand to obtain or assign in a single operation the entity properties stored as relationships, but retain relationships as the real underlying logical storage primitive. This line of thinking is the motivation for the introduction of properties in the Cypress model. The result is a model whose utility and simplicity is an improvement on either extreme.
We should note two variations on the idea of allowing entities to have attributes: list-valued attributes and the characteristic/association scheme.
A few data models allow list-valued attributes, rather than insisting that only single-valued properties be attributes of entities. For example, the projects a person works on would be an attribute of the person, and the persons working on a project would be an attribute of a project. Note, however, that these attributes of persons and projects are really representing the same fact. By representing the fact as attributes of the entities instead of as relationships the danger of inconsistent data arises (unless yet another feature is added to the model to automatically maintain the interdependencies). Serendipitously, the property mechanism we introduced as a shorthand solves this problem too. The convenient list-valued properties are a veneer on top of the underlying relations, and the consistency of the database is retained at the level of facts stored as atomic relationships.
Another variation on the entity attribute feature is to categorize relations into two types: associations and characteristics (Pirotte [1977], Codd [1979]). Only entity values may be attributes of associations. Only datum values may be attributes of characteristics, except for the first attribute, which is always an entity (characteristic relations relate an entity to datum values that are its "characteristics"). However, because a real-world logical dependency may involve both entities and datum values, it is also necessary in these models to allow relationships to be entities, so that the datum values associated with a functional dependency can be expressed as characteristics of the relationship. In some cases, previous authors have further restricted associations and characteristics to be binary relations. There doesn’t seem to be any advantage to the association/characteristic scheme, and it was not adopted in the Cypress model.
5.4 Relationships as entities: Allowing references to relationships
Some data models introduce another feature to the entity/relationship idea: they allow relationships to act as entities in the sense that they can be referenced by other relationships. In the models the author knows of, this is done by eradicating all distinctions between entities and relationships. That is, there is only one "object" type of primitive, that can have attributes like a relationship, and can be referenced elsewhere like an entity.
The single object type has the attractiveness of simplicity. One still needs the same operations present before on relationships and on entities, but there is only one type to deal with. Unfortunately, it falls back a bit on the utility criterion, as we lose some features that the entity/relationship distinction was providing us. In addition to the obvious fact that the user can no longer tell which data items are thought of as real-world objects, the system cannot tell the difference and so cannot:
1.automatically check that objects intended only as relationships are never referenced
2.perform optimizations that depend on knowing whether a database object can ever be referenced elsewhere, or
3.print the database in a simple linear or human-readable way, as it can when only entities can be referenced and all entities have names (or names can be invented).
Consider the motivation for dropping the entity/relationship distinction. Some relationships, say a purchase relationship between a customer, part, and quantity of the part ordered, may act as an entity in and of itself. For example, the purchasing may take part in a commission relationship between the purchase, a salesman, and a commission in dollars. If we drop the entity/relationship distinction, this kind of relationship can be added with no change to the existing data schema, as the previously unreferenced purchase relationships can just as easily be treated as entities. Keeping the entity/relationship distinction, such a change would require modifications to database application programs or at least the use of some intermediate interface to shield them from such changes (such as translucent attributes or views).
On the other hand, we can argue that when we choose to treat a relationship as representing an abstract object in and of itself, we should then and only then introduce a domain to represent that kind of event or transaction. Quite simply, that’s what an entity is for: it says that we are thinking of this database item as an object, that can participate in relationships.
In a more practical and perhaps more convincing vein, not allowing explicit references to the relationships in a database permits a much simpler query language, since it need not deal with nesting of relations. Relationships referencing relationships provide a potential spaghetti of interconnections. In particular, as noted in earlier sections, we may still use a relational query language in our data model, despite the introduction of the entities, names, domains, subtypes, and so on. This compatibility is possible exactly because of the dichotomy we introduced between relationships (objects that may reference other objects but may not themselves be referenced), and entities (objects that may be referenced but may not reference others). We have a simple, two-level structure.
In applying the simplicity and utility criteria here, the choice is difficult. Our choice, to keep the entity/relationship distinction, was motivated by the observations that (1) dropping the distinction results in no fewer operations either in the database interface or the application program, it merely condenses two types into one; (2) there is a loss of some integrity checking and data semantics by dropping the distinction; (3) a linear and human-readable notation of entities and relationships, using entity names, is possible; and (4) the query language is simplified by permitting only one information-carrying element, relations.
There is a variation on dropping the entity/relationship distinction that allows references to relationships only in a non-cyclic hierarchical fashion; this middle ground seems to be more complex and/or less powerful than either of the extreme positions, however.
5.5 Lists and sets
A few data models permit set values for relationship attributes. We discussed this in the previous subsection in the dual form of multi-valued attributes of entities. Both have the same drawback in maintaining consistency of the database. Both violate Codd’s [1970] first normal form designed to avoid such inconsistencies.
On the other hand, we have found it useful in database applications to have some mechanism to represent the concept of a list, an ordering upon database entities or relationships. Almost none of the data models of which the author is aware provide any help in representing the concept of a list. The standard way to represent ordering in the Relational model is by adding another attribute to the relation, specifying the ordinal position of the tuple. A list thus constructed can only be enumerated efficiently if an index is built upon the attribute, and it is quite awkward to insert new tuples in the middle of the list.
We do not claim to have solved the list representation problem in the Cypress data model, either. However, in practice we have used the list representation our implementation provides: the order maintained on references to an entity. When a RelationSubset is executed with an entity-valued attribute-value pair [a, e] as the constraint, the tuples that reference the entity are returned in a particular order. This ordering is guaranteed by the group lists we will describe in Section 6. It is the same order as the entities were specified in the call to SetPList, or in the order they were created if SetP or DeclareRelship calls were used. We are experimenting with an optional extra argument to SetF which specifies where in the list the new tuple should be inserted when a new reference to an entity is established:
SetF: PROC[t: Relship, a: Attribute, v: Value, after: Relship← NIL];
Using this mechanism, our applications can define binary member relations as connectors to entities that represent ordered (or unordered) sets. Elements can easily be added and removed from a set with SetF or SetP, and the sets can easily be enumerated with RelationSubset.
However, we have not advertised this feature in the description of the data model or the DBModel interface because it is only a tentative solution. Not only does this solution depend upon the implementation of the entity references via linked lists, but the feature is not accessible at all from a higher-level Relational query language. More work is badly needed, to extend data models and query languages to cover the concept of ordering in a clean and integrated way.
5.6 Entity names
The introduction of entity names to the data model is a comparatively easy choice. They easily satisfy the criteria of simplicity and utility. What is less clear is the form in which they should be introduced.
For the applications we have encountered, nearly all real-world objects have human-meaningful names that are unique within their domain or can easily be made to be so: people, organizations, articles, programs, calendar years, or electronic messages. The natural real-world name is preferred to an internally generated identifier for the object, if the real-world name is guaranteed to be unique, because the name is the way in which a user will typically identify an object to the computer and represent the object outside the database.
A number of applications use names that logically have multiple parts. The name of a person, for example, might consist of the concatenation of a first, middle, and last name. Some applications have dependent names. A dependent name is a multiple-part name one of whose parts is another entity. The name for a file system subdirectory, for example, might be a two-part name consisting of the entity for its super-directory plus the string name of the subdirectory. The name of a wine, e.g. a 1979 Sebastiani Zinfandel, has three parts: the vintage year, the winery entity, and the variety.
The ubiquity of multi-part and dependent names easily satisfies our utility criteria. They probably also satisfy our simplicity criteria, but this is less obvious. Closer examination reveals that the implementation of names must be reasonably complex in order to satisfy typical applications’ needs. Different separators for the name parts and different conventions for their combination and sort order may be required: e.g., "Lastname, FirstName Initial" for persons, or "<SuperDirectory>SubDirectory>" for file system directories. Also, the data schema mechanism must define the name in terms of independent existing relations. That is, one probably does not want to think of the name components as "attributes" of an entity in the sense that relationships have attributes. The relation between wines and wineries or the relation between organizations and their parent organizations are queried independently but are also used in derivation of wine and organization names.
Our conclusion is that the data model should allow for a procedural specification of the name derivation so that names can automatically be updated when data change, rather than allowing some fixed name derivation primitives. This allows an arbitrary definition of names, just as views may allow arbitrary relation definitions. It also removes the need for the database system to understand names at all, since it treats the name function as a "black box." Until storage of procedural information is easier in the Cedar programming environment, however, the name function has been omitted from the implementation.
One could imagine applications that use integers or dates as entity names rather than strings. Again, this variation could be handled by a procedural specification of the name derivation, and we found insufficient demand to explicitly incorporate this into Cypress.
Real-world objects can have more than one name, so it might be reasonable to allow this in the data model as well. Multiple names were omitted from the Cypress model since the user can build an alias relation on top of the system that is checked whenever names are looked up. The only common example of aliases we have encountered are for person entities, so name aliasing does not satisfy the utility criteria. However if multiple names are used widely it does make more sense to augment the model. In that case it is still desirable for every entity to have a primary name to allow for correct operation of GetFS and SetFS.
Another more difficult design decision for names in the data model is whether to allow null names for entities (Codd [1979], Kent [1978], Date [1981]). Without the guarantee that every entity has a unique identifier, a number of database operations become more complex in both semantics and implementation: expression of queries, printing of entities, dumping of databases, cross-database references (augments), data entry, and maintaining the correspondence between relational and entity-based database access (GetFS, SetFS). Unfortunately some applications do not require names. For example the transistors in an LSI layout may be represented as entities accessible only through their relationships to connected entities in the layout. The best solution seems to be to require that entities have names, but provide a default system facility for generating unique names. The GENSYM facility for generating unique atoms in LISP does this. The DeclareEntity procedure in our implementation acts similarly. Thus the data model’s invariants remain simple but add no additional complexity to application programs not requiring names.
5.7 Keys, normalization, and dependencies
Any kind of key mechanism is only an approximation to a description of arbitrarily complex dependency relationships between entities and datum values involved in relationships. Functional and multi-valued dependencies and their implications for relational normalization have been extensively studied in the literature (Codd [1970], Fagin [1977, 1981], Sadri & Ullman [1980]). Keys have the advantage of simplicity over a more general, open-ended scheme requiring constraint checking on a list of axioms on each database update: 80% of the checking can be handled with 20% of the complexity. We are forced once more to apply our utility constraint: how complex a key mechanism would most database applications use? Uniqueness of entity names in domains? Single-attribute primary keys for relations? Multiple-attribute primary keys? Optional keys (unique if present)? We include all of these mechanisms in the Cypress data model, but not arbitrary assertions.
Normalization is typically not enforced by a database system. The database administrator attempts to design the data schema in a manner to minimize various anomalies, through normalization. The subject of database normalization, as noted in the previous paragraph, has been extensively studied in the literature. It is my opinion that arbitrarily complex kinds of normalization and dependencies will continue to be discovered. The easiest solution, therefore, is to take the extreme position: adopt the convention of normalizing databases as much as possible at the conceptual level. We define functionally normalized to mean that relationships in the model are single irreducible facts. The result in practice is that most relations are binary; the physical database representation may in fact join relations in storage for efficiency, but this optimization is not visible at the client’s conceptual level.
Hall et al [1976], Biller [1979], and Schmidt & Swenson [1975] argue that the irreducible relational form we are advocating is preferable to joining unrelated functional dependencies into a single relation. There are several arguments: (1) this is easier to understand because the relational representation matches the real-world dependencies, (2) if there is efficiency in actually joining them, this can be achieved by storing the join in the physical representation and defining the real relations by views that are projections of this underlying relation, (3) this representation is more "restrictive" than any normal form that could be derived. The irreducible form has therefore been used by convention, although not enforced, in the Cypress system applications. Features of the database system and associated tools make this form easy to use.
Our definition of irreducibilty is not the same as irreducibility to relations whose join can reproduce the original relation, as in Hall et al [1976] but rather what Biller [1979] calls s-irreducibility. Biller proves that our definition is more restrictive.
We backed off one step from irreducibility to what we called "functional irreducibility:" in this form we allow two or more relationships to be combined into a single relationhip when they must logically all be present or all be absent for each external (real world) entity. In practice we find that functional irreducibility differs from irreducibility only when we separate components of a single real world value into its constituent parts. For example, the birthday relation given as an example of a functionally irreducible 4th-order relation in Section 2.7 would have been a binary relation had the month, day, and year been combined into a single "date" value. Kent [1982] discusses the issue of combining or separating components of values or entity names.
The last topic we cover in this section is "existence dependencies" between entities and relationships, in particular the fact that DestroyEntity is defined to include the destruction of relationships that reference the entity. One could imagine other semantics:
1.When an entity is destroyed, the attributes of relationships that reference it could become NIL.
2.We could disallow the destruction of entities altogether, destroying only relationships.
3.When two database applications share entities of a particular domain, e.g. both a message system and a phone directory share information about people, deletion of an entity by one application should only delete relationships pertaining to the entity for that application, not the entity itself.
Alternative (1) was ruled out on the basis of our utility criterion. For the database applications we have and envision, the desired semantics is to destroy all information about an entity when the entity is destroyed. Alternative (2) is attractive, as it simplifies the model by reducing the number of operations we must understand and implement. However, it was also ruled out on the basis of our utility criterion; there are cases where an application program really needs to remove an entity completely from the database, e.g. when old data is purged. Furthermore, the author knows no simple implementation for this alternative not requiring "garbage collection" of entities created but no longer used.
Alternative (3) is an important issue, as the support of applications with overlapping database schemas is crucial. Of course, applications that share domains will have to be in agreement as to which relations belong to whom anyway, and could delete only their own data. But a more powerful mechanism that could delete the relevant data in one operation is desirable. The solution to that problem, in the opinion of the author, is to use the data segmentation mechanisms we consider further in Section 5.9. These mechanisms not only allow physical independence of the applications, but deletion of the entity representative of a real-world object in one segment has no effect on the entity or its relationships in another segment.
5.8 Generalization and type hierarchies
A wide variety of mechanisms have been forwarded in the literature for specifying types and subtypes of data objects, or, equivalently, allowing the same object to participate in more than one class of relationship. These mechanisms have variously been called type hierarchies, type hierarchies with multiple inheritance, type lattices, roles, and flavors.
There are two major data modelling questions to address: how powerful a type-subtype mechanism is desirable, and whether it should apply to domains, relations, and/or datatypes. Both of these choices are difficult, and were made on the basis of the applications and schemas of the applications we envision.
At least some sort of type lattice ranks high on our utility criterion, as many applications define relations that are constrained to a subset of the entities of a type to which more general relations also apply. A database of documents for example, would separate those documents that are individual works (books, articles, technical reports) from those that are collections (journals, conference proceedings, or collected books).
Although we can construct examples where more complex type mechanisms such as roles or flavors are desirable, for example treating a person entity as either a homeowner or employee or both, none of the applications we envision would use this generality and the more complex mechanisms are harder to understand as well as more complex in their implementation. The addition of multiple supertypes to the basic type hierachy mechanism, on the other hand, is a relatively simple concept to understand and implement, and handles most data schemas the more complex mechanisms handle. Thus we arrive at the type hierarchy with multiple supertypes.
Yet another variation on the type-subtype mechanism is to define subtypes by a predicate computed at run-time from data in the database (McLeod [1978]). For example, an entity would be defined to have type mother if it has type person and also has a parent relation to one or more children. Computing types at run-time unfortunately introduces much of the complexity of arbitrary constraint satisfaction tests earlier rejected in our database normalization disucssion. It also provides more power than called for by our applications and the utility criterion. Subtypes computed by predicate were therefore omitted from the Cypress data model.
In their well-known paper on generalization (type hierarchies), Smith and Smith [1977] apply the concept to all kinds of database objects, and we would naturally apply it to relations, domains, and datatypes. This idea certainly satisfies our simplicity criterion, as it makes sense to apply generalization universally in the data model if at all. In our implementation, however, only type hierarchies of relations are not allowed: these do not seem particularly useful to applications. Since user-defined datatypes are not currently permitted, there is no need for a subtype mechanism for them, either. However a subtype mechanism for datatypes does satisfy our simplicity and utility criteria when user-defined datatypes are defined. The hierarchy of relation types is difficult to implement when multiple supertypes are allowed and attributes of relationships are stored in a physically contiguous memory, because the multiple supertypes may define overlapping incompatible fields. Furthermore, the hierarchy of relation types seems to be of less utility.
The effect of subdomains could be achieved by allowing relation attributes to accept one of a list of types instead of by using the predefined SubType relation to define a lattice of types, specifying the [one] least common domain for each attribute. These two alternative representations do not differ much in flexibility of the type mechanism. The former might be better if we found it necessary in the latter scheme to define a large number of "artificial" intermediate types in order to achieve the type constraints we desire. However this has not been the case in our experience. The single type for attributes might be simpler to understand, but the argument is not a convincing one. Therefore the choice we made, to use the Subtype relation, was arbitrary.
5.9 Access primitives
Most database management systems (INGRES and System R, for example) provide an interface to tuples (relationships or entities) at a coarser grain than ours. They do not allow tuple-valued variables in a client program, on which operations such as GetF and SetF can be invoked to extract or assign particular values. Instead tuples are explicitly copied from a database into buffers allocated by the client, and explicitly copied back on an update. The difference between our approach and theirs is subtle, but it may have a major impact on the design of an application. Because the "handles" for the tuples in our scheme are garbage-collected, the client need not keep track of the number, logical source, allocation, or size of variables referencing database objects.
One additional complexity is introduced by the granularity of database access in Cypress. Specifically, the problem is not a result of our choice of "call by reference" over the "call by value" in other systems, but that all of the updates to a particular tuple are not necessarily encapsulated in a single database call. In examining the interface to the implementation described in Section 3.4, the reader will note that relationship attributes can either be assigned by calling DeclareRelship with a list of attribute values, or by successive calls to SetF. The former is preferable, as it sometimes allows the database system to more efficiently update the data. In addition, the latter complicates the implementation and/or the rules for database system use, for example in the use of the surrogate relation optimization we discuss in Section 6. For now, however, we have retained SetF in the implementation as a convenience to the client.
In addition to the call by reference/value distinction, INGRES and System R differ from our design in providing a higher level of access to database clients: a query language. Of course, we plan the addition of a Query level to Cypress (or multiple different query levels). The difference remains, however, that we are willing to have clients access the database at the lower "navigational" level as well. Our introduction of entities into the Relational model makes a navigational interface compatible with concurrent use at a higher level. The introduction of entities also makes possible query languages of a different style, e.g. more like predicate calculus, or using a "chain" of domains with relations as connectors.
5.10 Views, segments, and augments
There is little question that views are quite useful and also relatively easy to understand, and thus satisfy our utility and simplicity criterion. A more difficult question in the design of views is the power of the view definition language: should an arbitrary programming language be allowed, or a more restricted specialized query language? We have essentially avoided this issue, by separating the definition of the query language from the definition of the data model itself. A particular choice will have to be made in the second phase of development of the Model Level of Cypress.
In addition to the utility of relational views as discussed in the literature, e.g. for data independence, they can be used to encode operations on entities in an object-oriented form. The operations of hiring or firing an employee or sending a message can be invoked by storing a tuple into a relation whose attributes are the arguments of this function. For example, if the operation of sending a message takes two arguments, the message and the recipient, the actual operation would be invoked by storing a tuple in the Send view (relation) with the appropriate message entity and recipient entity as attributes.
In general, we have taken a reasonably conservative philosophy in the design of the Cypress data model, integrating ideas that have separately been studied theoretically if not implemented in studies in the literature. This philosophy increases the probability of a practical implementation of the model. The introduction of augments is an area where we have violated the conservative philosophy, however, because the functionality of augments seems so important to applications.
Augments and segments have roots in the segments of System R (Astrahan et al [1976]). Our segments are slightly more powerful than theirs, as we allow references across segment boundaries; augments are of course more powerful still. System R segments cannot be used for encapsulating updates to data or to combine private and public data. Relations cannot cross System R segment boundaries.
The use of augments to represent versions of data makes them similar to that of the "layers" of Goldstein and Bobrow [1980] and the "hypothetical databases" of Stonebraker [1980]. None of these mechanisms have been implemented on a database system scale at the time of this writing. The semantics of layers, hypothetical databases, and augments differ slightly at the conceptual level. Augments have the additional feature that they have semantics at a physical level: they can be used for separating data that is physically independent in the face of software and hardware failures, and the data may be distributed over machines.
Transactions provide a mechanism to encapsulate updates to a database. A natural approach to providing the capabilities of augments in a database system might be to transform transactions into permanent, full-fledged objects. Perhaps in terms of the simplicity criterion, the elaboration of transactions has some merit. However, the implementation of long-term and short-term encapsulation of data updates must be quite different and the system would at least require some hint as to which physical model to use for a particular update. We chose in the Cypress data model to make the permanent versus temporary distinction show through to the client, because there seems to be little or no utility in avoiding the distinction. The applications we envision require transactions and/or augments, but not interchangeably.
We currently chose to implement segments rather than augments. A segment allows new data (entities or relationships) to be inserted, possibly referencing entities in other segments. However, data may not be deleted or modified in a different segment than it resides. Furthermore, applications must simulate the effect of relations crossing segment boundaries where desired (using the "namesakes" correspondence). Segments handle many of the applications we envision, combining orthogonal databases from different applications. A few applications require augments, however (e.g. a database of different versions of systems). Others are simplified by augments (e.g., several databases of people with overlapping information).
Augments add little or no conceptual complexity: they simply make an entity with a given name unique throughout instead of existing on a segment basis. Subtractive augments also have some utility even where they currently seem unnecessary: e.g., we might want to make personal modifications rather than simply additions to a public database. We conclude that both the utility and simplicity criteria are satisfied by augments. The implementation of augments is considerably more complex than segments, so was deferred.
5.11 Summary
In this section we have explained the rationale for our choice of data model. The criteria for our selection are simplicity and utility, viewed from the perspective of existing or envisioned database applications. Surprisingly, the choice of data model features in Cypress has been reasonably clear cut. Examining the needs of actual applications rather than postulating potential uses has greatly simplified the choice of model.
We started with the relational data model, adding the concept of entities. Entities insure referential integrity: the constraint that relations be between objects of specified types. We introduce entities in a way different from most other data models. Some models treat entities as relationships: that is, they allow entities to have fields just like relationships. We rejected this approach as it provides two redundant representations of information and thereby violates the simplicity criterion as we define it. Some models treat relationships as entities; that is, they allow references to relationships from other relationships. We rejected this approach because it results in a network "spaghetti" to which a high-level query language cannot be applied.
We augmented the basic entity-relationship model with features of high utility and little additional complexity: relational keys, unique entity names, segments, and a hierarchy of types. We also provide a limited mechanism to represent existence dependencies between data: relationships are deleted when an entity they reference is deleted, but relationships and entities in separate segments are independent. The basic features of Cypress are present in one or more other data models, although this is probably the first model combining all of these ideas.
Other data model features appear to have high utility, but more work is needed to provide a useful implementation of such features. Augments fall in this category. More complex (multi-part, multi-type) entity names would also be useful, but we know no sufficiently general and efficient implementation. Cypress has a limited mechanism for representing lists and sets using relationships to represent membership, but a better mechanism would be desirable.
The Query level of Cypress will provide relational views and a higher level access language. A fair amount of research work already exists in these areas, and Cypress provides the same basis as other database systems for query and view construction. The Query level has been deferred for now, however.