Heading:
Cypress database concepts and facilities, October 1982
Page Numbers: Yes First Page: 10 X: 527 Y: 10.5"
3. Data modelling concepts
Data Independence
In this section, we give an informal description of the Cypress data model. We are dealing with a conceptual data model, the language used to define and access the types of objects and relationships between them. This should be carefully distinguished from a physical data model, the data structuring mechanisms we are given for actual storage and access of data. The physical data model is hidden as much as possible from the database client to facilitate data independence, the guarantee that a user’s program will continue to work (perhaps with a change in efficiency) even though the physical data representation is redesigned.
For any particular database using the given conceptual and physical models, the actual specifications of this database using the primitives the models provide are termed the conceptual data schema and physical data schema. Note that a mapping must be provided between the conceptual and physical levels, either automatically or with further instruction from the client; we will do some of both. The logical to physical mapping is intimately associated with the performance of the database system as viewed by the user performing operations at the conceptual level. Performance is generally not a criteria for choosing between conceptual data models, unless there is no known way to map the differing conceptual views into the same or equally efficient implementations.
Basic Primitives
Three basic primitives are defined in the model: an entity, datum, and relationship. An entity represents an abstract or concrete object in the world: a person, an organization, a document, a product, an event. In programming languages and knowledge representation entities have variously been referred to as atoms, symbols, nodes, and objects. A datum, unlike an entity, represents literal information such as times, weights, part names, or phone numbers. Character strings and integers are possible datum types. Actually, it is a policy decision as to whether something is represented as an entity or merely a datum: e.g., an employee’s spouse may be represented in a database system as a datum (the spouse’s name), or the spouse may be an entity in itself. The database system provides a higher level of logical integrity checking for entities, as we will see later.
In some programming languages, an entity is just another type of datum, like numbers and strings. To allow more preciseness, we will instead use the term value to refer to something that can be either a datum or entity.
A relationship is a tuple whose elements are [entity or datum] values. In programming languages, a relationship is commonly termed a record. As in most programming languages, we can refer to the elements (fields) of tuples by name instead of position. These names for positions are called attributes.
We also define entity types, datum types, and relationship types: domains, datatypes, and relations, respectively. We make use of these three types through one fundamental type constraint: for every relation, we define the number and types of attributes expected of relationships of that type. That is, every relationship in a given relation must have the same attributes, and the values associated with each attribute must be from a pre-specified domain or datatype. For example, a member relation might specify that a given person is a member of a given organization with a given job title as in the figure following. The person and organization might be entities, while the title might be a string datum. We relax the fundamental type constraint somewhat in allowing a lattice of types of domains: a particular value may then belong to the pre-specified domain or one of its sub-domains. For example, one could be a member of a University, a Company, or any other type of Organization one chooses to define. Other relations, e.g. an "offers-course" relation, might apply only to a University.
<==<ViewFig1.press<
Relationships represent facts about the world. A relation (a set of relationships) represents a kind of relationship in which entities and values can participate. We will schematically represent that John Smith is a member of Acme company with title "manager" by drawing lines for the relationship (labelled with the relation and its attributes), a circle for each entity, and a box for the datum:
<==<ViewFig2.press<
The reader should normally think of relations as types, not sets. A relation can be defined as the set of all relationships of its type, however, and thus can be treated as a relation in the normal mathematical sense. Note that we differ from mathematical convention in another minor point: the use of attributes to refer to tuple elements by name instead of position. Reference by position is therefore not necessary and is in fact not permitted. We can often omit attribute names without ambiguity since the types of participating entities imply their role in a relationship. However, they are necessary in the general case; e.g., a boss relation between a person and a person requires the directionality to define its semantics.
We can summarize the six basic primitives of the data model in tabular form. Familiarity with these six terms is essential to understanding the remainder of this document:
TypeInstanceInstance Represents
DomainEntityphysical or abstract object
DatatypeDatumnumerical measurement or symbolic tag
RelationRelationshiplogical correspondence between one or more objects and values
Domains, datatypes, and relations are types but can also be viewed as sets.
Names and Keys
The data model also provides primitives to specify uniqueness of relationships and entities. For relationships, this is done with keys, and for entities, with names. A key for a relation is an attribute or set of attributes whose value uniquely identifies a relationship in the relation. No two relationships may have the same values for a key attribute or attribute set. A name acts similarly for a domain. The name of an entity must uniquely specify it in its domain.
Consider a spouse relation, a binary relation between a person and a person. Both of its attributes are keys, because each person may have only one spouse (if we choose to define spouse this way!). For an author relation, neither the person nor the document are keys, since a person may author more than one document and a document may have more than one author. The person and document attributes together would comprise a key, however: there should be only one tuple for a particular person-document pair.
<==<ViewFig3.press<
We have labelled entities with names in the figures. Names are most useful when they are human-sensible tags for the entities represented, e.g. the title for a document, or the name of a person. However, their primary function is as a unique entity identifier, so non-unique human-sensible names must be represented in some other manner (via relations).
Note that the only information about an entity associated directly with the entity is the name; this contrasts with most other data models. A person’s age or spouse, for example, would be represented in the Cypress data model via age or spouse relations. Thus the relationships in a database are the information-carrying elements: entities do not have attributes. However the model provide an abbreviation, properties, to access information such as age or spouse in a single operation. We will discuss properties later. In addition, the physical data modelling algorithms can store these data directly with the person entity as a result of the relation key information (since a person can have only one age, a field can be allocated for it in the stored object representing a person.)
Normalization
A common topic in the database literature is relational normalization (Beeri et al [1978], Codd [1970], Fagin[1977, 1981], Rissanen [1977], Biller[1979], Hall et al[1976]). A relation is normalized by breaking it into two or more relations of lower order (fewer attributes) to eliminate undesirable dependencies between the attributes. For example, one could define a "publication" relation with three attributes:
Publication:
PersonBookDate
GeorgeBackgammon for Beginners1978
JohnBackgammon for Beginners1978
MaryHow to Play Chess1981
MaryHow to Cheat at Chess1982
This relation represents the fact that John and George wrote a book together entitled "Backgammon for Beginners", published in 1978, and Mary wrote two books on the subject of chess, in 1981 and 1982. Alternatively, we could encode the same information in two relations, an author relation and a publication-date relation:
Author:
PersonBook
GeorgeBackgammon for Beginners
JohnBackgammon for Beginners
MaryHow to Play Chess
MaryHow to Cheat at Chess
Publication-date:
BookDate
Backgammon for Beginners1978
How to Play Chess1981
How to Cheat at Chess1982
Although the second two relations may seem more verbose than the first one, they are actually representationally better in some sense, because the publication dates of books are not represented redundantly. If one wants to change the publication date of "Backgrammon for Beginners" to 1979, for example, it need only be changed in one place in the publication-date relation but in two places in the publication relation. This latter kind of behavior is called an update anomaly. The second two relations are said to be a normalized form (as it happens, third normal form) of the first relation, and thereby avoid this particular kind of update anomaly.
A variety of successively stricter criteria for normalization have been developed and studied, dependent on the kinds of real-world dependencies between attributes. Relational normalization is not part of the Cypress data model. However the model’s operations (and the tools we will develop in the implementation) encourage what we will call functionally irreducible form, in which relations are of the smallest order that is naturally meaningful. This form is in some sense the most fully normalized canonical form that could be defined.
Functionally irreducible normal form cannot be defined simply syntactically, but rather requires resort to the semantics of relations. Specifically, the presence or absence of a relationship in a relation represents the truth of some predicate on the state of the represented world (see Kent [1979]). For example, a "member" relationship represents the fact that a particular person is a member of a particular organization. (The absence of such a relationship may mean falsity of that predicate or lack of knowledge; we fortunately need not concern ourselves with this distinction here). We will formally define irreducible form in Section 3. For informal purposes, it suffices to say that relations should be of the smallest order that is possible without introducing a new artificial domain not otherwise desired (all relations can be reduced to binary by introducing artificial domains). We will allow a slight weakening of irreducible form, called functionally irreducible form, which permits combining two or more irreducible relations that must all be logically present or absent in the real world. For example, a birthday relation between a person, month, day, and year can be combined instead of using three relations. Operationally defined, then, a relation is in functionally irreducible form iff:
(1) it is binary, or
(2) it has no single-attribute key, and cannot be represented with lower order without introducing a new type of entity to represent all or part of the relationships themselves, e.g., a member relation between a person, an organization, and the role of the person in the organization, or
(3) it has a single-attribute key, and the non-key attributes represent a single atomic fact such that all or none would be present or absent. An example would be the birthday relation just mentioned.
The functionally irreducible relations seen by the user are independent of the physical representation chosen by the system for efficiency, so we are concerned only with the logical data access. Note also that the motivation for the functionally irreducible form is a one-to-one correspondence between the relationships in the database and the atomic facts they represent, a canonical form which is in some sense more natural than any other form.
Data Schema
Like some other database models and a few programming languages, the Cypress model is self-representing: the data schema is stored and accessible just as any other data. Thus application-independent tools can be written without coded-in knowledge of the types of data and their relationships, by reading the meta-representation at run-time.
The pre-defined domains are:
the Domain domain, with one element (entity) per client-defined domain
the Attribute domain, with one element per client-defined attribute
the Relation domain, with one element per client-defined relation
The pre-defined relations are the Subtype relation and the Attribute relation. The Subtype relation has one element (relationship) per domain - subdomain relationship, with two attributes referencing the domain and subdomain, respectively. The Attribute relation has one element per client-defined attribute. The Attribute relation’s attributes and their values are:
aAttribute: the attribute this relationship refers to
aRelation: this attribute’s relation
aType: the domain or datatype that this attribute accepts
aUniqueness: TRUE iff this attribute is a key part of the relation
Primitive operations
The data model provides the capability to define and examine the data schema, and perform operations on entities, relationships, and aggregates of entities and relationships.
Five basic operations are defined for entities:
1.CreateEntity: Creation of a new entity in a domain. An entity name may be specified.
2.DestroyEntity: Destruction of an entity; this also destroys all relationships which refer to it.
3.GetDomain: Fetching the domain of an entity.
4.GetName: Fetching the name of the entity.
5.SetName: Changing the name of the entity.
Five basic operations are defined for relationships:
1.CreateRelationship: Creation of a new relationship in a given relation.
2.DestroyRelationship: Destruction of a relationship.
3.GetRelation: Fetching the relation of a relationship.
4.GetField: Fetching a field of a relationship.
5.SetField: Specifying a new value for a relationship field.
Query Operations
The RelationSubset operation enumerates all the relationships in a relation satisfying specified equality constraints on entity-valued attributes and/or range constraints on datum-valued attributes.
The DomainSubset operation enumerates all the entities in a domain. The list may optionally be sorted by entity name, or restricted to a subset of the entities with a name in a given range.
More complex queries are implemented by a query language based on these primitive query mechanisms.
Convenience Operations
As well as the basic operations described in the previous sub-sections, four more convenient but specialized operations may be defined in terms of these by an implementation of the data model:
1.DeclareEntity[domain, entity name]: Returns an entity with the given name in the domain if one exists, otherwise creates one, automatically generating a unique name if none is given.
2.DeclareRelationship[relation, attribute value list]: Returns an existing relationship with the given attribute values if one exists, otherwise creates one.
3.GetProperty[entity, attribute1, attribute2]: Returns the values of attribute1 for all relationships in its relation that reference the entity via attribute2. Attribute2 may be omitted, in which case it is assumed to be the only attribute of the relation that could reference the entity’s type.
4.SetProperty[entity, attribute1, value list, attribute2]: Attribute1 and attribute2 must be from the same relation. Creates relationships in that relation for each value in the list, with attribute1 equal to the value, and attribute2 equal to the entity. Attribute2 can be defaulted as in GetProperty.
As an example of the use of properties, consider the following database, consisting of two member relationships and an age relationship:
<==<ViewFig4.press<
On this database, the GetProperty and SetProperty operations work as follows:
GetProperty[John Smith, memberOf] would return the set {Acme Company, State University}
GetProperty[John Smith, ageIs] would return {34}. (A version of GetProperty which returns a value rather than list is available in the implementation.)
SetProperty[John Smith, memberOf, Foo Family] would create a new member relationship specifying John to be a member of the Foo Family. (The memberAs attribute would be left undefined in this relationship.)
SetProperty[John Smith, ageIs, 35], where ageOf is a key of the age relation, would delete the relationship specifying John’s age to be 34, and insert a new one specifying John’s age to be 35. Note that SetProperty acted differently than on the member relation because memberIs is not a key.
Views
A view is a relation whose relationships are not actually stored. Instead, the primitive retrieval and storage operations on a view invoke procedures defined by a database client who defines the view. Arbitrary procedures may be defined, and views can therefore be used for a variety of purposes:
1.Defining a pseudo-relation in terms of existing relations for convenience: e.g., a father view could be implemented in terms of parent and sex relations that are actually stored.
2.Allowing changes to the logical representation of data (e.g., actually changing the database to store father and mother relations instead) without changing existing programs written in terms of the older form.
3.Implementing operations on entities in an object-based style by storing tuples in a view. E.g., a Send operation on a Message entity might be invoked by storing a tuple in a send relation. This mechanism provides the basis for encoding of procedural information in a database.
The implementor of a view must define all of the operations on a relation. Unlike view mechanisms in some database systems, views may be defined directly in the underlying programming language. In our case, this is Cedar Mesa, and the operations the implementation must export are defined as a Cedar Mesa interface exported by the view. The view can be dynamically loaded when required at run-time. The operations the view provides are the same as those for relations:
1.RelationSubset and CreateRelationship on the view
2.DestroyRelationship, GetRelation, GetField and SetField on its relationships
When a query language is defined, a view may be defined at a higher level rather than the underlying programming language; this simplifies the definition of views as well as allowing some kinds of optimizations that need no longer treat a view as a "black box" implementation of an access definition.
Views are so-called because they provide a database client a different view of the database than what was originally stored. View definitions and implementations are stored in a database as any other data, being automatically retrieved when required by the database system.
Partial Augments
Partial augments correspond to what we call level 1 augments. These are the augments described in the implementation section.
Augments provide a mechanism to divide and re-combine databases to give different perspectives or subsets of the data to different users or application programs. An augment is a set of entities and relationships that a database client chooses to treat as one logical context or state of the world. The data in an augment is stored in an underlying file physically independent from other augments, perhaps on another machine. However we will see that the entities of a domain may logically be in more than one augment, and the entities in an augment may be from more than one domain. Relations and relationships behave similarly.
Augments may be used for four purposes:
1.Data distribution: Augments are represented by physical files in an underlying file system, and can be used as a basis for distribution of data according to the location of the files.
2.Database independence (between applications): Different database application using different domains or relations may define their data in separate augments. As a result one application can continue to operate although the data for another has been logically or physically damaged. Similarly, domains or relations distributed over multiple files, possibly on a variety of machines, can continue to be used in degraded mode when one or more of the augments missing.
3.Additive databases (combined public/private): A public augment may contain information accessible and possibly writable by a number of users, e.g. a database of phone numbers and addresses. A user may add on top of this database personal entries not visible to other users but visible to himself as if an extension of the public database.
4.Versions (Additive/subtractive databases): The relations in an augment encode some state of the world the database represents. Updates, representing changes to that state, could be made by adding them in a separate augment. The database may then be viewed with and without the changes (or a number of alternate changes) by adding and removing the augment(s) on top.
We will re-examine these four goals (database distribution, database independence, additive databases, and versions) later. First we examine partial augments, which satisfy some but not all of these goals.
The data model provides the following operations on augments:
1.CreateAugment and DestroyAugment create and destroy an augment, respectively. Augments are simply referred to by name. The name must be unique over the distributed computing environment; an implementation might use the full path name of a file in which the augment’s data is stored.
2.OpenAugment takes as argument an augment and adds it to the current augment list, a list maintained by the database system. CloseAugment removes a named augment from the list.
We now extend the DeclareDomain and DeclareRelation operations to have an additional argument, namely the augment in which the defined domain or relation will reside. The entity representing a domain or relation now represents data in a particular augment. However, a logical correspondence is maintained by the system between relations or domains with the same name in two or more augments in the currently open augment list. We define the namesake in augment Q of a domain or relation as the domain or relation with the same name in Q. The namesake in a particular augment may of course be non-existent. Similarly, we define the namesake in augment Q of an attribute A to be the attribute of A’s relation’s namesake in Q with the same name as A. All of the namesakes of a relation must have attributes identical in name, type, and uniqueness or an error is generated on the first attempt to use that relation. Finally, we will refer to the namesake of any entity E in an augment Q as the entity with the same name in E’s domain’s namesake in Q.
The effect of the namesake equivalence will be the very useful property that a relation or domain may logically span two or more augments while the augments remain physically independent.
With the addition of augments to the data model, we redefine the semantics of basic access operations as follows.
1.CreateEntity and CreateRelationship create an entity or relationship, respectively, in the augment in which the respective domain or relation is defined.
2.DestroyEntity and DestroyRelationship destroy the given entity or relationship in the augment in which it is defined.
3.GetField and SetField "translate" the attributes and entity values they are given to their namesake in the relationship’s augment before performing the operation. If the namesake for an attribute does not exist, this is an error; if the namesake for an entity value does not exist, it is implicitly created (this may cause the domain itself to become defined in that augment). GetField and SetField are otherwise defined as before.
4.GetDomain and GetRelation are unchanged, i.e. they retrieve the particular domain or relation where the item is stored. A convenience operation, GetAugment, is added as an abbreviation for fetching an entity or relationship’s domain or relation.
5.DomainSubset is unchanged when applied to client-defined domains, i.e., it enumerates the domain only in the augment in which it was declared. However when DomainSubset is applied to one of the system domains, namely the Attribute domain, Relation domain, or Domain domain, it enumerates across all open augments. If a relation with the same name exists in more than one augment, each will occur in the enumeration.
6.RelationSubset enumerates client-defined relations only in the augment in which they were declared. However, any entity values in the attribute value list supplied are automatically translated to their namesakes in the relation’s augment. If they have no already existing namesakes they are not created, the RelationSubset enumeration is simply empty. When RelationSubset is applied to the system relations (the AttributeRelation and SubtypeRelation), it enumerates relationships satisfying the attribute value list in all open augments. Entity values in that list are automatically translated to each augment in this process.
7.The entity equality operation returns TRUE iff the two entities have the same name and their domains have the same name, regardless of whether the entities are in the same augment.
Full Augments
Full augments correspond to what we call level 3, or subtractive, augments. They are included for future interest, but their implementation is not described.
Note that augments as defined thus far provide some aid to a database application in allowing relations and domains to cross segment boundaries, but do not carry this illusion through to completion. We will refer to these as partial augments. The merit of partial augments over a more complete illusion is that they are simpler to understand and have good performance in an implementation. They satisfy only the first two of the purposes of augments outlined at the beginning of the previous section, however. The third and fourth uses of augments could of course be satisfied by writing application programs which simulate these effects: public / private databases, in which a relation may have private and public relationships, are relatively easy to simulate in this way. However we will also define a higher level of aspiration, namely simulating in the data model relations and domains that appear to span augment boundaries, and can even be updated in a different augment than the data is defined. We will call these augments full augments.
An implementation of the database model with full augments has the same properties and definitions just enumerated for partial augments with the following modifications:
1.There is an ordering on the augment list maintained by the database system. The OpenAugment call may specify where in the current list the new augment should be opened.
2.DeclareRelation and DeclareDomain returns a handle for the relation or domain in all augments, and the relation or domain is declared to exist in the augment passed as argument. The Declare procedure may be called additional times to define the relation or domain in other augments.
3.CreateEntity and CreateRelationship create an entity or relationship, respectively, in the top-most augment in the augment list in which a namesake of the respective domain or relation has been defined by a client call to DeclareDomain or DeclareRelation (not necessarily in this database session). Domains defined in an augment implicitly by creation of entities through translation (see semantics of SetField) do not qualify, the domain must have been explicitly defined in the augment by a call to DeclareDomain.
4.DestroyEntity and DestroyRelationship destroy the given entity or relationship if there is no namesake of the domain or relation higher in the current augment list. Otherwise, they create an anti-entity or anti-relationship, respectively, in the top-most client-declared namesake of the domain or relationship. The semantics of SetField is in effect that of a DestroyRelationship followed by a CreateRelationship with the attribute changed.
5.DomainSubset searches its domain in all open augments, and has the property that an entity in an augment will not appear in the enumeration if an anti-entity with the same name exists in a namesake of the entity’s domain that is higher in the current augment list.
6.RelationSubset similarly searches its relation in all open augments, and has the property that a relationship in an augment will not appear in the enumeration if an anti-relationship with the same attribute values, with entity values mapped to their namesakes, exists in a namesake of the relationship’s relation that is higher in the current augment list..
Full augments also act specially upon relations upon which keys have been defined:
1.If a CreateRelationship would create a relationship with the same key value as an existing relationship in the relation in some open augment, then an anti-relationship is automatically created for the existing relationship (in the same augment as the new relationship, the top-most one possible) before creating the new one.
2.If a new augment is opened containing relationships whose key values match existing ones in open augments, then anti-relationships for the matching existing relationships are automatically created in the newly opened augment at that time.
The net effect of our definition of full augments is that a user or program may make arbitrary modifications to data in an underlying augment, in a fashion which appears exactly as if a single augment contains all the data. The underlying augment, however, is completely unchanged. The unmodified data may concurrently be examined by another user, or appear to be modified through another user’s augment. Furthermore, the modifications made by the same user over time can be separated and later removed. Full augments thus satisfy all four of the purposes of augments set out at the beginning of the previous section.
Note that two augments can be merged in a straightforward way to produce an augment that behaves as the two in the same order in the same place in the current augment list. The augments are merged by combining the elements of the relations and domains in the first augment and their namesakes in the other augment, discarding entities and anti-entities which match and relationships and anti-relationships which match.
Augments are not related in definition or implementation to the atomic transactions which an implementation of the model also provides. However, the reader might find it informative to think of a transaction as a [full] augment defined on top of the current data, which is automatically merged with the data when the transaction is committed. Transactions thus serve as "short-term" augments. The two are useful for entirely different purposes, however.