Page Numbers: Yes X: 530 Y: 10.5" First Page: 5
Columns: 1 Edge Margin: .6" Between Columns: .4"
Margins: Top: 1.3" Bottom: 1"
Line Numbers: No Modulus: 5 Page-relative
Even Heading:
DESIGN AND IMPLEMENTATION OF A RELATIONSHIP-ENTITY-DATUM DATA MODEL
Odd Heading: Not-on-first-page
CYPRESS DATA MODEL CONCEPTS
2. Cypress data model concepts
In this section, we give an informal description of the Cypress data model. The evaluation and justification of the model have been deferred to Section 5. A more formal description of the model has been deferred to a future paper (some axioms can be found in the appendix).
2.1 Data independence
We deal here with the conceptual data model, the logical primitives for data access and data type definition. This should be carefully distinguished from the physical data model, the mechanisms we are given for actual storage and access of data. The physical model in the Cypress implementation corresponds to the Storage level. The physical data model is hidden as much as possible from the database client to facilitate data independence, the guarantee that a user’s program will continue to work (perhaps with a change in efficiency) even though the physical data representation is redesigned.
For any particular database using the given conceptual and physical models, the actual specifications of this database using the primitives the models provide are termed the conceptual data schema and physical data schema. Note that a mapping must be provided between the conceptual and physical levels, either automatically or with further instruction from the client; we will do some of both. The logical to physical mapping is intimately associated with the performance of the database system as viewed by the user performing operations at the conceptual level. Performance is generally not a criteria for choosing between conceptual data models, unless there is no known way to map the differing conceptual views into the same or equally efficient implementations.
2.2 Basic primitives
Three basic primitives are defined in the model: an entity, datum, and relationship.
An entity represents an abstract or concrete object in the world: a person, an organization, a document, a product, an event. In programming languages and knowledge representation entities have variously been referred to as atoms, symbols, and nodes. A datum, unlike an entity, represents literal information such as times, weights, part names, or phone numbers. Character strings and integers are possible datum types.
It is a policy decision whether something is represented as an entity or merely a datum: e.g., an employee’s spouse may be represented in a database system as a datum (the spouse’s name), or the spouse may be an entity in itself. The database system provides a higher level of logical integrity checking for entities than for datum values, as we will see later: unique entity identifiers, checks on entity types, and removal of dependent data upon entity deletion. We shall discuss the entity/datum choice further in Section 4.2.
We will use the term value to refer to something that can be either a datum or an entity. In many programming languages, there is no reason to distinguish entity values from datum values. Indeed, most of the Cypress operations deal with any kind of value, and some make it transparent to the caller whether an entity or datum value is involved. The transparent case makes Relational operations possible in our model, as we will see in Section 2.5.
A relationship is a tuple whose elements are [entity or datum] values. We refer to the elements (fields) of relationships by name instead of position. These names for positions are called attributes.
Note that we have separated the representatives of unique objects (entities) from the representation of information about objects (relationships), unlike some object-oriented programming languages and data models. Therefore an entity is not an "object" (or "record") in the programming language sense, although entities are representatives of real-world objects. We discuss the advantages of the entity-relationship distinction along with other data model design issues in Section 5.
We also define entity types, datum types, and relationship types. These are called domains, datatypes, and relations, respectively. We make use of these three types through one fundamental type constraint: every relationship in a relation has the same attributes, and the values associated with each attribute must be from a pre-specified domain or datatype. One might think of a relation as a "record type" in a programming language, although relations permit more powerful operations than record types.
As an example, consider a member relation that specifies that a given person is a member of a given organization with a given job title, as in the following figure. The person and organization might be entities, while the title might be a string datum. We relax the fundamental type constraint somewhat in allowing a lattice of types of domains: a particular value may then belong to the pre-specified domain or one of its sub-domains. For example, one could be a member of a University, a Company, or any other type of Organization one chooses to define. Other relations, e.g. an "offers-course" relation, might apply only to a University.
<==<MLDFig2.1.press<
Relationships represent facts about the world. A relation (a set of relationships) represents a kind of relationship in which entities and values can participate. We will schematically represent that John Smith is a member of Acme company with title "manager" by drawing lines for the relationship (labelled with the relation and its attributes), a circle for each entity, and a box for the datum:
<==<MLDFig2.2.press<
One should normally think of relations as types, not sets. A relation can be defined as the set of all relationships of its type, however, and thus can be treated as a relation in the normal mathematical sense. Note that we differ from mathematical convention in another minor point: the use of attributes to refer to tuple elements by name instead of position. Reference by position is therefore not necessary and is in fact not permitted. We can often omit attribute names without ambiguity since the types of participating entities imply their role in a relationship. However, they are necessary in the general case; e.g., a boss relation between a person and a person requires the directionality to define its semantics.
We can summarize the six basic primitives of the data model in tabular form. Familiarity with these six terms is essential to understanding the remainder of this document:
TypeInstanceInstance Represents
DomainEntityphysical or abstract object
DatatypeDatumnumerical measurement or symbolic tag
RelationRelationshiplogical correspondence between one or more objects and values
Our terminology might be more consistent if we called a domain an "entity type," and a relation a "relationship type." Instead we have compromised on the terms most widely used in the literature for all six of the basic concepts. The reader will find the remainder of this document much more understandable if these terms are commited to memory before proceeding.
2.3 Names and keys
The data model also provides primitives to specify uniqueness of relationships and entities. For relationships, this is done with keys, and for entities, with names. A key for a relation is an attribute or set of attributes whose value uniquely identifies a relationship in the relation. No two relationships may have the same values for a key attribute or attribute set (a relation may have more than one key, in which case relationships must have unique values for all keys). A name acts similarly for a domain. The name of an entity must uniquely specify it in its domain.
Consider a spouse relation, a binary relation between a person and a person. Both of its attributes are keys, because each person may have only one spouse (if we choose to define spouse this way!). For an author relation, neither the person nor the document are keys, since a person may author more than one document and a document may have more than one author. The person and document attributes together would comprise a key, however: there should be only one tuple for a particular person-document pair.
<==<MLDFig2.3.press<
We have labelled entities with names in the figures. Names are most useful when they are human-sensible tags for the entities, e.g. the title for a document, or the name of a person. However, their primary function is as a unique entity identifier, so non-unique human-sensible names must be represented as relations. If entities of a domain have more than one unique identifier, e.g. social security numbers and employee numbers, then one identifier must be chosen as the domain’s entity names and the other represented as a relation connecting the entities with the unique alternate identifier (a key of that relation).
We require that every entity have a unique name, although the name may automatically be generated by the database system. Thus every entity may be uniquely identified by the pair:
[domain name, entity name].
Some authors in the database literature use the term entity to refer to a real-world object rather than its representation in the database system. When we use the term entity, we refer to the internal entity, the entity "handle" returned by database operations and stored in entity-valued variables, not the external entity that the internal entity represents or the entity identifier [domain, name] that may be used to uniquely identify an internal entity. The three are interchangeable, however, since they must always be in one-to-one correspondence.
The reader may find it simple to think of entity-valued attributes of relationships as pointers to the entities, in fact bi-directional pointers, since the operations we provide allow access in either direction. This is a useful analogy. However, there is no constraint that the model be implemented with pointers, and the relationships of a relation could equally well be conceptualized as rows of a table whose columns are the attributes of the relation and whose entries for entity-valued attributes are the string names of the entities. For example, the author relationships in the previous figure could also be displayed in the form:
Author:
PersonBook
GeorgeBackgammon for Beginners
JohnBackgammon for Beginners
JohnAdvanced Backgammon
TomAdvanced Backgammon
Thus our introduction of entities to the Relational model does not entail a different representation than, say, a Network model might imply, but simply additional integrity checks on entity names and types, and new operations upon entities. This compatibility with the Relational data model is important, as it allows the application of the powerful Relational calculus or algebra as a query language. We return to query languages in Section 2.5.
Note that the only information about an entity associated directly with the entity is the name; this contrasts with most other data models. A person’s age or spouse, for example, would be represented in the Cypress data model via age or spouse relations. Thus the relationships in a database are the information-carrying elements: entities do not have attributes. However the model provide an abbreviation, properties, to access information such as age or spouse in a single operation. We will discuss properties later. In addition, the physical data modelling algorithms can store these data directly with the person entity as a result of the relation key information (since a person can have only one age, a field can be allocated for that field in the stored object representing a person.)
2.4 Basic operations
The data model provides the capability to define and examine the data schema, and perform operations on entities, relationships, and aggregates of entities and relationships. In this section we discuss the basic operations on entities and relationships. In Section 2.5, we discuss the operations on aggregate types, i.e. domains and relations. We defer to Section 2.6 the discussion of "convenience" operations built upon the basic and aggregate operations.
Four basic operations are defined for entities:
1.DeclareEntity[domain, name]: Returns a new or existing entity in a domain. An entity name must be specified.
2.DestroyEntity[entity]: Destroys an entity; this also destroys all relationships that refer to it.
3.DomainOf[entity]: Returns the domain of an entity (its type).
4.NameOf[entity]: Returns the string name of an entity.
Five basic operations are defined for relationships:
1.DeclareRelationship[relation, list of attribute values]: Returns a relationship with the given attribute values in the given relation.
2.DestroyRelationship[relationship]: Destroys a relationship.
3.RelationOf[relationship]: Returns a relationship’s relation.
4.GetF[relationship, attribute]: Returns the value associated with the given attribute of the given relationship.
5.SetF[relationship, attribute, value]: Sets the value of the given relationship attribute.
The operations upon relationships recognize a specially-distinguished undefined value for an attribute. Unassigned attributes of a newly-created relationship have this value. A client of the data model may retrieve a value with GetF and test whether it equals the distinguished undefined value, and may set a previously defined value to be the distinguished undefined value with SetF.
Other "convenience" operations are built on top of the basic operations on entities and relationships: properties and translucent attributes. They are described in Section 2.6. Although these operations are not essential to the basis of the Cypress model, they do furnish a fundamentally different perspective on the model. They provide a mechanism to associate information directly with entities (instead of through relationships) and to write programs largely independent of attribute types.
The reader will also note that we have ignored issues of concurrent access and protection in the basic operations. We will see later that an underlying transaction, file, and protection is associated with the relation and domain handles used in the basic operations. This convenience allows us to treat concurrency, protection, and data location orthogonally.
2.5 Aggregate operations
There are two kinds of operations upon domains and relations, the aggregate types in our model: the definition of domains and relations, and queries on domains and relations. We first discuss their definitions.
Schema definition
As in other database models and a few programming languages, the Cypress model is self-representing: the data schema is stored and accessible as data. Thus application-independent tools can be written without coded-in knowledge of the types of data and their relationships.
Client-defined domains, relations, and attributes are represented by entities. These entities belong to special built-in domains, called the system domains:
the Domain domain, with one element (entity) per domain
the Attribute domain, with one element per attribute
the Relation domain, with one element per relation
There is also a predefined Datatype domain, with pre-defined elements StringType, BoolType, and IntType, called built-in types. An implementation may also allow client-defined datum types, but the implementation described in Section 3 currently does not.
There may be other system domains, depending upon the implementation of the Relationship-Entity-Datum model, for example to represent indices on relations. The Domain domain, Attribute domain, and all other domains are members of the Domain domain.
Information about domains, relations, and attributes are represented by system relations in which the system entities participate. The pre-defined SubType relation is a binary relation between domains and their subdomains. There are also predefined binary relations that map attributes to information about the attributes:
aRelation: maps an attribute entity to its relation entity.
aType: maps an attribute to its type entity (a domain or a built-in type)
aUniqueness: maps an attribute entity to {TRUE, FALSE}, depending whether it is part of a key of its relation. We are assuming only one key per relation, here; our implementation relaxes this assumption in the case of single-attribute keys.
The following diagram graphically illustrates a segment of a data schema describing the member relation and several domains. The left side of the figure shows two subdomains of Organization, (Company and University), and the right shows the types and uniqueness properties of the member relation’s attributes memberOf, memberIs, and memberAs.
<==<MLDFig2.4.press<
New domains, relations, and attributes are defined by creating entities and relationships in these pre-defined system domains and relations. However, an implementation of the model may choose to provide special operations to define the data schema, to simplify error checking. These operations are:
DeclareDomain[name], and for each subtype relationship:
DeclareSubType[superdomain, subdomain]
DeclareRelation[name], and for each attribute:
DeclareAttribute[name, relation, type, uniqueness].
Queries
The operation RelationSubset[relation, attribute value list] enumerates relationships in a relation satisfying specified equality constraints on entity-valued attributes and/or range constraints on datum-valued attributes. For example, RelationSubset might enumerate all of the relationships that reference a particular entity in one attribute and have an integer in the range 23 to 52 in another. Attributes of a relationship with an undefined value satisfy no range or equality constraints on the attribute.
The operation DomainSubset[domain, name range] enumerates entities in a domain. The enumeration may optionally be sorted by entity name, or restricted to a subset of the entities with a name in a given range.
More complex queries can be implemented in terms of DomainSubset and RelationSubset. A good implementation of the model might also provide a MultiRelationSubset operation to enumerate single queries spanning more than one relation. MultiRelationSubset operates upon a parsed representation of the query language, and produces the same kind of enumeration as RelationSubset.
We will not precisely define the query language for the Cypress data model (the Query level) in this report, and MultiRelationSubset has not yet been designed or implemented. A variety of query languages (and MultiRelationSubset implementations) could be built on top of the operations we have defined thus far. For the purposes of this report, however, we will assume that our query language is the Relational algebra (Codd[1970]), providing the join, projection, and selection operators on relations. We choose the relational algebra due to its wide acceptance in the literature, and now commercially. Our choice does not preclude an entity-centric query language in future work, however. We and others have designed but have not implemented such languages.
To demonstrate that the relational algebra can be used with our system, we will define a mapping of the relational operations onto the Cypress model. Some augmentation of the relational algebra is necessary to include the operations our model provides upon domains. The important aspects of the mapping are:
1.The Cypress relations exist in our relational mapping essentially unchanged, except that some type constraints exist on attribute values; these constraints simply appear as exceptional conditions when violated, they do not effect the form of the relational operations.
2.For each domain in our schema, we define a unary relation in our mapping with the entity name as its only attribute. The sole purpose of the unary domain relations is to specify the existence of entities in the domain. For example, a Person relation would contain one tuple per person entity in the database. The unary domain relations allow DomainSubset to be performed in the relational algebra as a select operation on the domain.
3.Entity-valued attributes of relations appear to have the name of the corresponding entity stored in them. There are no "atomic" entity values, so the ChangeName operation is expensive in this model. All references to an entity in the database appear as the string name of the entity, and the system generates an error if an entity-valued attribute is initiaized to an entity which does not exist in the domain.
Our representation of the data schema is as before: schema entities and relationships in the domain domain, relation domain, attribute domain, subtype relation, and attribute relations.
The join, project, and select operations in the relational algebra can now be performed by MultiRelationSubset, or by the appropriate calls to RelationSubset. Note that we must typically declare new relations for the result and intermediate results of computing such queries. A select operation on a domain may be used to determine whether a particular entity exists, or to enumerate the entities of a domain. A select operation on a relation may be used to find relationships that reference a particular entity. Query operations on the schema can be used to determine what relations might reference an entity of a particular domain.
The unary relations representing domains are used only to determine the existence of entities. An implementation could automatically create entities in these domains when an entity-valued attribute of a relationship is assigned a previously non-existent entity name. Note that domain relations may in fact be ignored altogether by the user if this is done. Domain relations are purely an extension to the Relational model. We chose not to automatically create entities in our mapping, as we would lose the logical integrity check the entities provide.
2.6 Convenience operations
Some more convenient specialized operations are built upon the basic operations described in the previous two sections. They implement what we call properties and translucent attributes. Although theoretically speaking these operations add no power to the model, they permit a significantly different perspective on the data access and so should be thought of as part of the model.
Properties
Properties allow the client to treat entities as if they, like relationships, had "attributes." They provide the convenience of treating attributes of relationships that reference an entity as if they were attributes (or properties) of the entity itself. The property operations are:
1.GetPList[entity, attribute1, attribute2]: Attribute1 and attribute2 must be from the same relation. Returns the values of attribute1 for all relationships in the relation that reference the entity via attribute2. Attribute2 may be omitted, in which case it is assumed to be the only other entity-valued attribute of the relation.
2.GetP[entity, attribute1, attribute2]: this is identical to GetPList except exactly one relationship must reference the entity via attribute2; otherwise an error is generated. GetP always returns one value.
3.SetPList[entity, attribute1, value list, attribute2]: Attribute1 and attribute2 must be from the same relation. Destroys any existing relationships whose attribute2 equals the entity, and creates new ones for each value in the list, with attribute1 equal to the value, and attribute2 equal to the entity. Attribute2 can be defaulted as in GetPList.
4.SetP[entity, attribute1, value, attribute2]: this is identical to SetPList except it simply adds a new relationship referencing the entity instead of destroying any existing ones (unless attribute1 is a key of its relation, in which case the existing one must be replaced).
Thus the property operations allow information specified through relationships to be treated as properties of the entity itself, in single operations. The property operations and the operations defined in earlier sections may be used interchangeably, as there is only one underlying representation of information: the relationships. As an example of the use of properties, consider the following database:
<==<MLDFig2.5.press<
The figure shows the entity John Smith, and three relationships in which he participates: an age relationship and two member relationships, The member relationships are ternary, the age relationship binary. On this database, the property operations work as follows:
GetPList[John Smith, memberOf] would return the set {Acme Company, State University}.
GetP[John Smith, ageIs] (or GetPList) would return 34.
SetP[John Smith, memberOf, Foo Family] would create a new member relationship specifying John to be a member of the Foo Family. SetPList would do the same, but would destroy the two existing member relationships referencing John. In either case, the memberAs attribute would be left undefined in the new relationship.
SetP[John Smith, ageIs, 35], where ageOf is a key of the age relation, would delete the relationship specifying John’s age to be 34, and insert a new one specifying John’s age to be 35. Note that SetP acted differently than on the member relation because memberIs is not a key.
Again, the property operations are simply a convenience, although they provide a different perspective on the data model by allowing an entity-based view of a database.
Translucent attributes
Some database application programs may not wish to be concerned with whether an attribute is entity-valued, string-valued, or integer-valued. They might prefer to have all values mapped to some common denominator, e.g. a string. An example would be a program that is simply displaying tuples on the screen.
Another class of applications would like to be independent of whether a particular attribute is represented as an entity or datum value. Consider the member relation in the previous figure. If we choose to define an Organization domain, then the memberOf attribute is entity-valued; but instead we might choose to make the memberOf attribute be string-valued, merely giving the name of the organization without defining organizations as entities. This might be appropriate, for example, if we did not wish to invoke the type checking on uniqueness of names and the correctness of entity types. We would like to write programs that are independent of whether an attribute is string-valued or entity-valued (as in the Relational data model).
We introduce translucent attributes to avoid dependence on attribute types. Any attribute may be treated as a translucent attribute, by using the GetFS and SetFS operations to retrieve or assign its value.
GetFS[relationship, attribute] is identical to the GetF operation, except it returns a string regardless of the attribute’s type. If the attribute is datum-valued, e.g. an integer or boolean, it is converted to a string equivalent. If the attribute is entity-valued, the name of the entity is returned.
SetFS[relationship, attribute, value] performs the inverse mapping. If the attribute is datum-valued, e.g. an integer or boolean, a string equivalent is accepted. If the attribute is entity-valued, the name of the entity is passed to SetFS. If an entity with the given name does not exist in the domain that is the attribute’s type, then one is automatically created.
Changing entity names
Another convenience operation is provided on entities to change an entity’s name: ChangeName[entity, new name]. This operation is semantically equivalent to destroying the given entity and creating a new one with the new name, participating in the same relationships that the old one did. See the description of ChangeName in Section 3.4 for precise semantics in our implementation, however.
2.7 Normalization
A common topic in the database literature is relational normalization (Codd [1970], Armstrong [1974], Hall et al[1976], Rissanen [1977], Beeri et al [1978], Fagin[1977, 1981], Biller[1979]). A relation is normalized by breaking it into two or more relations of lower order (fewer attributes) to eliminate undesirable dependencies between the attributes. For example, one could define a "publication" relation with three attributes:
Publication:
PersonBookDate
GeorgeBackgammon for Beginners1978
JohnBackgammon for Beginners1978
MaryHow to Play Chess1981
MaryHow to Cheat at Chess1982
This relation represents the fact that John and George wrote a book together entitled "Backgammon for Beginners," published in 1978, and Mary wrote two books on the subject of chess, in 1981 and 1982. Alternatively, we could encode the same information in two relations, an author relation and a publication-date relation:
Author:
PersonBook
GeorgeBackgammon for Beginners
JohnBackgammon for Beginners
MaryHow to Play Chess
MaryHow to Cheat at Chess
Publication-date:
BookDate
Backgammon for Beginners1978
How to Play Chess1981
How to Cheat at Chess1982
Although the second two relations may seem more verbose than the first one, they are actually representationally better in some sense, because the publication dates of books are not represented redundantly. If one wants to change the publication date of "Backgrammon for Beginners" to 1979, for example, it need only be changed in one place in the publication-date relation but in two places in the publication relation. If the date were changed in only one place in the publication relation, the database would become inconsistent. This kind of behavior is called an update anomaly. The second two relations are said to be a normalized form (as it happens, third normal form) of the first relation, and thereby avoid this particular kind of update anomaly.
A variety of successively stricter criteria for normalization have been developed and studied, based on different kinds of real-world dependencies between attributes. Work on normalization will almost certainly continue through the forseeable future.
Relational normalization is not strictly part of the Cypress data model. However the model’s operations (and the tools we will develop in the implementation) encourage what we will call functionally irreducible form, in which relations are of the smallest order that is naturally meaningful. This form is in some sense the most fully normalized canonical form that could be defined.
Functionally irreducible normal form cannot be defined simply syntactically, but rather requires resort to the semantics of relations. Specifically, the presence or absence of a relationship in a relation represents the truth of some predicate on the state of the represented world (Kent [1979]). For example, a "member" relationship represents the fact that a particular person is a member of a particular organization. (The absence of such a relationship may mean falsity of that predicate or lack of knowledge; we fortunately need not concern ourselves with this distinction here). A relation is in irreducible form if it is of the smallest order possible without introducing new artificial domain(s) not otherwise desired (all relations can be reduced to binary by introducing artificial domains). Biller [1979] provides a more precise definition of irreducible form. We will allow a slight weakening of irreducible form, functionally irreducible form, which permits combining two or more irreducible relations only when their semantics are mutually dependent (and therefore all present or absent in our world representation). For example, a birthday relation between a person, month, day, and year can be combined instead of using three relations. Another example would be an address relation between a person, street, city, and zip code. Combining an age and phone relation would not result in functionally irreducible form, however, as their semantics are not mutually dependent.
The functionally irreducible relations seen by the user are independent of the physical representation chosen by the system for efficiency, so we are concerned only with the logical data access. Note that in addition to avoiding update anomalies, functionally irreducible form provides a one-to-one correspondence between the relationships in the database and the atomic facts they represent, a canonical form that is in some sense more natural than any other form.
2.8 Segments
We would like a mechanism to divide up large databases, to provide different perspectives or subsets of the data to different users or application programs. In this section we discuss a mechanism to provide this separation: segments. A segment is a set of entities and relationships that a database client chooses to treat as one logical and physical part of a database.
In introducing segments, we will slightly change the definition of an entity, previously defined to be uniquely determined by its domain and name. We will treat entities with the same name and domain in different segments as different entities, although they may represent the same external entity. The unique identifier of an internal entity is now the triple
[segment, domain, name].
A consequence of this redefinition of entities is that relations and domains do not span segments, either. Application programs must maintain any desired correspondence between entities, domains, or relations with the same name in different segments. We will return to this later. In the next section, we will discuss a more powerful but more complex and expensive mechanism, augments, in which the database system itself maintains the correspondence.
We introduce three new operations to the data model in conjunction with segments:
DeclareSegment[segment, file]: opens a segment with the given name, whose data is stored in the given file.
GetSegments[] returns a list of all the segments which have been opened.
SegmentOf[entity or relationship] returns the segment in which a given entity or relationship exists. It may also be applied to relations or domains, since they are entities.
With the addition of segments to the data model, we redefine the semantics of the basic access operations as follows:
1.DeclareDomain and DeclareRelation take an additional argument, namely the segment in which the defined domain or relation will reside. The entity representing a domain or relation now represents data in a particular segment.
2.DeclareEntity and DeclareRelationship are unaffected: they implicitly refer to the segment in which the respective domain or relation was defined. By associating a segment (and therefore a transaction and underlying file) with each relation or domain entity returned to the database client, we conveniently obviate the need for additional arguments to every invocation of the basic operations in the data model.
3.DestroyEntity, DestroyRelationship, GetF, SetF, DomainOf, RelationOf, and Eq are similarly unaffected: they deal with entities and relationships in whatever segment they are defined. Note that by our definition, entities in different segments are never Eq. Also note that nothing in our definition makes a SetF across a segment boundary illegal (i.e. SetF[relationship, attribute, entity] where the relationship and entity are in different segments). Our current implementation requires that special procedures GetFR and SetFR be used on attributes that can cross segment boundaries, see Section 3.
4.DomainSubset and RelationSubset are unchanged when applied to client-defined domains or relations, i.e., they enumerate only in the segment in which the relation or domain was declared. However an optional argument may be used when applied to one of the system domains or relations (e.g. the Domain domain), allowing enumeration over a specific segment or all segments. RelationSubset’s attribute-value-list arguments implictly indicate the appropriate segment even for system relations, so a segment is not normally needed unless the entire relation is enumerated.
Note that the data in a segment is stored in an underlying file physically independent from other segments, perhaps on another machine. Introducing a file system into the conceptual data model may seem like an odd transgression at this point. From a practical point of view, however, we believe it better to view certain problems at the level of file systems. This point of view allows segments to be used for the following purposes:
1.Physical independence: Different database applications typically define their data in separate segments. As a result one application can continue to operate although the data for another has been logically or physically damaged. One application can entirely rebuild its database without affecting another, or an application can continue to operate in a degraded mode missing data in an unavailable segment.
2.Logical independence: Different database applications may have information which pertains to the same external entity, e.g. a person with a particular social security number. When one application performs a DestroyEntity operation, however, we would like the entity to disappear only from that application’s point of view. Information maintained by other applications should remain unchanged.
3.Protection: Clients can trust the protection provided by a file system more easily than a complex logical protection mechanism provided by the database system. An even higher assurance of protection can be achieved by physical isolation of the segment at a particular computer site. A more complex logical protection mechanism would be desirable for some purposes, but was deemed beyond the scope of Cypress.
4.Reliability: Because segment files are physically as well as logically separate, the probability of simultaneous physical failure is lower than for data in the same file. An even higher degree of independence can be achieved using segments at different sites. Replication of data can be provided at the level of segments to provide recovery from failure. Our implementation’s file systems do not do this, however.
5.Performance: Data may be distributed to sites where they are most frequently used. For example, personal data may reside on a client’s machine while publicly accessed data reside on a file server. If the file system provides replication, it can be used to improve performance for commonly accessed data.
Concurrency control may also be handled by the file system, although locks should be at a granularity finer than whole segments (e.g., file pages).
As noted earlier, information about an external entity may be distributed over multiple segments. One or more database applications may cooperate in maintaining the illusion that entities, domains, and relations span segment boundaries. This illusion may be used in at least two ways:
1.Private additions may be added to a public segment by adding entities or relationships in a private segment. The new relationships may reference entities in the public segment by creating representative entities with the same name in the private segment. An example would be personal phone numbers and addresses added to a public database of phone numbers and addresses: an application program would make the two segments appear to the user as one database.
2.If two applications use separate segments A and B, they may safely reference each other’s data yet remain physically independent. One of the applications may destroy and reconstruct its segment if it uses the same unique names for its entities. If both applications have relationships referencing an entity e, and application A does a DestroyEntity operation on e, the entity and relationships referencing it disappear from application A’s point of view, but application B’s representative entity and relationships remain.
2.9 Augments
In this section, we discuss a more elaborate mechanism for segmenting databases, with more powerful properties. We call these augments, because they are "additive" segments as we will see. Entities, domains, and relations are defined independently of augments. An entity with the same name and domain appears as the same entity regardless of particular augments in which data are stored. However, the entity may or may not be defined in a particular augment. Also, relationships continue to be associated with a particular segment.
NOTE: The ideas in the remainder of Section 2 have not been implemented, they are included for future interest. These features are not discussed again in this report except for Section 5.10, in contrasting data models.
Augments are intended for two main purposes:
1.Additive databases: Information about an entity may be distributed among multiple augments. The database system will make all the augment boundaries invisible, i.e. all the data will appear as if it is one augment from the application program’s point of view.
2.Subtractive databases (versions): The relations in an augment encode some state of the world the database represents. Updates, representing changes to that state, can be made by adding them in a separate augment. The database may then be viewed with and without the changes (or a number of alternate changes) by adding and removing the augment(s) on top.
The operations on augments themselves appear very similar to those on segments. However, an ordering on the open augment list is maintained by the database system. The DeclareAugment call may specify where in the current list the new augment should be opened: DeclareAugment[augment, file, previous augment] opens an augment with the given name, whose data is stored in the given file. It is defined to appear before the given previous augment in the ordering, or at the end if none is given. GetAugments[] returns a list of all the augments which have been opened in order. AugmentOf[relationship] may be applied to a relationship to determine in which augment it is stored.
With the addition of augments to the data model, we redefine the semantics of the basic access operations as follows:
1.DeclareRelation and DeclareDomain return a handle representing the relation or domain in all augments. The relation or domain is declared to exist in the augment passed as an argument; the declare procedure may be called additional times to define the same relation or domain to extend into other augments.
2.DeclareEntity and DeclareRelationship create an entity or relationship, respectively, in the top-most augment in the augment list in which the respective domain or relation is defined.
3.GetF, DomainOf, RelationOf, and Eq are unchanged, defined in the obvious way. Any entities returned by GetF, DomainOf, or RelationOf are of course augment-independent. Eq returns TRUE iff the two entities have the same name and their domains have the same name, regardless of whether the entities are in the same augment.
4.SetF is defined as before, with two exceptions. Consider the call
SetF[r, a, e]
where r is a relationship in augment A, a is an attribute of RelationOf[r] with type T, and e is an entity value in domain D. This call will cause e to exist in A, if it does not already. D must already exist in A. Otherwise, the call fails and nothing is created in A. (The client may create D and retry the operation).
5.DestroyEntity and DestroyRelationship destroy the given entity or relationship if there is no namesake of the domain or relation higher in the current augment list. Otherwise, they create an anti-entity or anti-relationship, respectively, in the top-most client-declared namesake of the domain or relationship. The semantics of SetF is, in effect, that of a DestroyRelationship followed by a CreateRelationship with the attribute changed.
6.DomainSubset searches its domain in all open augments, and has the property that an entity in an augment will not appear in the enumeration if an anti-entity with the same name exists in the entity’s domain higher in the current augment list. DomainSubset never returns the same entity twice, even though it exists in more than one augment.
7.RelationSubset similarly searches its relation in all open augments, and has the property that a relationship in an augment will not appear in the enumeration if an anti-relationship with the same attribute values exists in the relationship’s relation higher in the current augment list.
Augments act specially upon relations upon which keys have been defined:
1.If a CreateRelationship would create a relationship with the same key value as an existing relationship in the relation in some open augment, then an anti-relationship is automatically created for the existing relationship (in the same augment as the new relationship, the top-most one possible) before creating the new one.
2.If an augment is opened containing relationships whose key values match existing ones in open augments, then anti-relationships for the matching existing relationships are automatically created in the newly opened augment at that time.
The net effect of our definition of augments is that a user or program may make arbitrary modifications to data in an underlying augment, in a fashion which appears exactly as if a single augment contains all the data. The underlying augment, however, is completely unchanged. The unmodified data in the underlying augment may concurrently be examined by another user, or appear to be modified through another user’s augment. Furthermore, the modifications made by the same user over time are separated and can later be removed.
Note that two augments can be merged in a straightforward way to produce an augment that behaves as the two in the same order in the same place in the current augment list. The augments are merged by combining the elements of the relations and domains in the two augments, discarding entities and anti-entities that match and relationships and anti-relationships that match.
Augments are not related in definition or implementation to the atomic transactions that an implementation of the model also provides. However, the reader might find it informative to think of a transaction as an augment defined on top of the current data, which is automatically merged with the data when the transaction is committed. Transactions thus serve as "short-term" augments. The two mechanisms are useful for entirely different purposes in practice, however.
2.10 Views
A view is a relation whose relationships are not actually stored. Instead, the primitive retrieval and storage operations on a view invoke procedures defined by a database client who defines the view. Arbitrary procedures may be defined, and views can therefore be used for a variety of purposes:
1.Defining a pseudo-relation in terms of existing relations for convenience: e.g., a father view could be implemented in terms of parent and sex relations that are actually stored.
2.Allowing changes to the logical representation of data (e.g., actually changing the database to store father and mother relations instead) without changing existing programs written in terms of the older form.
3.Implementing operations on entities in an object-based style by storing tuples in a view. E.g., a Send operation on a Message entity might be invoked by storing a tuple in a send relation with two attributes, the message and the recipient. Any result of such a send operation would be stored in the database (e.g. as a third attribute of the send relationship returned). Thus the view mechanism provides a basis for encoding of procedural information in a database.
Views are so-called because they provide a database client a different view of the database than what was originally stored. View definitions and implementations are stored in a database as any other data, being automatically retrieved when required by the database system.
The implementor of a view must define the same operations as needed for any relation. Unlike view mechanisms in some database systems, views may be defined directly in the underlying programming language. In our case, this is Cedar, and the operations the implementation must export are defined as a Cedar interface exported by the view. The view can be dynamically loaded when required at run-time. The operations the view provides are:
1.RelationSubset and CreateRelationship on the view
2.DestroyRelationship, RelationOf, GetF and SetF on its relationships
A view may also be defined at a higher level than the underlying programming language, in a query language; this simplifies the definition of views as well as allowing some kinds of optimizations that need no longer treat a view as a "black box" implementation of an access definition.
2.11 Summary
We have introduced the three basic primitives of the Cypress data model: entity values, datum values, and relationships. The basic type checking provided by the model insures that the attributes of relationships are of the proper entity or datum types. Relationship types are called relations. Entity types are called domains, and a hierarchy of types of domains is permitted.
Entities have names, which uniquely identify them within their domain. Keys may be defined for relations, which consist of one or more attributes whose value must collectively be distinct for every relationship in the relation.
The model provides query operations to enumerate relationships with particular attribute values, or entities with particular names.
The model provides property and translucent attribute operations defined in terms of the basic operations. These operations provide a different perspective on the model, as they allow an entity-centric viewpoint (as in object-based languages) or a relation-centric viewpoint (as in the Relational model), respectively. As a result a client can simultaneously enjoy the advantages of both viewpoints, evaluating relational queries or navigating in a network of objects.
A segment mechanism permits overlapping databases maintained by a variety of applications, or databases distibuted over multiple machines or files by a single application. A more advanced mechanism, augments, duplicates the features of segments and also allows database updates to be encapsulated to maintain different database versions or viewpoints.
The rationale for the Cypress model’s features will be discussed further in Section 5; the reader may skip to Section 5 at this point without loss of continuity, or continue with the discussion of our implementation and examples in the next two sections.