[_CD8_]<cedardocs>database>CypressDoc2.bravo!4

Page Numbers: Yes X: 530 Y: 10.5" First Page: 3
Columns: 1 Edge Margin: .6" Between Columns: .4"
Margins: Top: 1.3" Bottom: 1"
Line Numbers: No Modulus: 5 Page-relative
Even Heading:

We deal here with the conceptual data model: the logical primitives for data access and data type definition. This should be carefully distinguished from the physical data storage and access mechansisms. The physical representation of data is hidden as much as possible from the database client to facilitate data independence, the guarantee that a user’s program will continue to work (perhaps with a change in efficiency) even though the physical data representation is redesigned.

For any particular database using our conceptual data model, the actual specification of the types of data in the database, using the primitives the model provides, is termed the data schema. Note that a mapping must be provided between the conceptual data model and the physical representation, either automatically or with further instruction from the client; we will do some of both. The logical to physical mapping is intimately associated with the performance of the database system as viewed by the user performing operations at the conceptual level.

An entity represents an abstract or concrete object in the world: a person, an organization, a document, a product, an event. In programming languages and knowledge representation entities have variously been referred to as atoms, symbols, and nodes. A datum, unlike an entity, represents literal information such as times, weights, part names, or phone numbers. Character strings and integers are possible datum types.

It is a policy decision whether something is represented as an entity or merely a datum: e.g., an employee’s spouse may be represented in a database system as a datum (the spouse’s name), or the spouse may be an entity in itself. The database system provides a higher level of logical integrity checking for entities than for datum values, as we will see later: unique entity identifiers, checks on entity types, and removal of dependent data upon entity deletion. We shall discuss the entity/datum choice further in Section 4.2.

We will use the term value to refer to something that can be either a datum or an entity. In many programming languages, there is no reason to distinguish entity values from datum values. Indeed, most of the Cypress operations deal with any kind of value, and some make it transparent to the caller whether an entity or datum value is involved. The transparent case makes relational operations possible in our model, as we will see in Section 2.5.

A relationship is a tuple whose elements are [entity or datum] values. We refer to the elements (fields) of relationships by name instead of position. These names for positions are called attributes.

Note that we have separated the representatives of unique objects (entities) from the representation of information about objects (relationships), unlike some object-oriented programming languages and data models. Therefore an entity is not an "object" (or "record") in the programming language sense, although entities are representatives of real-world objects.

We also define entity types, datum types, and relationship types. These are called domains, datatypes, and relations, respectively. We make use of these three types through one fundamental type constraint: every relationship in a relation has the same attributes, and the values associated with each attribute must be from a pre-specified domain or datatype. One might think of a relation as a "record type" in a programming language, although relations permit more powerful operations than record types.

As an example, consider a member relation that specifies that a given person is a member of a given organization with a given job title, as in the following figure. The person and organization might be entities, while the title might be a string datum. We relax the fundamental type constraint somewhat in allowing a lattice of types of domains: a particular value may then belong to the pre-specified domain or one of its sub-domains. For example, one could be a member of a University, a Company, or any other type of Organization one chooses to define. Other relations, e.g. an "offers-course" relation, might apply only to a University.

Relationships represent facts about the world. A relation (a set of relationships) represents a kind of relationship in which entities and values can participate. We will schematically represent that John Smith is a member of Acme company with title "manager" by drawing lines for the relationship (labelled with the relation and its attributes), a circle for each entity, and a box for the datum:

One should normally think of relations as types, not sets. A relation can be defined as the set of all relationships of its type, however, and thus can be treated as a relation in the normal mathematical sense. Note that we differ from mathematical convention in another minor point: the use of attributes to refer to tuple elements by name instead of position. Reference by position is therefore not necessary and is in fact not permitted. We can often omit attribute names without ambiguity since the types of participating entities imply their role in a relationship. However, they are necessary in the general case; e.g., a boss relation between a person and a person requires the directionality to define its semantics.

We can summarize the six basic primitives of the data model in tabular form. Familiarity with these six terms is essential to understanding the remainder of this document:

DomainEntityphysical or abstract object
DatatypeDatumnumerical measurement or symbolic tag
RelationRelationshiplogical correspondence between one or more objects and values

Our terminology might be more consistent if we called a domain an "entity type," and a relation a "relationship type." Instead we have compromised on the terms most widely used in the literature for all six of the basic concepts. The reader will find the remainder of this document much more understandable if these six terms are commited to memory before proceeding.

The data model also provides primitives to specify uniqueness of relationships and entities. For relationships, this is done with keys, and for entities, with names. A key for a relation is an attribute or set of attributes whose value uniquely identifies a relationship in the relation. No two relationships may have the same values for a key attribute or attribute set (a relation may have more than one key, in which case relationships must have unique values for all keys). A name acts similarly for a domain. The name of an entity must uniquely specify it in its domain.

Consider a spouse relation, a binary relation between a person and a person. Both of its attributes are keys, because each person may have only one spouse (if we choose to define spouse this way!). For an author relation, neither the person nor the document are keys, since a person may author more than one document and a document may have more than one author. The person and document attributes together would comprise a key, however: there should be only one tuple for a particular person-document pair.

We have labelled entities with names in the figures. Names are most useful when they are human-sensible tags for the entities, e.g. the title for a document, or the name of a person. However, their primary function is as a unique entity identifier, so non-unique human-sensible names must be represented as relations. If entities of a domain have more than one unique identifier, e.g. social security numbers and employee numbers, then one identifier must be chosen as the domain’s entity names and the other represented as a relation connecting the entities with the unique alternate identifier (a key of that relation).

We require that every entity have a unique name, although the name may automatically be generated by the database system. Thus every entity may be uniquely identified by the pair:

Some authors in the database literature use the term entity to refer to a real-world object rather than its representation in the database system. When we use the term entity, we refer to the internal entity, the entity "handle" returned by database operations and stored in entity-valued variables, not the external entity that the internal entity represents or the entity identifier [domain, name] that may be used to uniquely identify an internal entity. The three are interchangeable, however, since they must always be in one-to-one correspondence.

The reader may find it simple to think of entity-valued attributes of relationships as pointers to the entities, in fact bi-directional pointers, since the operations we provide allow access in either direction. This is a useful analogy. However, there is no constraint that the model be implemented with pointers, and the relationships of a relation could equally well be conceptualized as rows of a table whose columns are the attributes of the relation and whose entries for entity-valued attributes are the string names of the entities. For example, the author relationships in the previous figure could also be displayed in the form:

Thus our introduction of entities to the Relational model does not entail a different representation than, say, a Network model might imply, but simply additional integrity checks on entity names and types, and new operations upon entities. This compatibility with the Relational data model is important, as it allows the application of the powerful Relational calculus or algebra as a query language. We return to query languages in Section 2.5.

Note that the only information about an entity associated directly with the entity is the name; this contrasts with most other data models. A person’s age or spouse, for example, would be represented in the Cypress data model via age or spouse relations. Thus the relationships in a database are the information-carrying elements: entities do not have attributes. However the model provide an abbreviation, properties, to access information such as age or spouse in a single operation. We will discuss properties later. In addition, the physical data modelling algorithms can store these data directly with the person entity as a result of the relation key information (since a person can have only one age, a field can be allocated for that field in the stored object representing a person.)

The data model provides the capability to define and examine the data schema, and perform operations on entities, relationships, and aggregates of entities and relationships. In this section we discuss the basic operations on entities and relationships. In Section 2.5, we discuss the operations on aggregate types, i.e. domains and relations. We defer to Section 2.6 the discussion of "convenience" operations built upon the basic and aggregate operations.

The operations upon relationships recognize a specially-distinguished undefined value for an attribute. Unassigned attributes of a newly-created relationship have this value. A client of the data model may retrieve a value with GetF and test whether it equals the distinguished undefined value, and may set a previously defined value to be the distinguished undefined value with SetF.

Other "convenience" operations are built on top of the basic operations on entities and relationships: properties and translucent attributes. They are described in Section 2.6. Although these operations are not essential to the basis of the Cypress model, they do furnish a fundamentally different perspective on the model. They provide a mechanism to associate information directly with entities (instead of through relationships) and to write programs largely independent of attribute types.

The reader will also note that we have ignored issues of concurrent access and protection in the basic operations. We will see later that an underlying transaction, file, and protection is associated with the relation and domain handles used in the basic operations. This convenience allows us to treat concurrency, protection, and data location orthogonally.

There are two kinds of operations upon domains and relations, the aggregate types in our model: the definition of domains and relations, and queries on domains and relations. We first discuss their definitions.

As in other database models and a few programming languages, the Cypress model is self-representing: the data schema is stored and accessible as data. Thus application-independent tools can be written without coded-in knowledge of the types of data and their relationships.

There is also a predefined Datatype domain, with pre-defined elements StringType, BoolType, and IntType, called built-in types. We do not allow client-defined datum types at present.

Information about domains, relations, and attributes are represented by system relations in which the system entities participate. The pre-defined SubType relation is a binary relation between domains and their subdomains. There are also predefined binary relations that map attributes to information about the attributes:

aUniqueness: maps an attribute entity to {TRUE, FALSE}, depending whether it is part of a key of its relation. We are assuming only one key per relation, here; our implementation relaxes this assumption in the case of single-attribute keys.

The following diagram graphically illustrates a segment of a data schema describing the member relation and several domains. The left side of the figure shows two subdomains of Organization, (Company and University), and the right shows the types and uniqueness properties of the member relation’s attributes memberOf, memberIs, and memberAs.

New domains, relations, and attributes are defined by creating entities and relationships in these pre-defined system domains and relations. However, our implementation provides special operations to define the data schema, to simplify error checking. These operations are:

The operation RelationSubset[relation, attribute value list] enumerates relationships in a relation satisfying specified equality constraints on entity-valued attributes and/or range constraints on datum-valued attributes. For example, RelationSubset might enumerate all of the relationships that reference a particular entity in one attribute and have an integer in the range 23 to 52 in another.

The operation DomainSubset[domain, name range] enumerates entities in a domain. The enumeration may optionally be sorted by entity name, or restricted to a subset of the entities with a name in a given range.

More complex queries can be implemented in terms of DomainSubset and RelationSubset. A future implementation will provide a MultiRelationSubset operation to efficiently enumerate single queries spanning more than one relation. MultiRelationSubset operates upon a parsed representation of the query language, and produces the same kind of enumeration as RelationSubset. See CSL-83-4 for more details.

Some more convenient specialized operations are built upon the basic operations described in the previous two sections. They implement what we call properties and translucent attributes. Although theoretically speaking these operations add no power to the model, they permit a significantly different perspective on the data access and so should be thought of as part of the model.

Properties allow the client to treat entities as if they, like relationships, had "attributes." They provide the convenience of treating attributes of relationships that reference an entity as if they were attributes (or properties) of the entity itself. The property operations are:

1.GetPList[entity, attribute1, attribute2]: Attribute1 and attribute2 must be from the same relation. Returns the values of attribute1 for all relationships in the relation that reference the entity via attribute2. Attribute2 may be omitted, in which case it is assumed to be the only other entity-valued attribute of the relation.

2.GetP[entity, attribute1, attribute2]: this is identical to GetPList except exactly one relationship must reference the entity via attribute2; otherwise an error is generated. GetP always returns one value.

3.SetPList[entity, attribute1, value list, attribute2]: Attribute1 and attribute2 must be from the same relation. Destroys any existing relationships whose attribute2 equals the entity, and creates new ones for each value in the list, with attribute1 equal to the value, and attribute2 equal to the entity. Attribute2 can be defaulted as in GetPList.

4.SetP[entity, attribute1, value, attribute2]: this is identical to SetPList except it simply adds a new relationship referencing the entity instead of destroying any existing ones (unless attribute1 is a key of its relation, in which case the existing one must be replaced).

Thus the property operations allow information specified through relationships to be treated as properties of the entity itself, in single operations. The property operations and the operations defined in earlier sections may be used interchangeably, as there is only one underlying representation of information: the relationships. As an example of the use of properties, consider the following database:

The figure shows the entity John Smith, and three relationships in which he participates: an age relationship and two member relationships, The member relationships are ternary, the age relationship binary. On this database, the property operations work as follows:

SetP[John Smith, memberOf, Foo Family] would create a new member relationship specifying John to be a member of the Foo Family. SetPList would do the same, but would destroy the two existing member relationships referencing John. In either case, the memberAs attribute would be left undefined in the new relationship.

SetP[John Smith, ageIs, 35], where ageOf is a key of the age relation, would delete the relationship specifying John’s age to be 34, and insert a new one specifying John’s age to be 35. Note that SetP acted differently than on the member relation because memberIs is not a key.

Again, the property operations are simply a convenience, although they provide a different perspective on the data model by allowing an entity-based view of a database.

Some database application programs may not wish to be concerned with whether an attribute is entity-valued, string-valued, or integer-valued. They might prefer to have all values mapped to some common denominator, e.g. a string. An example would be a program that is simply displaying tuples on the screen.

Another class of applications would like to be independent of whether a particular attribute is represented as an entity or datum value. Consider the member relation in the previous figure. If we choose to define an Organization domain, then the memberOf attribute is entity-valued; but instead we might choose to make the memberOf attribute be string-valued, merely giving the name of the organization without defining organizations as entities. This might be appropriate, for example, if we did not wish to invoke the type checking on uniqueness of names and the correctness of entity types. We would like to write programs that are independent of whether an attribute is string-valued or entity-valued (as in the Relational data model).

We introduce translucent attributes to avoid dependence on attribute types. Any attribute may be treated as a translucent attribute, by using the GetFS and SetFS operations to retrieve or assign its value.

GetFS[relationship, attribute] is identical to the GetF operation, except it returns a string regardless of the attribute’s type. If the attribute is datum-valued, e.g. an integer or boolean, it is converted to a string equivalent. If the attribute is entity-valued, the name of the entity is returned.

SetFS[relationship, attribute, value] performs the inverse mapping. If the attribute is datum-valued, e.g. an integer or boolean, a string equivalent is accepted. If the attribute is entity-valued, the name of the entity is passed to SetFS. If an entity with the given name does not exist in the domain that is the attribute’s type, then one is automatically created.

Another convenience operation is provided on entities to change an entity’s name: ChangeName[entity, new name]. This operation is semantically equivalent to destroying the given entity and creating a new one with the new name, participating in the same relationships that the old one did. See the description of ChangeName in Section 3.4 for precise semantics in our implementation, however.

A relation is normalized by breaking it into two or more relations of lower order (fewer attributes) to eliminate undesirable dependencies between the attributes. For example, one could define a "publication" relation with three attributes:

This relation represents the fact that John and George wrote a book together entitled "Backgammon for Beginners," published in 1978, and Mary wrote two books on the subject of chess, in 1981 and 1982. Alternatively, we could encode the same information in two relations, an author relation and a publication-date relation:

Although the second two relations may seem more verbose than the first one, they are actually representationally better in some sense, because the publication dates of books are not represented redundantly. If one wants to change the publication date of "Backgrammon for Beginners" to 1979, for example, it need only be changed in one place in the publication-date relation but in two places in the publication relation. If the date were changed in only one place in the publication relation, the database would become inconsistent. This kind of behavior is called an update anomaly. The second two relations are said to be a normalized form (as it happens, third normal form) of the first relation, and thereby avoid this particular kind of update anomaly.

Relational normalization is not strictly part of the Cypress data model. However the model’s operations (and the tools we will develop in the implementation) encourage what we will call functionally irreducible form, in which relations are of the smallest order that is naturally meaningful.

A relation is in irreducible form if it is of the smallest order possible without introducing new artificial domain(s) not otherwise desired (all relations can be reduced to binary by introducing artificial domains). We will allow a slight weakening of irreducible form, functionally irreducible form, which permits combining two or more irreducible relations only when their semantics are mutually dependent (and therefore all present or absent in our world representation). For example, a birthday relation between a person, month, day, and year can be combined instead of using three relations. Another example would be an address relation between a person, street, city, and zip code. Combining an age and phone relation would not result in functionally irreducible form, however, as their semantics are not mutually dependent.

The functionally irreducible relations seen by the user are independent of the physical representation chosen by the system for efficiency, so we are concerned only with the logical data access. Note that in addition to avoiding update anomalies, functionally irreducible form provides a one-to-one correspondence between the relationships in the database and the atomic facts they represent, a canonical form that is in some sense more natural than any other form.

We would like a mechanism to divide up large databases, to provide different perspectives or subsets of the data to different users or application programs. In this section we discuss a mechanism to provide this separation: segments. A segment is a set of entities and relationships that a database client chooses to treat as one logical and physical part of a database.

In introducing segments, we will slightly change the definition of an entity, previously defined to be uniquely determined by its domain and name. We will treat entities with the same name and domain in different segments as different entities, although they may represent the same external entity. The unique identifier of an internal entity is now the triple

A consequence of this redefinition of entities is that relations and domains do not span segments, either. Application programs must maintain any desired correspondence between entities, domains, or relations with the same name in different segments. We will return to this later. In the next section, we will discuss a more powerful but more complex and expensive mechanism, augments, in which the database system itself maintains the correspondence.

SegmentOf[entity or relationship] returns the segment in which a given entity or relationship exists. It may also be applied to relations or domains, since they are entities.

1.DeclareDomain and DeclareRelation take an additional argument, namely the segment in which the defined domain or relation will reside. The entity representing a domain or relation now represents data in a particular segment.

2.DeclareEntity and DeclareRelationship are unaffected: they implicitly refer to the segment in which the respective domain or relation was defined. By associating a segment (and therefore a transaction and underlying file) with each relation or domain entity returned to the database client, we conveniently obviate the need for additional arguments to every invocation of the basic operations in the data model.

3.DestroyEntity, DestroyRelationship, GetF, SetF, DomainOf, RelationOf, and Eq are similarly unaffected: they deal with entities and relationships in whatever segment they are defined. Note that by our definition, entities in different segments are never Eq. Also note that nothing in our definition makes a SetF across a segment boundary illegal (i.e. SetF[relationship, attribute, entity] where the relationship and entity are in different segments). Our current implementation requires that special procedures GetFR and SetFR be used on attributes that can cross segment boundaries, see Section 3.

4.DomainSubset and RelationSubset are unchanged when applied to client-defined domains or relations, i.e., they enumerate only in the segment in which the relation or domain was declared. However an optional argument may be used when applied to one of the system domains or relations (e.g. the Domain domain), allowing enumeration over a specific segment or all segments. RelationSubset’s attribute-value-list arguments implictly indicate the appropriate segment even for system relations, so a segment is not normally needed unless the entire relation is enumerated.

Note that the data in a segment is stored in an underlying file physically independent from other segments, perhaps on another machine. Introducing a file system into the conceptual data model may seem like an odd transgression at this point. From a practical point of view, however, we believe it better to view certain problems at the level of file systems. This point of view allows segments to be used for the following purposes:

1.Physical independence: Different database applications typically define their data in separate segments. As a result one application can continue to operate although the data for another has been logically or physically damaged. One application can entirely rebuild its database without affecting another, or an application can continue to operate in a degraded mode missing data in an unavailable segment.

2.Logical independence: Different database applications may have information which pertains to the same external entity, e.g. a person with a particular social security number. When one application performs a DestroyEntity operation, however, we would like the entity to disappear only from that application’s point of view. Information maintained by other applications should remain unchanged.

3.Protection: Clients can trust the protection provided by a file system more easily than a complex logical protection mechanism provided by the database system. An even higher assurance of protection can be achieved by physical isolation of the segment at a particular computer site. A more complex logical protection mechanism would be desirable for some purposes, but was deemed beyond the scope of Cypress.

4.Performance: Data may be distributed to sites where they are most frequently used. For example, personal data may reside on a client’s machine while publicly accessed data reside on a file server. If the file system provides replication, it can be used to improve performance for commonly accessed data.

As noted earlier, information about an external entity may be distributed over multiple segments. One or more database applications may cooperate in maintaining the illusion that entities, domains, and relations span segment boundaries. This illusion may be used in at least two ways:

1.Private additions may be added to a public segment by adding entities or relationships in a private segment. The new relationships may reference entities in the public segment by creating representative entities with the same name in the private segment. An example would be personal phone numbers and addresses added to a public database of phone numbers and addresses: an application program would make the two segments appear to the user as one database.

2.If two applications use separate segments A and B, they may safely reference each other’s data yet remain physically independent. One of the applications may destroy and reconstruct its segment if it uses the same unique names for its entities. If both applications have relationships referencing an entity e, and application A does a DestroyEntity operation on e, the entity and relationships referencing it disappear from application A’s point of view, but application B’s representative entity and relationships remain.