[_CD8_]<cedardocs>database>CypressConceptsAndFacilitiesDoc.tioga!1

Cypress database concepts and facilities, October 1982

CSL Notebook Entry

To CSL Cedar Interest Group Date October 6, 1982

From Cedar Database Group Location PARC/CSL, Palo Alto

(Rick Cattell, Mark Brown)

Subject Cypress database concepts and facilities

(phase two implementation)

XEROX

Archive ID: yyCSL-xxx

Attributes: technical, Cypress, Cedar, Database

References: [Indigo]<CedarDocs>Database>ViewLevelDesign.press

Abstract: The Cypress database system is intended to provide a convenient, high-performance data management facility for use by clients of the Cedar programming environment. The goal of Cypress is an integrated, unified representation of data across programs and machines, aiding information exchange and avoiding repeated implementations of specialized data management facilities. We present the basic concepts and capabilities of the Cypress database system, and describe the programmer's interface to the system.

Note: This document is a conversion from Bravo to Tioga format of the files [Indigo]<CedarDocs>Database>c+f* by Bob Hagmann. The ViewFig*.press files apperar to be lost, but placeholders have been left in. CFFig4.Griffin exists, but Griffin does not run in Cedar6.0 yet. I think that the DBView referred to in the paper is what is documented in ModelLevelDesign*, that is Blue-and-White CSL-83-4. ViewLevelDesign.press may be the SIGMOD paper (SIGMOD.paper).

TABLE OF CONTENTS

1. Introduction

2. Database system concepts

3. Data modelling concepts

4. Client interface

5. A sample program

6. Database design

7. References

*** PRELIMINARY DRAFT: NOT FOR CIRCULATION ***

1. Introduction

Why a database system?

CSL needs a database system. A database system provides many data management facilities that are currently re-invented by program after program, and thereby eliminate a large duplication of effort. It also provides a capability that is needed by many programs but actually implemented by very few: a means of sharing information among programs that are distributed around the network.

Why this paper?

We hope that this paper will be read by potential users of the Cypress database system. It is certainly required reading for those who want to make use of the system now.

This paper begins by developing database concepts that are necessary to put the Cypress database facilities in perspective. We do not expect most people in CSL to become familiar with database jargon, or to read a Mesa interface, in order to discover whether or not the database system can be useful to them. Thus we begin (Section 2) by defining the scope of the general database problem, and measuring our ambitions relative to this standard. We also describe some current and potential applications of our system.

The main body of this paper describes the interface to the Cypress database system. We first give an overview (Section 3), then present the full story (Section 4). This requires defining the class of data structures that can be stored in a database, and then describing the procedures for manipulating these data structures. For illustration, we present a short Mesa program that manipulates a database (Section 5). We then give an example of how a particular real-world information structure might be represented as a database, and discuss design tradeoffs (Section 6).

Where are we going with database systems?

The form of our database system has evolved and will continue to evolve with experience with database applications in CSL. The Cypress database system described here incorporates two major changes to the original Cedar Database System (Brown, Cattell, Suzuki [1981]), motivated by early experience with applications: a more powerful file system (Alpine, by Brown et al [1982]), and more powerful data structuring mechanisms (DBView, by Cattell [1982]). Also, some database tools have been developed (Squirrel, see Donahue [1982]). We plan to build other tools and interfaces on top of the one described here, and eventually to integrate some set of database facilities into the Cedar Mesa language. Because other documents discuss these longer range plans, however, we shall give such discussions secondary status in this paper. We do wish to emphasize that we are open to suggestions concerning what the future facilities should look like; it is certain that our earliest and most energetic clients will have the greatest influence on the evolution of the system.

What are applications of Cypress?

As mentioned earlier, applications of the original Cedar database system were helpful in it revision. These applications included a database of Mesa programs, the CSL notebook, and a personal database browser. These applications were experiments, though it is expected they will be carried over into the Cypress system in the coming year.

At the time of this writing, three major application programs run in the Cypress database system. The first is Walnut, providing facilities in the Cedar programming environment to read, send, and organize electronic mail in a database. The second is Hickory, which manages an appointment calendar database and provides an alarm mechanism in Cedar to remind the user of meetings, seminars, etc. The third application is Squirrel, which provides facilities to edit, organize, and browse a personal database of any type of data. In addition to these application-independent data manipulation facilities, Squirrel provides facilities to application programs such as Hickory and Walnut. Cypress clients planning to build application programs should consult the Squirrel documentation (Donahue [1982]) to see if these facilities are useful for the envisioned application.

Other database applications are planned but have not yet been undertaken due to manpower limitations. Most of these are public databases such as the CSL notebook and the "Whitepages" database of people, phone numbers, and addresses.

2. Database system concepts

Before discussing the facilities to be provided in the Cypress database system, we must address the question "what properties make something a database system?" We shall often contrast databases with more familiar data structures, such as Mesa record structures.

We find it useful to give a few definitions here. A client is a program using some facility, such as a database system. A user, on the other hand, is a human using a facility, such as an interactive interface to a database system. A database query is the operation of reading information from a database; depending on the type of interface provided, a query may be limited to reading one item or may involve thousands. An update is a write operation on a database.

2.1 Three basic properties of databases

Persistent

Unlike information encoded into a typical Mesa data structure, information stored in a database may last longer than one invocation of a program. This persistence of data has far-reaching consequences.

Information in a long-lived database can become more valuable than the programs that manipulate it. A database may exist long enough to be operated on by several different generations of the same application program, or to be used for applications that were not anticipated when the database was first designed. Thus the gradual evolution of the database becomes a concern analogous to the well-known problems of evolving programs.

The continued soundness of a persistent database is important. Soundness in the face of physical mishaps such as disk head crashes or tape vault fires is called physical integrity. A guarantee of perfect physical integrity is impossible, but a database system should reduce the probability of disaster to a level that is acceptable in light of the value of information in the database.

A much more difficult problem comes in maintaining the logical integrity of a database. A database is generally used to model some real-world state of affairs, such as the structure of a Mesa system, the layout of a document, or the contents of a person's appointment schedule. Logical integrity means the correct correspondence between a database and the real-world situation being modeled. A database may be damaged in a logical sense by the keying of incorrect input data, or the running of an incorrect program. These two examples should make it clear that perfect logical integrity is an ideal that cannot be guaranteed; unfortunately, the loss of confidence caused by errors in a database can be just as damaging as a catastrophic physical failure (Hoare [1975]).

One simple form of insurance for the logical integrity of a database is always provided in a database system. This is the persistent storage and enforced use of data that tells how to interpret the representation of the database. This data is somewhat analogous to the type declarations required in Mesa programs that use data structures. It should be clear that using the wrong declarations when accessing the database could lead to disaster. Some database systems go beyond this, allowing the user to state more application-specific properties that his data should satisfy. These are sometines called integrity constraints or integrity assertions.

Of course, there was long-lived information before there were any database systems: this information was stored in file systems. But file systems at best gave some assistance in preserving the physical integrity of stored data, and were no help with any of the other problems associated with persistence. Hence these problems were solved in an ad hoc manner by various applications.

Shared

The persistence of information in a database introduces the possibility of data sharing. This sharing is qualitatively different from the sharing that can take place among separate processes of a Mesa program.

One aspect of sharing is consistent concurrent access to data by programs. Here "concurrent access" does not mean that the programs' commands to read or update the database are actually executed at the same instant of time, but rather that the precise order in which these commands are executed from different programs is unpredictable.

In general, consistency of a database can only be defined with respect to an application: we say that a database state is consistent if it satisfies a certain application-specific invariant. For instance, an accounting system might have as an invariant the property that the total debits and credits in the system are equal. In some database systems a transaction facility is provided to help applications maintain consistency invariants. Briefly, a transaction is a sequence of read and write commands. The system supports the property that the entire sequence of commands executes atomically with respect to all other transactions, that is, the transaction executes as if no other transactions were in progress at the same time. Because there may in fact be other transactions accessing the same data at the same time, it is possible (in most systems) that two transactions may deadlock, in which case one of them must be aborted. So the price paid for concurrent access is the necessity that programs be prepared to retry aborted transactions.

An example of an application in which concurrent access capability is necessary is an interactive reservation system, say for CSL meeting rooms. By structuring the reservation system's database accesses into transactions, it should be possible to ensure that the database will remain consistent despite reservations entered concurrently from different machines on the network. The consistency invariants in this case would include having each time period for each room reserved at most once.

Another aspect of sharing is the integration of information into a single database. Integration means that information is kept in a consistent form in one database, even if this is not strictly necessary for the purposes of existing applications. Integration allows new problems to be solved more easily because all of the necessary information is maintained in the same database. It also encourages sharing of information at a more macroscopic level than in the reservation system example above, since it provides a set of uniform naming conventions and access methods for all users. In principle, integration makes logical integrity easier to maintain, because there is less of a tendency to duplicate data for use by different applications.

Unfortunately, sharing introduces new possibilities for logical damage to a database. For instance, one user's undebugged program can mistakenly modify the data of another user. A database's protection system is designed to make such damage less likely by requiring an authorization check before querying or updating the database. Some database systems provide elaborate protection facilities that give fine-grained control over access to information; this is especially necessary if sensitive information is to be shared through the database.

Large

Databases often contain large volumes of data. By one definition, a database qualifies as "large" when it becomes unacceptable to make the database inaccessible during the amount of time required to copy it.

In principle, largeness is independent of persistence. In practice, largeness generally implies persistence, but not the other way around: it is quite common to require small data objects to survive between the executions of a program.

2.2 Consequences of the properties

The discussions above were limited to immediate results of the three basic properties. We now deal with some of the more indirect consequences.

Interface independent of storage structure

Because of the shared, persistent nature of a database, its usage changes with time. To provide good performance as the access pattern shifts, it is necessary to change the database representation to keep the most common operations inexpensive. Because of the value of existing applications software, the database system should make these programs functionally immune to changes in the storage structure used in representing the database. That is, such changes should at most affect a program's performance, and not the results that it gives.

This data independence can be achieved by ensuring that the interface used by programmers to specify algorithms involving the database refers to a model of data, the conceptual model, that is a useful abstraction of stored structures (independent of their precise form). Ideally the database system would then choose a good underlying representation based on its knowldege of the user's programs, or adaptively based on execution statistics. This remains a research problem: no existing system can perform such optimizations automatically. An alternative is for the system to accept advice from the user on pragmatic matters (such as the use of internal indexes), where this advice takes the form of declarations that cannot affect the semantics of an application program.

Simple and regular structures

Another consequence of the more fundamental database properties above is that database structures should be kept simple and regular. Persistence argues for simplicity because if a structure is not simple, it is difficult for different programs to maintain it in a consistent and accurate way over time. Regular structures are easier to share for the same reason, and also because they tend to contain larger meaningful substructures, that can serve as the units for locking. With a large, irregular structure it is very likely that consecutive accesses to data will seem uncorrelated from the system's point of view, meaning that a disk access is required for each one. In more regular structures the access pattern often involves scanning data in an order that is meaningful to the system, so that caching and prefetching will increase performance.

Aggregate-level operations

By structuring a database in a very regular fashion, we make it feasible to operate on it with very concise high-level languages. This approach was first demonstrated with the relational model of data (Date [1977]), which is probably the simplest model with respect to the mechanisms it provides for expressing relationships among data items. The use of such high-level languages makes it very easy to develop new applications that mainly involve queries, even when the queries use portions of the database that were developed independently. These high-level languages may also help somewhat in avoiding programming errors and maintaining the logical integrity of the database.

Multiple views

Because databases represent real-world objects and relationships, changes in the world (or in our preferred abstraction of it) may force a change in the structure of an existing database. For example, we may have to add a new field to each record of a given type, or delete an existing field. A stronger form of data independence holds when any application that isn't necessarily affected by some alteration in the conceptual model mentioned above is functionally immune to the change. (It is impossible for a nontrivial application to be immune to all possible alterations, because some of the data that it interprets may be expunged in a change of the conceptual model.) This degree of data independence requires a layer of mapping between the conceptual model and the external model presented to an application. Such a map is called a view. There may be many views that involve the same underlying data; views also allow different external models to be used while keeping a common conceptual model. It is not generally possible to perform an update through a view unless the view is very simple.

2.3 Database versus persistent virtual memory

Is it productive to view the data structures and variables found in normal Mesa programming as database variables that don't happen to be persistent, shared, or large? To be more concrete, is the distinction between Mesa records and database objects really necessary?

To some extent this is a function of our aspirations. If we consider persistence to be the primary goal, with sharing and largeness being much less important, then we may be satisfied with a database that is essentially a long-term Mesa virtual memory. If the typing of data objects in this memory could be preserved automatically over long periods, the results would have some of the character of a database.

JaM, Interlisp and Smalltalk are examples of existing systems at PARC that support this kind of persistent virtual memory. Databases built in these environments can be very flexible by normal database standards. The user is able to create arbitrarily complex data structures and is able to bind procedures to individual data objects, giving a programmable view mechanism (in the sense of the section above). JaM, Interlisp and Smalltalk do not have a conventional database system's facilities for ensuring integrity, do not allow concurrent access, and don't support large databases (by the standards of "Business Data Processing" applications).

2.4 Cypress database system aspirations

Persistent

Physical integrity is aided by storing data on the Alpine file system (or Juniper, which we use right now). Declarations live in the database, and are read using the same interface as the rest of the database. (This commonality of interface is not true for updates, for consistency reasons).

Shared

Concurrent access will be provided by using page-level locking on Alpine. Since this does not allow the locking of logical units of data, it will perhaps not support very high degrees of concurrent access efficiently; we won't know any more about this without getting some experience. We expect the database to be used in an integrated way (although probably not at first, until confidence is built up.) The initial protection mechanism will be based on file protection. This is not ideal, but the best way to improve it may be to build finer-grained protection at a level above the interface we are now developing.

Large

The database may get large in the sense that it occupies a large portion of a disk volume, but we do not expect that 24 hour real-time performance will be necessary. Thus we will be satisfied with fairly crude algorithms for responding to restructuring requests that appear to affect large volumes of data (but in principle could be handled by a more sophisticated incremental algorithm.) We do not plan to use very sophisticated algorithms to gain good performance for queries involvong huge amounts of data. We shall assume that the amount of data used to process a query will just about fit in main storage. We are primarily concerned with interactive systems rather than "data processing" applications that might involve scanning a huge database. Our performance aspirations are also limited by the bandwidth available to our file server (though we shall do local caching of file pages to reduce this traffic).

Interface independent of storage structure

This is true in our system to a large extent. The presence or absence of indexes ("inverted files") is not reflected in the statement of algorithms. Declarations are used to specify indexes, which are then maintained automatically by the system. The model does contain a form of pointer, but much of its underlying mechanism is abstracted away.

Allowed structures are simple and regular

The model of data is a simple extension of the relational data model. It consists of records, integers, strings, and atoms. Cypress helps maintain the logical integrity of the data by checking type constraints on these data. One goal of this document is to explain how to represent information within the Cypress data model in a well-structured way.

Aggregate-level operations and views

These are part of the next phase of development of Cypress. Our current interface allows only one-at-a-time operations on data that are explicitly stored; this functionality is quite adequate for many applications. We shall build aggregate operations and multiple views of data on top of this interface.

2.5 Database System Overview

Database structures and their manipulation

A database is a time-varying collection of data, all of which is stored as relationships and entities. A relationship is the database analogue of a Cedar record, and an entity is the database analogue of a Cedar atom. Fields of relationships are called attributes. Attributes may take on integer, string, boolean, and entity values. We do not allow any analogue of variant records for relationships. We also exclude the possibility of a "repeating" attribute, that is, an attribute whose value is a sequence of data values.

The database package provides operations for creating new relationships and entities of a specified type, and for destroying them. Other operations allow the value associated with any particular attribute of a given relationship to be fetched or updated.

We can contrast database relationships with Mesa records as follows. First, in the database model there is a mechanism that allows all relationships of a given type to be retrieved. This means that no tuple ever becomes "garbage", since there is always an access path to it. With a Mesa record, once the last user-maintained pointer to a record disappears the record is truly inaccessible. Second, it is possible to locate all of the tuples that reference a given atom or have a given range of integer or string values; this is not true with Mesa records. Third, nesting of relationships is not allowed; a Mesa record may contain a component that is also a Mesa record, or a pointer to one. Fourth, there is no analogue of variant records for tuples in the phase one implementation, although there is a hierarchy of types of entities (atoms).

Databases and files

A database is represented as a collection of files. A file in database format is called a segment. The segments comprising a database are identified by the client of the system. The set of segments may dynamically change, the segments are physically independent and may be independently opened, closed, and modified.

All of the relationships or entities of some client-created type lie in a single segment, which must be specified when the type is defined. This partitioning provides a crude mechanism for clustering related data, and may reduce the potential for conflict in concurrent accesses to a database by different applications. A default segment is provided for clients who don't care about segment structure.

Data definition

The phase two Cypress system is not integrated with Mesa; the data manipulation operations sketched above are implemented as Mesa procedures in the interface. This leaves the problem of declarations: how do we create new relations and domains, or access existing ones? The answer is to define a small number of system entities to contain information about types.

This technique for storing the "data dictionary" type information has the advantage that the same interface is used for data dictionary retrieval as for normal data retrieval. We do not extend the commonality of interface to updates, since the consistency of the types of the stored data is difficult to ensure without constraints on the type updates.

Concurrency

The database package uses the notion of transaction to structure database accesses. Clients can ensure the logical integrity of their databases, even in the face of concurrent access, by proper grouping of queries and updates into transactions.

Transactions do not always succeed. Sometimes this is due to hardware or software failures, and sometimes due to conflicts in concurrent accesses. Concurrent updates to data in different segments almost never cause a conflict; data within the same segment are also arranged to minimize the potential for conflicts. But all clients must be prepared for the possibility of an aborted transaction. By far the simplest response to a transaction abortion is to restart the transaction from the beginning (though this may be painful when the updates were performed manually).

Protection

Database protection is limited to the protection provided for segments by the file system. This has its strong and weak points. On the positive side, file protection is a robust mechanism. Since it is implemented on a remote host, the malfunction of a user program cannot breach it. Users in our environment have experience with file protection and they trust it.

Environmental requirements

The Cypress system uses Alpine or the local Pilot file system to store the database. All of the database software will run in the client machine; thus the system's speed may depend a great deal upon the amount of main memory available on the client machine for caching database pages, as well as on the raw speed of the client machine.

3. Data modelling concepts

Data Independence

In this section, we give an informal description of the Cypress data model. We are dealing with a conceptual data model, the language used to define and access the types of objects and relationships between them. This should be carefully distinguished from a physical data model, the data structuring mechanisms we are given for actual storage and access of data. The physical data model is hidden as much as possible from the database client to facilitate data independence, the guarantee that a user's program will continue to work (perhaps with a change in efficiency) even though the physical data representation is redesigned.

For any particular database using the given conceptual and physical models, the actual specifications of this database using the primitives the models provide are termed the conceptual data schema and physical data schema. Note that a mapping must be provided between the conceptual and physical levels, either automatically or with further instruction from the client; we will do some of both. The logical to physical mapping is intimately associated with the performance of the database system as viewed by the user performing operations at the conceptual level. Performance is generally not a criteria for choosing between conceptual data models, unless there is no known way to map the differing conceptual views into the same or equally efficient implementations.

Basic Primitives

Three basic primitives are defined in the model: an entity, datum, and relationship. An entity represents an abstract or concrete object in the world: a person, an organization, a document, a product, an event. In programming languages and knowledge representation entities have variously been referred to as atoms, symbols, nodes, and objects. A datum, unlike an entity, represents literal information such as times, weights, part names, or phone numbers. Character strings and integers are possible datum types. Actually, it is a policy decision as to whether something is represented as an entity or merely a datum: e.g., an employee's spouse may be represented in a database system as a datum (the spouse's name), or the spouse may be an entity in itself. The database system provides a higher level of logical integrity checking for entities, as we will see later.

In some programming languages, an entity is just another type of datum, like numbers and strings. To allow more preciseness, we will instead use the term value to refer to something that can be either a datum or entity.

A relationship is a tuple whose elements are [entity or datum] values. In programming languages, a relationship is commonly termed a record. As in most programming languages, we can refer to the elements (fields) of tuples by name instead of position. These names for positions are called attributes.

We also define entity types, datum types, and relationship types: domains, datatypes, and relations, respectively. We make use of these three types through one fundamental type constraint: for every relation, we define the number and types of attributes expected of relationships of that type. That is, every relationship in a given relation must have the same attributes, and the values associated with each attribute must be from a pre-specified domain or datatype. For example, a member relation might specify that a given person is a member of a given organization with a given job title as in the figure following. The person and organization might be entities, while the title might be a string datum. We relax the fundamental type constraint somewhat in allowing a lattice of types of domains: a particular value may then belong to the pre-specified domain or one of its sub-domains. For example, one could be a member of a University, a Company, or any other type of Organization one chooses to define. Other relations, e.g. an "offers-course" relation, might apply only to a University.

<==<ViewFig1.press<

Relationships represent facts about the world. A relation (a set of relationships) represents a kind of relationship in which entities and values can participate. We will schematically represent that John Smith is a member of Acme company with title "manager" by drawing lines for the relationship (labelled with the relation and its attributes), a circle for each entity, and a box for the datum:

<==<ViewFig2.press<

The reader should normally think of relations as types, not sets. A relation can be defined as the set of all relationships of its type, however, and thus can be treated as a relation in the normal mathematical sense. Note that we differ from mathematical convention in another minor point: the use of attributes to refer to tuple elements by name instead of position. Reference by position is therefore not necessary and is in fact not permitted. We can often omit attribute names without ambiguity since the types of participating entities imply their role in a relationship. However, they are necessary in the general case; e.g., a boss relation between a person and a person requires the directionality to define its semantics.

We can summarize the six basic primitives of the data model in tabular form. Familiarity with these six terms is essential to understanding the remainder of this document:

Type Instance Instance Represents

Domain Entity physical or abstract object
Datatype Datum numerical measurement or symbolic tag
Relation Relationship logical correspondence between one or more objects and values

Domains, datatypes, and relations are types but can also be viewed as sets.

Names and Keys

The data model also provides primitives to specify uniqueness of relationships and entities. For relationships, this is done with keys, and for entities, with names. A key for a relation is an attribute or set of attributes whose value uniquely identifies a relationship in the relation. No two relationships may have the same values for a key attribute or attribute set. A name acts similarly for a domain. The name of an entity must uniquely specify it in its domain.

Consider a spouse relation, a binary relation between a person and a person. Both of its attributes are keys, because each person may have only one spouse (if we choose to define spouse this way!). For an author relation, neither the person nor the document are keys, since a person may author more than one document and a document may have more than one author. The person and document attributes together would comprise a key, however: there should be only one tuple for a particular person-document pair.

<==<ViewFig3.press<

We have labelled entities with names in the figures. Names are most useful when they are human-sensible tags for the entities represented, e.g. the title for a document, or the name of a person. However, their primary function is as a unique entity identifier, so non-unique human-sensible names must be represented in some other manner (via relations).

Note that the only information about an entity associated directly with the entity is the name; this contrasts with most other data models. A person's age or spouse, for example, would be represented in the Cypress data model via age or spouse relations. Thus the relationships in a database are the information-carrying elements: entities do not have attributes. However the model provide an abbreviation, properties, to access information such as age or spouse in a single operation. We will discuss properties later. In addition, the physical data modelling algorithms can store these data directly with the person entity as a result of the relation key information (since a person can have only one age, a field can be allocated for it in the stored object representing a person.)

Normalization

A common topic in the database literature is relational normalization (Beeri et al [1978], Codd [1970], Fagin[1977, 1981], Rissanen [1977], Biller[1979], Hall et al[1976]). A relation is normalized by breaking it into two or more relations of lower order (fewer attributes) to eliminate undesirable dependencies between the attributes. For example, one could define a "publication" relation with three attributes:

Publication:

Person Book Date
George Backgammon for Beginners 1978
John Backgammon for Beginners 1978
Mary How to Play Chess 1981
Mary How to Cheat at Chess 1982

This relation represents the fact that John and George wrote a book together entitled "Backgammon for Beginners", published in 1978, and Mary wrote two books on the subject of chess, in 1981 and 1982. Alternatively, we could encode the same information in two relations, an author relation and a publication-date relation:

Author:

Person Book
George Backgammon for Beginners
John Backgammon for Beginners
Mary How to Play Chess
Mary How to Cheat at Chess

Publication-date:

Book Date
Backgammon for Beginners 1978
How to Play Chess 1981
How to Cheat at Chess 1982

Although the second two relations may seem more verbose than the first one, they are actually representationally better in some sense, because the publication dates of books are not represented redundantly. If one wants to change the publication date of "Backgrammon for Beginners" to 1979, for example, it need only be changed in one place in the publication-date relation but in two places in the publication relation. This latter kind of behavior is called an update anomaly. The second two relations are said to be a normalized form (as it happens, third normal form) of the first relation, and thereby avoid this particular kind of update anomaly.

A variety of successively stricter criteria for normalization have been developed and studied, dependent on the kinds of real-world dependencies between attributes. Relational normalization is not part of the Cypress data model. However the model's operations (and the tools we will develop in the implementation) encourage what we will call functionally irreducible form, in which relations are of the smallest order that is naturally meaningful. This form is in some sense the most fully normalized canonical form that could be defined.

Functionally irreducible normal form cannot be defined simply syntactically, but rather requires resort to the semantics of relations. Specifically, the presence or absence of a relationship in a relation represents the truth of some predicate on the state of the represented world (see Kent [1979]). For example, a "member" relationship represents the fact that a particular person is a member of a particular organization. (The absence of such a relationship may mean falsity of that predicate or lack of knowledge; we fortunately need not concern ourselves with this distinction here). We will formally define irreducible form in Section 3. For informal purposes, it suffices to say that relations should be of the smallest order that is possible without introducing a new artificial domain not otherwise desired (all relations can be reduced to binary by introducing artificial domains). We will allow a slight weakening of irreducible form, called functionally irreducible form, which permits combining two or more irreducible relations that must all be logically present or absent in the real world. For example, a birthday relation between a person, month, day, and year can be combined instead of using three relations. Operationally defined, then, a relation is in functionally irreducible form iff:

(1) it is binary, or

(2) it has no single-attribute key, and cannot be represented with lower order without introducing a new type of entity to represent all or part of the relationships themselves, e.g., a member relation between a person, an organization, and the role of the person in the organization, or

(3) it has a single-attribute key, and the non-key attributes represent a single atomic fact such that all or none would be present or absent. An example would be the birthday relation just mentioned.

The functionally irreducible relations seen by the user are independent of the physical representation chosen by the system for efficiency, so we are concerned only with the logical data access. Note also that the motivation for the functionally irreducible form is a one-to-one correspondence between the relationships in the database and the atomic facts they represent, a canonical form which is in some sense more natural than any other form.

Data Schema

Like some other database models and a few programming languages, the Cypress model is self-representing: the data schema is stored and accessible just as any other data. Thus application-independent tools can be written without coded-in knowledge of the types of data and their relationships, by reading the meta-representation at run-time.

The pre-defined domains are:

the Domain domain, with one element (entity) per client-defined domain

the Attribute domain, with one element per client-defined attribute

the Relation domain, with one element per client-defined relation

The pre-defined relations are the Subtype relation and the Attribute relation. The Subtype relation has one element (relationship) per domain - subdomain relationship, with two attributes referencing the domain and subdomain, respectively. The Attribute relation has one element per client-defined attribute. The Attribute relation's attributes and their values are:

aAttribute: the attribute this relationship refers to

aRelation: this attribute's relation

aType: the domain or datatype that this attribute accepts

aUniqueness: TRUE iff this attribute is a key part of the relation

Primitive operations

The data model provides the capability to define and examine the data schema, and perform operations on entities, relationships, and aggregates of entities and relationships.

Five basic operations are defined for entities:

1. CreateEntity: Creation of a new entity in a domain. An entity name may be specified.

2. DestroyEntity: Destruction of an entity; this also destroys all relationships which refer to it.

3. GetDomain: Fetching the domain of an entity.

4. GetName: Fetching the name of the entity.

5. SetName: Changing the name of the entity.

Five basic operations are defined for relationships:

1. CreateRelationship: Creation of a new relationship in a given relation.

2. DestroyRelationship: Destruction of a relationship.

3. GetRelation: Fetching the relation of a relationship.

4. GetField: Fetching a field of a relationship.

5. SetField: Specifying a new value for a relationship field.

Query Operations

The RelationSubset operation enumerates all the relationships in a relation satisfying specified equality constraints on entity-valued attributes and/or range constraints on datum-valued attributes.

The DomainSubset operation enumerates all the entities in a domain. The list may optionally be sorted by entity name, or restricted to a subset of the entities with a name in a given range.

More complex queries are implemented by a query language based on these primitive query mechanisms.

Convenience Operations

As well as the basic operations described in the previous sub-sections, four more convenient but specialized operations may be defined in terms of these by an implementation of the data model:

1. DeclareEntity[domain, entity name]: Returns an entity with the given name in the domain if one exists, otherwise creates one, automatically generating a unique name if none is given.

2. DeclareRelationship[relation, attribute value list]: Returns an existing relationship with the given attribute values if one exists, otherwise creates one.

3. GetProperty[entity, attribute1, attribute2]: Returns the values of attribute1 for all relationships in its relation that reference the entity via attribute2. Attribute2 may be omitted, in which case it is assumed to be the only attribute of the relation that could reference the entity's type.

4. SetProperty[entity, attribute1, value list, attribute2]: Attribute1 and attribute2 must be from the same relation. Creates relationships in that relation for each value in the list, with attribute1 equal to the value, and attribute2 equal to the entity. Attribute2 can be defaulted as in GetProperty.

As an example of the use of properties, consider the following database, consisting of two member relationships and an age relationship:

<==<ViewFig4.press<

On this database, the GetProperty and SetProperty operations work as follows:

GetProperty[John Smith, memberOf] would return the set {Acme Company, State University}

GetProperty[John Smith, ageIs] would return {34}. (A version of GetProperty which returns a value rather than list is available in the implementation.)

SetProperty[John Smith, memberOf, Foo Family] would create a new member relationship specifying John to be a member of the Foo Family. (The memberAs attribute would be left undefined in this relationship.)

SetProperty[John Smith, ageIs, 35], where ageOf is a key of the age relation, would delete the relationship specifying John's age to be 34, and insert a new one specifying John's age to be 35. Note that SetProperty acted differently than on the member relation because memberIs is not a key.

Views

A view is a relation whose relationships are not actually stored. Instead, the primitive retrieval and storage operations on a view invoke procedures defined by a database client who defines the view. Arbitrary procedures may be defined, and views can therefore be used for a variety of purposes:

1. Defining a pseudo-relation in terms of existing relations for convenience: e.g., a father view could be implemented in terms of parent and sex relations that are actually stored.

2. Allowing changes to the logical representation of data (e.g., actually changing the database to store father and mother relations instead) without changing existing programs written in terms of the older form.

3. Implementing operations on entities in an object-based style by storing tuples in a view. E.g., a Send operation on a Message entity might be invoked by storing a tuple in a send relation. This mechanism provides the basis for encoding of procedural information in a database.

The implementor of a view must define all of the operations on a relation. Unlike view mechanisms in some database systems, views may be defined directly in the underlying programming language. In our case, this is Cedar Mesa, and the operations the implementation must export are defined as a Cedar Mesa interface exported by the view. The view can be dynamically loaded when required at run-time. The operations the view provides are the same as those for relations:

1. RelationSubset and CreateRelationship on the view

2. DestroyRelationship, GetRelation, GetField and SetField on its relationships

When a query language is defined, a view may be defined at a higher level rather than the underlying programming language; this simplifies the definition of views as well as allowing some kinds of optimizations that need no longer treat a view as a "black box" implementation of an access definition.

Views are so-called because they provide a database client a different view of the database than what was originally stored. View definitions and implementations are stored in a database as any other data, being automatically retrieved when required by the database system.

Partial Augments

Partial augments correspond to what we call level 1 augments. These are the augments described in the implementation section.

Augments provide a mechanism to divide and re-combine databases to give different perspectives or subsets of the data to different users or application programs. An augment is a set of entities and relationships that a database client chooses to treat as one logical context or state of the world. The data in an augment is stored in an underlying file physically independent from other augments, perhaps on another machine. However we will see that the entities of a domain may logically be in more than one augment, and the entities in an augment may be from more than one domain. Relations and relationships behave similarly.

Augments may be used for four purposes:

1. Data distribution: Augments are represented by physical files in an underlying file system, and can be used as a basis for distribution of data according to the location of the files.

2. Database independence (between applications): Different database application using different domains or relations may define their data in separate augments. As a result one application can continue to operate although the data for another has been logically or physically damaged. Similarly, domains or relations distributed over multiple files, possibly on a variety of machines, can continue to be used in degraded mode when one or more of the augments missing.

3. Additive databases (combined public/private): A public augment may contain information accessible and possibly writable by a number of users, e.g. a database of phone numbers and addresses. A user may add on top of this database personal entries not visible to other users but visible to himself as if an extension of the public database.

4. Versions (Additive/subtractive databases): The relations in an augment encode some state of the world the database represents. Updates, representing changes to that state, could be made by adding them in a separate augment. The database may then be viewed with and without the changes (or a number of alternate changes) by adding and removing the augment(s) on top.

We will re-examine these four goals (database distribution, database independence, additive databases, and versions) later. First we examine partial augments, which satisfy some but not all of these goals.

The data model provides the following operations on augments:

1. CreateAugment and DestroyAugment create and destroy an augment, respectively. Augments are simply referred to by name. The name must be unique over the distributed computing environment; an implementation might use the full path name of a file in which the augment's data is stored.

2. OpenAugment takes as argument an augment and adds it to the current augment list, a list maintained by the database system. CloseAugment removes a named augment from the list.

We now extend the DeclareDomain and DeclareRelation operations to have an additional argument, namely the augment in which the defined domain or relation will reside. The entity representing a domain or relation now represents data in a particular augment. However, a logical correspondence is maintained by the system between relations or domains with the same name in two or more augments in the currently open augment list. We define the namesake in augment Q of a domain or relation as the domain or relation with the same name in Q. The namesake in a particular augment may of course be non-existent. Similarly, we define the namesake in augment Q of an attribute A to be the attribute of A's relation's namesake in Q with the same name as A. All of the namesakes of a relation must have attributes identical in name, type, and uniqueness or an error is generated on the first attempt to use that relation. Finally, we will refer to the namesake of any entity E in an augment Q as the entity with the same name in E's domain's namesake in Q.

The effect of the namesake equivalence will be the very useful property that a relation or domain may logically span two or more augments while the augments remain physically independent.

With the addition of augments to the data model, we redefine the semantics of basic access operations as follows.

1. CreateEntity and CreateRelationship create an entity or relationship, respectively, in the augment in which the respective domain or relation is defined.

2. DestroyEntity and DestroyRelationship destroy the given entity or relationship in the augment in which it is defined.

3. GetField and SetField "translate" the attributes and entity values they are given to their namesake in the relationship's augment before performing the operation. If the namesake for an attribute does not exist, this is an error; if the namesake for an entity value does not exist, it is implicitly created (this may cause the domain itself to become defined in that augment). GetField and SetField are otherwise defined as before.

4. GetDomain and GetRelation are unchanged, i.e. they retrieve the particular domain or relation where the item is stored. A convenience operation, GetAugment, is added as an abbreviation for fetching an entity or relationship's domain or relation.

5. DomainSubset is unchanged when applied to client-defined domains, i.e., it enumerates the domain only in the augment in which it was declared. However when DomainSubset is applied to one of the system domains, namely the Attribute domain, Relation domain, or Domain domain, it enumerates across all open augments. If a relation with the same name exists in more than one augment, each will occur in the enumeration.

6. RelationSubset enumerates client-defined relations only in the augment in which they were declared. However, any entity values in the attribute value list supplied are automatically translated to their namesakes in the relation's augment. If they have no already existing namesakes they are not created, the RelationSubset enumeration is simply empty. When RelationSubset is applied to the system relations (the AttributeRelation and SubtypeRelation), it enumerates relationships satisfying the attribute value list in all open augments. Entity values in that list are automatically translated to each augment in this process.

7. The entity equality operation returns TRUE iff the two entities have the same name and their domains have the same name, regardless of whether the entities are in the same augment.

Full Augments

Full augments correspond to what we call level 3, or subtractive, augments. They are included for future interest, but their implementation is not described.

Note that augments as defined thus far provide some aid to a database application in allowing relations and domains to cross segment boundaries, but do not carry this illusion through to completion. We will refer to these as partial augments. The merit of partial augments over a more complete illusion is that they are simpler to understand and have good performance in an implementation. They satisfy only the first two of the purposes of augments outlined at the beginning of the previous section, however. The third and fourth uses of augments could of course be satisfied by writing application programs which simulate these effects: public / private databases, in which a relation may have private and public relationships, are relatively easy to simulate in this way. However we will also define a higher level of aspiration, namely simulating in the data model relations and domains that appear to span augment boundaries, and can even be updated in a different augment than the data is defined. We will call these augments full augments.

An implementation of the database model with full augments has the same properties and definitions just enumerated for partial augments with the following modifications:

1. There is an ordering on the augment list maintained by the database system. The OpenAugment call may specify where in the current list the new augment should be opened.

2. DeclareRelation and DeclareDomain returns a handle for the relation or domain in all augments, and the relation or domain is declared to exist in the augment passed as argument. The Declare procedure may be called additional times to define the relation or domain in other augments.

3. CreateEntity and CreateRelationship create an entity or relationship, respectively, in the top-most augment in the augment list in which a namesake of the respective domain or relation has been defined by a client call to DeclareDomain or DeclareRelation (not necessarily in this database session). Domains defined in an augment implicitly by creation of entities through translation (see semantics of SetField) do not qualify, the domain must have been explicitly defined in the augment by a call to DeclareDomain.

4. DestroyEntity and DestroyRelationship destroy the given entity or relationship if there is no namesake of the domain or relation higher in the current augment list. Otherwise, they create an anti-entity or anti-relationship, respectively, in the top-most client-declared namesake of the domain or relationship. The semantics of SetField is in effect that of a DestroyRelationship followed by a CreateRelationship with the attribute changed.

5. DomainSubset searches its domain in all open augments, and has the property that an entity in an augment will not appear in the enumeration if an anti-entity with the same name exists in a namesake of the entity's domain that is higher in the current augment list.

6. RelationSubset similarly searches its relation in all open augments, and has the property that a relationship in an augment will not appear in the enumeration if an anti-relationship with the same attribute values, with entity values mapped to their namesakes, exists in a namesake of the relationship's relation that is higher in the current augment list..

Full augments also act specially upon relations upon which keys have been defined:

1. If a CreateRelationship would create a relationship with the same key value as an existing relationship in the relation in some open augment, then an anti-relationship is automatically created for the existing relationship (in the same augment as the new relationship, the top-most one possible) before creating the new one.

2. If a new augment is opened containing relationships whose key values match existing ones in open augments, then anti-relationships for the matching existing relationships are automatically created in the newly opened augment at that time.

The net effect of our definition of full augments is that a user or program may make arbitrary modifications to data in an underlying augment, in a fashion which appears exactly as if a single augment contains all the data. The underlying augment, however, is completely unchanged. The unmodified data may concurrently be examined by another user, or appear to be modified through another user's augment. Furthermore, the modifications made by the same user over time can be separated and later removed. Full augments thus satisfy all four of the purposes of augments set out at the beginning of the previous section.

Note that two augments can be merged in a straightforward way to produce an augment that behaves as the two in the same order in the same place in the current augment list. The augments are merged by combining the elements of the relations and domains in the first augment and their namesakes in the other augment, discarding entities and anti-entities which match and relationships and anti-relationships which match.

Augments are not related in definition or implementation to the atomic transactions which an implementation of the model also provides. However, the reader might find it informative to think of a transaction as a [full] augment defined on top of the current data, which is automatically merged with the data when the transaction is committed. Transactions thus serve as "short-term" augments. The two are useful for entirely different purposes, however.

4. Client interface

We now describe the Cedar Mesa interface to the implementation of the Cypress data model, called the DBView interface (stored on [Indigo]<Cedar>CedarDB>). A knowledge of Mesa (Mitchell et al [1979]) and the Cedar programming environment is helpful but not essential to understanding this section.

Types

In this subsection we describe the most important types in the interface. Less pervasive types are treated at the point where they are first used.

Entity: TYPE;
Relship: TYPE;
Datatype: TYPE = Entity;

An Entity or Relship is not the actual database entity or relationship; they are handles for the actual database objects. All accesses to database objects are performed by calling interface procedures with the handles as parameters. Even comparisons of two entities for equality must be done in this way. The Entity and Relship handles are allocated from storage and automatically freed by the garbage collector when no longer needed.

Value: TYPE = REF ANY;
StringType, IntType, BoolType, AnyDomainType: DataType;

Storing Mesa data values in tuples presents several problems. First, since we would like to define a single operation to store a new value into a specified attribute of a Relship (for instance), there must be a single type for all values that pass through this "store-value" procedure. This is the type Value above, represented as untyped REFs in Cedar (a REF is a garbage-collected pointer in Cedar). The DataTypes will be discussed in the next section. Entities, strings, integers, and booleans are the types of values the system currently recognizes and allows as attribute values. More precisely, these four types are Entity, ROPE (Cedar strings), REF INT (a REF is a garbage-collectable pointer), and REF BOOL. In the case of an entity-valued attribute, an attribute's type may be AnyDomainType or a specific domain may be specified. The latter is highly preferred, as AnyDomainType is a loophole in the type mechanism and limits the kinds of operations that can be performed automatically by the database system or associated tools. We currently provide no mechanism to store more complex Cedar data structures such as arrays, lists, or records in a database; the database system's data structuring mechanisms should be used instead. (The database system can only provide useful query and update mechanisms if it has knowledge of the data's structure.)

System domains, relations, and properties

Many procedures in the DBView interface take parameters that describe the type or portion of a data item to be manipulated. The permanent repository of such data about user-defined data in a database is the database's data schema, represented by dictionary entities and their properties. Dictionary entities are members of one of the pre-defined system domains:

DomainDomain, RelationDomain, AttributeDomain, DatatypeDomain: Domain;

Every client-defined domain, relation, or attribute contains a representative entity in these domains. Client-defined datatypes are not currently permitted, so the only entities in this domain are the pre-defined IntType, StringType, and BoolType declared above.

The information about the client-defined domains and attributes are encoded by relationships in the database. Domains participate in the system relation SubTypeRelation, which encodes a domain type hierarchy:

SubTypeRelation: Relation;
sSubType: Attribute; -- the domain in this attribute is a sub-type of..
sSuperType: Attribute; -- the domain in this attribute

The SubTypeRelation has one element per direct domain - subdomain relationship, it does not contain the transitive closure of that relation. However, it is guaranteed to contain no cycles. That is, the database system checks that there is no set of domains d1, d2, ... dN, N>1, such that d1 is a subtype of d2, d2 is a subtype of d3, and so on to dN, and d1=dN.

The information about attributes is available through use of the pre-defined system attribute relation:

AttributeRelation: Relation;
aAttribute: Attribute; -- the attribute to which this relationship refers
aType: Attribute; -- type of the attribute (DataType or Domain)
aRelation: Attribute; -- relation of the attribute (Relation)
aUniqueness: Attribute; -- one of: Key, KeyPart, OptionalKey, NonKey
aLength: Relation; -- only for string attributes: hint at max length

The AttributeRelation and SubTypeRelation can be queried with the ordinary data query operations we will define.

The data schema representation of attributes, relations, domains, and their properties may only be read, not written by the database client. They are written indirectly through the use of procedures for their creation: DeclareDomain, DeclareRelation, DeclareAttribute, and DeclareSubtype. The semantics of the system domains and relations we have discussed will be deferred to the discussion of these procedures later in this section.

B-tree indexes are automatically maintained for domains; that is, an index contains entries for all of the entities in a domain keyed by their name so that sorting or lookup by entity name is quick. Key comparisons are performed in the usual lexicographic fashion. B-Tree indexes are also maintained for relations if explicitly requested by the client.

That completes our description of the system-defined domains, relations, and properties. These contain all of the accessible data about the user-defined data in a database. A client program can read this data using the same interface that the program uses for all other data. Attempts to perform updates through this interface results in an ERROR DictionaryUpdate; all updates are performed explitly by client calls to procedures defined in the next subsections, in order to ensure the consistency of this data.

Starting up; transactions and segments

A transaction is a sequence of read and write commands. The system supports the property that the entire sequence of commands executes atomically with respect to all other data retrieval and updates, that is, the transaction executes as if no other transactions were in progress at the same time. Because there may in fact be other transactions accessing the same data at the same time, it is possible that two transactions may deadlock, in which case one of them must be aborted. So the price paid for concurrent access is that programs be prepared to retry aborted transactions.

One requirement of transactions is that packages used in conjunction with the database can group their operations into the same transaction when this is appropriate. Of course, when this is done the interactions among packages must be understood, since one package might commit the transaction while the other had some persistent data structure in an inconsistent state.

The database system provides the capability to access a database stored on the same machine as the database client, using the Pilot file system (Redell et al [1979]), or on Juniper file servers (Israel et al [1978]). (Juniper will be replaced by Alpine, a new transaction-based file server, when that project is completed.) There is currently only one public and one private transaction per instance of the database software on a client machine. All data on a server is accessed via the public transaction; all data on the client's machine via the private one. This single-transaction scheme is a major simplification of the package itself, and also removes the need to pass a Transaction on every call to procedures in the interface.

This is how the database package uses transactions. An OpenDatabase and OpenTransaction call start the database session. A transaction is either passed in by the client, or created by the database package (the latter is just a convenience feature). The operation MarkTransaction below forms the end of a database transaction and the start of a new one; this is implemented by executing a "checkpoint" (commit and hold read locks) on the Juniper file server, or by closing and re-opening the transaction on the Pilot file system. The operation AbortTransaction may be used to abort a transaction. The public and private transactions supported by the database system are independently opened, committed, or aborted. The segments in which data are stored may not be opened for use until the appropriate transaction (local or remote, depending on the location of the segment) has been opened. The operation CloseDatabase commits all transactions, closes all segments, and kills any transactions created by the package.

Clients must decide when to tell the system that a transaction is complete, and must be prepared to deal with unsolicited notification that the current transaction has been aborted because of system failure or lock conflict. Note that when the database package cooperates with other packages under the same transaction, a database transaction may be forced to abort because of the failure of some activity in the other package that is part of the same transaction.

The client's interaction with the database system begins with a call to OpenDatabase and ends with a call to CloseDatabase:

OpenDatabase: PROC[
userName, password: Rope.Ref←NIL,
nCachePages: CARDINAL ← DefaultNCachePages,
nBytesInitial: LONG CARDINAL ← DefaultNBytesInitial,
nBytesPerExtent: LONG CARDINAL ← DefaultNBytesPerExtent ];

CloseDatabase: PROC;.

OpenDatabase informs the database system of the clients name and password, and various system parameters: nCachePages tells the system how many pages of database to keep in virtual memory on the client's machine, nBytesInitial is the initial size to assign to segments, and nBytesPerExtent is the incremental increase in segment size used when a segment is full. OpenDatabase should be called before any other operation in the interface. CloseDatabase closes the database session, commiting and destroying any open transactions and closing any open database files.

OpenTransaction: PROC[
location: LocalOrRemote ← local;
useTrans: Transaction← DefaultTransaction,
transAbort: TransactionAbortProc← NIL ]
RETURNS[trans: Transaction, welcomeMsg: Rope.Ref];

LocalOrRemote: TYPE = {local, remote};

If useTrans is NIL then OpenTransaction establishes a new connection and transaction with the corresponding (local or remote) file system. Otherwise it uses the supplied transaction. In either case, OpenDatabase returns the resulting transaction. If a transAbort procedure is passed to OpenTransaction then it is called when the transaction is spontaneously aborted (currently, the local transaction is never spontaneously aborted). Any database operations after a transaction abort will invoke the ABORTED signal. The client's transAbort procedure should block any further database operations and wait for completion of any existing ones. Then it may re-open the aborted transaction by calling OpenTransaction again. When the remote transaction is successfully re-opened, the client's remote database operations may resume. Operations on data in local segments may proceed even when the remote transaction has been aborted or is not open, and vice versa.

In addition to the opening and closing of transactions by OpenTransaction and CloseDatabase, transactions may be manipulated by the procedures:

AbortTransaction: PROC [location: LocalOrRemote← local];

MarkTransaction: PROC[location: LocalOrRemote← local];

AbortTransaction aborts the current database transaction, either local and remote. The effect on the database is as if the transactions had never been started, the state is as it was just after the OpenTransaction call or most recent MarkTransaction call. A call to OpenTransaction is necessary to do more database operations, and all user variables referencing database items created or retrieved under the corresponding transaction must be re-initialized (they may reference entities or relationships that no longer exist, and in any case they are marked invalid by the database system).

MarkTransaction commits the current database transaction, and immediately starts a new one. User variables which reference database entities or relationships are still valid. When the local or remote transaction is committed, the other is unaffected.

After an OpenDatabase and OpenTransaction, but before database operations may be invoked, the client must open the file(s) in which the data are stored. Database files are called segments. Segments are represented as Cedar ATOMs whose name is the full path name of the data file, minus the implicit extension ".database". If the file is on the local file system, its name is preceded by "[Local]". For example, "[Local]Foo" refers to a segment file on the local disk named Foo.database; "[Juniper]<CedarDB>Baz refers to a segment named Baz.database on the <CedarDB> directory on the Juniper server. It is generally a bad idea to access database files other than through the database interface. However, because segments are physically independent and contain no references to other files by file identifier or explicit addresses within files, the segment files may be moved from machine to machine or renamed without effect on their contents. If a segment file in a set that comprise a client database is deleted, the others may still be opened to produce a database missing only that segment's entities and relationships.

An segment is opened or closed by the operations OpenSegment and CloseSegment:

OpenSegment: PROC[
fileName: ROPE, number: INT← 0, version: Version← OldOnly ] RETURNS [Segment];

CloseSegment: PROC[
fileName: Segment];

Segment: TYPE = ATOM; -- a Cedar ATOM whose name is the full file path name
Version: TYPE = {NewOnly, OldOnly, NewOrOld};

The version parameter to OpenSegment defaults to OldOnly to open an existing file. The signal FileNotFound is generated if it does not exist, and the signal IllegalFileName is generated if the directory or machine name is missing from fileName. For the convenience of the client, OpenSegment returns the Cedar ATOM which represents the segment in future database operations (the client can also use Atom.MakeAtom to create the same, but the segment must still be opened before use). If the version NewOnly is passed, a new segment file will be created, erasing any existing one. In this case, a number assigned to the segment by the database administrator must also be passed. The client program can pass version=NewOrOld to open a new or existing segment file; in this case the segment number must also be passed, of course.

A simple client program using the database system might have the form, then:

[]← OpenDatabase[];
[]← OpenTransaction[];
[]← OpenSegment["[Local]Temp"];
...
... database operations, including zero or more MarkTransaction calls ...
...
CloseDatabase[];

Defining the data schema

The definition of the client's data schema is done through a procedural interface, although the schema can be read through the normal data operations defined later. In general, any of the data schema may be extended or changed at any time; i.e., data operations and data schema definition may be intermixed. However, there are a few specific ordering constraints on schema definition we will note shortly. Also, the database system optimizes for better performance if the entire schema is defined before any data are entered. An interactive schema editing tool described in the Squirrel documentation allows the schema to be changed without regard for ordering constraints, by recreating schema items and copying data invisibly to the user when necessary.

All the data schema definition operations take a Version parameter which specifies whether the schema element is a new or existing one. The version defaults to allowing either (NewOrOld): i.e., the existing entity is returned if it exists, otherwise it is created. This feature avoids separate application code for creating the database schema the first time the application program is run.

DeclareDomain: PROC [name: Rope.Ref, segment: Segment← DefaultSegment,
version: Version← NewOrOld, estRelations: INT← 5] RETURNS [d: Domain];

DeclareSubType: PROC[sub, super: Domain];

DeclareDomain defines a domain with the given name and returns its representative entity. The domain is declared to exist in the given segment, or a default local one if none is given. The parameter estRelations may be used to estimate the largest number of relations in which entities of this domain may participate. Currently, an error will be generated at run time if this estimate is exceeded. If the domain already exists and version=NewOnly, the signal AlreadyExists is generated. If the domain does not already exist and version=OldOnly, then NIL is returned. The client may define one domain to be a subtype of another by calling DeclareSubType. This permits entities of the subdomain to participate in any relations in which entities of the superdomains may participate. All client DeclareSubType calls should be done before declaring relations on the superdomains (to allow some optimizations). The error WrongSegment is generated if the sub-domain and super-domain are not in the same segment.

DeclareRelation: PROC [
name: Rope.Ref, segment: Segment← DefaultSegment, version: Version← NewOrOld] RETURNS [r: Relation];

DeclareAttribute: PROC [
r: Relation, name: Rope.Ref, type: DataType← AnyDomainType,
uniqueness: Uniqueness ← None, length: INT← 0,version: Version← NewOrOld]
RETURNS[a: Attribute];

Uniqueness: TYPE = {NonKey, Key, KeyPart, OptionalKey};

DeclareRelation defines a new or existing relation with the given name and returns its representative entity. The relation is declared to exist in the given segment, or the default one if none is given. If the relation already exists and version=NewOnly, the signal AlreadyExists is generated. If the relation does not already exist and version=OldOnly, then NIL is returned.

DeclareAttribute is called once for each attribute of the relation, to define their names, types, and uniqueness. If version=NewOrOld and the attribute already exists, the database system checks that the new type, uniqueness, and length match the existing attribute. The error WrongExistingAttribute is generated if there is a discrepancy.

The attribute type should either be one of the pre-defined types (IntType, StringType, BoolType, AnyDomainType) or the entity representative for a client-defined domain. For pre-defined types, the actual values assigned to attributes of the relationship instances of the relation must have the corresponding type: REF INT, ROPE, REF BOOL, or Entity. If the attribute has a client-defined domain as type, the attribute values must be entities of the client-defined domain or some sub-domain thereof.

The attribute uniqueness indicates whether the attribute is a key of the relation. If its uniqueness is NonKey, then the attribute is not a key of the relation. If its uniqueness is OptionalKey, then the system will ensure that no two relationships in r have the same value for this attribute (if a value has been assigned). The ERROR NonUniqueKeyValue is generated if a non-unique key value results from a SetP, SetF, SetFS, or CreateRelship call. Key acts the same as OptionalKey, except that in addition to requiring that no two relationships in r have the same value for the attribute, it requires that every entity in the domain referenced by this attribute must be referenced by a relationship in the relation: the relationships in the relation and the entities in the domain are in one-to-one correspondence. Finally, if an attribute's uniqueness is KeyPart, then the system will ensure that no two relationships in R have the same value for all key attributes of R, though two may have the same values for some subset of them. KeyPart is not yet implemented.

DestroyRelation: PROC[r: Relation];

DestroyDomain: PROC[d: Domain];

DestroySubType: PROC[sub, super: Domain];

Relations, domains, and subdomain relationships may be destroyed by calls to the above procedures. Destroying a relation destroys all of it relationships. Destroying a domain destroys all of its entities and also any relationships which reference those entities. Destroying a sub-domain relationship has no effect on existing domains or their entities; it simply makes entities of domain sub no longer eligible to participate in the relations in which entities of domain super can participate. Existing relations and domains may only be modified by destroying them with the procedures above, with one exception: the operation SetName (described later) may be used to change the name of a relation or domain.

DeclareIndex: PROC [
relation: Relation, indexedAttributes: AttributeList, version: Version];

DeclareIndex has no logical effect on the database; it is a performance hint, telling the database system to create a B-Tree index on the given relation for the given indexedAttributes. The index will be used to process queries more efficiently. Each index key consists of the concatenated values of the indexedAttributes in the relationship the index key references. For entity-valued attributes, the value used in the key is the string name of the entity. The version parameter may be used as in DeclareDomain and other schema definition procedures, to indicate a new or existing index. If any of the attributes are not attributes of the given relation then the signal IllegalIndex is generated. Multiple indices on one relation are not implemented.

DeclareProperty: PROC [
relationName: ROPE, of: Domain, type: ValueType,
uniqueness: Uniqueness← None, vers: Version← NewOrOld] RETURNS [property: Attribute];

DeclareProperty provides a shorthand for definition of a binary relation between entities of the domain "of" and values of the specified type. The definitions of type and uniqueness are the same as for DeclareAttribute. A new relation relationName is created, and its attributes are given the names "of" and "is". The "is" attribute is returned, so that it can be used to represent the property in GetP and SetP described later.

Manipulating entities and relationships

Since attributes, domains, and relations are all entities, and datum values are represented as REF ANYs, all type checking must currently be done at run-time. The procedures below indicate illegal arguments by generating the signals IllegalAttribute, IllegalDomain, IllegalRelation, IllegalValue, IllegalEntity, and IllegalRelship, according to the type of argument expected.

DeclareEntity: PROC[
d: Domain, name: Rope.Ref← NIL, version: Version← NewOrOld]
RETURNS [e: Entity];

DeclareEntity finds or creates an entity in domain d with the given name. The name may be omitted if desired, in which case an entity with a unique name is automatically created. If version is OldOnly and an entity with the given name does not exist, NIL is returned. If version is NewOnly and an entity with the given name already exists, the signal NonUniqueEntityName is generated.

DeclareRelship: PROC [
r: Relation, avl: AttributeValueList←NIL, version: Version← NewOrOld]
RETURNS [Relship];

DeclareRelship finds or creates a relship in r with the given attribute values. If version is NewOnly, a new relship with the given attribute values is generated. If version is OldOnly, the relship in r with the given attribute values is returned if it exists, otherwise NIL is returned. If version is NewOrOld, the relship with the given attribute values is returned if it exists, otherwise one is created. If the creation of a new relship violates the key constraints specified by DeclareAttribute, the signal NonUniqueAttributeValue is generated.

DestroyEntity: PROC[e: Entity];

DestroyEntity removes e from its domain, destroys all relationships referencing it, and destroys the entity representative itself. Any client variables that reference the entity automatically take on the null value (Null[e] returns TRUE), and cause ERROR NullifiedArgument if passed to database system procedures.

DestroyRelship: PROC[rel: Relship];

DestroyRelship removes rel from its relation, and destroys it. Any client variables that reference the relationship automatically take on the null value, and will cause ERROR NullifiedArgument if subsequently passed to database system procedures.

SetF: PROC[rel: Relship, a: Attribute, v: Value];

SetF assigns the value v to attribute a of relationship rel. If the value is not of the same type as the attribute (or a subtype thereof if the attribute is entity-valued), the ERROR WrongAttributeValueType is generated. If a is not an attribute of rel's relation, IllegalAttribute is generated. If v is an entity from a different segment than rel, it is mapped to its namesakes in rel's segment. That is, v is replaced with the entity with the same name in the domain with the same name in rel's segment. Mapping v may create a new entity in rel's segment. It may also create new domains and subtype relationships in rel's segment, as necessary to make v conform to the domain with the same name as a's type in v's segment.

GetF: PROC[rel: Relship, a: Attribute] RETURNS [Value];

GetF retrieves the value of attribute a of relationship rel. If a is not an attribute of rel's relation, ERROR IllegalAttribute is generated. The client should use the V2x routines described in the next section to coerce the value into the expected type.

SetFS: PROC [rel: Relship, a: Attribute, v: Rope.Ref];
GetFS: PROC[rel: Relship, a: Attribute] RETURNS [Rope.Ref];

These procedures provide a convenient veneer on top of GetF and SetF that provide the illusion that all relation attributes are string-valued. The effect is something like the Relational data model, and is useful for applications such as a relation displayer and editor that deal only with strings. The semantics of GetFS and SetFS depend upon the actual type of the value v of attribute a:

type GetFS returns SetFS assigns attribute to be
StringType the string v the string v
IntType v converted to decimal string the string converted to decimal integer
BoolType "TRUE" or "FALSE" true if "TRUE", false if "FALSE"
a domain D the name of the ref'd entity the entity with name v
(or null string if v is NIL) (or NIL if v is null string)
AnyDomainType same, but includes domain: the entity with the given domain and name
<domain-name>:<entity-name> (or NIL if v is null string)

The same signals generated by GetFS and SetFS, such as IllegalAttribute, can also be generated by these procedures. The signal NotFound is generated in the last case above if no entity with the given name is found.

GetName: PROC [e: Entity] RETURNS [s: ROPE];
SetName: PROC [e: Entity, s: ROPE];

GetName and SetName retrieve or assign the name of an entity, respecitvely. They may generate the signal NotAnEntity if e is null or not an entity.

DomainOf: PROC[e: Entity] RETURNS [Domain];
RelationOf: PROC[r: Relship] RETURNS [Relation];

DomainOf and RelationOf can be used to find the entity representative of an entity's domain or a relationship's relation, respectively. The signal NotAnEntity is generated if e is null or not an entity. The signal NotARelship is generated if r is null or not a relship.

SegmentOf: PROC[x: EntityOrRelship] RETURNS [Segment];

SegmentOf returns the segment in which an entity or relship is stored. It can be applied to domain, relation, or attribute entities.

Eq: PROC [e1: Entity, e2: Entity] RETURNS [BOOL];

Eq returns TRUE iff the same database entity is referenced by e1 and e2. This is not equivalent to the Mesa expression "e1 = e2", which computes Cedar Mesa REF equality. If e1 and e2 are in different segments, Eq returns true iff they have the same name and their domains have the same name.

Null: PROC [x: EntityOrRelship] RETURNS [BOOL];

Null returns TRUE iff its argument has been destroyed, is NIL, or has been invalidated by abortion of the transaction under which it was created.

GetP: PROC [e: Entity, to: Attribute, from: Attribute← NIL] RETURNS [Value];
SetP: PROC [e: Entity, to: Attribute, v: Value, from: Attribute← NIL] RETURNS[Relship];

GetP and SetP are convenience routines for a common use of relships, to represent "properties" of entities. Properties allow the client to think of values stored in relships referencing an entity as if they are directly accessible fields (or "properties") of the entity itself. See the figure on page 10 illustrating properties. GetP finds a relship whose from attribute is equal to e, and returns that relship's to attribute. The from attribute may be defaulted if the relation is binary, it is assumed to be the other attribute of to's relation. If it is not binary, the current implementation will find the "first" other attribute, where "first" is defined by the order of the original calls calls to DeclareAttribute. SetP defaults the from attribute similarly to GetP, but operates differently depending on whether from is a key of the relation. If it is a key, any previous relship that referenced e in the from attribute is automatically deleted. In either case, a new relship is created whose from attribute is e and whose to attribute is v. SetP returns the relship it creates for the convenience of the client. GetP and SetP can generate the same errors as SetF and GetF, e.g. if e is not an entity or to is not an attribute. In addition, GetP and SetP can generate the error IllegalProperty if to and from are from different relations. GetP generates the error WrongPropertyCardinality if more than one relship references e in the from attribute; if no such relships exist, it returns a null value of the type of the to attribute. SetP allows any number of existing relships referencing e; it simply adds another one (when from is a key, of course, there will always be one existing relship).

GetPList: PROC [e: Entity, to: Attribute, from: Attribute← NIL] RETURNS [LIST OF Value];
SetPList: PROC [e: Entity, to: Attribute, vl: LIST OF Value, from: Attribute← NIL];

GetPList and SetPList are similar to GetP and SetP, but assume that any number of relships may reference e with their from attribute. They generate the signal WrongPropertyCardinality if this is not true, i.e. the from attribute is a key. GetPList returns the list of values of the to attributes of the relships that reference e with their from attribute. SetPList destroys any existing relships that reference e with their from attribute, and creates Length[vl] new ones, whose from attributes reference e and whose to attributes are the elements of vl. GetPList and SetPList may generate any of the errors that GetF and SetF may generate, and the error IllegalProperty if to and from are from different relations.

Examples of the use of the property procedures for data access can be found at the end of the previous section. Properties are also useful for obtaining information about the data schema. For example, GetP[a, aRelationIs] will return the attribute a's relation, and GetPList[d, aTypeOf] will return all the attributes that can reference domain d.

Type conversion routines

E2V: PROC[e: Entity] RETURNS[v: Value];
B2V: PROC[b: BOOLEAN] RETURNS[v: Value];
I2V: PROC[i: LONG INTEGER] RETURNS[v: Value];
S2V: PROC[s: ROPE] RETURNS[v: Value];

The x2V routines convert the various Mesa types to Values. The conversion is not normally required for ropes and entities since the compiler will widen these into the REF ANY type Value.

V2E: PROC[v: Value] RETURNS[Entity];
V2B: PROC[v: Value] RETURNS[BOOLEAN];
V2I: PROC [v: Value] RETURNS[LONG INTEGER];
V2S: PROC [v: Value] RETURNS[ROPE];

The V2x routines convert Values to the various Mesa types. The WrongValueType error is raised if the value is of the wrong type. It is recommended that these routines be used rather than user-written NARROWs, as the representation of Values may change. Also, NARROWs of opaque types don't work in the current Cedar compiler.

Query operations

RelationSubset: PROC[
r: Relation, constraint: AttributeValueList← NIL, searchSegments: BOOL← TRUE]
RETURNS [RelshipSet];

NextRelship: PROC[rel: RelshipSet] RETURNS [Relship];

ReleaseRelshipSet: PROC [rs: RelshipSet];

AttributeValueList: TYPE = LIST OF AttributeValue;
AttributeValue: TYPE = RECORD [
attribute: Attribute,
low: Value,
high: Value←NIL -- omitted where same as low or not applicable --];

The basic query operation is RelationSubset. It returns a generator of all the relationships in relation r which satisfy a constraint list of attribute values. The relationships are enumerated by calling NextRelship repeatedly; it returns NIL when there are no more relationships. ReleaseRelshipSet should be called when the client is finished with the query.

The constraint list may be NIL, in which case all of the relationships in r will be enumerated. Otherwise, relationships which satisfy the conjuction of constraints on attributes in the list will be enumerated. If an index exists on some subset of the attributes, the relationships will be enumerated sorted on the concatenated values of those attributes. For a StringType or IntType attribute a of r, the contraint list may contain a record of the form [a, b, c] where the attribute value must be greater or equal to b and less than or equal to c to satisfy the constraint. For any type of attribute, the list may contain a record of the form [a, b] where the value of the attribute must exactly equal b. The special values infString and minfString are defined in the DBView interface, and represent an infinitely large and infinitely small string value, respectively. The signal WrongAttributeValueType is generated by RelationSubset if one of the low or high values in the list is of a different type than its corresponding attribute.

If searchSegments is TRUE, relations in all open segments with the same name as r are searched. The attributes and entity values in avl are automatically mapped to their namesakes in other segments.

DomainSubset: PROC[
d: Domain,
lowName, highName: ROPE← NIL,
searchSubDomains, searchSegments: BOOL← TRUE]
RETURNS [EntitySet];

NextEntity: PROC[EntitySet] RETURNS [Entity];

ReleaseEntitySet: PROC[EntitySet];

DomainSubset enumerates all the entities in a domain. If lowName and highName are NIL, the entire domain is enumerated, in no particular order. Otherwise, only those entities whose names are lexicographically greater than lowName and less than highName are enumerated, in lexicographic order. If searchSubDomains is TRUE, subdomains of d are also enumerated. Each subdomain is sorted separately. If searchSegments is TRUE, d and domains with the same name in all open segments are enumerated.

Just as with relation enumeration, NextEntity is used to enumerate the entities satisfying the DomainSubset, and ReleaseEntitySet should be called upon completion.

GetDomainRefAttributes: PROC [d: Domain] RETURNS [AttributeList];

This procedures returns a list of all attributes, of any relation defined in d's segment, which reference domain d or one of its superdomains. The list does not include AnyDomainType attributes, which can of course reference any domain. GetDomainRefAttributes is implemented via queries on the data schema. GetDomainRefAttributes is useful for application-independent tools; most specific applications can code-in the relevant attributes.

GetEntityRefAttributes: PROC [e: Entity] RETURNS [AttributeList];

This procedure returns a list of all attributes in which some existing relship actually references e, including AnyDomainType attributes. This is currently not implemented as specified; the list may include key attributes that do not actually reference e. The ones that actually reference e must be determined by RelationSubset.

5. A sample program

As a simple example, we will consider the following concocted program, that defines a small schema and database of persons with their names, friends, and phone numbers. It also defines two other domains, "frogs" and "things", and does a number of queries on the database. It illustrates the use of most of the procedures defined in this section.

VLTest1Impl: PROGRAM
IMPORTS DBView, IOStream =

BEGIN OPEN IOStream, DBView;

tty: IOStream.Handle← CreateTTYStreams["VLTest1Impl.log"].out;

Person, Frog, Thing: Domain;
Friend, Phone: Relation;
ageProp, heightProp: Attribute;
friendOf, friendIs: --Friend-- Attribute;
wsh, harry, mark, nori: --Person-- Entity;
george: --Frog-- Entity;
rock, pebble: --Thing-- Entity;
myName: ROPE ← "Rick Cattell";
myExt: REF INT← NEW[INT ← 4466];

The following routine defines the domains and relations and through calls to the database system. Since the version parameter to DeclareDomain and DeclareRelation, and DeclareAttribute defaults to NewOnly, an ERROR would be raised if this program were executed a second time.

Initialize: PROC =
BEGIN
tty.Put[rope["Defining data dictionary..."], char[CR]];
-- Declare domains and make Persons and Frogs be subtypes of Things:
Person← DeclareDomain["Person"];
Thing← DeclareDomain["Thing"];
Frog← DeclareDomain["Frog"];
DeclareSubType[of: Thing, is: Person];
DeclareSubType[of: Thing, is: Frog];
-- Declare phone age and height property of Things
ageProp← DeclareProperty["Age", Thing, IntType];
heightProp← DeclareProperty["Height", Thing, StringType];
-- Declare Friend relation between Persons
Friend← DeclareRelation["Friend"];
friendOf← DeclareAttribute[Friend, "of", Person];
friendIs← DeclareAttribute[Friend, "is", Person];
-- Declare Phone relation between Persons and integers
Phone← DeclareRelation["Phone"];
phoneOf← DeclareAttribute[Phone, "of", Person];
phoneIs← DeclareAttribute[Phone, "is", IntType];
END;

The following routines creates various relationships and entities. Several methods are used to assign the attributes to illustrate the alternatives.

InsertData: PROC =
BEGIN t: Relship; rick: Entity;
tty.Put[rope["Creating data..."], char[CR]];
harry← DeclareEntity[Person, "Harry Q. Bovik"];
mark← DeclareEntity[Person, "Mark Brown"];
rick← DeclareEntity[Person, myName];
nori← DeclareEntity[Person]; SetName[nori, "Nori Suzuki"];
-- Data can be assigned with SetP, SetF, or DeclareRelship's initialization list:
SetP[harry, phoneIs, I2V[4999]];
SetP[mark, phoneIs, I2V[4464]];
SetP[nori, phoneIs, I2V[4425]];
SetP[rick, phoneIs, I2V[4466]]
SetP[rick, ageProp, I2V[29]]
t← DeclareRelship[Friend]; SetF[t, friendOf, rick]; SetF[t, friendIs, harry];
[]← DeclareRelship[Friend, LIST[[friendOf, nori], [friendIs, harry]]];
END;

InsertMoreData: PROCEDURE =
BEGIN t: Relship; rick: Entity; ok: BOOL← FALSE;
tty.Put[rope["Creating more data..."], char[CR]];
-- Create some new entities and fetch old one (rick)
wsh← DeclareEntity[Person];
rick← DeclareEntity[Person, "Rick Cattell", OldOnly];
rock← DeclareEntity[Thing, "Harry the rock"];
pebble← DeclareEntity[Thing, "Fred the pebble"];
george← DeclareEntity[Frog, "George the frog"];
[]← DeclareEntity[Frog, "Larry the frog"];
-- Use SetP to assign names and heights
SetName[wsh, "Willie-Sue H"];
SetP[wsh, heightProp, S2V["medium"]];
SetP[rick, heightProp, S2V["medium"]];
SetP[rock, heightProp, S2V["big"]];
SetP[pebble, heightProp, S2V["little"]];
SetP[george, heightProp, S2V["little"]];
-- Check that Person can't be friend of frog
t← DeclareRelship[Friend];
SetF[t, fOf, rick];
SetF[t, fIs, george ! WrongAttributeValueType => ok ← TRUE];
IF NOT ok THEN ERROR;
END;

The following routine illustrates the destruction of entities and domains.

DestroySomeData: PROCEDURE =
-- Destroy one person entity and all frog entities
BEGIN flag: BOOL← FALSE;
tty.Put[char[CR], rope["Deleting some data: Rick and all the frogs..."], char[CR]];
DestroyEntity[DeclareEntity[Person, "Rick Cattell", OldOnly]];
DestroyDomain[Frog];
END;

The following routine prints all of the people in the Person domain, by doing a DomainSubset retrieval with no constraints.

PrintPeople: PROC =
-- Use DomainSubset with no constraints to enumerate all Persons
BEGIN
person: -- PersonDomain -- Entity;
es: EntitySet;
lint: LONG INTEGER;
tty.Put[char[CR], rope["PersonDomain:"], char[CR], char[CR]];
tty.Put[rope["Name Phone"], char[CR]];
es← DomainSubset[Person];
WHILE (person← NextEntity[es])#NIL DO
tty.Put[rope[V2S[GetP[person, NameProp]]], char[TAB]];
lint← V2I[GetP[person, phoneProp]]^;
tty.Put[int[lint], char[CR]];
ENDLOOP;
ReleaseEntitySet[es];
END;

The following routine prints all elements of the Thing domain using DomainSubset, which automatically enumerates sub-domains as well.

PrintEverything: PROCEDURE =
-- Use DomainSubset to enumerate all Things (includes Frogs and Persons).
-- Print "unknown" if height field is null string.
BEGIN
thing: -- Thing -- Entity;
es: EntitySet;
r: ROPE;
tty.Put[char[CR], rope["Things:"], char[CR], char[CR]];
tty.Put[rope["Type Name Height"], char[CR]];
es← DomainSubset[Thing];
WHILE (thing← NextEntity[es])#NIL DO
tty.Put[rope[GetName[DomainOf[thing]]], char[TAB]];
tty.Put[rope[V2S[GetP[thing, NameProp]]], char[TAB]];
r← V2S[GetP[thing, heightProp]];
tty.Put[rope[IF r.Length[]=0 THEN "[unknown]" ELSE r], char[CR]];
ENDLOOP;
ReleaseEntitySet[es];
END;

The following routine prints the people in the Person domain with a phone number in the range 4400 to 4499.

PrintCSLPhones: PROC =
-- Use RelationSubset to enumerate phones between 4400 and 4500
BEGIN
rs: RelshipSet; r: Relship;
tty.PutF["\nPhone numbers between 4400 and 4499:\n"];
tty.PutF["Name Phone\n"];
rs← RelationSubset[Phone, LIST[[phoneIs, I2V[4400], I2V[4499]]]];
UNTIL (r← NextRelship[rs])=NIL DO
tty.PutF["%g %g\n", rope[GetFS[r, phoneOf]], rope[GetFS[r, phoneIs]] ];
ENDLOOP;
ReleaseRelshipSet[rs];
END;

The following routine prints all of the tuples in the Person domain that are related through the Friend relation to the entity in the Person domain named harry.

PrintFriends: PROC =
-- Use RelationSubset to enumerate friends of Harry
BEGIN
friendRS: --Friend-- Relship;
friendName: --Person-- ROPE;
rs: RelshipSet;
tty.Put[char[CR], rope["Harry's friends:"], char[CR]];
rs← RelationSubset[Friend, LIST[[friendIs, harry]]];
WHILE (friendRS← NextRelship[rs])#NIL DO
friendName← GetFS[friendRS, friendOf];
tty.Put[rope[friendName], rope[" "], char[CR]];
ENDLOOP;
ReleaseRelshipSet[rs];
END;

The main program opens the default database, calls the above routines, and calls CloseDatabase to cause the updates to be made.

tty.Put[rope["Creating database..."], char[CR]];
[]← OpenDatabase[];
[]← OpenTransaction[];
[]← OpenSegment[];
Initialize[];
InsertData[];
PrintPeople[];
PrintCSLPhones[];
PrintFriends[];
PrintEverything[];
InsertMoreData[];
DestroySomeData[];
PrintEverything[];
CloseDatabase[];
tty.Close[];

END.

6. Database design

This section is a discussion of database design: the process of representing abstractions of real-world information structures in a database. This is an important, poorly understod topic, just as the design of programs is; don't expect any startling revelations in these pages. To some extent the discussion here is specialized to the data structures available in the Cedar database package.

What are the properties of a well-designed database? To a large extent these properties follow from the general properties of databases, as discussed in Section 2. For instance, we would like our databases to extend gracefully as new types of information are added, since the existing data and programs are likely to be quite valuable.

It may be useful to consider the following point. The distinguishing aspect of information stored in a database system is that at least some of it is stored in a form that can be interpreted by the system itself, rather than only by some application-specific program. Hence one important dimension of variation among different database designs is in the amount of the database that is system-interpretable, i.e. the kinds of queries that can be answered by the system.

As an example of variation in this dimension, consider the problem of designing a database for organizing a collection of Mesa modules. In the present Mesa environment, this database would need to include at least the names of all the definitions modules, program modules, configuration descriptions, and current .Bcd files. A database containing only this information is little more than a file directory, and therefore the system's power to answer queries about information in this database is very limited. A somewhat richer database might represent the DIRECTORY and IMPORTS sections of each module as collections of tuples, so that queries such as "which modules import interface Y?" can be answered by the system. This might be elaborated further to deal with the use of individual types and procedures from a definitions module, and so on. There may be a limit beyond which it is useless to represent smaller objects in the database; if we aren't interested in answering queries like "what procedures in this module contain IF statements?", it may be attractive to represent the body of a procedure (or some smaller naming scope) as a Mesa data structure that is not interpretable by the database system, even though it is stored in a database.

We shall illustrate a few more database design ideas with the following problem: design a database of information about a collection of documents. Our current facilities, which again are simply file directories, leave much to be desired. The title of a document on the printed page does not tell the reader where the document is stored or how to print a copy. Relationships between different versions of the same basic document are not explicit. Retrievals by content are impossible. Our goal here is not to solve all of these problems, but to start a design that has the potential of dealing with some of them.

Each document has a title and a set of authors. Hence we might represent a collection of documents with a domain whose title is the name of the document, and an author property specifying the authors:

Document: Domain = DeclareDomain["Domain"];
dAuthors: Property = DeclareProperty["author", Document, StringType];

Here the authors' names are concatenated into a single string, using some punctuation scheme to allow the string to be decoded into the list of authors. This is a very poor database design because it does not allow the system to respond easily to queries involving authors; the system cannot parse the encoded author list.

Another property of this declaration of authors may be viewed as an advantage or a drawback: authors are strings, so anything is acceptable as an author. This weak typing gives some flexibility: the database will never complain that it doesn't know the author you just attached to a certain document. The corresponding drawback, as you might expect, is that the system is not helpful in catching errors when a new document is added to the database. If "Mark R. Brown" is mistakenly spelled "Mark R. Browne", then one of Mark's papers will not be properly retrieved by a later search. A step in the direction of stronger type checking is to provide a separate domain for authors.

To represent authors as entities, and allow a variable number number of authors for document, a better design would be:

Person: Domain = DeclareDomain["Person"];
author: Property = DeclareProperty["author", Document, Person];

Incidentally, we define a property rather than relation for brevity here. Instead of the author property declaration we could have written:

author: Relation = DeclareRelation["author"];
authorOf: Attribute = DeclareAttribute["of", author, Document];
authorIs: Attribute = DeclareAttribute["is", author, Person];

The property declaration has exactly the same effect on the database as the relation declaration, since it automatically declares an author relation with an "of" and "is" attribute. However, the relation is not available in a Cedar Mesa variable in the property case, so operations such as RelationSubset cannot be used (thus most applications will not use DeclareProperty; but will use operations such as SetP).

In either case, we have one Document entity per document, plus, for each document, one author relationship per author of that document. Each author relationship points to its Document tuple via the entity-valued attribute of the author relation. Now a search of the author relation can be used to locate all documents authored by a given person.

Documents have other interesting properties. Some of these, for example the date on which the document was produced, are in one-to-one correspondence with documents. Such properties can be defined by specifying a relation or property as being keyed on the document:

publDate: DeclareProperty["publ-date", Document, StringType, Key];

We are using the convention that domain names are capitalized and relation, attribute, and property names are not capitalized, both for the Cedar Mesa variable names and in the names used in the database system itself. When the database system is integrated with Mesa, the Mesa and database names will be one and the same. Other conventions are desirable, for example preceeding all relation names with the name of the database application in which they are used to avoid confusion when databases overlap.

A few entities and relationships from a database of documents, defined according to these two tuplesets, are shown graphically in Figure 4.

<==<cffig4.press<

We might wish to include additional information for particular kinds of documents, for example conference papers. Conference papers may participate in the same relations as other documents. For example, they have authors. In addition, we may want to define relations in which only conference papers may participate, for example a presentation relation which defines who presented the paper, and where. We can define a conference paper to be a sub-domain of documents, and define relations which pertain specifically to conference papers:

ConferencePaper: Domain = DeclareDomain["ConferencePaper"];
... DeclareSubType[of: Document, is: ConferencePaper]; ...
Conference: Domain = DeclareDomain["Conference"];
presentation: Relation = DeclareRelation["presentation"];
presentationOf: Attribute = DeclareAttribute["of", presentation, ConferencePaper];
presentationAt: Attribute = DeclareAttribute["at", presentation, Conference];
presentationBy: Attribute = DeclareAttribute["by", presentation, Person];

Figure 5 illustrates a fragment of a database using this extended design.

The reader will note that we have defined our database schema in the functionally irreducible form described in Section 2: i.e., the relations have as few attributes as possible so as to represent atomic facts. This normalization is not necessary in the design of a schema, but often makes the database easier to understand and use, and simplifies the development of applications (which need not be concerned with anomalies in updates to data that is a result of redundantly storing the same information).

Note that the presentation relation is an example of a functionally irreducible relation that is not binary. It cannot be decomposed into smaller relations without losing information or introducing artificial entities to represent the presentations themselves.

What other information should be present in our document database? Subject keywords would certainly be useful. Since one document will generally have many associated keywords, we would introduce another relation, say docKeyword, to represent the new information. Should keywords be entities? Again there is a tradeoff, but the argument for entities seems persuasive: limiting the range of keywords increases the value of the database for retrieval. The keyword entities could also participate in relationships with a dictionary of synonyms, Computing Reviews categories, etc.

This is certainly not a complete design, and the reader is encouraged to fit his or her own ideas into the database framework.

7. References

Brown, M., Cattell, R., Suzuki, N. The Cedar Database Management System: A Preliminary Report, Proceedings ACM SIGMOD Conference 1981, Ann Arbor, 1981.

Brown, M. "How to Use Walnut", Internal memorandum, 1982.

Cattell, R. G. G. "An entity-based database user interface", Proceedings ACM SIGMOD Conference 1981, Santa Monica, 1980.

Cattell, R. G. G. "Design of an Relationship-Entity-Datum data model", to appear as CSL report, 1982.

Date, C. J. An Introduction to Database Systems, Addison-Wesley, 1977.

Donahue, J. ?. documentation of Squirrel and how to build database applications, promised to write October 1982.

Hoare, C. A. R. "Data reliability", Proc. International Conference on Reliable Software, Los Angeles, April 1975, p 528-533.

Israel, J. E., Mitchell, J. G., Sturgis, H. E. "Separating data from function in a distributed file system", CSL-78-5, September 1978.