Page Numbers: Yes   X: 530   Y: 10.5"   First Page: 1
Columns: 1   Edge Margin: .6"   Between Columns: .4"
Margins:   Top: 1.3"   Bottom: 1"
Line Numbers: No   Modulus: 5   Page-relative
Even Heading:
DESIGN AND IMPLEMENTATION OF A RELATIONSHIP-ENTITY-DATUM DATA MODEL
Odd Heading: Not-on-first-page
INTRODUCTION
1.  Introduction
1.1 Data modelling
A data model is a scheme for describing the types of data that may be stored in a database:  how these data may be structured, and how they may be accessed.  Any particular database consists of the data themselves plus a data schema that describes the types of data in terms of the data description primitives of the data model.  The three data models most widely used in the past are the network, hierarchical, and relational models.  A good summary of these models can be found in Computing Surveys, March 1976.
In recent years, data models with more sophisticated representation schemes have been proposed, often called semantic data models.  Unfortunately, this more recent work is voluminous and difficult to understand:  their terminology is mutually conflicting, the models are complex, and related work was done concurrently by a number of authors.  Furthermore, almost none of the models were actually implemented.  Related problems of data structure specification have also been addressed extensively in the programming language and artificial intelligence literature, as abstract data type mechanisms and knowledge representation languages.  A survey of the literature is beyond the scope of this document.  An annotated bibliography is included in the last section, however, and Tsichritzis & Lochovsky [1982] summarize much of the recent data modelling work.
The data model described herein is the result of analyzing the strengths and weakness of a number of proposed models, and integrates a variety of viewpoints.  It includes only those features that are accepted in some form in a number of models, or proved particularly useful for our database applications.  We will call our model the Cypress data model.  It might alternately be called the Relationship-Entity-Datum model, after its three basic primitives.
The Cypress data model is described in Section 2.  Sections 3 and 4 provide a description of the programmer’s interface to our implementation, and an example of its use.  Section 5 contrasts the Cypress data model to others in the literature, and describes how we arrived at this particular integration of data modelling ideas.  Section 6 describes issues in the implementation of the Cypress data model; as none of the models we reference have actually been implemented, this is an important result of the work.  To complete our description of the Cypress work, Section 7 covers experience with database access tools and applications built upon the system.  Sections 8 and 9 provide a summary and annotated bibliography.  An appendix of formal axioms for the model and an index of important terms are also included.
For the sake of clean exposition, Section 2 covers only the abstract model, not its implementation or rationale.  The reader interested only in the basic ideas may read about the data model in Section 2 and about the applications it enables in Section 7, without loss of continuity.  A potential client of Cypress should read Sections 2, 3, and 4.
1.2 Cypress and Cedar
The motivation for design and implementation of a database system in the Computer Science Laboratory arose from the needs of anticipated and existing database applications running in the Cedar Programming Environment.  The design of a data model was deferred to the second phase of the earlier Cedar Database Managment System (DBMS) project to allow a choice of model after some experience with our needs.  With the second phase of development, the name of the Cedar DBMS was changed to Cypress, to distinguish it from the Cedar DBMS which it replaced, the Cedar Programming Environment in which it provides database facilities, and the Cedar Mesa Programming Language in which it is written.
The original Cedar DBMS, described by Brown, Cattell, and Suzuki [1980], implemented tuples whose elements are integers, strings, and references to other tuples.  It provides a large virtual memory of tuples which can be moved or deleted safely, optional B-trees for indexing of tuples, and concurrent transaction-based access to data over a network.  It was built in three levels:  the Cache, Storage, and Tuple Levels.  The latter (top) level provides primitive query mechanisms, and does little or no checking of the integrity of data type, form, uniqueness, or referents; the addition of the Model Level introduces types to the essentially typeless underlying system. 
Initial clients of the Cedar DBMS included:
1.The CSL Notebook, a database of memoranda from laboratory members,
2.PDB, a personal database including notes, bibliographies, addresses, phone numbers, etc.,
3.A database of components of large Mesa (Mitchell et al [1976]) systems and their inter-dependencies,
4.A database of Mesa source programs decomposed to the level of procedures, types, and variables,
5.A database of data and events in an automated office system.
Most of these applications built their own data model on top of the existing system, to enable the data structures and access mechanisms they required.  In some cases, building on top as opposed to integrating the data model with the existing system (an option not available to them) led to difficulties with the performance or integrity of the total system.  This is not surprising, even though the original system was designed to allow several data modelling choices at a future date; the choice of data model must influence the choices at all levels of the system, if adequate performance is to be obtained.
The addition of a data model to the database system makes possible the development of general-purpose tools.  The more data semantics encoded in the database, the more the database system and its associated tools may provide without specific knowledge of an application’s semantics.  In order to print out data in a meaningful form, for example, a tool must know which data comprise names for objects.  In order to abbreviate and/or check the type of input data, a tool must know what types of data may be related to what others in what ways.  In order to efficiently represent potentially circular pointer structures linearly, a tool needs a specification of allowed data relationships.  And so on.
The introduction of a data model also simplifies sharing a single database among multiple applications.  A hierarchy of types allows different applications to have differing perspectives on the same objects.  The logical integrity checks help protect the applications from one another.  More sophisticated data modelling mechanisms allow applications different independent views of the data or different physical environments (files) for their data.  Sharing a database between applications is important to avoid redundant, inconsistent representation of the same information, to mutually benefit from multiple data input sources, and, most importantly, to present one simple data access and manipulation mechanism to computer users.  This contrasts, for instance, to three separate personnel databases (and access tools) for applications dealing with phone numbers, electronic mail, and laboratory bibliographies. 
Finally, the introduction of the Model Level data model is important to future plans for the next level of the Cypress DBMS, the Query Level.  At the Query Level, a database query language will be implemented to allow access and/or updates to individual data items or data aggregates, in a concise form amenable to optimization and decoupled operation in a database server on a computer network.  The Model Level provides the functions to enumerate the database objects satisfying the queries parsed by the Query Level.
In summary, the development of the Model Level of the Cypress DBMS is a follow-on to the Cache, Storage, and Tuple Levels, a prerequisite of the Query Level, and motivated by:
1.the perceived needs of, or shortcomings for, future and existing database applications,
2.facilitation of shared databases for partial or complete integration of multiple applications,
3.the need for more data semantics to enable useful database utilities, e.g. a browser, and
4.the need for a query language for database access.  
1.3 Design criteria and motivation
How do we choose a data model?  Why should one data model be better than another?  In most cases we can find a representational isomorphism between two models, such that we can automatically map between the two equivalent representation schemes.  However, the models may still differ in the data access primitives they provide and the ease with which a user can understand and operate within the model.  The arguments concerning ease of understanding are generally quite subjective, however, and will be avoided here except where these arguments are generally well-accepted in the literature.
The choices in the development of the Cypress Data Model were made on the basis of two criteria:  simplicity and utility.  Simplicity means ease of understanding and also representational parsimony, i.e., avoiding two or more mechanisms to represent the same semantics.  Utility means the degree to which the data model avoids writing application-specific code for some function or integrity check; every feature need not be of use to all applications, though the feature should be of negligible cost to those that do not require it.
Simplicity and utility can of course be mutually conflicting; the result must be a balance incorporating the features deemed important for most applications, in the simplest form discovered.  To make the presentation less confusing, we separate the discussion of the model itself in the sections to follow from the analysis of the simplicity and utility of its features in Section 5 and the discussion of its implementation in Section 6.