Page Numbers: Yes X: 530 Y: 10.5" First Page: 27
Columns: 1 Edge Margin: .6" Between Columns: .4"
Margins: Top: 1.3" Bottom: 1"
Line Numbers: No Modulus: 5 Page-relative
Even Heading:
DESIGN AND IMPLEMENTATION OF A RELATIONSHIP-ENTITY-DATUM DATA MODEL
Odd Heading: Not-on-first-page
MODEL LEVEL INTERFACE
3. Model Level Interface
We now describe the Cedar interface to the implementation of the Cypress data model. A knowledge of the Cedar or Mesa programming language (Mitchell et al [1979]) and the Cedar programming environment is not essential to understanding this section. We will explain Cedar features as they are encountered.
We do assume that the reader is familiar with the basic conceptual data model, i.e., has read the previous section. Our presentation is therefore slightly different in this section: we describe the procedures in the database interface in roughly the order that a client will want to use them in a program. We present types and initialization, schema definition, the basic operations, and then queries.
It should be emphasized that the interface we are about to describe is only one possible implementation of the abstract data model described in Section 2. For example, we have chosen to implement a procedural interface called by Cedar programs, and to do type checking at run-time. We will discuss some of the trade-offs in our choice of interface in Section 6. The introduction of new interfaces, such as a Query level with a compiled access language, will provide a different perspective on the Cypress data model.
3.1 Types
In this subsection we describe the most important types in the interface. Less pervasive types are treated at the point where they are first used.
Entity: TYPE;
Relship: TYPE;
An Entity or Relship is not the actual database entity or relationship; they are handles for the actual database objects. All accesses to database objects are performed by calling interface procedures with the handles as parameters. Even comparisons of two entities for equality must be done in this way. The Entity and Relship handles are allocated from storage and automatically freed by the garbage collector when no longer needed.
Value: TYPE = REF ANY;
ValueType: TYPE;
Datatype: TYPE;
StringType, IntType, BoolType, AnyDomainType: DataType;
Storing Cedar data values in tuples presents several problems. First, since we would like to define a single operation to store a new value into a specified attribute of a Relship (for instance), there must be a single type for all values that pass through this "store-value" procedure. This is the type Value above, represented as untyped REFs in Cedar (a REF is a garbage-collectable Cedar pointer). The DataTypes will be discussed in the next section. Entities, strings, integers, and booleans are the types of values the system currently recognizes and allows as attribute values. More precisely, these four types are Entity, ROPE (the Cedar name for "heavy-duty" strings), REF INT, and REF BOOL. In the case of an entity-valued attribute, an attribute’s type may be AnyDomainType or a specific domain may be specified. The latter is highly preferred, as AnyDomainType is a loophole in the type mechanism and limits the kinds of operations that can be performed automatically by the database system or associated tools. We currently provide no mechanism to store compound Cedar data structures such as arrays, lists, or records in a database; the database system’s data structuring mechanisms should be used instead. (Cypress query operations such as RelationSubset cannot be composed upon data that appears as uninterpreted bits in the database. We return to this issue in Section 4.)
Note that a Value may be either an Entity or a Datum. Some operations accept any Value, e.g. SetF; others require an Entity, e.g. NameOf. Others may require an Entity from a particular client-defined domain, e.g. a Person. We might think of the hierarchy of built-in and client defined types and instances of values like this:
Value type hierarchyDatabase representative of type
Value (REF ANY)ValueType
Datum DatumType
ROPE StringType
INT IntType
BOOL BoolType
Entity AnyDomainType
person Entity Person domain
employee Entity Employee domain
... other client-defined entities ... ... other client-defined domains ...
As Cedar doesn’t have a good mechanism for defining type hierarchies or new types for client-defined domains, most Cypress operations simply take a REF ANY or an Entity as argument, performing further type checking at run-time.
3.2 Transactions and segments
In this section we describe the basic operations to start up a database application’s interaction with Cypress. The client application’s data is stored in one or more segments, accessed under transactions. The Cypress system currently runs on the same machine as the client program, however transactions are implemented by the underlying file system which may reside on another machine. Data in remote segments may therefore be concurrently accessed by other instances of Cypress on other client machines.
A transaction is a sequence of read and write commands. The system supports the property that the entire sequence of commands executes atomically with respect to all other data retrieval and updates, that is, the transaction executes as if no other transactions were in progress at the same time. Because there may in fact be other transactions accessing the same data at the same time, it is possible that two transactions may deadlock, in which case one of them must be aborted. So the price paid for concurrent access is that programs be prepared to retry aborted transactions.
The database system provides the capability of accessing a database stored on the same machine as the database client, using the Pilot file system (Redell et al [1979]), or on Alpine file servers (Brown et al [1983]). We currently permit only one transaction per segment per instance of the database software on a client machine. That is, data in remote segments may concurrently be updated by application programs under separate transactions, but on the same machine transactions are used simply to make application transactions on their respective segments independent. This transaction-per-segment scheme is a major simplification of the Cypress package. In addition, as we shall see presently, nearly all Cypress procedures can automatically infer the appropriate segment and transaction from the procedure arguments, avoiding the need to pass the transaction or segment for every database operation.
Calls to Initialize, DeclareSegment, and OpenTransaction start the database session. A transaction is either passed in by the client, or created by the database package (the latter is just a convenience feature). The operation MarkTransaction below forms the end of a database transaction and the start of a new one. The operation AbortTransaction may be used to abort a transaction. Data in a database segment may not be read or updated until the segment and transaction have been opened. Clients must decide when to tell the system that a transaction is complete (with CloseTransaction), and must be prepared to deal with unsolicited notification that the current transaction has been aborted because of system failure or lock conflict.
The client’s interaction with the database system begins with a call to Initialize:
Initialize: PROC[
nCachePages: CARDINAL← 256,
nFreeTuples: CARDINAL← 32,
cacheFileName: ROPE← NIL ];
Initialize initializes the database system and sets various system parameters: nCachePages tells the system how many pages of database to keep in virtual memory on the client’s machine, nFreeTuples specifies the size to use for the internal free list of Entity and Relship handles, and cacheFileName is the name of the disk file used for the cache backing store. Any or all of these may be omitted in the call; they will be given default values. Initialize should be called before any other operation; the schema declaration operations generate the error DatabaseNotInitialized if this is violated.
Before database operations may be invoked, the client must open the segment(s) in which the data are stored. The location of the segment is specified by using the full path name of the file, e.g. "[MachineName]<Directory>SubDirectory>SegmentName.segment". Each segment has a unique name, the name of a Cedar ATOM which is used to refer to it in Cypress operation. The name of the Cedar ATOM is normally, though not necessarily, the same as that of the file in which it is stored, except the extension ".segment" and the prefix specifying the location of the file is omitted in the ATOM. If the file is on the local file system, its name is preceded by "[Local]". For example, "[Local]Foo" refers to a segment file on the local disk named Foo.database; "[Alpine]<CedarDB>Baz" refers to a segment named Baz.segment on the <CedarDB> directory on the Alpine server. It is generally a bad idea to access database segments other than through the database interface. However, because segments are physically independent and contain no references to other files by file identifier or explicit addresses within files, the segment files may be moved from machine to machine or renamed without effect on their contents. If a segment file in a set of segments comprising a client database is deleted, the others may still be opened to produce a database missing only that segment’s entities and relationships. A segment is defined by the operation DeclareSegment:
DeclareSegment: PROC[
filePath: ROPE, segment: Segment, number: INT← 0,
readOnly: BOOL← FALSE, version: Version← OldOnly,
nBytesInitial, nBytesPerExtent: LONG CARDINAL← 32768]
RETURNS [Segment];
Segment: TYPE = ATOM;
Version: TYPE = {NewOnly, OldOnly, NewOrOld};
The version parameter to DeclareSegment defaults to OldOnly to open an existing file. The signal IllegalFileName is generated if the directory or machine name is missing from fileName, and FileNotFound is generated at the time a transaction is opened on the segment if the file does not exist. If version NewOnly is passed, a new segment file will be created, erasing any existing one. In this case, a number assigned to the segment by the database administrator must also be passed. This number is necessitated by our current implementation of segments (it specifies the section of the database address space in which to map this segment). Finally, the client program can pass version=NewOrOld to open a new or existing segment file; in this case the segment number must also be passed, of course.
The other parameters to DeclareSegment specify properties of the segment. If readOnly=TRUE, then writes are not permitted on the segment; any attempt to invoke a procedure which modifies data will generate the error ProtectionViolation. nBytesInitial is the initial size to assign to the segment, and nBytesPerExtent is the incremental increase in segment size used when more space is required for data in the file.
For convenience, a call is available to return the list of segments that have been declared in the current Cypress session:
GetSegments: PROC RETURNS[LIST OF Segment ];
A transaction is associated with a segment by using OpenTransaction:
OpenTransaction: PROC[
segment: Segment,
userName, password: ROPE← NIL,
useTrans: Transaction← NIL ];
If useTrans is NIL then OpenTransaction establishes a new connection and transaction with the corresponding (local or remote) file system. Otherwise it uses the supplied transaction. The same transaction may be associated with more than one segment by calling OpenTransaction with the same useTrans argument for each. The given user name and password, or by default the logged in user, will be used if a new connection must be established.
Any database operations upon data in a segment before a transaction is opened or after a transaction abort will invoke the Aborted signal. The client should catch this signal on a transaction abort, block any further database operations and wait for completion of any existing ones. Then the client may re-open the aborted transaction by calling OpenTransaction. When the remote transaction is successfully re-opened, the client’s database operations may resume.
Note that operations on data in segments under different transactions are independent. Normally there will be one transaction (and one or more segments) per database application program. A client may find what transaction has been associated with a particular segment by calling
TransactionOf: PROC [segment: Segment] RETURNS [Transaction];
Transactions may be manipulated by the following procedures:
MarkTransaction: PROC[trans: Transaction];
AbortTransaction: PROC [trans: Transaction];
CloseTransaction: PROC [trans: Transaction];
MarkTransaction commits the current database transaction, and immediately starts a new one. User variables which reference database entities or relationships are still valid.
AbortTransaction aborts the current database transaction. The effect on the data in segments associated with the segment is as if the transactions had never been started, the state is as it was just after the OpenTransaction call or the most recent MarkTransaction call. Any attempts to use variables referencing data fetched under the transaction will invoke the NullifiedArgument error. A call to OpenTransaction is necessary to do more database operations, and all user variables referencing database items created or retrieved under the corresponding transaction must be re-initialized (they may reference entities or relationships that no longer exist, and in any case they are marked invalid by the database system).
A simple client program using the database system might have the form, then:
Initialize[];
DeclareSegment["[Local]Test", $Test];
OpenTransaction[$Test];
...
... database operations, including zero or more MarkTransaction calls ...
...
CloseTransaction[TransactionOf[$Test]];
3.3 Data schema definition
The definition of the client’s data schema is done through calls to procedures defined in this section. The data schema is represented in a database as entities and relationships, and although updates to the schema must go through these procedures to check for illegal or inconsistent definitions, the schema can be read via the normal data operations described in the next section. Each domain, relation, etc., has an entity representative that is used in data operations which refer to that schema item. For example, we pass the domain entity when creating a new entity in the domain. The types of schema items are:
Domain, Relation, Attribute, Datatype, Index, IndexFactor: TYPE = Entity;
Of course, since the schema items are entities, they must also belong to domains; there are pre-defined domains, which we call system domains, in the interface for each type of schema entity:
DomainDomain, RelationDomain, AttributeDomain, DatatypeDomain, IndexDomain: Domain;
There are also pre-defined system relations, which contain information about sub-domains, attributes, and indices. Since these are not required by the typical (application-specific) database client, we defer the description of the system relations to Section 3.6.
In general, any of the data schema may be extended or changed at any time; i.e., data operations and data schema definition may be intermixed. However, there are a few specific ordering constraints on schema definition we will note shortly. Also, the database system optimizes for better performance if the entire schema is defined before any data are entered. The interactive schema editing tool described in Section 7 allows the schema to be changed regardless of ordering constraints and existing data, by recreating schema items and copying data invisibly to the user when necessary.
All the data schema definition operations take a Version parameter which specifies whether the schema element is a new or existing one. The version defaults to allowing either (NewOrOld): i.e., the existing entity is returned if it exists, otherwise it is created. This feature avoids separate application code for creating the database schema the first time the application program is run.
DeclareDomain: PROC [name: ROPE, segment: Segment,
version: Version← NewOrOld, estRelations: INT← 5] RETURNS [d: Domain];
DeclareSubType: PROC[sub, super: Domain];
DeclareDomain defines a domain with the given name in the given segment and returns its representative entity. If the domain already exists and version=NewOnly, the signal AlreadyExists is generated. If the domain does not already exist and version=OldOnly, then NIL is returned. The parameter estRelations is used to estimate the largest number of relations in which entities of this domain are expected to participate.
The client may define one domain to be a subtype of another by calling DeclareSubType. This permits entities of the subdomain to participate in any relations in which entities of the superdomains may participate. All client DeclareSubType calls should be done before declaring relations on the superdomains (to allow some optimizations). The error MismatchedSegment is generated if the sub-domain and super-domain are not in the same segment.
DeclareRelation: PROC [
name: ROPE, segment: Segment, version: Version← NewOrOld] RETURNS [r: Relation];
DeclareAttribute: PROC [
r: Relation, name: ROPE, type: ValueType← NIL,
uniqueness: Uniqueness ← None, length: INT← 0,
link: {Linked, Unlinked, Colocated, Remote}← yes, version: Version← NewOrOld]
RETURNS[a: Attribute];
Uniqueness: TYPE = {NonKey, Key, KeyPart, OptionalKey};
DeclareRelation defines a new or existing relation with the given name in the given segment and returns its representative entity. If the relation already exists and version=NewOnly, the signal AlreadyExists is generated. If the relation does not already exist and version=OldOnly, then NIL is returned.
DeclareAttribute is called once for each attribute of the relation, to define their names, types, and uniqueness. If version=NewOrOld and the attribute already exists, Cypress checks that the new type, uniqueness, etc. match the existing attribute. The error MismatchedExistingAttribute is generated if there is a discrepancy. The attribute name need only be unique in the context of its relation, not over all attributes. Note this is the only exception to the data model’s rule that names be unique in a domain. Also note that we could dispense with DeclareAttribute altogether by passing a list into the DeclareRelation operation; we define a separate procedure for programming convenience.
The attribute type should be a ValueType, i.e. it may be one of the pre-defined types (IntType, StringType, BoolType, AnyDomainType) or the entity representative for a domain. For pre-defined types, the actual values assigned to attributes of the relationship instances of the relation must have the corresponding type: REF INT, ROPE, REF BOOL, or Entity. If the attribute has a domain as type, the attribute values in relationships must be entities of that domain or some sub-domain thereof. The type is permitted to be one of the pre-defined system domains such as the DomainDomain, thereby allowing client-defined extensions to the data schema (for example, a comment for each domain describing its purpose).
The attribute uniqueness indicates whether the attribute is a key of the relation. If its uniqueness is NonKey, then the attribute is not a key of the relation. If its uniqueness is OptionalKey, then the system will ensure that no two relationships in r have the same value for this attribute (if a value has been assigned). The error NonUniqueKeyValue is generated if a non-unique key value results from a call to the SetP, SetF, SetFS, or CreateRelship procedures we define later. Key acts the same as OptionalKey, except that in addition to requiring that no two relationships in r have the same value for the attribute, it requires that every entity in the domain referenced by this attribute must be referenced by a relationship in the relation: the relationships in the relation and the entities in the domain are in one-to-one correspondence. Finally, if an attribute’s uniqueness is KeyPart, then the system will ensure that no two relationships in r have the same value for all key attributes of r, though two may have the same values for some subset of them.
The length and link arguments to DeclareAttribute have no functional effect on the attribute, but are hints to the database system implementation. For StringType fields, length characters will be allocated for the string within the space allocated for a relationship in the database. There is no upper limit on the size of a string-valued attribute; if it is longer than length, it will be stored separately from the relationship with no visible effect except for the performance of database applications. The link field is used only for entity-valued fields; it suggests whether the database system should link together relationships which reference an entity in this attribute. In addition, it can suggest that the relationships referencing an entity in this attribute be physically co-located as well as linked. Again, its logical effect is only upon performance, not upon the legal operations.
DestroyRelation: PROC[r: Relation];
DestroyDomain: PROC[d: Domain];
DestroySubType: PROC[sub, super: Domain];
Relations, domains, and subdomain relationships may be destroyed by calls to the above procedures. Destroying a relation destroys all of it relationships. Destroying a domain destroys all of its entities and also any relationships which reference those entities. Destroying a sub-domain relationship has no effect on existing domains or their entities; it simply makes entities of domain sub no longer eligible to participate in the relations in which entities of domain super can participate. Existing relationships violating the new type structure are allowed to remain. Existing relations and domains may only be modified by destroying them with the procedures above, with one exception: the operation ChangeName (described in Section 3.4) may be used to change the name of a relation or domain.
DeclareIndex: PROC [
relation: Relation, indexedAttributes: AttributeList, version: Version];
DeclareIndex has no logical effect on the database; it is a performance hint, telling the database system to create a B-Tree index on the given relation for the given indexedAttributes. The index will be used to process queries more efficiently. Each index key consists of the concatenated values of the indexedAttributes in the relationship the index key references. For entity-valued attributes, the value used in the key is the string name of the entity. The version parameter may be used as in other schema definition procedures, to indicate a new or existing index. If any of the attributes are not attributes of the given relation then the signal IllegalIndex is generated.
The optimal use of indices, links, and colocation, as defined by DeclareIndex and DeclareAttribute, is complex. It may be necessary to do some space and time analysis of a database application to choose the best trade-off, and a better trade-off may later be found as a result of unanticipated access patterns. Note, however, that a database may be rebuilt with different links, colocation, or indices, and thanks to the data independence our interface provides, existing programs will continue to work without change.
If a relation is expected to be very small (less than 100 relationships), then it might reasonably be defined with neither links nor indices on its attributes. In the typical case of a larger relation, one should examine the typical access paths: links are most appropriate if relationships that pertain to particular entities are involved, indices are more useful if sorting or range queries are desired.
B-tree indices are always maintained for domains; that is, an index contains entries for all of the entities in a domain, keyed by their name, so that sorting or lookup by entity name is quick. String comparisons are performed in the usual lexicographic fashion.
DeclareProperty: PROC [
relationName: ROPE, of: Domain, type: ValueType,
uniqueness: Uniqueness← None, version: Version← NewOrOld]
RETURNS [property: Attribute];
DeclareProperty provides a shorthand for definition of a binary relation between entities of the domain "of" and values of the specified type. The definitions of type and uniqueness are the same as for DeclareAttribute. A new relation relationName is created, and its attributes are given the names "of" and "is". The "is" attribute is returned, so that it can be used to represent the property in GetP and SetP defined in the next section.