[_CD8_]<cedardocs>database>ModelLevelDesign6.bravo!11

In this section we describe important aspects of our implementation of the Cypress data model defined in the previous sections. In the first subsection, we give an overview of the Cypress design and the storage primitives with which we are dealing. Then we discuss a number of specific optimizations we have found useful in the implementation. In the last subsections, we describe the implementation of each of the major operations in more detail and summarize our results.

The Cypress DBMS runs on personal computers, in the Cedar programming environment developed in the Computer Science Laboratory. It is written in the Cedar language, a descendent of Mesa (Mitchell et al [1979]) with changes aimed at making Cedar particularly convenient for developing prototype systems.

Cypress is built in four levels: the Model level, which implements the data model; the Storage level, which provides basic storage allocation and access structures; the Segment level, which manages the database system’s virtual memory; and the File level, which provides transaction-based file access.

The File level can be one of a variety of file systems that provide page-level transaction-based access to disk files. We currently use the Pilot system for files on the database client’s machine, and the Alpine file server for remote files. In the latter case, the Model, Storage, and Segment levels run on the user’s personal machine, and the File level on an Alpine file server.

The Segment level provides an LRU cache of database pages from the various segments, reducing network traffic when the File level resides on a different machine. The Segment level implements the segment abstraction, including a mapping from 32-bit database addresses to underlying file pages.

The lower levels of Cypress (the Segment and File levels) will not be of central concern to us for the purposes of this report. In the next subsection, we describe the Storage level interface, and in the remainder of Section 6 we discuss the Model level implementation.

The underlying Storage level provides the physical data structures used by the Model level: recordtypes, group lists, and indices. There is little that is particularly new in the Storage level, it uses reasonably well-understood mechanisms (the design is much like the RSS level of System R, Astrahan et al [1976]).

Recordtypes are the basic storage mechanism for data. A recordtype is a set of records with fields that can be (1) fixed length blocks of words, (2) variable-length byte strings, or (3) references to other records. Recordtypes are manipulated using recordtype and field handles that describe what the Storage level needs to know to perform an operation on a recordtype or one of its fields, respectively. (Specifically, we need to know the byte length of records in a recordtype, and the byte position and size for its fields.)

Both entities and relationships are represented as records. Entities are represented as records whose first field is the name. Relationships are normally represented as records whose fields are the attributes of the relation. Each domain or relation maps to a particular recordtype.

Every record has a unique 32-bit tuple identifier or TID, which uniquely defines it. The Cypress TID for a record contains its segment number, the page number where it is stored in that segment, and its record number on that page. TIDs can be used to find a record relatively efficiently, normally in one page access. The record number on a page is mapped to a record through an array at the beginning of the page. A record could be moved to another page without changing all references to its TID, simply by entering a forward reference in this array, to its new location on an overflow page. We do not currently use this feature, however.

Groups are declared by defining a group field of a recordtype. Every record referenced by one of these group fields is the parent of a group list containing all records which reference that record in that group field.

The Model level currently uses groups as follows. If an attribute a is an entity-valued attribute declared with link=Linked (see DeclareAttribute), it is defined as a group field at the Storage level. Thus, a group list is maintained for each entity e referenced by attribute a of one or more records in a’s relation. Those records (the relationships which reference e) are linked together, with the entity e at the head of the list. For example, an entity of type "person" might be at the head of a number of group lists, one for the first attribute of the "author" relation, two for the first and second attributes of the "friend" relation, one for the second attribute of the organization-member relation, and so on. We can quickly find all the author relationships referencing a particular person by following the group list associated with the first author attribute. Group lists are updated by the SetF operation on entity-valued attributes, and are used by the RelationSubset operation to process a query when an attribute value list contains an entity-valued attribute.

Indices are B-Trees with variable-length string keys indexing TID values. More than one index entry may be present with the same key. Indices are maintained by the Model level for the names of entities in a domain, using the TID to reference the record representing the entity. The B-Tree indices on names are updated by the ChangeName and DeclareEntity operations, and may be used by the DomainSubset operation. Indices are also maintained as requested by the client for relations; these are updated by the SetF operation, and are used in the implementation of the RelationSubset operation. We describe DomainSubset and RelationSubset in Section 6.8.

It is worth noting that Cypress application programs reference database items via handles rather than the cursors provided by most database systems. A handle is returned by Cypress whenever an application program retrieves a relationship or entity from a database. Application programs may have any number of entity-valued or relationship-valued variables, each containing a handle that uniquely identifies the database object it currently references. Each call to NextRelship, GetF, etc., produces another such handle. In other database systems (e.g. INGRES or System R), references to tuples are made through cursors, with a higher overhead than handles. Client programs may copy whole blocks of data from a client buffer into the database tuple that is at the current cursor position, or vice versa.

We do not claim that handles are any better than cursors; we emphasize the difference because it effects the design of Cypress. Handles have the advantage of tighter coupling between the programming language and DBMS: database items simply appear as values in the programming language. Handles have the disadvantage of requiring garbage collection to determine which database references are no longer in use. The client indicates this explicitly in a cursor-based DBMS.

As already discussed, the Model level represents both entities and relationships as records. The records may be of three internal types: virtual, surrogate, or stored. Virtual records are system schema entities, e.g. DomainDomain or StringType. Stored records are schema or data records; these are the Storage level records we described in Section 6.1. They may represent relations, attributes, domains, relationships, or entities. Surrogate records are used in conjunction with the surrogate relation optimization, which we describe in Section 6.6.

Note that a TID can uniquely define an entity, just as a [segment name, domain name, entity name] triple does. However, TIDs are used only inside Cypress and are never provided to clients. The Entities (and Relships) returned to the client are pointers to records that contain the TID and other information; the client may not access these data, and the database system keeps a pointer to this record so that the record may later be modified.

Keeping TIDs a "secret" in this way has a number of advantages. Note that we can find all references to records in running programs, as we keep a list of the record handles in use by clients of the system. We can find all references to records within databases, as the group lists and indices can quickly be traversed to find references to a record, and we do not store TIDs outside of groups and indices. As a result, we can:

Among other things, these features allow us to get by with TIDs as short as 32 bits, as we need only as many TIDs as there are records in the largest possible database. (Actually slightly more TIDs, since there is internal fragmentation due to the fixed maximum number of records per page).

We never use TIDs for references across segment boundaries, although they could in theory be used for such as TIDs contain a segment number. The lack of cross-segment references makes segments independent, so that one segment can be destroyed or rebuilt without affecting others. It also avoids the needs for a common network database that maps segments to database address space. Destruction or inaccessibility of such a common database would make all segments inaccessible.

The choice was made to represent values as REF ANYs in Cedar. A REF ANY is a pointer to an object whose type is determined at run-time, via a type code stored with the object. REF ANYs have some inconveniences and inefficiencies, particularly the overhead of using a pointer when it is not required (e.g., with INTs) and the overhead of the run-time NARROW operation which coerces the REF ANY to a particular type. However REF ANY seems the best solution since convenience and simplicity of the interface is important.

Because only a few of the many datatypes made available to alpha clients of the original Cedar DBMS were widely used, we implement only REF INT, REF BOOL, REF GreenwichMeanTime, ROPE, and Entity values. Some other types, e.g. pointers to Mesa records, are still used by internal procedures but are not currently exercised by client programs.

Note that in our system particular Relship or Entity variables all have the same Cedar type; i.e., there is not a different programming language type for each domain or relation. Run-time operations are provided to fetch the domain or relation to which an entity or relationship belongs. In the associative programming language LEAP (Feldman and Rovner [1969]), this run-time typing is taken to the extreme of treating a relationship’s relation not as its type, but as simply another field of the record: relationships are not typed. If this approach were taken in our implementation, for example, RelationSubset would not need the first (relation) argument: the user could specify the desired relation as part of the AttributeValueList or omit it. The typeless relationships would provide little or no additional utility in our system, however, and in fact may complicate the world for a typical application. The cost of complete typelessness does not seem to justify any utility it may have.

The run-time typing of relationships, entities, and values means that a fair amount of the code in the dozen Model level operations is concerned with error checking. If at the Query level we introduce a compiler for database operations, we can save execution time by performing many of these type checks at compile-time. Fortunately the caching of schema information avoids database accesses for most of this checking, so the current overhead is not overwhelming.

The Storage level implements strings of arbitrary and variable length. The length hint supplied to DeclareAttribute is used to decide how much space to allocate in the record itself, but the characters are stored elsewhere if the field overflows. It is currently not efficient to store strings with thousands of characters in Cypress if the strings are frequently updated, because the storage allocation function is not designed to deal efficiently with data chunks larger than a page or so. We are implementing a supplementary package to Cypress which provides efficient storage of these large text strings, such as the body of a document or program. The actual text is stored in conventional files; the file, position, and size of a text string is stored in the database where it is referenced. The supplementary package also implements a log of all database updates so that a database client may replay or back up database actions, independent of the database transaction mechanism.

In order to improve the performance of the database system, we cache information about the client’s data schema in a quickly-accessible form. On the first reference to an attribute, for example, the system saves the type of value the attribute demands, its relation, and its uniqueness. On subsequent uses the type-checking performed by GetF and SetF can be performed without access to the data schema stored in the database. The attribute cache information is flushed whenever an AbortTransaction or DestroyRelation call is invoked.

The schema cache is distinct from the Segment level’s LRU cache of data pages. Although the schema is cached just as any other data at the Segment level, we wish to avoid all database operation overhead, for examining schema entities. The schema-caching optimization makes a considerable improvement in Cypress performance: it gives a factor of ten speedup in GetF in nearly all database applications.

Since the Storage level does not know how to deal with system records such as the DomainDomain or IntType, the Model Level must make exceptions for these entities in the implementation of a number of operations. There are currently a few limitations in the way this is done; for example the DomainDomain is not actually in the enumeration of DomainSubset[DomainDomain], and user-defined Relships may not reference system entities. GetDomainRefAttributes and DomainSubset on the DomainDomain and RelationDomain had to be coded specially, and so on. Since Datatypes (which are system entities, not stored in the database) cannot be stored by the underlying storage level, integers are stored corresponding to IntType, StringType, and so on, and these are mapped to the system entities when passed to or from the client program.

The data schema is stored in four underlying storage-level recordtypes. These recordtypes are the domain recordtype, relation recordtype, attribute recordtype, and index recordtype. The domain recordtype has as attributes the domain name and the recordtype handle used by the underlying storage level. The relation recordtype has the same two attributes, plus a special attribute which indicates whether the surrogate relation optimization has been performed on the relation. The attribute recordtype has several attributes, indicating each attribute’s type, uniqueness, and relation, and, if applicable, their length and the index they participate in. Indices are represented in an index recordtype.

There are some ordering constraints on schema definition, as noted in Section 3. All of these constraints could be removed by enough special-case code, including deletion of attributes. We chose to pass the buck to the data schema editing tool described in Section 7.2 for now.

As discussed in Section 6.1, relationships with entity-valued attributes declared with link=Linked are inserted in a doubly-linked list (a group) with the entity they reference at the head of the list. Thus groups provide an efficient way to find all references to an entity via a particular attribute. There are three other possible values for the link hint: Unlinked, Colocated, and Remote.

Colocated is the same as Linked, but an attempt is made to store the relationship which references an entity on the same file page as the entity that it references. It provides even faster access to an entity from a relationship or vice versa. Only one attribute of a relation can be declared colocated. Relationships which reference an entity in this attribute are colocated with the entity and with each other.

Colocation must be performed by DeclareRelship when it creates a new relationship for which a value has been specified for the colocated attribute. (SetP calls DeclareRelship, so it achieves the same result; but SetF causes no colocation, as the record has already been allocated.) When DeclareRelship is called to create a relationship with a colocated attribute a with value e:

One can imagine a number of variations on this algorithm; we have not experimented with them at the time of this writing. One might profitably reserve pages for relationships referencing an entity on the basis of an estimate of the space that will be required.

NOTE: colocation is not yet implemented at the time of this writing. Relationships are simply placed on pages in the order they are created. We currently achieve some approximation to colocation by dumping a segment in textual form, with all relationships that reference an entity grouped together, and then reconstructing the segment from the dump file.

The string name of referenced entities are stored in Unlinked attributes, rather than using a group list. Thus, groups cannot be used to find an entity from an unlinked reference to it, nor vice versa. When an attribute is declared Unlinked, the database client typically declares an index on the attribute with DeclareIndex, thereby making access to a relationship reasonably fast (logarithmic in the size of the relation instead of a single page access, but rarely more than a few page accesses). In other cases, an index is better than groups. Indices could use less space if entity names are short: groups require three 32-bit TIDs per record. If TIDs are simply being compared for equality, e.g. for a relational join, the TIDs may be compared directly in B-tree pages rather than fetching the pages containing the records themselves.

Remote indicates that an attribute may reference entities in other segments (in a different segment than the relation). Remote attributes are currently treated much like Unlinked: the name is stored instead of a physical link (remember we do not want links, which are based on TIDs, to cross segment boundaries). Unlike Unlinked attributes, the segment and domain name must be stored in the reference as well as the entity name. The Remote attribute link hint is not technically necessary, as the Model level could determine how to store an entity-valued attribute "on the fly." It greatly simplifies the implementation, however.

If an index is not declared on an unlinked or remote attribute, the RelationSubset operation will take time linear in the size of the relation to process a query whose attribute value list asks for a particular entity value for the attribute. An unlinked and unindexed entity-valued attribute would be useful only if the relation is known to be of a small fixed size, or queries are never performed on the attribute.

A time comparison of three simple cases of entity-valued attributes follows: an attribute declared linked but not indexed, an attribute declared indexed but not linked, and an attribute that is neither linked nor indexed. We show the times for the three common operations of fetching the attribute, setting the attribute, and doing a DeclareRelship operation with the OldOnly option to find the relationship with a particular value for the attribute (this is equivalent to a RelationSubset and NextRelship). These and all other measurements were performed on the Xerox Dorado computer; see Section 8.2 for details.

These measurements were performed on relationships whose pages were already in the Segment level’s cache, thus the "linked" column is near the ideal time one would obtain if records were colocated. The difference between linked and colocated times depends on a particular application’s data schema, of course. A page read and write both require approximately 60 ms. Thus GetF and SetF could take as much as 60 ms in the worst case. For typical applications, the payoff from colocation is much smaller than that, but still seems to produce more than a factor of two speedup in both GetF and SetF.

Note that the introduction of entities to the relational model allows the DBMS to make some intelligent guesses as to where to use colocation and links. By default, an attempt is made to link and colocate relationships with the entities they reference. This may be overridden by the client, of course.

Our current database applications most frequently retrieve together all or most of the relationships, from a variety of relations, which reference a particular entity. Most implementations of the Relational model store the records of a relation contiguously. The relational arrangement results in many more page reads than ours for the entity-centric access we find most common.

The surrogate relation optimization is used when DeclareAttribute is called to define the first attribute of a relation, and: (1) the attribute is entity-valued, (2) it is a primary key of its relation, (3) its type is a domain with no sub-types, and (4) that domain does not yet have any entities defined in it. The relation is then specially tagged as being a surrogate relation, and the 2nd through Nth attributes will be added to the domain’s recordtype, i.e. the data will be stored in the entities rather than separate records. The relation itself will not exist as a recordtype, and any relationships in the relation will be represented as surrogate records. Thus the records that represent entities may have fields in addition to the entity name, containing surrogate relation attributes.

The data schema must record the relations for which the surrogate relation optimization has been performed, for internal use. This information can also be useful to select clients such as the data schema update tool.

When compared to colocation (i.e. first attribute declared Colocated), the surrogate relation optimization is an improvement in two ways. First, a surrogate relationship uses less space (no additional record and link overhead); this may allow information to be stored on the same page. The overhead of an additional page access (60 ms) can make a considerable difference in an operation that takes less than 1 ms (GetF). Also, accessing the field directly rather than following the link to another record is somewhat (currently about .4 ms) faster even when the other record is on the same page.

The effect of the surrogate relation optimization on Cypress performance is of course dependent upon its applicability to a particular application’s data schema. In general, we find that the surrogate optimization can be performed on about half the relations in the schema. For the electronic mail database application, construction of database entries runs about 30% faster with the surrogate relation optimization enabled (instead of just colocation). Other operations, e.g. displaying a message, have equal or better performance improvement.

We could allow database clients to specifically indicate when the surrogate relation optimization should be performed, as we do with links. The optimization appears desirable whenever possible, however, so this unnecessary.

Note the restrictions mentioned earlier on the application of the surrogate relation. The restriction that a domain have no subtypes is made to simplify the problem of assigning field positions in the recordtypes. The restriction that the domain be empty is made to avoid reconstructing or marking existing entities which would not have the field. The latter restriction is similar to one on relations, that attributes may only be added to a relation while it is empty. The schema editing tool described in Section 7 makes such restrictions invisible by copying the domain or relation when necessary.

DeclareEntity and DeclareRelship simply create a record in the underlying domain or relation recordtype, updating any associated index entries. The DeclareEntity operation uses the DomainSubset operation to determine whether an entity with the given name already exists. DeclareRelship similarly uses RelationSubset to determine whether the given relationship already exists, unless version is NewOnly.

A small complication is introduced by the interaction of the surrogate relation optimization and the use of indices. The attributes of relationships of a surrogate relation are stored as "attributes" of entities in the target domain. Index entries reference the entities. This would suggest that we must create index entries when the entities are created. The semantics of null values can be used to improve efficiency, here: we need make no entries until values are actually assigned. This should not be done until surrogate relationships are created and their attributes assigned values.

GetF and SetF must make special checks for the surrogate relation optimization; when this optimization has been performed, the storage level "field handles" for attributes of the relation actually reference fields of the domain, in which the relation’s attributes are stored. With SetF, a particular constraint has currently been added: the key attribute of a record must be assigned before others, so that the Model level knows which entity in the domain is being referenced. Note this would not be a problem if we disallowed SetF, requiring that whole relationships be created or destroyed as units.

Type checking is necessary only for one operation, SetF. If the domain sub-type hierarchy is not used, i.e., an entity stored in an entity-valued field is precisely of the domain required by the attribute, the check is fast. Furthermore, if type-checking is thwarted altogether, i.e., the attribute is of type AnyDomainType, the check is even faster: the system need only check that the base type (i.e., entity-valued, string-valued, etc.) is correct. In the case that a check on the attribute’s domain type fails, SetF recursively searches up the hierarchy from the actual entity type looking for the attribute’s type. It fails upon finding no more super-types. (Searching upward is faster than searching downward for randomly-selected examples in the domain type lattice in the largest application schema we have so far.) More efficient schemes could be implemented if necessary, e.g., pre-computing the super-domains and redundantly storing the list with each domain.

SetF must update group lists when they are being maintained for entity-valued attributes, as described earlier. Conversely, if links are not maintained for an entity-valued attribute, GetF must perform an index lookup to map the stored entity name into the entity handle returned to the client (GetFS need not do this lookup, however, as it is only required to return the name). SetF uses RelationSubset to determine whether key constraints are violated by the new attribute value.

The algorithm for DestroyRelship is relatively simple. The Storage level operation to destroy the record automatically deletes the record itself and removes it from the linked list maintained through all records of its recordtype. Variable-length strings which have overflowed the space allocated in the record (the length argument to DeclareAttribute) are stored in separate pseudo-records that must also be de-allocated. Also, any index entries for the destroyed Relship must be deleted, and the relationship must be removed from any group lists in which it participates.

3.Relationships referencing the entity via an Unlinked attribute must be found by a RelationSubset operation on attributes that can reference an entity of the kind we are deleting. A group list of these unlinked attributes is maintained in the database for each domain, to make this operation more efficient.

The Eq operation returns TRUE iff its arguments have identical TIDs. If its arguments are from different segments they will have different TIDs, of course, as the segment numbers are stored as the high order bits of the TIDs. The correspondence between segment numbers and atoms is maintained to implement the SegmentOf operation.

The procedure GetAllRefAttributes is implemented using a special Storage level operation that returns all the groups in which an [entity] record participates. As unlinked attributes will not be found by this operation, they must be added to the list returned by GetAllRefAttributes, as well.

The algorithm for DomainSubset is relatively simple. If a name or range of names is specified in the call, the index for the domain is used; otherwise the linked list the Storage level provides through all the records of the domain recordtype is used. If it is requested that subdomains be searched, a DomainSubset operation is invoked on each in sequence. Whatever the method, the information about the enumeration is packaged up in the EntitySet returned so that NextEntity can sequence through the indices, lists, or domains as required.

The algorithm for RelationSubset is somewhat more complex than DomainSubset. We are given an AttributeValueList, composed of triples of the form [attribute, low, high] constraining the given attribute to have a value greater or equal to low and less than or equal to high. The high value is typically omitted and is then assumed to equal low. The high value must be omitted for entity-valued attributes, since a range of values makes no sense in that case. RelationSubset must choose an efficient manner in which to find the relationships satisfying the list.

1.If the surrogate relation optimization has been performed on the relation, then (a) if a key-valued attribute is contrained to equal a particular entity in the attribute-value list passed to RelationSubset (the constraint list), we may return a single-element RelshipSet; (b) otherwise, we set a boolean in the RelshipSet (the one we will return) to indicate that whichever of the strategies 2, 3, and 4 are used below, NextRelship must convert the entities we obtain to surrogate relationships.

2.If one of the attributes in the constraint list is entity-valued, then we use the group list for the given entity value. That constraint is removed from the list, and the remaining reduced list, is saved to be serially checked by NextRelship against records in the group list.

3.If there is an index on one or more of the attributes in the constraint list, the index containing the largest subset of the attributes is used to find the relationships satisfying that subset of the list. The remaining reduced list is checked by NextRelship, to filter the relationships enumerated from the index.

4.If all of the above strategies fail, the underlying storage-level operation to enumerate the list through all records of the relation’s recordtype is invoked, and NextRelship serially filters out the relationships satisfying the constraint list.

The NextRelship operation enumerates the appropriate group, index, or recordtype as specified by the RelshipSet that RelationSubset produced. As all three of these enumerations are bi-directional at the Storage level, PrevRelship is a straightforward inverse of NextRelship. Note the slight complication in NextRelship and PrevRelship introduced by the surrogate relation optimization. In that case, the indices, group lists, or recordtypes can be scanned as usual, but refer to entities rather than the surrogate relationships which have been mapped to them. Surrogate relationships are created on-the-fly.

The time required by the RelationSubset operation is of course dependent upon the path through the above algorithm taken, whether an index or group is scanned, and so forth. A typical application averages about 7 ms for a variety of RelationSubset calls. Thanks to the schema cache, only about half of this time is spent examining the data schema to check the validity of the query and determine which physical access path to use to process it.

The time for NextRelship is typically 0.4 ms, and is nearly the same for a group, index, or recordtype enumeration. Currently most of this time is for the Cedar allocation of the Relship handle itself! However, the NextRelship time will be much larger for a database application without a working set in the database page cache: the time to fetch pages will dominate.

We will not discuss the MultiRelationSubset operation and Query level here, as they have not yet been constructed. One possible plan would be to (1) build a parser at the Query level to map text queries into a tree representation, (2) make a major extension of the RelationSubset operation to optimize access path selection over the entire query, (3) make a minor extension to RelshipSets (and NextRelship) to allow them to act upon nested relations (still producing a single relation at the top level), and (4) build a Query-level formatter which prints the RelshipSet enumeration for the client.

The Cypress DBMS is built in four levels: the Model, Storage, Segment, and File levels; in this section we have concentrated on the implementation of the Model level. We summarized the optimizations made in that implementation:

Caching the data schema is crucial to good performance; almost all operations must examine the data schema to check the legality of the client’s request or to decide how to perform the operation. The payoff can be as much as 90%.

The payoff from the use of indices is variable, but they are crucial upon entity names and are often beneficial upon relations. There is a trade-off between using indices and links on relations; indices can be used on multiple attributes and use less space if the keys are short, but links are typically a factor of three faster.

It is enlightening to observe the progressively better performance for the GetF operation on an entity-valued attribute that is (1) indexed, (2) linked, (3) colocated, or (4) surrogate. As just noted, links are typically a factor of three faster than an index, as at most one page (other than the page on which the relationship resides) is accessed. Colocation typically produces another factor of three or so, though this varies widely with database application. The surrogate relation optimization provides another factor of two or so, by avoiding another tuple lookup (on the same page). The best time for a GetF is currently about .2 ms.

In Section 5, we showed how the Cypress data model was designed to provide a higher-level conceptual model, simplifying the development of database applications. In this section we have shown that a higher-level data model can also be used to improve the performance of a database system, by capitalizing on the additional knowledge of data semantics and interconnectivity. In Section 7 we will concentrate on the third advantage of a higher-level model: the more powerful tools and integration of database applications it enables.