[_CD8_]<cedardocs>database>SIGMOD.paper!1

We formed a small group in early 1979 to discuss the design and construction of a database management system. The system was to run in the Cedar programming environment (which was itself being designed at the time), and its primary goal was to support various applications within our laboratory. The various members of the group had research interests in data models, physical database design, user interfaces, and other topics related to databases; building the system would give them a chance to try out some of their ideas in these areas.

This paper will focus primarily on the most novel aspects of the Cedar DBMS. These include: its clean separation between transaction management and data management, its distribution of computation and data, and its interface based on dynamic rather than static storage allocation. To a large degree these design features are a function of the our starting assumptions, concerning both the system’s intended applications and the system’s computing environment. Therefore we shall devote Section 2 to a description of these assumptions. Section 3 gives an overview of the design, and Section 4 discusses selected design issues in more detail. Section 5 gives some conclusions and mentions directions we see for future work.

Some conventions: in the sequel, we shall refer to programs that use a certain facility as clients of the facility. Human users of a service are called users. This sans-serif font will be used to emphasize identifiers that might appear in a program.

As mentioned above, our work in constructing the Cedar DBMS is motivated not only by our interest in DBMS research problems, but by a desire to address a range of applications that arise in our laboratory (CSL). These applications include computer-design automation, graphics and page imaging, knowledge representation for Artificial Intelligence, source program storage, and office information systems. By applying the Cedar DBMS to some of these problems, we can get direct feedback about the success of our system without the overhead of supporting the system for use outside of CSL.

The applications listed above present a range of requirements, in terms of the expected database size, necessary performance, most desirable data model and interface to the DBMS, and so on. Yet there are certain similarities among most of the applications. Apart from the AI application, none requires extremely high performance for a large number of accesses, but most would like the overhead for the first access to be low (for < 2 second interactive response). None of them involves extremely large volumes of data (104 - 106 data objects is the upper range). A typical query seems to involve only a small number (< 50) of related data objects. There is a strong requirement for a program interface to the DBMS, and less of a need for a standard user interface.

Hardware. Our laboratory’s computing facility includes a large number of personal computers, connected by an experimental 3 mBit/sec local network [Metcalfe & Boggs, 1976]. The local network is connected to other local nets, forming an internetwork [Boggs et. al, 1979]. Many of the personal computers are Altos [Thacker et. al, 1979] or similar machines; a growing number of higher-speed Dorado computers [Lampson & Pier, 1980] are also available. Each machine is equipped with a disk storage device.

Juniper. An important service provided in the internetwork environment is Juniper, the distributed file system [Israel et. al, 1978]. Juniper is implemented by a cooperating set of file server computers. A client of Juniper performs file read and write actions to some set of files, possibly not all residing on the same machine. These operations can be grouped into a transaction to guarantee that they will occur as a single, atomic action with respect to other activity in the system. This atomicity is guaranteed even in the face of file server crashes.

Juniper uses locking to ensure the serial consistency implied by atomic transactions. For some of what follows it will be necessary to understand Juniper’s locking algorithm in some detail. When a client reads a sequence of bytes from Juniper, he implicitly obtains a read lock on the bytes; writing a byte range obtains a write lock. At a given instant of time, a single byte may be covered by any number of read locks and at most one write lock. An attempt to obtain a write lock on an already write-locked byte causes the second writer to wait; eventually, a waiter either proceeds because the conflict has disappeared, or times out and is aborted. These timeouts resolve potential deadlock situations.

When a transaction attempts to commit (finish and make its actions visible to the rest of the world), it may be holding a write lock that overlaps a read lock of another transaction. In this case the reading transaction will be sent a broken read lock indication. The reader must then either release the broken lock (to certify that it does not depend upon the data formerly protected by the lock) or abort itself.

After commit, a transaction loses all of the read and write locks that it had acquired. A transaction may choose to checkpoint rather than commit; in this case, the transaction first commits, then changes all of its write locks into read locks, and is able to proceed with these locks in force.

Juniper files are named by 64-bit IDs, which Juniper guarantees to be unique across the internet. A directory service, providing text names for files, is written as a Juniper application. Juniper provides standard facilities for file protection, based on access control lists. As personal computers, the client machines are not able to provide strong protection guarantees; therfore the physical separation of Juniper from client machines is useful.

Cedar. The pending arrival in CSL of new, more powerful hardware (the Dorado) forced the lab to evaluate what sort of programming environment should be provided on the new machines [Deutsch & Taft, 1980]. It was apparent that this programming environment must serve the needs of experimental programming: the production of moderate-sized systems that are usable by moderate numbers of people in order to test ideas about such systems. Our present programming facilities definitely limit the rate at which we can conduct such experiments.

The Cedar project is developing an advanced programming environment to serve as the basis for most of CSL’s programming in the next few years. Cedar is a single-language environment; the language is a derivative of Mesa [Mitchell, et al., 1979] that we shall call Cedar Mesa. (In what follows, we shall use the name "Mesa" when referring to language features that are common to both Mesa and Cedar Mesa).

Cedar is perhaps best viewed as a coordinated confederation of projects (graphics, user interface, file system, database management system, etc). Cedar provides a common language, operating system, and set of programming conventions to make the results of these (largely) separate efforts compatible. Hence an important thrust of the project is in developing techniques that make the sharing of such "packages" successful. On the other hand, there will always be situations where for efficiency or other reasons, it is necessary to integrate an important facility more tightly. In a single-language environment the barriers to tighter coupling of this sort are not high; for instance, facilities for multiple processes in Mesa evolved from being a separate package to being part of the Mesa language [Lampson & Redell, 1980].

Systems built in Mesa are collections of modules. Definitions modules, often called interfaces, are the connectors between programs. Interfaces contain declarations of types, constants, procedures, and variables. Only the names and types of procedures are specified in interfaces, not their implementations. A program module, or program, contains a collection of variables and the implementations of procedures with direct access to those variables. A program may import an interface in order to use facilities defined there and implemented in other programs. A program may also export an interface, thereby making the interface procedures implemented by the program available to other programs.

Because Cedar is a single-language environment, all components of the system are structured as programs connected by interfaces. For instance, the entire operating system is written in Mesa, and operating system facilities are presented to client programs via standard Mesa interfaces. It is therefore natural that a Cedar DBMS be accessible to Mesa programs through a Mesa interface.

Mesa is a fully typed language, much in the style of PASCAL. The type system allows the compiler to detect certain types of errors in using procedures or data items. The type system may be viewed as a form of protection: if a type and its operations are defined in an interface, a client must take extrordinary steps in order to apply a non-interface operation to a variable of this type. Mesa allows the programmer to breach the type system through the "loophole" construct. An operating system implementation needs loopholes in order to control devices and allocate storage; many application programs use no loopholes at all, and this is good Mesa style since it allows the compiler to provide maximum checking. Ideally, a Cedar DBMS should be able to store away typed data items, and return them to applications programs, without forcing these programs to contain loopholes.

The presence of interfaces and the type system may influence another bit of Mesa philosophy: no macro facility is associated with Mesa, and no preprocessors for Mesa are presently in use. The use of macros can make sharing programs very difficult, since it encourages the evolution of many similar but incompatible language dialects. A preprocessor into Mesa would surely generate some programs containing compile-time or run-time errors; all of the user’s tools for diagnosing such problems would operate in terms of the Mesa code produced by the preprocessor, not the preprocessor’s source code. A Cedar DBMS that required a preprocessor would therefore be an oddity in the Cedar environment.

The two principal extensions to Mesa that are present in Cedar are runtime types and garbage collection. In Mesa, all type checking is done at compile time, and types tend to be highly specific. This makes it difficult to write a general package for, say, manipulation of lists. Cedar Mesa contains dynamically-typed pointers (REF ANY), along with a facility for examining types at runtime, to address this problem.

Storage management is a constant problem in writing general packages. It is difficult for an interface to describe the precise responsibility that each party to the interface has for storage allocation and deallocation; this gives rise to problems of dangling pointers and storage leaks. Cedar Mesa includes a garbage-collecting storage manager to deal with this problem, and we expect many applications programs to use garbage-collected storage exclusively.

The Cedar DBMS is a relatively simple, low-level system. When considering other designs we felt incapable of choosing an appropriate high-level data model for the DBMS without actually trying to implement some of our intended applications. By building a simple system and some applications quickly we hoped to gain the experience necessary to design a practical future system with a higher degree of data independence.

We also hoped that in addition to supporting some applications and furnishing us with useful experience, the Cedar DBMS would contribute some software components toward the development of a future high-level system. Strict upward-compatibility would be very pleasant, but is not required. This freedom is a result of having a limited number of very cooperative users.

While we characterize our system as low-level, it does in fact support a useful degree of abstraction from the representation of data items in ferrous oxide. For instance, clients are never exposed to disk addresses or the equivalent ("tuple IDs").

We have borrowed ideas freely from the Research Storage System (RSS), a component of System R [Astrahan et. al, 1976]. As in that system, we do not address many of the real issues of very large databases, since these are not expected to be significant in our applications. We are more concerned with fast access at the individual tuple level for moderate-sized databases.

The Cedar DBMS is constructed in four levels: the tuple level, storage level, segment level, and file level. This represents a strict heirarchy in that each level communicates only with adjacent levels, and does so through Mesa interfaces.

The segment level exports DBSegment, an interface that supports the abstraction of a database as a uniformly addressable set of pages. Such databases consist of a number of distinct segments, which are independent units for allocating database pages. DBSegment contains operations to make a database addressible, to create a new segment as part of the current database, to allocate or free pages from a particular segment, to generate the address of an in-core copy of page p of the current database, etc.

The storage level exports DBStorage. DBStorage supports abstractions such as tuple and index, and a set structure on tuples that is equivalent to RSS’ binary links. It provides operations to copy primitive types such as byte strings, numbers, and word arrays to and from tuple fields.

The tuple level exports DBTuples, the Cedar DBMS’ only client interface. DBTuples will discussed in detail in the next section. The primary responsibilities of the Tuple level are to implement the "system tuple sets" (data dictionary), to maintain indexes in the face of tuple updates, and to perform access path selection for retrieval requests. It also translates between Mesa types and the primitive types supported at the storage level.

The primitive data type provided by DBTuples is the tuple. The DBTuples notion of tuple is fairly standard; it is not too far wrong to think of a row of a table. A tuple consists of a set of fields, each with an associated value. Each tuple belongs to a type, called a tuple set. A tuple set is a template for making tuples: it has a fixed set of named attributes, and each attribute has a value type. Each field of a tuple is in 1-1 correspondence with an attribute of the tuple’s tuple set, in that it is accessed by the attribute’s name and its value is constrained to match the attribute’s value type.

The system allows a limited set of Mesa types as value types: INTEGER, LONG INTEGER, STRING, and so on. Variable-length values, such as strings, do not require a specific upper bound on their length. These Mesa types are uninterpreted by the DBMS, apart from some knowledge of ordering among values (as required for indexing). The system also supports an interpreted value type, tupleRef. A tupleRef is a reference to a tuple from any tuple set. The presence of tupleRef values makes the Cedar DBMS non-relational.

That is, a client program is given a pointer ("handle") to a data object whose structure is not revealed in the interface. This ensures that only procedures defined in the interface can be applied to tuples. For instance, to test two tuples for identity (a distinctly non-relational operation), we call

The four primitive operations on tuples are to create or destroy tuples, and to retrieve or store tuple fields. These four are also the primitive operations on records in most programming languages, but not in the application programmer’s interface to most data management systems.

CreateTuple: PROCEDURE [ts: TupleSet] RETURNS [t: Tuple];
DestroyTuple: PROCEDURE [t: Tuple];
SetF: PROCEDURE [t: Tuple, a: Attribute, v: Value]; -- t.a ← v
GetF: PROCEDURE [t: Tuple, a: Attribute] RETURNS [v: Value]; -- v ← t.a
Value: TYPE = REF ANY; -- runtime conversion to/from specific type

The system also includes a notion of "null tuple", and a predicate to test a Tuple for equality with the null tuple. Null tuples are generated for a variety of reasons. An uninitialized tupleRef field is null, as is a tupleRef that once pointed to a tuple that was later destroyed. Also, if a client generates two Tuples t1 and t2 such that Eq[t1, t2], then DestroyTuple[t1] causes both t1 and t2 to become null.

An interesting problem arises in the implementation of tuples. A Tuple is a REF to a TupleObject, a record whose structure is hidden from the client; the intention is that one these records R should be reclaimed by the garbage collector when the last client REF to R is destroyed. The problem is that the DBMS wishes to retain its own REF to R for as long as any client REF remains. The reason for this is as follows. The representation of R includes a tuple ID, which is in effect a reference to a particular location on the disk. When a client calls DestroyTuple, it is important that all such references be invalidated. Otherwise the client might use the reference to perform a tuple access after the disk storage for the destroyed tuple had been used to create some new tuple. There seem to be two possibilities: either ensure that all such disk references are unique, or be prepared to find all such references when invalidation of one is necessary. But guaranteeing uniqueness also requires the ability to find all references. So in either case, the DBMS needs to maintain a REF to a TupleObject for longer than the client. This seems to frustrate the desire for garbage collection.

Cedar’s solution to this problem is to allow a user-written finalization procedure to be associated with each collectible record type. The garbage collector arranges for the finalization procedure to be called when the number of REFs to a TupleObject (say) reaches a specified value greater than 0, say 2. Then if 2 is the number of internal refs to a TupleObject, the finalization procedure knows that no more client REFs remain, and can destroy the system’s REFs. Later, the collector finds that the number of refs to the TupleObject has reached zero, and reclaims it.

DBTuples provides two forms of retrieval from a tuple set to yield a collection of tuples. A result of the first form, a match set, contains all tuples whose field values fall into specified ranges. A result of the second form, a ref set, contains all tuples with a specific tupleRef field pointing to a particular tuple. Both kinds of retrieval are implemented as generators: an initial call is made to define the query, and this returns a handle for the match set or ref set. Then succeeding calls to the "next" operation are used to scan through the elements of the set. When all elements of the set have been enumerated, the null tuple is returned.

Match sets and ref sets are represented in the same way as tuples — the client is given a handle on a dynamically-generated, garbage-collected opaque object. A tuple is not bound in any way to the match set or ref set that generated it. In fact, a client is free to build arbitrary Mesa data structures that include Tuples. However, these structures only remain valid during a single invocation of the client; it is meaningless to store a Tuple in a database.

A common use of ForMatchSet is to retrieve a tuple by its unique name (some application-defined combination of attributes), so DBTuples provides a more specialized procedure for this case. If the match set contains either no tuples or more than one tuple, the system raises an exception (Mesa SIGNAL) for the client to deal with.

In some applications it is useful to perform the following query: "given tuple t, find all tupleRef attributes a such that there exists a tuple ta with ta.a = t, i.e. ta’s field a references t". This is especially useful when a database makes use of the fact that a tupleRef is not constrained to reference a tuple from any particular tuple set. To support such network-style databases, DBTuples includes this operation.

The client’s data schema, i.e. the tuplesets and their attributes, is described in the database itself. Tuples that represent tuple sets and tuples that represent attributes are referred to as dictionary tuples. In addition, it is possible for the client to specify what indexes (implemented as B-Trees) he would like to have maintained on tuple sets to increase the performance of the system; the indexes and the attributes that make up the index’s key (termed index factors) are also represented as dictionary tuples in the client’s database.

In addition to data tuples, and the dictionary tuples that describe the data schema for data tuples, the tuple level exports system tuples that define the data schema for dictionary tuples. These are simply variables in the DBTuples interface, and their values are manufactured during system startup. There are 17 system tuples, consisting of 4 tuple sets with a total of 13 attributes. The system tuple sets are:

Read operations such as FetchTuple and GetF work equally well on system, dictionary, or data tuples. We do not allow the client to modify the data schema by directly modifying the data in dictionary and system tuples; instead, DBTuples includes specialized procedures for performing such updates. A fundamental difficulty with allowing clients to update even selected fields of dictionary tuples is the fine granularity of updates: an individual update may temporarily leave the dictionary in an inconsistent state, but the DBMS may access the dictionary before it becomes consistent. The data definition procedures include

Finally, DBTuples includes procedures for starting and ending a database a database session, and procedures to allow updates made in a session to be committed or aborted. A commit or abort applies to all changes since the last open, commit, or abort.

We use the Juniper file system to implement DBFile, the file component of the Cedar DBMS. The rest of the system runs on the client machine, i.e. on the same machine as the application program that imports DBTuples. This means that network communication takes place at the level of database page accesses. On a fast local network, the extra communications overhead is less than the cost of a disk access. Furthermore, there is the possibility of caching pages locally, since each personal computer includes a disk unit.

The alternatives to this choice would have been to run the Cedar DBMS on each Juniper server, or to provide a number of separate machines as "database servers". The first alternative is impractical because Juniper is not written in Cedar Mesa; the second was rejected on the grounds that performance would be unacceptable. This is largely a function of the fact that DBTuples is a tuple-at-a-time interface. A server that processes queries written in a higher-level language would be practical. A "DBTuples server" may become more attractive with the advent of a fast remote procedure call facility [Nelson, 1981]; this would also provide access to Cedar databases from non-Cedar programs.

We chose to allow a database to be structured as a collection of files, instead making a database consist of a single file. A number of considerations entered here: performance, size, data distribution, and protection. We can get the file server to preserve the locality of files on the disk, so by placing more closely related information in a single file we can get better performance. Juniper files do not span a physical volume, so in order to create a database larger than a volume we need multiple files. We provide data distribution at a low level allowing several files, perhaps located on different file servers, to contribute to the same database. We also provide low-level protection by using Juniper’s file protection facilities.

It is worth expanding on the protection issue. Protecting parts of databases by protecting files seems like a weak idea. Much more natural protection facilities can be provided at a high level, as is done by associating protection with views in System R [Griffiths & Wade, 1976]. But our system has nothing corresponding to the RDS level of System R. Protection facilities built into DBTuples would almost certainly have to be thrown away in building a system with a high-level data model. So we chose not to build protection in DBTuples. The main protection issue in our environment is not privacy, but inadvertent modifications, especially by undebugged programs. We want to provide natural barriers between applications, but not to make these barriers impermeable.

A segment is the database unit that lives in a single file. A tupleset (including all of its dictionary tuples) must be contained in a single segment, but a segment may contain any number of tuplesets. A database is a set of segments; a segment belongs to exactly one database.

In designing the Cedar DBMS’ concurrency control mechanism we relied heavily on the concurrency control facilities already provided by Juniper. We wanted database transactions, like Juniper transactions, to guarantee atomicity and hence serial consistency; we did not feel it necessary to support lower degrees of consistency [Gray, 1980].

To first approximation, the DBMS operates as follows: it opens a Juniper transaction during DBTuples.OpenDatabase, performs page reads and writes as required by the client’s DBTuples operations, and commits the Juniper transaction during DBTuples.CloseDatabase. Juniper maintains locks as described in Section 2.2, and may abort a transaction due to conflicts. It is the client’s responsibility to retry an aborted transaction if desired, and also to invalidate any writes made to other media (e.g. a display terminal) based on information from the aborted transaction. Note that in this scheme, locks are held and lock processing is performed on the machine that stores files, not on the client machine that executes DBMS code.

Giving the file server all of the responsibility for locks, and thus locking physical units (pages), has some interesting consequences. For instance, it makes the published algorithms for concurrency control in B-Trees [Bayer & Scholnick, 1977] inapplicable to the Cedar DBMS; from the file server’s point of view, a B-Tree page is no different from any other page. We have tried to reduce the likelihood of unnecessary lock conflicts by segregating unrelated data structures that are updated frequently, such as TupleSetTS tuples, on separate pages. Another potential locus of contention is the page allocator for a segment (i.e., extending a file). This contention can be reduced by employing separate allocators for individual logical structures within a segment (e.g. tuple sets), and having these employ the segment’s allocator only when necessary.

An important benefit of using a transaction-based file server in this way is that transactions can span ordinary files as well as databases. For instance, in a database representing a file directory we’d like the combined operation of (1) creating a new file and then (2) making a directory entry for it to be atomic, avoiding an orphan file if the system crashes between (1) and (2). To allow this kind of application to be written, Cedar DBMS clients can perform Juniper actions under the same transaction as the DBMS is using for database page transfers.

To obtain better performance, the Cedar DBMS actually uses an elaboration of the file-accessing method described above. First, we maintain a cache on the client machine of recently-accessed pages. This cache is implemented from virtual memory, backed by local disk storage, and provides two benefits. For cache hits on core-resident pages, the access time is obviously very fast. For cache hits that generate paging activity to the local disk, the access time is comparable to the time to access a page on a lightly-loaded Juniper server, but the local disk access generates no load on the remote server. Hence caching pages on the local disk improves the performance for the entire community of users.

The second elaboration is that for a sequence of database transactions we can checkpoint, rather than commit, at the end of each database transaction except the last. This holds read locks on all pages involved in the transaction, including locks on pages in the local cache. Hence the cache remains valid through the sequence of transactions, improving the cache’s efficiency.

This improvement is not entirely free, however, since the more locks a transaction holds the greater its chances are of conflict with another transaction. But Juniper’s broken read lock facility gives the basis for solving this problem. The DBMS records, in the client machine, the pages that have actually been used in a database transaction (i.e. since the previous Juniper checkpoint). If a read lock for one of these pages breaks then the database transaction must abort, but other broken locks may be released without compromising the transaction’s integrity.

The third elaboration involves control of when writes into the local cache are propogated to Juniper. Unless the number of writes in a transaction is extremely large, we can cache all of the pages written in the transaction on the client machine, and write these pages to Juniper immediately before the commit. Furthermore, we can write the pages back to Juniper in a standard order for all clients. This eliminates the possibility of deadlock between clients. (Of course, Juniper would break the deadlock eventually by aborting one of the participants).

The concurrency control method we have described is an "optimistic" one [Kung & Robinson, 1979]. That is, readers almost never wait due to locking conflicts, but are sometimes aborted because data they depend upon has been updated from another transaction. This is in contrast to methods that lock objects for exclusive access in order to update.

The combination of the optimistic approach and physical locking has problems with certain patterns of transactions. One bad situation is when there is a very high traffic of short transactions trying to update the same logical entity, such as a log. In this case, some sort of logical locking seems to be required for adequate performance.

Another bad situation is when there is a very long transaction that accesses many pages. If a short transaction modifies some page that has been used by the long transaction and then commits, the long transaction will be aborted. If a small transaction does this often enough it can keep the large transaction from making any progress at all. Juniper’s present facilities do not include exclusive-mode locks, so it is awkward for us to cope with this situation. Should this become a problem for the Cedar DBMS we can either convince the Juniper implementors to provide exclusive-mode locks, or implement them ourselves by using Juniper files to store the lock state.

We feel that it is important to acknowledge the great amount of effort that we saved by building the Cedar DBMS on top of an existing transaction-based file server. The difficult problems in building a robust database facility in a distributed environment are associated with managing transactions, and Juniper solved these problems for us.

Our strategy of building a simple system quickly has been a success; these applications were undertaken less than a year after we started to build the Cedar DBMS. We are presently considering a number of specific improvements to the lower levels of the system, and awaiting Cedar facilities that will allow us to store values of arbitrary user-defined type in a database. We are also pondering designs for a high-level system and greater integration of the DBMS with Cedar Mesa.

Acknowledgements: Peter Deutsch and Jerry Popek were members of the initial Cedar DBMS design committee. Jim Saxe, John Ellis, Ray Cheng, and Eric Schmidt built the first Cedar DBMS-based application programs. Howard Sturgis and Karen Kolling are responsible for the continuing development of Juniper.