Cypress database concepts and facilities, October 1982 CSL Notebook Entry To CSL Cedar Interest Group Date October 6, 1982 From Cedar Database Group Location PARC/CSL, Palo Alto (Rick Cattell, Mark Brown) Subject Cypress database concepts and facilities (phase two implementation) XEROX Archive ID: yyCSL-xxx Attributes: technical, Cypress, Cedar, Database References: [Indigo]Database>ViewLevelDesign.press Abstract: The Cypress database system is intended to provide a convenient, high-performance data management facility for use by clients of the Cedar programming environment. The goal of Cypress is an integrated, unified representation of data across programs and machines, aiding information exchange and avoiding repeated implementations of specialized data management facilities. We present the basic concepts and capabilities of the Cypress database system, and describe the programmer's interface to the system. Note: This document is a conversion from Bravo to Tioga format of the files [Indigo]Database>c+f* by Bob Hagmann. The ViewFig*.press files apperar to be lost, but placeholders have been left in. CFFig4.Griffin exists, but Griffin does not run in Cedar6.0 yet. I think that the DBView referred to in the paper is what is documented in ModelLevelDesign*, that is Blue-and-White CSL-83-4. ViewLevelDesign.press may be the SIGMOD paper (SIGMOD.paper). TABLE OF CONTENTS 1. Introduction 2. Database system concepts 3. Data modelling concepts 4. Client interface 5. A sample program 6. Database design 7. References *** PRELIMINARY DRAFT: NOT FOR CIRCULATION *** 1. Introduction Why a database system? CSL needs a database system. A database system provides many data management facilities that are currently re-invented by program after program, and thereby eliminate a large duplication of effort. It also provides a capability that is needed by many programs but actually implemented by very few: a means of sharing information among programs that are distributed around the network. Why this paper? We hope that this paper will be read by potential users of the Cypress database system. It is certainly required reading for those who want to make use of the system now. This paper begins by developing database concepts that are necessary to put the Cypress database facilities in perspective. We do not expect most people in CSL to become familiar with database jargon, or to read a Mesa interface, in order to discover whether or not the database system can be useful to them. Thus we begin (Section 2) by defining the scope of the general database problem, and measuring our ambitions relative to this standard. We also describe some current and potential applications of our system. The main body of this paper describes the interface to the Cypress database system. We first give an overview (Section 3), then present the full story (Section 4). This requires defining the class of data structures that can be stored in a database, and then describing the procedures for manipulating these data structures. For illustration, we present a short Mesa program that manipulates a database (Section 5). We then give an example of how a particular real-world information structure might be represented as a database, and discuss design tradeoffs (Section 6). Where are we going with database systems? The form of our database system has evolved and will continue to evolve with experience with database applications in CSL. The Cypress database system described here incorporates two major changes to the original Cedar Database System (Brown, Cattell, Suzuki [1981]), motivated by early experience with applications: a more powerful file system (Alpine, by Brown et al [1982]), and more powerful data structuring mechanisms (DBView, by Cattell [1982]). Also, some database tools have been developed (Squirrel, see Donahue [1982]). We plan to build other tools and interfaces on top of the one described here, and eventually to integrate some set of database facilities into the Cedar Mesa language. Because other documents discuss these longer range plans, however, we shall give such discussions secondary status in this paper. We do wish to emphasize that we are open to suggestions concerning what the future facilities should look like; it is certain that our earliest and most energetic clients will have the greatest influence on the evolution of the system. What are applications of Cypress? As mentioned earlier, applications of the original Cedar database system were helpful in it revision. These applications included a database of Mesa programs, the CSL notebook, and a personal database browser. These applications were experiments, though it is expected they will be carried over into the Cypress system in the coming year. At the time of this writing, three major application programs run in the Cypress database system. The first is Walnut, providing facilities in the Cedar programming environment to read, send, and organize electronic mail in a database. The second is Hickory, which manages an appointment calendar database and provides an alarm mechanism in Cedar to remind the user of meetings, seminars, etc. The third application is Squirrel, which provides facilities to edit, organize, and browse a personal database of any type of data. In addition to these application-independent data manipulation facilities, Squirrel provides facilities to application programs such as Hickory and Walnut. Cypress clients planning to build application programs should consult the Squirrel documentation (Donahue [1982]) to see if these facilities are useful for the envisioned application. Other database applications are planned but have not yet been undertaken due to manpower limitations. Most of these are public databases such as the CSL notebook and the "Whitepages" database of people, phone numbers, and addresses. 2. Database system concepts Before discussing the facilities to be provided in the Cypress database system, we must address the question "what properties make something a database system?" We shall often contrast databases with more familiar data structures, such as Mesa record structures. We find it useful to give a few definitions here. A client is a program using some facility, such as a database system. A user, on the other hand, is a human using a facility, such as an interactive interface to a database system. A database query is the operation of reading information from a database; depending on the type of interface provided, a query may be limited to reading one item or may involve thousands. An update is a write operation on a database. 2.1 Three basic properties of databases Persistent Unlike information encoded into a typical Mesa data structure, information stored in a database may last longer than one invocation of a program. This persistence of data has far-reaching consequences. Information in a long-lived database can become more valuable than the programs that manipulate it. A database may exist long enough to be operated on by several different generations of the same application program, or to be used for applications that were not anticipated when the database was first designed. Thus the gradual evolution of the database becomes a concern analogous to the well-known problems of evolving programs. The continued soundness of a persistent database is important. Soundness in the face of physical mishaps such as disk head crashes or tape vault fires is called physical integrity. A guarantee of perfect physical integrity is impossible, but a database system should reduce the probability of disaster to a level that is acceptable in light of the value of information in the database. A much more difficult problem comes in maintaining the logical integrity of a database. A database is generally used to model some real-world state of affairs, such as the structure of a Mesa system, the layout of a document, or the contents of a person's appointment schedule. Logical integrity means the correct correspondence between a database and the real-world situation being modeled. A database may be damaged in a logical sense by the keying of incorrect input data, or the running of an incorrect program. These two examples should make it clear that perfect logical integrity is an ideal that cannot be guaranteed; unfortunately, the loss of confidence caused by errors in a database can be just as damaging as a catastrophic physical failure (Hoare [1975]). One simple form of insurance for the logical integrity of a database is always provided in a database system. This is the persistent storage and enforced use of data that tells how to interpret the representation of the database. This data is somewhat analogous to the type declarations required in Mesa programs that use data structures. It should be clear that using the wrong declarations when accessing the database could lead to disaster. Some database systems go beyond this, allowing the user to state more application-specific properties that his data should satisfy. These are sometines called integrity constraints or integrity assertions. Of course, there was long-lived information before there were any database systems: this information was stored in file systems. But file systems at best gave some assistance in preserving the physical integrity of stored data, and were no help with any of the other problems associated with persistence. Hence these problems were solved in an ad hoc manner by various applications. Shared The persistence of information in a database introduces the possibility of data sharing. This sharing is qualitatively different from the sharing that can take place among separate processes of a Mesa program. One aspect of sharing is consistent concurrent access to data by programs. Here "concurrent access" does not mean that the programs' commands to read or update the database are actually executed at the same instant of time, but rather that the precise order in which these commands are executed from different programs is unpredictable. In general, consistency of a database can only be defined with respect to an application: we say that a database state is consistent if it satisfies a certain application-specific invariant. For instance, an accounting system might have as an invariant the property that the total debits and credits in the system are equal. In some database systems a transaction facility is provided to help applications maintain consistency invariants. Briefly, a transaction is a sequence of read and write commands. The system supports the property that the entire sequence of commands executes atomically with respect to all other transactions, that is, the transaction executes as if no other transactions were in progress at the same time. Because there may in fact be other transactions accessing the same data at the same time, it is possible (in most systems) that two transactions may deadlock, in which case one of them must be aborted. So the price paid for concurrent access is the necessity that programs be prepared to retry aborted transactions. An example of an application in which concurrent access capability is necessary is an interactive reservation system, say for CSL meeting rooms. By structuring the reservation system's database accesses into transactions, it should be possible to ensure that the database will remain consistent despite reservations entered concurrently from different machines on the network. The consistency invariants in this case would include having each time period for each room reserved at most once. Another aspect of sharing is the integration of information into a single database. Integration means that information is kept in a consistent form in one database, even if this is not strictly necessary for the purposes of existing applications. Integration allows new problems to be solved more easily because all of the necessary information is maintained in the same database. It also encourages sharing of information at a more macroscopic level than in the reservation system example above, since it provides a set of uniform naming conventions and access methods for all users. In principle, integration makes logical integrity easier to maintain, because there is less of a tendency to duplicate data for use by different applications. Unfortunately, sharing introduces new possibilities for logical damage to a database. For instance, one user's undebugged program can mistakenly modify the data of another user. A database's protection system is designed to make such damage less likely by requiring an authorization check before querying or updating the database. Some database systems provide elaborate protection facilities that give fine-grained control over access to information; this is especially necessary if sensitive information is to be shared through the database. Large Databases often contain large volumes of data. By one definition, a database qualifies as "large" when it becomes unacceptable to make the database inaccessible during the amount of time required to copy it. In principle, largeness is independent of persistence. In practice, largeness generally implies persistence, but not the other way around: it is quite common to require small data objects to survive between the executions of a program. 2.2 Consequences of the properties The discussions above were limited to immediate results of the three basic properties. We now deal with some of the more indirect consequences. Interface independent of storage structure Because of the shared, persistent nature of a database, its usage changes with time. To provide good performance as the access pattern shifts, it is necessary to change the database representation to keep the most common operations inexpensive. Because of the value of existing applications software, the database system should make these programs functionally immune to changes in the storage structure used in representing the database. That is, such changes should at most affect a program's performance, and not the results that it gives. This data independence can be achieved by ensuring that the interface used by programmers to specify algorithms involving the database refers to a model of data, the conceptual model, that is a useful abstraction of stored structures (independent of their precise form). Ideally the database system would then choose a good underlying representation based on its knowldege of the user's programs, or adaptively based on execution statistics. This remains a research problem: no existing system can perform such optimizations automatically. An alternative is for the system to accept advice from the user on pragmatic matters (such as the use of internal indexes), where this advice takes the form of declarations that cannot affect the semantics of an application program. Simple and regular structures Another consequence of the more fundamental database properties above is that database structures should be kept simple and regular. Persistence argues for simplicity because if a structure is not simple, it is difficult for different programs to maintain it in a consistent and accurate way over time. Regular structures are easier to share for the same reason, and also because they tend to contain larger meaningful substructures, that can serve as the units for locking. With a large, irregular structure it is very likely that consecutive accesses to data will seem uncorrelated from the system's point of view, meaning that a disk access is required for each one. In more regular structures the access pattern often involves scanning data in an order that is meaningful to the system, so that caching and prefetching will increase performance. Aggregate-level operations By structuring a database in a very regular fashion, we make it feasible to operate on it with very concise high-level languages. This approach was first demonstrated with the relational model of data (Date [1977]), which is probably the simplest model with respect to the mechanisms it provides for expressing relationships among data items. The use of such high-level languages makes it very easy to develop new applications that mainly involve queries, even when the queries use portions of the database that were developed independently. These high-level languages may also help somewhat in avoiding programming errors and maintaining the logical integrity of the database. Multiple views Because databases represent real-world objects and relationships, changes in the world (or in our preferred abstraction of it) may force a change in the structure of an existing database. For example, we may have to add a new field to each record of a given type, or delete an existing field. A stronger form of data independence holds when any application that isn't necessarily affected by some alteration in the conceptual model mentioned above is functionally immune to the change. (It is impossible for a nontrivial application to be immune to all possible alterations, because some of the data that it interprets may be expunged in a change of the conceptual model.) This degree of data independence requires a layer of mapping between the conceptual model and the external model presented to an application. Such a map is called a view. There may be many views that involve the same underlying data; views also allow different external models to be used while keeping a common conceptual model. It is not generally possible to perform an update through a view unless the view is very simple. 2.3 Database versus persistent virtual memory Is it productive to view the data structures and variables found in normal Mesa programming as database variables that don't happen to be persistent, shared, or large? To be more concrete, is the distinction between Mesa records and database objects really necessary? To some extent this is a function of our aspirations. If we consider persistence to be the primary goal, with sharing and largeness being much less important, then we may be satisfied with a database that is essentially a long-term Mesa virtual memory. If the typing of data objects in this memory could be preserved automatically over long periods, the results would have some of the character of a database. JaM, Interlisp and Smalltalk are examples of existing systems at PARC that support this kind of persistent virtual memory. Databases built in these environments can be very flexible by normal database standards. The user is able to create arbitrarily complex data structures and is able to bind procedures to individual data objects, giving a programmable view mechanism (in the sense of the section above). JaM, Interlisp and Smalltalk do not have a conventional database system's facilities for ensuring integrity, do not allow concurrent access, and don't support large databases (by the standards of "Business Data Processing" applications). 2.4 Cypress database system aspirations Persistent Physical integrity is aided by storing data on the Alpine file system (or Juniper, which we use right now). Declarations live in the database, and are read using the same interface as the rest of the database. (This commonality of interface is not true for updates, for consistency reasons). Shared Concurrent access will be provided by using page-level locking on Alpine. Since this does not allow the locking of logical units of data, it will perhaps not support very high degrees of concurrent access efficiently; we won't know any more about this without getting some experience. We expect the database to be used in an integrated way (although probably not at first, until confidence is built up.) The initial protection mechanism will be based on file protection. This is not ideal, but the best way to improve it may be to build finer-grained protection at a level above the interface we are now developing. Large The database may get large in the sense that it occupies a large portion of a disk volume, but we do not expect that 24 hour real-time performance will be necessary. Thus we will be satisfied with fairly crude algorithms for responding to restructuring requests that appear to affect large volumes of data (but in principle could be handled by a more sophisticated incremental algorithm.) We do not plan to use very sophisticated algorithms to gain good performance for queries involvong huge amounts of data. We shall assume that the amount of data used to process a query will just about fit in main storage. We are primarily concerned with interactive systems rather than "data processing" applications that might involve scanning a huge database. Our performance aspirations are also limited by the bandwidth available to our file server (though we shall do local caching of file pages to reduce this traffic). Interface independent of storage structure This is true in our system to a large extent. The presence or absence of indexes ("inverted files") is not reflected in the statement of algorithms. Declarations are used to specify indexes, which are then maintained automatically by the system. The model does contain a form of pointer, but much of its underlying mechanism is abstracted away. Allowed structures are simple and regular The model of data is a simple extension of the relational data model. It consists of records, integers, strings, and atoms. Cypress helps maintain the logical integrity of the data by checking type constraints on these data. One goal of this document is to explain how to represent information within the Cypress data model in a well-structured way. Aggregate-level operations and views These are part of the next phase of development of Cypress. Our current interface allows only one-at-a-time operations on data that are explicitly stored; this functionality is quite adequate for many applications. We shall build aggregate operations and multiple views of data on top of this interface. 2.5 Database System Overview Database structures and their manipulation A database is a time-varying collection of data, all of which is stored as relationships and entities. A relationship is the database analogue of a Cedar record, and an entity is the database analogue of a Cedar atom. Fields of relationships are called attributes. Attributes may take on integer, string, boolean, and entity values. We do not allow any analogue of variant records for relationships. We also exclude the possibility of a "repeating" attribute, that is, an attribute whose value is a sequence of data values. The database package provides operations for creating new relationships and entities of a specified type, and for destroying them. Other operations allow the value associated with any particular attribute of a given relationship to be fetched or updated. We can contrast database relationships with Mesa records as follows. First, in the database model there is a mechanism that allows all relationships of a given type to be retrieved. This means that no tuple ever becomes "garbage", since there is always an access path to it. With a Mesa record, once the last user-maintained pointer to a record disappears the record is truly inaccessible. Second, it is possible to locate all of the tuples that reference a given atom or have a given range of integer or string values; this is not true with Mesa records. Third, nesting of relationships is not allowed; a Mesa record may contain a component that is also a Mesa record, or a pointer to one. Fourth, there is no analogue of variant records for tuples in the phase one implementation, although there is a hierarchy of types of entities (atoms). Databases and files A database is represented as a collection of files. A file in database format is called a segment. The segments comprising a database are identified by the client of the system. The set of segments may dynamically change, the segments are physically independent and may be independently opened, closed, and modified. All of the relationships or entities of some client-created type lie in a single segment, which must be specified when the type is defined. This partitioning provides a crude mechanism for clustering related data, and may reduce the potential for conflict in concurrent accesses to a database by different applications. A default segment is provided for clients who don't care about segment structure. Data definition The phase two Cypress system is not integrated with Mesa; the data manipulation operations sketched above are implemented as Mesa procedures in the interface. This leaves the problem of declarations: how do we create new relations and domains, or access existing ones? The answer is to define a small number of system entities to contain information about types. This technique for storing the "data dictionary" type information has the advantage that the same interface is used for data dictionary retrieval as for normal data retrieval. We do not extend the commonality of interface to updates, since the consistency of the types of the stored data is difficult to ensure without constraints on the type updates. Concurrency The database package uses the notion of transaction to structure database accesses. Clients can ensure the logical integrity of their databases, even in the face of concurrent access, by proper grouping of queries and updates into transactions. Transactions do not always succeed. Sometimes this is due to hardware or software failures, and sometimes due to conflicts in concurrent accesses. Concurrent updates to data in different segments almost never cause a conflict; data within the same segment are also arranged to minimize the potential for conflicts. But all clients must be prepared for the possibility of an aborted transaction. By far the simplest response to a transaction abortion is to restart the transaction from the beginning (though this may be painful when the updates were performed manually). Protection Database protection is limited to the protection provided for segments by the file system. This has its strong and weak points. On the positive side, file protection is a robust mechanism. Since it is implemented on a remote host, the malfunction of a user program cannot breach it. Users in our environment have experience with file protection and they trust it. Environmental requirements The Cypress system uses Alpine or the local Pilot file system to store the database. All of the database software will run in the client machine; thus the system's speed may depend a great deal upon the amount of main memory available on the client machine for caching database pages, as well as on the raw speed of the client machine. 3. Data modelling concepts Data Independence In this section, we give an informal description of the Cypress data model. We are dealing with a conceptual data model, the language used to define and access the types of objects and relationships between them. This should be carefully distinguished from a physical data model, the data structuring mechanisms we are given for actual storage and access of data. The physical data model is hidden as much as possible from the database client to facilitate data independence, the guarantee that a user's program will continue to work (perhaps with a change in efficiency) even though the physical data representation is redesigned. For any particular database using the given conceptual and physical models, the actual specifications of this database using the primitives the models provide are termed the conceptual data schema and physical data schema. Note that a mapping must be provided between the conceptual and physical levels, either automatically or with further instruction from the client; we will do some of both. The logical to physical mapping is intimately associated with the performance of the database system as viewed by the user performing operations at the conceptual level. Performance is generally not a criteria for choosing between conceptual data models, unless there is no known way to map the differing conceptual views into the same or equally efficient implementations. Basic Primitives Three basic primitives are defined in the model: an entity, datum, and relationship. An entity represents an abstract or concrete object in the world: a person, an organization, a document, a product, an event. In programming languages and knowledge representation entities have variously been referred to as atoms, symbols, nodes, and objects. A datum, unlike an entity, represents literal information such as times, weights, part names, or phone numbers. Character strings and integers are possible datum types. Actually, it is a policy decision as to whether something is represented as an entity or merely a datum: e.g., an employee's spouse may be represented in a database system as a datum (the spouse's name), or the spouse may be an entity in itself. The database system provides a higher level of logical integrity checking for entities, as we will see later. In some programming languages, an entity is just another type of datum, like numbers and strings. To allow more preciseness, we will instead use the term value to refer to something that can be either a datum or entity. A relationship is a tuple whose elements are [entity or datum] values. In programming languages, a relationship is commonly termed a record. As in most programming languages, we can refer to the elements (fields) of tuples by name instead of position. These names for positions are called attributes. We also define entity types, datum types, and relationship types: domains, datatypes, and relations, respectively. We make use of these three types through one fundamental type constraint: for every relation, we define the number and types of attributes expected of relationships of that type. That is, every relationship in a given relation must have the same attributes, and the values associated with each attribute must be from a pre-specified domain or datatype. For example, a member relation might specify that a given person is a member of a given organization with a given job title as in the figure following. The person and organization might be entities, while the title might be a string datum. We relax the fundamental type constraint somewhat in allowing a lattice of types of domains: a particular value may then belong to the pre-specified domain or one of its sub-domains. For example, one could be a member of a University, a Company, or any other type of Organization one chooses to define. Other relations, e.g. an "offers-course" relation, might apply only to a University. <==CedarDB>). A knowledge of Mesa (Mitchell et al [1979]) and the Cedar programming environment is helpful but not essential to understanding this section. Types In this subsection we describe the most important types in the interface. Less pervasive types are treated at the point where they are first used. Entity: TYPE; Relship: TYPE; Datatype: TYPE = Entity; An Entity or Relship is not the actual database entity or relationship; they are handles for the actual database objects. All accesses to database objects are performed by calling interface procedures with the handles as parameters. Even comparisons of two entities for equality must be done in this way. The Entity and Relship handles are allocated from storage and automatically freed by the garbage collector when no longer needed. Value: TYPE = REF ANY; StringType, IntType, BoolType, AnyDomainType: DataType; Storing Mesa data values in tuples presents several problems. First, since we would like to define a single operation to store a new value into a specified attribute of a Relship (for instance), there must be a single type for all values that pass through this "store-value" procedure. This is the type Value above, represented as untyped REFs in Cedar (a REF is a garbage-collected pointer in Cedar). The DataTypes will be discussed in the next section. Entities, strings, integers, and booleans are the types of values the system currently recognizes and allows as attribute values. More precisely, these four types are Entity, ROPE (Cedar strings), REF INT (a REF is a garbage-collectable pointer), and REF BOOL. In the case of an entity-valued attribute, an attribute's type may be AnyDomainType or a specific domain may be specified. The latter is highly preferred, as AnyDomainType is a loophole in the type mechanism and limits the kinds of operations that can be performed automatically by the database system or associated tools. We currently provide no mechanism to store more complex Cedar data structures such as arrays, lists, or records in a database; the database system's data structuring mechanisms should be used instead. (The database system can only provide useful query and update mechanisms if it has knowledge of the data's structure.) System domains, relations, and properties Many procedures in the DBView interface take parameters that describe the type or portion of a data item to be manipulated. The permanent repository of such data about user-defined data in a database is the database's data schema, represented by dictionary entities and their properties. Dictionary entities are members of one of the pre-defined system domains: DomainDomain, RelationDomain, AttributeDomain, DatatypeDomain: Domain; Every client-defined domain, relation, or attribute contains a representative entity in these domains. Client-defined datatypes are not currently permitted, so the only entities in this domain are the pre-defined IntType, StringType, and BoolType declared above. The information about the client-defined domains and attributes are encoded by relationships in the database. Domains participate in the system relation SubTypeRelation, which encodes a domain type hierarchy: SubTypeRelation: Relation; sSubType: Attribute; -- the domain in this attribute is a sub-type of.. sSuperType: Attribute; -- the domain in this attribute The SubTypeRelation has one element per direct domain - subdomain relationship, it does not contain the transitive closure of that relation. However, it is guaranteed to contain no cycles. That is, the database system checks that there is no set of domains d1, d2, ... dN, N>1, such that d1 is a subtype of d2, d2 is a subtype of d3, and so on to dN, and d1=dN. The information about attributes is available through use of the pre-defined system attribute relation: AttributeRelation: Relation; aAttribute: Attribute; -- the attribute to which this relationship refers aType: Attribute; -- type of the attribute (DataType or Domain) aRelation: Attribute; -- relation of the attribute (Relation) aUniqueness: Attribute; -- one of: Key, KeyPart, OptionalKey, NonKey aLength: Relation; -- only for string attributes: hint at max length The AttributeRelation and SubTypeRelation can be queried with the ordinary data query operations we will define. The data schema representation of attributes, relations, domains, and their properties may only be read, not written by the database client. They are written indirectly through the use of procedures for their creation: DeclareDomain, DeclareRelation, DeclareAttribute, and DeclareSubtype. The semantics of the system domains and relations we have discussed will be deferred to the discussion of these procedures later in this section. B-tree indexes are automatically maintained for domains; that is, an index contains entries for all of the entities in a domain keyed by their name so that sorting or lookup by entity name is quick. Key comparisons are performed in the usual lexicographic fashion. B-Tree indexes are also maintained for relations if explicitly requested by the client. That completes our description of the system-defined domains, relations, and properties. These contain all of the accessible data about the user-defined data in a database. A client program can read this data using the same interface that the program uses for all other data. Attempts to perform updates through this interface results in an ERROR DictionaryUpdate; all updates are performed explitly by client calls to procedures defined in the next subsections, in order to ensure the consistency of this data. Starting up; transactions and segments A transaction is a sequence of read and write commands. The system supports the property that the entire sequence of commands executes atomically with respect to all other data retrieval and updates, that is, the transaction executes as if no other transactions were in progress at the same time. Because there may in fact be other transactions accessing the same data at the same time, it is possible that two transactions may deadlock, in which case one of them must be aborted. So the price paid for concurrent access is that programs be prepared to retry aborted transactions. One requirement of transactions is that packages used in conjunction with the database can group their operations into the same transaction when this is appropriate. Of course, when this is done the interactions among packages must be understood, since one package might commit the transaction while the other had some persistent data structure in an inconsistent state. The database system provides the capability to access a database stored on the same machine as the database client, using the Pilot file system (Redell et al [1979]), or on Juniper file servers (Israel et al [1978]). (Juniper will be replaced by Alpine, a new transaction-based file server, when that project is completed.) There is currently only one public and one private transaction per instance of the database software on a client machine. All data on a server is accessed via the public transaction; all data on the client's machine via the private one. This single-transaction scheme is a major simplification of the package itself, and also removes the need to pass a Transaction on every call to procedures in the interface. This is how the database package uses transactions. An OpenDatabase and OpenTransaction call start the database session. A transaction is either passed in by the client, or created by the database package (the latter is just a convenience feature). The operation MarkTransaction below forms the end of a database transaction and the start of a new one; this is implemented by executing a "checkpoint" (commit and hold read locks) on the Juniper file server, or by closing and re-opening the transaction on the Pilot file system. The operation AbortTransaction may be used to abort a transaction. The public and private transactions supported by the database system are independently opened, committed, or aborted. The segments in which data are stored may not be opened for use until the appropriate transaction (local or remote, depending on the location of the segment) has been opened. The operation CloseDatabase commits all transactions, closes all segments, and kills any transactions created by the package. Clients must decide when to tell the system that a transaction is complete, and must be prepared to deal with unsolicited notification that the current transaction has been aborted because of system failure or lock conflict. Note that when the database package cooperates with other packages under the same transaction, a database transaction may be forced to abort because of the failure of some activity in the other package that is part of the same transaction. The client's interaction with the database system begins with a call to OpenDatabase and ends with a call to CloseDatabase: OpenDatabase: PROC[ userName, password: Rope.Ref_NIL, nCachePages: CARDINAL _ DefaultNCachePages, nBytesInitial: LONG CARDINAL _ DefaultNBytesInitial, nBytesPerExtent: LONG CARDINAL _ DefaultNBytesPerExtent ]; CloseDatabase: PROC;. OpenDatabase informs the database system of the clients name and password, and various system parameters: nCachePages tells the system how many pages of database to keep in virtual memory on the client's machine, nBytesInitial is the initial size to assign to segments, and nBytesPerExtent is the incremental increase in segment size used when a segment is full. OpenDatabase should be called before any other operation in the interface. CloseDatabase closes the database session, commiting and destroying any open transactions and closing any open database files. OpenTransaction: PROC[ location: LocalOrRemote _ local; useTrans: Transaction_ DefaultTransaction, transAbort: TransactionAbortProc_ NIL ] RETURNS[trans: Transaction, welcomeMsg: Rope.Ref]; LocalOrRemote: TYPE = {local, remote}; If useTrans is NIL then OpenTransaction establishes a new connection and transaction with the corresponding (local or remote) file system. Otherwise it uses the supplied transaction. In either case, OpenDatabase returns the resulting transaction. If a transAbort procedure is passed to OpenTransaction then it is called when the transaction is spontaneously aborted (currently, the local transaction is never spontaneously aborted). Any database operations after a transaction abort will invoke the ABORTED signal. The client's transAbort procedure should block any further database operations and wait for completion of any existing ones. Then it may re-open the aborted transaction by calling OpenTransaction again. When the remote transaction is successfully re-opened, the client's remote database operations may resume. Operations on data in local segments may proceed even when the remote transaction has been aborted or is not open, and vice versa. In addition to the opening and closing of transactions by OpenTransaction and CloseDatabase, transactions may be manipulated by the procedures: AbortTransaction: PROC [location: LocalOrRemote_ local]; MarkTransaction: PROC[location: LocalOrRemote_ local]; AbortTransaction aborts the current database transaction, either local and remote. The effect on the database is as if the transactions had never been started, the state is as it was just after the OpenTransaction call or most recent MarkTransaction call. A call to OpenTransaction is necessary to do more database operations, and all user variables referencing database items created or retrieved under the corresponding transaction must be re-initialized (they may reference entities or relationships that no longer exist, and in any case they are marked invalid by the database system). MarkTransaction commits the current database transaction, and immediately starts a new one. User variables which reference database entities or relationships are still valid. When the local or remote transaction is committed, the other is unaffected. After an OpenDatabase and OpenTransaction, but before database operations may be invoked, the client must open the file(s) in which the data are stored. Database files are called segments. Segments are represented as Cedar ATOMs whose name is the full path name of the data file, minus the implicit extension ".database". If the file is on the local file system, its name is preceded by "[Local]". For example, "[Local]Foo" refers to a segment file on the local disk named Foo.database; "[Juniper]Baz refers to a segment named Baz.database on the directory on the Juniper server. It is generally a bad idea to access database files other than through the database interface. However, because segments are physically independent and contain no references to other files by file identifier or explicit addresses within files, the segment files may be moved from machine to machine or renamed without effect on their contents. If a segment file in a set that comprise a client database is deleted, the others may still be opened to produce a database missing only that segment's entities and relationships. An segment is opened or closed by the operations OpenSegment and CloseSegment: OpenSegment: PROC[ fileName: ROPE, number: INT_ 0, version: Version_ OldOnly ] RETURNS [Segment]; CloseSegment: PROC[ fileName: Segment]; Segment: TYPE = ATOM; -- a Cedar ATOM whose name is the full file path name Version: TYPE = {NewOnly, OldOnly, NewOrOld}; The version parameter to OpenSegment defaults to OldOnly to open an existing file. The signal FileNotFound is generated if it does not exist, and the signal IllegalFileName is generated if the directory or machine name is missing from fileName. For the convenience of the client, OpenSegment returns the Cedar ATOM which represents the segment in future database operations (the client can also use Atom.MakeAtom to create the same, but the segment must still be opened before use). If the version NewOnly is passed, a new segment file will be created, erasing any existing one. In this case, a number assigned to the segment by the database administrator must also be passed. The client program can pass version=NewOrOld to open a new or existing segment file; in this case the segment number must also be passed, of course. A simple client program using the database system might have the form, then: []_ OpenDatabase[]; []_ OpenTransaction[]; []_ OpenSegment["[Local]Temp"]; ... ... database operations, including zero or more MarkTransaction calls ... ... CloseDatabase[]; Defining the data schema The definition of the client's data schema is done through a procedural interface, although the schema can be read through the normal data operations defined later. In general, any of the data schema may be extended or changed at any time; i.e., data operations and data schema definition may be intermixed. However, there are a few specific ordering constraints on schema definition we will note shortly. Also, the database system optimizes for better performance if the entire schema is defined before any data are entered. An interactive schema editing tool described in the Squirrel documentation allows the schema to be changed without regard for ordering constraints, by recreating schema items and copying data invisibly to the user when necessary. All the data schema definition operations take a Version parameter which specifies whether the schema element is a new or existing one. The version defaults to allowing either (NewOrOld): i.e., the existing entity is returned if it exists, otherwise it is created. This feature avoids separate application code for creating the database schema the first time the application program is run. DeclareDomain: PROC [name: Rope.Ref, segment: Segment_ DefaultSegment, version: Version_ NewOrOld, estRelations: INT_ 5] RETURNS [d: Domain]; DeclareSubType: PROC[sub, super: Domain]; DeclareDomain defines a domain with the given name and returns its representative entity. The domain is declared to exist in the given segment, or a default local one if none is given. The parameter estRelations may be used to estimate the largest number of relations in which entities of this domain may participate. Currently, an error will be generated at run time if this estimate is exceeded. If the domain already exists and version=NewOnly, the signal AlreadyExists is generated. If the domain does not already exist and version=OldOnly, then NIL is returned. The client may define one domain to be a subtype of another by calling DeclareSubType. This permits entities of the subdomain to participate in any relations in which entities of the superdomains may participate. All client DeclareSubType calls should be done before declaring relations on the superdomains (to allow some optimizations). The error WrongSegment is generated if the sub-domain and super-domain are not in the same segment. DeclareRelation: PROC [ name: Rope.Ref, segment: Segment_ DefaultSegment, version: Version_ NewOrOld] RETURNS [r: Relation]; DeclareAttribute: PROC [ r: Relation, name: Rope.Ref, type: DataType_ AnyDomainType, uniqueness: Uniqueness _ None, length: INT_ 0,version: Version_ NewOrOld] RETURNS[a: Attribute]; Uniqueness: TYPE = {NonKey, Key, KeyPart, OptionalKey}; DeclareRelation defines a new or existing relation with the given name and returns its representative entity. The relation is declared to exist in the given segment, or the default one if none is given. If the relation already exists and version=NewOnly, the signal AlreadyExists is generated. If the relation does not already exist and version=OldOnly, then NIL is returned. DeclareAttribute is called once for each attribute of the relation, to define their names, types, and uniqueness. If version=NewOrOld and the attribute already exists, the database system checks that the new type, uniqueness, and length match the existing attribute. The error WrongExistingAttribute is generated if there is a discrepancy. The attribute type should either be one of the pre-defined types (IntType, StringType, BoolType, AnyDomainType) or the entity representative for a client-defined domain. For pre-defined types, the actual values assigned to attributes of the relationship instances of the relation must have the corresponding type: REF INT, ROPE, REF BOOL, or Entity. If the attribute has a client-defined domain as type, the attribute values must be entities of the client-defined domain or some sub-domain thereof. The attribute uniqueness indicates whether the attribute is a key of the relation. If its uniqueness is NonKey, then the attribute is not a key of the relation. If its uniqueness is OptionalKey, then the system will ensure that no two relationships in r have the same value for this attribute (if a value has been assigned). The ERROR NonUniqueKeyValue is generated if a non-unique key value results from a SetP, SetF, SetFS, or CreateRelship call. Key acts the same as OptionalKey, except that in addition to requiring that no two relationships in r have the same value for the attribute, it requires that every entity in the domain referenced by this attribute must be referenced by a relationship in the relation: the relationships in the relation and the entities in the domain are in one-to-one correspondence. Finally, if an attribute's uniqueness is KeyPart, then the system will ensure that no two relationships in R have the same value for all key attributes of R, though two may have the same values for some subset of them. KeyPart is not yet implemented. DestroyRelation: PROC[r: Relation]; DestroyDomain: PROC[d: Domain]; DestroySubType: PROC[sub, super: Domain]; Relations, domains, and subdomain relationships may be destroyed by calls to the above procedures. Destroying a relation destroys all of it relationships. Destroying a domain destroys all of its entities and also any relationships which reference those entities. Destroying a sub-domain relationship has no effect on existing domains or their entities; it simply makes entities of domain sub no longer eligible to participate in the relations in which entities of domain super can participate. Existing relations and domains may only be modified by destroying them with the procedures above, with one exception: the operation SetName (described later) may be used to change the name of a relation or domain. DeclareIndex: PROC [ relation: Relation, indexedAttributes: AttributeList, version: Version]; DeclareIndex has no logical effect on the database; it is a performance hint, telling the database system to create a B-Tree index on the given relation for the given indexedAttributes. The index will be used to process queries more efficiently. Each index key consists of the concatenated values of the indexedAttributes in the relationship the index key references. For entity-valued attributes, the value used in the key is the string name of the entity. The version parameter may be used as in DeclareDomain and other schema definition procedures, to indicate a new or existing index. If any of the attributes are not attributes of the given relation then the signal IllegalIndex is generated. Multiple indices on one relation are not implemented. DeclareProperty: PROC [ relationName: ROPE, of: Domain, type: ValueType, uniqueness: Uniqueness_ None, vers: Version_ NewOrOld] RETURNS [property: Attribute]; DeclareProperty provides a shorthand for definition of a binary relation between entities of the domain "of" and values of the specified type. The definitions of type and uniqueness are the same as for DeclareAttribute. A new relation relationName is created, and its attributes are given the names "of" and "is". The "is" attribute is returned, so that it can be used to represent the property in GetP and SetP described later. Manipulating entities and relationships Since attributes, domains, and relations are all entities, and datum values are represented as REF ANYs, all type checking must currently be done at run-time. The procedures below indicate illegal arguments by generating the signals IllegalAttribute, IllegalDomain, IllegalRelation, IllegalValue, IllegalEntity, and IllegalRelship, according to the type of argument expected. DeclareEntity: PROC[ d: Domain, name: Rope.Ref_ NIL, version: Version_ NewOrOld] RETURNS [e: Entity]; DeclareEntity finds or creates an entity in domain d with the given name. The name may be omitted if desired, in which case an entity with a unique name is automatically created. If version is OldOnly and an entity with the given name does not exist, NIL is returned. If version is NewOnly and an entity with the given name already exists, the signal NonUniqueEntityName is generated. DeclareRelship: PROC [ r: Relation, avl: AttributeValueList_NIL, version: Version_ NewOrOld] RETURNS [Relship]; DeclareRelship finds or creates a relship in r with the given attribute values. If version is NewOnly, a new relship with the given attribute values is generated. If version is OldOnly, the relship in r with the given attribute values is returned if it exists, otherwise NIL is returned. If version is NewOrOld, the relship with the given attribute values is returned if it exists, otherwise one is created. If the creation of a new relship violates the key constraints specified by DeclareAttribute, the signal NonUniqueAttributeValue is generated. DestroyEntity: PROC[e: Entity]; DestroyEntity removes e from its domain, destroys all relationships referencing it, and destroys the entity representative itself. Any client variables that reference the entity automatically take on the null value (Null[e] returns TRUE), and cause ERROR NullifiedArgument if passed to database system procedures. DestroyRelship: PROC[rel: Relship]; DestroyRelship removes rel from its relation, and destroys it. Any client variables that reference the relationship automatically take on the null value, and will cause ERROR NullifiedArgument if subsequently passed to database system procedures. SetF: PROC[rel: Relship, a: Attribute, v: Value]; SetF assigns the value v to attribute a of relationship rel. If the value is not of the same type as the attribute (or a subtype thereof if the attribute is entity-valued), the ERROR WrongAttributeValueType is generated. If a is not an attribute of rel's relation, IllegalAttribute is generated. If v is an entity from a different segment than rel, it is mapped to its namesakes in rel's segment. That is, v is replaced with the entity with the same name in the domain with the same name in rel's segment. Mapping v may create a new entity in rel's segment. It may also create new domains and subtype relationships in rel's segment, as necessary to make v conform to the domain with the same name as a's type in v's segment. GetF: PROC[rel: Relship, a: Attribute] RETURNS [Value]; GetF retrieves the value of attribute a of relationship rel. If a is not an attribute of rel's relation, ERROR IllegalAttribute is generated. The client should use the V2x routines described in the next section to coerce the value into the expected type. SetFS: PROC [rel: Relship, a: Attribute, v: Rope.Ref]; GetFS: PROC[rel: Relship, a: Attribute] RETURNS [Rope.Ref]; These procedures provide a convenient veneer on top of GetF and SetF that provide the illusion that all relation attributes are string-valued. The effect is something like the Relational data model, and is useful for applications such as a relation displayer and editor that deal only with strings. The semantics of GetFS and SetFS depend upon the actual type of the value v of attribute a: type GetFS returns SetFS assigns attribute to be StringType the string v the string v IntType v converted to decimal string the string converted to decimal integer BoolType "TRUE" or "FALSE" true if "TRUE", false if "FALSE" a domain D the name of the ref'd entity the entity with name v (or null string if v is NIL) (or NIL if v is null string) AnyDomainType same, but includes domain: the entity with the given domain and name : (or NIL if v is null string) The same signals generated by GetFS and SetFS, such as IllegalAttribute, can also be generated by these procedures. The signal NotFound is generated in the last case above if no entity with the given name is found. GetName: PROC [e: Entity] RETURNS [s: ROPE]; SetName: PROC [e: Entity, s: ROPE]; GetName and SetName retrieve or assign the name of an entity, respecitvely. They may generate the signal NotAnEntity if e is null or not an entity. DomainOf: PROC[e: Entity] RETURNS [Domain]; RelationOf: PROC[r: Relship] RETURNS [Relation]; DomainOf and RelationOf can be used to find the entity representative of an entity's domain or a relationship's relation, respectively. The signal NotAnEntity is generated if e is null or not an entity. The signal NotARelship is generated if r is null or not a relship. SegmentOf: PROC[x: EntityOrRelship] RETURNS [Segment]; SegmentOf returns the segment in which an entity or relship is stored. It can be applied to domain, relation, or attribute entities. Eq: PROC [e1: Entity, e2: Entity] RETURNS [BOOL]; Eq returns TRUE iff the same database entity is referenced by e1 and e2. This is not equivalent to the Mesa expression "e1 = e2", which computes Cedar Mesa REF equality. If e1 and e2 are in different segments, Eq returns true iff they have the same name and their domains have the same name. Null: PROC [x: EntityOrRelship] RETURNS [BOOL]; Null returns TRUE iff its argument has been destroyed, is NIL, or has been invalidated by abortion of the transaction under which it was created. GetP: PROC [e: Entity, to: Attribute, from: Attribute_ NIL] RETURNS [Value]; SetP: PROC [e: Entity, to: Attribute, v: Value, from: Attribute_ NIL] RETURNS[Relship]; GetP and SetP are convenience routines for a common use of relships, to represent "properties" of entities. Properties allow the client to think of values stored in relships referencing an entity as if they are directly accessible fields (or "properties") of the entity itself. See the figure on page 10 illustrating properties. GetP finds a relship whose from attribute is equal to e, and returns that relship's to attribute. The from attribute may be defaulted if the relation is binary, it is assumed to be the other attribute of to's relation. If it is not binary, the current implementation will find the "first" other attribute, where "first" is defined by the order of the original calls calls to DeclareAttribute. SetP defaults the from attribute similarly to GetP, but operates differently depending on whether from is a key of the relation. If it is a key, any previous relship that referenced e in the from attribute is automatically deleted. In either case, a new relship is created whose from attribute is e and whose to attribute is v. SetP returns the relship it creates for the convenience of the client. GetP and SetP can generate the same errors as SetF and GetF, e.g. if e is not an entity or to is not an attribute. In addition, GetP and SetP can generate the error IllegalProperty if to and from are from different relations. GetP generates the error WrongPropertyCardinality if more than one relship references e in the from attribute; if no such relships exist, it returns a null value of the type of the to attribute. SetP allows any number of existing relships referencing e; it simply adds another one (when from is a key, of course, there will always be one existing relship). GetPList: PROC [e: Entity, to: Attribute, from: Attribute_ NIL] RETURNS [LIST OF Value]; SetPList: PROC [e: Entity, to: Attribute, vl: LIST OF Value, from: Attribute_ NIL]; GetPList and SetPList are similar to GetP and SetP, but assume that any number of relships may reference e with their from attribute. They generate the signal WrongPropertyCardinality if this is not true, i.e. the from attribute is a key. GetPList returns the list of values of the to attributes of the relships that reference e with their from attribute. SetPList destroys any existing relships that reference e with their from attribute, and creates Length[vl] new ones, whose from attributes reference e and whose to attributes are the elements of vl. GetPList and SetPList may generate any of the errors that GetF and SetF may generate, and the error IllegalProperty if to and from are from different relations. Examples of the use of the property procedures for data access can be found at the end of the previous section. Properties are also useful for obtaining information about the data schema. For example, GetP[a, aRelationIs] will return the attribute a's relation, and GetPList[d, aTypeOf] will return all the attributes that can reference domain d. Type conversion routines E2V: PROC[e: Entity] RETURNS[v: Value]; B2V: PROC[b: BOOLEAN] RETURNS[v: Value]; I2V: PROC[i: LONG INTEGER] RETURNS[v: Value]; S2V: PROC[s: ROPE] RETURNS[v: Value]; The x2V routines convert the various Mesa types to Values. The conversion is not normally required for ropes and entities since the compiler will widen these into the REF ANY type Value. V2E: PROC[v: Value] RETURNS[Entity]; V2B: PROC[v: Value] RETURNS[BOOLEAN]; V2I: PROC [v: Value] RETURNS[LONG INTEGER]; V2S: PROC [v: Value] RETURNS[ROPE]; The V2x routines convert Values to the various Mesa types. The WrongValueType error is raised if the value is of the wrong type. It is recommended that these routines be used rather than user-written NARROWs, as the representation of Values may change. Also, NARROWs of opaque types don't work in the current Cedar compiler. Query operations RelationSubset: PROC[ r: Relation, constraint: AttributeValueList_ NIL, searchSegments: BOOL_ TRUE] RETURNS [RelshipSet]; NextRelship: PROC[rel: RelshipSet] RETURNS [Relship]; ReleaseRelshipSet: PROC [rs: RelshipSet]; AttributeValueList: TYPE = LIST OF AttributeValue; AttributeValue: TYPE = RECORD [ attribute: Attribute, low: Value, high: Value_NIL -- omitted where same as low or not applicable --]; The basic query operation is RelationSubset. It returns a generator of all the relationships in relation r which satisfy a constraint list of attribute values. The relationships are enumerated by calling NextRelship repeatedly; it returns NIL when there are no more relationships. ReleaseRelshipSet should be called when the client is finished with the query. The constraint list may be NIL, in which case all of the relationships in r will be enumerated. Otherwise, relationships which satisfy the conjuction of constraints on attributes in the list will be enumerated. If an index exists on some subset of the attributes, the relationships will be enumerated sorted on the concatenated values of those attributes. For a StringType or IntType attribute a of r, the contraint list may contain a record of the form [a, b, c] where the attribute value must be greater or equal to b and less than or equal to c to satisfy the constraint. For any type of attribute, the list may contain a record of the form [a, b] where the value of the attribute must exactly equal b. The special values infString and minfString are defined in the DBView interface, and represent an infinitely large and infinitely small string value, respectively. The signal WrongAttributeValueType is generated by RelationSubset if one of the low or high values in the list is of a different type than its corresponding attribute. If searchSegments is TRUE, relations in all open segments with the same name as r are searched. The attributes and entity values in avl are automatically mapped to their namesakes in other segments. DomainSubset: PROC[ d: Domain, lowName, highName: ROPE_ NIL, searchSubDomains, searchSegments: BOOL_ TRUE] RETURNS [EntitySet]; NextEntity: PROC[EntitySet] RETURNS [Entity]; ReleaseEntitySet: PROC[EntitySet]; DomainSubset enumerates all the entities in a domain. If lowName and highName are NIL, the entire domain is enumerated, in no particular order. Otherwise, only those entities whose names are lexicographically greater than lowName and less than highName are enumerated, in lexicographic order. If searchSubDomains is TRUE, subdomains of d are also enumerated. Each subdomain is sorted separately. If searchSegments is TRUE, d and domains with the same name in all open segments are enumerated. Just as with relation enumeration, NextEntity is used to enumerate the entities satisfying the DomainSubset, and ReleaseEntitySet should be called upon completion. GetDomainRefAttributes: PROC [d: Domain] RETURNS [AttributeList]; This procedures returns a list of all attributes, of any relation defined in d's segment, which reference domain d or one of its superdomains. The list does not include AnyDomainType attributes, which can of course reference any domain. GetDomainRefAttributes is implemented via queries on the data schema. GetDomainRefAttributes is useful for application-independent tools; most specific applications can code-in the relevant attributes. GetEntityRefAttributes: PROC [e: Entity] RETURNS [AttributeList]; This procedure returns a list of all attributes in which some existing relship actually references e, including AnyDomainType attributes. This is currently not implemented as specified; the list may include key attributes that do not actually reference e. The ones that actually reference e must be determined by RelationSubset. 5. A sample program As a simple example, we will consider the following concocted program, that defines a small schema and database of persons with their names, friends, and phone numbers. It also defines two other domains, "frogs" and "things", and does a number of queries on the database. It illustrates the use of most of the procedures defined in this section. VLTest1Impl: PROGRAM IMPORTS DBView, IOStream = BEGIN OPEN IOStream, DBView; tty: IOStream.Handle_ CreateTTYStreams["VLTest1Impl.log"].out; Person, Frog, Thing: Domain; Friend, Phone: Relation; ageProp, heightProp: Attribute; friendOf, friendIs: --Friend-- Attribute; wsh, harry, mark, nori: --Person-- Entity; george: --Frog-- Entity; rock, pebble: --Thing-- Entity; myName: ROPE _ "Rick Cattell"; myExt: REF INT_ NEW[INT _ 4466]; The following routine defines the domains and relations and through calls to the database system. Since the version parameter to DeclareDomain and DeclareRelation, and DeclareAttribute defaults to NewOnly, an ERROR would be raised if this program were executed a second time. Initialize: PROC = BEGIN tty.Put[rope["Defining data dictionary..."], char[CR]]; -- Declare domains and make Persons and Frogs be subtypes of Things: Person_ DeclareDomain["Person"]; Thing_ DeclareDomain["Thing"]; Frog_ DeclareDomain["Frog"]; DeclareSubType[of: Thing, is: Person]; DeclareSubType[of: Thing, is: Frog]; -- Declare phone age and height property of Things ageProp_ DeclareProperty["Age", Thing, IntType]; heightProp_ DeclareProperty["Height", Thing, StringType]; -- Declare Friend relation between Persons Friend_ DeclareRelation["Friend"]; friendOf_ DeclareAttribute[Friend, "of", Person]; friendIs_ DeclareAttribute[Friend, "is", Person]; -- Declare Phone relation between Persons and integers Phone_ DeclareRelation["Phone"]; phoneOf_ DeclareAttribute[Phone, "of", Person]; phoneIs_ DeclareAttribute[Phone, "is", IntType]; END; The following routines creates various relationships and entities. Several methods are used to assign the attributes to illustrate the alternatives. InsertData: PROC = BEGIN t: Relship; rick: Entity; tty.Put[rope["Creating data..."], char[CR]]; harry_ DeclareEntity[Person, "Harry Q. Bovik"]; mark_ DeclareEntity[Person, "Mark Brown"]; rick_ DeclareEntity[Person, myName]; nori_ DeclareEntity[Person]; SetName[nori, "Nori Suzuki"]; -- Data can be assigned with SetP, SetF, or DeclareRelship's initialization list: SetP[harry, phoneIs, I2V[4999]]; SetP[mark, phoneIs, I2V[4464]]; SetP[nori, phoneIs, I2V[4425]]; SetP[rick, phoneIs, I2V[4466]] SetP[rick, ageProp, I2V[29]] t_ DeclareRelship[Friend]; SetF[t, friendOf, rick]; SetF[t, friendIs, harry]; []_ DeclareRelship[Friend, LIST[[friendOf, nori], [friendIs, harry]]]; END; InsertMoreData: PROCEDURE = BEGIN t: Relship; rick: Entity; ok: BOOL_ FALSE; tty.Put[rope["Creating more data..."], char[CR]]; -- Create some new entities and fetch old one (rick) wsh_ DeclareEntity[Person]; rick_ DeclareEntity[Person, "Rick Cattell", OldOnly]; rock_ DeclareEntity[Thing, "Harry the rock"]; pebble_ DeclareEntity[Thing, "Fred the pebble"]; george_ DeclareEntity[Frog, "George the frog"]; []_ DeclareEntity[Frog, "Larry the frog"]; -- Use SetP to assign names and heights SetName[wsh, "Willie-Sue H"]; SetP[wsh, heightProp, S2V["medium"]]; SetP[rick, heightProp, S2V["medium"]]; SetP[rock, heightProp, S2V["big"]]; SetP[pebble, heightProp, S2V["little"]]; SetP[george, heightProp, S2V["little"]]; -- Check that Person can't be friend of frog t_ DeclareRelship[Friend]; SetF[t, fOf, rick]; SetF[t, fIs, george ! WrongAttributeValueType => ok _ TRUE]; IF NOT ok THEN ERROR; END; The following routine illustrates the destruction of entities and domains. DestroySomeData: PROCEDURE = -- Destroy one person entity and all frog entities BEGIN flag: BOOL_ FALSE; tty.Put[char[CR], rope["Deleting some data: Rick and all the frogs..."], char[CR]]; DestroyEntity[DeclareEntity[Person, "Rick Cattell", OldOnly]]; DestroyDomain[Frog]; END; The following routine prints all of the people in the Person domain, by doing a DomainSubset retrieval with no constraints. PrintPeople: PROC = -- Use DomainSubset with no constraints to enumerate all Persons BEGIN person: -- PersonDomain -- Entity; es: EntitySet; lint: LONG INTEGER; tty.Put[char[CR], rope["PersonDomain:"], char[CR], char[CR]]; tty.Put[rope["Name Phone"], char[CR]]; es_ DomainSubset[Person]; WHILE (person_ NextEntity[es])#NIL DO tty.Put[rope[V2S[GetP[person, NameProp]]], char[TAB]]; lint_ V2I[GetP[person, phoneProp]]^; tty.Put[int[lint], char[CR]]; ENDLOOP; ReleaseEntitySet[es]; END; The following routine prints all elements of the Thing domain using DomainSubset, which automatically enumerates sub-domains as well. PrintEverything: PROCEDURE = -- Use DomainSubset to enumerate all Things (includes Frogs and Persons). -- Print "unknown" if height field is null string. BEGIN thing: -- Thing -- Entity; es: EntitySet; r: ROPE; tty.Put[char[CR], rope["Things:"], char[CR], char[CR]]; tty.Put[rope["Type Name Height"], char[CR]]; es_ DomainSubset[Thing]; WHILE (thing_ NextEntity[es])#NIL DO tty.Put[rope[GetName[DomainOf[thing]]], char[TAB]]; tty.Put[rope[V2S[GetP[thing, NameProp]]], char[TAB]]; r_ V2S[GetP[thing, heightProp]]; tty.Put[rope[IF r.Length[]=0 THEN "[unknown]" ELSE r], char[CR]]; ENDLOOP; ReleaseEntitySet[es]; END; The following routine prints the people in the Person domain with a phone number in the range 4400 to 4499. PrintCSLPhones: PROC = -- Use RelationSubset to enumerate phones between 4400 and 4500 BEGIN rs: RelshipSet; r: Relship; tty.PutF["\nPhone numbers between 4400 and 4499:\n"]; tty.PutF["Name Phone\n"]; rs_ RelationSubset[Phone, LIST[[phoneIs, I2V[4400], I2V[4499]]]]; UNTIL (r_ NextRelship[rs])=NIL DO tty.PutF["%g %g\n", rope[GetFS[r, phoneOf]], rope[GetFS[r, phoneIs]] ]; ENDLOOP; ReleaseRelshipSet[rs]; END; The following routine prints all of the tuples in the Person domain that are related through the Friend relation to the entity in the Person domain named harry. PrintFriends: PROC = -- Use RelationSubset to enumerate friends of Harry BEGIN friendRS: --Friend-- Relship; friendName: --Person-- ROPE; rs: RelshipSet; tty.Put[char[CR], rope["Harry's friends:"], char[CR]]; rs_ RelationSubset[Friend, LIST[[friendIs, harry]]]; WHILE (friendRS_ NextRelship[rs])#NIL DO friendName_ GetFS[friendRS, friendOf]; tty.Put[rope[friendName], rope[" "], char[CR]]; ENDLOOP; ReleaseRelshipSet[rs]; END; The main program opens the default database, calls the above routines, and calls CloseDatabase to cause the updates to be made. tty.Put[rope["Creating database..."], char[CR]]; []_ OpenDatabase[]; []_ OpenTransaction[]; []_ OpenSegment[]; Initialize[]; InsertData[]; PrintPeople[]; PrintCSLPhones[]; PrintFriends[]; PrintEverything[]; InsertMoreData[]; DestroySomeData[]; PrintEverything[]; CloseDatabase[]; tty.Close[]; END. 6. Database design This section is a discussion of database design: the process of representing abstractions of real-world information structures in a database. This is an important, poorly understod topic, just as the design of programs is; don't expect any startling revelations in these pages. To some extent the discussion here is specialized to the data structures available in the Cedar database package. What are the properties of a well-designed database? To a large extent these properties follow from the general properties of databases, as discussed in Section 2. For instance, we would like our databases to extend gracefully as new types of information are added, since the existing data and programs are likely to be quite valuable. It may be useful to consider the following point. The distinguishing aspect of information stored in a database system is that at least some of it is stored in a form that can be interpreted by the system itself, rather than only by some application-specific program. Hence one important dimension of variation among different database designs is in the amount of the database that is system-interpretable, i.e. the kinds of queries that can be answered by the system. As an example of variation in this dimension, consider the problem of designing a database for organizing a collection of Mesa modules. In the present Mesa environment, this database would need to include at least the names of all the definitions modules, program modules, configuration descriptions, and current .Bcd files. A database containing only this information is little more than a file directory, and therefore the system's power to answer queries about information in this database is very limited. A somewhat richer database might represent the DIRECTORY and IMPORTS sections of each module as collections of tuples, so that queries such as "which modules import interface Y?" can be answered by the system. This might be elaborated further to deal with the use of individual types and procedures from a definitions module, and so on. There may be a limit beyond which it is useless to represent smaller objects in the database; if we aren't interested in answering queries like "what procedures in this module contain IF statements?", it may be attractive to represent the body of a procedure (or some smaller naming scope) as a Mesa data structure that is not interpretable by the database system, even though it is stored in a database. We shall illustrate a few more database design ideas with the following problem: design a database of information about a collection of documents. Our current facilities, which again are simply file directories, leave much to be desired. The title of a document on the printed page does not tell the reader where the document is stored or how to print a copy. Relationships between different versions of the same basic document are not explicit. Retrievals by content are impossible. Our goal here is not to solve all of these problems, but to start a design that has the potential of dealing with some of them. Each document has a title and a set of authors. Hence we might represent a collection of documents with a domain whose title is the name of the document, and an author property specifying the authors: Document: Domain = DeclareDomain["Domain"]; dAuthors: Property = DeclareProperty["author", Document, StringType]; Here the authors' names are concatenated into a single string, using some punctuation scheme to allow the string to be decoded into the list of authors. This is a very poor database design because it does not allow the system to respond easily to queries involving authors; the system cannot parse the encoded author list. Another property of this declaration of authors may be viewed as an advantage or a drawback: authors are strings, so anything is acceptable as an author. This weak typing gives some flexibility: the database will never complain that it doesn't know the author you just attached to a certain document. The corresponding drawback, as you might expect, is that the system is not helpful in catching errors when a new document is added to the database. If "Mark R. Brown" is mistakenly spelled "Mark R. Browne", then one of Mark's papers will not be properly retrieved by a later search. A step in the direction of stronger type checking is to provide a separate domain for authors. To represent authors as entities, and allow a variable number number of authors for document, a better design would be: Person: Domain = DeclareDomain["Person"]; author: Property = DeclareProperty["author", Document, Person]; Incidentally, we define a property rather than relation for brevity here. Instead of the author property declaration we could have written: author: Relation = DeclareRelation["author"]; authorOf: Attribute = DeclareAttribute["of", author, Document]; authorIs: Attribute = DeclareAttribute["is", author, Person]; The property declaration has exactly the same effect on the database as the relation declaration, since it automatically declares an author relation with an "of" and "is" attribute. However, the relation is not available in a Cedar Mesa variable in the property case, so operations such as RelationSubset cannot be used (thus most applications will not use DeclareProperty; but will use operations such as SetP). In either case, we have one Document entity per document, plus, for each document, one author relationship per author of that document. Each author relationship points to its Document tuple via the entity-valued attribute of the author relation. Now a search of the author relation can be used to locate all documents authored by a given person. Documents have other interesting properties. Some of these, for example the date on which the document was produced, are in one-to-one correspondence with documents. Such properties can be defined by specifying a relation or property as being keyed on the document: publDate: DeclareProperty["publ-date", Document, StringType, Key]; We are using the convention that domain names are capitalized and relation, attribute, and property names are not capitalized, both for the Cedar Mesa variable names and in the names used in the database system itself. When the database system is integrated with Mesa, the Mesa and database names will be one and the same. Other conventions are desirable, for example preceeding all relation names with the name of the database application in which they are used to avoid confusion when databases overlap. A few entities and relationships from a database of documents, defined according to these two tuplesets, are shown graphically in Figure 4. <==