Seven Ideas Rationale June 16, 1989 1:34:52 pm PDT Seven Ideas Rationale Bob Hagmann I extracted the seven ideas from the Blue and White draft. The paragraphs in italics at the end of each section is a rationale for the idea. 1. Nodes ({ documents or objects without semantics) In keeping with the theme of literature, nodes in the database are also called documents. A node may not have a real world analogy that the reader might consider a document. The node's contents might be the number three. The node might be a source or object "file." The system can handle a large number of nodes (e. g., millions). A node can be of any size: one byte long, or a gigabyte. Nodes (or node parts) will be archived to on-line tertiary storage (e. g., an optical disk jukebox). Nodes have a contents. The contents of a node is a primitive value: it is a type and a value (of that type). Only a dozen or so types are understood by the server. For example, strings (any length), dates, floats (32 bit), and integers (32 bit and infinite precision) are supported. In addition, other types may be stored without the server understanding what the value means. For example, ODA documents and compressed scanned images can be stored, but the server treats them as uninterpreted bytes. The client code, not running on the server, is expected to provide the semantics for these types. The type space is fairly large (almost 32 bits) and is administrated by a database administrator without much support from the system. The notion of an object, node, document, entity, or file seems pervasive in many of the areas that we want this system to function. If we want to support hypertext, file servers, and object-oriented databases then this seems like the right place to start with the model. Also, Xerox's document processing strategy seems to be quite document based. 2. Typed links Nodes can be connected by links (hypertext) [Bush45, Engl??, Nels81]. A link is a directed pointer between two nodes. The link has a type, which is just a string. An inlink or a to link are links that point at a node, while an outlink or a from link are links leaving a node. Nodes can efficiently determine both their inlinks and outlinks. Although links are directional, they can be followed in either way. Links are themselves nodes, so they can have attributes. Yggdrasil provides only node to node links. If a client wants a link to point at part of a node, then the client has to encode this as a property on the link. Not only is the notion of node (or whatever) important, but also the relationships between nodes. Links are a very flexible and simple notion to have binary relations between nodes. Links are directional since most relations are not reflexive. Links are nodes (or at least can be reified as nodes) for three reasons. First, sometimes the dual view of the "schema" is important. That is, the entities and relationships are backwards: the entities should be relationships and the relationships should be entities. Without the links being nodes this is hard to model. Second, it is sometimes useful to have links as nodes. And finally, the system only understands node to node links. It does not understand links inside of nodes. Sub-node links are a very desirable feature, but it seems to require some semantic understanding of the node itself. Since we do not have full knowledge of the semantics of all objects, it is hard to make a sub-node link that can be maintained by the system. Hence, this feature is left to the client. The client needs some place to hang client specific data about the link. Hence, it seems natural to have the link have properties (see next section). 3. Nodes have properties ({ attributes) Nodes have a set of attributes or properties. Both terms are used interchangeably. The attributes of a node are a set of attribute name and attribute value pairs. The attribute name is just a string that identifies the attribute. The attribute value is a set of fields. Each field has a field name and a field value. The field value is a set of primitive values. Each of the above sets may be ordered or unordered (default is unordered). There are several parts to this. Let me work my way up. Primitive values: A primitive value is a type and value pair. Having the type explicit seems lots cleaner than hiding it as the first word of the value (as is done in UNIX files). Only a few of the types are understood by the system. They are the boring types that we all see from databases (at least string, date, float, and integer) and extensions (such as big string, microsecond time, double precision float, infinite precision integer). Maybe there are others such as decimal and money. There are also types from the data model such as link or reference (a document id). However, the type system is meant to be somewhat extensible. The type is a 32-bit value, with only 1024 of it reserved. Applications are supposed to grab a type and use it. The idea here (by example) is that Yggdrasil really, really does not want to understand ODA documents. So clients store ODA documents with a client supplied type that means ODA. We just mess with the bits and don't understand the internals. (I think that not interpreting some of the stored data is a key design point). Object-oriented systems think of their objects as having a class or type, as well as some state. Primitive values as type and value pairs is meant to model this. Sets of primitive values as a field value: Why not have a primitive value as the field value instead of a set? I don't have a real firm answer, but it seems to enable several other things so that it seems useful. Ordered sets give n-ary relations as a field value (see next subsection). Sets instead of singletons seems like a good idea. This has been proposed for relational systems. It makes lots of modeling easier. You can get by with a unique id generator and an auxiliary relation (stick the id in the field in the table, and join with the auxiliary relation to get the set), but you have view update problems and poor execution time (the bits are in the wrong place and you have to do this join - sure, it's a real easy join, but ...). See "Three levels of sets" below. Ordering: Ordering can be applied to any of the sets. Sometimes it is useful. An ordered field value is a n-tuple. This breaks the binary relationship restriction imposed by links. When you need n-ary relations, you can use ordered sets. Ordering is also a natural for many applications. Relational systems are guaranteed to be unordered. This is a problem for documents where we really do want to see chapter 1 first, for example. Attribute and field names: Naming of attributes is common in systems we are trying to supplant or make persistent. File servers have named attributes. Object oriented systems have named slots. So attributes should have names. By the way, names do not have to be unique for a node. Fields were named to allow for an attribute called "keywords" to have fields that were the keywords themselves. So the field name would be the keyword and the field value was the set of positions in the document. Naming attributes and fields should make it easier to filter out attributes to select those necessary to do a query. Both attribute and field naming is optional. Three levels of sets: A node has a set of named attributes. Each attribute has a set of named fields. Each field is set valued. What's so magic about three? Why not one or two? Why restrict it to three? Three levels of sets were needed so that an attribute called "keywords" to have fields that were the keywords themselves. So the field name would be the keyword and the field value was the set of positions in the document. 4. Containers group nodes Yggdrasil is designed to hold millions of nodes. Without some help, it is easy to get "lost in (hyper) space." Both databases and file systems have grouping mechanisms. Relational databases have relations that group similar nodes together. File systems have (hierarchical) directories that are a well accepted improvement over a flat name space. Some sort of context or grouping is needed by any large system. Containers are the context mechanism in Yggdrasil. A container is a (possibly ordered) set of nodes. A container is associated with a root node of the container (analogous to directory in a file system). A node can belong to any number of containers. In addition, a container can be a sub-container of another container. A container can be the sub-container of any number of containers. Loops are allowed in the sub-container relation (i. e., it is a directed graph, not just a DAG). The rationale for containers is simple. Gathering nodes together to form a higher level cluster is a very useful function. Relational databases and file systems both do this. If we want to be big, we got to try every trick we can think of to deal with scale. The question is "can you emulate containers with links"? The first problem with links is that they are nodes. This is more than is required. It makes the implementation harder, since we don't want to pay too much for the links in this case. We want to make links without attributes cheap anyway, so this is not a real big deal. The second problem with links is that by making containers special, different semantics can be associated with them. In particular, currently the indeces kept by the system are built off of the container structure. If we built indices based on any and all kinds of links, then whenever an object changed we would have to examine all of the links (running backwards) to do the index update. restricting it to just containers decreases the index overhead during update. It costs in that now there is only one notion of contains. You have to build one container each for each notion you want. The real reason that it is a separate notion is that to me it is a first class notion. It may share lots of code with links, but so what? It's really part of the way to do business. 5. Set oriented non-navigational access One of the powers of relational databases is the non-navigational, relational access to data. Relations are just sets of similar items. The access to data is non-navigational in that the client characterizes what the matching data looks like, as opposed to providing the systems with an (imperative) program to get the data. Sometimes this is called non-procedural access. Both navigational and non-navigational access are desirable. Hypertext systems are outstanding for browsing (navigating), but are not good at finding starting points for browsing. Containers can be configured to automatically maintain indices. For any attribute, all nodes that can be reached in the transitive closure of the containment graph starting at the container are indexed. Index maintenance is specified by the "owner" of the subtree or by the database administrator. An index at a level replicates data of indices at lower levels. We intend that the query language will use the notion of container, so that often the indices can be used during query evaluation. Filters (similar to SELECT and join in a RDBMS) will be used to restrict the results of expansions. For a large system, this is a necessity. It is one of the key ideas from relational systems. Much of this would be applying the ideas and techniques from relational systems to hypertext. There are some unique problems such as graph matching. 6. Documents can be named Any node can act as a directory. It is natural to expect the root nodes of containers to be directories, but the system does not make this restriction. A special system maintained property provides the name and node binding. Not all nodes have to be named and a node can have many names. Naming can also help with the "lost in (hyper) space" problem. However, a major consequence of providing naming is that Yggdrasil can provide a view of a database that is a file system or file server. A file server front-end can be built that talks a file server protocol, and provides access to some of the data in the server. Links are not modeled in all file server protocols, so they usually disappear from the file server view. A particular file server protocol may have a restricted notion of attributes (e. g., a fixed number of them of fixed types) or the protocol may not type the contents of files. Directories may or may not have contents. In any case, the file server front-end provides a view of the database that is suitable for programs that are designed to talk to a file server. A few front-ends are anticipated. NFS, Xerox's NS Filing Protocol, and ANSI/CCITT/OSI Filing and Retrieval are all candidates for front-ends. A server is not restricted to only one front-end filing protocol. One prime benefit of this is to unify file systems and databases. There is one permanent storage mechanism. Different views are provided for different software. It is possible to maintain information in the database, and "publish" it via the naming system. Naming is part of the "roll as many types of systems into one" religion. With naming we can do file servers. 7. Versions and alternatives for nodes Nodes can have have versions, revisions, and alternatives. A node, then, is really a tree of states. States can differ in any of contents, properties, links, containers to which it belongs, and its value as a container. A new branch may be started from any state in the tree. This is called an alternative. Major changes along a branch are called versions while minor changes are called revisions. Changes are (usually) applied at the leaves of the tree. A new version or revision is built whenever a transaction commits that changes some part of the node. The version or revision is added to the version chain for the current alternative. A new version collapses the revisions between it and the previous version. Yggdrasil will use some "delta" and compression techniques. The system will be optimized for access to the current version. Yggdrasil will not directly provide for the merging of alternatives and/or versions. Client level code can do the merge in an application specific manner for a container. Versions and alternatives are a requirement for many systems. File servers files, documents, and CAD objects all benefit from having versioning. One question is whether the server has to do it. If you want for the server to be able to do historical queries, then it has to understand enough to do the queries. Hence, the server has to have at least a simple notion of versions. One requirement is that the server has to be able to do versions like an IFS or an XNS file server. There has to be some trick or layer that makes this all possible. It can hide in the file server interface code, but it has to be somewhere. A strong desire is that significant data compression can be achieved for may types of large storage objects. Text versions are easy to compress, while object files output by a compiler typically do not compress well from version to version. Depending on the mix of data stored on the server, deltaing between versions internally on the server can be a big win. The current design has all the versions and alternatives folded into the object storage. Looking at an object you can see it's whole version chain. The alternatives have to be named (if I recall correctly). That creates yet another name space to be managed (this is a deficit) but I can't get all excited about that. Instead of the indirect map Hector was talking about, the current design uses the objects themselves together with a context that defines the current way to select the version/alternative. The same effect is achieved, but without the addition of extra objects and their copying, plus the overhead of the map, and the additional complexity of map management. I don't have a strong opinion on the implementation here. Hector's is nice since there is a place you can look to see what an alternative effects. Mine is nice since the objects wholly contain their versions. Both have naming or management problems. Why not triples?: A triple folds too much into the "type." The type defines "class" of the triple (e. g., keyword) as well as the "types" of the key and data items. Note that the two uses of type are different in the last two sentences. Hence, the type of the triple defines that the key will be a string. That's a pain. And very anti-object-oriented. Order is missing. It has to be synthesized in various ways. Making a linked list for order is a problem if you want to share the parts of the list between several top level nodes. Putting in numbers for the order in the data item wastes the data item. You are down to one real value per triple (the type doesn't count that much). You have no natural way to model n-ary relations. You could point at another object that encoded the n-ary relation in its triples, but this lacks locality and would make a mess of the query language. Without links being nodes, there is no natural way to model sub-node links and no natural way to express the dual schema. Key points remaining: Schemas and invariants Views Complex objects Locking Long lived transactions Object-oriented database storage system