Yggdrasil Query Language Proposal June 22, 1989 8:18:04 am PDT Yggdrasil Query Language Proposal Bob Hagmann Here's a shot at it. 1 Data model Note that versions and alternatives are not discussed here. The selection of the version and alternative could be via an external environment, so might be orthogonal. 1.1 Nodes ({ documents or objects without semantics) Nodes have a contents. The contents of a node is a primitive value. It is a singleton list of the attribute $contents (see §1.3). 1.2 Typed links Nodes can be connected by links (hypertext). A link is a directed pointer between two nodes. The link has a type, which is just a string. An inlink or a to link are links that point at a node, while an outlink or a from link are links leaving a node. Nodes can efficiently determine both their inlinks and outlinks. Although links are directional, they can be followed in either way. Links are themselves nodes, so they can have attributes. Yggdrasil provides only node to node links. If a client wants a link to point at part of a node, then the client has to encode this as a property on the link. Links are special properties of a node (see §1.3). They have attribute names $inlinks and $outlinks respectively. They are encoded as a two level list. The first level is a list for each link type. Each first level list has a field name matching the link type. The second list is a list of document ids. 1.3 Nodes have properties ({ attributes) Nodes have a set of attributes or properties. Both terms are used interchangeably. The attributes of a node are a set of attribute name and attribute value list pairs. The attribute name is just a string that identifies the attribute. It does not have to be unique for the node. An attribute value list is a sequence of attribute value lists and primitive values. Each attribute value list has a special list item called its name. Attribute value list names are always strings. A primitive value is a type and value pair. The type is a descriptor for the value. Only a dozen or so types are understood by the server. For example, strings (any length), dates, floats (32 bit), and integers (32 bit and infinite precision) are supported. In addition, other types may be stored without the server understanding what the value means. For example, ODA documents and compressed scanned images can be stored, but the server treats them as uninterpreted bytes. The client code, not running on the server, is expected to provide the semantics for these types. The type space is fairly large (almost 32 bits) and is administrated by a database administrator without much support from the system. Note that is a fundamental change in what has been the model up to now. There is no depth to the recursion in the attribute values. 1.4 Containers group nodes Containers are the context mechanism in Yggdrasil. A container is a (possibly ordered) set of nodes. A container is associated with a root node of the container (analogous to directory in a file system) (where the information about membership is kept in the $children attribute in the root node) A node can belong to any number of containers. In addition, a container can be a sub-container of another container (where the information about membership is kept in the $child-containers attribute). A container can be the sub-container of any number of containers. Loops are allowed in the sub-container relation (i. e., it is a directed graph, not just a DAG). 1.5 Documents can be named Any node can act as a directory. It is natural to expect the root nodes of containers to be directories, but the system does not make this restriction. A special system maintained property, $directoryContents, provides the name and node binding. Not all nodes have to be named and a node can have many names. Naming a node that is a root node means naming a subdirectory. 2 Query language motivation There are three motivations for the query language: 1) Perform a subset of SQL 2) Bring in the notion of object into the query language 3) Incorporate the primitive notions in the Yggdrasil data model into the query language. Hence, links, set valued attributes, and containers are directly representable in the language. Note that is a fundamental change in what has been proposed before. The system now wants to provide support to: 1) Hypertext  this is the basic model 2) File server  not all that applicible at the query level, but needed in the data model 3) Object-oriented database storage backend  again, not that applicible at the query level, but needed in the data model 4) Relational database  this is where the subset of SQL comes in. This is not an attempt to have a fully relational database, but only some features of it available. For example, I am not proposing referential integrity be guarenteed. 2.1 Subset of SQL When I visited IBM, Bruce Lindsay's major comment (I paraphrase) was that any system that is trying to supplant another should have a plan of how to do the functions of the old system. So for us, that means how is Yggdrasil going to do SQL? I think the answer is in architecting a SQL subset that can then be extended into our query language. That is, we should look at SQL (or the relational algebra - syntax is not the big thing here) and decide what makes sense for us. From this subset, enhance it with the things we need to do our job. If a client just wants relational, then the client can just restrict its use of the system to the SQL subset (provided that the subset is enough). A SQL compatibility layer might be constructed to allow real SQL use of the system. I don't think we are in the business of building the fastest relational system. However, giving basic support to this way of dealing with some of your data makes sense. 2.2 Objects in the query language Where relational databases and the systems we want to support differ is in the concept of object. File servers, object-oriented databases, and hypertext systems all have objects. Relational databases have tables and fields. The elements in these fields are atomic values. Object identity (tuple id's or tid's) is not preserved over certain operations such as join. What comes out is just another table with fields and atomic values. Since Yggdrasil is an object based system, it has to understand objects in the query language. But this has to be integrated with the other ways to deal with queries. 2.3 Yggdrasil primitives The query language should incorporate the primitive notions in the Yggdrasil data model in the query language. Hence, links, set valued attributes, and containers are directly representable in the language. 3 Query concepts 3.1 Sets Set appear again and again in the data model. It is natural that they show up in the query model too. A tuple is a ordered set of primitive values. A node set is an unordered set of tuples. Some examples of node sets of objects are a container (where the singleton tuples of the nodes in the containter are in the node set) or a set formed from the singleton of a single object. S1 = { , , } (A note on notation: "{}" means unordered set and "<>" means an ordered set (a list or sequence)) Here the dij's are primitive values which are often node ids, but can be of any type (e. g., integers or strings). The dij's are not sets! The dij's internals are addressed positionally within the tuple, although some higher level may use naming to make this more human sensible and less susceptible to bugs. The basic idea is to start with some number of a node sets (of objects?). Process those node sets to produce new node sets. These node sets usually are only ephemeral, but can be saved as a property of a node on the server. New node sets can be constructed by normal set operations (union, difference, and intersect), by select, by joins, or by dereferencing. What's going on here? Isn't S1 just a relation? No it isn't. First of all, the dij's can be references to nodes. They are not necessarily atomic values as required by the relational model. Second, there is no guarantee that items in the same position are of the same type. Third, things are dealt with positionally where relational systems deal with attributes by name. Finally, the size of each tuple is not necessarily the same. 3.2 Relational operations on sets Normal set operations are real simple. "S1 UNION S2" produces the set S3 where a tuple is in S3 iff it is in S1 or S2. Difference and intersect are similarly defined. Select is a filtering operation where an old node set is restricted, via a Condition, to construct a new node set. For example, a new node set can be constructed by restricting an existing node set based on the properties of the members of the existing node set. The new node set is a subset of the old node set. "S5 JOIN S6 WHERE Condition" produces the node set S7 where a tuple, r, is in S7 iff it is constructed from tuples t1 and t2 from S5 and S6 respectively and t1 and t2 satisfy the condition. If the number of primitive values in t1 is n and the number of primitive values in t2 is m, tuple r has primitive values 1 thru n from t1 and primitive values n+1 thru n+m from t2, where all of these are copied in order. Join can be combined with Project to remove and reorder the primitive values in the result. Conditions are discussed below. Project produces a new node set from an old node set where some of the fields of the ordered tuples have been removed and/or reordered. For example, projecting S1 above on fields 1 and 2 and swapping the fields would give: { , , ..., } If a container is used as a starting point for some operation, then it is viewed as a set of singleton tuples for the documents in the container. S2 = {, , ...> 3.3 Dereferencing links Dereferencing involves looking at the links and attributes of a primitive value for a node set. Not all primitive values can be dereferenced (e. g., an integer can't be dereferenced). For notation, I'm going to use the symbol "^" for explicitly following one level of dereferencing. Dot notation is used to select attributes. Assume that S is a node set of singleton object refereneces (e. g., a container). To construct a new node set R from S where we follow all the refersTo links and only take those who have a co-author of Hagmann: SELECT S1.$outlinks.refersTo ^ FROM S WHERE S1.$outlinks.refersTo^author Using set notation this means: O (S For links, each tuple of the node set is examined. The various primitive values in the tuple are the domain for the start of the dereferencing. Links can be followed in two ways: to the link itself (" selectively walked (see Conditions below). For example, only follow "author" links or only follow links that have the "linkPage" property set to "1". 3.3 Transative closure Transative closure for a query Q starting from node set D is written: Transative closure means to run the fullowing algorithm until two iterations produce the same result. Take the result of the previous iteration or D for the first iteration, and the query Q, and produce a new node set E via standard query exectuion. The result of the iteration is D Transative closures can be iterated to a fixed depth (depth three is denoted as reached ( not guaranteed to exist for all queries (sigh). Maybe a "leaves" only form of transative closure should be included. The normal algorithm is run, but any tuple that spawns another is marked. The union operation treats marked as sticky: two identical tuples appart from the mark may be reduced to one tuple with a mark. All marked tuples are deleted from the result. 3.4 Attribute and field selection When a primitive value that refers to a node, then a form of dereferencing can also be applied to the attributes of the node. For tuple t = , if t2 is a reference to a node, then it can be examined. t2.attributeName dereferences t2 and selects from it the attributeName property. As this property is set valued (if it exists for the node), each member of the set is considered seperately. The member can be further dereferenced or be required to pass some condition or be used to compute some expression. Eventually the ground case is reached where the result of the dereferencing is either a primitive value, a set that is not a singleton of a primitive value, or something has gone wrong during evaluation (e. g., the attributeName didn't exist, t2 was not a reference, or the result is set valued). For those evaluations that produce a primitive value or a singleton set of a primitive value, then one of the results of the evaluation is the primitive value. 3.5 Expressions There has to be some kind of expression evaluator that makes use of the same ideas in dereferencing. 3.6 Conditons xxx 3.7 Aggregation SQL provides COUNT, SUM, AVG, MAX and MIN functions on fields. Yggdrasil should do at least this. We also need SIZE (for size of a set). COUNT[S] = number of tuples in node set S SUM[Si] = sum of values in position "i" in all tuples in S 3.8 Update Just like SQL? In essence, you run a query to select those tuples of interest, and then set attributes in them. Is this OK? 3.9 Examples 3.9.1 Query 1: Transitive closure to leaves The Documents: Book documents point to one or more chapter documents. A chapter document points to sections. Sections point to (sub)sections and/or paragraphs. Paragraphs have a contents (a long string) and keywords (a set of short strings). The Query: Given book X, find all paragraphs in it containing the keyword "cat". O (S T _ O*({X}); u Variation: Given book X, find all paragraphs in it containing the keywords "cat" and "dog". Same as above but replace {"cat"} with {"cat", "dog"} Variation: Given book X, find all paragraphs in it containing the pattern "c?t*" in its contents. Same as above but the last line reads: u 3.9.2 Query 2: Transitive closure, all nodes The Documents: Same as Query 1, except that book, chapters, and sections also have keywords. The Query: Given book X, find all documents in it (of any type) containing the keyword "cat". O (S T _ O*({X}); u 3.9.3 Query 3: Weighted Keyword Search The Documents: A set of papers. Each paper is stored as a single document. Each paper contains a set of keywords. For each keyword, there is a set of (l, w) pairs. The pair (l, w) means that the keyword occurs at location "l" (e.g., byte offset into body of paper) with a weight "w". The wight represents the "importance" of the keyword to that occurrence. The Query: I am given a set of pairs (k, z). In a pair (k, z), "k" is a keyword and "z" is the search weight I want to use. I want to examine the papers in the set. For each paper, I wish to compute the function: sum _ 0; For each in the paper, multiply w by z, where z is search weight corresponding to k; and add to sum if sum > 0.734 then select paper Assume that S is the set of "The Documents" references and W is the "set of pairs (k, z)". S is coded as follows: each keyword K is kept in property name formed by concatenating "$keywords-" and the string K. SELECT S FROM S, W WHERE SUM[ SELECT u2* w2 WHERE u = s1.%k^ AND k = ConCat["$keywords-*", w1] > 0.734 SELECT u2* w2 WHERE u = s1.%k^ AND k = ConCat["$keywords-*", w1] u is a set of keyword location and confidence weight pairs (l, w). u2 is the confidence weight. W is the set of pairs. W1 is the keyword and W2 is the search weight. SELECT * FROM R WHERE R.id = Keywords.id AND SUM[W.weight * SELECT Keywords.weight FROM Keywords.id = R.id AND W.keyword = Keywords.keyword] > 0.734 r FOREACH[f = s1.$keywords ^ | f.name = "cat" O (S T _ O*({X}); u 3.9.4 Query 4: Join The Documents: A set of departments. Each department document has a property where "str" is the name of the manager of the department. There is also a set of employees. Each employee has a property representing the name of the employee's manager. There are no outgoing links for department or employees. The Query: For each employee, department pair that share a common manager, create a tuple . Return the set of pairs obtained. Variation: Instead of creating a pair , create a new document that contains the attributes of both e and d. Variation: Assume employee documents have salaries. Find all employees that make more than their managers. Or try any other relational type query. But remember that all associations are via names, not via links. 4 Relational algebra This section should compare and contrast the above model and relational algebra. 1.3.1 SELECT Extract specified tuples from a specified relation. 1.3.2 PROJECT Extract specified attributes from a specified relation . Not applicable in an object oriented system (?). Yggdrasil's project tosses whole objects out of node sets, not just some fields. 1.3.3 PRODUCT Cross product of two relations formed by pairs of tuples being concatenated. 1.3.4 UNION 1.3.5 INTERSECT 1.3.6 DIFFERENCE 1.3.7 JOIN Cross product of two relations formed by pairs of tuples being concatenated where the two tuple satisfy some condition. 1.3.8 DIVIDE: n/a