Here's a shot at a query language. I think that we should be concerned that the concepts are OK, not how good the language is. This proposal retains "the three levels of sets" for attributes.
Section 1 has a review of the data model. Section 2 motivates the choises for the query language. Section 3 introduces the query concepts. Section 4 shows how the query language can solve the sample queries Hector proposed. In an appendix is a brief summary of the Relational Algebra.
1 Data model
This is mostly a review and to make this document mostly self contained. Note that versions and alternatives are not discussed here. The selection of the version and alternative could be via an external environment, so it might be orthogonal.
1.1 Nodes (
{ documents or
objects without semantics)
Nodes have a contents. The contents of a node is a primitive value. It is a singleton list of the attribute $contents (see §1.3).
1.2 Typed links
Nodes can be connected by links (hypertext). A link is a directed pointer between two nodes. The link has a type, which is just a string. An inlink or a to link are links that point at a node, while an outlink or a from link are links leaving a node.
Nodes can efficiently determine both their inlinks and outlinks. Although links are directional, they can be followed in either way. Links are themselves nodes, so they can have attributes. Yggdrasil provides only node to node links. If a client wants a link to point at part of a node, then the client has to encode this as a property on the link.
Links are special properties of a node (see §1.3). They have attribute names $inlinks and $outlinks respectively. They are encoded as a two level list. The first level is a list for each link type. Each first level list has a field name matching the link type. The second list is a list of document ids.
1.3 Nodes have properties (
{ attributes)
Nodes have a set of attributes or properties. Both terms are used interchangeably. The attributes of a node are a set of attribute name and attribute value list pairs. The attribute name is just a string that identifies the attribute. It does not have to be unique for the node. The attribute value is a set of fields. Each field has a field name and a field value. The field value is a set of primitive values. Each of the above sets may be ordered or unordered (default is unordered).
A primitive value is a type and value pair. The type is a descriptor for the value. Only a dozen or so types are understood by the server. For example, strings (any length), dates, floats (32 bit), and integers (32 bit and infinite precision) are supported. In addition, other types may be stored without the server understanding what the value means. For example, ODA documents and compressed scanned images can be stored, but the server treats them as uninterpreted bytes. The client code, not running on the server, is expected to provide the semantics for these types. The type space is fairly large (almost 32 bits) and is administrated by a database administrator without much support from the system.
This proposal sticks to the three level model. With three levels, the common case of an auxiliary table used to simulate repeating groups can be modeled as part of the object. The benefit if the elimination of the otherwise useless auxiliary table. Not all cases of the use of auxiliary tables are eliminated! Where the auxiliary table serves some other purpose (e. g., it is used to indicate common structure) or if the repeating group is multi-level, then an auxiliary table is preferable to putting the data "inline" the document. Greater ease of processing (both is CPU and disk I/O) and simplicity of modeling are achieved for many common cases.
1.4 Containers group nodes
Containers are the context mechanism in Yggdrasil. A container is a (possibly ordered) set of nodes. A container is associated with a root node of the container (where the information about membership is kept in the $children attribute in the root node.) Containers are analogous to directory in a file system. A node can belong to any number of containers. A node has a $parents property that tells all the containers the node is part of. In addition, a container can be a sub-container of another container (where the information about membership is kept in the $child-containers attribute). A container can be the sub-container of any number of containers. A container can find its parents by looking at its $parent-containers attribute. Loops are allowed in the sub-container relation (i. e., it is a directed graph, not just a DAG).
1.5 Documents can be named
Any node can act as a directory. It is natural to expect the root nodes of containers to be directories, but the system does not make this restriction. A special system maintained property, $directoryContents, provides the name and node binding. Not all nodes have to be named and a node can have many names. Naming a node that is a root node means naming a subdirectory.
2 Query language motivation
There are three motivations for the query language:
1) Perform a subset of SQL
2) Bring in the notion of object into the query language
3) Incorporate the primitive notions in the Yggdrasil data model into the query language. Hence, links, set valued attributes, and containers are directly representable in the language.
A further requirement is that the basis of the query language also be a reasonable substrate for the other features of the system such as file server attribute queries (e. g., from XNS file servers).
Note that is a fundamental change in what has been proposed before. The system now wants to provide support to:
1) Hypertext — this is the basic model
2) File server — not all that applicable at the query level, but needed in the data model
3) Object-oriented database storage backend — again, not that applicable at the query level, but needed in the data model
4) Relational database — this is where the subset of SQL comes in. This is not an attempt to have a fully relational database, but only some features of it available. For example, I am not proposing referential integrity be guaranteed.
2.1 Subset of SQL
When I visited IBM, Bruce Lindsay's major comment (I paraphrase) was that any system that is trying to supplant another should have a plan of how to do the functions of the old system. So for us, that means how is Yggdrasil going to do SQL?
I think the answer is in architecting a SQL subset that can then be extended into our query language. That is, we should look at SQL (or the relational algebra - syntax is not the big thing here) and decide what makes sense for us. From this subset, enhance it with the things we need to do our job.
If a client just wants relational, then the client can just restrict its use of the system to the SQL subset (provided that the subset is enough). A SQL compatibility layer might be constructed to allow real SQL use of the system.
I don't think we are in the business of building the fastest relational system. However, giving basic support to this way of dealing with some of your data makes sense.
2.2 Objects in the query language
Where relational databases and the systems we want to support differ is in the concept of object. File servers, object-oriented databases, and hypertext systems all have objects. Relational databases have tables and fields. The elements in these fields are atomic values. Object identity (tuple id's or tid's) is not preserved over certain operations such as join. What comes out is just another table with fields and atomic values.
Since Yggdrasil is an object based system, it has to understand objects in the query language. But this has to be integrated with the other ways to deal with queries.
2.3 Yggdrasil primitives
The query language should incorporate the primitive notions in the Yggdrasil data model in the query language. Hence, links, set valued attributes, and containers are directly representable in the language.
3 Query concepts
3.1 Sets
Set appear again and again in the data model. It is natural that they show up in the query model too. A
tuple is a ordered set of primitive values. A
node set is an unordered set of tuples. Some examples of node sets of objects are a container (where the singleton tuples of the nodes in the container are in the node set) or a set formed from the singleton of a single object.
S1 = { <d11, d12, ..., d1n>, <d21, d22, ..., d2p>, <dm1, dm2, ..., dmq>}
(A note on notation: "{}" means unordered set and "<>" means an ordered set (a list or sequence))
(The subscripts are inconvenient on character terminals. Hence, the use of ()'s will replace them in textual queries.)
Here the dij's are primitive values which are often node ids, but can be of any type (e. g., integers or strings). The dij's are not sets! The dij's internals are addressed positionally within the tuple, although some higher level may use naming to make this more human sensible and less susceptible to bugs. The node sets are just a special case of the attributes.
The basic idea is to start with some number of a node sets (of objects?). Process those node sets to produce new node sets. These node sets usually are only ephemeral, but can be saved as a property of a node on the server. New node sets can be constructed by normal set operations (union, difference, and intersect), by select, by joins, by examining properties, or by dereferencing.
What's going on here? Isn't S1 just a relation? No it isn't. First of all, the dij's can be references to nodes. They are not necessarily atomic values as required by the relational model. Second, there is no guarantee that items in the same position are of the same type. Third, things are dealt with positionally where relational systems deal with attributes by name. Finally, the size of each tuple is not necessarily the same.
3.2 Relational operations on sets
Normal set operations are real simple. "S1 UNION S2" produces the set S3 where a tuple is in S3 iff it is in S1 or S2. Difference and intersect are similarly defined.
Select is a filtering operation where an old node set is restricted, via a Condition, to construct a new node set. For example, a new node set can be constructed by restricting an existing node set based on the properties of the members of the existing node set. The new node set is a subset of the old node set.
"S5 JOIN S6 WHERE Condition" produces the node set S7 where a tuple, r, is in S7 iff it is constructed from tuples t1 and t2 from S5 and S6 respectively and t1 and t2 satisfy the condition. If the number of primitive values in t1 is n and the number of primitive values in t2 is m, tuple r has primitive values 1 thru n from t1 and primitive values n+1 thru n+m from t2, where all of these are copied in order. Join can be combined with Project to remove and reorder the primitive values in the result. Conditions are discussed below.
Project produces a new node set from an old node set where some of the fields of the ordered tuples have been removed and/or reordered. For example, projecting S1 above on fields 1 and 2 and swapping the fields would give:
{ <d12, d11>, <d22, d21>, ..., <dm2, dm1>}
If a container is used as a starting point for some operation, then it is viewed as a set of singleton tuples for the documents in the container.
S2 = {<c1>, <c2>, ...<cn>>
3.3 Dereferencing links
Dereferencing involves looking at the links and attributes of a primitive value for a node set. Not all primitive values can be dereferenced (e. g., an integer can't be dereferenced).
For notation, I'm going to use the symbol "^" for explicitly following one level of dereferencing. Dot notation is used to select attributes. Assume that S is a node set of singleton object references (e. g., a container). To construct a new node set R from S where we follow all the refersTo links and only take those who have a co-author of Hagmann:
SELECT S1.$outlinks.refersTo ^ FROM S WHERE S1.$outlinks.refersTo^author ƒ {"Hagmann"}
Using set notation this means:
O (S b R) { <r1> R Ä s S | s1.$outlinks.refersTo ^ = r1 ' r1.author ƒ {"Hagmann"}
For links, each tuple of the node set is examined. The various primitive values in the tuple are the domain for the start of the dereferencing. Links can be followed in two ways: to the link itself ("g"), or to the linked node ("^"). Links may be selectively walked (see Conditions below). For example, only follow "author" links or only follow links that have the "linkPage" property set to "1".
3.3 Attribute and field selection
When a primitive value that refers to a node, then a form of dereferencing can also be applied to the attributes of the node. For tuple t = <t1, t2, ..., tm>, if t2 is a reference to a node, then it can be examined. t2.attributeName dereferences t2 (if needed) and selects from it the attributeName property. A "%" may be used as a prefix to insure that this means selecting an attribute. As this property is set valued (if it exists for the node), each member of the set is considered separately. The member can be further dereferenced or be required to pass some condition or be used to compute some expression.
Eventually the ground case is reached where the result of the dereferencing is either a primitive value or something has gone wrong during evaluation (e. g., the attributeName didn't exist, t2 was not a reference, or the result is set valued). For those evaluations that produce a primitive value or a singleton set of a primitive value, then one of the results of the evaluation is the primitive value.
3.4 Transitive closure
Transitive closure for a query Q starting from node set D is written:
TRANSITIVE CLOSURE [ SEED D IN QUERY]
Transitive closure means to run the following algorithm until two iterations produce the same result. Take the result of the previous iteration or D for the first iteration, and the query QUERY, and produce a new node set E via standard query execution. The result of the iteration is D * E.
Transitive closures can be iterated to a fixed depth (depth three is denoted as TRANSITIVE CLOSURE(3) - not really the transitive closure, fo course) or can be performed until a fixed point is reached. An obvious optimization of fixed depth accumulation is to check for a fixed point. Note that fixed points are not guaranteed to exist for all queries (sigh).
Sometimes a "leaves" only form of transitive closure should be included. The normal algorithm is run, but any tuple that spawns another is marked. The union operation treats marked as sticky: two identical tuples apart from the mark may be reduced to one tuple with a mark. All marked tuples are deleted from the result. For notation:
LEAF TRANSITIVE CLOSURE[ SEED D IN QUERY]
3.5 Expressions
There has to be some kind of expression evaluator that makes use of the same ideas in dereferencing. Addition, multiplication, subtraction, and division are all possible for integers and floats. String operations such as concatenate, pattern match, and pattern match with extraction of a substring. Set based operations are supported.
3.6 Conditions
Conditions are provided that are modeled after Conditions in SQL. Set conditionals (e. g. , ‚, †, î, and =) are also supported.
3.7 Aggregation
SQL provides COUNT, SUM, AVG, MAX and MIN functions on fields. Yggdrasil should do at least this. We also need SIZE (for size of a set).
COUNT[S] = number of tuples in node set S
SUM[Si] = sum of values in position "i" in all tuples in S
3.8 Update
Just like SQL? In essence, you run a query to select those tuples of interest, and then set attributes in them. Is this OK?
4 Examples
4.1 Query 1: Transitive closure to leaves
The Documents: Book documents point to one or more chapter documents. A chapter document points to sections. Sections point to (sub)sections and/or paragraphs. Paragraphs have a contents (a long string) and keywords (a set of short strings).
The Query: Given book X, find all paragraphs in it containing the keyword "cat".
Assume that S is the set of "The Book" chapter documents. chapter documents point to the sections via section links. Sections point to (sub)sections via subsection links and to paragraphs via paragraph links.
R ← LEAF TRANSITIVE CLOSURE[ SEED S IN SELECT * FROM R WHERE r(1).%section UNION SELECT r(1).%subsection]
SELECT * FROM R WHERE COUNT[r(1).%keyword1 = "cat"] > 1
Variation: Given book X, find all paragraphs in it containing the keywords "cat" and "dog".
Same as above but replace the second line with:
SELECT * FROM R WHERE COUNT[r(1).%keyword1 = "cat"] > 1 AND COUNT[r(1).%keyword1 = "dog"] > 1
Variation: Given book X, find all paragraphs in it containing the pattern "c?t*" in its contents.
Same as above but the last line reads:
SELECT * FROM R WHERE MATCH["c?t*", r.$contents]
4.2 Query 2: Transitive closure, all nodes
The Documents: Same as Query 1, except that book, chapters, and sections also have keywords.
The Query: Given book X, find all documents in it (of any type) containing the keyword "cat".
R ← TRANSITIVE CLOSURE[ SEED S IN SELECT * FROM R WHERE r(1).%section UNION SELECT r(1).%subsection]
SELECT * FROM R WHERE COUNT[r(1).%keyword1 = "cat"] > 1
4.3 Query 3: Weighted Keyword Search
The Documents: A set of papers. Each paper is stored as a single document. Each paper contains a set of keywords. For each keyword, there is a set of (l, w) pairs. The pair (l, w) means that the keyword occurs at location "l" (e.g., byte offset into body of paper) with a weight "w". The wight represents the "importance" of the keyword to that occurrence.
The Query: I am given a set of pairs (k, z). In a pair (k, z), "k" is a keyword and "z" is the search weight I want to use. I want to examine the papers in the set. For each paper, I wish to compute the function:
sum ← 0;
For each <k, (l, w)> in the paper,
multiply w by z, where z is search weight corresponding to k; and add to sum
if sum > 0.734 then select paper
Assume that S is the set of "The Documents" references and W is the "set of pairs (k, z)". S is coded as follows: all keywords are kept in a property name (keywords). The property is a set of ordered sets. Each ordered set is a three-tuple: <keyword-string, location, weight>. Multiple uses of the same keyword are represented by replicating the three-tuple with a different location and weight. The keyword-string is a string, so it can be understood by the server. The weight is a floating point number, so it can also be understood by the server. The location can be anything. All that the server has to do is return it to the client application. It could be a character position in the document, but it also could be a client supplied type that encodes chapter/section/sub-section/paragraph.
SELECT S FROM S, W WHERE SUM[
SELECT u3* w2 WHERE u = s1.%keywords AND u1 = w1] > 0.734
u is a set of three-tuples. u1 is the keyword-string, u2 is the location (uninterpreted), and u3 is the confidence weight.
W is the set of pairs. w1 is the keyword and w2 is the search weight.
4.4 Query 4: Join
The Documents: A set of departments. Each department document has a property <manager, str> where "str" is the name of the manager of the department. There is also a set of employees. Each employee has a property <manager, str> representing the name of the employee's manager. There are no outgoing links for department or employees.
The Query: For each employee, department pair that share a common manager, create a tuple <e,t>. Return the set of pairs obtained.
Variation: Instead of creating a pair <e,t>, create a new document that contains the attributes of both e and d.
Variation: Assume employee documents have salaries. Find all employees that make more than their managers. Or try any other relational type query. But remember that all associations are via names, not via links.
Joins are supported so that this is easy.