Yggdrasil Query Language Proposal
June 22, 1989 8:18:04 am PDT
Yggdrasil Query Language Proposal
Bob Hagmann
    Here's a shot at it.
1  Data model
    Note that versions and alternatives are not discussed here.  The selection of the version and alternative could be via an external 
    environment, so might be orthogonal.
1.1 Nodes ({ documents or objects without semantics)
    Nodes have a contents.  The contents of a node is a primitive value.  It is a singleton list of the attribute $contents (see §1.3).
1.2 Typed links
    Nodes can be connected by links (hypertext).  A link is a directed pointer between two nodes.  The link has a type, which is just a 
    string.  An inlink or a to link are links that point at a node, while an outlink or a from link are links leaving a node.
    Nodes can efficiently determine both their inlinks and outlinks.  Although links are directional, they can be followed in either 
    way.  Links are themselves nodes, so they can have attributes.  Yggdrasil provides only node to node links.  If a client wants 
    a link to point at part of a node, then the client has to encode this as a property on the link.
    Links are special properties of a node (see §1.3).  They have attribute names $inlinks and $outlinks respectively.  They are 
    encoded as a two level list.  The first level is a list for each link type.  Each first level list has a field name matching 
    the link type.  The second list is a list of document ids.
1.3 Nodes have properties ({ attributes)
    Nodes have a set of attributes or properties.  Both terms are used interchangeably.  The attributes of a node are a set of 
    attribute name and attribute value list pairs.  The attribute name is just a string that identifies the attribute.  It does not 
    have to be unique for the node.
    An attribute value list is a sequence of attribute value lists and primitive values.  Each attribute value list has a special list 
    item called its name.  Attribute value list names are always strings.
    A primitive value is a type and value pair.  The type is a descriptor for the value.  Only a dozen or so types are understood by 
    the server.  For example, strings (any length), dates, floats (32 bit), and integers (32 bit and infinite precision) are 
    supported.  In addition, other types may be stored without the server understanding what the value means.  For example, ODA 
    documents and compressed scanned images can be stored, but the server treats them as uninterpreted bytes.  The client code, not 
    running on the server, is expected to provide the semantics for these types.  The type space is fairly large (almost 32 bits) 
    and is administrated by a database administrator without much support from the system.
    Note that is a fundamental change in what has been the model up to now.  There is no depth to the recursion in the attribute values.
1.4 Containers group nodes
    Containers are the context mechanism in Yggdrasil.  A container is a (possibly ordered) set of nodes.  A container is associated 
    with a root node of the container (analogous to directory in a file system) (where the information about membership is kept in 
    the $children attribute in the root node)  A node can belong to any number of containers.  In addition, a container can be a 
    sub-container of another container (where the information about membership is kept in the $child-containers attribute).  A 
    container can be the sub-container of any number of containers.  Loops are allowed in the sub-container relation (i. e., it is 
    a directed graph, not just a DAG).
1.5 Documents can be named
    Any node can act as a directory.  It is natural to expect the root nodes of containers to be directories, but the system does not 
    make this restriction.  A special system maintained property, $directoryContents, provides the name and node binding.  Not all 
    nodes have to be named and a node can have many names.  Naming a node that is a root node means naming a subdirectory.
2  Query language motivation
    There are three motivations for the query language:
        1) Perform a subset of SQL
        2) Bring in the notion of object into the query language
        3) Incorporate the primitive notions in the Yggdrasil data model into the query language.  Hence, links, set valued attributes, and 
        containers are directly representable in the language.
    Note that is a fundamental change in what has been proposed before.  The system now wants to provide support to:
        1) Hypertext  this is the basic model
        2) File server  not all that applicible at the query level, but needed in the data model
        3) Object-oriented database storage backend  again, not that applicible at the query level, but needed in the data model
        4) Relational database  this is where the subset of SQL comes in.  This is not an attempt to have a fully relational database, but 
        only some features of it available.  For example, I am not proposing referential integrity be guarenteed.
2.1 Subset of SQL
    When I visited IBM, Bruce Lindsay's major comment (I paraphrase) was that any system that is trying to supplant another should have 
    a plan of how to do the functions of the old system.  So for us, that means how is Yggdrasil going to do SQL?
    I think the answer is in architecting a SQL subset that can then be extended into our query language.  That is, we should look at 
    SQL (or the relational algebra - syntax is not the big thing here) and decide what makes sense for us.  From this subset, 
    enhance it with the things we need to do our job.
    If a client just wants relational, then the client can just restrict its use of the system to the SQL subset (provided that the 
    subset is enough).  A SQL compatibility layer might be constructed to allow real SQL use of the system.
    I don't think we are in the business of building the fastest relational system.  However, giving basic support to this way of 
    dealing with some of your data makes sense.
2.2 Objects in the query language
    Where relational databases and the systems we want to support differ is in the concept of object.  File servers, object-oriented 
    databases, and hypertext systems all have objects.  Relational databases have tables and fields.  The elements in these fields 
    are atomic values.  Object identity (tuple id's or tid's) is not preserved over certain operations such as join.  What comes 
    out is just another table with fields and atomic values.
    Since Yggdrasil is an object based system, it has to understand objects in the query language.  But this has to be integrated with 
    the other ways to deal with queries.
2.3 Yggdrasil primitives
    The query language should incorporate the primitive notions in the Yggdrasil data model in the query language.  Hence, links, set 
    valued attributes, and containers are directly representable in the language.
3  Query concepts
3.1 Sets
    Set appear again and again in the data model.  It is natural that they show up in the query model too.  A tuple is a ordered set of 
    primitive values.  A node set is an unordered set of tuples.  Some examples of node sets of objects are a container (where the 
    singleton tuples of the nodes in the containter are in the node set) or a set formed from the singleton of a single object.
        S1 = { <d11, d12, ..., d1n>, <d21, d22, ..., d2p>, <dm1, dm2, ..., dmq>} 
        (A note on notation: "{}" means unordered set and "<>" means an ordered set (a list or sequence))
    Here the dij's are primitive values which are often node ids, but can be of any type (e. g., integers or strings).  The dij's are 
    not sets!  The dij's internals are addressed positionally within the tuple, although some higher level may use naming to make 
    this more human sensible and less susceptible to bugs.  
    The basic idea is to start with some number of a node sets (of objects?).  Process those node sets to produce new node sets.  These 
    node sets usually are only ephemeral, but can be saved as a property of a node on the server.  New node sets can be constructed 
    by normal set operations (union, difference, and intersect), by select, by joins, or by dereferencing.
    What's going on here?  Isn't S1 just a relation?  No it isn't.  First of all, the dij's can be references to nodes.  They are not 
    necessarily atomic values as required by the relational model.  Second, there is no guarantee that items in the same position 
    are of the same type.  Third, things are dealt with positionally where relational systems deal with attributes by name.  
    Finally, the size of each tuple is not necessarily the same.
3.2 Relational operations on sets
    Normal set operations are real simple.  "S1 UNION S2" produces the set S3 where a tuple is in S3 iff it is in S1 or S2.  Difference 
    and intersect are similarly defined.
    Select is a filtering operation where an old node set is restricted, via a Condition, to construct a new node set.  For example, a 
    new node set can be constructed by restricting an existing node set based on the properties of the members of the existing node 
    set.  The new node set is a subset of the old node set.
    "S5 JOIN S6 WHERE Condition" produces the node set S7 where a tuple, r, is in S7 iff it is constructed from tuples t1 and t2 from 
    S5 and S6 respectively and t1 and t2 satisfy the condition.  If the number of primitive values in t1 is n and the number of 
    primitive values in t2 is m, tuple r has primitive values 1 thru n from t1 and primitive values n+1 thru n+m from t2, where all 
    of these are copied in order.  Join can be combined with Project to remove and reorder the primitive values in the result.  
    Conditions are discussed below.
    Project produces a new node set from an old node set where some of the fields of the ordered tuples have been removed and/or 
    reordered.  For example, projecting S1 above on fields 1 and 2 and swapping the fields would give:
        { <d12, d11>, <d22, d21>,  ..., <dm2, dm1>}
    If a container is used as a starting point for some operation, then it is viewed as a set of singleton tuples for the documents in 
    the container.
        S2 = {<c1>, <c2>, ...<cn>>
3.3 Dereferencing links
    Dereferencing involves looking at the links and attributes of a primitive value for a node set.  Not all primitive values can be 
    dereferenced (e. g., an integer can't be dereferenced).
    For notation, I'm going to use the symbol "^" for explicitly following one level of dereferencing.  Dot notation is used to select 
    attributes.  Assume that S is a node set of singleton object refereneces (e. g., a container).  To construct a new node set R 
    from S where we follow all the refersTo links and only take those who have a co-author of Hagmann:
    SELECT S1.$outlinks.refersTo ^ FROM S WHERE S1.$outlinks.refersTo^author 
    Using set notation this means:
    O (S 
    For links, each tuple of the node set is examined.  The various primitive values in the tuple are the domain for the start of the 
    dereferencing.  Links can be followed in two ways: to the link itself ("    selectively walked (see Conditions below).  For example, only follow "author" links or only follow links that have the 
    "linkPage" property set to "1".
3.3 Transative closure
    Transative closure for a query Q starting from node set D is written:
        
        Transative closure means to run the fullowing algorithm until two iterations produce the same result.  Take the result of the 
        previous iteration or D for the first iteration, and the query Q, and produce a new node set E via standard query 
        exectuion.  The result of the iteration is D 
    Transative closures can be iterated to a fixed depth (depth three is denoted as     reached (    not guaranteed to exist for all queries (sigh).
    Maybe a "leaves" only form of transative closure should be included.  The normal algorithm is run, but any tuple that spawns 
    another is marked.  The union operation treats marked as sticky: two identical tuples appart from the mark may be reduced to 
    one tuple with a mark.  All marked tuples are deleted from the result.
3.4 Attribute and field selection
    When a primitive value that refers to a node, then a form of dereferencing can also be applied to the attributes of the node.  For 
    tuple t = <t1, t2, ..., tm>, if t2 is a reference to a node, then it can be examined.  t2.attributeName dereferences t2 and 
    selects from it the attributeName property.  As this property is set valued (if it exists for the node), each member of the set 
    is considered seperately.  The member can be further dereferenced or be required to pass some condition or be used to compute 
    some expression.
    Eventually the ground case is reached where the result of the dereferencing is either a primitive value, a set that is not a 
    singleton of a primitive value, or something has gone wrong during evaluation (e. g., the attributeName didn't exist, t2 was 
    not a reference, or the result is set valued).  For those evaluations that produce a primitive value or a singleton set of a 
    primitive value, then one of the results of the evaluation is the primitive value.
3.5 Expressions
    There has to be some kind of expression evaluator that makes use of the same ideas in dereferencing.
3.6 Conditons
    xxx
3.7 Aggregation
    SQL provides COUNT, SUM, AVG, MAX and MIN functions on fields.  Yggdrasil should do at least this.  We also need SIZE (for size of 
    a set).
        COUNT[S] = number of tuples in node set S
        SUM[Si] = sum of values in position "i" in all tuples in S
3.8 Update
    Just like SQL?  In essence, you run a query to select those tuples of interest, and then set attributes in them.  Is this OK?
3.9 Examples
3.9.1 Query 1:  Transitive closure to leaves
    The Documents:  Book documents point to one or more chapter documents.  A chapter document points to sections.  Sections point to 
    (sub)sections and/or paragraphs.  Paragraphs have a contents (a long string) and keywords (a set of short strings).
    The Query:  Given book X, find all paragraphs in it containing the keyword "cat".
    O (S 
    T _ O*({X});
    u 
    Variation: Given book X, find all paragraphs in it containing the keywords "cat" and "dog".
    Same as above but replace {"cat"}  with  {"cat", "dog"}
    Variation: Given book X, find all paragraphs in it containing the pattern "c?t*" in its contents.
    Same as above but the last line reads:
    u 
3.9.2 Query 2:  Transitive closure, all nodes
    The Documents:  Same as Query 1, except that book, chapters, and sections also have keywords.
    The Query:  Given book X, find all documents in it (of any type) containing the keyword "cat".
    O (S 
    T _ O*({X});
    u 
3.9.3 Query 3:  Weighted Keyword Search
    The Documents:  A set of papers.  Each paper is stored as a single document.  Each paper contains a set of keywords.  For each 
    keyword, there is a set of (l, w) pairs.  The pair (l, w) means that the keyword occurs at location "l" (e.g., byte offset into 
    body of paper) with a weight "w".  The wight represents the "importance" of the keyword to that occurrence.
    The Query:  I am given a set of pairs (k, z).  In a pair (k, z), "k" is a keyword and "z" is the search weight I want to use.  I 
    want to examine the papers in the set.  For each paper, I wish to compute the function:
        sum _ 0;
        For each <k, (l, w)> in the paper,
            multiply w by z, where z is search weight corresponding to k; and add to sum
        if sum > 0.734 then select paper
        Assume that S is the set of "The Documents" references and W is the "set of pairs (k, z)".  S is coded as follows: each keyword K 
        is kept in property name formed by concatenating "$keywords-" and the string K.  
        SELECT S FROM S, W WHERE SUM[
            SELECT u2* w2 WHERE u = s1.%k^ AND k = ConCat["$keywords-*", w1]   > 0.734

        SELECT u2* w2 WHERE u = s1.%k^ AND k = ConCat["$keywords-*", w1]

        u is a set of keyword location and confidence weight pairs (l, w).  u2 is the confidence weight.
        W is the set of pairs.  W1 is the keyword and W2 is the search weight.

        SELECT * FROM R 
            WHERE R.id = Keywords.id AND SUM[W.weight * SELECT Keywords.weight FROM Keywords.id = R.id AND W.keyword = Keywords.keyword] > 0.734

        r 
        FOREACH[f = s1.$keywords ^ | f.name = "cat" 
        O (S 
        T _ O*({X});
        u 
3.9.4 Query 4:  Join
    The Documents:  A set of departments. Each department document has a property <manager, str> where "str" is the name of the manager 
    of the department.  There is also a set of employees.  Each employee has a property <manager, str> representing the name of the 
    employee's manager.  There are no outgoing links for department or employees.
    The Query:  For each employee, department pair that share a common manager, create a tuple <e,t>.  Return the set of pairs obtained.
    Variation: Instead of creating a pair <e,t>, create a new document that contains the attributes of both e and d.
    Variation:  Assume employee documents have salaries.  Find all employees that make more than their managers. Or try any other 
    relational type query.  But remember that all associations are via names, not via links.
4 Relational algebra
    This section should compare and contrast the above model and relational algebra.
1.3.1 SELECT
    Extract specified tuples from a specified relation.
1.3.2 PROJECT
    Extract specified attributes from a specified relation .  Not applicable in an object oriented system (?).  Yggdrasil's project 
    tosses whole objects out of node sets, not just some fields.
1.3.3 PRODUCT
    Cross product of two relations formed by pairs of tuples being concatenated.
1.3.4 UNION
1.3.5 INTERSECT
1.3.6 DIFFERENCE
1.3.7 JOIN
    Cross product of two relations formed by pairs of tuples being concatenated where the two tuple satisfy some condition.
1.3.8 DIVIDE: n/a