DRAFT -- March 15, 1988 11:48:09 am PST -- DRAFT Internal Memo XEROX To From Distribution Bob Hagmann PARC/CSL Subject Date Yggdrasil Overview March 15, 1988 Introduction Xerox corporate strategy requires multimedia, large scale, unstructured, robust, and efficient servers for the storage and retrieval of documents. Yggdrasil is a project in Xerox's Palo Alto Research Center, Computer Sciences Laboratory to address many of the research issues needed to provide such a service. This research is needed for Xerox to understand how to implement a filing strategy. Many organizations require the more or less permanent storage of information. Over the years, there have evolved three types of electronic services that have satisfied parts of the problem: databases, file systems, and file servers. Each of these systems have their strengths and weaknesses. Over time, the problems and technologies change. What was appropriate at one time is not a good solution at a different time. The cost of secondary storage has continued to decrease over the years. In addition, the range of size of individual objects that we wish to store has grown (one storage object in CSL is a 160 megabyte file). We are not always satisfied with the simple yet elegant hierarchal naming systems such as we have in file systems. Nor are the concepts of relational database systems a panacea: it is missing several features that make it difficult for it to model some applications (e.g., good solutions for repeating groups, hypertext, stored procedures, triggers, transitive closure, large objects, ...). Object-oriented databases are a current research topic, but significant problems arise in their performance, locking, security, evolution, archiving, execution model, robustness, as well as other problems. Although much of the research in database systems is involved with increasing the semantics of the system, many of the potential clients of these systems simply want data storage. They want the database to ``store the bits and get out of the way.'' Semantics will be provided by the applications. The database must be fast, efficient, and robust. There seems to be a good deal of overlap between the needs of several groups to store and retrieve documents. The needs include the storage of documents of vastly varying sizes, the naming and indexing of some of the documents, the interconnection of (some of) the documents, and notification of significant events of documents. The following is a partial list of CRG projects, proposals, groups, and needs that could be served by a Document Server. f FolioPub project at WRC f ERSG proposal at WRC f VLSI DA (CAD) group in CSL f Voice project in CSL f Software storage for the Cedar Programming Environment in CSL f Storage of scanned images f Distributed Notecards in ISL f Large capacity and high performance file server f Mail storage f TIC support, both at WRC and PARC Xerox also has the current focus of ``document processing.'' Central to most (if not all) document processing systems is the storage and retrieval of documents. We are interested in the subclass of documents called electronic documents; these are ``documents'' that can be handled in some way via computers. Although there is no precise definition of ``electronic document,'' we believe that a stored electronic document should have (some of) the following features: f Primitive elements include plain text, structured text (e.g., a Tioga or Viewpoint file), scanned image, mail message, digitized voice, digitized video, and compressed versions of the above. This should be an extensible list. f A document is a primitive element with an optional ordered collection of documents. Note that is a recursive definition. f Documents have attributes. Some attributes are interpreted and some are uninterpreted. Keywords may be distinguished attributes. f Documents may be interconnected with links. Links have names. Links are also documents. The basic idea is to build an ``Interim Document Storage and Retrieval System'' called Yggdrasil. This system will be the central code of several server systems. Yggdrasil could have front-ends that made it looks like some or all of an XNS File Server, an OSI File Server, an NFS File Server, or a native Yggdrasil server. A single server could export multiple interfaces over the same ``Document Base'' if they are compatible, possibly allowing only restricted or stylized access from the simpler protocols. Yggdrasil should be configurable with various secondary storage devices to meet a variety of requirements for response time, storage capacity, crash recovery, availability, archival storage, backup, and robustness. We in CSL are interested in providing the basic storage and retrieval system, together with its associated administrative functions. We hope to cooperate with other groups to make this a useful system. Other groups could provide, for example, an automatic keyword generation, selective dissemination of information (SDI), inverse typesetting, document capture (e. g., scanning), and other facilities. The name Yggdrasil was chosen because of its definition: Ygg|dra|sil n. Also Yg|dra|sil. Norse Mythology. The great ash tree that holds together earth, heaven, and hell by its roots and branches. [Old Norse, probably the horse of Yggr'' : Yggr, name of Odin, from yggr, variant of uggr, frightful (see ugly) + drasill, horse.] << [Artwork node; type 'Artwork on' to command tool] >> Picture of Yggdrasil The rest of this memo outlines the ideas for the system. For compact presentation, an outline in bullet form is used. Basic Features f large number of documents f documents can be large f documents can be small f documents ``types'' understood by the system include text, bitmaps, digitized audio, digitized video (?), graphics, ... f documents ``types'' are extensible f documents can be connected via links (hypertext) f documents can be named via a ``hierarchical'' name system f documents can have attributes and keywords f documents are grouped into contexts called containers f keyword and other indices maintained per container f support for versions and alternatives f data compression and decompression f support for archival storage f alerters (send a network message when a system event occurs) f query language and query optimizer f page level access within a document f transactions f robust f good performance f fast recovery f good availability f access control f hooks for multi-server and foreign server support Hardware Base f > 3 MIP CPU (or much more) f > 8 Megabytes of main memory possible f f f Good communications hardware back-to-back packets sufficient bandwidth (10 megabits/second or much more) f SCSI bus (or other way to connect efficiently to a large amount of storage) for WORM optical and magnetic hard disks (if WORM optical not on a separate server) f Online archiving (optional) 1) WORM optical disks, most likely with a jukebox might be accessible over the net as an ``archive server'' instead of connected by a bus 2) Other archival device with appropriate size and speed f Magnetic hard disks possibly in large numbers even if WORM optical disks are used!! 5-20+ Gigabytes varying capacity and performance mirrored hardware for the log f Plan for read/write optical disks will not perform as well as magnetic hard disks (at least at first) f Plan for stable main memory battery backed RAM f Flexible design to accept new storage and communications technology Operating System and Programming Environment f Supports sharing and concurrency f Good existing communications supports protocol(s) needed for server support likely for new protocols f Reasonable programming environment support Cedar seems to be a clear choice f Server should be able to run under 1) SUN OS 2) Cedar on a Dorado 3) Cedar on a Dragon High Level Protocols f ``Native'' protocol with (eventual) full support for data model f Build server so that a protocol veneer can be built later to handle (at least) XNS Filing ANSI/CCITT/OSI/... Filing and Retrieval NFS If the system talks Pup, then consider Pup FTP Data Storage f The system will implement a storage hierarchy including some or all of these levels memory stable main memory (battery backed up DRAM) magnetic hard disk log magnetic hard disk buffers magnetic hard disk permanent storage WORM optical disk jukebox WORM optical disk backup or tape backup f Log based recovery f Use group commit and/or stable main memory for the log tail for high update load (if needed) f All naming, indexing, buffering, linkages, etc. are saved and updated on magnetic hard disk (or other fast secondary storage) f Delay processing (e.g., indexing) accept ``image'' quickly; index at leisure f Large object support f Hints for data storage (clustering) f ``Real time'' support -- at least enough for voice f Archival storage possibilities a) write to archival device when not referenced recently (current design) b) write to archival device ASAP but when convenient c) write to archival device preserving time order of documents (possibly do to legal requirements) f Plan for new technologies and changes in old technologies XDM might replace WORM optical disk optical tape digital video tape f Key is to build a framework for varying performance storage devices; the actual devices might change Data Model Objects f An object is either a document or a link f All objects (both documents and links) can have ``contents'' as well as attributes (a property list) f Links have both source and destination objects f Links are first class objects. They can have ``contents'', attributes, and be the source and destination of other links f A link has a ``type'' the ``type'' is just a character string any number of links of the same ``type'' on a node, but they are unordered f Links point at objects. To get finer granularity, attributes may be used in an application specific way. Naming of documents f Hierarchical directories a slash separated path name (``[]<>'' format accepted and converted) loops allowed, but a server specific evaluation depth will be imposed directories have as members directories hard links to other directories symbolic links f Ordered or unordered children links document can have subdocuments (children); there is an order to the children f Not all objects have names Containers f An object belongs to some object, its container f The container does not have to have a name f The containment graph describes the container relation Versions and alternatives f Objects can have versions and named alternatives f Directories have numbered versions and named alternatives f A partial name looks like objectName%alternative!version ``/Luther.alpine/Hagmann.pa/Buffering%FCFS!5/TechnicalReport.tioga!3'' is a valid file name f Versions and alternatives are maintained for both named and unnamed objects f A forward or backward ``delta'' technique is used to conserve storage f Minor versions, called revisions, are kept until the next version Attributes f May be associated with an object (document or link) f An attribute has a value f A value is an unordered or ordered set of values or a primitive document element f Interpreted attributes may be used in queries (e. g., keywords) Keywords f Similar to attributes, but are distinguished f Keywords have a string name Indices f Maintain indices for attributes and keywords in the containment graph at the levels specified by the "owner" of the subtree f An index at a level replicates data of indices at lower levels Updates f Primitive navigational updates (e.g., link A to B, add attribute C of type D value E to F, remove G, store document H) f One server interaction can create an object, set its contents, set its attributes, and create links from/to it. Alerters f Simple triggers on insert, delete, read, or update f Trigger queues a message for some (probably) external server f Uses for alerters: screen refresh due to update of a document being viewed cache maintenance on client machines triggering of services (e. g., automatic indexing of a new document) Queries This is only partly specified at this time. f A query results in a set or an enumerator (a list of dynamically constructed handles) of the matching objects. Each object may be examined further by the code under the enumerator (e.g., the attributes may be examined or the document read). f A subquery is specified by object-selection and a match-condition, which yields a new subquery. f A primitive subquery is specified by name-pattern and a match-condition. Transactions and concurrency control Locking f a document (object) is the natural unit of locking f locks come in (at least) three flavors browse lock: the weakest of all the locks. Can break when a stronger lock is granted or a transaction commits that holds a write lock. Useful for alerters. read lock: locks the object so that it cannot be modified write lock: locks the object for modification. Transactions f they are supposed to be short Security f Use physical security. Don't bother to encrypt the disk. f Use ``secure'' RPC, or other ``secure'' protocols. f Access control lists permit access. f Restrict access via naming, indexing, and links. Administration f Implement historical logging and monitoring f Have hooks for debugging f Backup f Restrict access via naming, indexing, and links. f Garbage collection via delete Real system f Standards authentication protocols secure RPC character sets f Replication and availability all committed data always stored in two stable places multiple CPU's per server for availability and load balancing redundancy using mirrored buses, communications networks, controllers, and disks and/or dual porting of controllers and disks parity disk trick to mask failure of a disk drive mirror the log f Robustness f Fast recovery recover the server in under two minutes after boot completes if 1) there is power 2) the net is up 3) there are not multiple severe failures f Performance performance, performance, performance