DRAFT -- October 11, 1988 11:11:21 am PDT -- DRAFT Internal Memo XEROX To From Distribution Bob Hagmann PARC/CSL Subject Date Yggdrasil Storage Design October 11, 1988 Introduction This memo specifies the overall storage design for Yggdrasil. There are many additional sections to this memo. They are as follows: Requirements These requirements have mostly been gathered from other memos. General description Object manager buffering policy This section deals with issues like read-ahead, buffering of metadata and data, and interactions with logging. Object representations Formats for objects, links and attributes are discussed at a high level. Interoperation of optical disk subsystem with full system Since the bulk of the data stored by the server will reside in the archival backend, the interoperation of the frontend and backend must be addressed. Problems occur with the write policy and interaction with backup and crash recovery. Magnetic disk page allocator and clustering The section deals with the magnetic disk page allocator and associated issues about clustering on disk. It also discusses flushing of the magnetic disk cache. Naming The section describes the update techniques for naming. Garbage collection Garbage collection is used to clean up the naming and data space. Explicit object deletion and cleansing of the name tables are both covered in this section. Versions, revisions, and alternatives This section deals with the encoding of versions, revisions, and alternatives. It also describes how to modify encodings after objects have been changed. Index data structures Indices are maintained up the containment hierarchy. This section describes the data structures kept at the container objects. Data compression Scavenger Both scavenging the frontend and backend are covered. Replication Replication is necessary to avoid loss of data. The next section describes several facets in the use of replication. Frontend backup DID map Document Identifiers (DIDs) to storage address mapping is dealt with in the next section. Requirements f Large numbers of documents f documents "types" include text, bitmaps, digitized audio, digitized video (?), graphics, ... f documents can be connected via links (hypertext) f Large object support f Page level access f Storage hierarchy memory; low and high performance magnetic hard disk for logs, names, attributes, and buffers; WORM optical disk, ... f Read a track (or so), but keep smaller units in memory f Will rotational position become more important than radial position? f Data compression The service will store compressed data. The data can either be compressed/decompressed by the clients or the server. The service policy will be to forbid the storage of compressed data where the server does not have the decompression algorithm. f Log based recovery "new file" trick used on Alpine: write new data to free pages and commit and log only small amounts of data. This avoids much of the double writing - once to the log and once to the file ``high water mark'' trick used on Alpine: only log pages before HWM, treating pages past HWM as new pages f Use group commit for high update load, if needed f Sequential read: do read ahead An "FTP retrieve" type of sequential read should be blindingly fast. On an otherwise unloaded server, it should run at the slower of the network protocol speed for memory to memory copy and of the disk bandwidth (whether magnetic or optical). From [indigo]Doc>EarlyDesign>FileStoreImplOutline0.bravo: Priorities Sequential I/O will always be very important, and Alpine should try hard to do a good job on sequential I/O. Applications that exhibit sequentiality: file transfer, external sorting, searching (pattern matching in text, no-index query processing), logs (both low and high level.) Random I/O can really only be optimized to the extent that it is non-random, e.g. it exhibits locality of reference and benefits from an LRU buffer manager. f Delayed write (watch out for overflow!) f Double write of all non-backed up information (e. g., double write of the log and new objects). f All naming, indexing, buffering, linkages, etc. are saved and updated on magnetic hard disk (or other fast secondary storage) f WORM optical disks will be self identifying upon removal (assuming that the optical disks are directly connected to the server) not self identifying when they are de-selected or after every update includes naming, indexing, and linkages (assumes we use the optical disks directly instead of through some other server) f Storage hierarchy: data/naming information/indices/etc. can be in/on: memory buffer stable storage (?) log magnetic hard disk buffer WORM optical disk f Delay processing (e.g., indexing) accept "image" quickly; index at leisure (?? via external server) f "Real time" support -- enough for voice (?? video later) f Archival storage a) write to archival device when not referenced recently b) write to archival device ASAP but when convenient c) write to archival device preserving time order of documents (legal requirement) f Plan for new technologies and changes in old technologies XDM might replace WORM optical disk optical tape f Space accounting f Key is to build a framework for varying performance storage devices; the actual devices might change General description There are three entities involved in the service: the client, the frontend, and the backend. The client is the software running somewhere (e. g., a workstation) on behalf of a user. The frontend is Yggdrasil. It provides the data model and a moderate amount of medium term storage (e. g., 5 gigabytes of magnetic disk). The backend is a server for the optical disk jukebox. We would like to think that there is not a one-to-one mapping of backends and frontends. Several frontends may provide the same service (for availability and performance), while one frontend may use several backends to store data. The backends are intended to be captive subsystems: they are never directly accessed by clients. The problems in storage design are: give the illusion the frontend as providing the data model hide the performance problems of the backend(s) perform multi-machine crash recovery Object manager buffering policy The challenge of any storage hierarchy is to make it appear to the client that the hierarchy does not exist: there is only fast memory. With a three (or more) level hierarchy, it is critical for system performance to hide the fact that there is an optical jukebox, as well as providing extremely efficient use of magnetic disk. The system should be built to parameterize the use of the storage hierarchy. ``Gap filler'' technology may (finally) appear in between magnetic disk and memory, or performance differences between high and low performance disk drives may be great enough to distinguish another level in the hierarchy. Note that the hierarchy is a lattice, not a linearly order set. Each item of the storage hierarchy may use multiple lower level items (e. g., multiple optical disk jukeboxes). As we get experience in using the system, two technologies may become of interest. First is high performance disks for logging. They can decrease the commit time and make logging more attractive. Second is stable memory. With a megabyte or so of battery backed up RAM (or other fast stable memory) the commit time can be shortened. Stable memory may be a requirement if we implement multiple cooperating frontends for the same data. It is intended that all metadata of the system, the names and attributes, will be kept on magnetic disk. Rarely, if ever, some metadata will be written to optical disk since there is insufficient room on magnetic disk. An object written to optical disk will have the full object including all attributes, but the names and attributes are never (nearly never) removed from magnetic disk, except for large attributes. An attribute can be huge so we must allow for the possibility that they will not all fit on magnetic disk, but we should design the system to anticipate that they do all fit. LRU is the basic management policy. However, it is well known that this policy is not optimal for many types of access. The concerns are: single touch of an object (or page) should not promote the object (or page) for a long time big objects should not dominate storage a single user/xact should not dominate storage volatile objects should be kept in faster storage buffer large amounts of nearby data to faster storage on a fault read a track, but keep pages data buffered from the previous line has to be aged quickly transient objects should be kept in faster storage and not written to longer term storage (they will disappear soon) permanent objects that are written for archive should be written to longer term storage quickly and removed from faster storage sequential access of a large object should not dominate faster storage (sequential access often is a scan, where LRU is the worst policy to follow) different classes of data and metadata should be buffered separately. Buffer name, attribute, and contents differently. New and changed objects will be logged and/or written to magnetic disk on transaction commit (or before). They may be buffered in faster storage, but the optical disk write decision is made as described in section Interoperation of optical disk subsystem with full system of this memo. To sequentially access large objects (e. g., FTP retrieve), the system must do read-ahead. Sequential access can be either specified as a hint by the client or dynamically inferred by the server code. The goal of buffering in sequential access is to be communications limited. Read-ahead should be done in chunks that match the logical storage. Hence, don't read-ahead some pages from magnetic disk and some from optical disk and don't read-ahead pages from very different disk addresses. But do read-ahead across cylinder boundaries. There is a separate memo on the design of crash recovery mechanism. It is important to point out a few requirements here about the logging procedures. The ideas from Alpine seem still relevant. The following passages are taken from FileStoreImplOutline0 by Mark Brown. Basic transaction strategy Execute file page write, file deletion, file shortening, and file immutable-making by writing a redo log entry before the end of phase one. During phase two, carry out the operation. Execute file creation and file lengthening by writing an undo log entry, then executing the operation, before the end of phase one. Undo the operation if transaction aborts. Optimize important special cases. For example, page writes do not need to be deferred if the file was created in this transaction. How various interesting operations are implemented Write Page Obtain page lock (U or W mode) if required. Consult log map. If a committed update (necessarily by another transaction) to this page exists, then read data from log, write the new data through FilePageMgr, and erase the log entry. Write log record "transaction t wrote data d on page p of file f". Not necessary to force log. Insert entry containing log ID of this log record into the log map. Read Page Obtain page lock (R mode) if required. Consult log map. If a committed update (necessarily by another transaction) to this page exists, then read data from log (and as side effect, perform the write through FilePageMgr and erase the log entry.) If an update from this transaction exists, then read data from log. Otherwise read data from file. Yggdrasil will use the log more often for written data. Small writes (< 8K bytes or so), whether to create, replace, or change part of the contents of an object, will be logged. They will be written by some kind of write-behind scheme. The scheme has to be adaptive to the load on the server. It has to avoid the arm contention problems we see on Alpine (writes are delayed past end of transaction, but overly interfere with further transactions). Large writes (> 100K bytes) will be done by writing shadow copies of the pages. Medium writes will probably have to be done by logging for update and shadow for create (this should be a tuning parameter). The virtual state of the disk must be reconstructed after a crash. Data that must survive crashes includes (but is not limited to) free page map name and attribute metadata object roots location of buffered contents of optical disk contents committed transaction table To be completed: interaction with backup how much to read when doing the access Object representations Objects Requirements Contents may be large (at least a good piece of a gigabyte) Names and attributes (mostly) are smaller than the contents Objects have id's. The id's are unique. A server should never run out of id's during the life of the service. Support primitive types including text, bitmaps, digitized audio, PDL descriptions, document interchange, graphics, ... Page level access (for large values) All naming, indexing, buffering, linkages, etc. are saved and updated on magnetic hard disk (or other fast secondary storage) in the normal case When contents are small, keep them with object data. Discussion Every object owned by the service has a Document Identifier (DID). The DIDs are mapped (see DID map below) to get the secondary storage address(es), sizes(s) and (optional) slot number of the object. The slot number is like the slots used in many database systems (including Cypress): the slot is essentially a part of a page (page/slot tells the system to look at a page and "indirect" through a "slot table" to find the current position and size of the record of interest). The DID map has two purposes: first to do the mapping and second to find pages that are already buffered. It also is used for flushing and some forms of scavenge. DIDs can map to multiple addresses since the object may be large. The root storage address and the slot number describe the root object for small objects. Other pages form the extension of the root object and the contents. The root extension pages do not need slots: they consume the whole page. If the contents cannot fit on the root page, then it is allocated its own page(s). The DID map can span main memory, magnetic disk memory, and archival memory (tertiary or quaternary). Fragments may be kept on any and all levels. Committed objects never have only main memory buffers for any of their pages. Buffering may be on magnetic disk or in memory. The mapping is not trivial (as it is in most DB systems) since we want the objects to be able to move without changing DIDs. DIDs have enough bits so that they are never reused and are unique per instance of the service. DIDs are concatenated (somehow) with the server's ID (whatever that is) to get a universal DID (UDID). Machines that are servers may export interfaces for multiple instances of the service, but each of these is separate internally and and has its own server ID. Hence, we only store DIDs on secondary storage, not UDIDs (except for links to foreign objects). Links to foreign objects are done using UDIDs. UDIDs encode the service type, protocol, some way to find the service (e. g., net address or service name and name service), and unique identifier as specified by the foreign host. The frontend tries to guarantee that any object that is linked to always exists. It can make no such guarantee for foreign objects. Once the secondary storage address is known, the root page can be buffered. The slot table points at the object's representation on the page. The representation consists of: the DID (for internal checking, scavenge, or something) the ``from'' and ``to'' DIDs or UDIDs (keep in links only?) access control list may be nil if it inherits protection from its parent attributes (see Attributes below) contents if short, stored directly in the object (immediate) if not short, stored as a file, and the file run table is kept here. (Note that this is replicated in the DID map). for the actual encoding of the contents, see the Versions, revisions, and alternatives section below DIDs should be stored in as few bits as reasonable. For example, if we want about 264 DIDs (over 1019), then the first 231 DIDs on a server might take one word, and all other DIDs might take two words. DIDs stored inside of an object may elite the high (say) 49 bits of the DID. It will take any server a while to get 231 objects, and by then memory and disk will be cheaper. However, we should never, ever ``run out of DIDs.'' Links Requirements Links are directed from one object to another. The from part of the link is called either the "from link" or the "out link". The to part of the link is called either the "to link" or the "in link". From an object, it is easy to find its in and out links. Out links must be extremely cheap, and in links must be non-expensive. Documents can be connected via links (hypertext) Discussion Store out links id's with object. Keep a table for the ``to'' links for the default version of the object under attribute $defaultOutLinks. B-tree or hash are possible implementations for this table. Also, keep a set of objects that this object links to that are not in the default object under attribute $nondefaultOutLinks. Links are first class objects as far as the data model, but their representation is optimized. Many links are simple links: they have no contents or properties. No storage for these links is ever allocated. For slightly more complicated links, storage is allocated but is clustered near the ``from'' object. Attributes Requirements Fast access from object representation Compressed attribute names for common attributes Discussion Attributes (also called properties) of an object are encoded as a list that may be interpreted as ordered or unordered. Each attribute name has a list of attribute values. The general form of this is: {{attribute name, {attribute value, ...}}, {attribute name, {attribute value, ...}}, ...}. An attribute value may also be a list of attribute values. This grounds out into primitive objects. The attribute names do not have to be unique for non-system attributes. Encode attribute name so that it takes only a byte or two for ``common attributes'' - those attribute names that are used a lot. Common attributes (e. g. ``depends on'') should cover an entire system. They could be discovered by administrator specification, wired in defaults (particularly for system attributes), and dynamically by use (e. g., promote the day's most active non-common attribute to be a common attribute for now on). Use some semi-obvious encoding scheme. Cypress has an example of an encoding. Main properties of a good encoding are: good locality (no off page string overflows, please) no (real) restrictions extensible moderately compact moderately fast decode/encode A system attribute is used for outlinks ($outlinks). The value of this attribute is: {{out link name, {DID, ...}}, {out link name, {DID, ...}}, ...}. There is at most one $outlinks attribute per object. A system attribute is used for children ($children). The value of this attribute is a list of DIDs: {{DID, ...}}. There is at most one $children attribute per object. A system attribute is used for the directory ($directoryContents). The value of this attribute is a directory object, a primitive type understood internally by the server. There is at most one $directoryContents attribute per object. Interoperation of optical disk subsystem with full system Write policy to optical disk jukebox backend As objects are modified and created, updates only occur to magnetic disk and main memory. Archival storage is not involved. As the magnetic disk and main memory start to fill with ``dirty'' data, some objects have to be written the optical disk jukebox backend (tertiary storage). Once written and backed up (to quaternary storage), these objects may be removed from magnetic disk and main memory. What policy should be used to write objects to the backend? The basic idea is to use LRU. For small and moderate sized objects, LRU is for the entire object. For large objects, LRU is for pages of the object. LRU will not work well all by itself. LRU needs to be tweaked for: big objects should not dominate magnetic disk storage object volatility some big objects are just created, are immutable, and have to be archived transient/permanent objects big objects have to be subdivided and archived in pieces when they are incrementally updated otherwise database like objects will dominate magnetic disk What is the will miss magnetic disk and main memory, and have to retrieve the object from archival memory? Clearly we want goal is to hide the fact that there is an optical disk out there. The system has to be able to complain if it is behind so that backup can be run on the backend. Interaction with backup and crash recovery Assumptions: There is an autonomous system handing the details of the optical disk jukebox, the backend. This system is accessed over the net. It has its own backup and recovery mechanisms. It does backup every night and the frontend service can tell if the backup works. For the purpose of recovery, backup always works, so data that is backed up is always recoverable. After data is written to the backend, it still cannot be removed from magnetic disk. Only after backup has completed that writes the object to tape (or wherever) can the object be deleted from magnetic disk. If in the interim, the backend has to be reloaded from backup (a very lengthy process), then the object still is available in the frontend. It will be written to the backend soon after recovery. The frontend has its own backup system. It operates independently of the backend's backup, but must be coordinated. If we have to reload the frontend from backup, then extra stuff will be on the optical disk that we do not know about. A scavenger could enumerate the backend to discover objects have have been archived, but this is probably not worth the effort. Magnetic disk page allocator and clustering Requirements Keep an object, its attributes, and its contents on a page - if it fits (see Object representations). If an attribute value or the contents is bigger than a page, store it on its own page set. Place no (real) restriction on the number of attributes or their sizes. An acceptable restriction is that an object cannot exceed the size of a tertiary storage device, tape, or platter. Place no (real) restriction on the size of the contents of an object. Concurrent allocation Goals: cluster leaf objects near parents, cluster links near ``from'' objects, allocate big objects in big contiguous chunks, have fast allocation, and recover from crashes quickly. Object allocation on magnetic disk An object is a collection of information about the object. It is kept in an encoded form specified in the section Object representations above. Here's how to allocate pages for an object. Objects come in three kinds: small, medium, and large. Assume for the moment that a page is 4K bytes. Object size is discovered dynamically, and the policy in use for the object changes. Clustering is done for ``leaf'' objects (objects that do not have children) near parent objects. Parent objects, while buffered in memory, maintain a list of cylinders where some of their children are located. This is a hint list for allocation. A run is a set of contiguous pages. Some file systems have called this an extent. Large objects are broken into a set of runs for the various parts of the object. Small objects Small objects are less than or equal to a page for the root data and the contents. Locate ``leaf'' objects (objects that do not have children) near parent objects, and locate small links near their ``from'' object using the hint list for allocation. ``Near'' here means prefer the same cylinder (cylinder group?), try for the same sector as the parent (or ``from'' object), but use the hint list if required. Non-leaf objects are not clustered (on purpose to break the cluster-near paradox). Never split small objects. Always allocate a sector just for the object if required. Note that this may commit us to doing some sort of compaction later on. Rationale for this choice of allocation is that other database systems have used similar clustering and it should work well for NoteCards. Medium objects Medium objects have their root data less than or equal to a page, but are not small objects and their contents are less than or equal to 16 pages. Prefer to locate links near their ``from'' object. Prefer to locate ``leaf'' objects near parent objects. Non-leaf objects are not clustered and the root data and contents are stored in one run. There are three policies for leaf objects: contents less than or equal to four pages cluster the root data and contents if not enough room, cluster the root data and put the contents in one run where ever it fits if still not enough room, put the object where ever it fits contents more than four pages cluster the root data and put the contents in one run where ever it fits if not enough room, try to put the root data and the contents in one run where ever it fits Large objects Large objects are everything else. Try to cluster root data if less than four pages. Try to allocate (the rest of the object) as one disk page run. If object is larger that 100 pages, try to allocate in runs of at least 100 pages. Allocating space on magnetic disk The goal (not done in version 0) Cylinders are grouped together to form a cylinder group. The number of cylinders in a group is a property of the configuration and disk. The number in a group is small (maybe one); for understanding, suppose that five cylinders make up a group. The first few sectors (replicated on the last few sectors) of the group store the cylinder group head. This stores the free page map (a bitmap) and a count for each sector of the free bytes. Part of the initialization reads all cylinder group heads into memory. They have latches that synchronize access. They are updated and logged via physical logging. For cylinder groups undergoing modification, a shadow cylinder group is created by coping the cylinder group. All allocations occur in the shadow with the (sub-)transaction identified that made the modification and the modification. If the (sub-)transaction commits, the modification is applied to the real cylinder group (using the normal dance that Camelot has for modifying recoverable data). If the (sub-)transaction aborts, the modifications are undone from the shadow. Camelot For Camelot, the first few sectors (~80) of a segment are reserved for the segment metadata. The data is stored at low numbered pages. If necessary, it is replicated. The replication is at a fixed address within the segment (e.g., all segments reserve 80 sectors for the first copy (pages 0-79) and the second copy (80-159)). Replication is fixed so that the segment can be expanded without undue pain and recovery can always find the replicated data without reading anything in the first copy. Page 0 and page 79 are replicas of each other (and replicas of 80 and 159 if metadata replication is used). They contain the following data: segment ID use of segment (normal or system use only) metadata replicated flag size of allocation map segment maximum size so far segment logical page 0 Pages just past the metadata is the allocation bitmap. Enough space is left so that the segment can be expanded later. One design for clustering [The rest of this section is a description of one way to try to achieve the goal of clustering. There are lots of other designs. This one is presented just to show just one. Whoever implements this should choose what to do.] When trying to allocate an object in a cylinder group, first the ``cluster near'' object is looked at, if one is given. If the object fits on the same page, put it there. If it fits on any page on the same track, put it on the sector using ``best fit'', but try to leave a little breathing space (5-10% of the sector -- select percentage as a configuration parameter for tuning) for small objects. If there is no ``cluster near'' object or if allocation on the same track doesn't look good, put it on some sector using ``best fit''. Always try to leave a little breathing space for small objects. For multi-page object, always allocate in runs of 100 pages or more unless they are smaller than that. If an object, or parts of it, do not fit in a cylinder group, then another group must be tried. For some parents, there is a list of other groups where there children already have been located. Once this hint list is also exhausted, it is time to try a completely different group. The group is selected randomly from the groups that are somewhat empty. Keep the cylinder groups in lists that represent how full they are. The ranges are 0-80%, 80-85%, 85-89%, 89-92%, 92-94%, 94-96%, 96-97%, 97-98%, and 98-99%. Groups that are over 99% full appear on no list. [These percentages, and the percentages below, are made up; tuning or configuration parameters should be used.] The goal is to build a set of cylinder groups that are somewhat empty. To do this: Add to the set all cylinder groups that are less than 80% full. If this is at least 10% of the groups, the set is complete. Otherwise, successively add the next band of cylinder groups (80-85%, 85-89%, and so on) until there are 15% or more of the cylinder groups in the set. If there is a good set, then randomly select a group from it. Otherwise, start the flusher (if it isn't already running) and randomly select any group in the set. The flusher is described in the next subsection. If the set is empty, randomly select any group that is large enough. Flushing objects from magnetic disk The system monitors its free magnetic disk storage and asynchronously initiates flushing if required. It needs an algorithm to notice when it is getting behind, and it needs a way to force backup to run on the archival server. Part of the DID map is the last-used-time of an object's run (or finer granularity of the object). The granularity of the time is kept coarse to avoid excessive update. The DID map is used to keep the last-used-time since the map is already a write hot spot. Last-used-time is kept for each run of the object. Many flush algorithms are possible. Here is one proposal. The system keeps and/or computes the cutoff time. A few other times are computed after this time ( e. g., say the cutoff was 21 days. Then 20, 19, and 18 days could be the "other times"). The DID map is scanned. For medium and large objects, the LRU time for the runs of the DID are checked. For each run, if it older than the cutoff, then it is flushed. If it fits into any of the other times, its size is recorded in that time interval. Small objects present a problem. The page should be flushed based on the earliest LRU time. I estimate that a hash table of size of a million or ten million entries should suffice. The entry has one of three values: skip, a candidate for flush, or not a candidate for flush. The table is initially set to skip. As a DID is seen older than the cutoff date, then its hash entry (hash function based on sector address) is set to "candidate for flush" provided that it was previously skip. As a DID is seen newer than the cutoff date, then its hash entry is set to "not candidate for flush". A second scan is necessary to find old DIDs whose hash entry is "candidate for flush". Note that after pass one, care must be taken not to flush DIDs who just get touched. Note that collisions present a problem, but are so small statistically that we can ignore them. Changing the hash each scan (not each pass) makes the problem so small that we can ignore it. Periodically through the scan, the sum of sizes is stably recorded. The cutoff time is computed from the values stored from the last scan. The amount of storage reclaimed, from medium and large objects, can be estimated from these numbers. Optical disk If we buy our optical disk management code, we will have little or no control here. We should not plan to make the optical disks removable. Naming Yggdrasil has a "hierarchical" directory naming system similar to names in file servers and file systems. See the data model memo for more details. Forget the next paragraph. We are doing delayed update. Name tables are hot spots and have locking problems. To avoid many of these problems, Yggdrasil will use a variant of the Postgres immediate update model. Write entries in the name table immediately (before xact commit). These entries indicate creation, deletion, or modification of a value. Value are viewed as immutable, but are garbage collected and are removed by other optimizations. A search only finds an entry if its xact has committed or if it is the current xact. This requires Yggdrasil to maintain a map of committed xacts. To keep the size of this list moderate, the name tables is swept periodically so that all xacts before some time are either committed, aborted (with all entries garbage collected), or in progress. For old entries, all you have to remember is the new/old xact id transition, and a short list (probably null) of in progress xacts. The locking and latching mechanisms operate separately from each other. A latch is required to make the update described above, but it is a short term lock. Once the entry is inserted, the latch is dropped. Locking just prevents conflicting updates from occurring. Forget the next paragraph. We are doing delayed update. All of this means that locking is used to prevent conflicting updates to a directory, but updates to the directory are performed immediately without concern for the ultimate fate of the transaction. Garbage collection Object deletion Indexing and containment make nearly all object accessible (possibly through some tortuous path). Hence, it nearly impossible to ever garbage collect objects. Thus, the system does no automatic garbage collection and relies on object delete. Objects without children are called leaf objects, while those with children are called non-leaf objects. Either leaf objects or non-leaf objects may be deleted. When a leaf object is deleted, it is necessary to remove the root object, the contents, all attributes, and to modify the DID map. If the object being deleted is the ``to'' object of some link(s), then the link(s) must be deleted too. This process is recursive. A non-leaf being deleted roughly means that the non-leaf object and all its children (discovered recursively) are deleted. This is not quite accurate. For the container object, all its children, and all their children, and so on, the objects are removed from the appropriate container. If the object is only in this container, then it is deleted. Object deletion here has the same side effects as above (if it is the ``to'' object of some link). Uncommitted items in name tables Since the naming code updates name tables prior to transaction commit, it is likely that some entries in the name tables will be for aborted transactions. These will tend to litter the name tables and slow searches. As mentioned above, the name tables must be periodically swept to remove old aborted entries to cut down the size of the name tables and to keep the committed transaction map of reasonable size. Versions, revisions, and alternatives To set the context, here is part of the description of versions, revisions, and alternatives from YggdrasilDataModel.tioga: Objects can have versions, revisions and alternatives. An object is really a tree of states. A new branch may be started from any state in the tree. This is called an alternative. Alternatives are distinguished by some attribute(s). These may include attribute values as well as time of updates. Major changes along a branch are called versions while minor changes are called revisions. Changes are applied at the leaves of the tree, which is called the current alternative. A new alternative at some state is viewed as a leaf (even though it has branches). A new version or revision is built whenever a transaction commits that changes some part of the object. The version or revision is added to the version chain for the current alternative. Hence, Yggdrasil must store a tree of versions. Sub-branchs have names. Non-branching nodes may be revisions (minor changes) or versions. Branchs are always from versions. The term modification will be used for either a new version, revision, or alternative. Requirements f The tree of versions must be represented and must be browsable f All versions must be retrievable f Random access to large versioned objects f No (or very little) cost to get default version f Quick store of new version Version identifier string A way is needed to encode the tree of versions. A compact string representation has several advantages. First, it is compact. Second, it is human-readable in the string form. Finally, it is encoded so that trees or individual nodes both can be represented. To parse the string, six reserved characters are needed. These characters can either be excluded from alternative names or they can be quoted (preferred). For the purposes here, we will assume that ``$'', ``('', ``)'', ``.'', ``~'', and ``,'' are reserved. Briefly, the encoding starts at the root of the tree. The root is not represented. As the tree is walked, a ``.'' represents a revision, a ``~'' represents a revision that has been eliminated, and a ``$'' represents a version. An alternative is represented by a parenthesis enclosed, comma separated list of alternatives. For example, ``(string1,string2,string3)'' represents a version node that splits into three alternatives encoded as string1, string2, and string3. Each alternative starts with the alternative name and is followed by its tree of versions. The first alternative is the default alternative and may not have an alternative name. In BNF, the language is: tree-of-versions ::= $ | $ inner-tree-of-versions inner-tree-of-versions ::= $ inner-tree-of-versions | . inner-tree-of-versions | ~ inner-tree-of-versions | ( alternative-list ) alternative-list :: = alternative | alternative , alternative-list alternative ::= alternative-name-string inner-tree-of-versions Examples: $($.$.,foo($,bar.)) The root has one child, a version. That child has two alternatives: the default with identifier string ``$.$.'' and one named ``foo'' with identifier string ``($,bar.)''. The default alternative is a simple branch of two versions and revisions. The ``foo'' alternative starts with another alternative. The default here has only a single version, while the ``bar'' alternative has only a single revision. $$.$.. This is an extension to the above string. A new revision has been added to the default. The union of the strings is ``$($.$..,foo($,bar.))''. Note the two ``.'' in a row. Modification encoding A modification is either stored as a separate entity with its own field(s) in the DID map with an appropriate version identifier string, or it is encoded with other modifications using a ``delta'' technique. Deltas can either be forward or backward. A forward delta encodes how to take a previous version of the object and modify it to be the new version. A backward delta takes a new version of the object and encodes how to modify it to get the previous version. Details of the delta encoding will not be given here, but it must specify how bytes are deleted and inserted. Yggdrasil stores the most recent default version directly. This version may be compressed, but this is a detail. It also stores the backward deltas in order to move the object back to any default version. Finally, forward deltas are stored for the non-default alternatives and their successors. Objects are divided into segments. Each segment is 100 kilobytes to 1 megabyte long (a configuration parameter). The segment is stored on some number (most likely one) run of disk sectors. Its format is: default value of the segment if the value is not compressed, this is the value of the data and/or metadata of the object compression and size is obtained from the DID map some number of unused pages (optional, used for growth) backward deltas along the default version chain these delta have a preamble with the size, so it is easy to parse them there is an extra sentinel delta at the end some number of unused pages (optional, used for growth) forward deltas these delta have a preamble with the size and the version identifier string they are stored in the order they were committed To build any modification of the segment, do the following: fetch the disk pages decompress the value if needed apply backward deltas to get to the maximal node between the default and the desired modification apply forward deltas along the path to the desired modification Discussion To cover the expected cases, an array of techniques are needed. large new modification made via complete rewrite If a new modification of an object is stored by completely rewriting it, then the system will allocate some disk storage, update the DID map, and write the new modification to the disk storage. Coalesce this modification into the main tree of versions ``at night'' on a time available basis. The DID map has to be scanned in some semi-fair order to clean up these modifications. If the modification is extreme (e.g., a new bcd), it should be left alone. To find deltas, Yggdrasil will use the algorithm given by Paul Heckel in ``A Technique for Isolating Differences Between Files,'' CACM 21, 4 (April 1978), 264-268. The folks who did the DOMAIN system liked this algorithm the best. Luckily, this is what Waterlily uses. small modification made via partial update An example of this is the update of an attribute that stores a small string. The change is immediately encoded in the object and the DID map is updated. ``page level'' update of a large object An example of this is changing a database page. The new value is written into the value part and the old value is added to the backward deltas. If this object is not keeping revisions, no backward deltas are kept. Index data structures Requirements f Maintain indices at containers for specified attributes f Keep the index exact for default versions f Keep the index for all objects where this container is on the primary container chain f Keep list of objects directly contained in container, where this is not the primary container Discussion There are four attributes reserved for the index data structure. They are: $indices-patterns: The set of patterns for attributes that are indexed at this node. $indices-non-hierarchical-objects: The set of objects whose default version is directly contained in this container but do not have the object as its primary container. ... This is how the system represents non-hierarchical containment, without extremely complicated update algorithms. $indices-non-default-objects: The set of objects where some version is directly contained in this container, but the default version may not be directly contained. ... Used in non-default version searches. $indices-bTree-bTree: A B-tree of B-trees for the indices. The first level B-tree is indexed by attribute string name and has a value of a different B-tree for each attribute. ... This is potentially a massive structure for higher level containers. Each second level B-tree is for a different attribute. It is indexed by attribute value (as described below) and contains the DID for the object. Each second level B-tree keeps in its root page an indication of the types of keys that are stored in the tree. It is anticipated that only one type (or objects that conform to this type) will be kept in any tree. That is, all the objects in the B-tree will either have, for example, integer, string, or date values. The root page will store an indication of the type (e. g., date) or a wildcard for more than one type. If all the objects have the same type key, then ordering the B-tree is easy. If different types are used, then the objects will be primarily ordered by type, and secondarily ordered by key. If the type does not have a natural ordering (e. g., uninterpreted bytes), then ASCII ordering of the bytes will be used. Index maintenance Whenever an document attribute is modified, the indices to this attribute have to be updated. Also, whenever container membership is changed, indices have to be updated. Finally, changes in indexing patterns can cause indices to be built or destroyed. The indices are maintained by ASAP update, and hence are not guaranteed to be atomic with the transaction that made the update. It is intended that the update proceed very quickly, but not using the client's transaction. The updates do have transaction guarantees, but not inside the client's transaction. The attribute may have been added, modified, or deleted. The DID for the document and its parent container DID's together with the add, mod, or delete are appended to a "index update list." This list is kept in recoverable storage as a circular list with a start and an end pointer. One of the last things a transaction does before commit is to update this list (if needed). A process does the deferred updates from the "index update list". It takes successive items off of the list. It tries to do the updates needed. If it cannot acquire a latch, it goes on to the next item. The goal is to do the update recursively for all needed objects, indices, or containers. However, the rules of the game can change. That is, while an indexable value is changing, the containment membership may concurrently be changing. The invariant that we wish to enforce is that the object is always indexed in at least all of the places required. Spurious extra indexing is OK as it should be quite infrequent. So updates go in the following phases (cyclically). The first phase starts with all the stuff that has to be done is found. Objects are created to indicate the dependance on update on the data in the object (e. g., the parent objects of a container are important to index update due to a changed attribute). As objects (or maybe parts of objects) are modified by index update, they are timestamped (just a counter). The update keeps track of the largest timestamp seen. When it appears that the update is complete, the second phase starts.. The dependencies are examimed again. If a larger timestamp is found than was current during the first phase, we start over at the first phase. What follows is old text. The goal is to do the update for each parent. Depending on the way indices are being maintained (in a file or as a property), various locking must be done. Hopefully, an algorithm can be devised that allows the update to occur no matter what the state of the locks on the parent. For the interim, the following algorithm can be tried. For each parent, a read lock is tried for. It this fails, then an entry is added to the end of the list for just this DID and parent (with maybe a loop counter, conflicting transaction, and delay added - after several failures, the transaction holding the write lock is blown away). If the lock succeeds, then if the container is not keeping an index that matches the update, then no update is done to an index. If the index is being kept, then the index is opened (either as a property of the object, which needs a write lock, or as a file that needs no lock). The update is performed (as a server transaction). After the index is updated (if needed), then the parent DID's for the parent are obtained. An entry is added to the update list for the original DID and these parents. The old entry is removed from the list. The document lock is dropped. Processing is similar for changed container membership. Upon recovery, the index update list is processed again. Some updates may be redone, but who cares. Data compression If compressed data via some algorithm is ever stored on the server, the server must forever retain the ability to decompress it itself! (Unless all copies of data using this compression are deleted.) Yggdrasil will provide for compression hints associated with individual data values. On a service basis, defaults for compression for data types will be set by configuration parameters. Multiple compressions can be selected, and the system will try them all and store the best. When data is written to the server, it may be decompressed. When memory and CPU power are available, it can be compressed by the server. This activity should be invisible to clients. When accessed, it must be decompressed. Since large objects can be accessed on a page level, compression must allow for random access and decompression of data. More data than is needed may be decompressed, but the page data and the decompressed data sizes should be somewhat comparable (e. g., buffering 64K compressed and doing page level access for 4K is fine). Scavenger Scavenge of the backend If the backend is reloaded or otherwise is in an unknown state with the frontend undamaged, a scavenge might be necessary. It would involve going through the directory (or whatever the backend keeps as a name table or a database) on the backend and finding about all the DIDs and their versions. The DID map would have to be updated to reflect lost objects and discovered objects on the backend. Scavenge of the frontend Why do we have to scavenge? What went wrong? What do we know? Loss of DID map Scan the cylinder group headers. Select the pages that are marked as having root objects. Fetch (in some semi-optimal way) the pages with root objects. Parse the pages and rebuild the DID map. Do a full backup. Loss of cylinder group headers Scan the DID map. Find all root object pages and used pages. Re-write cylinder group headers. Log failure Process log as far as possible. Use the cylinder group headers to discover lost DIDs. Scan the DID map to discover root objects and free pages. Backup buffer failure Nil the backup buffer. Do a full backup. Replication Basic principle Always have two or more copies of all committed data. This can be accomplished by double writing to disk (e. g., for the log) or by relaying on backup. Normal operations In normal operations, Yggdrasil mirrors the log (unless the log is on a mirrored disk!) As data is committed, it is logged. When it is time to write it back to the magnetic disk, it is written both ``home'' and to the backup buffer (see below). The backup buffer is a different disk volume where data to be backed up resides. When backup runs, it copies this data to ``tape'' and nils the buffer. Frontends Multiple frontends can be placed on a bus (at least they can on a SCSI bus) to avoid loss of the system during frontend machine failure. Multiple buses can be used to increase bandwidth and allow for more devices. Dual ported disk drives (ported to two controllers on different busses) and multiple buses also can be used to avoid bus and controller failures. Any frontend can take over the role of any service. There might have three frontends and six services. When one of the computers fails, three instead of two services would run on each frontend. Disks Design for mirrored disk facilities. However, the ``basic principle'' above may make mirrored disks not necessary to protect against loss of data. However, availability may force Yggdrasil to run with mirrored disks. Backends Backends may also be replicated. Data is not viewed as reliably written to the backends until it is written and backed up on all backends serving the service. Frontend backup When it is time to write some data to the magnetic disk, it is written both ``home'' to the disk and to the backup buffer. The backup buffer is a different disk volume where data to be backed up resides. When backup runs, it copies this data to ``tape'' and nils the buffer. Two or more copies of the tape are written. Some coordination (TBD) is necessary to synchronize all of this. DID map The DID map is used to translate DIDs into information about the object. Small objects with few links are stored with all the contents, properties, and outlinks in one place. As parts of the object grow in size, the part is broken out into its own piece. For example, when the contents for a DID is small, it are stored "inline." As the contents grow, they are moved "out of line." The remaining components of the object that are "inline" is called the root. A modification is a version, revision, or alternative of the object. The information about a DID includes: f for each modification(s) (a version, revision, or alternative) stored separately: f a version identifier string for the modification(s) f the slot number of the object (optional) for the first storage address f compression used, if any f flag for keeping modifications f root, contents, or outlinks flags (root may contain either the contents and/or the outlinks for small or sparsely connected objects) f for each segment of the modification: secondary storage address(es) (disk and/or optical storage) encoded as storage address runs byte number of first byte size of stored data offset to modification deltas last-used-time backup time and identifier Runs may overlap. The whole object may be stored on optical storage with just part of it buffered on disk. The DID map must be able to: map a DID to the information be scanned in no particular order for scavenge and backup Hence, many data structures will work. Hashed, B-tree, and heap are all acceptable data structures. The DID map is a disk resident data structure. It is double written. To be resolved: what is the memory resident form of the DID map? Is it just cached pages of the map or is it entries extracted from pages? The former may take too much storage and the later is harder to update. A design using hashing and B-trees The DID map is stored on disk. As a DID entry is needed, it is cached in a memory resident data structure. This data structure is periodically pruned to remove entries that have not been active. When an entry is updated, the DID has already been locked (it says here). An intention is created (or a nested intention for a nested transaction) and the update is recorded in the intention. If the (sub-)transaction aborts, the intention is destroyed. If a sub-transaction commits, the intention is promoted. If the commit is for a top level transaction, then during the pre-commit the intention is applied to the cache entry, and it must be applied to the disk resident DID map. If the transaction later fails in commit, the cache entry is destroyed and the undo is applied to the disk resident DID map in virtual memory. The basic structure of the disk resident DID map is a hash of the DID leading to a B-tree. The DID is hashed into one of 64 values (64 seems to be enough to get lots of parallelism). This is hash value indicates which of 64 B-trees the DID belongs. The Cedar Btree code controls these B-trees. The Btrees are kept in replicated recoverable storage. The disk resident DID map is kept in a separate segment. At a fixed location, a set of 64 addresses are kept. These are the addresses of the first header page for the BTree. A header page contains an ordered set of addresses and sizes where the B-tree is stored. Extension header pages are forward chained. Clustering is attempted on a B-tree basis. By reading all the header pages for a B-tree, the storage for the B-tree is found. The "underlying page storage primitives" provided to the Cedar Btree code use this storage. When a top level transaction is committing, it may have to update some number of DID map entries. For all the DID's being updated, they are hashed and the hash values sorted. The DID's are then processed in this order. This is to avoid deadlock (lock ordering). As the commit logic starts a new B-tree, it latches it. [The latch may not be needed, but I am pretty sure that otherwise concurrent updates inside the B-tree could deadlock.] All the updates needed by the transaction in the B-tree are performed. The normal dance that Camelot has for modifying recoverable data is used by the "underlying page storage primitives." Once all updates are done for this B-tree, the latch is released. By having multiple B-trees, the problems with B-tree locking are reduced. With multiple B-trees, concurrent update by multiple processors is allowed.