BTreeSimple.mesa
Copyright Ó 1985, 1986, 1992 by Xerox Corporation. All rights reserved.
Taft, January 13, 1984 11:18 am
Russ Atkinson (RRA) March 11, 1985 9:42:39 pm PST
Brian Oki February 28, 1989 12:18:18 pm PST
BTreeSimple is a simplified interface to the BTree package, suitable for clients not requiring the full flexibility of the BTree interface. In particular, the representation of a BTree entry is fixed by this package (a text key and a sequence value); and the BTree is backed up on an FS file, using a cache similar to that used by BTreeVM. In return for this loss of flexibility, the package takes care of most of the boilerplate and bookkeeping automatically.
This interface defines nearly all the types and operations ordinarily required for BTree access; so for most purposes there should be no need to refer to the BTree interface. But there are two important exceptions: the signals Error and UpdateInProgress are defined only in the BTree interface, so catching them requires referring to that interface.
Willie-s, April 23, 1992 11:34 am PDT
DIRECTORY
Basics USING [Comparison],
BTree USING [PageNumber, PageSize, PathStk, Reason, Relation, State, Tree, UpdateType],
IO USING [STREAM],
Rope USING [ROPE, Text];
BTreeSimple: CEDAR DEFINITIONS = BEGIN
BTree representation
Tree: TYPE = BTree.Tree; -- This represents the volatile state of an open BTree
A BTree concepually contains an ordered set of Records, each of which consists of two REFs designating a Key and a Value. A Key is a ROPE (internally flattened to a Rope.Text) identifying a Record (or potential Record) in the BTree. The contents of the Key are ordinarily treated as a text string, though this default can be overridden by client-provided procedures. A Value is an uninterpreted array of words, which may be of different length in different entries. The client will typically need to LOOPHOLE the contents of the Value to the actual representation of the data being stored in the Value.
Record: TYPE = RECORD [ -- Not actually used; Key and Value are always passed separately
key: InternalKey,
value: Value];
Key: TYPE = Rope.ROPE;
InternalKey: TYPE = Rope.Text; -- The package always hands out this version of a Key. InternalKey is a bound variant of Key (see Rope package).
Value: TYPE = REF ValueObject;
ValueObject: TYPE = RECORD [words: SEQUENCE length: CARD OF WORD];
EntryKey: TYPE = POINTER TO EntryKeyObject;
EntryKeyObject: TYPE = RECORD [text: PACKED SEQUENCE length: CARD OF CHAR];
EntryValue: TYPE = POINTER TO ValueObject;
Since a BTree is an external data structure, each Record actually has an external "pickled" representation called an Entry. An unsophisticated client never needs to see an Entry; for such a client, all BTreeSimple procedures are SAFE. However, a more sophisticated client (in particular, one that treats a Key as other than a text string or that requires the ability to enumerate the BTree very efficiently) will have to deal with Entries to a limited extent; and the procedures that deal with Entries are necessarily UNSAFE.
KeyFromEntry:
UNSAFE
PROC [entryKey: EntryKey]
RETURNS [key: InternalKey];
Returns a newly-allocated collectible object containing a copy of the specified EntryKey; returns NIL if entryKey=NIL.
ValueFromEntry:
UNSAFE
PROC [entryValue: EntryValue]
RETURNS [value: Value];
Returns a newly-allocated collectible object containing a copy of the specified EntryValue; returns NIL if entryValue=NIL.
The following procedures determine the ordering relation among keys in a BTree. You need to worry about this stuff only if you want to treat Keys other than as ASCII text strings.
Comparison: TYPE = Basics.Comparison;
Compare:
TYPE =
UNSAFE
PROC [key: InternalKey, entryKey: EntryKey]
RETURNS [Comparison];
Compares key^ to entryKey^ and returns one of {less, equal, greater}.
CompareEntries:
TYPE =
UNSAFE
PROC [entryKey1, entryKey2: EntryKey]
RETURNS [Comparison];
Compares entryKey1^ to entryKey2^ and returns one of {less, equal, greater}. This is called only by Validate, and is used to verify that the entries are in order. To bypass the checking of order, it suffices to supply a CompareEntries procedure that always returns less.
CompareProcs:
TYPE =
RECORD [
compare: Compare,
compareEntries: CompareEntries];
defaultCompareProcs: CompareProcs; -- the default interpretation of a Key is as an ASCII text string with upper-case letters mapped to lower-case for the purpose of comparison (the same as Rope.Compare[..., case: TRUE])
A BTree is backed up on a STREAM. Reading records from the BTree requires only read access to the file. Inserting, deleting, or updating records will cause writes to occur, and may cause the file to grow. A BTree file never shrinks, no matter how few records remain; the only way to shrink a BTree file is to create a new one and insert all the entries from the old.
The BTree consists of "logical" pages, each of which is some integral number of contiguous FS pages. The ratio of page size to average entry size, times about .75, defines a tree's "branching factor", i.e., the number of descendants of each non-leaf page. A branching factor of about 7 should be considered the minimum acceptable (this gives you about 400 entries in a 3-level tree); much larger branching factors (10 to 25) are preferable. (Actually, with a uniform distribution of variable-size entries, the BTree package achieves a higher branching factor than the ratio given above, because it endeavors to keep smaller keys in the non-leaf pages).
The BTree file is accessed through a cache kept in virtual memory. The size of the cache is specified at Open time. For very large trees, a cache larger than the default will improve performance.
BTree global operations
State:
TYPE = BTree.State;
-- {closed, suspended, open}
A Tree object can be in one of three states. A tree that is closed is inaccessible for any operation besides Open or SetState; an attempted call will result in Error[closed]. A suspended tree is likewise inaccessible, but an attempted call will wait indefinitely until the tree is changed to some other state. An open tree is available for regular operations.
New:
PROC [compareProcs: CompareProcs ¬ defaultCompareProcs, initialState: State[closed..suspended] ¬ closed]
RETURNS [tree: Tree];
Creates a new Tree object with no associated backing file, and sets its initial state as requested. The compareProcs define the key representation and ordering for all BTrees to be accessed via this Tree handle.
CacheSize: TYPE = [8..4000];
PageSize: TYPE = BTree.PageSize;
ROPE: TYPE = Rope.ROPE;
Open: PROC [tree: Tree, readFile, writeFile: IO.STREAM, pageSize: PageSize, cacheSize: CacheSize ¬ 40, offset: CARD ¬ 0, initialize: BOOLEAN ¬ FALSE, maintainRecomputableState: BOOLEAN ¬ TRUE];
! Error {badSeal, wrongPageSize};
! UpdateInProgress;
Associates a Tree handle with a BTree instance, and changes its state to open. (If it is already open, it is first closed as described under SetState.) The BTree is backed up on the supplied file, which must previously have been opened (in the appropriate mode) by the client. CacheSize is the maximum number of "logical" BTree pages to keep in the local VM cache. If initialize=TRUE, an empty BTree is created; if FALSE, the file is assumed to contain a well-formed BTree already. If maintainRecomputableState=TRUE then the BTree package will maintain the tree's entryCount and "update in progress" flag; the situations under which doing so is undesirable are discussed below under "Locking and consistency".
"offset" specifies which byte offset is the origin for BTree page zero.
SetState:
PROC [tree: Tree, state: State[closed..suspended]];
If the tree is open, waits until there is no BTree operation in progress and the "update in progress" flag is off (which can take arbitrarily long, depending on the activities of other processes accessing the tree). Then sets the tree's state to the specified value. This procedure is intended to force the tree into a quiescent, stable state, e.g., immediately before initiating a checkpoint or rollback.
GetState: PROC [tree: Tree] RETURNS [state: State, entryCount: CARD, greatestPage: BTree.PageNumber, depth: CARD, file: IO.STREAM];
GetState: PROC [tree: Tree] RETURNS [state: State, entryCount: CARD, greatestPage: BTree.PageNumber, depth: CARD];
Returns the Tree's state, the number of entries contained in the tree, the highest PageNumber of any page which is or ever has been part of the tree, the tree's depth (number of levels; by convention an empty tree has depth=0), and the handle on the BTree backing file. Returns entryCount=LAST[LONG CARDINAL] if the tree has ever been opened and updated with maintainRecomputableState=FALSE. The entryCount is only approximate if the BTree has ever been updated using any Tree handle besides this one, unless Validate has been executed since the most recent update. If state#open, the remaining results are left over from the last time the tree was open; if it has never been open, they are indeterminate.
GetStatistics:
PROC [tree: Tree]
RETURNS [hits, misses, reads, writes, cumChainLength, cumReadWriteTime:
CARD];
Returns performance statistics, measured from the time the tree was last Opened. Hits and misses are the number of ReferencePage operations for pages that were and were not already in the cache. Reads and writes are the number of actual reads and writes to the backing file; and cumReadWriteTime is the total amount of time spent waiting for reads and writes. cumChainLength is the total number of hash chain entries searched during all lookups; the average search length is cumChainLength/(hits+misses).
Validate:
PROC [tree: Tree];
! Error {closed, entriesOutOfOrder, entrySizesWrong, other};
Scans the entire BTree and verifies that it is well-formed. Any error (except closed) indicates a fatal structural problem whose presence may cause the BTree package to malfunction if any access operations are performed.
If maintainRecomputableState=TRUE, this procedure counts the entries in the BTree and updates the entryCount field in both the volatile and the permanent state of the tree.
SetUpdateInProgress:
PROC [tree: Tree, updateInProgress:
BOOLEAN];
! Error {closed};
Sets the "update in progress" flag to the specified value. Ordinarily this flag is maintained automatically as a reliable indication of the internal consistency of the BTree. However, a client desiring an indication of higher-level consistency in the face of operations involving multiple updates should bracket those updates with calls to this procedure. If this is done, a subsequent Open will raise UpdateInProgress if such an update sequence was interrupted, even if the tree itself is well-formed. (Note: calling SetUpdateInProgress makes sense only if the tree was opened with maintainRecomputableState=TRUE.)
Higher-level BTree access
There are two classes of BTree access operations. The higher-level operations traffic exclusively in REFs and are SAFE in the Cedar sense. In general, each BTree access involves the creation of a new Value object, and perhaps of a new InternalKey object as well. Thus, the higher-level operations are good for isolated accesses, but not for applications requiring good performance such as exhaustive enumerations.
Corresponding to each of the higher-level operations there is a lower-level UNSAFE operation which does not require allocating any new objects. These are defined in the next section.
Relation: TYPE = BTree.Relation; -- {less, lessEqual, equal, greaterEqual, greater};
ReadRecord:
PROC [tree: Tree, key: Key, relation: Relation ¬ equal, pathStk: PathStk ¬
NIL, useExistingPath:
BOOLEAN ¬
FALSE]
RETURNS [actualKey: InternalKey, value: Value];
! Error {closed, entrySizesWrong};
Finds the closest BTree entry having the specified relation to key, and returns newly-allocated InternalKey and Value object for the entry that was found. Returns NIL if no such entry exists. That is, relation=equal finds an entry that is equal to key; relation=lessEqual finds the greatest entry less than or equal to key; relation=greater finds the least entry strictly greater than key; etc.
A NIL key corresponds to a key smaller than the least entry in the BTree. If relation=greaterEqual or greater, the result is the least entry in the BTree (assuming one exists); otherwise the result is always NIL.
The pathStk and useExistingPath arguments are described below under "PathStk hints".
ReadValue:
PROC [tree: Tree, key: Key, relation: Relation ¬ equal, pathStk: PathStk ¬
NIL, useExistingPath:
BOOLEAN ¬
FALSE]
RETURNS [value: Value];
! Error {closed, entrySizesWrong};
Same as ReadRecord, but returns (and incurs the cost of allocating) only the Value.
EnumerateRecords:
PROC [tree: Tree, key: Key, relation: Relation ¬ equal, pathStk: PathStk ¬
NIL, useExistingPath:
BOOLEAN ¬
FALSE, Proc:
PROC [key: InternalKey, value: Value]
RETURNS [continue:
BOOLEAN]]
RETURNS [exhausted:
BOOLEAN];
! Error {closed, entrySizesWrong};
Looks up an entry as for ReadRecord, except that if key=NIL or key is less than the least entry in the BTree then finds the least entry in the BTree (regardless of the value of relation), and relation=equal is the same as relation=greaterEqual. Then calls Proc[key, value] with (copies of) the key and value for the entry that was found. If Proc returns FALSE then EnumerateRecords terminates with exhausted=FALSE. If Proc returns TRUE then it is called again with the entry's successor; this continues until Proc returns FALSE or the BTree is exhausted. If the BTree is exhausted then EnumerateRecords returns exhausted=TRUE.
The tree is locked, so Proc may not invoke any BTree operations. To achieve an unlocked enumeration, repeatedly call ReadRecord[relation: greater] and copy the returned actualKey back into the Key. (A backward enumeration may likewise be accomplished using ReadRecord[relation: less].)
UpdateType: TYPE = BTree.UpdateType; -- {insert, replace, insertOrReplace};
UpdateRecord:
PROC [tree: Tree, key: Key, value: Value, pathStk: PathStk ¬
NIL, useExistingPath:
BOOLEAN ¬
FALSE, updateType: UpdateType ¬ insertOrReplace];
! Error {closed, depthExceeded, entrySizesWrong, wrongUpdateType};
Inserts the supplied value into the BTree at the location specified by key. If there already exists an entry with the same key, it is replaced. A key of NIL is illegal.
If updateType=insert then the update is required to be an insertion of a key not already present in the tree. Similarly, if updateType=replace then the update is required to be a replacement of an entry already present in the tree. If the updateType is violated, Error[wrongUpdateType] is raised and the operation is not performed.
DeleteKey:
PROC [tree: Tree, key: Key, pathStk: PathStk ¬
NIL, useExistingPath:
BOOLEAN ¬
FALSE]
RETURNS [found:
BOOLEAN];
! Error {closed, entrySizesWrong};
Deletes the BTree entry at the location specified by key. Returns TRUE if this was done, FALSE if no such entry existed in the BTree. A key of NIL is illegal.
Lower-level BTree access
These are all UNSAFE, but more efficient than the corresponding SAFE procedures since no storage allocation is required. The client call-back procedures may obtain collectible (safe) objects by calling KeyFromEntry or ValueFromEntry.
ReadEntry:
UNSAFE
PROC [tree: Tree, key: Key, relation: Relation ¬ equal, pathStk: PathStk ¬
NIL, useExistingPath:
BOOLEAN ¬
FALSE, Proc:
UNSAFE
PROC [key: EntryKey, value: EntryValue]];
! Error {closed, entrySizesWrong};
Finds the closest BTree entry having the specified relation to key, and calls Proc[key, value], passing pointers into the BTree page itself. Calls Proc[NIL, NIL] if no such entry exists. That is, relation=equal finds an entry that is equal to key; relation=lessEqual finds the greatest entry less than or equal to key; relation=greater finds the least entry strictly greater than key; etc.
A NIL key corresponds to a key smaller than the least entry in the BTree. If relation=greaterEqual or greater, the result is the least entry in the BTree (assuming one exists); otherwise the result is always NIL.
The pathStk and useExistingPath arguments are described below under "PathStk hints".
The tree is locked and a reference is in progress to the BTree page containing the entry, so Proc may not invoke any BTree operations. It is illegal for Proc to write into key^ or value^.
EnumerateEntries:
UNSAFE
PROC [tree: Tree, key: Key, relation: Relation ¬ equal, pathStk: PathStk ¬
NIL, useExistingPath:
BOOLEAN ¬
FALSE, Proc:
UNSAFE
PROC [key: EntryKey, value: EntryValue]
RETURNS [continue:
BOOLEAN]]
RETURNS [exhausted:
BOOLEAN];
! Error {closed, entrySizesWrong};
Looks up an entry as for ReadEntry, except that if key=NIL or key is less than the least entry in the BTree then finds the least entry in the BTree (regardless of the value of relation), and relation=equal is the same as relation=greaterEqual. Then calls Proc[key, value], passing pointers into the BTree page itself. If Proc returns FALSE then EnumerateEntries terminates with exhausted=FALSE. If Proc returns TRUE then it is called again with the entry's successor; this continues until Proc returns FALSE or the BTree is exhausted. If the BTree is exhausted then EnumerateEntries returns exhausted=TRUE.
The tree is locked and a reference is in progress to the BTree page containing the entry, so Proc may not invoke any BTree operations. It is illegal for Proc to write into key^ or value^. To achieve an unlocked enumeration, repeatedly call ReadEntry[relation: greater] and copy the resulting EntryKey back into the Key. (A backward enumeration may likewise be accomplished using ReadEntry[relation: less].)
UpdateEntry:
UNSAFE
PROC [tree: Tree, key: Key, pathStk: PathStk ¬
NIL, useExistingPath:
BOOLEAN ¬
FALSE, valueLength:
CARD, Proc:
UNSAFE
PROC [value: EntryValue], updateType: UpdateType ¬ insertOrReplace];
! Error {closed, depthExceeded, entrySizesWrong, wrongUpdateType};
Inserts an entry with the specified Value.length into the BTree at the location specified by key. Proc is called to produce the contents of the entry's value; that is, Proc should store the value into the sequence elements value[0] through value[valueLength-1]. (This sometimes requires a lot of LOOPHOLEs and extreme care; it is intended for clients who don't want to preallocate the storage for the Value but instead generate it on-the-fly.) If there already exists an entry with the same key, it is replaced. A key of NIL is illegal.
If updateType=insert then the update is required to be an insertion of a key not already present in the tree. Similarly, if updateType=replace then the update is required to be a replacement of an entry already present in the tree. If the updateType is violated, Error[wrongUpdateType] is raised and the operation is not performed.
SalvageEntries:
UNSAFE
PROC [tree: Tree, Proc:
UNSAFE
PROC [key: EntryKey, value: EntryValue]
RETURNS [continue:
BOOLEAN]]
RETURNS [exhausted:
BOOLEAN];
! Error {closed};
Enumerates all well-formed entries in the tree in an arbitrary order, skipping over any malformed portions of the tree. If the tree is well-formed, SalvageEntries will enumerate all the entries. If the tree is malformed, SalvageEntries may return a subset or superset of the entries that should be present, depending on what operation was being performed when the tree was damaged. This procedure facilitates recovering all possible information from a broken tree, presumably in preparation for rebuilding it.
For each entry, SalvageEntries calls Proc[key, value], passing pointers into the BTree page itself. If Proc returns FALSE then SalvageEntries terminates with exhausted=FALSE. If Proc returns TRUE then it is called again with the entry's successor; this continues until Proc returns FALSE or the BTree is exhausted. If the BTree is exhausted then SalvageEntries returns exhausted=TRUE.
The tree is locked and a reference is in progress to the BTree page containing the entry, so Proc may not invoke any BTree operations. It is illegal for Proc to write into key^ or value^.
PathStk hints
All procedures that look up keys accept optional PathStk and useExistingPath arguments. If a PathStk is supplied then the operation fills it in with the path to the BTree entry that was located (the last such entry in the case of an enumeration). If useExistingPath=TRUE (in which case the PathStk must not be NIL or Error[nilPathStk] will be raised) then the BTree package will attempt to access the entry described by the PathStk; only if the entry is incorrect or the PathStk is otherwise invalid will a new path be constructed. Use of PathStk hints can substantially speed up certain operations such as repeated Reads, resuming a suspended Enumerate, or performing an Update of an entry that was just looked up.
It would be illegal for the same PathStk to be used in multiple concurrent calls to BTree package procedures; no check is made for this.
PathStk: TYPE = BTree.PathStk;
NewPathStk:
PROC
RETURNS [pathStk: PathStk];
Returns a new PathStk for subsequent use. It is marked invalid until the first lookup is done. The size of a PathStk is approximately 70 words.
Exceptions
The signals are not actually defined in this interface; the client must refer to BTree for them. This is because the compiler doesn't provide any way to copy a signal (or procedure) value from one interface to another. (Nor is it possible for the implementation to export to one interface a signal imported from another.) The commented statements contain the desired definitions, which are not compilable at the present time.
Error: ERROR [reason: Reason] = BTree.Error;
Reason:
TYPE = BTree.Reason;
-- {
badSeal, -- statePage does not contain a valid seal (during Open)
closed, -- attempt to access BTree when its state is closed
depthExceeded, -- insertion caused the tree to exceed the maximum depth permitted by the implementation (for all practical purposes this error is impossible)
entriesOutOfOrder, -- Validate found entries out of order
entrySizesWrong, -- inconsistency detected between the sizes of entries and the space used in BTree pages
nilPathStk, -- pathStk=NIL in call with useExistingPath=TRUE
other, -- Validate found unspecified structural problem with BTree
wrongEntryProduced, -- not possible when using BTreeSimple interface
wrongPageSize, -- attempt to Open existing BTree with a page size different from the one it was built with
wrongUpdateType}; -- the updateType specified in call to UpdateRecord or UpdateEntry was violated
UpdateInProgress: SIGNAL = BTree.UpdateInProgress;
This resumable signal is raised by Open if there is an uncompleted update in progress for this BTree; the contents of the tree should be treated with suspicion.
Locking and consistency
This package can accomodate two rather different approaches for maintaining consistency in the face of concurrent access and crashes in mid-update. These are distinguished by whether or not accesses to the backing file are covered by a transaction which includes locking, e.g., an Alpine transaction. It is required that all clients agree on which approach is being used for any given BTree.
If transactions are not being used, it is required that at most one Tree handle be active for a given BTree (i.e., a given backing file). For this case, the BTree package internally locks the entire Tree during every operation, enforcing a conventional "readers and writers" discipline (any number of simultaneous readers or exactly one writer, but not both).
If transactions are being used, clients may access a BTree with different Tree handles, so long as the PageStorage accesses performed for each one are covered by a different transaction. In this case, the BTree package's locking mechanism is subverted, but consistency is assured by the page locking that occurs as part of the PageStorage references. This permits increased concurrency at the risk of possible deadlocks. (With Alpine, maximum concurrency is obtained by covering write references with "update" as opposed to "write" locks so as not to block concurrent reads.) Note that it is still permissible for multiple clients to use the same Tree handle concurrently (i.e., under the same transaction), since the internal Tree lock is still in effect and keeps the clients out of each others' way.
The top-level permanent state of the BTree is kept in the statePage. Besides vital information such as the PageNumber of the root of the tree, the statePage optionally holds two pieces of recomputable information: the (approximate) number of entries in the BTree, and an "update in progress" flag which is set at the beginning of an update, cleared at the end, and checked during Open. Whether or not these two fields are maintained is controlled by the maintainRecomputableState argument of Open.
If accesses are performed under a transaction, maintaining the recomputable state is inappropriate because doing so would result in a write lock being set on the statePage during every update, thereby reducing concurrency (only one transaction that updates the BTree could be in progress at a time). Therefore, maintainRecomputableState should be FALSE when operating in this fashion.
If the PageStorage primitives stop performing accesses under one transaction and begin using another transaction (e.g., because the client has finished the first transaction and started a new one), the Open procedure should be called to inform the BTree package that any state it has cached in the Tree object (or associated PathStk objects) is no longer valid and must be re-read from the PageStorage. This is logically equivalent to creating a new Tree object, but is less costly.
END.