BTree.mesa
Last Edited by: Taft, June 6, 1983 6:10 pm
DIRECTORY
Environment USING [Comparison, wordsPerPage];
BTree:
CEDAR
DEFINITIONS =
BEGIN
BTree entries, records, and keys
A BTree contains an ordered set of Entries. From the BTree package's point of view, an Entry is simply an uninterpreted array of words. A Record is a collectible object containing a copy of an Entry. A Key identifies an Entry (or potential Entry) in the BTree; there is a one-to-one relationship between Keys and Entries. Each Entry contains (some encoding of) the information in the corresponding Key, such that it is possible to determine the ordering relation between any Key and any Entry. Also, each Entry typically contains additional information which is the "value" associated with the Key; but the BTree package has no cognizance of this. (An Entry will typically contain a copy of the Key that addresses it, but the package doesn't require that; a Key might contain, for example, auxiliary lookup information that makes Compare go faster.)
All knowledge of the representation of Entries, Records, and Keys is vested in the client program. Each client may have its own TYPEs defining the concrete representations of these objects. To permit the BTree package to be shared among all such clients without recompilation, this interface defines these objects in such a way that type checking is mostly bypassed. A cautious client will define a specialized BTree interface as a veneer over this one, exporting INLINE procedures which "translate" between the specific concrete types being used externally and the promiscuous ones defined here.
Tree: TYPE = REF TreeObject;
TreeObject: TYPE; -- This represents the volatile state of an open BTree
Entry: TYPE = LONG POINTER --TO READONLY EntryObject--;
EntryObject: TYPE;
Record: TYPE = REF ANY --RecordObject--;
RecordObject: TYPE;
Key: TYPE = REF ANY --KeyObject--;
KeyObject: TYPE;
Operations on client-provided objects
These operations are provided by the client when the tree is opened.
When an Entry is passed to a client-provided procedure, it is valid only until that procedure returns; the client must not store it in any variable whose lifetime exceeds that of the procedure call.
Compare:
TYPE =
UNSAFE
PROCEDURE [key: Key, entry: Entry]
RETURNS [Environment.Comparison];
Compares key^ to entry^ and returns one of {less, equal, greater}.
CompareEntries:
TYPE =
UNSAFE
PROCEDURE [entry1, entry2: Entry]
RETURNS [Environment.Comparison];
Compares entry1^ to entry2^ and returns one of {less, equal, greater}. This is called only by Validate, and is used to verify that the entries are in order. To bypass the checking of order, it suffices to supply a CompareEntries procedure that always returns less.
EntSize: TYPE = [0..LAST[PageSize]];
EntrySize:
TYPE =
UNSAFE
PROCEDURE [entry: Entry]
RETURNS [words: EntSize];
Returns the number of words occupied by entry^; i.e., the information in the entry consists of [entry^ .. (entry+words)^). This procedure must return unvarying results for any given value of entry^; the BTree package will malfunction in inscrutable ways if this is not so. An Entry's size may not exceed the BTree page size minus reservedWordsPerPage.
This procedure should be efficient, as it is called for every entry on every BTree page searched.
reservedWordsPerPage: CARDINAL = 3;
EntryFromRecord:
TYPE =
UNSAFE
PROCEDURE [record: Record]
RETURNS [entry: Entry];
Returns the pointer to the Entry contained within record^.
NewRecordFromEntry:
TYPE =
UNSAFE
PROCEDURE [entry: Entry]
RETURNS [record: Record];
Creates a new Record containing a copy of entry^.
RepresentationPrimitives:
TYPE =
RECORD [
compare: Compare,
compareEntries: CompareEntries,
entrySize: EntrySize,
entryFromRecord: EntryFromRecord,
newRecordFromEntry: NewRecordFromEntry];
Underlying page storage
The permanent storage for a BTree is represented by a client-provided PageStorage object; BTree pages are stored in an array of "pages" numbered [0..LAST[PageNumber]] relative to that object. Each page is typically the underlying physical page size (e.g., Environment.wordsPerPage) or some multiple thereof; PageSize defines the "reasonable" range of page sizes, within which one may expect the BTree package to operate with reasonable performance.
The ratio of page size to average entry size, times about .75, defines a tree's "branching factor", i.e., the number of descendants of each non-leaf page. A branching factor of about 7 should be considered the minimum acceptable (this gives you about 400 entries in a 3-level tree); much larger branching factors (10 to 25) are preferable. (Actually, with a uniform distribution of variable-size entries, the BTree package achieves a higher branching factor than the ratio given above, because it endeavors to keep smaller keys in the non-leaf pages).
The BTree package uses lower-numbered pages first. Precisely, when the package references page i for the first time, it is because all pages in [0..i) are in use and a new page is required. However, at any given time, the interval of in-use pages may be sparse.
The page storage is assumed to behave much like a virtual memory in which the Mesa VM is the "real" memory and the page storage array is the "virtual" memory. That is, the BTree package makes an explicit read or write reference with a "virtual" address which is a PageNumber and obtains a LONG POINTER to a page in the Mesa VM either mapped onto or containing a copy of that storage page. This LONG POINTER must remain valid until explicitly released by the BTree package. Any time after it is released, the Mesa VM page may be rewritten to the storage page (if dirty) and then reclaimed.
The page storage is expected to include a substantial amount of caching. That is, if a given PageNumber was referenced and released recently, a new reference to that PageNumber is expected to have very low cost on the average. In the same vein, instant rewriting of dirty pages upon their release is not recommended; systematic rewriting (if desired) should be deferred until the end of an operation. The cache should have room for at least log to the base branching-factor of the number of entries (i.e., the number of levels in the tree) so that all the pages in the lookup path to an entry will fit at once. A cache size at least several times this is recommended, because updates typically touch more pages than lookups do. (Precisely, a lookup references precisely one page per level, and an update references at most three pages per level.)
During any single BTree operation, the BTree package promises never to have more than 3 simultaneous references in progress; these may include multiple references to the same PageNumber. There are no references in progress between BTree operations.
PageStorage: TYPE = REF ANY --PageStorageObject--;
PageStorageObject: TYPE;
PageNumber: TYPE = CARDINAL;
statePage: PageNumber = 0;
Page containing the top-level permanent state of the BTree. The reason this may be of interest to the PageStorage primitives is explained below under "Locking and consistency".
PageSize: TYPE = [Environment.wordsPerPage .. 16*Environment.wordsPerPage];
PagePtr: TYPE = LONG POINTER --TO ARRAY [0..pageSize) OF WORD--;
ReferenceType: TYPE = {read, write, new};
UpdateState: TYPE = {startOfUpdate, endOfUpdate, unchanged};
Operations on page storage
ReferencePage:
TYPE =
UNSAFE
PROCEDURE [storage: PageStorage, number: PageNumber, type: ReferenceType]
RETURNS [ptr: PagePtr];
Makes a reference to the specified BTree page. Type=read means that the BTree package will not modify the page; write means that it intends to modify the page before releasing it. Type=new is the same as type=write except that the BTree package asserts that the page has not been used since the BTree was initialized (i.e., is not presently part of the tree) and that its former contents are uninteresting.
No provision exists for this operation to fail (e.g., because page storage is exhausted or a transaction aborts). Any errors that it raises must be caught by the client of the BTree package, and thereafter the state of the BTree is undefined. (If PageStorage accesses are made under a transaction, what is actually true is that the state of the Tree object and of the BTree as viewed by that transaction becomes undefined; the stable state of the BTree is of course perfectly ok.)
ReleasePage:
TYPE =
UNSAFE
PROCEDURE [storage: PageStorage, number: PageNumber, update: UpdateState ← unchanged];
Releases a page referenced by ReferencePage. If update=startOfUpdate (which occurs only when number=statePage) then an update operation involving multiple page writes has started, and the reference being released is the first write reference of the update. If update=endOfUpdate then an update operation has ended, and the reference being released is the last write reference of the operation; the BTree package asserts that there are no remaining references in progress for that update. (Unchanged means that the reference being released is neither the beginning nor the end of an update.) If an update operation writes only a single single page, that page reference will be released with endOfUpdate. If an update operation writes multiple pages (and maintainRecomputableState=TRUE), the operation will be bracketed with write references to the statePage released with startOfUpdate and endOfUpdate respectively. For an elaboration on this, see "Locking and consistency" below.
PageStoragePrimitives:
TYPE =
RECORD [
referencePage: ReferencePage,
releasePage: ReleasePage];
BTree global operations
Open:
PROCEDURE [repPrim: RepresentationPrimitives, storage: PageStorage, storPrim: PageStoragePrimitives, pageSize: PageSize ← Environment.wordsPerPage, minEntrySize:
CARDINAL ← 1, initialize:
BOOLEAN ←
FALSE, maintainRecomputableState:
BOOLEAN ←
TRUE]
RETURNS [tree: Tree];
Establishes global state for subsequent access to a BTree. The properties of BTree entries are defined by the RepresentationPrimitives; the set of pages comprising the BTree is identified by the PageStorage; and the PageStorage operations are defined by the PageStoragePrimitives. PageSize is the "logical" size of a BTree page in words. MinEntrySize is the size of the smallest possible entry in words; specifying this accurately minimizes the BTree package's use of temporary storage. If initialize=TRUE, an empty BTree is created; if FALSE, the storage pages are assumed to contain a well-formed BTree already. If maintainRecomputableState=TRUE then the BTree package will maintain the tree's entryCount and "update in progress" flag; the situations under which doing so is undesirable are discussed below under "Locking and consistency".
Raises Error[badSeal] if the statePage does not contain a valid seal; raises Error[wrongPageSize] if pageSize differs from the one that the tree was built with.
Raises the resumable signal UpdateInProgress if there is an uncompleted update in progress for this tree. The contents of the tree should be treated with suspicion.
ReOpen:
PROCEDURE [tree: Tree];
Informs the BTree package that any state it has cached in the Tree object (or associated PathStk objects) is no longer valid and must be re-read from the PageStorage. This is logically equivalent to creating a new Tree object, but is less costly. The situations under which this procedure should be called are discussed below under "Locking and consistency".
Validate:
PROCEDURE [tree: Tree];
Scans the entire BTree and verifies that it is well-formed. Raises Error[entriesOutOfOrder] if there are entries out of order. Raises Error[entrySizesWrong] if an inconsistency is detected between the sizes of entries and the space used in BTree pages. Raises Error[other] if there are other (unspecified) problems with the BTree. All these errors indicate fatal structural problems whose presence may cause the BTree package to malfunction if any access operations are performed.
If maintainRecomputableState=TRUE, this procedure counts the entries in the BTree and updates the entryCount field in both the volatile and the permanent state of the tree.
GetState:
PROCEDURE [tree: Tree]
RETURNS [entryCount:
LONG
CARDINAL, greatestPage: PageNumber, depth:
CARDINAL, storage: PageStorage];
Returns the number of entries contained in the tree, the highest PageNumber of any page which is or ever has been part of the tree, the tree's depth (number of levels; by convention an empty tree has depth=0), and the PageStorage handle currently in use. Returns entryCount=LAST[LONG CARDINAL] if the tree has ever been opened and updated with maintainRecomputableState=FALSE. The entryCount is only approximate if the BTree has ever been updated using any Tree handle besides this one, unless Validate has been executed since the most recent update.
SetUpdateInProgress:
PROCEDURE [tree: Tree, updateInProgress:
BOOLEAN];
Sets the "update in progress" flag to the specified value. Ordinarily this flag is maintained automatically as a reliable indication of the internal consistency of the BTree. However, a client desiring an indication of higher-level consistency in the face of operations involving multiple updates should bracket those updates with calls to this procedure. This procedure is called, for example, from the StartLongUpdate and EndLongUpdate procedures in BTreeVM.
Higher-level BTree access
There are two classes of BTree access operations. The higher-level operations traffic exclusively in REFs and are SAFE in the Cedar sense. In general, each BTree access involves the creation of a new Record object. Thus, the higher-level operations are good for isolated accesses, but not for applications requiring good performance such as exhaustive enumerations.
Corresponding to each of the higher-level operations there is a lower-level UNSAFE operation which does not require allocating any new Records. These are defined in the next section.
ReadRecord,
ReadRecordL,
ReadRecordLE,
ReadRecordGE,
ReadRecordG:
PROCEDURE [tree: Tree, key: Key, pathStk: PathStk ←
NIL, useExistingPath:
BOOLEAN ←
FALSE]
RETURNS [record: Record];
ReadRecord finds the BTree entry equal (according to the Compare procedure) to key, and returns a new Record containing it. Returns NIL if no such entry exists. The pathStk and useExistingPath arguments are described below under "PathStk hints".
ReadRecordL is the same as ReadRecord except that it finds the greatest entry strictly less than key; likewise for ReadRecordLE (less than or equal), ReadRecordGE (greater than or equal), and ReadRecordG (strictly greater).
A NIL key corresponds to a key smaller than the least entry in the BTree. For ReadRecordGE and ReadRecordG, the result is the least entry in the BTree; for ReadRecord, ReadRecordL, and ReadRecordLE, the result is always NIL.
Raises Error[entrySizesWrong] if an inconsistency is detected between the sizes of entries and the space used in BTree pages; the state of the tree is undefined if this occurs.
EnumerateRecords:
PROCEDURE [tree: Tree, key: Key, pathStk: PathStk ←
NIL, useExistingPath:
BOOLEAN ←
FALSE, Proc:
PROCEDURE [record: Record]
RETURNS [continue:
BOOLEAN]]
RETURNS [exhausted:
BOOLEAN];
Looks up an entry as for ReadRecordLE, except that if key=NIL or key is less than the least entry in the BTree then finds the least entry in the BTree. Then calls Proc[record] with a Record containing the entry that was found. If Proc returns FALSE then EnumerateRecords terminates with exhausted=FALSE. If Proc returns TRUE then it is called again with another record containing the entry's successor; this continues until Proc returns FALSE or the BTree is exhausted. If the BTree is exhausted then EnumerateRecords returns exhausted=TRUE.
The tree is locked, so Proc may not invoke any BTree operations. To achieve an unlocked enumeration, repeatedly call ReadRecordG and copy the returned Record back into the Key. (A backward enumeration may likewise be accomplished using ReadRecordL.)
Raises Error[entrySizesWrong] if an inconsistency is detected between the sizes of entries and the space used in BTree pages; the state of the tree is undefined if this occurs.
UpdateType: TYPE = {insert, replace, insertOrReplace};
UpdateRecord:
PROCEDURE [tree: Tree, key: Key, pathStk: PathStk ←
NIL, useExistingPath:
BOOLEAN ←
FALSE, record: Record, updateType: UpdateType ← insertOrReplace];
Inserts the entry contained in record into the BTree at the location specified by key. If there already exists an entry with the same key, it is replaced. It is required that Compare[key, EntryFromRecord[record]]=equal, and Error[wrongEntryProduced] is raised if this does not hold. A key of NIL is illegal.
If updateType=insert then the update is required to be an insertion of a key not already present in the tree. Similarly, if updateType=replace then the update is required to be a replacement of an entry already present in the tree. If the updateType is violated, Error[wrongUpdateType] is raised and the operation is not performed.
If inserting the entry causes the tree to exceed the maximum depth permitted by the implementation, Error[depthExceeded] is raised, and the state of the BTree is undefined. For all practical purposes this error is impossible.
Raises Error[entrySizeWrong] if an inconsistency is detected between the sizes of entries and the space used in BTree pages; the state of the tree is undefined if this occurs.
DeleteKey:
PROCEDURE [tree: Tree, key: Key, pathStk: PathStk ←
NIL, useExistingPath:
BOOLEAN ←
FALSE]
RETURNS [found:
BOOLEAN];
Deletes the BTree entry at the location specified by key. Returns TRUE if this was done, FALSE if no such entry existed in the BTree. A key of NIL is illegal.
Raises Error[entrySizesWrong] if an inconsistency is detected between the sizes of entries and the space used in BTree pages; the state of the tree is undefined if this occurs.
Lower-level BTree access
These are all UNSAFE, but more efficient than the corresponding SAFE procedures since no storage allocation is required.
ReadEntry,
ReadEntryL,
ReadEntryLE,
ReadEntryGE,
ReadEntryG:
UNSAFE
PROCEDURE [tree: Tree, key: Key, pathStk: PathStk ←
NIL, useExistingPath:
BOOLEAN ←
FALSE, Proc:
UNSAFE
PROCEDURE [entry: Entry]];
ReadEntry finds the BTree entry equal (according to the Compare procedure) to key, and calls Proc[entry], passing a pointer into the BTree page itself. Calls Proc[NIL] if no such entry exists. The pathStk and useExistingPath arguments are described below under "PathStk hints".
ReadEntryL is the same as ReadEntry except that it finds the greatest entry strictly less than key; likewise for ReadEntryLE (less than or equal), ReadEntryGE (greater than or equal), and ReadEntryG (strictly greater).
A NIL key corresponds to a key smaller than the least entry in the BTree. For ReadEntryGE and ReadEntryG, the result is the least entry in the BTree; for ReadEntry, ReadEntryL, and ReadEntryLE, the result is always NIL.
The tree is locked and a reference is in progress to the BTree page containing the entry, so Proc may not invoke any BTree operations. It is illegal for Proc to write into entry^.
Raises Error[entrySizesWrong] if an inconsistency is detected between the sizes of entries and the space used in BTree pages; the state of the tree is undefined if this occurs.
EnumerateEntries:
UNSAFE
PROCEDURE [tree: Tree, key: Key, pathStk: PathStk ←
NIL, useExistingPath:
BOOLEAN ←
FALSE, Proc:
UNSAFE
PROCEDURE [entry: Entry]
RETURNS [continue:
BOOLEAN]]
RETURNS [exhausted:
BOOLEAN];
Looks up an entry as for ReadEntryLE, except that if key=NIL or key is less than the least entry in the BTree then finds the least entry in the BTree. Then calls Proc[entry], passing a pointer into the BTree page itself. If Proc returns FALSE then EnumerateEntries terminates with exhausted=FALSE. If Proc returns TRUE then it is called again with the entry's successor; this continues until Proc returns FALSE or the BTree is exhausted. If the BTree is exhausted then EnumerateEntries returns exhausted=TRUE.
The tree is locked and a reference is in progress to the BTree page containing the entry, so Proc may not invoke any BTree operations. It is illegal for Proc to write into entry^.
Raises Error[entrySizesWrong] if an inconsistency is detected between the sizes of entries and the space used in BTree pages; the state of the tree is undefined if this occurs.
UpdateEntry:
UNSAFE
PROCEDURE [tree: Tree, key: Key, pathStk: PathStk ←
NIL, useExistingPath:
BOOLEAN ←
FALSE, words: EntSize, Proc:
UNSAFE
PROCEDURE [entry: Entry], updateType: UpdateType ← insertOrReplace];
Inserts an entry whose size is the specified number of words into the BTree at the location specified by key. Proc is called to produce the contents of the entry; that is, Proc should store data into [entry^ .. (entry+words)^). If there already exists an entry with the same key, it is replaced.
If updateType=insert then the update is required to be an insertion of a key not already present in the tree. Similarly, if updateType=replace then the update is required to be a replacement of an entry already present in the tree. If the updateType is violated, Error[wrongUpdateType] is raised and the operation is not performed.
After Proc is called, it is required that Compare[key, entry]=equal and EntrySize[entry]=words, and Error[wrongEntryProduced] is raised if this does not hold; the state of the tree is undefined if this error occurs. A key of NIL is illegal.
If inserting the entry causes the tree to exceed the maximum depth permitted by the implementation, Error[depthExceeded] is raised, and the state of the BTree is undefined. For all practical purposes this error is impossible.
Raises Error[entrySizesWrong] if an inconsistency is detected between the sizes of entries and the space used in BTree pages; the state of the tree is undefined if this occurs.
PathStk hints
All procedures that look up keys accept optional PathStk and useExistingPath arguments. If a PathStk is supplied then the operation fills it in with the path to the BTree entry that was located (the last such entry in the case of an enumeration). If useExistingPath=TRUE (in which case the PathStk must not be NIL or Error[nilPathStk] will be raised) then the BTree package will attempt to access the entry described by the PathStk; only if the entry is incorrect or the PathStk is otherwise invalid will a new path be constructed. Use of PathStk hints can substantially speed up certain operations such as repeated Reads, resuming a suspended Enumerate, or performing an Update of an entry that was just looked up.
It would be illegal for the same PathStk to be used in multiple concurrent calls to BTree package procedures; no check is made for this.
PathStk: TYPE = REF PathStkObject;
PathStkObject: TYPE;
NewPathStk:
PROCEDURE
RETURNS [pathStk: PathStk];
Returns a new PathStk for subsequent use. It is marked invalid until the first lookup is done. The size of a PathStk is approximately 50 words.
Exceptions
Error: ERROR [reason: Reason];
Reason: TYPE = {badSeal, depthExceeded, entriesOutOfOrder, entrySizesWrong, nilPathStk, other, wrongEntryProduced, wrongPageSize, wrongUpdateType};
UpdateInProgress: SIGNAL;
Locking and consistency
This package can accomodate two rather different approaches for maintaining consistency in the face of concurrent access and crashes in mid-update. These are distinguished by whether or not accesses to the underlying page storage are covered by a transaction which includes locking, e.g., an Alpine transaction. It is required that all clients agree on which approach is being used for any given BTree.
If transactions are not being used, it is required that at most one Tree handle be active for a given BTree (i.e., a given set of underlying PageStorage pages). For this case, the BTree package internally locks the entire Tree during every operation, enforcing a conventional "readers and writers" discipline (any number of simultaneous readers or exactly one writer, but not both).
If transactions are being used, clients may access a BTree with different Tree handles, so long as the PageStorage accesses performed for each one are covered by a different transaction. In this case, the BTree package's locking mechanism is subverted, but consistency is assured by the page locking that occurs as part of the PageStorage references. This permits increased concurrency at the risk of possible deadlocks. (With Alpine, maximum concurrency is obtained by covering write references with "update" as opposed to "write" locks so as not to block concurrent reads.) Note that it is still permissible for multiple clients to use the same Tree handle concurrently (i.e., under the same transaction), since the internal Tree lock is still in effect and keeps the clients out of each others' way.
The top-level permanent state of the BTree is kept in the statePage. Besides vital information such as the PageNumber of the root of the tree, the statePage optionally holds two pieces of recomputable information: the (approximate) number of entries in the BTree, and an "update in progress" flag which is set at the beginning of an update, cleared at the end, and checked during Open. Whether or not these two fields are maintained is controlled by the maintainRecomputableState argument of Open.
If the PageStorage accesses are not performed under a transaction, the "update in progress" flag may be used as a reliable indication of the consistency of the BTree if the PageStorage writes are managed in the following way. When the BTree package makes a (write) reference to the statePage and releases it with update=startOfUpdate, the statePage should be immediately rewritten to permanent storage. When the BTree package makes a (write) reference to the statePage and releases it with update=endOfUpdate, all dirty pages other than the statePage should be rewritten to permanent storage, followed by the statePage itself. (It is ok for the PageStorage implementation to write dirty pages at other times as well, except when references are in progress to them.) Note that updates are bracketed by statePage writes in this fashion only if maintainRecomputableState is TRUE. When the BTree package makes a (write) reference to a page other than the statePage and releases it with update=endOfUpdate, the update consists of that single write, so the tree is consistent both before and after the page is rewritten to permanent storage. Therefore, single-page updates are not bracketed with statePage writes. (This is why the permanent value of the entryCount is only approximate.)
If accesses are performed under a transaction, maintaining the recomputable state is inappropriate because doing so would result in a write lock being set on the statePage during every update, thereby reducing concurrency (only one transaction that updates the BTree could be in progress at a time). Therefore, maintainRecomputableState should be FALSE when operating in this fashion.
If the PageStorage primitives stop performing accesses under one transaction and begin using another transaction (e.g., because the client has finished the first transaction and started a new one), the ReOpen procedure should be called to inform the BTree package that any state it has cached in the Tree object (or associated PathStk objects) is no longer valid and must be re-read from the PageStorage. This is logically equivalent to creating a new Tree object, but is less costly. Note that the PageStorage cache must also be flushed so that subsequent references will give rise to real storage accesses that set locks under the new transaction.
END.