BTree Documentation
To        Cedar users        Date    January 17, 1984
From    Ed Taft        Location    Palo Alto
Subject    BTree, BTreeSimple, and BTreeTest packages    Organization    PARC/CSL
XEROX
<<Copyright © 1985 by Xerox Corporation.  All rights reserved.>>
Release as    [Indigo]<Cedar>Documentation>BTreeDoc.tioga
Draft    [Indigo]<PreCedar>Documentation>BTreeDoc.tioga
Last edited by    Taft, January 17, 1984 11:39 am
Abstract    This Cedar package maintains an ordered collection of objects as a BTree. The objects may be of different sizes, and 
there may be a large number of them (tens or hundreds of thousands). The amount of virtual memory required does not depend on the 
size of the BTree, and the cost of finding, inserting, and deleting objects increases only very slowly as the BTree gets larger. 
The package makes very few assumptions about the representation of the objects being stored or about the properties of the storage 
itself. (There also exists a separate package, BTreeSimple, that makes conventional assumptions and provides simplified access.) 
This memo gives an overview of the operation of the BTree package and presents the results of some performance measurements.
Introduction
    A BTree is a data structure for maintaining an ordered collection of objects (or entries) stored as a tree of fixed-size pages. The 
    entries are each smaller than a page and are stored many per page; thus, each interior page of the tree has many branches, so 
    even a very large tree is quite shallow (a depth of 3 or 4 is typical). This means that finding an arbitrary entry is 
    relatively inexpensive in terms of the number of pages that must be accessed. The cost of an update is also reasonable and has 
    an upper bound which is linear in the depth of the tree. Finally, the tree has very good locality, so references to consecutive 
    entries usually access the same page.
    For further information on the properties of BTrees in general, please read Knuth, The Art of Computer Programming, vol. 3, section 
    6.2.4.
    This BTree package is a Cedar transliteration of one written in BCPL by Ed McCreight and used for maintaining the file directory in 
    IFS. The important properties that distinguish it from existing Mesa BTree packages are:
        Generalitythe BTree package makes no assumptions about the representation of entries; it simply stores uninterpreted blocks of 
        words. It need know nothing about what portion of an entry contains the key'' or the value''. All knowledge about the 
        representation of entries is vested in client-provided procedures. The package also makes no assumptions about the means 
        used to store the BTree pages, but instead calls client-provided procedures to access them.
        Capacitythere is no practical limit to the number of entries that can be stored. In particular, there is no requirement that all 
        of the pages of the BTree occupy virtual memory at the same time. The ultimate limit to the size of the tree is 65,535 
        pages, which should be sufficient to store at least a million entries (even more if pages'' larger than 256 words are 
        used).
        Storage efficiencyoverhead consists of only one word per entry and two words per page; and the package maintains pages between 60 
        and 80 percent full on the average. The package also attempts to keep larger entries in the leaf pages and smaller ones in 
        the interior pages so as to minimize the tree's depth.
        Proven reliabilitysince the last'' bug was fixed in the BCPL version of the package, there have been many tens of thousands of 
        IFS-hours of operation with no suspicion of a BTree-related problem. No changes were made to the BTree maintenance 
        algorithms during the transliteration to Cedar.
    It is intended that this package be suitable for use in maintaining any ordered data structure that must be stored externally, 
    i.e., is either permanent or is (potentially) too large to keep in virtual memory. The most obvious application is a file 
    directory. For smaller applications involving temporary data only, an internal data structure such as a RedBlackTree is more 
    efficient to access and maintain.
Instructions for use
    The primary documentation for the BTree package is the interface BTree.mesa, obtained via BTree.df. The following is an overview of 
    how the package is intended to be used; consult the interface itself for details.
    The client must provide two sets of procedures, called the representation primitives and the storage primitives. The BTree package 
    is object-oriented, i.e., a single instantiation of the package suffices to deal with an arbitrary number of Tree objects; and 
    each Tree object instance can have an independent set of representation and storage primitives associated with it.
    The representation primitives enable the package to find out everything it needs to know about BTree entries and keys. The most 
    important are EntrySize, which gives the size of an entry in words, and Compare, which gives the result of comparing a key with 
    an entry (less, equal, or greater). The package does not do anything with keys besides Compare them to entries. Additionally, 
    there are procedures for converting between REF and LONG POINTER (i.e., safe and unsafe) references to entries.
    The storage primitives are the means by which the package accesses the pages in which the BTree is stored. The package gains access 
    to the contents of a page by calling the ReferencePage procedure, which returns a LONG POINTER to a block of virtual memory 
    containing a copy of that page. When the package is finished with the page, it calls ReleasePage; subsequently, the page 
    storage implementation may rewrite the page to permanent storage (if dirty) and reclaim the virtual memory.
    The package comes in two parts. The main part, BTreeImpl.bcd, exports the BTree interface. Additionally, there is an example 
    implementation of the page storage primitives called BTreeVMImpl, which exports the BTreeVM interface. This implementation is 
    intended for building a temporary or permanent BTree on raw Cedar Nucleus files. It maintains a VM cache whose size is 
    specified by the client; and it reads and writes the file explicitly, using File.Read and File.Write. BTree and BTreeVM are 
    both exported from the BasicCedar boot file.
    The idea is that the client program first calls BTreeVM.Open, passing a file capability and obtaining a storage handle and a set of 
    storage primitives. These, along with the client's representation primitives, should then be passed to BTree.Open, which 
    returns a Tree handle upon which BTree operations may be performed.
    There is no explicit Close operation. The Tree and storage handles are collectible objects, which vanish when the last REFs to them 
    disappear.
    There also exists a separate package, BTreeSimple, intended for use by clients with unsophisticated needs. It makes some 
    conventional assumptions about how the entries are interpreted (text keys, sequence values) and how the BTree is stored (an FS 
    file). In return for this loss of flexibility, the BTreeSimple package takes care of providing the representation and storage 
    primitives and does certain other bookkeeping that the basic BTree package requires the client to do. The implementation is 
    BTreeSimpleImpl, which is not part of the boot file.
Performance
    The performance of the BTree package depends on many variables: tree size, page size, access patterns, caching behavior of the 
    storage primitives, and others. A few general statements about performance are offered here, followed by some measured results.
    It is crucial that the storage implementation include a substantial amount of caching, managed in an LRU fashion. In many 
    situations, the BTree package references a given page multiple times in the course of a single update. Even ignoring this, in 
    most applications it is common for clients to make references to the same or closely-related entries over a short interval of 
    time; the resulting locality of reference has a performance benefit only when redundant storage reads are avoided by use of a 
    cache.
    The size of the cache relative to the size of the BTree can have a substantial effect on performance, as the results below show. 
    Additionally, it should be noted that for a large BTree it is desirable to have a large page size; given a reasonably large 
    amount of space for a cache, it is better to have a modest number of relatively large pages than a larger number of small 
    pages. This is because the main cost of performing an operation on a large BTree stored on a disk is disk access time (as 
    opposed to transfer time); and larger pages mean more entries per page, a shallower tree, and fewer accesses per operation.
    When updating a BTree, a tradeoff must be made between the desire for good performance and the desire to maintain a consistent 
    permanent state of the tree. If it is desired that the permanent state be consistent between every update then more writes are 
    required, thus reducing performance. The client program controls this tradeoff by use of the maintainRecomputableState argument 
    of BTree.Open and the BTree.SetUpdateInProgress procedure. Of course, if consistency is being maintained at a lower level, 
    e.g., by keeping the BTree in an Alpine file under a transaction, then no provisions need be made at the BTree level for 
    maintaining consistency.
    The measurements below were obtained using the BTreeTest tool, which is available as a separate package. This program uses BTreeVM 
    to manage a BTree stored in a temporary file (which is deleted when the tool is destroyed). It obtains parameters from the 
    user, randomly performs various operations on the tree, and displays a number of performance statistics.
    Running BTreeTest from the UserExec opens a viewer containing a command menu, several user-adjustable parameters, and a table of 
    results. The parameters are:
        DiskPages/BTreePagethe number of file pages (256 words) per logical BTree page. The legal range is [1..16].
        CacheSizethe number of logical BTree pages kept in BTreeVM's cache [8..255].
        MaxTreeEntriesthe maximum number of entries the test program will ever permit the tree to contain [100..65535]. (Note that the 
        upper limit is a restriction of the test program, not one imposed by the BTree package itself.)
        LongUpdatethis boolean switch controls whether or not the BTree package is to keep the permanent state consistent between updates. 
        LongUpdate: yes'' means that writes are performed only when necessary to obtain a free cache entry.
        ValidateEveryUpdateturning on this switch causes the entire tree to be checked after each update; this is useful only during basic 
        debugging of the BTree package, and is very costly.
    A test is started by clicking Start, and is ended by clicking Stop. The results are updated every few seconds while the test is 
    running. Clicking InitTree resets the tree to empty before the beginning of a test. (It is necessary to do this after changing 
    DiskPages/BTreePage or MaxTreeEntries.)  ResetStats resets the statistics to zero but does not change the tree. The results are:
        Tree size, Levels, Entriesthese describe the current state of the tree. The tree size is in units of logical BTree pages.
        Operations, ms/opthese are the number of each type of BTree operation (lookup, enumerate, insert, delete, replace) that have been 
        performed and the average number of milliseconds taken by each. In the case of enumerate, what is counted is the total 
        number of entries enumerated and the average time per entry. An insert is an update that inserts an entry not previously 
        present, while a replace is an update that replaces an existing entry with a new one having the same key. Inserted and 
        replacement entries have sizes selected at random from [2..33].
        CacheRefs, Hit%the total number of calls to BTreeVM.ReferencePage, and the percentage of those calls that accessed a page already 
        present in the cache.
        Storage Reads, Writesthe number of times BTreeVM transferred a BTree page between the cache and permanent storage.
        Writes/updatethe average number of storage writes per update operation (insert, delete, or replace).
        Total elapsed timethe total number of seconds the test has run, excluding test overhead such as updating the results in the 
        viewer, but including any concurrent activity elsewhere in Cedar.
        %R+W time, ms/(R or W)the percentage of the total elapsed time spent waiting for storage reads and writes, and the average number 
        of milliseconds taken by each one.
    Here are some results, obtained on a Dorado. (These numbers are for Cedar 4, using Pilot file facilities. No complete set of 
    corresponding measurements have yet been made for Cedar 5, though some preliminary tests demonstrated an improvement of about 
    30%.)
        Test number    1    2    3    4    5    6    7
        DiskPgs/BTreePg    1    1    1    1    1    1    4
        CacheSize       200    200    20    20    200    200    50
        MaxTreeEntries    2500    2500    2500    2500    60000    60000    60000
        LongUpdate    Yes    No    Yes    No    Yes    No    No
        TreeSize        190    190    190    190    4361    4360    1037
        Levels        3    3    3    3    4    4    3
        Entries        1961    1949    1944    1940    46913    47027    47048
        ms/op Lookup    1.90    1.91    38.8    26.2    72.8    55.0    40.1
        ms/op Enumerate    0.08    0.08    1.82    1.68    2.75    2.99    0.84
        ms/op Insert    4.72    48.8    48.0    75.7    86.4    119    71.9
        ms/op Delete    6.63    60.8    69.5    101    120    154    85.9
        ms/op Replace    3.96    30.1    44.6    52.8    79.8    85.5    62.4
        Hit%        100    100    82    81    81    81    85
        Writes/update    0    1.74    1.28    1.78    1.35    1.80    1.16
        %R+W time    0    87    91    93    94    94    91
        ms/(R or W)        21.7    17.0    18.7    27.4    28.3    28.8
    A few things are worth noting about these results. Test 1 uses a cache large enough to hold the entire tree and never writes any 
    pages to permanent storage (except at the end of the test, where it isn't measured). This gives a good approximation of the CPU 
    cost of operations performed on a 3-level BTree. As is evident from comparing test 1's results with the others, normally most 
    of the time required for a BTree operation is spent waiting for the disk.
    Test 2 is the same as test 1 except that the permanent state is written to storage at the end of every update. Tests 3 and 4 are 
    similar to 1 and 2 except that the cache is much smaller than the tree, so many more reads and writes occur. Interestingly, the 
    choice of whether or not to write storage after each update has only a modest effect on %R+W time in tests 3 and 4. This is 
    presumably because most dirty pages get written anyway as a result of being displaced during later references to other parts of 
    the tree. (The decrease in lookup time from test 3 to test 4 is a good indication that this is what is happening.)
    Tests 5 through 7 are for a much larger BTree. Tests 6 and 7 show the effect of varying the page size, given a fixed-size cache. It 
    is evident that the larger page size yields better performance in all respects.
    The variations in time per read or write across all the tests are partially explainable as being related to the size of the tree (a 
    larger tree means increased time spent seeking between different pages of it). Variations among tests with equal-size trees are 
    probably Pilot artifacts.
    Finally, it should be noted that the test program probes the BTree at random. Real applications may be expected to exhibit more 
    locality in tree references, and consequently a higher hit rate and less time spent reading and writing storage.