BTree Documentation To Cedar users Date January 17, 1984 From Ed Taft Location Palo Alto Subject BTree, BTreeSimple, and BTreeTest packages Organization PARC/CSL XEROX <> Release as [Indigo]Documentation>BTreeDoc.tioga Draft [Indigo]Documentation>BTreeDoc.tioga Last edited by Taft, January 17, 1984 11:39 am Abstract This Cedar package maintains an ordered collection of objects as a BTree. The objects may be of different sizes, and there may be a large number of them (tens or hundreds of thousands). The amount of virtual memory required does not depend on the size of the BTree, and the cost of finding, inserting, and deleting objects increases only very slowly as the BTree gets larger. The package makes very few assumptions about the representation of the objects being stored or about the properties of the storage itself. (There also exists a separate package, BTreeSimple, that makes conventional assumptions and provides simplified access.) This memo gives an overview of the operation of the BTree package and presents the results of some performance measurements. Introduction A BTree is a data structure for maintaining an ordered collection of objects (or entries) stored as a tree of fixed-size pages. The entries are each smaller than a page and are stored many per page; thus, each interior page of the tree has many branches, so even a very large tree is quite shallow (a depth of 3 or 4 is typical). This means that finding an arbitrary entry is relatively inexpensive in terms of the number of pages that must be accessed. The cost of an update is also reasonable and has an upper bound which is linear in the depth of the tree. Finally, the tree has very good locality, so references to consecutive entries usually access the same page. For further information on the properties of BTrees in general, please read Knuth, The Art of Computer Programming, vol. 3, section 6.2.4. This BTree package is a Cedar transliteration of one written in BCPL by Ed McCreight and used for maintaining the file directory in IFS. The important properties that distinguish it from existing Mesa BTree packages are: Generalitythe BTree package makes no assumptions about the representation of entries; it simply stores uninterpreted blocks of words. It need know nothing about what portion of an entry contains the key'' or the value''. All knowledge about the representation of entries is vested in client-provided procedures. The package also makes no assumptions about the means used to store the BTree pages, but instead calls client-provided procedures to access them. Capacitythere is no practical limit to the number of entries that can be stored. In particular, there is no requirement that all of the pages of the BTree occupy virtual memory at the same time. The ultimate limit to the size of the tree is 65,535 pages, which should be sufficient to store at least a million entries (even more if pages'' larger than 256 words are used). Storage efficiencyoverhead consists of only one word per entry and two words per page; and the package maintains pages between 60 and 80 percent full on the average. The package also attempts to keep larger entries in the leaf pages and smaller ones in the interior pages so as to minimize the tree's depth. Proven reliabilitysince the last'' bug was fixed in the BCPL version of the package, there have been many tens of thousands of IFS-hours of operation with no suspicion of a BTree-related problem. No changes were made to the BTree maintenance algorithms during the transliteration to Cedar. It is intended that this package be suitable for use in maintaining any ordered data structure that must be stored externally, i.e., is either permanent or is (potentially) too large to keep in virtual memory. The most obvious application is a file directory. For smaller applications involving temporary data only, an internal data structure such as a RedBlackTree is more efficient to access and maintain. Instructions for use The primary documentation for the BTree package is the interface BTree.mesa, obtained via BTree.df. The following is an overview of how the package is intended to be used; consult the interface itself for details. The client must provide two sets of procedures, called the representation primitives and the storage primitives. The BTree package is object-oriented, i.e., a single instantiation of the package suffices to deal with an arbitrary number of Tree objects; and each Tree object instance can have an independent set of representation and storage primitives associated with it. The representation primitives enable the package to find out everything it needs to know about BTree entries and keys. The most important are EntrySize, which gives the size of an entry in words, and Compare, which gives the result of comparing a key with an entry (less, equal, or greater). The package does not do anything with keys besides Compare them to entries. Additionally, there are procedures for converting between REF and LONG POINTER (i.e., safe and unsafe) references to entries. The storage primitives are the means by which the package accesses the pages in which the BTree is stored. The package gains access to the contents of a page by calling the ReferencePage procedure, which returns a LONG POINTER to a block of virtual memory containing a copy of that page. When the package is finished with the page, it calls ReleasePage; subsequently, the page storage implementation may rewrite the page to permanent storage (if dirty) and reclaim the virtual memory. The package comes in two parts. The main part, BTreeImpl.bcd, exports the BTree interface. Additionally, there is an example implementation of the page storage primitives called BTreeVMImpl, which exports the BTreeVM interface. This implementation is intended for building a temporary or permanent BTree on raw Cedar Nucleus files. It maintains a VM cache whose size is specified by the client; and it reads and writes the file explicitly, using File.Read and File.Write. BTree and BTreeVM are both exported from the BasicCedar boot file. The idea is that the client program first calls BTreeVM.Open, passing a file capability and obtaining a storage handle and a set of storage primitives. These, along with the client's representation primitives, should then be passed to BTree.Open, which returns a Tree handle upon which BTree operations may be performed. There is no explicit Close operation. The Tree and storage handles are collectible objects, which vanish when the last REFs to them disappear. There also exists a separate package, BTreeSimple, intended for use by clients with unsophisticated needs. It makes some conventional assumptions about how the entries are interpreted (text keys, sequence values) and how the BTree is stored (an FS file). In return for this loss of flexibility, the BTreeSimple package takes care of providing the representation and storage primitives and does certain other bookkeeping that the basic BTree package requires the client to do. The implementation is BTreeSimpleImpl, which is not part of the boot file. Performance The performance of the BTree package depends on many variables: tree size, page size, access patterns, caching behavior of the storage primitives, and others. A few general statements about performance are offered here, followed by some measured results. It is crucial that the storage implementation include a substantial amount of caching, managed in an LRU fashion. In many situations, the BTree package references a given page multiple times in the course of a single update. Even ignoring this, in most applications it is common for clients to make references to the same or closely-related entries over a short interval of time; the resulting locality of reference has a performance benefit only when redundant storage reads are avoided by use of a cache. The size of the cache relative to the size of the BTree can have a substantial effect on performance, as the results below show. Additionally, it should be noted that for a large BTree it is desirable to have a large page size; given a reasonably large amount of space for a cache, it is better to have a modest number of relatively large pages than a larger number of small pages. This is because the main cost of performing an operation on a large BTree stored on a disk is disk access time (as opposed to transfer time); and larger pages mean more entries per page, a shallower tree, and fewer accesses per operation. When updating a BTree, a tradeoff must be made between the desire for good performance and the desire to maintain a consistent permanent state of the tree. If it is desired that the permanent state be consistent between every update then more writes are required, thus reducing performance. The client program controls this tradeoff by use of the maintainRecomputableState argument of BTree.Open and the BTree.SetUpdateInProgress procedure. Of course, if consistency is being maintained at a lower level, e.g., by keeping the BTree in an Alpine file under a transaction, then no provisions need be made at the BTree level for maintaining consistency. The measurements below were obtained using the BTreeTest tool, which is available as a separate package. This program uses BTreeVM to manage a BTree stored in a temporary file (which is deleted when the tool is destroyed). It obtains parameters from the user, randomly performs various operations on the tree, and displays a number of performance statistics. Running BTreeTest from the UserExec opens a viewer containing a command menu, several user-adjustable parameters, and a table of results. The parameters are: DiskPages/BTreePagethe number of file pages (256 words) per logical BTree page. The legal range is [1..16]. CacheSizethe number of logical BTree pages kept in BTreeVM's cache [8..255]. MaxTreeEntriesthe maximum number of entries the test program will ever permit the tree to contain [100..65535]. (Note that the upper limit is a restriction of the test program, not one imposed by the BTree package itself.) LongUpdatethis boolean switch controls whether or not the BTree package is to keep the permanent state consistent between updates. LongUpdate: yes'' means that writes are performed only when necessary to obtain a free cache entry. ValidateEveryUpdateturning on this switch causes the entire tree to be checked after each update; this is useful only during basic debugging of the BTree package, and is very costly. A test is started by clicking Start, and is ended by clicking Stop. The results are updated every few seconds while the test is running. Clicking InitTree resets the tree to empty before the beginning of a test. (It is necessary to do this after changing DiskPages/BTreePage or MaxTreeEntries.) ResetStats resets the statistics to zero but does not change the tree. The results are: Tree size, Levels, Entriesthese describe the current state of the tree. The tree size is in units of logical BTree pages. Operations, ms/opthese are the number of each type of BTree operation (lookup, enumerate, insert, delete, replace) that have been performed and the average number of milliseconds taken by each. In the case of enumerate, what is counted is the total number of entries enumerated and the average time per entry. An insert is an update that inserts an entry not previously present, while a replace is an update that replaces an existing entry with a new one having the same key. Inserted and replacement entries have sizes selected at random from [2..33]. CacheRefs, Hit%the total number of calls to BTreeVM.ReferencePage, and the percentage of those calls that accessed a page already present in the cache. Storage Reads, Writesthe number of times BTreeVM transferred a BTree page between the cache and permanent storage. Writes/updatethe average number of storage writes per update operation (insert, delete, or replace). Total elapsed timethe total number of seconds the test has run, excluding test overhead such as updating the results in the viewer, but including any concurrent activity elsewhere in Cedar. %R+W time, ms/(R or W)the percentage of the total elapsed time spent waiting for storage reads and writes, and the average number of milliseconds taken by each one. Here are some results, obtained on a Dorado. (These numbers are for Cedar 4, using Pilot file facilities. No complete set of corresponding measurements have yet been made for Cedar 5, though some preliminary tests demonstrated an improvement of about 30%.) Test number 1 2 3 4 5 6 7 DiskPgs/BTreePg 1 1 1 1 1 1 4 CacheSize 200 200 20 20 200 200 50 MaxTreeEntries 2500 2500 2500 2500 60000 60000 60000 LongUpdate Yes No Yes No Yes No No TreeSize 190 190 190 190 4361 4360 1037 Levels 3 3 3 3 4 4 3 Entries 1961 1949 1944 1940 46913 47027 47048 ms/op Lookup 1.90 1.91 38.8 26.2 72.8 55.0 40.1 ms/op Enumerate 0.08 0.08 1.82 1.68 2.75 2.99 0.84 ms/op Insert 4.72 48.8 48.0 75.7 86.4 119 71.9 ms/op Delete 6.63 60.8 69.5 101 120 154 85.9 ms/op Replace 3.96 30.1 44.6 52.8 79.8 85.5 62.4 Hit% 100 100 82 81 81 81 85 Writes/update 0 1.74 1.28 1.78 1.35 1.80 1.16 %R+W time 0 87 91 93 94 94 91 ms/(R or W)  21.7 17.0 18.7 27.4 28.3 28.8 A few things are worth noting about these results. Test 1 uses a cache large enough to hold the entire tree and never writes any pages to permanent storage (except at the end of the test, where it isn't measured). This gives a good approximation of the CPU cost of operations performed on a 3-level BTree. As is evident from comparing test 1's results with the others, normally most of the time required for a BTree operation is spent waiting for the disk. Test 2 is the same as test 1 except that the permanent state is written to storage at the end of every update. Tests 3 and 4 are similar to 1 and 2 except that the cache is much smaller than the tree, so many more reads and writes occur. Interestingly, the choice of whether or not to write storage after each update has only a modest effect on %R+W time in tests 3 and 4. This is presumably because most dirty pages get written anyway as a result of being displaced during later references to other parts of the tree. (The decrease in lookup time from test 3 to test 4 is a good indication that this is what is happening.) Tests 5 through 7 are for a much larger BTree. Tests 6 and 7 show the effect of varying the page size, given a fixed-size cache. It is evident that the larger page size yields better performance in all respects. The variations in time per read or write across all the tests are partially explainable as being related to the size of the tree (a larger tree means increased time spent seeking between different pages of it). Variations among tests with equal-size trees are probably Pilot artifacts. Finally, it should be noted that the test program probes the BTree at random. Real applications may be expected to exhibit more locality in tree references, and consequently a higher hit rate and less time spent reading and writing storage.