XEROX HASH-FILE 2 4 1 HASH-FILE 1 4 By: Doug Cutting (cutting.pa@xerox.com) This document last edited on August 4, 1987. INTRODUCTION This is a new implementation of a facility similar to but not compatible with the HASH library module. It instead tries to be very similar to the Common LISP hash table facility. As hash files are not stored in memory, hashing for EQ or EQL keys is not terribly useful. Memory references written to file in one session will probably not be valid in another. For this reason, the default hashing is for EQUAL keys, and then only those which can be dependably printed and read. All of the code for HASH-FILE is in a package called HASH-FILE. Througout this document LISP symbols will be printed as though in a package which uses the packages HASH-FILE and LISP. FUNCTIONAL INTERFACE (make-hash-file file-name size . keys-args) [Function] Creates and returns an empty hash file in file-name of size size. The keyword arguments are explained as this document progresses. (get-hash-file key hash-file &optional default) [Function] Retrieves the value stored under key in hash-file. Returns default if there is nothing stored under key. The default for default is nil. Also returns a second value which is true if something was found under key and false otherwise. (get-hash-file key hash-file) [setf place] Values can be stores in a hash file with: (setf (get-hash-file key hash-file) new-value) Accordingly incf, decf, push, pop and any other macro that accepts generalized variables will work with get-hash-file. (close-hash-file hash-file) [Function] Closes the file for hash-file, ensuring that all data has been saved. Note that the backing file is always kept coherent % the only reason to close the hash-file is to ensure that the backing file is written properly to disk. All the functions mentioned in this document which operate on hash files will open the file when necesary, thus it is safe to call close-hash-file at almost any time. (open-hash-file file-name &key :direction . other-keys-args) [Function] Opens a hash file created by make-hash-file and returns it. The :direction argument must be one of :input or :io. If opened for :input then storing values in the hash file will be disallowed. The default for :direction is :input. Other key arguments are the same as for make-hash-file and will be discussed later in this document. (map-hash-file function hash-file) [Function] For each entry in hash-file, function is called with the key and value stored. Caution: It is in general unsafe to change a hash file while mapping over it. This includes calling close-hash-file. (rem-hash-file key hash-file) [Function] Removes any entry for key in hash-file. Returns t if there was such an entry, nil otherwise. (copy-hash-file hash-file file-name &optional new-size) [Function] Make a hash file in file-name with the same contents as hash-file. Much slower than il:copyfile, but performs garbage collection, often resulting in a smaller file. Implemented with make-hash-file, map-hash-file, get-hash-file and setf of get-hash-file. Called when rehashing (below). (hash-file-count hash-file) [Function] Returns the number of entries in hash-file. (hash-file-p object) [Function] Returns t if object is a hash file, nil otherwise. (hash-file-p object) r(typep object 'hash-file) FILE FORMAT We use a linked bucket implementation as illustrated in the following diagram. ((SKETCH NIL VERSION 3 PRIRANGE (109 . 0) SKETCHCONTEXT ((ROUND 1 BLACK) (MODERN 10 (MEDIUM REGULAR REGULAR)) (LEFT BASELINE) (CURVE 18.0 8) NIL NIL (CENTER CENTER) (NIL NIL NIL) T NIL POINTS 1 NIL)) ((0.67550004 357.0 (PRI 109)) (GROUP (-128.0 85.99994 714.0 675.50006) (((0.025 72.0 (PRI 5)) (TEXTBOX (72.0 655.0 144.0 25.0) ("size") 1 (CENTER CENTER) (MODERN 18 (MEDIUM REGULAR REGULAR)) ((128.5 658.0 31 19)) BLACK (ROUND 2 BLACK) NIL (NIL NIL NIL))) ((0.07 19.6 (PRI 8)) (TEXT (56.0 . 640.0) ("1") 1.4 (RIGHT BASELINE) (TERMINAL 12 (MEDIUM REGULAR REGULAR)) ((46.2 634.2 9.8 19.6)) BLACK)) ((0.025 72.0 (PRI 11)) (TEXTBOX (72.0 632.0 144.0 25.0) ("count") 1 (CENTER CENTER) (MODERN 18 (MEDIUM REGULAR REGULAR)) ((120.0 635.0 48 19)) BLACK (ROUND 2 BLACK) NIL (NIL NIL NIL))) ((0.025 72.0 (PRI 13)) (TEXTBOX (72.0 568.0 144.0 25.0) ("pointer-1") 1 (CENTER CENTER) (MODERN 18 (MEDIUM REGULAR REGULAR)) ((105.0 571.0 78 19)) BLACK (ROUND 2 BLACK) NIL (NIL NIL NIL))) ((0.05 19.0 (PRI 15)) (TEXT (60.0 . 576.0) ("hash-fn(key-1, size)+2" "hash-fn(key-2, size)+2") 1 (RIGHT BASELINE) (MODERN 18 (MEDIUM REGULAR REGULAR)) ((-128.0 580.0 188 19) (-128.0 561.0 188 19)) BLACK)) ((0.025 72.0 (PRI 18)) (TEXTBOX (72.0 216.0 144.0 25.0) ("pointer-3") 1 (CENTER CENTER) (MODERN 18 (MEDIUM REGULAR REGULAR)) ((105.0 219.0 78 19)) BLACK (ROUND 2 BLACK) NIL (NIL NIL NIL))) ((0.05 19.0 (PRI 20)) (TEXT (60.0 . 392.0) ("size+2") 1 (RIGHT BASELINE) (MODERN 18 (MEDIUM REGULAR REGULAR)) ((1.0 386.5 59 19)) BLACK)) ((0.0 48.0 (PRI 24)) (WIRE ((264.0 . 383.99994) (272.0 . 383.99994) (272.0 . 479.99994)) (ROUND 2 BLACK) NIL NIL 1 (262.0 381.99994 12.0 100.0) NIL)) ((0.0 48.0 (PRI 24)) (WIRE ((264 . 624) (272 . 624) (272 . 528)) (ROUND 2 BLACK) NIL NIL 1 (262 526 12 100) NIL)) ((0.05 19.0 (PRI 25)) (TEXT (272.0 . 504.0) ("Table") 1 (CENTER BASELINE) (MODERN 18 (BOLD REGULAR REGULAR)) ((249.0 498.5 46 19)) BLACK)) ((0.05 19.0 (PRI 30)) (TEXT (56.0 . 664.0) ("0") 1 (RIGHT BASELINE) (MODERN 18 (MEDIUM REGULAR REGULAR)) ((46.0 658.5 10 19)) BLACK)) ((0.05 15.0 (PRI 31)) (TEXT (56.0 . 688.0) ("file position in pointers") 1 (RIGHT BASELINE) (MODERN 14 (BOLD REGULAR REGULAR)) ((-100 683.5 156 15)) BLACK)) ((0.025 72.0 (PRI 49)) (TEXTBOX (72.0 440.0 144.0 25.0) ("pointer-2") 1 (CENTER CENTER) (MODERN 18 (MEDIUM REGULAR REGULAR)) ((105.0 443.0 78 19)) BLACK (ROUND 2 BLACK) NIL (NIL NIL NIL))) ((0.05 19.0 (PRI 53)) (TEXT (60.0 . 228.0) ("pointer-1") 1 (RIGHT BASELINE) (MODERN 18 (MEDIUM REGULAR REGULAR)) ((-18.0 222.5 78 19)) BLACK)) ((0.05 15.0 (PRI 57)) (TEXT (56.0 . 352.0) ("file position in bytes") 1 (RIGHT BASELINE) (MODERN 14 (BOLD REGULAR REGULAR)) ((-81 347.5 137 15)) BLACK)) ((0.05 19.0 (PRI 58)) (TEXT (60.0 . 328.0) ("pointer-2") 1 (RIGHT BASELINE) (MODERN 18 (MEDIUM REGULAR REGULAR)) ((-18.0 322.5 78 19)) BLACK)) ((0.05 19.0 (PRI 59)) (TEXT (60.0 . 448.0) ("hash-fn(key-3, size)+2") 1 (RIGHT BASELINE) (MODERN 18 (MEDIUM REGULAR REGULAR)) ((-128.0 442.5 188 19)) BLACK)) ((0.07 19.6 (PRI 61)) (TEXT (72.0 . 304.0) ("key-1") 1.4 (LEFT BASELINE) (TERMINAL 12 (MEDIUM REGULAR REGULAR)) ((71.4 298.19998 49.0 19.6)) BLACK)) ((0.07 19.6 (PRI 65)) (TEXT (72.0 . 284.0) ("value-1") 1.4 (LEFT BASELINE) (TERMINAL 12 (MEDIUM REGULAR REGULAR)) ((71.4 278.6 68.6 19.6)) BLACK)) ((0.07 19.6 (PRI 66)) (TEXT (72.0 . 200.0) ("key-3") 1.4 (LEFT BASELINE) (TERMINAL 12 (MEDIUM REGULAR REGULAR)) ((71.4 194.59999 49.0 19.6)) BLACK)) ((0.07 19.6 (PRI 67)) (TEXT (72.0 . 180.0) ("value-3") 1.4 (LEFT BASELINE) (TERMINAL 12 (MEDIUM REGULAR REGULAR)) ((71.4 175.0 68.6 19.6)) BLACK)) ((0.07 19.6 (PRI 71)) (TEXT (72.0 . 112.0) ("key-2") 1.4 (LEFT BASELINE) (TERMINAL 12 (MEDIUM REGULAR REGULAR)) ((71.4 106.4 49.0 19.6)) BLACK)) ((0.07 19.6 (PRI 72)) (TEXT (72.0 . 92.0) ("value-2") 1.4 (LEFT BASELINE) (TERMINAL 12 (MEDIUM REGULAR REGULAR)) ((71.4 86.799995 68.6 19.6)) BLACK)) ((0.05 19.0 (PRI 73)) (TEXT (60.0 . 140.0) ("pointer-3") 1 (RIGHT BASELINE) (MODERN 18 (MEDIUM REGULAR REGULAR)) ((-18.0 134.5 78 19)) BLACK)) ((0.05 19.0 (PRI 74)) (TEXT (320.0 . 664.0) ("The size of the table") 1 (LEFT BASELINE) (MODERN 18 (MEDIUM REGULAR REGULAR)) ((320.0 658.5 167 19)) BLACK)) ((0.05 19.0 (PRI 75)) (TEXT (320.0 . 640.0) ("The number of entries in the file") 1 (LEFT BASELINE) (MODERN 18 (MEDIUM REGULAR REGULAR)) ((320.0 634.5 266 19)) BLACK)) ((0.05 19.0 (PRI 76)) (TEXT (320.0 . 576.0) ("A collision leading to a" "bucket of lenght two") 1 (LEFT BASELINE) (MODERN 18 (MEDIUM REGULAR REGULAR)) ((320.0 580.0 188 19) (320.0 561.0 175 19)) BLACK)) ((0.05 19.0 (PRI 78)) (TEXT (320.0 . 512.0) ("An empty bucket") 1 (LEFT BASELINE) (MODERN 18 (MEDIUM REGULAR REGULAR)) ((320.0 506.5 141 19)) BLACK)) ((0.0 36.0 (PRI 81)) (OPENCURVE ((216.0 . 220.0) (224.0 . 207.31706) (224.0 . 161.65854) (216.0 . 148.0)) (ROUND 2 BLACK) NIL (NIL (SOLID 18.0 8.669468)) 1 (213.0 132.0 15.2 104.799995) (NIL ((216.0 . 148.0) (223.10548 . 152.96707) (224.66803 . 147.84195))))) ((0.0 170.0 (PRI 81)) (OPENCURVE ((216 . 576) (232 . 512) (232 . 292) (216 . 236)) (ROUND 2 BLACK) NIL (NIL (SOLID 18.0 10)) 1 (211 166 26.4 480.0) (NIL ((216 . 236) (225.16692 . 239.99597) (225.76495 . 233.84464))))) ((0.0 54.0 (PRI 81)) (OPENCURVE ((216.0 . 447.99997) (224.0 . 428.09756) (224.0 . 355.2195) (216.0 . 339.99997)) (ROUND 2 BLACK) NIL (NIL (SOLID 18.0 11.076687)) 1 (213.0 316.99997 15.2 155.2) (NIL ((216.0 . 339.99997) (227.06137 . 339.41757) (224.6065 . 333.0271))))) ((0.05 19.0 (PRI 82)) (TEXT (320.0 . 448.0) ("A bucket of length one") 1 (LEFT BASELINE) (MODERN 18 (MEDIUM REGULAR REGULAR)) ((320.0 442.5 191 19)) BLACK)) ((0.0 64.0 (PRI 89)) (WIRE ((264.0 . 87.99994) (272.0 . 87.99994) (272.0 . 215.99994)) (ROUND 2 BLACK) NIL NIL 1 (262.0 85.99994 12.0 132.0) NIL)) ((0.0 56.0 (PRI 90)) (WIRE ((264.0 . 376.0) (272.0 . 376.0) (272.0 . 264.0)) (ROUND 2 BLACK) NIL NIL 1 (262.0 262.0 12.0 116.0) NIL)) ((0.05 19.0 (PRI 91)) (TEXT (272.0 . 240.0) ("Data") 1 (CENTER BASELINE) (MODERN 18 (BOLD REGULAR REGULAR)) ((252.0 234.5 40 19)) BLACK)) ((0.0 4.0 (PRI 93)) (WIRE ((264 . 680) (272 . 680) (272 . 672)) (ROUND 2 BLACK) NIL NIL 1 (262 670 12 12) NIL)) ((0.0 4.0 (PRI 94)) (WIRE ((264 . 632) (272 . 632) (272 . 640)) (ROUND 2 BLACK) NIL NIL 1 (262 630 12 12) NIL)) ((0.05 19.0 (PRI 95)) (TEXT (272.0 . 652.0) ("Header") 1 (CENTER BASELINE) (MODERN 18 (BOLD REGULAR REGULAR)) ((241.5 646.5 61 19)) BLACK)) ((0.05 25.0 (PRI 97)) (TEXT (204.0 . 744.0) ("Hash File Format") 1 (CENTER BASELINE) (MODERN 24 (BOLD REGULAR REGULAR)) ((112.0 736.5 184 25)) BLACK)) ((0.028825851 74.249344 (PRI 106)) (GROUP (70 501.17413 148.49869 28.825851) (((0.0259314 72.51187 (PRI 103)) (BOX (72.474945 502.17413 145.02374 25.931398) (ROUND 2 BLACK) NIL 1 (NIL NIL NIL))) ((0.0 72.0 (PRI 105)) (WIRE ((72 . 504) (216 . 528)) (ROUND 2 BLACK) NIL NIL 1 NIL NIL))) (144 . 520))) ((0.028825851 74.249344 (PRI 107)) (GROUP (70 133.17415 148.49869 28.825851) (((0.0259314 72.51187 (PRI 103)) (BOX (72.474945 134.17415 145.02374 25.931398) (ROUND 2 BLACK) NIL 1 (NIL NIL NIL))) ((0.0 72.0 (PRI 105)) (WIRE ((72 . 136) (216 . 160)) (ROUND 2 BLACK) NIL NIL 1 NIL NIL))) (144 . 152))) ((0.028825851 74.249344 (PRI 108)) (GROUP (70 317.17413 148.49869 28.825851) (((0.0259314 72.51187 (PRI 103)) (BOX (72.474945 318.17413 145.02374 25.931398) (ROUND 2 BLACK) NIL 1 (NIL NIL NIL))) ((0.0 72.0 (PRI 105)) (WIRE ((72 . 320) (216 . 344)) (ROUND 2 BLACK) NIL NIL 1 NIL NIL))) (144 . 336)))) (224 . 424)))) (-127.556076 84.41211 714.6892 677.1727) 1.8758247 8 Pointers are 32 bit integers written as four 8-bit bytes. There are two pointers of header (holding the size and count) followed by size pointers of table. Except for in the header and null pointers all pointers are file-positions in bytes. Every such pointer points to the position on the file of the next pointer in the bucket. Immediately following the next pointer on the file are the printed representation of the key and value for the entry. New entries, including ones for old keys, are always added at the end of the file. REHASHING When the number of keys with values in the file reaches a threshold we rehash to keep bucket lengths from getting too long. This threshold is expressed as a fraction of the table size. rehash-threshold [keyword arg] Should be floating point number between zero and one. When the product of the table size and the rehash threshold of a hash file is greater than its hash-file-count then the hash file is automatically rehashed. The default for this keyword argument is the value of the special variable hash-file::*rehash-threshold* whose global binding is by default 0.875. Rehashing is accomplished by having copy-hash-file make a new hash file with a larger size in a new version of the file. The new hash file structure is then smashed into the old one so that pointers to the old one are still valid. rehash-size [keyword arg] Should be floating point number larger than one. The next prime larger than the product of this and the old table size is used to as the size for the new table. The default for this keyword argument is the value of the special variable hash-file::*rehash-size* whose global binding is by default 2.0. hash-file::*delete-old-version-on-rehash* [special variable] If true, when rehashing generates a new version of the backing file the old version will be automatically deleted. The default top-level value for this variable is nil. Note that rehashing is very expensive. Thus when possible one should attempt to make good estimates for the size argument to make-hash-file. PROGRAMMERS' INTERFACE One can imagine applications which one might want to store things in hash files which could not be printed and read by the functions print and read. We provide the following hooks for this purpose. value-read-fn [keyword arg] Called by get-hash-file with one argument of a stream to read a value. The file postion will be set to the same position as it was when this value was written. Default is hash-file::default-read-fn which binds *package* to the XCL package, *readtable* to the XCL readtable and *print-pretty* to nil before calling read. value-print-fn [keyword arg] Called by the setf method for get-hash-file with the object to be stored and the stream to print it on. The file position of the stream will be at the end of the file and there are no limitations as to how much can be printed. Default is hash-file::default-print-fn which binds *package* to the XCL package, *readtable* to the XCL readtable and *print-base* to 10 before calling print. Example: A hash file with circular values. (defun print-circular-object (object stream) (let ((*print-circle* t)) (hash-file::default-print-fn object stream))) (setq hash-file-with-circular-values (make-hash-file "{core}foo" 10 :value-print-fn #'print-circular-object)) (setq l (list "foo")) (setf (cdr l) l) O #1= ("foo" . #1#) (setf (get-hash-file "bar" hash-file-with-circular-values) l) (get-hash-file "bar" hash-file-with-circular-values) O #1= ("foo" . #1#) (eq * l) O nil key-read-fn [keyword arg] Called by get-hash-file with one argument of a stream to read a key. The file postion will be set to the same position as it was when this key was written. Default is hash-file::default-read-fn, described above. key-print-fn [keyword arg] Called by the setf method for get-hash-file with the object to be stored and the stream to print it on. The file position of the stream will be at the end of the file and there are no limitations as to how much can be printed. Default is hash-file::default-print-fn, described above. Note: The value reader is called immediately after the key reader, thus the key reader must be sure to read all that the key printer printed so that the file position is appropriate for the value reader. However the value reader is free to not read all that the value printer printed. One might now think that one could make a hash file whose keys were circular by simply specifying our circular reader and printer for the key print and read functions, but this would not be sufficient. One would also need the following hooks: key-compare-fn [keyword arg] Called when searching a bucket to determine whether the correct key/value pair has been reached yet. Default is equal. key-hash-fn [keyword arg] Called with a key and a range. Should return an integer between zero and range-1 with the following property: key-hash-fn(x) = key-hash-fn(y) iff key-compare-fn(x,y) The default key-hash-fn is hash-file::hash-object which works on symbols, strings, lists, bit-vectors, pathnames, characters and numbers. (Any object whose printed representation can be dependably read in as an object equal to the original.) Note in particualar that this function will work on circular lists, as it only proceeds a fixed depth down a structure. Thus to hash on circular keys we would also need to provide a key comparer which is able to compare circular keys, as most defintions of equal are not. PERFORMANCE A linked bucket implementation generally gives shorter bucket lengths, but uses more file space. The effects of this upon performance are difficult to judge. The following table shows the distribution of bucket lengths in a WHERE-IS hash file containing 27,157 entries with a table size of 50,021. length number of buckets this length 2 0 29,279 (empty buckets) 1 15,461 2 4334 3 794 4 125 5 23 6 4 7 1 This information was gathered by the function hash-file::histogram. (LIST ((PAGE NIL (PAPERSIZE LETTER STARTINGPAGE# 106) (0 0 612 792) ((FOLIO NIL (PARALOOKS (QUAD CENTERED) CHARLOOKS (SUPERSCRIPT 0 INVISIBLE OFF SELECTPOINT OFF PROTECTED OFF SIZE 10 FAMILY MODERN OVERLINE OFF STRIKEOUT OFF UNDERLINE OFF EXPANSION REGULAR SLOPE REGULAR WEIGHT MEDIUM INVERTED OFF USERINFO NIL STYLE NIL) FORMATINFO NIL) (174 36 288 36) NIL) (HEADING NIL (HEADINGTYPE RUNNINGHEAD) (84 744 528 36) NIL) (TEXT NIL NIL (84 96 456 600) NIL))) (PAGE NIL (PAPERSIZE NIL . LETTER) (0 0 612 792) ((FOLIO NIL (PARALOOKS (QUAD CENTERED) CHARLOOKS (SUPERSCRIPT 0 INVISIBLE OFF SELECTPOINT OFF PROTECTED OFF SIZE 10 FAMILY MODERN OVERLINE OFF STRIKEOUT OFF UNDERLINE OFF EXPANSION REGULAR SLOPE REGULAR WEIGHT MEDIUM INVERTED OFF USERINFO NIL STYLE NIL) FORMATINFO NIL) (174 36 288 36) NIL) (HEADING NIL (HEADINGTYPE RUNNINGHEAD) (84 744 528 36) NIL) (TEXT NIL NIL (84 96 456 600) NIL))) (PAGE NIL (PAPERSIZE NIL . LETTER) (0 0 612 792) ((FOLIO NIL (PARALOOKS (QUAD CENTERED) CHARLOOKS (SUPERSCRIPT 0 INVISIBLE OFF SELECTPOINT OFF PROTECTED OFF SIZE 10 FAMILY MODERN OVERLINE OFF STRIKEOUT OFF UNDERLINE OFF EXPANSION REGULAR SLOPE REGULAR WEIGHT MEDIUM INVERTED OFF USERINFO NIL STYLE NIL) FORMATINFO NIL) (174 36 288 36) NIL) (HEADING NIL (HEADINGTYPE RUNNINGHEAD) (84 744 528 36) NIL) (TEXT NIL NIL (84 96 456 600) NIL))))) 0 HTT/T(. / T4 . . ((8(8D PAGEHEADING RUNNINGHEADTERMINAL MODERN MODERN MODERN MODERNMODERN LOGO    HRULE.GETFNMODERN   HRULE.GETFNMODERN   HRULE.GETFNMODERN    HRULE.GETFNMODERN   HRULE.GETFNMODERN  )- +!  *  D !  "J *   G    ]A +  K h  *e     &    X   #  !        O SKIO.GETFN.2MODERN  {+$)=m 4          ,~'>4    ~ !qo9  <  HRULE.GETFNMODERN K.E ~z