HASH A Hash-Coded Dictionary Facility for Interlisp-10 & Interlisp-D (Implemented by Christopher Lane at Stanford University, and now maintained by Xerox) 1. Introduction HASH is a new implementation of the Interlisp-10 hashfile package which works in both Interlisp-10 and Interlisp-D, being written in Interlisp and operating on streams of bytes. This package implements the functional interface of both the current Interlisp-10 package implemented by R. Kaplan as well as the original implementation of L. Masinter and W. van Melle which is still used in EMYCIN based systems (which will be referred to as the EMYCIN package). This document mainly deals with differences between the various packages. Users should consult the Interlisp Reference Manual and the EMYCIN documentation for more detail. 2. Basics Hashfiles are created by calling CREATEHASHFILE. A "HashFile", as referenced in this document, is the datum returned by CREATEHASHFILE or OPENHASHFILE, currently an array record containing the hashfile name, and the number of slots in the file, the used slots and other details. All other functions with hashfile arguments use this datum. A NIL hashfile argument refers to SYSHASHFILE in the EMYCIN sense of the current hashfile, not a global user hashfile as in Interlisp-10. Keys are strings or atoms, as in the other systems. Interlisp-10 hashfiles come in several flavors, according to the values stored in them. The EMYCIN system provides even more flexibility. This system only supports the most general EXPR type of hashfiles and EMYCIN style TEXT entries, in the same file. The VALUETYPE and ITEMLENGTH arguments are for the most part ignored. Two key hashing is supported in this system but is discouraged as it is only in EMYCIN, not in the Interlisp-10 system. The functions GETPAGE, DELPAGE and GETPNAME which manipulate `secret pages' do not exist in this implementation. However, it is permissible to write data to the end of a HashFile. The package sysloads the DFOR10.COM package from the LispUsers directory for Interlisp-10 users. 3. Functions The functions implemented are listed below. Arguments in italics are those that only the EMYCIN system used. (CREATEHASHFILE File ValueType ItemLength #Entries Smash CopyFn) Creates a new HashFile called File. All other arguments are optional. ValueType can be EXPR or TEXT, however both kinds of entries can exist on the same file so EXPR is best. ItemLength is not used by the system but is currently saved on the file (if less than 256) for future use. #Entries is an 1 estimate of the number of entries the file will have. This should be a realistic guess as the system triples it anyway. Smash is a HashFile datum to reuse and CopyFn is a function to be applied to entries when the file is Rehashed. (OPENHASHFILE File Access ItemLength #Entries Smash) Opens HashFile File with Access of either INPUT (Synonyms are: READ OLD NIL RETRIEVE) or 'BOTH (Synonyms are: WRITE OUTPUT T INSERT DELETE REPLACE). ItemLength and #Entries are for backward compatibility with EMYCIN where OPENHASHFILE also created new HashFiles; these arguments should be avoided. Smash is a HashFile datum to reuse. (HASHFILEP HashFile Write?) Returns HashFile if it is a valid, open HashFile datum or returns the HashFile datum associated with HashFile if it is the name of an open hashfile. If Write? is non-NIL, HashFile must also be open for write access. (PUTHASHFILE Key Value HashFile Key2) Puts Value under Key in HashFile. Key2 is for EMYCIN two key hashing. Key2 is internally appended to Key and they are treated as a single key. (GETHASHFILE Key HashFile Key2) Gets the value stored under Key in HashFile. Key2 is necessary if it was supplied to PUTHASHFILE. (LOOKUPHASHFILE Key Value HashFile CallType Key2) Implements PUTHASHFILE and GETHASHFILE among other options. CallType can be any combination of RETRIEVE, DELETE, REPLACE or INSERT. GETHASHFILE does a RETRIVE, PUTHASHFILE does a DELETE if VALUE is NIL otherwise a (REPLACE INSERT). Other combinations are possible, for example, (RETRIEVE DELETE) will delete a key and return the old value. (HASHFILEPROP HashFile Property) Returns the value of a Property of a HashFile datum. Currently accepted properties are NAME, ACCESS, VALUETYPE, ITEMLENGTH, SIZE, #ENTRIES, CopyFn and STREAM. (HASHFILENAME HashFile) Same as (HASHFILEPROP HashFile 'NAME). (CLOSEHASHFILE HashFile ReOpen) Closes HashFile. If ReOpen is non-NIL it should be one of the accepted access types. In this case the file is closed and the immediately reopened with access = ReOpen. This is used to make sure the HashFile is valid on the 2 disk. (MAPHASHFILE HashFile MapFn Double) Maps over HashFile applying MapFn. If MapFn takes two arguments, it is applied to Key and Value. If MapFn only takes one argument, it is only applied to Key and saves the cost of reading value from the file. If Double is non-NIL then MapFn is applied to (Key1 Key2 Value) or (Key1 Key2) if the MapFn only takes two arguments. (REHASHFILE HashFile NewName NewValueType) As keys are replaced, space in the data section of the file is not reused (though space in the key section is). Eventually the file may need rehashing to reclaim the wasted data space. REHASHFILE is really a special case of COPYHASHFILE, and creates a new file. If NewName is non-NIL, it is taken as the name of the rehashed file. NewValueType is a no-op. The system automatically rehashes files when 7/8 of the key section is filled. The system will print a message when automatically rehashing a file if the global variable REHASHGAG is non-NIL. (COPYHASHFILE HashFile NewName FN ValueType LeaveOpen) Makes a copy of HashFile under NewName. Each key and value pair is moved individually and if FN is supplied, is applied to (KEY VALUE HASHFILE NEWHASHFILE) and what it returns is used as the value of the key in the new HashFile. ValueType is a no-op. If LeaveOpen is non-NIL then the new HashFile datum is returned open, otherwise the new HashFile is closed and the name is returned. (HASHFILESPLST HashFile XWord) Returns an Interlisp generator for the keys in HashFile, usable with the spelling corrector. If XWord is supplied, only keys starting with the prefix in XWord are generated. The following were only in the EMYCIN hash package: (HASHFILEDATA HashFile) Returns a list of the file name, value type, item length and number of entries in HashFile. (CLEARHASHFILES Close Release) Closes all hashfiles if Close is non-NIL otherwise a no-op. Used on AFTERSYSOUTFORMS to clean up hashfiles. Release is not implemented. (COLLECTKEYS HashFile Double MakeString) Returns a list of keys in HashFile. If Double is non-NIL returns keys as 3 double key pairs. If MakeString is non-NIL, converts keys to strings. Although TEXT HashFiles were implemented under the EMYCIN system, the following two function are unique to this implementation: (PUTHASHTEXT Key SRCFIL HashFile Start End) Puts text from SRCFIL onto HashFile under Key. Start and End are passed directly to COPYBYTES. (GETHASHTEXT Key HashFile DSTFIL) Uses COPYBYTES to retrieve text stored under Key on HashFile. 4. Global Variables The variables used by the system of interest to the user. HASHFILEDEFAULTSIZE {512} Size used when #Entries is omitted or is too small. HASHFILERDTBL {ORIG} The hashfile read table. HASHLOADFACTOR {.875} The ratio, used slots/total slots, at which the system rehashes the file, initially 7/8. HASHTEXTCHAR {^A} The character separating two key hashkeys. HFGROWTHFACTOR {3} The ration of total slots to used slots when a hashfile is created. REHASHGAG {NIL} Flags whether to print message when rehashing; initially off. SYSHASHFILE {NIL} The current hashfile. SYSHASHFILELST {NIL} An alist of open hashfiles. 5. Implementation The hash package views files as a sequence of bytes, randomly accessible. No notice is made of pages and it is assumed that the host computer buffers I/O sufficiently. Hashfiles consist of a short header section (8 bytes), a layer of pointers (4*HASHFILE:Size bytes) followed by ascii data. Pointers are 3 bytes wide, preceeded by a status byte. The pointers point to key PNAMES in the data section, where each key is followed by its value. Deleted key pointers are reused, deleted data space is not, so rehashing is required if many items have been "replaced". The data section starts at 4*HASHFILE:Size + 9, and consists of alternating keys and values. As deleted data is not re-written, not all data in the data section is valid. When a key hashes into a used slot, a probe value is added to it to find the 4 next slot to search. The probe value is a small prime derived from the original hash key. 6. Limitations The system currently is able to manipulate files on {CORE}, {DSK}, {FLOPPY} and over the network, via leaf servers. HashFiles cannot be used with NS servers until they support random access files. Due to the pointer size, only hashfiles of < 6 million intial entries can be created, though these can grow to 14 million entries before automatic rehashing exceeds the pointer limit. The total file length is limited to 16 million bytes. No range checking is done for these limits.