HASH
A Hash-Coded Dictionary Facility for Interlisp-10 & Interlisp-D
(Implemented by Christopher Lane at Stanford University, and now maintained
by Xerox)
1. Introduction
HASH is a new implementation of the Interlisp-10 hashfile package which works
in both Interlisp-10 and Interlisp-D, being written in Interlisp and operating
on streams of bytes. This package implements the functional interface of both
the current Interlisp-10 package implemented by R. Kaplan as well as the
original implementation of L. Masinter and W. van Melle which is still used in
EMYCIN based systems (which will be referred to as the EMYCIN package). This
document mainly deals with differences between the various packages. Users
should consult the Interlisp Reference Manual and the EMYCIN documentation for
more detail.
2. Basics
Hashfiles are created by calling CREATEHASHFILE. A "HashFile", as referenced
in this document, is the datum returned by CREATEHASHFILE or OPENHASHFILE,
currently an array record containing the hashfile name, and the number of slots
in the file, the used slots and other details. All other functions with
hashfile arguments use this datum. A NIL hashfile argument refers to
SYSHASHFILE in the EMYCIN sense of the current hashfile, not a global user
hashfile as in Interlisp-10. Keys are strings or atoms, as in the other
systems.
Interlisp-10 hashfiles come in several flavors, according to the values
stored in them. The EMYCIN system provides even more flexibility. This system
only supports the most general EXPR type of hashfiles and EMYCIN style TEXT
entries, in the same file. The VALUETYPE and ITEMLENGTH arguments are for the
most part ignored.
Two key hashing is supported in this system but is discouraged as it is only
in EMYCIN, not in the Interlisp-10 system. The functions GETPAGE, DELPAGE and
GETPNAME which manipulate `secret pages' do not exist in this implementation.
However, it is permissible to write data to the end of a HashFile.
The package sysloads the DFOR10.COM package from the LispUsers directory for
Interlisp-10 users.
3. Functions
The functions implemented are listed below. Arguments in italics are those
that only the EMYCIN system used.
(CREATEHASHFILE File ValueType ItemLength #Entries Smash CopyFn)
Creates a new HashFile called File. All other arguments are optional.
ValueType can be EXPR or TEXT, however both kinds of entries can exist on the
same file so EXPR is best. ItemLength is not used by the system but is
currently saved on the file (if less than 256) for future use. #Entries is an
1
estimate of the number of entries the file will have. This should be a
realistic guess as the system triples it anyway. Smash is a HashFile datum to
reuse and CopyFn is a function to be applied to entries when the file is
Rehashed.
(OPENHASHFILE File Access ItemLength #Entries Smash)
Opens HashFile File with Access of either INPUT (Synonyms are: READ OLD NIL
RETRIEVE) or 'BOTH (Synonyms are: WRITE OUTPUT T INSERT DELETE REPLACE).
ItemLength and #Entries are for backward compatibility with EMYCIN where
OPENHASHFILE also created new HashFiles; these arguments should be avoided.
Smash is a HashFile datum to reuse.
(HASHFILEP HashFile Write?)
Returns HashFile if it is a valid, open HashFile datum or returns the
HashFile datum associated with HashFile if it is the name of an open hashfile.
If Write? is non-NIL, HashFile must also be open for write access.
(PUTHASHFILE Key Value HashFile Key2)
Puts Value under Key in HashFile. Key2 is for EMYCIN two key hashing. Key2
is internally appended to Key and they are treated as a single key.
(GETHASHFILE Key HashFile Key2)
Gets the value stored under Key in HashFile. Key2 is necessary if it was
supplied to PUTHASHFILE.
(LOOKUPHASHFILE Key Value HashFile CallType Key2)
Implements PUTHASHFILE and GETHASHFILE among other options. CallType can be
any combination of RETRIEVE, DELETE, REPLACE or INSERT. GETHASHFILE does a
RETRIVE, PUTHASHFILE does a DELETE if VALUE is NIL otherwise a (REPLACE
INSERT). Other combinations are possible, for example, (RETRIEVE DELETE) will
delete a key and return the old value.
(HASHFILEPROP HashFile Property)
Returns the value of a Property of a HashFile datum. Currently accepted
properties are NAME, ACCESS, VALUETYPE, ITEMLENGTH, SIZE, #ENTRIES, CopyFn and
STREAM.
(HASHFILENAME HashFile)
Same as (HASHFILEPROP HashFile 'NAME).
(CLOSEHASHFILE HashFile ReOpen)
Closes HashFile. If ReOpen is non-NIL it should be one of the accepted
access types. In this case the file is closed and the immediately reopened
with access = ReOpen. This is used to make sure the HashFile is valid on the
2
disk.
(MAPHASHFILE HashFile MapFn Double)
Maps over HashFile applying MapFn. If MapFn takes two arguments, it is
applied to Key and Value. If MapFn only takes one argument, it is only applied
to Key and saves the cost of reading value from the file. If Double is non-NIL
then MapFn is applied to (Key1 Key2 Value) or (Key1 Key2) if the MapFn only
takes two arguments.
(REHASHFILE HashFile NewName NewValueType)
As keys are replaced, space in the data section of the file is not reused
(though space in the key section is). Eventually the file may need rehashing
to reclaim the wasted data space. REHASHFILE is really a special case of
COPYHASHFILE, and creates a new file. If NewName is non-NIL, it is taken as
the name of the rehashed file. NewValueType is a no-op.
The system automatically rehashes files when 7/8 of the key section is
filled. The system will print a message when automatically rehashing a file if
the global variable REHASHGAG is non-NIL.
(COPYHASHFILE HashFile NewName FN ValueType LeaveOpen)
Makes a copy of HashFile under NewName. Each key and value pair is moved
individually and if FN is supplied, is applied to (KEY VALUE HASHFILE
NEWHASHFILE) and what it returns is used as the value of the key in the new
HashFile. ValueType is a no-op. If LeaveOpen is non-NIL then the new HashFile
datum is returned open, otherwise the new HashFile is closed and the name is
returned.
(HASHFILESPLST HashFile XWord)
Returns an Interlisp generator for the keys in HashFile, usable with the
spelling corrector. If XWord is supplied, only keys starting with the prefix
in XWord are generated.
The following were only in the EMYCIN hash package:
(HASHFILEDATA HashFile)
Returns a list of the file name, value type, item length and number of
entries in HashFile.
(CLEARHASHFILES Close Release)
Closes all hashfiles if Close is non-NIL otherwise a no-op. Used on
AFTERSYSOUTFORMS to clean up hashfiles. Release is not implemented.
(COLLECTKEYS HashFile Double MakeString)
Returns a list of keys in HashFile. If Double is non-NIL returns keys as
3
double key pairs. If MakeString is non-NIL, converts keys to strings.
Although TEXT HashFiles were implemented under the EMYCIN system, the
following two function are unique to this implementation:
(PUTHASHTEXT Key SRCFIL HashFile Start End)
Puts text from SRCFIL onto HashFile under Key. Start and End are passed
directly to COPYBYTES.
(GETHASHTEXT Key HashFile DSTFIL)
Uses COPYBYTES to retrieve text stored under Key on HashFile.
4. Global Variables
The variables used by the system of interest to the user.
HASHFILEDEFAULTSIZE {512} Size used when #Entries is omitted or is too small.
HASHFILERDTBL {ORIG} The hashfile read table.
HASHLOADFACTOR {.875} The ratio, used slots/total slots, at which the system
rehashes the file, initially 7/8.
HASHTEXTCHAR {↑A} The character separating two key hashkeys.
HFGROWTHFACTOR {3} The ration of total slots to used slots when a hashfile is
created.
REHASHGAG {NIL} Flags whether to print message when rehashing; initially off.
SYSHASHFILE {NIL} The current hashfile.
SYSHASHFILELST {NIL} An alist of open hashfiles.
5. Implementation
The hash package views files as a sequence of bytes, randomly accessible. No
notice is made of pages and it is assumed that the host computer buffers I/O
sufficiently.
Hashfiles consist of a short header section (8 bytes), a layer of pointers
(4*HASHFILE:Size bytes) followed by ascii data. Pointers are 3 bytes wide,
preceeded by a status byte. The pointers point to key PNAMES in the data
section, where each key is followed by its value. Deleted key pointers are
reused, deleted data space is not, so rehashing is required if many items have
been "replaced".
The data section starts at 4*HASHFILE:Size + 9, and consists of alternating
keys and values. As deleted data is not re-written, not all data in the data
section is valid.
When a key hashes into a used slot, a probe value is added to it to find the
4
next slot to search. The probe value is a small prime derived from the
original hash key.
6. Limitations
The system currently is able to manipulate files on {CORE}, {DSK}, {FLOPPY}
and over the network, via leaf servers. HashFiles cannot be used with NS
servers until they support random access files.
Due to the pointer size, only hashfiles of < 6 million intial entries can be
created, though these can grow to 14 million entries before automatic rehashing
exceeds the pointer limit. The total file length is limited to 16 million
bytes. No range checking is done for these limits.