File [eris]<Lisp>JCAI>library>HASH.MSS!1

@make(article)
@disable(contents)
@center{@b[HASH]}


A Hash-Coded Dictionary Facility for Interlisp-10 & Interlisp-D

(Implemented by Christopher Lane at Stanford University, and now
maintained by Xerox)

@section(Introduction)

HASH is a new implementation of the Interlisp-10 hashfile package which
works in both Interlisp-10 and Interlisp-D, being written in Interlisp
and operating on streams of bytes.  This package implements the
functional interface of both the current Interlisp-10 package
implemented by R.  Kaplan as well as the original implementation of
L.  Masinter and W.  van Melle which is still used in EMYCIN based
systems (which will be referred to as the EMYCIN package).  This
document mainly deals with differences between the various packages.
Users should consult the Interlisp Reference Manual and the EMYCIN
documentation for more detail.

@section(Basics)

Hashfiles are created by calling CREATEHASHFILE.  A "HashFile", as
referenced in this document, is the datum returned by CREATEHASHFILE
or OPENHASHFILE, currently an array record containing the hashfile
name, and the number of slots in the file, the used slots and other
details.  All other functions with hashfile arguments use this datum.
A NIL hashfile argument refers to SYSHASHFILE in the EMYCIN sense of
the current hashfile, not a global user hashfile as in Interlisp-10.
Keys are strings or atoms, as in the other systems.

Interlisp-10 hashfiles come in several flavors, according to the values
stored in them.  The EMYCIN system provides even more flexibility.
This system @b(only) supports the most general EXPR type of hashfiles
and EMYCIN style TEXT entries, in the same file.  The VALUETYPE and
ITEMLENGTH arguments are for the most part ignored.

Two key hashing @b(is) supported in this system but is discouraged
as it is only in EMYCIN, not in the Interlisp-10 system.  The functions
GETPAGE, DELPAGE and GETPNAME which manipulate `secret pages' do not
exist in this implementation.  However, it is permissible to write data
to the end of a HashFile.

The package sysloads the DFOR10.COM package from the LispUsers directory
for Interlisp-10 users.

@section(Functions)

The functions implemented are listed below.  Arguments in italics are
those that only the EMYCIN system used.


(@b[CREATEHASHFILE] File ValueType ItemLength #Entries @i[Smash CopyFn])

Creates a new HashFile called File.  All other arguments are optional.
ValueType can be EXPR or TEXT, however both kinds of entries can exist
on the same file so EXPR is best.  ItemLength is not used by the system
but is currently saved on the file (if less than 256) for future use.
#Entries is an estimate of the number of entries the file will have.
This should be a realistic guess as the system triples it anyway.  Smash
is a HashFile datum to reuse and CopyFn is a function to be applied
to entries when the file is Rehashed.


(@b[OPENHASHFILE] File Access @i[ItemLength #Entries] Smash)

Opens HashFile File with Access of either INPUT (Synonyms are:  READ
OLD NIL RETRIEVE) or 'BOTH (Synonyms are:  WRITE OUTPUT T INSERT DELETE
REPLACE).  ItemLength and #Entries are for backward compatibility with
EMYCIN where OPENHASHFILE also created new HashFiles; these arguments
should be avoided.  Smash is a HashFile datum to reuse.


(@b[HASHFILEP] HashFile @i[Write?])

Returns HashFile if it is a valid, open HashFile datum or returns the
HashFile datum associated with HashFile if it is the name of an open
hashfile.  If Write?  is non-NIL, HashFile must also be open for write
access.


(@b[PUTHASHFILE] Key Value HashFile @i[Key2])

Puts Value under Key in HashFile.  Key2 is for EMYCIN two key hashing.
Key2 is internally appended to Key and they are treated as a single
key.


(@b[GETHASHFILE] Key HashFile @i[Key2])

Gets the value stored under Key in HashFile.  Key2 is necessary if
it was supplied to PUTHASHFILE.


(@b[LOOKUPHASHFILE] Key Value HashFile CallType @i[Key2])

Implements PUTHASHFILE and GETHASHFILE among other options.  CallType
can be any combination of RETRIEVE, DELETE, REPLACE or INSERT.
GETHASHFILE does a RETRIVE, PUTHASHFILE does a DELETE if VALUE is NIL
otherwise a (REPLACE INSERT).  Other combinations are possible, for
example, (RETRIEVE DELETE) will delete a key and return the old value.


(@b[HASHFILEPROP] HashFile Property)

Returns the value of a Property of a HashFile datum.  Currently accepted
properties are NAME, ACCESS, VALUETYPE, ITEMLENGTH, SIZE, #ENTRIES,
CopyFn and STREAM.


(@b[HASHFILENAME] HashFile)

Same as (HASHFILEPROP HashFile 'NAME).


(@b[CLOSEHASHFILE] HashFile @i[ReOpen])

Closes HashFile.  If ReOpen is non-NIL it should be one of the accepted
access types.  In this case the file is closed and the immediately
reopened with access = ReOpen.  This is used to make sure the HashFile
is valid on the disk.


(@b[MAPHASHFILE] HashFile MapFn @i[Double])

Maps over HashFile applying MapFn.  If MapFn takes two arguments, it
is applied to Key and Value.  If MapFn only takes one argument, it
is only applied to Key and saves the cost of reading value from the
file.  If Double is non-NIL then MapFn is applied to (Key1 Key2 Value)
or (Key1 Key2) if the MapFn only takes two arguments.


(@b[REHASHFILE] HashFile @i[NewName NewValueType])

As keys are replaced, space in the data section of the file is not
reused (though space in the key section is).  Eventually the file may
need rehashing to reclaim the wasted data space.  REHASHFILE is really
a special case of COPYHASHFILE, and creates a new file.  If NewName
is non-NIL, it is taken as the name of the rehashed file.  NewValueType
is a no-op.

The system automatically rehashes files when 7/8 of the key section
is filled.  The system will print a message when automatically rehashing
a file if the global variable REHASHGAG is non-NIL.

(@b[COPYHASHFILE] HashFile NewName FN ValueType LeaveOpen)

Makes a copy of HashFile under NewName.  Each key and value pair is
moved individually and if FN is supplied, is applied to (KEY VALUE
HASHFILE NEWHASHFILE) and what it returns is used as the value of the
key in the new HashFile.  ValueType is a no-op.  If LeaveOpen is non-NIL
then the new HashFile datum is returned open, otherwise the new HashFile
is closed and the name is returned.


(@b[HASHFILESPLST] HashFile @i[XWord])

Returns an Interlisp generator for the keys in HashFile, usable with
the spelling corrector.  If XWord is supplied, only keys starting with
the prefix in XWord are generated.


The following were only in the EMYCIN hash package:

(@b[HASHFILEDATA] HashFile)

Returns a list of the file name, value type, item length and number
of entries in HashFile.


(@b[CLEARHASHFILES] Close Release)

Closes all hashfiles if Close is non-NIL otherwise a no-op.  Used on
AFTERSYSOUTFORMS to clean up hashfiles.  Release is not implemented.


(@b[COLLECTKEYS] HashFile Double MakeString)

Returns a list of keys in HashFile.  If Double is non-NIL returns keys
as double key pairs.  If MakeString is non-NIL, converts keys to
strings.


Although TEXT HashFiles were implemented under the EMYCIN system,
the following two function are unique to this implementation:


(@b[PUTHASHTEXT] Key SRCFIL HashFile Start End)

Puts text from SRCFIL onto HashFile under Key. Start and End are passed
directly to COPYBYTES.


(@b[GETHASHTEXT] Key HashFile DSTFIL)

Uses COPYBYTES to retrieve text stored under Key on HashFile.


@section(Global Variables)

The variables used by the system of interest to the user.

@begin(description)
HASHFILEDEFAULTSIZE {512} Size used when #Entries is omitted or is too
small.

HASHFILERDTBL {ORIG} The hashfile read table.

HASHLOADFACTOR {.875} The ratio, used slots/total slots, at which
the system rehashes the file, initially 7/8.

HASHTEXTCHAR {↑A} The character separating two key hashkeys.

HFGROWTHFACTOR {3} The ration of total slots to used slots when a
hashfile is created.

REHASHGAG {NIL} Flags whether to print message when rehashing;
initially off.

SYSHASHFILE {NIL} The current hashfile.

SYSHASHFILELST {NIL} An alist of open hashfiles.
@end(description)

@section(Implementation)

The hash package views files as a sequence of bytes, randomly
accessible.  No notice is made of pages and it is assumed that the
host computer buffers I/O sufficiently.

Hashfiles consist of a short header section (8 bytes), a layer of
pointers (4*HASHFILE:Size bytes) followed by ascii data.  Pointers
are 3 bytes wide, preceeded by a status byte.  The pointers point to
key PNAMES in the data section, where each key is followed by its value.
Deleted key pointers are reused, deleted data space is not, so rehashing
is required if many items have been "replaced".

The data section starts at 4*HASHFILE:Size + 9, and consists of
alternating keys and values.  As deleted data is not re-written, not
all data in the data section is valid.

When a key hashes into a used slot, a probe value is added to it to
find the next slot to search.  The probe value is a small prime derived
from the original hash key.

@section(Limitations)

The system currently is able to manipulate files on {CORE}, {DSK},
{FLOPPY} and over the network, via leaf servers.  HashFiles cannot
be used with NS servers until they support random access files.

Due to the pointer size, only hashfiles of < 6 million intial entries
can be created, though these can grow to 14 million entries before
automatic rehashing exceeds the pointer limit.  The total file length
is limited to 16 million bytes.  No range checking is done for these
limits.