File [eris]<Lisp>JCAI>library>hash.doc!1




                                     HASH

  A Hash-Coded Dictionary Facility for Interlisp-10 & Interlisp-D

  (Implemented  by  Christopher Lane at Stanford University, and now maintained
by Xerox)

1. Introduction
  HASH is a new implementation of the Interlisp-10 hashfile package which works
in both Interlisp-10 and Interlisp-D, being written in Interlisp and  operating
on  streams of bytes.  This package implements the functional interface of both
the current Interlisp-10 package  implemented  by  R. Kaplan  as  well  as  the
original  implementation of L. Masinter and W. van Melle which is still used in
EMYCIN based systems (which will be referred to as the EMYCIN package).    This
document  mainly  deals  with  differences between the various packages.  Users
should consult the Interlisp Reference Manual and the EMYCIN documentation  for
more detail.

2. Basics
  Hashfiles are created by calling CREATEHASHFILE.  A "HashFile", as referenced
in  this  document,  is  the  datum returned by CREATEHASHFILE or OPENHASHFILE,
currently an array record containing the hashfile name, and the number of slots
in the file, the used slots and  other  details.    All  other  functions  with
hashfile  arguments  use  this  datum.    A  NIL  hashfile  argument  refers to
SYSHASHFILE in the EMYCIN sense of the current  hashfile,  not  a  global  user
hashfile  as  in  Interlisp-10.    Keys  are  strings or atoms, as in the other
systems.

  Interlisp-10 hashfiles come in  several  flavors,  according  to  the  values
stored in them.  The EMYCIN system provides even more flexibility.  This system
only  supports  the  most  general EXPR type of hashfiles and EMYCIN style TEXT
entries, in the same file.  The VALUETYPE and ITEMLENGTH arguments are for  the
most part ignored.

  Two  key hashing is supported in this system but is discouraged as it is only
in EMYCIN, not in the Interlisp-10 system.  The functions GETPAGE, DELPAGE  and
GETPNAME  which  manipulate `secret pages' do not exist in this implementation.
However, it is permissible to write data to the end of a HashFile.

  The package sysloads the DFOR10.COM package from the LispUsers directory  for
Interlisp-10 users.

3. Functions
  The  functions  implemented are listed below.  Arguments in italics are those
that only the EMYCIN system used.

  (CREATEHASHFILE File ValueType ItemLength #Entries Smash CopyFn)

  Creates a new HashFile called  File.    All  other  arguments  are  optional.
ValueType  can  be EXPR or TEXT, however both kinds of entries can exist on the
same file so EXPR is best.  ItemLength  is  not  used  by  the  system  but  is
currently  saved on the file (if less than 256) for future use.  #Entries is an
                                       1


estimate  of  the  number  of  entries  the  file  will have.  This should be a
realistic guess as the system triples it anyway.  Smash is a HashFile datum  to
reuse  and  CopyFn  is  a  function  to  be applied to entries when the file is
Rehashed.

  (OPENHASHFILE File Access ItemLength #Entries Smash)

  Opens HashFile File with Access of either INPUT (Synonyms are:  READ OLD  NIL
RETRIEVE)  or  'BOTH  (Synonyms  are:    WRITE OUTPUT T INSERT DELETE REPLACE).
ItemLength and #Entries  are  for  backward  compatibility  with  EMYCIN  where
OPENHASHFILE  also  created  new  HashFiles; these arguments should be avoided.
Smash is a HashFile datum to reuse.

  (HASHFILEP HashFile Write?)

  Returns HashFile if it is  a  valid,  open  HashFile  datum  or  returns  the
HashFile  datum associated with HashFile if it is the name of an open hashfile.
If Write?  is non-NIL, HashFile must also be open for write access.

  (PUTHASHFILE Key Value HashFile Key2)

  Puts Value under Key in HashFile.  Key2 is for EMYCIN two key hashing.   Key2
is internally appended to Key and they are treated as a single key.

  (GETHASHFILE Key HashFile Key2)

  Gets  the  value  stored  under Key in HashFile.  Key2 is necessary if it was
supplied to PUTHASHFILE.

  (LOOKUPHASHFILE Key Value HashFile CallType Key2)

  Implements PUTHASHFILE and GETHASHFILE among other options.  CallType can  be
any  combination  of  RETRIEVE,  DELETE, REPLACE or INSERT.  GETHASHFILE does a
RETRIVE, PUTHASHFILE does a  DELETE  if  VALUE  is  NIL  otherwise  a  (REPLACE
INSERT).   Other combinations are possible, for example, (RETRIEVE DELETE) will
delete a key and return the old value.

  (HASHFILEPROP HashFile Property)

  Returns the value of a Property of a  HashFile  datum.    Currently  accepted
properties  are NAME, ACCESS, VALUETYPE, ITEMLENGTH, SIZE, #ENTRIES, CopyFn and
STREAM.

  (HASHFILENAME HashFile)

  Same as (HASHFILEPROP HashFile 'NAME).

  (CLOSEHASHFILE HashFile ReOpen)

  Closes HashFile.  If ReOpen is non-NIL it  should  be  one  of  the  accepted
access  types.    In  this case the file is closed and the immediately reopened
with access = ReOpen.  This is used to make sure the HashFile is valid  on  the
                                       2


disk.

  (MAPHASHFILE HashFile MapFn Double)

  Maps  over  HashFile  applying  MapFn.    If MapFn takes two arguments, it is
applied to Key and Value.  If MapFn only takes one argument, it is only applied
to Key and saves the cost of reading value from the file.  If Double is non-NIL
then MapFn is applied to (Key1 Key2 Value) or (Key1 Key2)  if  the  MapFn  only
takes two arguments.

  (REHASHFILE HashFile NewName NewValueType)

  As  keys  are  replaced,  space in the data section of the file is not reused
(though space in the key section is).  Eventually the file may  need  rehashing
to  reclaim  the  wasted  data  space.   REHASHFILE is really a special case of
COPYHASHFILE, and creates a new file.  If NewName is non-NIL, it  is  taken  as
the name of the rehashed file.  NewValueType is a no-op.

  The  system  automatically  rehashes  files  when  7/8  of the key section is
filled.  The system will print a message when automatically rehashing a file if
the global variable REHASHGAG is non-NIL.

  (COPYHASHFILE HashFile NewName FN ValueType LeaveOpen)

  Makes a copy of HashFile under NewName.  Each key and  value  pair  is  moved
individually  and  if  FN  is  supplied,  is  applied  to  (KEY  VALUE HASHFILE
NEWHASHFILE) and what it returns is used as the value of the  key  in  the  new
HashFile.  ValueType is a no-op.  If LeaveOpen is non-NIL then the new HashFile
datum  is  returned  open, otherwise the new HashFile is closed and the name is
returned.

  (HASHFILESPLST HashFile XWord)

  Returns an Interlisp generator for the keys  in  HashFile,  usable  with  the
spelling  corrector.   If XWord is supplied, only keys starting with the prefix
in XWord are generated.

  The following were only in the EMYCIN hash package:

  (HASHFILEDATA HashFile)

  Returns a list of the file name,  value  type,  item  length  and  number  of
entries in HashFile.

  (CLEARHASHFILES Close Release)

  Closes  all  hashfiles  if  Close  is  non-NIL  otherwise  a  no-op.  Used on
AFTERSYSOUTFORMS to clean up hashfiles.  Release is not implemented.

  (COLLECTKEYS HashFile Double MakeString)

  Returns a list of keys in HashFile.  If Double is  non-NIL  returns  keys  as
                                       3


double key pairs.  If MakeString is non-NIL, converts keys to strings.

  Although  TEXT  HashFiles  were  implemented  under  the  EMYCIN  system, the
following two function are unique to this implementation:

  (PUTHASHTEXT Key SRCFIL HashFile Start End)

  Puts text from SRCFIL onto HashFile under  Key.  Start  and  End  are  passed
directly to COPYBYTES.

  (GETHASHTEXT Key HashFile DSTFIL)

  Uses COPYBYTES to retrieve text stored under Key on HashFile.

4. Global Variables
  The variables used by the system of interest to the user.

HASHFILEDEFAULTSIZE {512} Size used when #Entries is omitted or is too small.

HASHFILERDTBL {ORIG} The hashfile read table.

HASHLOADFACTOR  {.875}  The  ratio, used slots/total slots, at which the system
                rehashes the file, initially 7/8.

HASHTEXTCHAR {↑A} The character separating two key hashkeys.

HFGROWTHFACTOR {3} The ration of total slots to used slots when a  hashfile  is
                created.

REHASHGAG {NIL} Flags whether to print message when rehashing; initially off.

SYSHASHFILE {NIL} The current hashfile.

SYSHASHFILELST {NIL} An alist of open hashfiles.

5. Implementation
  The hash package views files as a sequence of bytes, randomly accessible.  No
notice  is  made  of pages and it is assumed that the host computer buffers I/O
sufficiently.

  Hashfiles consist of a short header section (8 bytes), a  layer  of  pointers
(4*HASHFILE:Size  bytes)  followed  by  ascii data.  Pointers are 3 bytes wide,
preceeded by a status byte.  The pointers point  to  key  PNAMES  in  the  data
section,  where  each  key  is followed by its value.  Deleted key pointers are
reused, deleted data space is not, so rehashing is required if many items  have
been "replaced".

  The  data  section starts at 4*HASHFILE:Size + 9, and consists of alternating
keys and values.  As deleted data is not re-written, not all data in  the  data
section is valid.

  When  a key hashes into a used slot, a probe value is added to it to find the
                                       4


next  slot  to  search.    The  probe  value  is a small prime derived from the
original hash key.

6. Limitations
  The system currently is able to manipulate files on {CORE},  {DSK},  {FLOPPY}
and  over  the  network,  via  leaf  servers.  HashFiles cannot be used with NS
servers until they support random access files.

  Due to the pointer size, only hashfiles of < 6 million intial entries can  be
created, though these can grow to 14 million entries before automatic rehashing
exceeds  the  pointer  limit.    The total file length is limited to 16 million
bytes.  No range checking is done for these limits.