File [eris]<LispManual>Stuff.HashArrays!1


{Note The next few paragraphs seem to be a little too explicit.  Do we really want to go into this much detail?  How about describing hash arrays as 'black-boxes' that you give a hash-item to and receive a hash-value in return?  Perhaps there could be a seperate section later describing the algorithm in a little more detail.  ---mjs}


{index hash links}

A hash link is an association between any Interlisp pointer (atoms, numbers, arrays, strings, lists, et al) called the {index hash-item}{term hash-item}, and any other Interlisp pointer called the {index hash-value}{term hash-value}.
Hash links are implemented by computing an address, called the {term hash-address}{index hash-address}, in a specified array, called the {term hash-array},{index hash-array} and storing the {term hash-item}{index hash-item} and the {term hash-value}{index hash-value} into the cell with that address.  The contents of that cell, i.e. the {term hash-item} and {term hash-value}, is then called the {index hash-link}{term hash-link}.{foot
The term {term hash link} (unhyphenated) refers to the process of associating information this way, or the "association" as an abstract concept.
}{comment endfootnote}

Since the hash-array is obviously much smaller than the total number of possible hash-items,{foot
which is the total number of Interlisp pointers, i.e. in Interlisp-10, 256K.
}{comment endfootnote}
the hash-address computed from {arg ITEM} may already contain a hash-link.
If this link is from {arg ITEM},{foot
{fn EQ} is used for comparing {arg ITEM} with the hash-item in the cell.
}{comment endfootnote}
the new hash-value simply replaces the old hash-value.
Otherwise, another hash-address (in the same hash-array) must be computed, etc, until an empty cell is found,{foot
When the hash array becomes 7/8 full, it is considered to be full, and the array is either enlarged, or an error is generated, as described below in the discussion of overflow.
}{comment endfootnote}
or a cell containing a hash-link from {arg ITEM}.

{index hash links}

When a hash link for {arg ITEM} is being retrieved, the hash-address is computed using the same algorithm as
that employed for making the hash link.
If the corresponding cell is empty, there is no hash link for {arg ITEM}.
If it contains a {index hash-link}hash-link from {arg ITEM}, the {index hash-value}hash-value is returned.
Otherwise, another hash-address must be computed, and so forth.

Note that more than one hash link can be associated with a given {index hash-item}hash-item by using more than one {index hash-array}hash-array.

----------

{var SYSHASHARRAY} is not used by the system, it is provided solely for the user's benefit.  It is initially 512 words large, and is automatically enlarged by 50% whenever it is "full".  See page {PageRef L!6}.

-----------

In Interlisp-10, the size of the hash array may be increased so that it is relatively prime to possible probe intervals.{note is this worth mentioning?}


{Begin Note}
Date: 28 Oct 1981 11:16 PST
From: KAPLAN at PARC-MAXC

FYI, hash arrays in Interlisp-D currently ARE a power of 2, in the interests of
efficiency.  That might change if we get microcode support for hashing.
{End Note}

-----------

By using an array argument of a special form, the user can provide for automatic enlargement of a {index hash-array}hash-array when it overflows,
i.e., is full and an attempt is made to store a hash link into it.
The array argument is either of the form
[1] {lisp ({arg HARRAY} . {arg N})}, {arg N} a positive integer;
[2] {lisp ({arg HARRAY} . {arg F})}, {arg F} a floating point number;
[3] {lisp ({arg HARRAY})};
or
[4] {lisp ({arg HARRAY} . {arg FN})}, {arg FN} a function name or a lambda expression.
In the first case, a new hash-array is created with {arg N} more cells than the current hash-array.
In the second case, the new hash array will be {arg F} times the size of the current hash-array.
The third case, {lisp ({arg HARRAY})},
is equivalent to {lisp ({arg HARRAY} . 1.5)}.
In the fourth case, {lisp ({arg HARRAY} . {arg FN})}, {arg FN} is called with {lisp ({arg HARRAY} . {arg FN})} as its argument.  If {arg FN} returns a number, the number will be the size of the new hash array.
Otherwise, the new size defaults to 1.5 times the size of the old hash array, e.g. {arg FN} could be used to print a message, or perform some monitor function.
In each case, the new hash-array is {fn RPLACA}ed into the dotted pair, and the computation continues.

If a {index hash-array}hash-array overflows, and the array argument used was not one of these three forms, the error
{lisp HASH TABLE FULL}{index HASH TABLE FULL EM} is generated, which will either cause a break or unwind to the last {index ERRORSET}{fn ERRORSET}, as per treatment of errors described in
{SectionRef L!ERRORS}.

The system hash array,{index SYSHASHARRAY SY} {var SYSHASHARRAY}, is automatically enlarged by 1.5 when it is full.

-----------


{Begin Note}
Date: 23 Sep 1981 0740-PDT
From: Dave Dyer <DDYER at USC-ISIB>
To: Lispdiscussion↑.pa at PARC-MAXC

 I'd like opinions to what extent hash arrays should be usable
interchangably with ordinary arrays.

 Should SETA and ELT work?
 Should both SETA and SETD retreve the key and data fields respectively?
 Should ARRAYSIZE work?  And which number, actual size or HARRAYSIZE
   should it return if those are different?

 I tend to favor the view that ARRAYS and HARRAYS should not be
interchangable, to allow maximum flexibility in the choice of
the representation of hash arrays across various Interlisps.
(in which case, Interlisp-10 should be made more restrictive)

Date: 25 Sep 1981 10:09 PDT
From: Bobrow at PARC-MAXC
In-reply-to: DDYER's message of 23 Sep 1981 0740-PDT

I agree that ARRAYS and HARRAYS should not be interchangable.  The garbage collection properties of the two should be able to be different.  Having something as a key in a hash array should not necessarily allow one to hold on to it.

Date: 25 Sep 1981 14:17 PDT
From: Masinter at PARC-MAXC
Subject: Re: Arrays and hasha arrays

The issue has very little to do with GC semantics, which relates as much to
MAPHASH rather than ELT and SETA, but whether HARRAYP can be
implemented other than as an array.

My opinions are:

HARRAYP can be a separate datatype from ARRAYP. All current lisp system
code is believed to be written in such a way that that is true (it is stated so in
the VM). Note that TYPENAME must return the appropriate token when given
an ARRAYP or HARRAYP.

Thus, ARRAYSIZE, SETA, ELT, SETD, ELTD are NOT expected to work on
HARRAYPs; the only operations valid for things created by HARRAY are
GETHASH, PUTHASH, CLRHASH, MAPHASH, HARRAYSIZE. 

The requirement on HARRAYSIZE is as follows:
	(HARRAYSIZE (HARRAY n)) ge n
	X must hold (HARRAYSIZE X) - (HARRAYSIZE Y) more key/value
		atributes than Y.

(In Interlisp-D, while things created by HARRAYP will be ARRAYP and respond
to ARRAYSIZE, the ELT and SETA functions will complain when given a
HARRAYP.)
{End Note}


{Begin Note}
Date: 19 Apr 1982 16:03 PST
From: JonL at PARC-MAXC
Subject: Re: COPYARRAY

I agree pretty strongly with Dyer that "hash-arrays" should not be thought
of as arrays -- I've a version of a hashing package written in NIL/CommonLisp
which implements a HASH-TABLE as a semi-intelligent "object", and generally
one array isn't enough to hold the appropriate data (about a dozen more state
variables were needed, and in some cases a second data array).  The
functionalities implemented were a superset of those documented in the
LISPMachine manual for hash-on-eq and hash-on-equal.

Perhaps the current Interlisp hasharray should be kept for backwards
compatibility only, and some future package could implement a more
developed facility.  If so, it would be fairly important to have a type
of array which holds pointers, but which does not cause the pointers to be
protected/copied during a GC; alternatively, there could be a type of
pointer array which the GC sweeps after marking everything else, and
just deletes entries which haven't been marked elsewhere.  Thus it would be
possible for "GC" methods to take care of the case where you want a table entry
to go away when no one else points to its key.  

Incidentaly, the use of EXTENDs (i.e., "object-oriented" programming) made
it easy to put in special PRINT methods for these hash-tables, but let the default
class heirarchy values stand for most other methods, e.g., COPY.

Date: 19 APR 1982 2246-PST
From: MASINTER.PA

It is clear that Interlisp says that HARRAYPs can be different
from ARRAYP, and it is a quirk of Interlisp-10 that COPYARRA
works (I'm actually not sure).

But what is this about GC's "sweeping"?
{End Note}


{Begin Note}

-------------------------------------the following few messages deal with the problem of hash array entries being deleted by the garbage collector -----mjs


Date: 3 NOV 1977 1925-PST
Date: 29 APR 1975 1115-PDT
From: DEUTSCH
Subject: GC
To:   hartley at BBNB
cc:   teitelman, bobrow

I have (apparently ) finally encountered a situation where I really
need to know under precisely what circumstances the G.C. will delete
an entry from a hash array.  I have a (unfortunately large) program
which works properly if I do MINFS(25000) and RECLAIM) before I
start running it, but if I don't, it blows up with a NON-NUMERIC ARG
NIL which results from something which it "knows" is in a hash array
not being there.  In the latter case, a GC: 8 occurs during a phase
of the computation when there are many hash arrays being
filled which will be examined later.

Just to explain my current understanding: it is my impression that
the G.C. will delete an entry from a hash array if the key
(the thing you give to GETHASH) is otherwise reclaimable, i.e. if
there are no references to the key other than from entry keys in
hash arrays, and the key is not a small number or litatom.
If this is true, I think it may be what is screwing me, since
I think I may have hash arrays which at some point in their lives
only exist to be scanned with MAPHASH and whose keys no longer have
references to them in some cases.  (This is not a contrived situation:
imagine a system for manipulating sets which represents sets as
hash arrays.  Then the keys almost certainly have no other references
to them, and the ONLY important use of the arrays is MAPHASH.)
-------
29-APR-75 13:21:41-PDT,1050;000000000001
Net mail from site BBN-TENEXB rcvd at 29-APR-75 13:21:33
Date: 29 APR 1975 1619-EDT
From: HARTLEY at BBN-TENEXB
Subject: HASH ARRAYS AND GC
To:   DEUTSCH at PARC
cc:   TEITELMAN at PARC, BOBROW at PARC, LEWIS

PETER,
YOUR SITUATION IS A MORE COMPLICATED VERSION OF A SITUATION
THAT AROSE HERE AWHILE AGO AND HAS BEEN BUGGING
ME EVER SINCE.
A USER SAVED A HASH ARRAY WITH DUMPHASH(SPELLING?) AND
FOUND THAT SHE COULDNT GET IT LOADED IN BECAUSE
A GC OCCURED WHILE IT WAS LOADING
AND ALL THE HASH ENTRIES WENT AWAY. HER INTENT WAS
TO LOAD THE HASH ARRAY AND THEN MAPHASH THRU IT
TO CONSTRUCT THE REST OF HER DATA STRUCTURE (HAD SHE
GOTTENT O THE STAGE OF CREATING THE REST OF THE
STRUCTURE BEFORE THE GC THEN THE HASH ARRAY WOULD NOT
HAVE DISAPPEARED (THE CONTENTS OF THE HASH ARRAY THET IS)
).

THE LESSON TO BE LEARNED IS THAT EITHER MAPHASH SHOULD NOT
EXIST, OR THAT THE GARBAGE COLLECTOR SHOULD NEVER
DELETE ENTRIES FROM A HASH ARRAY.
THIS PROBLEM IS RELATED TO MY OBJECTIONS TO THE
DESIRE FOR MAPATOMS.

Date: 29 APR 1975 1414-PDT
From: DEUTSCH
Subject: hash arrays & gc
To:   hartley at BBNB
cc:   teitelman, bobrow, lewis at BBNB

Well, I'm glad to learn I'm not alone in being bitten by this bug.

The logical problem is fairly subtle.  I'm not willing to give up
MAPHASH (and I see it as being different than MAPATOMS in that it
requires an explicit act to create and delete hash associations whereas
atoms are created "at need" and therefore may more reasonable be
reclaimed "when no longer needed").  What is really going in is that
some hash arrays are sort of like property list associations,
which persist indefinitely even if the user has forgotten all about
the atom that possesses them, whereas others are more like associations
by direct pointers (like list-records), which disappear when the key
is no longer accessible.  Litatoms can possess both types of
associations: an example of the former is the EXPR property, whereas
a "memo function" like a numeric hash from the name might be an example
of the latter property.

I confess I don't see a good solution to this problem.  The one that
comes to mind is a bit in the hash array that tells the g.c. whether
it is allowed to delete entries, but somehow this is unsatisfying.
However, if (as it appears) the difference between the two situations
is really the user's "intent", then there is no way the system can
always do the right thing, and some kind of explicit user-settable bit
is really required.

Comments?


-----------------------------------the following few messages deal with the problem of MAPHASH (in Interlisp-10) working incorrectly if a garbage collection occurs at the wrong time-----mjs

Date: 1 Mar 1982 13:09 PST
From: Masinter at PARC-MAXC
Subject: Re: MAPHASH problems in Interlisp??

(1) You have unfortunately run into one of the more subtle problems with
MAPHASH in Interlisp-10: if a garbage collection which MOVES STORAGE
occurs during the middle of a MAPHASH, it is possible for the hash
pointers to move around, and for entries to be missed and for some entries
to be visitied twice. This is the only situation in which MAPHASH will
omit items or present them twice (note that "rehashing" actually copies the
original array into another one, so that if a rehash occurs because of overflow,
you may get outdated information but not any duplicates.)

(2) The problem is that if a reclaim needs to increase the size of one of the
contiguous areas (such as array space or string pointer space), it may actually
move around pages of atoms. It isn't that atoms get compacted but rather that
other spaces have to increase which causes the atoms to get moved around.

(3) The way that I worked around this problem when I ran into it was
	(a) MAPHASH down the array, collecting a list of the "keys"
	(b) MAPC down that list, performing the operation

This guarantees that no string/array/pname garbage collection will occur during
MAPHASH.

There are some proposals for fixing this problem in Interlisp-10 (e.g., marking the
array that it is being maphashed, and if so marked, not rehashing during a
reclaim but fixing it the next puthash) but so far (for the last 4 years) no
progress on fixing it.

This bug is not present in other Interlisp implementations, as far as I know. 

Date:  1 Mar 1982 1732-PST
From: Steve Crocker <Crocker at USC-ISIF>
Subject: Re: MAPHASH problems in Interlisp??

An alternative strategy is not to use MAPHASH at all.  An auxiliary
list may be kept, suitably updated whenever items are added to or deleted
from the table.

If this sounds ridiculously expensive, I submit that it competes with the
proposed solution for some frequencies of MAPHASHing, adding, deleting and
accessing elements.

It would be interesting to see the crossover point.
{End Note}