{Note The next few paragraphs seem to be a little too explicit. Do we really want to go into this much detail? How about describing hash arrays as 'black-boxes' that you give a hash-item to and receive a hash-value in return? Perhaps there could be a seperate section later describing the algorithm in a little more detail. ---mjs} {index hash links} A hash link is an association between any Interlisp pointer (atoms, numbers, arrays, strings, lists, et al) called the {index hash-item}{term hash-item}, and any other Interlisp pointer called the {index hash-value}{term hash-value}. Hash links are implemented by computing an address, called the {term hash-address}{index hash-address}, in a specified array, called the {term hash-array},{index hash-array} and storing the {term hash-item}{index hash-item} and the {term hash-value}{index hash-value} into the cell with that address. The contents of that cell, i.e. the {term hash-item} and {term hash-value}, is then called the {index hash-link}{term hash-link}.{foot The term {term hash link} (unhyphenated) refers to the process of associating information this way, or the "association" as an abstract concept. }{comment endfootnote} Since the hash-array is obviously much smaller than the total number of possible hash-items,{foot which is the total number of Interlisp pointers, i.e. in Interlisp-10, 256K. }{comment endfootnote} the hash-address computed from {arg ITEM} may already contain a hash-link. If this link is from {arg ITEM},{foot {fn EQ} is used for comparing {arg ITEM} with the hash-item in the cell. }{comment endfootnote} the new hash-value simply replaces the old hash-value. Otherwise, another hash-address (in the same hash-array) must be computed, etc, until an empty cell is found,{foot When the hash array becomes 7/8 full, it is considered to be full, and the array is either enlarged, or an error is generated, as described below in the discussion of overflow. }{comment endfootnote} or a cell containing a hash-link from {arg ITEM}. {index hash links} When a hash link for {arg ITEM} is being retrieved, the hash-address is computed using the same algorithm as that employed for making the hash link. If the corresponding cell is empty, there is no hash link for {arg ITEM}. If it contains a {index hash-link}hash-link from {arg ITEM}, the {index hash-value}hash-value is returned. Otherwise, another hash-address must be computed, and so forth. Note that more than one hash link can be associated with a given {index hash-item}hash-item by using more than one {index hash-array}hash-array. ---------- {var SYSHASHARRAY} is not used by the system, it is provided solely for the user's benefit. It is initially 512 words large, and is automatically enlarged by 50% whenever it is "full". See page {PageRef L!6}. ----------- In Interlisp-10, the size of the hash array may be increased so that it is relatively prime to possible probe intervals.{note is this worth mentioning?} {Begin Note} Date: 28 Oct 1981 11:16 PST From: KAPLAN at PARC-MAXC FYI, hash arrays in Interlisp-D currently ARE a power of 2, in the interests of efficiency. That might change if we get microcode support for hashing. {End Note} ----------- By using an array argument of a special form, the user can provide for automatic enlargement of a {index hash-array}hash-array when it overflows, i.e., is full and an attempt is made to store a hash link into it. The array argument is either of the form [1] {lisp ({arg HARRAY} . {arg N})}, {arg N} a positive integer; [2] {lisp ({arg HARRAY} . {arg F})}, {arg F} a floating point number; [3] {lisp ({arg HARRAY})}; or [4] {lisp ({arg HARRAY} . {arg FN})}, {arg FN} a function name or a lambda expression. In the first case, a new hash-array is created with {arg N} more cells than the current hash-array. In the second case, the new hash array will be {arg F} times the size of the current hash-array. The third case, {lisp ({arg HARRAY})}, is equivalent to {lisp ({arg HARRAY} . 1.5)}. In the fourth case, {lisp ({arg HARRAY} . {arg FN})}, {arg FN} is called with {lisp ({arg HARRAY} . {arg FN})} as its argument. If {arg FN} returns a number, the number will be the size of the new hash array. Otherwise, the new size defaults to 1.5 times the size of the old hash array, e.g. {arg FN} could be used to print a message, or perform some monitor function. In each case, the new hash-array is {fn RPLACA}ed into the dotted pair, and the computation continues. If a {index hash-array}hash-array overflows, and the array argument used was not one of these three forms, the error {lisp HASH TABLE FULL}{index HASH TABLE FULL EM} is generated, which will either cause a break or unwind to the last {index ERRORSET}{fn ERRORSET}, as per treatment of errors described in {SectionRef L!ERRORS}. The system hash array,{index SYSHASHARRAY SY} {var SYSHASHARRAY}, is automatically enlarged by 1.5 when it is full. ----------- {Begin Note} Date: 23 Sep 1981 0740-PDT From: Dave Dyer To: Lispdiscussion^.pa at PARC-MAXC I'd like opinions to what extent hash arrays should be usable interchangably with ordinary arrays. Should SETA and ELT work? Should both SETA and SETD retreve the key and data fields respectively? Should ARRAYSIZE work? And which number, actual size or HARRAYSIZE should it return if those are different? I tend to favor the view that ARRAYS and HARRAYS should not be interchangable, to allow maximum flexibility in the choice of the representation of hash arrays across various Interlisps. (in which case, Interlisp-10 should be made more restrictive) Date: 25 Sep 1981 10:09 PDT From: Bobrow at PARC-MAXC In-reply-to: DDYER's message of 23 Sep 1981 0740-PDT I agree that ARRAYS and HARRAYS should not be interchangable. The garbage collection properties of the two should be able to be different. Having something as a key in a hash array should not necessarily allow one to hold on to it. Date: 25 Sep 1981 14:17 PDT From: Masinter at PARC-MAXC Subject: Re: Arrays and hasha arrays The issue has very little to do with GC semantics, which relates as much to MAPHASH rather than ELT and SETA, but whether HARRAYP can be implemented other than as an array. My opinions are: HARRAYP can be a separate datatype from ARRAYP. All current lisp system code is believed to be written in such a way that that is true (it is stated so in the VM). Note that TYPENAME must return the appropriate token when given an ARRAYP or HARRAYP. Thus, ARRAYSIZE, SETA, ELT, SETD, ELTD are NOT expected to work on HARRAYPs; the only operations valid for things created by HARRAY are GETHASH, PUTHASH, CLRHASH, MAPHASH, HARRAYSIZE. The requirement on HARRAYSIZE is as follows: (HARRAYSIZE (HARRAY n)) ge n X must hold (HARRAYSIZE X) - (HARRAYSIZE Y) more key/value atributes than Y. (In Interlisp-D, while things created by HARRAYP will be ARRAYP and respond to ARRAYSIZE, the ELT and SETA functions will complain when given a HARRAYP.) {End Note} {Begin Note} Date: 19 Apr 1982 16:03 PST From: JonL at PARC-MAXC Subject: Re: COPYARRAY I agree pretty strongly with Dyer that "hash-arrays" should not be thought of as arrays -- I've a version of a hashing package written in NIL/CommonLisp which implements a HASH-TABLE as a semi-intelligent "object", and generally one array isn't enough to hold the appropriate data (about a dozen more state variables were needed, and in some cases a second data array). The functionalities implemented were a superset of those documented in the LISPMachine manual for hash-on-eq and hash-on-equal. Perhaps the current Interlisp hasharray should be kept for backwards compatibility only, and some future package could implement a more developed facility. If so, it would be fairly important to have a type of array which holds pointers, but which does not cause the pointers to be protected/copied during a GC; alternatively, there could be a type of pointer array which the GC sweeps after marking everything else, and just deletes entries which haven't been marked elsewhere. Thus it would be possible for "GC" methods to take care of the case where you want a table entry to go away when no one else points to its key. Incidentaly, the use of EXTENDs (i.e., "object-oriented" programming) made it easy to put in special PRINT methods for these hash-tables, but let the default class heirarchy values stand for most other methods, e.g., COPY. Date: 19 APR 1982 2246-PST From: MASINTER.PA It is clear that Interlisp says that HARRAYPs can be different from ARRAYP, and it is a quirk of Interlisp-10 that COPYARRA works (I'm actually not sure). But what is this about GC's "sweeping"? {End Note} {Begin Note} -------------------------------------the following few messages deal with the problem of hash array entries being deleted by the garbage collector -----mjs Date: 3 NOV 1977 1925-PST Date: 29 APR 1975 1115-PDT From: DEUTSCH Subject: GC To: hartley at BBNB cc: teitelman, bobrow I have (apparently ) finally encountered a situation where I really need to know under precisely what circumstances the G.C. will delete an entry from a hash array. I have a (unfortunately large) program which works properly if I do MINFS(25000) and RECLAIM) before I start running it, but if I don't, it blows up with a NON-NUMERIC ARG NIL which results from something which it "knows" is in a hash array not being there. In the latter case, a GC: 8 occurs during a phase of the computation when there are many hash arrays being filled which will be examined later. Just to explain my current understanding: it is my impression that the G.C. will delete an entry from a hash array if the key (the thing you give to GETHASH) is otherwise reclaimable, i.e. if there are no references to the key other than from entry keys in hash arrays, and the key is not a small number or litatom. If this is true, I think it may be what is screwing me, since I think I may have hash arrays which at some point in their lives only exist to be scanned with MAPHASH and whose keys no longer have references to them in some cases. (This is not a contrived situation: imagine a system for manipulating sets which represents sets as hash arrays. Then the keys almost certainly have no other references to them, and the ONLY important use of the arrays is MAPHASH.) ------- 29-APR-75 13:21:41-PDT,1050;000000000001 Net mail from site BBN-TENEXB rcvd at 29-APR-75 13:21:33 Date: 29 APR 1975 1619-EDT From: HARTLEY at BBN-TENEXB Subject: HASH ARRAYS AND GC To: DEUTSCH at PARC cc: TEITELMAN at PARC, BOBROW at PARC, LEWIS PETER, YOUR SITUATION IS A MORE COMPLICATED VERSION OF A SITUATION THAT AROSE HERE AWHILE AGO AND HAS BEEN BUGGING ME EVER SINCE. A USER SAVED A HASH ARRAY WITH DUMPHASH(SPELLING?) AND FOUND THAT SHE COULDNT GET IT LOADED IN BECAUSE A GC OCCURED WHILE IT WAS LOADING AND ALL THE HASH ENTRIES WENT AWAY. HER INTENT WAS TO LOAD THE HASH ARRAY AND THEN MAPHASH THRU IT TO CONSTRUCT THE REST OF HER DATA STRUCTURE (HAD SHE GOTTENT O THE STAGE OF CREATING THE REST OF THE STRUCTURE BEFORE THE GC THEN THE HASH ARRAY WOULD NOT HAVE DISAPPEARED (THE CONTENTS OF THE HASH ARRAY THET IS) ). THE LESSON TO BE LEARNED IS THAT EITHER MAPHASH SHOULD NOT EXIST, OR THAT THE GARBAGE COLLECTOR SHOULD NEVER DELETE ENTRIES FROM A HASH ARRAY. THIS PROBLEM IS RELATED TO MY OBJECTIONS TO THE DESIRE FOR MAPATOMS. Date: 29 APR 1975 1414-PDT From: DEUTSCH Subject: hash arrays & gc To: hartley at BBNB cc: teitelman, bobrow, lewis at BBNB Well, I'm glad to learn I'm not alone in being bitten by this bug. The logical problem is fairly subtle. I'm not willing to give up MAPHASH (and I see it as being different than MAPATOMS in that it requires an explicit act to create and delete hash associations whereas atoms are created "at need" and therefore may more reasonable be reclaimed "when no longer needed"). What is really going in is that some hash arrays are sort of like property list associations, which persist indefinitely even if the user has forgotten all about the atom that possesses them, whereas others are more like associations by direct pointers (like list-records), which disappear when the key is no longer accessible. Litatoms can possess both types of associations: an example of the former is the EXPR property, whereas a "memo function" like a numeric hash from the name might be an example of the latter property. I confess I don't see a good solution to this problem. The one that comes to mind is a bit in the hash array that tells the g.c. whether it is allowed to delete entries, but somehow this is unsatisfying. However, if (as it appears) the difference between the two situations is really the user's "intent", then there is no way the system can always do the right thing, and some kind of explicit user-settable bit is really required. Comments? -----------------------------------the following few messages deal with the problem of MAPHASH (in Interlisp-10) working incorrectly if a garbage collection occurs at the wrong time-----mjs Date: 1 Mar 1982 13:09 PST From: Masinter at PARC-MAXC Subject: Re: MAPHASH problems in Interlisp?? (1) You have unfortunately run into one of the more subtle problems with MAPHASH in Interlisp-10: if a garbage collection which MOVES STORAGE occurs during the middle of a MAPHASH, it is possible for the hash pointers to move around, and for entries to be missed and for some entries to be visitied twice. This is the only situation in which MAPHASH will omit items or present them twice (note that "rehashing" actually copies the original array into another one, so that if a rehash occurs because of overflow, you may get outdated information but not any duplicates.) (2) The problem is that if a reclaim needs to increase the size of one of the contiguous areas (such as array space or string pointer space), it may actually move around pages of atoms. It isn't that atoms get compacted but rather that other spaces have to increase which causes the atoms to get moved around. (3) The way that I worked around this problem when I ran into it was (a) MAPHASH down the array, collecting a list of the "keys" (b) MAPC down that list, performing the operation This guarantees that no string/array/pname garbage collection will occur during MAPHASH. There are some proposals for fixing this problem in Interlisp-10 (e.g., marking the array that it is being maphashed, and if so marked, not rehashing during a reclaim but fixing it the next puthash) but so far (for the last 4 years) no progress on fixing it. This bug is not present in other Interlisp implementations, as far as I know. Date: 1 Mar 1982 1732-PST From: Steve Crocker Subject: Re: MAPHASH problems in Interlisp?? An alternative strategy is not to use MAPHASH at all. An auxiliary list may be kept, suitably updated whenever items are added to or deleted from the table. If this sounds ridiculously expensive, I submit that it competes with the proposed solution for some frequencies of MAPHASHing, adding, deleting and accessing elements. It would be interesting to see the crossover point. {End Note}