The Spelling Tool.
The Spelling Tool performs one of the steps of proofreading a document: that of pointing out words whose spellings seem doubtful. The Tool breaks a document up into words, checks whether the words are in a list of known words, and complains about those words that are not. It lets you fix the offending words by using Tioga or by using a semi-automatic correction facility. Or, if you think that a word is correctly spelled, the Tool lets you add it to your list of known words. This is all the Spelling Tool does for you; the rest of your proofreading must be handled by other means.
The Spelling Tool is a standard sort of Cedar Tool. As of this writing, you retrieve and run it in the usual way:
& bringover /a /p [Indigo]<PostCedar>Top>SpellingTool.df
& run SpellingTool
When started, the Spelling Tool reads an encoded representation of the standard system word list from the file [Indigo]<SpellingToolData>SpellingTool.BitTable. It then reads your private word list from the local file SpellingTool.Wordlist and merges those words into its copy of the standard system word list. Once these lists have been read, the Tool is ready to check your spelling.
For example
As a first example, suppose you want to know whether the word auxilliary is correctly spelled. Select it and left-click the Spelling Tool's Check button. The word auxilliary is not in the Spelling Tool's word list, so the Tool tells you that the word is misspelled. Note that the Spelling Tool's word list is far from complete, so this forthright statement may not be correct; however, in this case auxilliary is indeed a misspelling. If you know the correct way to spell auxilliary, then you can use Tioga to correct the document. On the other hand, if you don't know how auxilliary should be spelled, then the Spelling Tool offers you some assistance by displaying the words in its word list that are similar to auxilliary:
auxilliary --> auxiliary
The Tool is telling you that its word list contains the word auxiliary, which is spelled similarly to auxilliary. The proposed correction looks like it might be right, so you can bug the ``auxiliary'' button and it performs the correction by replacing the text auxilliary with the text auxiliary.
As a second example, suppose that you've written a long memo:
Walnut is a mail system that runs in Cedar. Walnut provides facilties to send and retrieve mail (using the Grapevine mail transport system), and to display and classify stored (i.e. previously retrieved) messages. Walnut uses the Cypress database system to maintain information about stored messages: we hope that this will allow Walnut...
And on and on. You're about to send this opus off to your local rhetorician for comment; you think that it's sterling prose, and you'd be most embarrassed if she found spelling mistakes in the first paragraph (or, for that matter, in subsequent ones). To check the spelling of the text, you make a point or single-character selection at the beginning of the document, say before the W in Walnut, and left-click the Check button in the Spelling Tool. Left-clicking Check when you've selected a point or single character tells the Spelling Tool to search from that point to the end of the document for the next misspelled word. In this case, it finds the word ``facilties'', which has two plausible corrections:
facilties --> facilities faculties
This word ``facilities'' is correct, so you bug that word with a right click, which corrects it and invokes Check on the text that follows it. The next misspelled word is Grapevine. You think that Grapevine is spelled just fine, the Spelling Tool doesn't because it's not in its list of known words. You add it to the list by left-clicking the Insert command, and you continue searching for misspelled words by moving the selection to the point after Grapevine and left-clicking Check. Alternatively, you could have combined these two operations of inserting and continuing your search by right-clicking the Insert command. The next word found is Cypress; it too is simply an unknown word that must be inserted. Fifty-four insertions and twenty-two misspellings later, you now have a little more confidence in your document's correctness.
How it works.
The current implementation of the Spelling Tool uses a compact representation of the word list that was suggested in the paper ``Space/Time tradeoffs in hash coding with allowable errors'' by Bloom (CACM, July 1970) and in the paper ``Exact and approximate membership testers'' by Carter, Floyd, Gill, Markovsky, and Wegman (Proc. 10th ACM Symp. on Theory of Computing, May 1978, pp. 59-65). To illustrate the technique, suppose that we wish to represent a list L of 1,000 words so that the question ``Does a word x occur in L?'' can be answered in a space-efficient manner. The algorithm represents the list using a 20,000-element bit table T and accesses T through ten independent hashing functions h1, h2, ..., h10 that map words to numbers in the range 1 to 20,000. All 20,000 bits in T are initially set to zero. Each word w in L is inserted into T by setting bits T[h1(w)], T[h2(w)], ..., T[h10(w)] to one. After all 1,000 words have been inserted into T, one would expect somewhat less than 10,000 of the 20,000 bits to be one, and at least 10,000 to be zero.
This collection of 20,000 bits can be used to test membership in L. A word x is looked up by testing bits T[h1(x)], T[h2(x)], ..., T[h10(x)]. If any of these bits is zero, then x is definitely not a member of L, because all of the bits in T that correspond to members of L have been set to one. If all of the bits T[h1(x)], T[h2(x)], ..., T[h10(x)] are one, then the algorithm says that x is a member of L. The algorithm is probabilistic because it will occasionally give a false positive; the algorithm will occasionally say that a word x is in L when in fact it is not.
False positives will not happen very often. Suppose that the table is less than half filled with ones and that the ten hashing functions produce uniformly distributed and independent results. The probability that T[h1(x)] will contain a one on a randomly chosen word x that is not in L is no more than 1/2, because no more than half of the bits in T are ones. Since the hash functions are independent, the probability that all 10 of the bits T[h1(x)], T[h2(x)], ..., T[h10(x)] are equal to one is no more than (1/2)10. Given these assumptions about the hash functions, the chances that the algorithm will accept a word that is not actually in the list are less than one in a thousand.
To generalize the analysis a bit, suppose that there are n words in the word list L, that there are k different hash functions h1, h2, ..., hk, and that there are b bits per word or a total of bn bits in T. Then the probability of a false positive on a randomly chosen word x that is not in L is less than or equal to (kn/bn)k, or (k/b)k. For a particular value of k, (k/b)k is minimized when b=ek (e~2.71828...).
In the current version of the Spelling Tool, k is set to 7 and b is set to 19 (7*2.71828=19.02796). Ideally then, the chances of the Tool falsely declaring that a word x is correctly spelled are less than (7/19)7, which is around 0.00092, or 1 in 1086.
The principal advantage of the bit-table word list representation is that it is space efficient. The average English word consists of around 8.6 characters. If such a word were represented as an ASCII string, then it would each take up 68.8 bits of storage. Even if the word were more compactly represented as a base-26 integer, it would still take up around 40.4 bits (lg 26 ~ 4.7). A bit-table representation of the word list that delivers 99.9% accuracy uses 19 bits per word, which is 27% of the space used by the ASCII representation, and 47% of the space used by the base-26 representation.
Why is space efficiency important? Let me tell you a little history. This same algorithm was implemented by Steve Wood and myself to run within Wood's Z editor (Steven R. Wood, ``Z - The 95% program editor'' in Proc. of the ACM SIGPLAN/SIGOA Symp. on Text Manipulation, June, 1981, pp. 1-7, and Robert Nix, ``Experience with a space-efficient way to store a dictionary,'' CACM, 24(5), May 1981, pp. 297-298). Z runs on crowded time sharing systems that sit atop machines, DEC-20s, in which programs have a fairly small virtual address space. Z runs very well in this environment, providing seemingly instantaneous response to keystrokes even when there are more than 50 users on the machine. It was important that a spelling checker for Z not take up too much of the editor's address space. There were at least two reasons for this: First, Z had more important things to do with its address space than to give an enormous chunk of it to a spelling checker; and second, programs with large working sets tend to perform poorly on heavily loaded DEC-20's. We chose the bit-table representation because it was space-efficient and easy to implement. The bit table representation of a 34,958 element word list takes up 18K 36-bit words, or 7.0% of the DEC-20's address space. This can be contrasted with Goren's well-known DEC-20 SPELL program, whose representation for the same word list takes up 95K words, or 36.4% of the -20's address space. A program like SPELL is a resource hog; Z's spelling checker is not.
What does this have to do with a Cedar spelling checker? Are these same performance considerations important within Cedar? I think so. Are the same constants significant? I doubt it. Wouldn't a spelling/dictionary server provide more efficient and general service? Perhaps it could. In the meantime, the Spelling Tool can serve as a performance yardstick by which to judge a more sophisticated spelling checker, and a user interface setting in which a more sophisticated spelling checker can be presented.
A second advantage of the bit-table representation is that it allows for easy implementation of the insertion operation, so that the user's own word list can be easily merged with a copy of the standard word list. 'The insertion code is a simple variant of the lookup code: rather than testing the bits indicated by the hash functions, it turns them on. The table can be over-allocated by a few percent to allow for insertions, and even if the table is not over-allocated, its performance will degrade only slightly when a few more words are added.
A final advantage of this representation is that it is very easy to implement the lookup and insertion procedures. The insertion, lookup, and hash procedure is 79 lines long and has a trivial control structure. The full implementation of the bit table abstraction, including all ancillary procedures for making new bit tables and schlepping them into and out of the file system, comes to 223 lines of Cedar.
This representation has several disadvantages. The obvious one is the probability for error in deciding that a word is a member of the word list. A little back of the envelope calculations seems in order: If your 10,000 word document actually contains 100 different misspellings, what are the chances that the Spelling Tool will get a false positive on at least one of 100 "words" and cause you endless embarrassment? Well, 1 - (1 - 0.00092)100 is 0.088, so your chances are a pretty good, 11 to 1, that the Spelling Tool will let you off. And if you make 100 different spelling mistakes in 10,000 words, I suggest that you cultivate your local proofreader, because you've probably also made a large number of grammatical errors that even a perfect spelling checker wouldn't catch.
The checker's probability for error becomes significant when it is used by the automatic word correction procedure. The word correction procedure finds plausible corrections by fiddling around with a misspelled word: adding, deleting, replacing and swapping characters and testing if the modified word is in the word list. It tries a large number of words; for example, in the course of finding corrections to a nine-letter misspelled word, 502 possible corrections will be tried. The probability that the Tool will get a false positive on at least one of these corrections is around 1 - (1 - 0.00092)502, which is about 0.37. Thus there is better than a 1 out of 3 chance that the Tool will screw up and propose a nonsense word as a correction. In practice, the probability of encountering a false positive while correcting a word seems to be closer to 1 in 5. Most false positives are outrageous words, e.g. algkebra as a correction for algebra or clonstant as a correction for constant, so most of them are easy to ignore. However, since this has impact on the Spelling Tool's user interface, this is a serious drawback of this representation.
Another serious disadvantage of this representation is that the text of the word list is not available. Therefore no accurate affix analysis can be done, no information about parts of speech is available, etc. This is a problem for those who want such information.
A minor disadvantage is that this method is a significant constant factor slower than most. If one ignores space usage considerations like working set size, the bit-table lookup procedure is three or four times slower than, for example, a lookup procedure using an in-core tree representation. This is because the bit-table lookup procedure must compute seven hash functions, incurring a significant overhead for each character, whereas a tree representation simply chases a pointer per character. The Spelling Tool currently scans text at a rate of 2150 words per second on a Dorado, and while one could probably push this to over 4000 words per second using a faster lookup procedure, 2150 seems to be adequate for interactive use.