KeyNoteDoc.tioga
Jack Kent January 4, 1988 7:27:26 pm PST
KeyNote
CEDAR 7.0 — FOR INTERNAL XEROX USE ONLY
KeyNote
Jack Kent
© Copyright 1987 Xerox Corporation. All rights reserved.
key|note n. An underlying or general tone, spirit, or idea: Liveliness is the keynote of all Dunbar's work.'' (Tucker Brooke).
Abstract: Traditional database management systems are well suited for handling the formatted portions of homogeneous documents (such as mail headers or abstract headers). Unformatted free text requires an alternative browsing solution. "KeyNote" is a package that should enable easy, fast and intuitive full-text browsing...easy meaning that building a KeyNote database on an arbitrary collection of files is as simple as specifying a file pattern...fast meaning that executing a KeyNote query is fast (e.g. a query comprised of five words pertaining to the collection of files known as the CSL-Notebook took on the order of five to ten seconds)...intuitive meaning that the documents are presented to the user in an order based on the likelihood they satisfy the query.
Created by: Jack Kent
Maintained by: <kent.pa>
Keywords: keys, files, matching, queries
XEROX  Xerox Corporation
   Palo Alto Research Center
   3333 Coyote Hill Road
   Palo Alto, California 94304

For Internal Xerox Use Only
1. Introduction
Traditional database management systems are well suited for handling the formatted portions of homogeneous documents (such as mail headers or abstract headers). Unformatted free text requires an alternative browsing solution. KeyNote is one such solution.
"KeyNote" is a package that enables easy, fast and intuitive full-text browsing...easy meaning that building a KeyNote database on an arbitrary collection of files is as simple as specifying a file pattern...fast meaning that executing a KeyNote query is fast (e.g. a query comprised of five words pertaining to the collection of files known as the CSL-Notebook took on the order of five to ten seconds)...intuitive meaning that the documents are presented to the user in an order based on the likelihood they satisfy the query.
2. How to Use it
First type "Install KeyNote" to a command tool. This registers the three keynote command tool operations. They are described below.
KeyStop stopListCutOff pattern stopListFileName
This commands examines all the tokens (words?) in the file pattern specified by pattern. The top stopListCutOff tokens that occur the most frequently are then stored in the file named specified by stopListFileName. The stopList will be used by the operation that builds keynote databases.
Example
% KeyStop 100 /indigo/csl-notebook/entries/*tioga!H ///Users/johnDoe.pa/csl-notebook/stopList.txt
Builds a stopList comprised of the top 100 words that occur in the based on the tioga csl-notebook entries .
Warning: stopListFileName must be a local file.
KeyBD databaseName pattern stopListFileName
This commands builds a keyNote database specified by databaseName from the file pattern specified by pattern. The stop list (specified by stopListFileName) is used to help keep the database down to a manageable size.
Example
% KeyBD /ebbetts.alpine/Doc/Doc.segment [Cedar]<CedarChest7.0>Documentation>*.tioga /ivy/johnDoe.pa/StopForDoc.txt
Builds a keyNote database (specified by /ebbetts.alpine/Doc/Doc.segment) on the Cedar chest documentation using the stopList /ivy/johnDoe.pa/StopForDoc.txt (presumably the stopList was built from [Cedar]<CedarChest7.0>Documentation>*.tioga or some representative fraction)
Warning: databaseName must be an alpine fileName.
KeyWM {-twq} databaseName (token)+
This commands performs a keynote query on databaseName, using the list of tokens specified at the end of the commands. Results (fileNames) are presented in a single column (in command tool output stream) in order of likelihood they satisfy query. Number displayed immediately after fileName is called fileName weight, it represents how well file satisffied query. Tokens displayed immediately after fileName weight are the tokens used in query. First number following token is the frequncy token occurs in file. Second number represents the token's contribution to the fileName weight. Switches operate as follows: -t: suppress token display, -w: suppress fileNames weight display, -q threshold: display only files with weight greater than the best match * threshold. (default for threshold is 0.1)
Example
% KeyWM [ebbetts.alpine]<keynote>CSL-Notebook.segment database browser
[NoteBook]81CSLN-0039.tioga!1 377.9 database 49 55.45 browser 15 322.5
[NoteBook]87CSLN-0022.tioga!1 47.77 database 9 47.77
(1) Pseudo-server names are substituted (whenever possible) for fileName prefixes. As the user (me) had specified NoteBook maps to the path "[indigo]<CSL-Notebook>entries>", substitution was performed.
(2) The files are presented in order of the likelihood they satisfy the query, i.e. the file's weight with respect to the query. It is the number following the fileName. (For example, the weight of file "[NoteBook]81CSLN-0039.tioga!1" with respect to the query "database" and "browser" is 377.9). The higher the number the better the fit. If you're looking for some qualitative intution behind the weighting scheme, it was contrived with three things in mind:
(a) Insofar as a word W occurs infrequently within the document universe, then documents that contain W should be given higher weight.
(b) Insofar as a word W occurs frequently within a document, then the document should be given higher weight.
(c) Weights should be normalized according to the length of a file.
(3) The composite keyword weights (and frequencies) are shown following the file's weight with respect to the query. For example, with respect to file "[NoteBook]81CSLN-0039.tioga!1", we see that "database" contributed 55.45 weight and browser 322.5 weight of the file's 377.9 weight. With respect to frequency, we see that "database" occurred 49 times and browser occurred 15 times in the file.
(4) Note that the weight of the last (second) file is greater than 0.10 times the weight of the first as the default threshold is 0.10.
3. Public KeyNote Databases
There are now two public keynote databases.
CSL-Notebook:    /ebbetts.alpine/keynote/CSL-Notebook.segment
CedarChest Documentation:   /ebbetts.alpine/keynote/Doc.segment   
4. Miscellaneous
Due to Cypress limitations (one open database of a given type per Cypress instance) you can only run one KeyNote operation at-a-time. Sorry.