The IMCOMPARE text file comparison program:

author:  Michael Sannella
file: {Phylum}<LispUsers>IMCOMPARE. (& .DCOM)
loads in: {Phylum}<Lisp>Library>GRAPHER.DCOM

IMCOMPARE is a rather non-standard text file comparison program which tries to address two problems: (1) the problem of detecting certain types of changes, such as detecting when a paragraph is moved to a different part of a document; and (2) the problem of showing the user what changes have been made in a document.

The text comparison algorithm is an adaptation of the one described in the article "A Technique for Isolating Differences Between Files" by Paul Heckel, in CACM, V21, #4, April 1978.  The main idea is to break each of the two text files into "chunks" (words, lines, paragraphs, ...), hash each chunk into a hash value, and match up chunks with the same hash value in the two files.  This method detects switching two chunks, or moving a chunk anywhere else in the document.


Two text files can be compared with the following function:

(IMCOMPARE newFile oldFile hashType graphRegion)

newFile and oldFile are the names of the two files to compare.  The order is not important, except that in the resulting graph the newFile information will appear on the left, and the oldFile info on the right.

hashType determines how "chunks" of text are defined; how fine-grained the comparison will be.  This can be PARA to hash by paragraphs (delimited by two consecutive CRs), LINE to hash by lines (delimited by one CR), or WORD to hash words (delimited by any white space).  hashType=NIL defaults to PARA.

graphRegion is the region on the display screen used for the file comparison graph.  If graphRegion=NIL, the system asks the user to specify a region.  If graphRegion=T, a region in the lower left corner is used.

IMCOMPARE creates a graph with two columns.  Each column contains the file name of one of the files, and lists the chunks from that file.  Each chunk is represented by an atom NNN:MMM, where NNN is the file pointer of the beginning of the chunk within the file, and MMM is the length of the chunk.  Lines are drawn from one column to the other to show which chunks in one file are the same as those in the other file.  Chunks with no lines going to them do not exist in the other file.  [Note: a series of chunks in one file which are the same as a series of chunks in the other file are merged into one big chunk.  A series of unconnected chunks is also merged.]

Pressing the LEFT mouse button over one of the chunk nodes causes the node to be boxed, and a Tedit window to be opened on the file, with the appropriate text selected.  If a Tedit window to the file is already active, the selection is simply moved.

Pressing the MIDDLE mouse button over a chunk node raises a pop-up menu with the items: PARA, LINE, and WORD.  If one of these is selected, IMCOMPARE is called to compare the selected chunk with the last selected chunk (the one that is boxed), using the hash type selected.  IMCOMPARE creates a new graph window, using the region occupied by its "parent" graph.

If the mouse is buttoned outside of the PARA/LINE/WORD menu, no comparison is done, but the selected node is boxed.  The PARA/LINE/WORD menu is always brought up a little away from the cursor, so pressing double-MIDDLE-button over a chunk node is a way to change the boxed node without calling Tedit.


Important note:  white space (space, tab, CR, LF) is used to delimit chunks, but is ignored when computing the hash value of a chunk.  Therefore, if two paragraphs are identical except that one has a few extra CRs after it, they will be considered identical by IMCOMPARE.  This is a useful feature for my purposes (comparing documentation files), but may bite someone else --- contact me if it is a problem.


Known Bugs:

(1)  The first time a Tedit window is brought up, the selection is not underlined.  Pressing the LEFT mouse button over the chunk node again seems to do it correctly.  This is a Tedit bug.

(2)  When the MIDDLE button is used to run IMCOMPARE on two chunks when the files are open with Tedit (a natural thing to do), sometimes there may be some problems.  It is not the case that Tedit loses track of the file --- it is smart enough to re-open it after IMCOMPARE closes it --- but for some reason when you later Quit from Tedit, the file is still open.  I think this may be a bug in Tedit.  You may want to check (OPENP) occasionally.