XEROX COMPARETEXT 2 4 1 COMPARETEXT 1 4 By: Michael Sannella (Sannella.pa@Xerox) Uses TEDIT.DCOM, GRAPHER.DCOM INTRODUCTION COMPARETEXT is a rather non-standard text file comparison program which tries to address two problems: (1) the problem of detecting certain types of changes, such as detecting when a paragraph is moved to a different part of a document; and (2) the problem of showing the user what changes have been made in a document. The text comparison algorithm is an adaptation of the one described in the article "A Technique for Isolating Differences Between Files" by Paul Heckel, in CACM, V21, #4, April 1978. The main idea is to break each of the two text files into "chunks" (words, lines, paragraphs, ...), hash each chunk into a hash value, and match up chunks with the same hash value in the two files. This method detects switching two chunks, or moving a chunk anywhere else in the document. COMPARING TEXT FILES Two text files can be compared with the following function: (COMPARETEXT NEWFILENAME OLDFILENAME HASH.TYPE GRAPH.REGION) [Function] NEWFILENAME and OLDFILENAME are the names of the two files to compare. The order is not important, except that in the resulting graph the NEWFILENAME information will appear on the left, and the OLDFILENAME info on the right. HASH.TYPE determines how "chunks" of text are defined; how fine-grained the comparison will be. This can be PARA to hash by paragraphs (delimited by two consecutive CRs), LINE to hash by lines (delimited by one CR), or WORD to hash words (delimited by any white space). HASH.TYPE=NIL defaults to PARA. GRAPH.REGION is the region on the display screen used for the file comparison graph. If GRAPH.REGION=NIL, the system asks the user to specify a region. If GRAPH.REGION=T, a region in the lower left corner is used. COMPARETEXT creates a graph with two columns. Each column contains the file name of one of the files, and lists the chunks from that file. Each chunk is represented by an atom NNN:MMM, where NNN is the file pointer of the beginning of the chunk within the file, and MMM is the length of the chunk. Lines are drawn from one column to the other to show which chunks in one file are the same as those in the other file. Chunks with no lines going to them do not exist in the other file. [Note: a series of chunks in one file which are the same as a series of chunks in the other file are merged into one big chunk. A series of unconnected chunks is also merged.] Pressing the LEFT mouse button over one of the chunk nodes causes the node to be boxed, and a Tedit window to be opened on the file, with the appropriate text selected. If a Tedit window to the file is already active, the selection is simply moved. Pressing the MIDDLE mouse button over a chunk node raises a pop-up menu with the items: PARA, LINE, and WORD. If one of these is selected, COMPARETEXT is called to compare the selected chunk with the last selected chunk (the one that is boxed), using the hash type selected, and create a new graph window. If the mouse is buttoned outside of the PARA/LINE/WORD menu, no comparison is done, but the selected node is boxed. The PARA/LINE/WORD menu is always brought up a little away from the cursor, so pressing double-MIDDLE-button over a chunk node is a way to change the boxed node without calling Tedit. Important note: white space (space, tab, CR, LF) is used to delimit chunks, but is ignored when computing the hash value of a chunk. Therefore, if two paragraphs are identical except that one has a few extra CRs after it, they will be considered identical by COMPARETEXT. (LIST ((PAGE NIL (FOLIOINFO (ARABIC) STARTINGPAGE# 1) (0 0 612 792) ((FOLIO NIL (PARALOOKS (QUAD CENTERED) CHARLOOKS (SUPERSCRIPT 0 SIZE 10 FAMILY MODERN OVERLINE OFF STRIKEOUT OFF UNDERLINE OFF SLOPE REGULAR WEIGHT MEDIUM) FORMATINFO (ARABIC)) (174 36 288 36) NIL) (HEADING NIL (HEADINGTYPE RUNNINGHEAD) (84 744 444 36) NIL) (TEXT NIL NIL (84 96 456 600) NIL))) (PAGE NIL NIL (0 0 612 792) ((FOLIO NIL (PARALOOKS (QUAD CENTERED) CHARLOOKS (SUPERSCRIPT 0 SIZE 10 FAMILY MODERN OVERLINE OFF STRIKEOUT OFF UNDERLINE OFF SLOPE REGULAR WEIGHT MEDIUM)) (282 42 72 36) NIL) (HEADING NIL (HEADINGTYPE RUNNINGHEAD) (84 744 444 36) NIL) (TEXT NIL NIL (84 96 456 600) NIL))) (PAGE NIL NIL (0 0 612 792) ((FOLIO NIL (PARALOOKS (QUAD CENTERED) CHARLOOKS (SUPERSCRIPT 0 SIZE 10 FAMILY MODERN OVERLINE OFF STRIKEOUT OFF UNDERLINE OFF SLOPE REGULAR WEIGHT MEDIUM)) (282 42 72 36) NIL) (HEADING NIL (HEADINGTYPE RUNNINGHEAD) (84 744 444 36) NIL) (TEXT NIL NIL (84 96 456 600) NIL)))))(È1È È(ŠŠ8(È (ŠŠ8DÈÈ PAGEHEADING RUNNINGHEAD HELVETICA MODERN MODERN MODERN MODERNMODERN LOGO  HRULE.GETFNMODERN  HRULE.GETFNMODERN  HRULE.GETFNMODERN   HRULE.GETFNMODERN  HRULE.GETFNMODERN * @Ú<.   p .    M 8 /šú3-'Jzº