ACFind.mesa
Copyright © 1985 by Xerox Corporation. All rights reserved.
Dave Rumph, June 18, 1986 10:11:32 am PDT
The Aho-Corasick pattern matching algorithm, as described in "Efficient String Matching: An Aid to Bibliographic Search," Communications of the ACM 18:6 (June 1975). It is very effective in searching repeatedly for a finite number of fixed patterns containing no wildcards. The efficiency of the algorithm is based on building a Finite State Automaton (FSA), which is then run over the input data. The FSA can be either deterministic or non-deterministic; the deterministic version, implemented here, requires more overhead time to setup, but then runs faster.
DIRECTORY
Rope USING [ROPE];
ACFind: CEDAR DEFINITIONS
~ BEGIN
ROPE: TYPE ~ Rope.ROPE;
Building and Using FSA's
Ref: TYPE ~ REF Rep;
ActionProc: TYPE ~ PROC[position: INT, keyFound: ROPE] RETURNS [quit: BOOLFALSE];
Procedure to be called for each key matched. Position is at the end of keyFound. Subtract Rope.Size[keyFound] (a fast operation) to get the beginning.
Create: PROC [keys: LIST OF ROPE, caseSensitive: BOOLFALSE] RETURNS [Ref];
Build the FSA, creating a new instance if required.
Find: PROC [self: Ref, text: ROPE, action: ActionProc] RETURNS [foundAny: BOOLEANFALSE];
Step through the text, calling the ActionProc for each match.
Data Structures
All the global stuff for one instance of a search
MoveRep: TYPE ~ RECORD[SEQUENCE l: NAT OF State];
State: TYPE ~ REF StateRep ← NIL;
StateRep: TYPE ~ RECORD [
keys: LIST OF ROPENIL, -- what keys have been matched, if any?
failure: State ← NIL, -- the next state, by default
offset: CHAR ← '\000, -- offset to first char of the move sequence
move: REF MoveRep ← NIL -- maps characters into States
];
Rep: TYPE ~ RECORD [
initialState: State ← NIL,
caseSensitive: BOOLEANFALSE
];
END.