ACFind.mesa
Copyright © 1985 by Xerox Corporation. All rights reserved.
Dave Rumph, September 30, 1985 5:23:33 pm PDT
The Aho-Corasick pattern matching algorithm, as described in "Efficient String Matching: An Aid to Bibliographic Search," Communications of the ACM 18:6 (June 1975). It is very effective in searching repeatedly for a finite number of fixed patterns containing no wildcards. The efficiency of the algorithm is based on building a Finite State Automaton (FSA), which is then run over the input data. The FSA can be either deterministic or non-deterministic; the deterministic version, implemented here, requires more overhead time to setup, but then runs faster.
DIRECTORY
Rope USING [ROPE];
ACFind: CEDAR DEFINITIONS
~
BEGIN
ROPE: TYPE ~ Rope.ROPE;
Building and Using FSA's
ActionProc:
TYPE ~
PROC[position:
INT, keyFound:
ROPE]
RETURNS [quit:
BOOL ←
FALSE];
Procedure to be called for each key matched.
Create:
PROC [keys:
LIST
OF
ROPE, caseSensitive:
BOOL ←
FALSE]
RETURNS [Ref];
Build the FSA, creating a new instance if required.
Find:
PROC [self: Ref, text:
ROPE, action: ActionProc]
RETURNS [foundAny:
BOOLEAN ←
FALSE];
Step through the text, calling the ActionProc for each match.
Data Structures
All the global stuff for one instance of a search
MoveRep: TYPE ~ RECORD[SEQUENCE l: NAT OF State];
State: TYPE ~ REF StateRep ← NIL;
StateRep:
TYPE ~
RECORD [
keys: LIST OF ROPE ← NIL, -- what keys have been matched, if any?
failure: State ← NIL, -- the next state, by default
offset: CHAR ← '\000, -- offset to first char of the move sequence
move: REF MoveRep ← NIL -- maps characters into States
];
Rep:
TYPE ~
RECORD [
initialState: State ← NIL,
caseSensitive: BOOLEAN ← FALSE
];
END.