RegularExpressionDoc.tioga
Copyright © 1985 by Xerox Corporation. All rights reserved.
Russ Atkinson (RRA) June 26, 1985 6:25:08 pm PDT
REGULAR EXPRESSION
CEDAR 6.0 — FOR INTERNAL XEROX USE ONLY
Regular Expression
-- a package for parsing and matching regular expressions
Russ Atkinson
© Copyright 1985 Xerox Corporation. All rights reserved.
Abstract: RegularExpression is a package that implements a syntax for specifying a kind of text matching pattern called regular expressions.
Created by: Bob Nix
Maintained by: Russ Atkinson <Atkinson.pa>
Keywords: regular expressions, matching
XEROX  Xerox Corporation
   Palo Alto Research Center
   3333 Coyote Hill Road
   Palo Alto, California 94304

For Internal Xerox Use Only
Overview
The pattern matcher is a package that implements a syntax for specifying a kind of text matching pattern called regular expressions. The simplest sort of regular expression is just a string, like ``hello''. This pattern will match occurrences of the word ``hello'', or substrings of words like ``othello''. It will even match the string ``Hello'', if your ignore-capitalization switch is set.
Special characters
Most characters in a regular expression match themselves, but some characters (one of ``'#[..]~^$*+(|)~\<,:>''), have a special meaning to the pattern matcher. These special characters are used to specify more interesting patterns:
'
Quote
; ignore the special meaning of the following character, and just match that character. ``'#'' matches a sharp, and ``'''' matches an apostrophe. If the ' is followed by a stream of digits, e.g. ``'012'', then the Ascii character with that code is specified.
#
Wildcard; match any character.
[class]
Character class; this is used to match one of a set of characters. ``[abc]'' matches an ``a'', or a ``b'', or a ``c''. The characters ``..'' are used here to specify a range of characters: ``[a..zA..Z0..9'.]'' matches a letter, number, or the character ``.''. If the ignore-capitalization switch is set, then every occurence of an alphabetic character in one case also signifies the corresponding character in the other case, e.g. ``[aBc]'' is the same as ``[aAbBcC]''.
[~class]
Not character class; matches any character except those in the class, e.g. ``[~a..zA..Z0..9]'' matches a non-alphanumeric character.
^
Beginning of node; matches the beginning of a node, or the beginning of a line in line-based searches.
$
End of node; matches the end of a node, or the end of a line in line based searches.
P*
Closure; this matches zero or more occurences of the pattern P. It will match as few P's as it can, e.g. if the pattern is ``#*ab'' then it will match the first ``ab'' in ``ababababab''. If the pattern is ``#*ab$'', then the it will match the whole string ``ababababab'', with the ``#*'' portion of the pattern matching ``abababab''.
P+
Plus; this is just a shorthand for PP*. It will match one or more occurences of the pattern P.
P**
Greedy closure; this matches zero or more occurences of the pattern P, but it starts out by matching as many P's as it can, and backs off on this choice only if the rest of the pattern doesn't match. So, the pattern ``#**ab'' will match the whole string ``ababababab'', unlike ``#*ab'' above.
P++
Greedy plus; a shorthand for PP**.
(P1|P2|...|Pn)
Alternation; this will match either P1, or P2, ..., or Pn. It tries them in that order, and it will switch from Pi to Pi+1 only if the rest of the pattern fails. There may be a single alternative P1, in which case the parentheses act as groupers (a large bottom fish).
P!n
Power; a shorthand for matching n consecutive copies of P.
<field:P>
Fields; this is used to delimit parts of the string being matched so that you can refer to them later. For example, if the pattern is ``^<name:[a-zA-Z]*> *<number:[0-9]*>$'', and the string is ``Fahrenheit 451'', the the first field will match ``Fahrenheit'', and the second field will match ``451''. If the pattern P is not given, then it defaults to ``#*''. Fields are valid only at the top level of a pattern, i.e. they must not occur inside ()'s or as the target of a * or +. The special field name ``ALL'' is reserved, and is always bound to the text matched by the entire pattern.
{ }
Beginning and end; A single pair of matched curly braces may be given at the top syntactic level of the pattern. They delimit the virtual start and end of the text matched by the pattern: e.g. ``match {this}'' will match the text ``this'', but only if it is preceded by the text ``match ''. These curly braces determine the extent of the ``ALL'' field.
A technical note: the characters matched by the pattern matcher must be in the range ASCII '001 to ASCII '177. In particular, ``#'' will match only characters in this range, and the octal character code notation can be used solely to specify characters in this range.
Common patterns
The following commonly used patterns are defined for your typing convenience, to use them, just put ``\<letter>'' in your pattern, e.g. ``\w\b\w\b\d'' will match text that consists of two words followed by a number, all separated by blanks. Capitalization is not significant in the abbreviation, e.g. ``\a'' is the same as ``\A''.
\aAlphanumeric. ``([a..zA..Z0..9]++)''
\bBlanks & whitespace. ``([ '011..'015]++)''
\dDecimal number. ``([0..9]+.[0..9]**|[0..9]*.[0..9]++|[0..9]++)''
\nNewline.  ``('015)''
\qQuoted string. ``("[~"]*"|``[~'']*''''|`[~'']*'')''
\sCharacter stream. ``([~ '011..'015]++)''
\wA word.  ``([a..zA..Z]++)''
\^Control character. ``['001..'037]''
Examples
The abbreviations given above provide good examples of use of the pattern matcher, here are some others that will give you some idea of its power:
This finds all of the ";"'s in a Cedar program that don't have a blank before them:
``[~ ];''
This finds all occurences of one of four sentences, even if they are split across several lines:
``(Is|Isn''t)\bthis\b(funny|stupid).''
This matches a sequence of x's on a line if its length is of the form 5l+7m, for non-negative integers l and m.
``^(xxxxx|xxxxxxx)*$''
This tells if a sequence of binary digits is of even parity, that is, if it contains an even number of 1's:
``^0*(10*10*)*0*$''
This finds all words in a word list that have at least seven vowels in them:
``[aeiou]#*[aeiou]#*[aeiou]#*[aeiou]#*[aeiou]#*[aeiou]#*[aeiou]''
This finds all words in a word list that have five consecutive vowels in them:
``[aeiou][aeiou][aeiou][aeiou][aeiou]''
Syntax
For those who care, this is the actual syntax of the patterns accepted by the pattern matcher. The alternatives for each production are split out on separate lines.
First, the tipity-top level syntax, which allows an optional single pair of matched {}'s:
TT ::=
T
T{T}T  -- Delimits the virtual start of the text matched by <ALL>.
The top level syntax, which allows field delimiters:
T ::=
<name,n:P>T -- Named portions valid only at top level. P and n optional.
P<name,n:P>T -- Force P's to parse as much as they can.
P  -- Or just a pattern.
The pattern matching fragments for full patterns:
P ::=
 null  -- The empty pattern.
A  --
A*P  -- Min-matching closure.
A**P  -- Greedy closure.
A+P  -- AA*
A++P  -- AA**
A!n  -- AAA...AAA (n times)
A production that makes alternation bind tightly:
A ::=
S  -- A single-character pattern.
(P|P|...|P)  -- Alternation.
And finally, the single character patterns:
S ::=
character -- But not one of the special characters.
'special character -- One of the special characters.
'nnn   -- Octal characters; nnn is an octal number.
[character class]  -- Character class notation
[~character class] -- Not the characters in this class.
#  -- Any character.
$  -- Node break.
^  -- Node start.
\x  -- Abbreviation for a predefined pattern.
The numbers that occur in a few of the productions above are specified by a string of octal or decimal digits, ending in a nondigit. If the non-digit is a ``.'', then that is taken to be part of the number, otherwise it is not. For example, ``x!10.2'' will match 10 x's followed by a ``2''; ``x!10..2'' will match 10 x's followed by a single ``.'' and a ``2''; ``x!102'' will match 102 x's; and ``x!10z'' will match 10 x's followed by a z.