# A tree-based regular expression matching algorithm

## cmcconne@reed.edu (Carl McConnell)Sun, 12 Mar 1995 19:52:29 GMT

From comp.compilers

Related articles
A tree-based regular expression matching algorithm cmcconne@reed.edu (1995-03-12)
| List of all articles for this month |

 Newsgroups: comp.compilers From: cmcconne@reed.edu (Carl McConnell) Keywords: AST, DFA Organization: Reed College, Portland, Oregon Date: Sun, 12 Mar 1995 19:52:29 GMT

I recently came up with a regular expression matching algorithm that I
thought others might find interesting. It uses the abstract syntax
tree (AST) for the regular expression rather than an NFA. It is
essentially an AST-based analog of the traditional NFA simulation
algorithm. The running time of this algorithm is the same as for NFA
simulation: O(rs), where r is the length of the regular expression and
s is the length of the string being matched. However, its space
consumption is worse than that of NFA simulation: O(rs), compared to
O(r) for NFA simulation. Furthermore, the AST-based algorithm requires
the ability to seek back over the input, rather than being able to
examine each character just once. Thus, the AST algorithm is
basically inferior to NFA simulation. However, I think it's
interesting conceptually.

If anyone has seen it published somewhere, I'd be interested in
hearing about it.

Carl McConnell

=== algorithm description begins ===

The intuition behind the algorithm is that regular expressions are
somewhat analogous to the usual control structures. For example,
the regular expression R* vaguely resembles the loop
"while match(R) do nothing", where "match(R)" is an as-yet undefined
function that can match a regular expression against the input.
Likewise, the regular expression R|S resembles the conditional
"match(R) or match(S)", and RS the conditional "match(R) and
match(S)". Thus, the regular expression a*ab resembles
the expression "(while match(a) do nothing) and match(a) and match(b)",
where the loop returns a boolean. (This analogy is just meant to be
suggestive, as it obviously breaks down pretty fast.)

Of course, the problem is how to account for the non-determinism.
Continuing with the previous regular expression, in matching against
the string aaab, the "loop" for a* mustn't read all the a's, or else
match(a) will fail when it could have succeeded. To get around this,
we define match() to return a set of positions in the input stream at
which subsequent matching may begin. (The drawback here is that the
number of positions is unbounded; instead of a finite state machine,
we essentially have an infinite data machine.) As arguments,
match(R, stream, positions) takes a regular expression, an input
stream, and a set of positions in that input stream where matching may
begin. (We'll assume the first position in the input stream is 1.)

This leads to the definition for match() given below. The invocation
is match(R, stream, {1}), and the result is the set of positions for
which the input stream matches R: if p is a position, then the
contents of the input stream from 1 to p-1 match R. We can find the
usually desired longest match by using the maximum element of the
result. If the result is empty, there were no matches.

match(R|S, stream, positions) =
match(R, stream, positions) | match(S, stream, positions)
match(RS, stream, positions) =
match(R, stream, positions) & match(S, stream, positions)
match(R*, stream, positions) =
closure(R, stream, positions)
match(c, stream, positions) =
matchCharacter(c, stream, positions)

where "|" is set union and "&" set intersection, and the auxiliary functions
are

closure(R, stream, positions)
newPositions := match(R, stream, positions);
if (newPositions = positions)
return positions;
return closure(R, stream, positions | newPositions);

matchCharacter(c, stream, positions)
newPositions := {};
for each position p in positions do
if (c = stream[p])
newPositions := newPositions | {p+1};
return newPositions;
--

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.