|Summary: NFA to DFA, minimize DFA email@example.com (1992-01-14)|
|Re: Summary: NFA to DFA, minimize DFA brnstnd@KRAMDEN.ACF.NYU.EDU (1992-01-17)|
|From:||brnstnd@KRAMDEN.ACF.NYU.EDU (Dan Bernstein)|
|Date:||Fri, 17 Jan 92 20:12:20 GMT|
I missed the preceding discussion of this issue, and I must confess to
almost complete ignorance of the wide variety of techniques which have
been applied to pattern matching. However, my New Year's resolution was
to write a regexp package, and the result, dynapat 0.90, is available
from wuarchive.wustl.edu:usenet/alt.sources/articles/4566.Z. Some of its
characteristics may be of interest here:
The ``automaton'' is deterministic. You compile any number of patterns
together into a single structure. Then you feed one character after
another into the structure, and after each you can find out which
patterns match the stream of characters so far. (A separate library
provides a more traditional string-matching interface.)
The space used in compiling N patterns of maximum length M and total
complexity T is O(MT), with a very small constant of proportionality.
The space used once all the patterns are compiled is O(T). For
instance, ((abc)*d|e*f)* takes under 80 bytes of dynamic storage after
The time used in compiling the patterns is at most O(MT^2), with a
much smaller average case and again a tiny constant of
proportionality. The time used to match each character is O(T) for
patterns satisfying the restrictions of most regexp packages---i.e.,
no closures of an expression which may be empty---though dynapat has
no trouble matching arbitrary patterns. (The utterly inachievable
upper bound is T^2.) The usual behavior is thus O(T) space and O(T)
time per character, which seems a reasonable tradeoff.
I must also admit to never having had the patience to dredge through the
available explanations of regexp matching. My implementation is based on
three trivial observations: 1. pc matches a stream if and only if p
matches the stream-before-now and c matches the latest character. 2.
p(q)* matches the stream if and only if either p or p(q)*q matches the
stream. 3. p(q|r) matches the stream if and only if pq or pr matches the
stream. It is obvious how to use these rules to solve the problem of
computing the match state of a set of patterns against a stream, given
the previous match state and the latest character---simply take the
inductive closure of that problem under the rules. And that's exactly
what dynapat does.
Please note that dynapat 0.90 was created on the same time scale as Man
(on the seventh day I rested and sent it out to the net) and hence
should be expected to have a few bugs. ontmoh!rwh (Russ somebody) has
already caught a bug, namely that patmatch_trashrl frees the same memory
twice. I also plan to introduce a few optimizations so that dynapat can
compete with JAW egrep for typical patterns. But the package is
basically stable, and since I emphasized simplicity and modularity,
I don't anticipate any serious problems.
Return to the
Search the comp.compilers archives again.