Re: Summary: NFA to DFA, minimize DFA

brnstnd@KRAMDEN.ACF.NYU.EDU (Dan Bernstein)
Fri, 17 Jan 92 20:12:20 GMT

          From comp.compilers

Related articles
Summary: NFA to DFA, minimize DFA roberto@cernvax.cern.ch (1992-01-14)
Re: Summary: NFA to DFA, minimize DFA brnstnd@KRAMDEN.ACF.NYU.EDU (1992-01-17)
| List of all articles for this month |
Newsgroups: comp.compilers
From: brnstnd@KRAMDEN.ACF.NYU.EDU (Dan Bernstein)
In-Reply-To: 92-01-052
Keywords: lex
Organization: IR
Date: Fri, 17 Jan 92 20:12:20 GMT

I missed the preceding discussion of this issue, and I must confess to
almost complete ignorance of the wide variety of techniques which have
been applied to pattern matching. However, my New Year's resolution was
to write a regexp package, and the result, dynapat 0.90, is available
from wuarchive.wustl.edu:usenet/alt.sources/articles/4566.Z. Some of its
characteristics may be of interest here:


    The ``automaton'' is deterministic. You compile any number of patterns
    together into a single structure. Then you feed one character after
    another into the structure, and after each you can find out which
    patterns match the stream of characters so far. (A separate library
    provides a more traditional string-matching interface.)


    The space used in compiling N patterns of maximum length M and total
    complexity T is O(MT), with a very small constant of proportionality.
    The space used once all the patterns are compiled is O(T). For
    instance, ((abc)*d|e*f)* takes under 80 bytes of dynamic storage after
    compilation.


    The time used in compiling the patterns is at most O(MT^2), with a
    much smaller average case and again a tiny constant of
    proportionality. The time used to match each character is O(T) for
    patterns satisfying the restrictions of most regexp packages---i.e.,
    no closures of an expression which may be empty---though dynapat has
    no trouble matching arbitrary patterns. (The utterly inachievable
    upper bound is T^2.) The usual behavior is thus O(T) space and O(T)
    time per character, which seems a reasonable tradeoff.


I must also admit to never having had the patience to dredge through the
available explanations of regexp matching. My implementation is based on
three trivial observations: 1. pc matches a stream if and only if p
matches the stream-before-now and c matches the latest character. 2.
p(q)* matches the stream if and only if either p or p(q)*q matches the
stream. 3. p(q|r) matches the stream if and only if pq or pr matches the
stream. It is obvious how to use these rules to solve the problem of
computing the match state of a set of patterns against a stream, given
the previous match state and the latest character---simply take the
inductive closure of that problem under the rules. And that's exactly
what dynapat does.


Please note that dynapat 0.90 was created on the same time scale as Man
(on the seventh day I rested and sent it out to the net) and hence
should be expected to have a few bugs. ontmoh!rwh (Russ somebody) has
already caught a bug, namely that patmatch_trashrl frees the same memory
twice. I also plan to introduce a few optimizations so that dynapat can
compete with JAW egrep for typical patterns. But the package is
basically stable, and since I emphasized simplicity and modularity,
I don't anticipate any serious problems.


---Dan


--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.