Fri, 17 Jan 92 20:12:20 GMT

Related articles |
---|

Summary: NFA to DFA, minimize DFA roberto@cernvax.cern.ch (1992-01-14) |

Re: Summary: NFA to DFA, minimize DFA brnstnd@KRAMDEN.ACF.NYU.EDU (1992-01-17) |

Newsgroups: | comp.compilers |

From: | brnstnd@KRAMDEN.ACF.NYU.EDU (Dan Bernstein) |

In-Reply-To: | 92-01-052 |

Keywords: | lex |

Organization: | IR |

Date: | Fri, 17 Jan 92 20:12:20 GMT |

I missed the preceding discussion of this issue, and I must confess to

almost complete ignorance of the wide variety of techniques which have

been applied to pattern matching. However, my New Year's resolution was

to write a regexp package, and the result, dynapat 0.90, is available

from wuarchive.wustl.edu:usenet/alt.sources/articles/4566.Z. Some of its

characteristics may be of interest here:

The ``automaton'' is deterministic. You compile any number of patterns

together into a single structure. Then you feed one character after

another into the structure, and after each you can find out which

patterns match the stream of characters so far. (A separate library

provides a more traditional string-matching interface.)

The space used in compiling N patterns of maximum length M and total

complexity T is O(MT), with a very small constant of proportionality.

The space used once all the patterns are compiled is O(T). For

instance, ((abc)*d|e*f)* takes under 80 bytes of dynamic storage after

compilation.

The time used in compiling the patterns is at most O(MT^2), with a

much smaller average case and again a tiny constant of

proportionality. The time used to match each character is O(T) for

patterns satisfying the restrictions of most regexp packages---i.e.,

no closures of an expression which may be empty---though dynapat has

no trouble matching arbitrary patterns. (The utterly inachievable

upper bound is T^2.) The usual behavior is thus O(T) space and O(T)

time per character, which seems a reasonable tradeoff.

I must also admit to never having had the patience to dredge through the

available explanations of regexp matching. My implementation is based on

three trivial observations: 1. pc matches a stream if and only if p

matches the stream-before-now and c matches the latest character. 2.

p(q)* matches the stream if and only if either p or p(q)*q matches the

stream. 3. p(q|r) matches the stream if and only if pq or pr matches the

stream. It is obvious how to use these rules to solve the problem of

computing the match state of a set of patterns against a stream, given

the previous match state and the latest character---simply take the

inductive closure of that problem under the rules. And that's exactly

what dynapat does.

Please note that dynapat 0.90 was created on the same time scale as Man

(on the seventh day I rested and sent it out to the net) and hence

should be expected to have a few bugs. ontmoh!rwh (Russ somebody) has

already caught a bug, namely that patmatch_trashrl frees the same memory

twice. I also plan to introduce a few optimizations so that dynapat can

compete with JAW egrep for typical patterns. But the package is

basically stable, and since I emphasized simplicity and modularity,

I don't anticipate any serious problems.

---Dan

--

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.