Re: regular expressions (bug-report)

pardo@cs.washington.edu (David Keppel)
Fri, 19 Mar 1993 19:09:38 GMT

          From comp.compilers

Related articles
regular expressions (bug-report) megatest!plethorax!djones@uu4.psi.com (1993-03-14)
Re: regular expressions (bug-report) Paul.Klint@cwi.nl (1993-03-15)
Re: regular expressions (bug-report) mab@wdl39.wdl.loral.com (1993-03-15)
regular expressions (bug-report) jos@thinx.convex.com (Jos Horsmeier) (1993-03-16)
Re: regular expressions (bug-report) pardo@cs.washington.edu (1993-03-19)
Re: regular expressions wendt@CS.ColoState.EDU (1993-03-22)
Re: regular expressions (bug-report) henry@zoo.toronto.edu (1993-03-25)
Re: regular expressions (bug-report) kanze@us-es.sel.de (1993-03-26)
| List of all articles for this month |
Newsgroups: comp.compilers
From: pardo@cs.washington.edu (David Keppel)
Keywords: lex, DFA
Organization: Computer Science & Engineering, U. of Washington, Seattle
References: 93-03-046 93-03-050
Date: Fri, 19 Mar 1993 19:09:38 GMT

>megatest!plethorax!djones@uu4.psi.com (Dave Jones) writes:
>[This textbook regexp matcher has a bug...]


mab@wdl39.wdl.loral.com (Mark A Biggar) writes:
>[Most Unix-based RE pkgs interpretive backtracking.]


An NFA package can, in principal, have at most as many states in the queue
as there are total states in the automaton. For an NFA, this number is
"reasonable" (e.g., linear in the size of the regular expression).


A DFA RE algorithm can represent one DFA state as a list of the
correspnding NFA states. A straightforward construction of the DFA from
the NFA, however, yields a huge DFA (the number of DFA states is, I think,
exponential in the number of NFA states).


`egrep', (and, in principle, `gnu?egrep') use a DFA-based mechanism. Al
Aho made several important observations. One is that you only *need* to
keep one DFA state at a time. On a transition, the next state can be
computed dynamically from the current state and a simulation of the
correspnding NFA states running on the NFA automaton. A second
observation is that the DFA can keep a cache of recently used DFA states
so that new states only need to be generated "rarely". A third
observation is that checking for a cache hit can be made cheap.


In principal, the running time of Aho's algorithm can be bad, since every
transition can require construction of a new DFA state. In practice, the
DFA traverses a limited number of states, and a small cache does a good
job. Combined with a Boyer-Moore prematcher, the cache mostly suffers
capacity misses (for common search patterns).


A related tool is `agrep', Udi Mamber's approximate string matching
algorithm. I don't know the details of the algorithm, but it does several
clever things very well. Source via anonymous ftp from sites that carry
linux, also (according to `archie') in


`cs.dal.ca:/pub/comp.archives/agrep',
`files1zrz.zrz.tu-berlin.de:/pub/unix/tools/agrep',
`gatekeeper.dec.com:/contrib/src/pa/agrep',
`harpo.seas.ucla.edu:/mnt/fs01/agrep',


and, probably, lots of other places.


;-D on ( Unleaded expressions ) Pardo
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.