Re: regular expressions (bug-report)

henry@zoo.toronto.edu (Henry Spencer)
Thu, 25 Mar 1993 20:21:18 GMT

          From comp.compilers

Related articles
regular expressions (bug-report) megatest!plethorax!djones@uu4.psi.com (1993-03-14)
Re: regular expressions (bug-report) Paul.Klint@cwi.nl (1993-03-15)
Re: regular expressions (bug-report) mab@wdl39.wdl.loral.com (1993-03-15)
regular expressions (bug-report) jos@thinx.convex.com (Jos Horsmeier) (1993-03-16)
Re: regular expressions (bug-report) pardo@cs.washington.edu (1993-03-19)
Re: regular expressions (bug-report) henry@zoo.toronto.edu (1993-03-25)
Re: regular expressions (bug-report) kanze@us-es.sel.de (1993-03-26)
| List of all articles for this month |

Newsgroups: comp.compilers
From: henry@zoo.toronto.edu (Henry Spencer)
Keywords: lex, DFA
Organization: U of Toronto Zoology
References: 93-03-046 93-03-050
Date: Thu, 25 Mar 1993 20:21:18 GMT

mab@wdl39.wdl.loral.com (Mark A Biggar) writes:
>Most of the Unix based RE packages don't use the NFA or DFA methods (a
>notable exception is egrep). They use an interpretive backtracking
>algorithm using a stack of backtrack points. Prime examples of this are
>grep, Henry Spenser's RE package and Perl (which is a heavly modified
>version of Henry's package)...


(Apologies for slow response, I've been busy...)


Actually, what's being referred to as the "NFA method" is simulating a
DFA; done right, it's equivalent to the DFA method except in speed. The
backtracking method is the one that's usually considered to be an NFA
approach.


The major advantage of the backtracking method, apart from smaller tables,
is that it's easy to code.


Something that many people overlook is that the original backtracking
algorithms *did* match the longest possible string, without added fudges.
If you make the first-guess choices right, the first match is the longest.


The gotcha is, this works only with the traditional RE syntax. When you
add | or \<digit>, it becomes possible to sucker the backtracking
algorithm into making a poor choice early on that produces a
less-than-longest match in the end. (I don't believe this can be done
without one of those two constructs or the equivalent, although I'll admit
I don't have a proof.) Many of the implementors didn't notice this;
documentation for things like ed still claims longest match, even though
it uses a backtracking algorithm and implements \<digit>.


Unfortunately, virtually all RE implementations nowadays include | or
\<digit>, so an efficient algorithm really has to use a DFA or simulate
one. You really must explore all possible matches to be sure of finding
the longest, and that can be spectacularly expensive if you do them one at
a time with a hacked backtracking version. The DFA approach, exploring
all of them in parallel, is a practical necessity. (I know of folks who
have done it the other way, but I don't recommend it.)


My old package implements | but explicitly documents that you get first
match rather than longest match.


POSIX.2 REs require longest match, which is really exciting to implement
since the obsolete ("basic") variant also includes \<digit>. I haven't
found a better way of tackling this than doing a preliminary match using a
DFA (or simulation) on a modified RE that just replicates subREs for
\<digit>, and then doing a backtracking match to determine whether the
subRE matches were right. This can be rather slow, but I console myself
with the thought that people who use \<digit> deserve very slow execution.
(Pun unintentional but very appropriate.)
--
Henry Spencer @ U of Toronto Zoology, henry@zoo.toronto.edu utzoo!henry
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.