Related articles |
---|
Regular expression search algorithm harvinder.singh@patni.com (Harry) (2002-07-15) |
Re: Regular expression search algorithm joachim_d@gmx.de (Joachim Durchholz) (2002-07-21) |
Re: Regular expression search algorithm ralph@inputplus.co.uk (Ralph Corderoy) (2002-07-21) |
Re: Regular expression search algorithm peter.tillier@btinternet.com (Peter S Tillier) (2002-07-24) |
Re: Regular expression search algorithm arnold@skeeve.com (Aharon Robbins) (2002-09-03) |
regular expression search algorithm forsyth@minster.york.ac.uk (1993-03-18) |
Re: regular expression search algorithm pardo@cs.washington.edu (1993-03-19) |
From: | "Aharon Robbins" <arnold@skeeve.com> |
Newsgroups: | comp.compilers |
Date: | 3 Sep 2002 00:32:34 -0400 |
Organization: | Pioneer Consulting, Ltd. |
References: | 02-07-036 02-07-069 02-07-102 |
Keywords: | lex |
Posted-Date: | 03 Sep 2002 00:32:34 EDT |
Peter S Tillier <peter.tillier@btinternet.com> wrote:
> awk, egrep and others usually use DFAs, which tend to be slower to compile
> and faster to execute than NFAs, but don't support back references.
> gawk uses both - DFA most of the time, but NFA when back references
> are needed.
Gawk doesn't do back references. (Regexes of the form (abc)foo\1 )
They're not in the awk language.
Gawk uses dfa matching only for the cases of:
x ~ y
x !~ y
Other matches, such as split(), match(), and regex-based field and record
splitting, require knowing the start and end position of the match,
which the dfa matcher doesn't provide.
The dfa code is that from GNU grep/egrep, and the NFA code is GNU regex.
(Which I think does NFA stuff deep down under the hood, but as the code
is A Big Ball Of Mud, I'm not sure. I treat it strictly as a black box,
and I loathe the interface.)
Fortunately, there's improvement on the horizon. GNU Libc now has a
written-from-scratch version of the GNU regex routines that attempt to
use a dfa-based algorithm where possible, and which falls back to an
NFA algorithm when not. This code can be gotten from the glibc public
CVS archive at sources.redhat.com.
In the in-testing version of gawk I've removed the dfa code entirely and
now use the new version of regex. (The code in gawk is now cleaner as
a result, which is a pleasant side-effect.) Said version is currently
undergoing the trial-by-fire of portability to different Unix systems,
non-Unix systems, and various compilers. So it's getting closer to
being ready for prime time, but isn't quite there yet.
For what it's worth. :-)
Arnold Robbins
gawk's maintainer
--
Aharon (Arnold) Robbins --- Pioneer Consulting Ltd. arnold@skeeve.com
P.O. Box 354 Home Phone: +972 8 979-0381 Fax: +1 928 569 9018
Nof Ayalon Cell Phone: +972 51 297-545
D.N. Shimshon 99785 ISRAEL
Return to the
comp.compilers page.
Search the
comp.compilers archives again.