Re: Regular expression search algorithm

"Aharon Robbins" <>
3 Sep 2002 00:32:34 -0400

          From comp.compilers

Related articles
Regular expression search algorithm (Harry) (2002-07-15)
Re: Regular expression search algorithm (Joachim Durchholz) (2002-07-21)
Re: Regular expression search algorithm (Ralph Corderoy) (2002-07-21)
Re: Regular expression search algorithm (Peter S Tillier) (2002-07-24)
Re: Regular expression search algorithm (Aharon Robbins) (2002-09-03)
regular expression search algorithm (1993-03-18)
Re: regular expression search algorithm (1993-03-19)
| List of all articles for this month |

From: "Aharon Robbins" <>
Newsgroups: comp.compilers
Date: 3 Sep 2002 00:32:34 -0400
Organization: Pioneer Consulting, Ltd.
References: 02-07-036 02-07-069 02-07-102
Keywords: lex
Posted-Date: 03 Sep 2002 00:32:34 EDT

Peter S Tillier <> wrote:
> awk, egrep and others usually use DFAs, which tend to be slower to compile
> and faster to execute than NFAs, but don't support back references.
> gawk uses both - DFA most of the time, but NFA when back references
> are needed.

Gawk doesn't do back references. (Regexes of the form (abc)foo\1 )
They're not in the awk language.

Gawk uses dfa matching only for the cases of:

x ~ y
x !~ y

Other matches, such as split(), match(), and regex-based field and record
splitting, require knowing the start and end position of the match,
which the dfa matcher doesn't provide.

The dfa code is that from GNU grep/egrep, and the NFA code is GNU regex.
(Which I think does NFA stuff deep down under the hood, but as the code
is A Big Ball Of Mud, I'm not sure. I treat it strictly as a black box,
and I loathe the interface.)

Fortunately, there's improvement on the horizon. GNU Libc now has a
written-from-scratch version of the GNU regex routines that attempt to
use a dfa-based algorithm where possible, and which falls back to an
NFA algorithm when not. This code can be gotten from the glibc public
CVS archive at

In the in-testing version of gawk I've removed the dfa code entirely and
now use the new version of regex. (The code in gawk is now cleaner as
a result, which is a pleasant side-effect.) Said version is currently
undergoing the trial-by-fire of portability to different Unix systems,
non-Unix systems, and various compilers. So it's getting closer to
being ready for prime time, but isn't quite there yet.

For what it's worth. :-)

Arnold Robbins
gawk's maintainer
Aharon (Arnold) Robbins --- Pioneer Consulting Ltd.
P.O. Box 354 Home Phone: +972 8 979-0381 Fax: +1 928 569 9018
Nof Ayalon Cell Phone: +972 51 297-545
D.N. Shimshon 99785 ISRAEL

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.