Re: --- Regexps, compilers etc., prog help ---

Nick Kramer <>
20 Feb 1997 00:31:24 -0500

          From comp.compilers

Related articles
--- Regexps, compilers etc., prog help --- (1997-02-11)
Re: --- Regexps, compilers etc., prog help --- (Quinn Tyler Jackson) (1997-02-16)
Re: --- Regexps, compilers etc., prog help --- (Paul David Fox) (1997-02-16)
Re: --- Regexps, compilers etc., prog help --- (1997-02-20)
Re: --- Regexps, compilers etc., prog help --- (Nick Kramer) (1997-02-20)
Re: --- Regexps, compilers etc., prog help --- (1997-02-22)
Re: --- Regexps, compilers etc., prog help --- (1997-03-14)
Re: --- Regexps, compilers etc., prog help --- (1997-03-16)
Re: --- Regexps, compilers etc., prog help --- (Henry Spencer) (1997-03-21)
Re: --- Regexps, compilers etc., prog help --- (Henry Spencer) (1997-03-21)
| List of all articles for this month |

From: Nick Kramer <>
Newsgroups: comp.compilers
Date: 20 Feb 1997 00:31:24 -0500
Organization: School of Computer Science, Carnegie Mellon
References: 97-02-072
Keywords: lex

Hursh Jain <> wrote:
>I am interested in writing a regular expression evaluator, and also a
>toy compiler..
>[I'd sit down with the Dragon Book and read the chapter on regular
>expresssions and finite state machines. In this case, an ounce of
>theory is worth several hundred pounds of hacking, since it's such a
>thoroughly studied field. -John]

I agree with our moderator that you should get a good understanding of
the theory behind regular expressions; the Dragon Book is as good as
any a place for this. However, if you actually want to write a
regular expression evaluator, you need a bit more than theory.

Basically, there are four representations of a regexp: The string
representation, the parse-tree representation, the non-deterministic
finite automaton (NFA) representation, and the deterministic finite
automaton (DFA) representation. Most theory books provide algorithms
for converting parse trees to NFAs and NFAs to DFAs, and algorithms to
interpret both NFAs and DFAs. The string to parse tree problem is
generally left as an exercise to the reader. (If you're familiar with
compilers, it isn't hard, but if you're not, coding this piece will
probably teach you as much about compilers as the rest of the regexp

The "obvious" way to implement a regexp engine, then, is to use one of
the interpreter algorithms you get out of the theory book. In my
experience, this is a complete dead end. All of the regexp packages
I'm familiar with work straight from the parse tree. Reason is, while
the NFA/DFA stuff works fine for no-frills regular expressions, it
breaks badly when you start adding frills. The biggest problem I had
was when I wanted more than a yes or no answer to my regexp queries --
I wanted to know which substrings matched which pieces of my regexp.
I struggled mightily and unsuccessfully to get Perl semantics into an
NFA-based algorithm; eventually I switched to a parse-tree based
algorithm and never looked back. (Back-references also pose problems
for automaton-based algorithms. After all, if your regexp language
supports back-references, you're no longer dealing with true regular

On a more down to earth note, I'd recommend using an existing regexp
package if at all possible. As it relates to compilers, knowing how
to use regexps is far more important than knowing how to write a
regexp package.

(Caveat: When I talk of regexp packages, I mean something like Perl or
Henry Spencer's regexp package. Some programs (like lex) that deal
with regular expressions do have a lot of use for finite automata,
because they are dealing with true regular expressions and because
they have lots of time to convert NFAs into DFAs.)

So, does anyone have pointers to any papers which elaborate on what I
just wrote?

-Nick Kramer

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.