Re: tips for writing regular expression interpreter/compiler?

markwh04@yahoo.com
23 Dec 2005 18:02:29 -0500

          From comp.compilers

Related articles
tips for writing regular expression interpreter/compiler? derekhaas@gmail.com (Derek Haas) (2005-11-26)
Re: tips for writing regular expression interpreter/compiler? jburgy@gmail.com (2005-11-30)
Re: tips for writing regular expression interpreter/compiler? rsc@swtch.com (Russ Cox) (2005-12-02)
Re: tips for writing regular expression interpreter/compiler? jeffrey.kenton@comcast.net (Jeff Kenton) (2005-12-02)
Re: tips for writing regular expression interpreter/compiler? mefrill@yandex.ru (mefrill) (2005-12-02)
Re: tips for writing regular expression interpreter/compiler? markwh04@yahoo.com (2005-12-23)
Re: tips for writing regular expression interpreter/compiler? markwh04@yahoo.com (2005-12-23)
Re: tips for writing regular expression interpreter/compiler? rsc@swtch.com (Russ Cox) (2005-12-23)
| List of all articles for this month |
From: markwh04@yahoo.com
Newsgroups: comp.compilers
Date: 23 Dec 2005 18:02:29 -0500
Organization: http://groups.google.com
References: 05-11-119
Keywords: lex
Posted-Date: 23 Dec 2005 18:02:29 EST

Derek Haas wrote:
> I am trying to implement a small subset of the regular expression
> languages found in perl/python/et cetera (maybe just Kleene closure
> and alternation for now).


There should still be the demo software for RE->FA in the
comp.compilers archive. Along with each demo (each about 400-500 lines
of programming) are texts that provide a general description of the
algorithms used to make the conversions.


The dfa and nfa programs perform conversions as above. The rex program
is grep, souped up to handle general boolean combinations (A intersect
B, A - B) and interleave products (e.g. x^y^z = x(y^z) | y(z^x) |
z(x^y); x^y = xy | yx).


The simplest approach: just represent a regular expression as you would
any operator expression, (with the operators being A*, A+, A?, A|B, AB
and constants 0, 1 and S for character sets S); and perform operations
by doing algebraic computations on these expressions (that's how nfa,
dfa and rex work; rex does computations on a need-to-use basis while
reading through input, which is the same strategy most grep
implementations take).


What NFA does is reduce an expression to normal form:
                                                      0, 1, S normal form
                                                    A* -> 1 | A+ = normal form
                                                    A+ -> A A* -> ...
                                                    A? -> 1 | A = normal form
                                                    A | B = normal form
                                                    0A -> 0 = normal form
                                                    1A -> A -> ...
                                                    xA = normal form
                                                    A* B -> B | A+ B = normal form
                                                    A+ B -> A (A* B) -> ...
                                                    A? B -> B | A B = normal form
                                                    (A|B) C -> AC | BC = normal form
                                                    (AB) C -> A (BC) -> ...
Denoting the normal form of an expression A by NF(A), then when it
proceeds to do is compute sets by the following:
                                                    <0> = {}; <1> = {1}; <S> = {S 1}
                                                    <A | B> = <A> union <B>
                                                    <xA> = {xA}
                                                    <A> = <NF(A)>, if A is not in normal form.
In the process of finding these sets, repeats may occur (e.g. <1+> = <1
1*> = <1*> = <1 | 1+> = {1} union <1+>), which are tracked and absorbed
as they occur.


Each expression A produces a set <A> whose elements are either 1 or
other expressions of the form xB. This yields the rules (A final) <->
(1 is in <A>); (x: A -> B) <-> {xB} is in A).


The starting expression is equated to the start state; and the
processing is run on <S>, and then on any of the other expressions
spawned from this until the full automaton is determined.


For nfa, and dfa, the GREP subset syntax is not used for S; S may only
be a symbolic name. But they both support non-recursive definitions.
(If you allow recursive definitions, then you no longer had regular
expressions but context-free expressions).


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.