Re: tips for writing regular expression interpreter/compiler?
23 Dec 2005 20:54:31 -0500

          From comp.compilers

Related articles
tips for writing regular expression interpreter/compiler? (Derek Haas) (2005-11-26)
Re: tips for writing regular expression interpreter/compiler? (2005-11-30)
Re: tips for writing regular expression interpreter/compiler? (Russ Cox) (2005-12-02)
Re: tips for writing regular expression interpreter/compiler? (Jeff Kenton) (2005-12-02)
Re: tips for writing regular expression interpreter/compiler? (mefrill) (2005-12-02)
Re: tips for writing regular expression interpreter/compiler? (2005-12-23)
Re: tips for writing regular expression interpreter/compiler? (2005-12-23)
Re: tips for writing regular expression interpreter/compiler? (Russ Cox) (2005-12-23)
| List of all articles for this month |

Newsgroups: comp.compilers
Date: 23 Dec 2005 20:54:31 -0500
References: 05-11-11905-12-006
Keywords: lex
Posted-Date: 23 Dec 2005 20:54:31 EST

mefrill wrote:
> re --> re_list
> re_list --> re_term re_list | re_term
> re_term --> re_atom | re_alternation | re_kleene
> re_alternation --> re_atom | re_atom
> re_kleene --> re_atom '*'
> re_atom --> 'one of simbols except '*' symbol and '|' symbol'

The syntax can be substantially simplified:
E -> E [a] E | E b | '(' E ')' | '[' E ']' | c | c '=' E ',' E
a: infix operators ('|', maybe '-' and '&' for relative complement and
b: postfix operators ('*', '+', '?').
c: constant or variable name (or, if using a grep-like grammar, may
also include set).

The last item can be used to make definitions and/or abbreviations;
                                  a (a = b b, a b a) would be equivalent to a ((b b)
b (b b))

Precedence conflict resolution shouldn't be hard-defined in the syntax.

> By basing on this grammar you may build the parser. Define separate
> routine for each grammar's nonterminal.

A parser is best understood as applying not to a context-free language,
but a context-free translator. The translation, here, is into a
language whose words comprise sequences of build actions. If the build
actions corresponding to E E; E|E; E*; [E]; c are written, respectively
as y, x, w, v and u; and the accept action as z; then this would read:

start -> E z
E -> E E y | E '|' E x | E '*' w | '(' E ')' | '[' E ']' v | c u | c
'=' E ',' E
or equivalently
start -> E z
E -> ('(' E ')' | '[' E ']' v | c u | c '=' E ',' E t) (E y | '|' E x |
'*' w)*

Using the operators:
                  u0: create empty stack; ui(...): push i (and additional
content) (i = 1, 2, ...)
                  v0: test for empty stack; vi(...): test for i and pop
(extracting additional content)
you can then define a transition for the corresponding automata

This works with 2 state, E_S, E_F and a third state E_Q, defined by:
E_Q = (E y | '|' E x | '*" E) E_F

>From (start -> E z), you get:
start -> u0 E_S
E_F -> v0 {z: return E}

>From (E -> ( E )), you get:
E_S -> '(' u1 E_S
E_F -> v1 ')' E_F

>From (E -> [ E ] v), you get:
E_S -> '[' u2 E_S
E_F -> ']' v2 {v:E <- E?} E_F

>From (E -> c u), you get:
E_S -> c {u:E <- c} E_Q

>From (E -> c = E, E), you get:
E_S -> c '=' u3(c) E_S
E_F -> v3(c) ',' u4(c, value(c)) {value(c) <- E} E_S
E_F -> v4(c,F) {value(c) <- F} E_F

This threads in operations to save and restore the old value of c.

>From E_Q = (E y | '|' E x | '*' w)* E_F, you get
                  E_Q -> E_F
                  E_Q -> E y E_F
                  E_Q -> '|' E x E_F
                  E_Q -> '*' w E_F

>From (E_Q -> E y E_Q), you get:
E_F -> u5(E) E_S
E_F -> v5(F) {y:E <- E F} E_F

>From (E_Q -> '|' E x E_Q), you get:
E_F -> '|' u6(E) E_S
E_F -> v6(F) {x:E <- E | F} E_F

>From (E_Q -> '*" E E_Q), you get:
E_F -> '*' {w:E <- E*} E_F

The lambda transition (E_Q -> E_F) is removed, resulting in an
effective merger of E_Q and E_F, which is then redefined as E_F.

Combining everything, this reads:
start -> u0 E_S

E_S -> '(' u1 E_S |
                      '[' u2 E_S |
                      c u { E<-c} E_F |
                      c '=' u3(c) E_F | (takes precendence over the previous

E_F -> v0 { return E; }
                      v1 ')' E_F
                      v2 ']' {E <- E?} E_F
                      v3(c) ',' u4(c,value(c)) {value(c)<-E} E_S
                      v4(c,F) {value(c)<-E} E_F
                      v5(F) {E <- E F} E_F
                      v6(F) {E <- E | F} E_F
E_F -> u5(E) E_S
                      '|' u6(E) E_S
                    '*' {E <- E*} E_F

A simple table can be defined to handle the resolution of the conflicts
from the first vs. second set of rules for E_F that came originally
from E_F and E_Q respectively.

> So, you will have re(), re_list(), re_term(), re_alternation(), re_kleene() and
> re_atom() functions.

This is best handled within a single routine of the form
Exp MakeExp(char Tag, ...)

which retrieves (or, if need be, builds anew) the desired expression. A
hash-table can be used to stored built-up expressions for later reuse.

For constants, a call might look like MakeExp(c); for operators with an
argument A, B, ...; it might look like MakeExp(op, A, B, ...). In C++,
you can use the name "Exp", assuming you defined this as a class
instead of as a structure, as the name of the constructor function,
itself. Then a variable could actually be initialized with the
constructure (e.g. in place of Exp E = MakeExp('|', A, B) you'd have
something like Exp E('|', A, B).)

In C++ you can use the native operator facility to further simplify the
notation (e.g., the operator '|' routine, with arguments A and B, would
call GetExp('|', A, B)). This will allow you to use infix notation in
the program (but you're stuck with prefix notation, if you encode
either of the postfix operators, since C++ doesn't have built-in
user-defineable postfix operators).

This is the approach adopted by the NFA and DFA conversion programs
that (I believe) are still in the comp.compilers archive under the
regex directory.

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.