Related articles |
---|
tips for writing regular expression interpreter/compiler? derekhaas@gmail.com (Derek Haas) (2005-11-26) |
Re: tips for writing regular expression interpreter/compiler? jburgy@gmail.com (2005-11-30) |
Re: tips for writing regular expression interpreter/compiler? rsc@swtch.com (Russ Cox) (2005-12-02) |
Re: tips for writing regular expression interpreter/compiler? jeffrey.kenton@comcast.net (Jeff Kenton) (2005-12-02) |
Re: tips for writing regular expression interpreter/compiler? mefrill@yandex.ru (mefrill) (2005-12-02) |
Re: tips for writing regular expression interpreter/compiler? markwh04@yahoo.com (2005-12-23) |
Re: tips for writing regular expression interpreter/compiler? markwh04@yahoo.com (2005-12-23) |
Re: tips for writing regular expression interpreter/compiler? rsc@swtch.com (Russ Cox) (2005-12-23) |
From: | markwh04@yahoo.com |
Newsgroups: | comp.compilers |
Date: | 23 Dec 2005 20:54:31 -0500 |
Organization: | http://groups.google.com |
References: | 05-11-11905-12-006 |
Keywords: | lex |
Posted-Date: | 23 Dec 2005 20:54:31 EST |
mefrill wrote:
> re --> re_list
> re_list --> re_term re_list | re_term
> re_term --> re_atom | re_alternation | re_kleene
> re_alternation --> re_atom | re_atom
> re_kleene --> re_atom '*'
> re_atom --> 'one of simbols except '*' symbol and '|' symbol'
The syntax can be substantially simplified:
E -> E [a] E | E b | '(' E ')' | '[' E ']' | c | c '=' E ',' E
a: infix operators ('|', maybe '-' and '&' for relative complement and
intersection)
b: postfix operators ('*', '+', '?').
c: constant or variable name (or, if using a grep-like grammar, may
also include set).
The last item can be used to make definitions and/or abbreviations;
e.g.
a (a = b b, a b a) would be equivalent to a ((b b)
b (b b))
Precedence conflict resolution shouldn't be hard-defined in the syntax.
> By basing on this grammar you may build the parser. Define separate
> routine for each grammar's nonterminal.
A parser is best understood as applying not to a context-free language,
but a context-free translator. The translation, here, is into a
language whose words comprise sequences of build actions. If the build
actions corresponding to E E; E|E; E*; [E]; c are written, respectively
as y, x, w, v and u; and the accept action as z; then this would read:
start -> E z
E -> E E y | E '|' E x | E '*' w | '(' E ')' | '[' E ']' v | c u | c
'=' E ',' E
or equivalently
start -> E z
E -> ('(' E ')' | '[' E ']' v | c u | c '=' E ',' E t) (E y | '|' E x |
'*' w)*
Using the operators:
u0: create empty stack; ui(...): push i (and additional
content) (i = 1, 2, ...)
v0: test for empty stack; vi(...): test for i and pop
(extracting additional content)
you can then define a transition for the corresponding automata
This works with 2 state, E_S, E_F and a third state E_Q, defined by:
E_Q = (E y | '|' E x | '*" E) E_F
>From (start -> E z), you get:
start -> u0 E_S
E_F -> v0 {z: return E}
>From (E -> ( E )), you get:
E_S -> '(' u1 E_S
E_F -> v1 ')' E_F
>From (E -> [ E ] v), you get:
E_S -> '[' u2 E_S
E_F -> ']' v2 {v:E <- E?} E_F
>From (E -> c u), you get:
E_S -> c {u:E <- c} E_Q
>From (E -> c = E, E), you get:
E_S -> c '=' u3(c) E_S
E_F -> v3(c) ',' u4(c, value(c)) {value(c) <- E} E_S
E_F -> v4(c,F) {value(c) <- F} E_F
This threads in operations to save and restore the old value of c.
>From E_Q = (E y | '|' E x | '*' w)* E_F, you get
E_Q -> E_F
E_Q -> E y E_F
E_Q -> '|' E x E_F
E_Q -> '*' w E_F
>From (E_Q -> E y E_Q), you get:
E_F -> u5(E) E_S
E_F -> v5(F) {y:E <- E F} E_F
>From (E_Q -> '|' E x E_Q), you get:
E_F -> '|' u6(E) E_S
E_F -> v6(F) {x:E <- E | F} E_F
>From (E_Q -> '*" E E_Q), you get:
E_F -> '*' {w:E <- E*} E_F
The lambda transition (E_Q -> E_F) is removed, resulting in an
effective merger of E_Q and E_F, which is then redefined as E_F.
Combining everything, this reads:
start -> u0 E_S
E_S -> '(' u1 E_S |
'[' u2 E_S |
c u { E<-c} E_F |
c '=' u3(c) E_F | (takes precendence over the previous
rule)
E_F -> v0 { return E; }
v1 ')' E_F
v2 ']' {E <- E?} E_F
v3(c) ',' u4(c,value(c)) {value(c)<-E} E_S
v4(c,F) {value(c)<-E} E_F
v5(F) {E <- E F} E_F
v6(F) {E <- E | F} E_F
E_F -> u5(E) E_S
'|' u6(E) E_S
'*' {E <- E*} E_F
A simple table can be defined to handle the resolution of the conflicts
from the first vs. second set of rules for E_F that came originally
from E_F and E_Q respectively.
> So, you will have re(), re_list(), re_term(), re_alternation(), re_kleene() and
> re_atom() functions.
This is best handled within a single routine of the form
Exp MakeExp(char Tag, ...)
which retrieves (or, if need be, builds anew) the desired expression. A
hash-table can be used to stored built-up expressions for later reuse.
For constants, a call might look like MakeExp(c); for operators with an
argument A, B, ...; it might look like MakeExp(op, A, B, ...). In C++,
you can use the name "Exp", assuming you defined this as a class
instead of as a structure, as the name of the constructor function,
itself. Then a variable could actually be initialized with the
constructure (e.g. in place of Exp E = MakeExp('|', A, B) you'd have
something like Exp E('|', A, B).)
In C++ you can use the native operator facility to further simplify the
notation (e.g., the operator '|' routine, with arguments A and B, would
call GetExp('|', A, B)). This will allow you to use infix notation in
the program (but you're stuck with prefix notation, if you encode
either of the postfix operators, since C++ doesn't have built-in
user-defineable postfix operators).
This is the approach adopted by the NFA and DFA conversion programs
that (I believe) are still in the comp.compilers archive under the
regex directory.
Return to the
comp.compilers page.
Search the
comp.compilers archives again.