Related articles |
---|
Regular expression grammar? bediger@teal.csn.net (Bruce Ediger) (1999-09-16) |
Re: Regular expression grammar? jjan@cs.rug.nl (J.H.Jongejan) (1999-09-20) |
Re: Regular expression grammar? terryg@uswest.net (1999-09-20) |
Re: Regular expression grammar? lex@cc.gatech.edu (1999-09-20) |
Re: Regular expression grammar? cbarron3@ix.netcom.com (1999-09-20) |
Re: Regular expression grammar? zalman@netcom18.netcom.com (Zalman Stern) (1999-09-24) |
Re: Regular expression grammar? zalman@netcom18.netcom.com (Zalman Stern) (1999-09-24) |
Re: Regular expression grammar? zalman@netcom18.netcom.com (Zalman Stern) (1999-09-24) |
Re: Regular expression grammar? cbarron3@ix.netcom.com (1999-09-27) |
From: | Zalman Stern <zalman@netcom18.netcom.com> |
Newsgroups: | comp.compilers |
Date: | 24 Sep 1999 22:52:58 -0400 |
Organization: | Netcom |
References: | 99-09-051 |
Keywords: | lex, parse |
I sent email to the original poster 'cause I was embarassed to post my
grammar, but seeing the responses, I think its ok. My main point to the
poster is that the grammar needs to tell the parser generator how to handle
the following two cases:
is "ab|cd" equivalent to "a(b|c)d" or equiv. to "(ab)|(cd)"
and
is "ab*" equivalent to "a(b*)" or equiv. to "(ab)*"
I don't think this can be resolved by operator precedence because there
isn't an explicit operator for concatenation, but one might be able to
finesse that by setting up the rules in some particualr way. (Its been
forever since I worked with yacc.)
Any way, here is a lexer and grammar for the problem at hand. it chooses:
"ab|cd" equiv. to "(ab)|(cd)"
and
"ab*" equiv. to "a(b*)"
which is how all regexp parsers I know of work. There is a top level rule
that lets one type multiple regexps, one per line.
Practically speaking, this is perhaps not a great problem for using a full
parser tool, but I think its a good test problem for learning what parsing
tools are about. And I'm loathe to ever tell anyone a language is too
simple for "real" parsing tools as most people err in the other direction
and don't use them when they should. Which pollutes the world with many ill
defined unspecified "ad hoc" languages. (On the flipside, many of these ad
hoc languages are a hassle to parse with grammar's 'cause they're ambiguous
out the ying yang. Think of the work it takes to write a grammar for one as
an improvement to the universe :-))
Also one poster mentioned that regexps should be handled by finite state
machine. I don't think one can parse balanced paranthesis with a mere FSM.
Of course the next step after getting a parse tree from this yacc grammar
is going to be to generate an FSM so you can match stuff against the
regexp, but I don't think regexps are powerful enough to match themselves.
-Z-
-----regexp.y-----
%token SYMBOL OR STAR LPAREN RPAREN EOL
%{
#define yyerror puts
%}
%%
input : regexp EOL input ;
regexp : alternation ;
optalt : OR alternation | ;
alternation : concat optalt;
optcat : concat | ;
concat : kleen optcat ;
optstar : STAR | ;
kleen : lowterm optstar ;
lowterm : SYMBOL | LPAREN regexp RPAREN ;
;
%%
main() {
yyparse();
}
-----regexp.l----
%{
#include "y.tab.h"
/* Noise to make generated output compile. */
extern int yylval;
#define yywrap() 1
%}
%%
[a-zA-Z] yylval = yytext[0]; return SYMBOL;
\* return STAR;
\| return OR;
\( return LPAREN;
\) return RPAREN;
\n return EOL
%%
Return to the
comp.compilers page.
Search the
comp.compilers archives again.