Regular expression grammar?

Bruce Ediger <bediger@teal.csn.net>
16 Sep 1999 01:56:12 -0400

          From comp.compilers

Related articles
Regular expression grammar? bediger@teal.csn.net (Bruce Ediger) (1999-09-16)
Re: Regular expression grammar? jjan@cs.rug.nl (J.H.Jongejan) (1999-09-20)
Re: Regular expression grammar? terryg@uswest.net (1999-09-20)
Re: Regular expression grammar? lex@cc.gatech.edu (1999-09-20)
Re: Regular expression grammar? cbarron3@ix.netcom.com (1999-09-20)
Re: Regular expression grammar? zalman@netcom18.netcom.com (Zalman Stern) (1999-09-24)
Re: Regular expression grammar? zalman@netcom18.netcom.com (Zalman Stern) (1999-09-24)
[2 later articles]
| List of all articles for this month |
From: Bruce Ediger <bediger@teal.csn.net>
Newsgroups: comp.compilers
Date: 16 Sep 1999 01:56:12 -0400
Organization: Compilers Central
Keywords: parse, question

Dr Dobb's Journal, April 1999 carried an article by Brian Kernighan
and Rob Pike titled "Regular Expressions". I read the article, and
I was struck by the beginning sentences of the second-to-last paragraph:


      For regular expressions with parentheses and alternatives,
      the implementation must be more sophisticated. One approach
      is to compile the regular expression into a parse tree that
      captures its grammatical structure.


How the heck do they do that compilation? I've tried to come up with
a "yacc" grammar for regular expressions based on the definition of
"regular expression" in "Automata and Formal Languages" by Dean
Kelley, ISBN 0-13-497777-7, Prentice Hall, 1995.


Kelly defines regular languages like this (page 38):


      (a) The empty set is a regular language
      (b) The set containing the zero-length string is a regular language
      (c) The set containing every single character in the alphabet is regular
      (d) For two regular languages A and B, A | B, AB, A* are regular
      (e) No other languages are regular


His notation for regular expressions follows directly, and seems
pretty standard to me.


My "yacc" grammar for regular expressions:


--------
%token SYMBOL OR STAR LPAREN RPAREN
%%
regexp
        : SYMBOL /* any letter of the alphabet (c) */
        | regexp OR regexp /* alternation (d) */
        | regexp regexp /* concatenation (d) */
        | regexp STAR /* Kleene Closure (d) */
        | RPAREN regexp LPAREN /* grouping */
;
%%
--------
This gives a bunch of shift/reduce conflicts. I can't seem to get around
them by putting in operator precedences, or the analogy of algebraic
terms and factors.


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.