LL(1) grammar for regex

"James Brown \[MVP\]" <not@www-no.ucsd.edu>
22 Mar 2006 23:41:45 -0500

          From comp.compilers

Related articles
LL(1) grammar for regex not@www-no.ucsd.edu (James Brown \[MVP\]) (2006-03-22)
Re: LL(1) grammar for regex schmitz@i3s.unice.fr (Sylvain Schmitz) (2006-03-27)
Re: LL(1) grammar for regex cfc@shell01.TheWorld.com (Chris F Clark) (2006-03-27)
| List of all articles for this month |

From: "James Brown \[MVP\]" <not@www-no.ucsd.edu>
Newsgroups: comp.compilers
Date: 22 Mar 2006 23:41:45 -0500
Organization: Compilers Central
Keywords: lex, question
Posted-Date: 22 Mar 2006 23:41:45 EST


I'm currently trying to build a regex engine by hand (using C++). The aim is
to build a top-down parser for regex expressions. Therefore I am attempting
to build a LL(1) grammar for regex's. I am at the early stage of building a
syntax-tree from my regex grammar, but I am having a hard time with regex

I am basically copying the code (and grammar) found in the following regex


Here's the BNF grammar:

// expr ::= concat '|' expr
// | concat
// concat ::= rep . concat
// | rep
// rep ::= atom '*'
// | atom
// atom ::= chr
// | '(' expr ')'
// char ::= alphanumeric character

The grammar above represents regex concatenation by a '.' (dot) operator.
This means that regular expressions of the form "abcde" must be written as
"a.b.c.d.e". I can appreciate that it is much simpler to parse in this form,
however, I'd like to remove the reliance on the concatenation operator.

The current pseudo-code for parsing the concatenation is:

node *concat()
      node *left = rep();

      if(ch == '.')
              ch = nextch();
              node *right = concat();
              return new node(CONCAT, 0, left, right);

      return left;

I briefly tried to describe concatenation as:

concat ::= rep concat
                      | rep

node *concat()
      node *left = rep();

      if(ch) // note I don't test for '.'
              node *right = concat();
              return new node(CONCAT, 0, left, right);

      return left;

Of course, this destroys the parsing because I've introduced a recursion
which consumes the rest of the input-stream. I think my version of the
'concat' rule is no longer LL(1)..??

Can anyone advise me on where I should look for help with this? The article
I am using is perfect for me because the grammar is understandable (to me)
and the source-code is nice and simple also....but I just can't figure out
how to proceed from here.


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.