Re: Parsing implicit operators with recursive descent?

Kaz Kylheku <kkylheku@gmail.com>
Mon, 1 Feb 2010 21:08:28 +0000 (UTC)

          From comp.compilers

Related articles
Parsing implicit operators with recursive descent? johann@myrkraverk.com (Johann 'Myrkraverk' Oskarsson) (2009-02-06)
Re: Parsing implicit operators with recursive descent? mailbox@dmitry-kazakov.de (Dmitry A. Kazakov) (2009-02-07)
Re: Parsing implicit operators with recursive descent? armelasselin@hotmail.com (Armel) (2009-02-07)
Re: Parsing implicit operators with recursive descent? torbenm@pc-003.diku.dk (2009-02-09)
Re: Parsing implicit operators with recursive descent? barry.j.kelly@gmail.com (Barry Kelly) (2009-02-12)
Re: Parsing implicit operators with recursive descent? johann@myrkraverk.com (Johann 'Myrkraverk' Oskarsson) (2010-01-30)
Re: Parsing implicit operators with recursive descent? kkylheku@gmail.com (Kaz Kylheku) (2010-02-01)
| List of all articles for this month |

From: Kaz Kylheku <kkylheku@gmail.com>
Newsgroups: comp.compilers
Date: Mon, 1 Feb 2010 21:08:28 +0000 (UTC)
Organization: A noiseless patient Spider
References: 09-02-013 09-02-022 10-01-097
Keywords: parse
Posted-Date: 01 Feb 2010 18:26:56 EST

On 2010-01-30, Johann 'Myrkraverk' Oskarsson <johann@myrkraverk.com> wrote:
> A year ago, torbenm@pc-003.diku.dk (Torben Fgidius Mogensen) wrote:
>
>> "Johann 'Myrkraverk' Oskarsson" <johann@myrkraverk.com> writes:
>>
>>> Is it possible to parse implicit operators, like the regular
>>> expression concatenation operator, with a recursive descent parser?
>>
>> Yes, you can do it mostly by standard grammar-rewriting techniques.
>>
>> Let us take regular expressions as an example. We start with the
>> grammar:
>>
>>
>> R -> R | R
>> R -> R R
>> R -> R *
>> R -> a
>>
>> Note that the | above is a terminal. a stands for any alphabet
>> symbol.


The above grammar doesn't account for regular expressions
with empty branches;


    abc|


i.e. (abc|): match abc, or empty string.


Simply adding a production R -> <epsilon> causes a problem.


Now R -> R * can derive R -> <epsilon> *. I.e. a * is
a R, all by itself!


What you can do in a recursive parser is recognize that certain lookaheads in
only certain situations generate an empty R.


whole_regex := regex


regex := branch | regex
            | branch


/*
  * Here is where your implicit catenation is represented:
  * a branch is either one term, or a term followed by a branch.
  * We allow a branch to be empty, which is crucial in regexes.
  */


branch := term branch
              | term
| /* empty! */ /* thus, regex derives empty! */


term := primary
          | primary * /* useless compound star X**... syntax is invalid */


primary | letter
                | [ class ]
| ( regex ) /* ( ) is valid due to regex's empty derivation. */




So your recursive parser basically is something like this pseudocode:




function parse_whole_regex(tokstream)
{
    parse_regex(tokstream);
    if (lookahead(tokstream) != EOF)
        syntax_error(tokstream, "stray token %t", lookahead(tokstream));
}


function parse_regex(tokstream)
{
    parse_branch(tokstream);


    while (lookahead(tokstream) == '|') {
        consume(tokstream);
parse_branch(tokstream);
    }


    /* At this point lookahead is EOF (we parsed a whole regex)
      * or the character ')' (we parsed a parenthesized regex).
      */
}


function parse_branch(tokstream)
{
    loop {
        case (lookahead(tokstream)) {
        EOF, '|', ')' : break;
        }


        parse_term(tokstream);
    }
}


function parse_term(tokstream)
{
    parse_primary();


    if (lookahead(tokstream) == '*')
        consume(tokstream);
}


function parse_primary(tokstream)
{
    case (lookahead(tokstream)) {
    '*' : syntax_error("operator * without operand");
    '(' : consume(tokstream);
                        parse_regex(tokstream);
                        if (lookahead(tokstream) != ')')
                            syntax_error("unbalanced parenthesis");
                        consume(tokstream);
    '[' : parse_charclass(tokstream);
    letter : parse_letter(tokstream)
    default : syntax_error("symbol %t misused as a regex term",
                                                  lookahead(tokstream));
    }
}


function parse_charclass(tokstream)
{
  ...
}


function parse_letter(tokstream)
{
  ...
}



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.