Re: Parsing implicit operators with recursive descent?

Kaz Kylheku <kkylheku@gmail.com>
Mon, 1 Feb 2010 21:08:28 +0000 (UTC)

From comp.compilers

Related articles
Parsing implicit operators with recursive descent? johann@myrkraverk.com (Johann 'Myrkraverk' Oskarsson) (2009-02-06)
Re: Parsing implicit operators with recursive descent? mailbox@dmitry-kazakov.de (Dmitry A. Kazakov) (2009-02-07)
Re: Parsing implicit operators with recursive descent? armelasselin@hotmail.com (Armel) (2009-02-07)
Re: Parsing implicit operators with recursive descent? torbenm@pc-003.diku.dk (2009-02-09)
Re: Parsing implicit operators with recursive descent? barry.j.kelly@gmail.com (Barry Kelly) (2009-02-12)
Re: Parsing implicit operators with recursive descent? johann@myrkraverk.com (Johann 'Myrkraverk' Oskarsson) (2010-01-30)
*Re: Parsing implicit operators with recursive descent? kkylheku@gmail.com (Kaz Kylheku)* (2010-02-01)**

| List of all articles for this month |

From:	Kaz Kylheku <kkylheku@gmail.com>
Newsgroups:	comp.compilers
Date:	Mon, 1 Feb 2010 21:08:28 +0000 (UTC)
Organization:	A noiseless patient Spider
References:	09-02-013 09-02-022 10-01-097
Keywords:	parse
Posted-Date:	01 Feb 2010 18:26:56 EST

On 2010-01-30, Johann 'Myrkraverk' Oskarsson <johann@myrkraverk.com> wrote:
> A year ago, torbenm@pc-003.diku.dk (Torben Fgidius Mogensen) wrote:
>
>> "Johann 'Myrkraverk' Oskarsson" <johann@myrkraverk.com> writes:
>>
>>> Is it possible to parse implicit operators, like the regular
>>> expression concatenation operator, with a recursive descent parser?
>>
>> Yes, you can do it mostly by standard grammar-rewriting techniques.
>>
>> Let us take regular expressions as an example. We start with the
>> grammar:
>>
>>
>> R -> R | R
>> R -> R R
>> R -> R *
>> R -> a
>>
>> Note that the | above is a terminal. a stands for any alphabet
>> symbol.

The above grammar doesn't account for regular expressions
with empty branches;

    abc|

i.e. (abc|): match abc, or empty string.

Simply adding a production R -> <epsilon> causes a problem.

Now R -> R * can derive R -> <epsilon> *. I.e. a * is
a R, all by itself!

What you can do in a recursive parser is recognize that certain lookaheads in
only certain situations generate an empty R.

whole_regex := regex

regex := branch | regex
            | branch

/*
  * Here is where your implicit catenation is represented:
  * a branch is either one term, or a term followed by a branch.
  * We allow a branch to be empty, which is crucial in regexes.
  */

branch := term branch
              | term
| /* empty! */ /* thus, regex derives empty! */

term := primary
          | primary * /* useless compound star X**... syntax is invalid */

primary | letter
                | [ class ]
| ( regex ) /* ( ) is valid due to regex's empty derivation. */

So your recursive parser basically is something like this pseudocode:

function parse_whole_regex(tokstream)
{
    parse_regex(tokstream);
    if (lookahead(tokstream) != EOF)
        syntax_error(tokstream, "stray token %t", lookahead(tokstream));
}

function parse_regex(tokstream)
{
    parse_branch(tokstream);

    while (lookahead(tokstream) == '|') {
        consume(tokstream);
parse_branch(tokstream);
    }

    /* At this point lookahead is EOF (we parsed a whole regex)
      * or the character ')' (we parsed a parenthesized regex).
      */
}

function parse_branch(tokstream)
{
    loop {
        case (lookahead(tokstream)) {
        EOF, '|', ')' : break;
        }

        parse_term(tokstream);
    }
}

function parse_term(tokstream)
{
    parse_primary();

    if (lookahead(tokstream) == '*')
        consume(tokstream);
}

function parse_primary(tokstream)
{
    case (lookahead(tokstream)) {
    '*' : syntax_error("operator * without operand");
    '(' : consume(tokstream);
                        parse_regex(tokstream);
                        if (lookahead(tokstream) != ')')
                            syntax_error("unbalanced parenthesis");
                        consume(tokstream);
    '[' : parse_charclass(tokstream);
    letter : parse_letter(tokstream)
    default : syntax_error("symbol %t misused as a regex term",
                                                  lookahead(tokstream));
    }
}

function parse_charclass(tokstream)
{
  ...
}

function parse_letter(tokstream)
{
  ...
}

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: Parsing implicit operators with recursive descent?

Kaz Kylheku <kkylheku@gmail.com>Mon, 1 Feb 2010 21:08:28 +0000 (UTC)

Kaz Kylheku <kkylheku@gmail.com>
Mon, 1 Feb 2010 21:08:28 +0000 (UTC)