Advice for recursive descent parser

pk <pk@pk.invalid>
Tue, 31 Mar 2009 17:32:20 +0200

From comp.compilers

Related articles
*Advice for recursive descent parser pk@pk.invalid (pk)* (2009-03-31)**
Re: Advice for recursive descent parser jamin.hanson@googlemail.com (2009-04-01)
Re: Advice for recursive descent parser cfc@shell01.TheWorld.com (Chris F Clark) (2009-04-01)

| List of all articles for this month |

From:	pk <pk@pk.invalid>
Newsgroups:	comp.compilers
Date:	Tue, 31 Mar 2009 17:32:20 +0200
Organization:	Not organized
Keywords:	parse, question, comment
Posted-Date:	31 Mar 2009 14:32:51 EDT

Hi,

Foreword: I'm a beginner in parsing, so there may be errors in what I'm
doing or writing - any correction welcome, thanks.

I'm trying to write a recursive descent parser for this grammar (simplified
regexes - sorry for the non formal grammar):

RE -> BR ('|' BR)* # alternation

BR -> EX (EX)* # concatenation

EX -> ITEM (Q)? # single letters plus optional quantifier

ITEM -> <any character except *>

Q -> '*'

Using this pseudocode it sort of works (cursym() returns the current
symbol - just characters here-, accept() consumes the current symbol if it
matches its argument - pretty standard functions as used in most theory RD
explanations):

RE() {
    if !BR() error()
    while accept("|") {
        if !BR() error()
    }
    return 1
}

BR() {
    if !EX() error()
    while cursym()!="|" && !end_of_RE {
        if !EX() error()
    }
    return 1
}

EX() {
    if !ITEM() error()
    if(cursym()=="*" {
        if !Q() error()
    }
    return 1
}

ITEM() {
    if cursym()=="*" error() # stray "*"
    accept(cursym())
    return 1
}

Q() {
    accept("*")
    return 1 # to be expanded in the future
}

Now I want to add parenthesized groups as ITEMs, so they can be quantified
too. To that end, I did these changes to the grammar:

RE -> BR ('|' BR)* # alternation

BR -> EX (EX)* # concatenation

EX -> ITEM (Q)?

ITEM -> '(' RE ')' | <any character except *>

Q -> '*'

And this new code (only the changed part):

ITEM() {

    if accept('(') { # parenthesized group
        if !RE error()
        if !accept(")") error()
        return 1
    }

    if (cursym()=="*" ||
            cursym()==")" ) error() # stray "*" or ")"

    # single character
    accept(cursym())
    return 1
}

This doesn't work because an input like "a(bc)d", which is correct, is not
parsed correctly because the ) is detected as stray ), since BR() does not
know where to stop and goes on trying to parse EXs (which in turn parse
ITEMs) even when the closing ) is reached. I could do this:

BR() {
    if !EX() error()
    while cursym()!="|" && !cursym()!=")" && !end_of_RE {
        if !EX() error()
    }
    return 1
}

But it looks kind of hackish, and I'm not really sure that is the way to go.
I could change the grammar and introduce an explicit concatenation operator
for EXs (which I don't really like too much), but before doing that, can
anybody please advise?

Thank you!
[I'd implement one-token lookahead so each rule can look at a token and
if it's not something it recognizes, put the token back and return so
the upper level can do something different. -John]

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Advice for recursive descent parser

pk <pk@pk.invalid>Tue, 31 Mar 2009 17:32:20 +0200

pk <pk@pk.invalid>
Tue, 31 Mar 2009 17:32:20 +0200