Advice for recursive descent parser

pk <pk@pk.invalid>
Tue, 31 Mar 2009 17:32:20 +0200

          From comp.compilers

Related articles
Advice for recursive descent parser pk@pk.invalid (pk) (2009-03-31)
Re: Advice for recursive descent parser jamin.hanson@googlemail.com (2009-04-01)
Re: Advice for recursive descent parser cfc@shell01.TheWorld.com (Chris F Clark) (2009-04-01)
| List of all articles for this month |
From: pk <pk@pk.invalid>
Newsgroups: comp.compilers
Date: Tue, 31 Mar 2009 17:32:20 +0200
Organization: Not organized
Keywords: parse, question, comment
Posted-Date: 31 Mar 2009 14:32:51 EDT

Hi,


Foreword: I'm a beginner in parsing, so there may be errors in what I'm
doing or writing - any correction welcome, thanks.


I'm trying to write a recursive descent parser for this grammar (simplified
regexes - sorry for the non formal grammar):


RE -> BR ('|' BR)* # alternation


BR -> EX (EX)* # concatenation


EX -> ITEM (Q)? # single letters plus optional quantifier


ITEM -> <any character except *>


Q -> '*'


Using this pseudocode it sort of works (cursym() returns the current
symbol - just characters here-, accept() consumes the current symbol if it
matches its argument - pretty standard functions as used in most theory RD
explanations):


RE() {
    if !BR() error()
    while accept("|") {
        if !BR() error()
    }
    return 1
}


BR() {
    if !EX() error()
    while cursym()!="|" && !end_of_RE {
        if !EX() error()
    }
    return 1
}


EX() {
    if !ITEM() error()
    if(cursym()=="*" {
        if !Q() error()
    }
    return 1
}


ITEM() {
    if cursym()=="*" error() # stray "*"
    accept(cursym())
    return 1
}




Q() {
    accept("*")
    return 1 # to be expanded in the future
}




Now I want to add parenthesized groups as ITEMs, so they can be quantified
too. To that end, I did these changes to the grammar:




RE -> BR ('|' BR)* # alternation


BR -> EX (EX)* # concatenation


EX -> ITEM (Q)?


ITEM -> '(' RE ')' | <any character except *>


Q -> '*'


And this new code (only the changed part):


ITEM() {


    if accept('(') { # parenthesized group
        if !RE error()
        if !accept(")") error()
        return 1
    }


    if (cursym()=="*" ||
            cursym()==")" ) error() # stray "*" or ")"


    # single character
    accept(cursym())
    return 1
}


This doesn't work because an input like "a(bc)d", which is correct, is not
parsed correctly because the ) is detected as stray ), since BR() does not
know where to stop and goes on trying to parse EXs (which in turn parse
ITEMs) even when the closing ) is reached. I could do this:


BR() {
    if !EX() error()
    while cursym()!="|" && !cursym()!=")" && !end_of_RE {
        if !EX() error()
    }
    return 1
}


But it looks kind of hackish, and I'm not really sure that is the way to go.
I could change the grammar and introduce an explicit concatenation operator
for EXs (which I don't really like too much), but before doing that, can
anybody please advise?


Thank you!
[I'd implement one-token lookahead so each rule can look at a token and
if it's not something it recognizes, put the token back and return so
the upper level can do something different. -John]


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.