Re: Generating a simple hand-coded like recursive descent parser

Chris F Clark <cfc@shell01.TheWorld.com>
19 Dec 2006 11:24:42 -0500

          From comp.compilers

Related articles
[33 earlier articles]
Re: Generating a simple hand-coded like recursive descent parser ajo@andrew.cmu.edu (Arthur J. O'Dwyer) (2006-12-19)
Re: Generating a simple hand-coded like recursive descent parser 148f3wg02@sneakemail.com (Karsten Nyblad) (2006-12-19)
Re: Generating a simple hand-coded like recursive descent parser ik@unicals.com (Ivan A. Kosarev) (2006-12-19)
Re: Generating a simple hand-coded like recursive descent parser boldyrev@cgitftp.uiggm.nsc.ru (Ivan Boldyrev) (2006-12-19)
Re: Generating a simple hand-coded like recursive descent parser DrDiettrich1@aol.com (Hans-Peter Diettrich) (2006-12-19)
Re: Generating a simple hand-coded like recursive descent parser walter@bytecraft.com (Walter Banks) (2006-12-19)
Re: Generating a simple hand-coded like recursive descent parser cfc@shell01.TheWorld.com (Chris F Clark) (2006-12-19)
Re: parsing C and C++, Generating a simple hand-coded like gneuner2/@comcast.net (George Neuner) (2006-12-21)
Re: parsing C and C++, Generating a simple hand-coded like cfc@shell01.TheWorld.com (Chris F Clark) (2006-12-22)
Re: parsing C and C++, Generating a simple hand-coded like DrDiettrich1@aol.com (Hans-Peter Diettrich) (2006-12-22)
Re: parsing C and C++, Generating a simple hand-coded like derek@knosof.co.uk (Derek M. Jones) (2006-12-22)
Re: parsing C and C++, Generating a simple hand-coded like ik@unicals.com (Ivan A. Kosarev) (2006-12-22)
Re: Generating a simple hand-coded like recursive descent parser bobduff@shell01.TheWorld.com (Robert A Duff) (2006-12-22)
[6 later articles]
| List of all articles for this month |

From: Chris F Clark <cfc@shell01.TheWorld.com>
Newsgroups: comp.compilers
Date: 19 Dec 2006 11:24:42 -0500
Organization: The World Public Access UNIX, Brookline, MA
References: 06-09-029 06-09-042 06-09-048 06-09-060 06-09-078 06-09-093 06-12-064 06-12-066
Keywords: C, C++, parse
Posted-Date: 19 Dec 2006 11:24:42 EST

Robert A Duff <bobduff@shell01.TheWorld.com> writes:
> Has anyone ever tried to solve these C / C++ parsing problems without
> "feedback"?


Although you've gotten lots of feedback, there still seems to be 2
points worth addressing.


1) James Roskind did build a C grammar which attempts to eliminate the
      need for feedback. (We were consultants to Honeywell building a C
      compiler at the time.) He analyzed all the cases where the use of
      an identifier and a typename could be confusing (many of them had
      to do with function prototypes). As I recall, he had some success.
      However, the truly ambiguous examples being discussed in this
      thread, make me doubt exactly what I am remembering. Perhaps, it
      was just that he had success identifying places, where a new type
      or variable *could* be introduced.


      In any case, he then tried to apply the same analysis to C++, and
      successfully proved that the same technique could not be used to
      distinguish the ambiguities in C++. All that could be done in C++
      is to push the ambiguities far-enough away that the amount of
      lookahead required made them impractical to distinguish.


2) As to "feedback" (and this is in fact the normal term used for one
      solution to this problem), the term is somewhat misleading. The
      term comes from the fact that the parser would like certain tokens
      recategorized based upon the parsers state. One easy way to do
      that is to track the parsers state in the lexer (often done by
      setting variables shared between the lexer and parser when certain
      parser states are entered) and using that to guide which tokens the
      lexer emits to the parser. Because, the flow of tokens from the
      lexer to the parser is generally considered one-way, this passing
      of information back from the parser to the lexer is considered a
      "feedback loop", and thus the term.


      One thing of particular interest in this feedback, it is rare for
      the feedback to change the token boundaries (i.e. which regular
      expressions are grouped into tokens). Thus, Frank Deremer long ago
      suggessted separating the lexer into two parts, the scanner and the
      screener. The scanner is the part of the lexer which separates the
      incoming character stream into lexemes and does some preliminary
      classification based upon which regular expression was matched.
      The screener then looks at some (or all) of those lexeme and
      recategorizes them based upon symbol table (dictionary)
      information.


      Notably, the timing of those two phases can be separated. You can
      delay the screening process until the information is actually
      needed. This resolves the key problem of "feedback". Generally,
      only the screener needs the feedback, although that is not
      universally true (for example, in Verilog parsing context is
      required to distinguish hex numbers from identifiers and that can
      change token boundaries). Thus, in cases where the feedback does
      not affect the scanner, the scanner is decoupled from the parser,
      and the feedback loop is broken, which allows the scanner to run
      arbitrarily ahead of the parser. And that ability to run the
      scanner ahead of the parser defuses the "feedback" issue, whicdh is
      that the two need to be kept in lock-step, which can be
      particularly difficult if your parser needs lookahead tokens or
      uses backtracking, since the state necessary for determining the
      exact token type can be unavailable at the time the token needs to
      be scanned.


      However, in the C/C++ case, screening can be delayed until parsing.
      In your Ada example, it sounds like some parts of screening can be
      delayed until after parsing. The key issue of screening is that it
      needs a symbol table (dictionary) and not just a simple stack.
      Thus, you cannot convert a screener directly into a context-free
      grammar. My favorite solution to this problem is the one Terence
      Parr came up with "semantic predicates", which are bits of code
      which run at specific parts of the parsing process (i.e. when the
      parser gets to a specific point in the grammar) and provide a
      go/no-go answer. So, one could write a semantic predicate, that
      queries the symbol table for, "in the current context is this
      identifier a typedef" and then depending on whether you wanted
      typedefs to be legal or not, the parse would either continue with
      the current rule or chose another alternative. Note, semantic
      predicates are essentially equivalent to attribute grammars, so
      that is an alternate solution.


Hope this helps,
-Chris


*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
23 Bailey Rd voice : (508) 435-5016
Berlin, MA 01503 USA fax : (978) 838-0263 (24 hours)


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.