From: | Chris F Clark <cfc@shell01.TheWorld.com> |
Newsgroups: | comp.compilers |
Date: | 19 Dec 2006 11:24:42 -0500 |
Organization: | The World Public Access UNIX, Brookline, MA |
References: | 06-09-029 06-09-042 06-09-048 06-09-060 06-09-078 06-09-093 06-12-064 06-12-066 |
Keywords: | C, C++, parse |
Posted-Date: | 19 Dec 2006 11:24:42 EST |
Robert A Duff <bobduff@shell01.TheWorld.com> writes:
> Has anyone ever tried to solve these C / C++ parsing problems without
> "feedback"?
Although you've gotten lots of feedback, there still seems to be 2
points worth addressing.
1) James Roskind did build a C grammar which attempts to eliminate the
need for feedback. (We were consultants to Honeywell building a C
compiler at the time.) He analyzed all the cases where the use of
an identifier and a typename could be confusing (many of them had
to do with function prototypes). As I recall, he had some success.
However, the truly ambiguous examples being discussed in this
thread, make me doubt exactly what I am remembering. Perhaps, it
was just that he had success identifying places, where a new type
or variable *could* be introduced.
In any case, he then tried to apply the same analysis to C++, and
successfully proved that the same technique could not be used to
distinguish the ambiguities in C++. All that could be done in C++
is to push the ambiguities far-enough away that the amount of
lookahead required made them impractical to distinguish.
2) As to "feedback" (and this is in fact the normal term used for one
solution to this problem), the term is somewhat misleading. The
term comes from the fact that the parser would like certain tokens
recategorized based upon the parsers state. One easy way to do
that is to track the parsers state in the lexer (often done by
setting variables shared between the lexer and parser when certain
parser states are entered) and using that to guide which tokens the
lexer emits to the parser. Because, the flow of tokens from the
lexer to the parser is generally considered one-way, this passing
of information back from the parser to the lexer is considered a
"feedback loop", and thus the term.
One thing of particular interest in this feedback, it is rare for
the feedback to change the token boundaries (i.e. which regular
expressions are grouped into tokens). Thus, Frank Deremer long ago
suggessted separating the lexer into two parts, the scanner and the
screener. The scanner is the part of the lexer which separates the
incoming character stream into lexemes and does some preliminary
classification based upon which regular expression was matched.
The screener then looks at some (or all) of those lexeme and
recategorizes them based upon symbol table (dictionary)
information.
Notably, the timing of those two phases can be separated. You can
delay the screening process until the information is actually
needed. This resolves the key problem of "feedback". Generally,
only the screener needs the feedback, although that is not
universally true (for example, in Verilog parsing context is
required to distinguish hex numbers from identifiers and that can
change token boundaries). Thus, in cases where the feedback does
not affect the scanner, the scanner is decoupled from the parser,
and the feedback loop is broken, which allows the scanner to run
arbitrarily ahead of the parser. And that ability to run the
scanner ahead of the parser defuses the "feedback" issue, whicdh is
that the two need to be kept in lock-step, which can be
particularly difficult if your parser needs lookahead tokens or
uses backtracking, since the state necessary for determining the
exact token type can be unavailable at the time the token needs to
be scanned.
However, in the C/C++ case, screening can be delayed until parsing.
In your Ada example, it sounds like some parts of screening can be
delayed until after parsing. The key issue of screening is that it
needs a symbol table (dictionary) and not just a simple stack.
Thus, you cannot convert a screener directly into a context-free
grammar. My favorite solution to this problem is the one Terence
Parr came up with "semantic predicates", which are bits of code
which run at specific parts of the parsing process (i.e. when the
parser gets to a specific point in the grammar) and provide a
go/no-go answer. So, one could write a semantic predicate, that
queries the symbol table for, "in the current context is this
identifier a typedef" and then depending on whether you wanted
typedefs to be legal or not, the parse would either continue with
the current rule or chose another alternative. Note, semantic
predicates are essentially equivalent to attribute grammars, so
that is an alternate solution.
Hope this helps,
-Chris
*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
23 Bailey Rd voice : (508) 435-5016
Berlin, MA 01503 USA fax : (978) 838-0263 (24 hours)
Return to the
comp.compilers page.
Search the
comp.compilers archives again.