Re: A Grammar Writing Question

Chris F Clark <cfc@shell01.TheWorld.com>
Sat, 28 Jul 2007 13:44:55 -0400

From comp.compilers

Related articles
Re: Grammar for optional elements cfc@shell01.TheWorld.com (Chris F Clark) (2007-06-21)
A Grammar Writing Question lowell@coasttocoastresearch.com (Lowell Thomas) (2007-07-23)
Re: A Grammar Writing Question cfc@shell01.TheWorld.com (Chris F Clark) (2007-07-26)
Re: A Grammar Writing Question gah@ugcs.caltech.edu (glen herrmannsfeldt) (2007-07-26)
Re: A Grammar Writing Question stephenhorne100@aol.com (Steve Horne) (2007-07-27)
Re: A Grammar Writing Question DrDiettrich1@aol.com (Hans-Peter Diettrich) (2007-07-27)
*Re: A Grammar Writing Question cfc@shell01.TheWorld.com (Chris F Clark)* (2007-07-28)**
Re: A Grammar Writing Question schmitz@i3s.unice.fr (Sylvain Schmitz) (2007-07-29)
A Grammar Writing Question lowell@coasttocoastresearch.com (Lowell Thomas) (2007-08-07)

| List of all articles for this month |

From:	Chris F Clark <cfc@shell01.TheWorld.com>
Newsgroups:	comp.compilers
Date:	Sat, 28 Jul 2007 13:44:55 -0400
Organization:	The World Public Access UNIX, Brookline, MA
References:	<07-06-053@comp.compilers 07-07-081 07-07-093 07-07-095
Keywords:	parse, C++
Posted-Date:	28 Jul 2007 15:29:17 EDT

glen herrmannsfeldt <gah@ugcs.caltech.edu> writes:

> Chris F Clark wrote:
>
> (snip regarding C++ parsing)
>
>> Worse, I think that some of the devious examples he came up with
>> required not only semantic input (i.e. which identifiers were
>> declared as what coming into the sequence), but could change which
>> identifiers were being declared (as well as what they were being
>> declared as).
>
> I am not so sure what features you are considering, but typedef does
> complicate the separation of lexing and parsing. Do you have any
> examples of the ambiguous cases?

This is my recollection of the examples he was having trouble with,
which did essentially all have to do with whether a typedef (or class
name) was in scope or not, and the varieties of declarations
(particularly prototypes), combined with function calls (especially
constructor calls), casts, and assignments (the = operator) versus the
= which beings an initialization list.

Essentially, it was hard to tell a function prototype (which is a
declaration) from a function call. Inside a prototype, one might use
typename variable name to be a local parameter declaration
(introducing a new copy of that variable name) or a cast of the value
of variable (if the variable name was legally declared outside) to the
type described by the typename (an expression). Thus, you had to keep
parsing to see if one of those choices caused the entire containing
epxression to be invalid. Thus, you had to exhaustively keep that
ambiguity undetermined until you reached the conclussion of the
statement, in which case you could determine that it was a
declaration.

Now, of course, in protoptype declarations, you can have nested
prototype declarations, and in expressions you can have nested
expressions. I can't recall whether you can nest either within the
other, but I believe there is a case where you can do so (perhaps a
prototype declaration can be used as part of a cast, similarly an
expression can be used in an initializer which is part of a
declaration).

In the worst case, if you can nest both within each other, then you
can describe an arbitrary list of yes/no decisions that all depend on
the final context. In the best case, if one can not nest either
within the other (in a meaningful way), you get a long chain of yes/no
decisions (that all resolve to the same answer), but again depend upon
arbitrary context.

In addition to the above, Jim showed that certain features were
"locally unhelpful". For example, one can wrap a declarator (and an
expression) in an arbitrary nesting of parenthesis with no effect.
thus, even in the simple case, one could easily bound any fixed
lookahead simply by wrapping the innermost things in deeper and deeper
nests of parenthesis.

As I recall, Jim worked through these examples and developed a grammar
which for a certain fixed amount of ambiguities (I recall it was 4,
but that was long ago) could accurately resolve them using LALR
techniques (that's all encoded in his Yacc-able C++ grammar).

He even went to the committee with his results and made some specific
recommendations with things which could be fixed to may the problem
less onerous. However, the problem was dismissed as being too
pedantic. After all there was C-front which was the defacto standard
(and also g++ and several other C++ compilers, most of which mostly
came up with the same answers for the same examples as long as the
nesting wasn't too complex).

Moreover, one shouldn't fault the committee too much, for any given
finite input there is exactly one correct interpretation of what it
means in C++. Moreover, the rule that descriminates the choices is
simple to state. Just because one can't describe that in an LR
grammar is not the worst offense.

In fact, I only have one problem with that attitude, which goes to the
root of this thread. The (declaration overrides expression) rule may
be simple to state. However, it puts us in the position where we
can't independently verify correctness. This is the same problem I
have with PEGs (OCP). One has to trust the author that one has
covered all the cases (and only the cases). Only the linear ordered
evaluation of the rules produces the desired result. In the C++
cases, that means that there are fragments of the language which one
can't look at (independent of context) and determine what they mean.

Context switch (getting general again):

In fact, I have that one of the chief values of LL parsing (even over
LR parsing, which I prefer to use as an implementor). If you know the
entire grammar is LL, then for any given fragment you know exactly how
it will parse independent of context (i.e. independent of other rules
that may be in effect at the time). With LR, one only has the weaker
claim, that if you have a fragment and know the context, you can tell
exactly how it will parse (and that's because LR allows one to use
tokens of the context in resolving how things will parse). Moreover,
because of the theoretical underpinning, you know that these
properties hold wherever the fragment is used.

Now, a PEG seems to be closer to the LL case than the LR, because you
don't need following context. However, you do have to worry about the
disjointness issue, which LR protects you from. In an LR grammar, two
non-disjoint rules are never used in a context where they cannot be
disambiguated in context.

Maybe that suggests a good choice, if you can't make your grammar LL
is to make certain that it is both LR and a valid PEG. That is,
develop a way to show that if you treated the LR grammar as a PEG
(i.e. that the rule ordering enforced the same derivation that an LR
grammar would make), then you would have a reliable method that was
broader than LL and yet still had both local and global readability
properties. I think some bright PhD student could solve that problem
by next year. (And, if some student (or anyone for that matter) wants
to attack that problem, I'll be happy to consult, exchange ideas,
whatever to help them along.)

Hope this helps,
-Chris

*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
23 Bailey Rd voice : (508) 435-5016
Berlin, MA 01503 USA fax : (978) 838-0263 (24 hours)

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: A Grammar Writing Question

Chris F Clark <cfc@shell01.TheWorld.com>Sat, 28 Jul 2007 13:44:55 -0400

Chris F Clark <cfc@shell01.TheWorld.com>
Sat, 28 Jul 2007 13:44:55 -0400