Re: C++ Grammar - Update

"Mike Dimmick" <mike@dimmick.demon.co.uk>
3 May 2001 13:40:45 -0400

From comp.compilers

Related articles
C++ Grammar - Update mike@dimmick.demon.co.uk (Mike Dimmick) (2001-04-26)
Re: C++ Grammar - Update loewis@informatik.hu-berlin.de (Martin von Loewis) (2001-04-30)
Re: C++ Grammar - Update idbaxter@semdesigns.com (Ira D. Baxter) (2001-05-03)
*Re: C++ Grammar - Update mike@dimmick.demon.co.uk (Mike Dimmick)* (2001-05-03)**
Re: C++ Grammar - Update gahide@lil.univ-littoral.fr (Patrice Gahide) (2001-05-03)
Re: C++ Grammar - Update michael_spencer@btclick.com (Michael Spencer) (2001-05-07)
Re: C++ Grammar - Update michael_spencer@btclick.com (Michael Spencer) (2001-05-13)
Re: C++ Grammar - Update loewis@informatik.hu-berlin.de (Martin von Loewis) (2001-05-13)

| List of all articles for this month |

From:	"Mike Dimmick" <mike@dimmick.demon.co.uk>
Newsgroups:	comp.compilers,comp.compilers.tools.pccts
Date:	3 May 2001 13:40:45 -0400
Organization:	Compilers Central
References:	01-04-141 01-04-155
Keywords:	C++, parse
Posted-Date:	03 May 2001 13:40:44 EDT

"Martin von Loewis" <loewis@informatik.hu-berlin.de> wrote in message
> "Mike Dimmick" <mike@dimmick.demon.co.uk> writes:
[...]
> > one can follow the technique of Ed Willink
> > (http://www.computing.surrey.ac.uk/research/dsrg/fog/FogThesis.html)
>
> Please note that the goal of that parser is restricted to parsing
> declarations only (see 4.4, Ambiguity resolution).

I don't think you read it fully. That section deals only with
versions up to that _prior_ to the superset method detailed in the
following subsection and the remainder of the thesis. The full
parser, as implemented in appendix B (as a YACC grammar) certainly
covers declarations, expressions and statements.

> It seems that the parser accepts a *very* large superset of C++,
> e.g. the provided Solaris binary accepts
>
> void foo(){
> +
> }
>
> without complaints. So I still doubt that you can do meaningful C++
> parsing w/o semantic analysis in the lexer.

I hadn't noticed that; perhaps you should report that as an error!
The two arms of an additive_expression should not be epsilon producing
but clearly are. It may be that something's going wrong with the
error suppression technique on backtracking, but my exact recollection
of the method used is hazy. It involves a separate piece of code
steering the parser by pushing a token onto the input stream depending
on whether a template was found or not, or whether a member-specifier
was found or not. I seem to recall that Willink used '+' as the
symbol in one of these cases.

I rejected Willink's method mainly due to my unfamiliarity with YACC
and that I could not understand his method of resolving the
constructed syntax trees - you will note that the syntax tree
construction code is missing from the YACC grammar and as such it is
quite difficult to follow his explanation of forming hypotheses about
the semantic particle each subtree represents. The semantic
information is still needed, left-to-right, to perform correct
analysis of the C++ program; however, this can be deferred until the
program is complete. This helps in resolving some issues in class
declarations, where inline member functions can refer to items
declared after the function definition, providing there is nothing in
a surrounding scope that can conflict with it.

[...]

It is true that semantic information is required in order to parse
correctly (if Willink's method is discarded); PCCTS/ANTLR do not force
it as part of the lexer, or even as a layer between the lexer and
parser. The problem with this approach is that the parser rules must
be duplicated for every valid possibility, and in C++, there are a lot
of possibilities for classification of an identifier. In some cases,
plain 'identifier' is what the rule requires, which means the rule
must be duplicated for each of the ten or so classifications.

The PCCTS/ANTLR approach (also followed by Yacc++, I believe) is to
allow the parser to call arbitrary functions to perform the exact
lookup where it is necessary. This ends up being compiled into part
of the condition of an if statement necessary to steer the parse.
Essentially this means asking a specific question - 'is this
identifier a class name?' - to parse accurately.

> > Qualified names are another circumstance which require unlimited
> > semantic lookahead. This is due to template names with attached
> > argument lists being permitted in a qualified name.
>
> There are actually ambiguities in this area, consider
>
> class X{
> friend A::B::C();
> };
>
> Is this ::C, returning A::B, or is it ::B::C, returning A? This is
> currently an ambiguity in C++, which is not resolved in the '98
> edition of the standard.

I'm not sure that is actually an ambiguity - it depends on the types
in question. Remember that it says that '[a] type definition
introduces a new keyword.' If A is a class or a namespace, and B is a
class or namespace, it names the function C() in A::B. It's then a
syntax error by my terms because there's no return type. However, if
A does not name a class or namespace, it's ::B::C returning A. If A
names a class or namespace but B names a type that isn't a class or
struct (and is therefore a union, an enum, or a typedef that is itself
a union, an enum, or a built-in type - see what I mean about
classification?), it's ::C returning A::B.

In the first case, if there is no C() in A::B, it's a semantic error.
C++ parsing involves a fair amount of 'in what possible way could this
be correct?'.

I note that section 7 para 2 of the standard says: "The longest
sequence of _decl-specifiers_ that could possibly be a type name is
taken as the _decl-specifier-seq_ of a declaration."

> > It is necessary to resolve the exact instantiation of the template
> > to determine whether the contents of the template themselves name a
> > class (in which case a following "::" should continue the qualified
> > name).
>
> You mean, to see whether
>
> A<k>::B
>
> is a typename or not? In C++, it is never a typename; to make it a
> typename, you have to write
>
> typename A<k>::B

That depends on whether A has been previously declared. If you have
the declaration

template <int c>
class A {
        enum B { D = c, E };
};

then A<k>::B is a type.

If you had

template <int c>
class A {
        class B {
                void f() { }
        };
};

A<k>::B would be a type and also would be a scope, so that
A<k>::B::f() can be looked up. The typename keyword is not required
in this context (because the compiler is supposed to be able to
resolve that A<k>::B is a type through lookup).

If on the other hand you have

template <template <int k> class A>
class C {
        void f() {
                typename A<k>::B D;
        }
};

'Typename' is needed to indicate to the compiler that A<k>::B names a
type (that is, that B in A<k> is a type, which it cannot know). See
for example table 32 of the standard (section 20.1.5, allocator
requirements) for

typename X::template rebind<U>::other

which casts that 'other' is a type living in the template 'rebind'
with parameter U, which is a member template of class X. This cannot
be known in the context of template parameters, for which a full
declaration is not seen until the template is instantiated.

> > I believe I have previously posted on at least one of these two
> > newsgroups regarding the rule in the standard which requires this
> > behaviour; it can be summarised as "the members of one instantiation
> > of a template need bear no relation to any other instantiation of a
> > template." This leaves us in the ridiculous situation of requiring
> > full template instantiation and expression evaluation in order to
> > produce an AST.
>
> That is surely not the case. Whether something is a typename or not
> can be determined without instantiation.

I refer you to John Lilley's parser at
http://www.empathy.com/pccts/download.html. In the archive (choose
Zip or GZipped TAR at your discretion) you should find a Word 95
document titled 'Theory_and_operation.doc'. I quote from it:

"The second and most important reason that template specialization is
horrible, is that it requires the syntax phase to perform
near-complete type, scope, and expression analysis, including function
overloading. The original CFront-style templates were not so bad, but
the introduction of template specialization made templates truly
awful."

The rule in the standard that requires this behaviour is embedded in
the middle of section 14.7.3, paragraph 5, and reads:

"The definition of an explicitly specialized class is unrelated to the
definition of a generated specialization. That is, its members need
not have the same names, types, etc. as the members of the a generated
specialization."

[...]
> So out of curiosity: What does your parser with my friend example
> above?

It depends on the prevailing surrounding declarations, as described
above. I should have said 'usually resolvable'. In the particular
quoted paragraph I only intended to cover the case where the scope
resolution operator leads a name, which can also be ::new or ::delete
to invoke the global allocation or deallocation function. This
requires two tokens of lookahead or a left-factor of the grammar,
'remembering' that the scope resolution operator has been seen.

> > The "declaration specifiers" rule (decl-specifiers) has been modified
> > to accommodate only one user-defined type or a sequence of built in
> > types. This is slightly complicated by the fact that modifiers may be
> > interspersed between the built-in types (e.g. "unsigned const long
> > static int") but this removes the problem of whether a name in a
> > declaration is the type or the declarator. This decision was taken
> > because the C++ standard has now disallowed implicit 'int' - and
> > therefore all declarations must be "type-name declarator-list;".
>
> I think this is also an error in the FOG thesis: The only case where
> the decl-specifier-seq can be ommitted is the constructor/destructor;
> so I can't see why "i=0;" is ambiguous.

Because Willink has chosen to support the moderately large amount of
C++ code that was written when implict int was permitted. It's only a
recent drop from the standard. As such, any declaration is:

{ decl-specifiers } declarator { '=' initialiser } ';'

where { } represents optionality.

The declarator syntax in C++ is extremely strange with different
precedences being applied to the declarative operators '*', '(' ')',
'[' ']' (which are applied to the declared name, which is one reason
it's so hard to extract). In order to get a different precedence, one
must use parentheses to change it. We've all seen this and written
this with the use of function pointers (well, if we've used C or C++
function pointers, we have). Therefore the only rule that can work
for declarator is one which resolves these operators and the
precedence-altering parens correctly.

Therefore, the two declarations

int i = 0;

virtual int i() = 0;

conflict, as for the first we get

decl-specifiers = 'int'
declarator = 'i'
initialiser(constant-expression) = '0'

and the second

decl-specifiers = 'virtual int'
declarator = 'i()'
initialiser(pure-specifier) = '0'

But the parser can't tell, without using context information, that a
virtual function is being declared. This is because 'virtual' is just
a decl-specifier, it isn't a different form of declaration in its own
right. The second could equally be written

int virtual i() = 0;

with the same semantic meaning. Therefore pure-specifier and
constant-expression conflict, and the only thing to do is to perform
resolution at a later stage.

--
Mike Dimmick

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: C++ Grammar - Update

"Mike Dimmick" <mike@dimmick.demon.co.uk>3 May 2001 13:40:45 -0400

"Mike Dimmick" <mike@dimmick.demon.co.uk>
3 May 2001 13:40:45 -0400