Re: parsing C and C++, was Compiler Compiler Compiler

"Mike Dimmick" <mike@dimmick.demon.co.uk>
31 Mar 2001 02:55:11 -0500

          From comp.compilers

Related articles
Compiler Compiler Compiler danwang+news@cs.princeton.edu (Daniel C. Wang) (2001-03-22)
Re: Compiler Compiler Compiler mike@dimmick.demon.co.uk (Mike Dimmick) (2001-03-26)
Re: Compiler Compiler Compiler kszabo@nortelnetworks.com (Kevin Szabo) (2001-03-27)
Re: parsing C and C++, was Compiler Compiler Compiler loewis@informatik.hu-berlin.de (Martin von Loewis) (2001-03-31)
Re: parsing C and C++, was Compiler Compiler Compiler mike@dimmick.demon.co.uk (Mike Dimmick) (2001-03-31)
Re: parsing C and C++, was Compiler Compiler Compiler mike@dimmick.demon.co.uk (Mike Dimmick) (2001-03-31)
| List of all articles for this month |

From: "Mike Dimmick" <mike@dimmick.demon.co.uk>
Newsgroups: comp.compilers
Date: 31 Mar 2001 02:55:11 -0500
Organization: Compilers Central
References: 01-03-095 01-03-122 01-03-133
Keywords: parse, C
Posted-Date: 31 Mar 2001 02:55:11 EST

"Kevin Szabo" <kszabo@nortelnetworks.com> wrote in message
> I've never tried to parse C/C++. Could you give an example or two
> of the problems (or point me to a reference).


I realise yesterday's message may not have been as helpful as it could
have been. I apologise now for the length of this reply; it is likely
to form the basis of a section in my project report, so I've gone into
fairly hairy detail.


Most of the problems with C/C++ parsing stem from the fact that in C,
names for structs, unions and enums occupy one namespace, while names
for objects and functions occupy another. However, you can of course
use "typedef" to create a name in the object/function namespace for a
type in the struct/union/enum namespace.


This is additionally complicated by the fact that typedef is _not_ a
syntactically distinct concept - it isn't a declaration, it's just a
declaration-specifier, which means that


struct node_s {
        int item;
        struct node_s *next;
} typedef node;


is a perfectly valid (although irregular to our eyes) declaration of
both a struct node_s in the struct namespace and an equivalent type
name 'node' in the unqualified namespace. This is a factor of the
syntax being permitted to perform as much as possible in one step,
including lots of qualifiers. Unfortunately this tends to end up
looking like noise.


Likewise the positions of all the other declaration-specifiers are
completely arbitrary. One cannot declare two classes in the same
declaration, semantically, but the syntax (as specified in the
standard, at least) permits it. In my own grammar, I intend to
restrict it to


decl-specifiers type-specifier decl-specifiers


with the type-specifiers being either ONE user-defined type or
multiple built-in types followed by decl-specifiers, because you are
permitted to have


int const long === const int long


Again, I'm not sure how many people actually _write_ declarations like
that, but it must be supported, because the standard says that
declarations like that are permitted. I don't think C99 is helping
much with the introduction of 'long long' - while it's parsable, a
single keyword would have helped more.


Because of the 'typedef' situation, and the form of C declarations,
type information is needed in the parse (proceeding as always from
left-to-right) to resolve:


node (x);


If 'node' is a type (for example, the one above), this declares 'x' to
be of 'node', with redundant parentheses, but if 'node' is not a type,
this declares a function 'node' with a single parameter 'x'
(implicitly 'int') and a return type of int (also implicit). Without
the parentheses, it would have been a syntax error (declaration of
'node' as int, missing comma, declaration of 'x' as the same type as
'node').


The implicit 'int' has been deprecated in the C++ standard, which
might mean it could be removed in a number of years (and possibly not
introduced in new tools) which may give assistance.


Moving on to C++, the first additional problem occurs because the
structs, unions and enums (and classes, of course) can now be used
without qualification - they occupy the same namespace as the objects,
functions and typedefs. This makes the declaration at the top
basically unnecessary. However, to ensure compatibility with C, it's
still possible to declare an object or a function with the same name
as a type. This ensured compatibility with AT&T UNIX, which declares
a function stat() in addition to a "struct stat".


Therefore, a different resolution operator was required to indicate to
the compiler that it should be looking at nested types (another C++
introduction over C) as opposed to objects. The scope resolver "::"
was introduced into C++ - it's unnecessary for almost every other
language with nested types and packages (for example, '.' suffices for
Ada and Java, for both purposes).


The next major ambiguity is due to the relative positioning of
declarations and expressions. In C++ declarations and expressions can
be mixed freely as statements in the body, but due to the redundant
parenthesising and the ability to call a constructor directly, one
must make a guess. This tends to take the form of 'try to parse a
declaration, if you can't, it must be an expression.' Essentially
this works out to marking your place in the input, trying to parse,
then rewinding if incorrect. This suits syntactic predicates in PCCTS
(and Yacc++, from my understanding). Doing this in yacc itself
requires additional work - see Ed Willink's thesis for more
information on the technique.


Templates introduce more problems. Because the '<' symbol was used to
open a template argument list, and '>' to close it, type and template
information is needed to determine whether


name1<object1, object2> name2;


means 'name2 is an object of type name1 with parameters object1,
object2' or whether it means 'evaluate name1 < object1, discard the
result, then evaluate object2 > name2 and use that as the result.' In
a qualified name, both template and type information are needed to
resolve


a::b::c<d, e>::f::g h;


If a and b are namespaces or classes, c is a template, and when d and
e are passed as parameters to instantiate a::b::c<>, it contains a
class f and a type g, h is declared of the type 'a::b::c<d, e>::f::g.'
If some of that doesn't hold, it could be a comparison, a definition
of a static object 'g' in a global class 'f' of type 'a::b::c<d, e>',
and a number of other possibilities.


Note the reference to 'when d and e are passed as parameters to
instantiate a::b::c<>'. This cannot be determined from a single
canonical version of the template, because C++ permits specialisations
of templates. The C++ standard says that "The definition of an
explicitly specialized class is unrelated to the definition of a
generated specialization. That is, its members need not have the same
names, types, etc. as the members of the a generated specialization."
(section 14.7.3, paragraph 5). This extremely irritating rule
requires evaluation of template parameter expressions and selection of
an appropriate instantiation of a template, AT PARSE TIME.


Much of this could have been avoided by a more tightly defined rule (I
have suggested one on another newsgroup, comp.compilers.tools.pccts)
and either using a different symbol for introducing template
parameters or by requiring the keyword 'template' to introduce any use
of a template name followed by parameters. It's required at present
to disambiguate cases where the compiler just cannot work it out.


There's a slight lexical ambiguity too: consider 'name1<name2<param>>'
This will not parse correctly, because '>>' is the right-shift
operator. To parse correctly, the final '>'s must be separated with a
space.


The final source of ambiguity in C++ I will mention here is an obscure
one between bit-fields and nested classes. Willink gives a good
description of this - his source was Jim Roskind's grammar of 1991.
An example:


class A { class B :


could begin an anonymous bit-field or a nested class, depending on the
following token(s):


const int C = 3;
class A { class B : C, D, E = 5; };


(there is an anonymous bit-field of size 3 of type 'class B'; D and E are
variables of the same type)


class C {};
class D {};
class E {};
class A { class B : C, D, E {}; };


(A::B is a nested class inheriting privately from C, D and E).


This is another problem that appears to require backtracking.


It's no real wonder that there are very few simple tools around for
C++. And practically no free ones. This job is just way too hard for
casual use!
--
Mike Dimmick


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.