C++ Grammar - Update

"Mike Dimmick" <mike@dimmick.demon.co.uk>
26 Apr 2001 21:18:56 -0400

          From comp.compilers

Related articles
C++ Grammar - Update mike@dimmick.demon.co.uk (Mike Dimmick) (2001-04-26)
Re: C++ Grammar - Update loewis@informatik.hu-berlin.de (Martin von Loewis) (2001-04-30)
Re: C++ Grammar - Update idbaxter@semdesigns.com (Ira D. Baxter) (2001-05-03)
Re: C++ Grammar - Update mike@dimmick.demon.co.uk (Mike Dimmick) (2001-05-03)
Re: C++ Grammar - Update gahide@lil.univ-littoral.fr (Patrice Gahide) (2001-05-03)
Re: C++ Grammar - Update michael_spencer@btclick.com (Michael Spencer) (2001-05-07)
Re: C++ Grammar - Update michael_spencer@btclick.com (Michael Spencer) (2001-05-13)
[1 later articles]
| List of all articles for this month |
From: "Mike Dimmick" <mike@dimmick.demon.co.uk>
Newsgroups: comp.compilers,comp.compilers.tools.pccts
Date: 26 Apr 2001 21:18:56 -0400
Organization: Compilers Central
Keywords: C++, parse
Posted-Date: 26 Apr 2001 21:18:56 EDT

Some of you are probably aware that I have been trying to develop a
C++ parser grammar using PCCTS. One person even e-mailed me to ask if
he could have a copy! I also know that a number of others
periodically ask whether such a beast exists. Here's how I've been
getting on. A quick summary would be: too many ambiguities, too
little remaining time; abandoned in favour of using Java as the
language to process. [1]


I started pretty much from scratch, aiming to be LL(1) where possible
(although this was before I read Terence Parr's paper "LL and LR
Translators Need k > 1 Lookahead", available at
http://www.antlr.org/papers/needlook.ps - LL(1) may make the parser a
little smaller but it _really_ makes things complicated). I did have
John Lilley's C++ grammar and hacked PCCTS available for study, but
couldn't actually understand it. It's too big to understand in one go
(not John's fault), and lacks a real overview of exactly how he's
broken it down into rules. It doesn't help that he's sorted the rules
alphabetically as opposed to breaking them down in some logical
scheme, so far as I can see. I also didn't want to rely on a tool
that had no formal documentation accompanying it; PCCTS stock release
documentation is good, John's was not clear exactly what changes he
had made.


As I have proceded, I have included semantic predicates for
resolution where necessary; however, I never actually completed the
grammar such that ANTLR produced no warnings. The actual predicate
functions and symbol table are unimplemented. Semantic actions
performing symbol table manipulations have also been omitted. It is
therefore unsurprising to say that testing was never performed upon
this grammar.


C++ is big. Very big. The grammar works out (in its incomplete
state) at 2,528 lines (including comments, token definitions, etc).
That doesn't include statements - I cut down the grammar to reduce the
number of reported ambiguities. It should perform brace matching when
it hits a function body to steer it correctly to the next declaration.
This mostly removed the problem of having to handle the well-known
declaration/expression ambiguities.


The major reported problem with the C++ syntax is that it requires
semantic information to parse correctly. This isn't strictly true,
one can follow the technique of Ed Willink
(http://www.computing.surrey.ac.uk/research/dsrg/fog/FogThesis.html)
who uses purely syntactic methods. However, this technique needs a
lot of subsequent analysis to correct the misparsed syntax trees, and
requires backtracking, and also needs some complicated binary-search
methods to resolve template usage into a consistent AST. [2] I
therefore decided to stick with the classic method of producing a
symbol table "on the fly" as it were.


The real problem I encountered was the number of constructs which
require unlimited lookahead to resolve. PCCTS provides the syntactic
predicates to try to resolve these situations, but I had reckoned
without the requirement to build and traverse the symbol table. One
must know whether a construct names a type in order to correctly parse
in some circumstances. For example, in a cast-expression, the cast
notation is similar to that of a parenthesised expression. If it is a
type, a following '*' or '&' is part of the type-id (pointer and
reference respectively); however, if not, it is the equivalent binary
operator (multiplication, binary AND). The two require somewhat
different ASTs. To determine whether a construct names a type, one
must inspect the last name in a qualified identifier in the
appropriate scope, which requires traversing the symbol table.


PCCTS syntactic predicates do not execute actions whilst in guess
mode. They do execute init actions, but I could not formulate a
method which performed all the required traversal solely in init
actions (this would have been very strange and non-intuitive in any
case).


Specific ambiguities that were handled:


The template syntax specifies that the first encountered ">" symbol
inside a template argument list (but not inside a parenthesised
expression) marks the end of the list, and is not a relational
operator. The grammar was not duplicated as for example Willink
demonstrates; instead a boolean argument to each rule between
"relational_expression" and "expression" was used to control a
semantic predicate in relational_expression to treat ">"
appropriately. I am not entirely sure that this was the right
decision - testing may have revealed a problem with this approach.


I had originally thought that there would be conflicts between the
name of a destructor "~X" and a unary complement expression "~ x". I
consulted the standard on this issue and determined that in fact "~X"
on its own is always considered the complement of X; the destructor
can only be called through explicit qualification or through one of
the member access operators. I therefore had two rules for
'unqualified-id', one for use after leading qualified names and member
operators, and another for use without.


Qualified names are another circumstance which require unlimited
semantic lookahead. This is due to template names with attached
argument lists being permitted in a qualified name. It is necessary
to resolve the exact instantiation of the template to determine
whether the contents of the template themselves name a class (in which
case a following "::" should continue the qualified name). I believe
I have previously posted on at least one of these two newsgroups
regarding the rule in the standard which requires this behaviour; it
can be summarised as "the members of one instantiation of a template
need bear no relation to any other instantiation of a template." This
leaves us in the ridiculous situation of requiring full template
instantiation and expression evaluation in order to produce an AST.


C++ name resolution is complicated by the fact that the global
namespace has no name; it is referred to by prefixing a name
(qualified or not) with the scope resolution operator "::". This
causes more ambiguities resolvable by left-factoring the grammar.


As previously mentioned, resolution of a name between a type-id and an
expression is necessary in some places (cast, "typeid" clause,
"sizeof" clause, template argument). This has been done by
left-factoring type-id and expression, duplicating _all_ the
expression rules and removing the left-hand side. An example (code to
reattach the left-hand side in the appropriate place has not yet been
added):


name_inclusive_or_expression [bool inTemplate]:
                name_exclusive_or_expression [inTemplate]
                        ( BitOr exclusive_or_expression [inTemplate] )*
        | ( BitOr exclusive_or_expression [inTemplate] )+
        ;


"new" expressions also cause some difficulties. The new-type-id may
be a parenthesised type-id, which is syntactically similar to a
new-placement. This is resolved using the type-id-or-expression rule
to determine which it was; if a type, we are then expecting a possible
parenthesised initialiser, if not, the next item must be the type-id,
followed by a possible initialiser.


The "declaration specifiers" rule (decl-specifiers) has been modified
to accommodate only one user-defined type or a sequence of built in
types. This is slightly complicated by the fact that modifiers may be
interspersed between the built-in types (e.g. "unsigned const long
static int") but this removes the problem of whether a name in a
declaration is the type or the declarator. This decision was taken
because the C++ standard has now disallowed implicit 'int' - and
therefore all declarations must be "type-name declarator-list;".


Function bodies have been treated as a special form of initialiser.
This is because of the attachment rules for the parameter list of the
function declarator and the use of redundant parentheses; "func()" is
considered a declarator in C++. This causes problems in
enum_specifier and class_specifier due to the "operator <type-name>"
conversion declarators which are not required to have an attached
argument list. This could probably be resolved by duplicating rules.


Elaborated class and enum specifiers (where the keyword "class",
"struct", "union" or "enum" qualifies the name of the type itself)
have been absorbed into the same rule as the equivalent definitions
(due to their leading context).


This one was the final straw - I realised at this point that I still
had 21 outstanding warnings and little time to complete the project.
A declarator is an optional sequence of pointer operators followed by
possibly qualified identifier. A pointer-operator may be a
pointer-to-member operator which has the format:


{"::"} (namespace_name "::")* (class_name "::")+ "*"


and conflicts with the qualified identifier naming the declarator itself.


This is deducible with left-factoring, however, it became clear that
the repetition was going to cause problems.


Incidentally, the reason that a declarator may be a qualified name is
to permit the separate initialisation of class statics. Prior to one
of the very latest drafts of the standard, this was required; a static
class member field could not be initialised within the class
declaration itself. It was supposed to be declared once only in a
single translation unit, all others linking to that translation unit.
This ensures a single copy of the static shared between all instances
of the class.


Once I have checked the legal position (I have a copyright agreement
with my University that copyright on any work carried out towards my
degree is assigned to them; a fait accompli presented when I enrolled)
I may post my (assuredly incomplete) grammar on a website somewhere,
should anyone else wishes to study it or make use of it. I would
expect that I will place a limited license similar to that for zlib on
it, if I do so.


There are further ambiguities I haven't detailed here. These are
mostly well known, see Willink's thesis (chapter 4 and appendix F) for
more details, if interested.


To solve the remaining ambiguities, I suggest that at least one of the
following techniques is employed:


* Rewriting the code generator such that actions are executed in
      guess mode; adding support so that erroneously executed actions
      can be backed out [3]. This inevitably is likely to be prone to
      error.


* Manually handling guesses by marking the token buffer, then unmarking
      in case of success [4] or rewinding the buffer and trying an alternative
      in case of failure.


* Further left-factoring of the grammar.


* Multiple stage parsing.


I conclude that C++ requires some very strong parsing methods if one
is to be successful. An LALR(1) grammar is sufficiently strong if
actions are placed very carefully so as not to interfere with the
actual parsing method. My experience of this is minor, but it
appears, with the tool I have experienced (bison) that any ambiguities
introduced by doing so are quite difficult to resolve. The problem
with the freely available tools is that they require that every use of
an identifier is translated into a distinct token type depending on
its semantic meaning (having no equivalent of ANTLR/PCCTS's semantic
predicates). YACC++ is said to have these but it costs a fair amount
of money, which I and my department do not have. An LL(k) parser,
with any fixed amount of lookahead, cannot resolve many C++ constructs
even with semantic information. Both tools require a great deal of
manipulation of the grammar which makes it extremely hard to
understand.


I shall conclude this article by repeating a statement I have made
before. I recommend that the syntaxes of C and C++ are no longer
considered as a basis for developing a new programming language.
Parsing them for a simple tool (and even for a complex tool) is far
harder than it should be.


--
Mike Dimmick
Final Year Undergraduate, Aston University, Birmingham, UK.


[1] The ultimate project is to produce a tool which processes a source
program (language to be determined by student, suggested C++ or Java)
to produce a UML class diagram of the static structure of the program.
I'm now using ANTLR 2.7.1 with the supplied example Java grammar,
which is LL(2). I still hold my prejudices against Java as an actual
working language; I'm using the version which compiles to C++. This
is also because I don't know sufficient Java to use JFC, Swing or the
AWT; I _do_ know Microsoft Foundation Classes, so I need a parser that
can link with a C++ program, which I know no way to do with the Java 2
SDK. No suggestions please, this needs to be completed by Monday...
This post is helping me with my documentation, so I'm not skiving.


[2] The problem comes with expressions as template parameters. C++
permits full generality of constant expressions, so the use of '<' as
the less-than operator is permitted. This leads to trouble with:


a::b<c < d>::e


where correct resolution must be either "a::b is less than c<d>::e" or
"a::b< (c < d) >::e [is a primary expression]" (assuming one of the
two is a template). Willink's method requires that each "Identifier
LT" combination is processed as a template argument list, backing out
and retrying as an expression if it could be an expression. I
couldn't work out exactly how this worked (the code being interwoven
between the lexer and parser, in order to force the appropriate
resolution symbol into YACC's front end) nor how to translate it to
PCCTS. I had already put some considerable study into the choice of
parser generators by this time and considerable time into
understanding PCCTS properly.


[3] This may be a worthwhile addition to ANTLR and/or PCCTS in any
case; requiring that all actions are implemented as command objects
supporting backing out if required with some standard interface. This
could easily add true language-independence so long as other minor
differences (e.g. Java "." vs. C++ "->" with regard to use of named
tokens) were in some way removed.


[4] A PCCTS implementation facet. While there are outstanding marks
on the buffer, PCCTS cannot garbage-collect tokens from the buffer, as
token pointers might be needed in a semantic action. By default,
because it always rewinds the buffer to do the action pass where
syntax predicates are involved, it treats a rewind() as unmarking the
buffer and reducing the mark count. Adding an unmark facility
requires subclassing the token buffer class.


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.