New C++ grammar release

jar@hq.ileaf.com (Jim Roskind x5570)
Thu, 11 Jul 91 18:28:27 EDT

          From comp.compilers

Related articles
New C++ grammar release jar@hq.ileaf.com (1991-07-11)
| List of all articles for this month |
Newsgroups: comp.compilers
From: jar@hq.ileaf.com (Jim Roskind x5570)
Keywords: C++, C, parse, code
Organization: Compilers Central
Date: Thu, 11 Jul 91 18:28:27 EDT

This is a cross posting notification of a new release of my C/C++
grammar. This release also contains a paper (autodoc5.txt) that
discusses LR and LALR, and LALR-only conflicts, and might prove quite
interesting to some of the comp.compilers readers. Some of the
analysis in that paper also addresses a recent question in this forum
about the ratio of the number of states in an LR parser to the number
of states in an LALR parser.


The almost complete release should be available in a posting to
comp.lang.c++. The complete release, which includes a *LARGE* machine
generated file that analyses the conflicts, is available as part of
the packet at each of the ftp sites shown below.


Jim


cut here------------------------------------------------
FILENAME: FREEGRAM5.TXT
AUTHOR: Jim Roskind
                Independent Consultant
                516 Latania Palm Drive
                Indialantic FL 32903
                (407)729-4348
                jar@hq.ileaf.com
                or ...uunet!leafusa!jar


                                                                                                                7/4/91


Dear C++ and C Grammar User,


I have written a YACC debugging tool, and a set of grammars for C and
C++ in order to use them within my own personal project development. I
have made the results of my work in this area available to other
developers at no charge with the hope that they would use my work. I
believe the entire C++ community can benefit from such
standardization. If any of the copyright notices on the grammars
(which are VERY liberal) prevent using my work, please notify me of
the problem.


Note that the grammars can each be processed by YACC, but they are
very clean, and make NO USE of the precedence setting (i.e.: %prec) or
associativity setting (i.e.:%assoc) constructs of YACC. This feature
should make them easily portable to other parser generator input
format. This "cleanliness" fact also provides brutal exposure of all
the complex constructs in C++, and the complexity of the grammar as a
whole (the C++ grammar is 2 to 3 times as large as the C grammar)
reflects this exposure.


The files included in this set are:


        FREEGRM5.TXT This introductory file
        GRAMMAR5.TXT Parsing ambiguities in C++, and in my grammar
        CPP5.Y My YACC compatible C++ grammar
        C5.Y My YACC compatible, ANSI C conformant grammar
        CPP5.L Flex input file defining a C++ lexical analyzer
        SKELGRPH.C A hacked file from the Berkeley YACC distribution
        AUTODOC5.TXT Documentation for my machine generated analysis
        Y.OUTPUT Machine generated analysis of the C++ grammar.


Aside from the addition of several files, this release of my grammar
corrects a few problems located in my prior release. I have also
transitioned to using names in my grammar that are more acceptable to
a wider variety of parser generators. This release also includes
support for nested types (at least grammatically, as there is no
symbol table provided). It does not support templates and exception
handling, as the ANSI C++ Committee is still discussing variations
(and trying to deal with a variety of ambiguities that the initial
proposals, such as what is described in the ARM, would entail).


Since my first public release of my grammar, I have received a number
of requests. One of the most common requests was for a lexical
analyzer to go with the grammar. This release of the grammar
continues to provides such a a "bare bones" lexical analyzer. The
analyzer does not support preprocessing, or even comment removal. In
addition, since I have not included a symbol table, or semantic
actions in the grammar to maintain proper context (i.e., current
scope), typedef names and struct/class/union/enum tags are not
*really* defined. To allow users to experiment with my grammar
without a symbol table, my lexer assumes that if the first letter of
the name is upper case, then then name is a type name. This hack is
far from sufficient for parsing full blown programs, but it is more
than sufficient for experimenting with the grammar to determine the
acceptability of a token sequence, and to understand how my grammar
parsed the sequence.


Since I did not believe that a lexical analyzer alone would be
sufficient to assist many people with playing with my grammar, I have
also provided the basis for a tool to explain what a grammar is doing.
Specifically, I have modified a file that is included in the Berkeley
YACC distribution so that parsers generated by such a YACC would
automatically display a syntax tree in graphical-ASCII format during a
parse. The instructions for using and building this yacc tool are
presented in the next section. Note that there are no significant
special hooks in my grammar or parser to excite this yacc tool, and
the tool can be used equally well on any grammar that you are working
with. This graphical debugging tool is probably one of the most
popular aspects of my releases, and its presence and usefulness to
grammar developers should not be underestimated.


Significantly new to this release is a large file that contains
machine generated documentation (re: Y.OUTPUT). This file goes well
beyond what is provided in a typical verbose output, and provides both
detailed conflict analysis, and a number of cross-references which
make it **MUCH** easier to read the associated grammar. I have not
yet decided whether to market, shareware, or plain give-away my
program, so the best I can do at this point is to release the machine
generated documentation. Unfortunately, this file is *very* large,
and I have decided (for the time being) to distribute it only via the
ftp sites only. I am doing this to lessen the global bandwidth
utilization during my grammar posting to the network. I will however
post the file (AUTODOC5.TXT) which documents the contents of the
Y.OUTPUT file, so that users can decide if they want to download the
larger file. To hint at what is included in the automatic
documentation, the following are the sections:


Reference Grammar
Alphabetized Grammar
Sample Expansions for the Non-terminal Tokens
Summary Descriptions of the Terminal Tokens
Symbol and Grammar Cross Reference for the Tokens
Sample Stack Context and Accessing Sentences for each State
Concise list of Conflicts
Canonical Description of Conflicts
Verbose listing of state transitions
Explanations for all reductions suggested in conflicts


Please see AUTODOC5.TXT for more details.


I have posted 7 of the 8 files to comp.lang.c++ (I will not post
Y.OUTPUT due to its size), to make this information as available as
possible to users and developers. I will also post this introductory
note to comp.compilers, and comp.lang.c. I am arranging for archival
support via several ftp sites, and updates will be posted to those
sites. I will also try to get the source to Berkeley YACC posted to
these ftp sites, although it is certainly available at more central
sites.


Currently, Doug Lea and Doug Schmidtt have graciously offered to
provide anonymous ftp sites for all 8 of files, as well as the
Berkeley YACC source (if you need it). The ftp addresses are:


ics.uci.edu (128.195.1.1) in the ftp/pub directory as:


c++grammar2.0.tar.Z
byacc1.8.tar.Z


mach1.npac.syr.edu (128.230.7.14) in the ftp/pub/C++ directory as:


c++grammar2.0.tar.Z
byacc1.8.tar.z






HOW TO EXPERIMENT WITH THE C++ GRAMMAR


The following describes how to use the graphical debugging extensions
to Berkeley YACC to explore the grammar.


Note that the following instructions assume that you have the Berkeley
YACC source on hand. You can actually use any YACC to process the
grammar, but running the resulting demo (which has no semantic
actions) will tend to be quite boring. If you can't get hold of the
Berkeley YACC, you should at least try to enable the "debugging
options" in your parser to so that you can see in some way what
reductions are taking place. (Hint: search for the letters "debug" in
the C file that your yacc produces...).


                1) Get the entire source for Berkeley YACC 1.8 1/2/91
                2) Verify that you can make the YACC
                3) Rename SKELETON.C to SKELOLD.C, and implant my SKELGRPH.C
                                in that directory as SKELETON.C
                4) Make the yacc using this new SKELETON.C
                5) Using the above yacc, process my CPP5.Y file
                                yacc -dvl cpp5.y
                      The result should be a file y.tab.c, and y.tab.h
                6) Using Flex (replacement for lex) to process my CPP5.L file
                                flex cpp5.l
                      the result should be yy.lex.c
                7) Compile the two files
                                cc -o cpp5 y.tab.c yy.lex.c
                      the result should be an executable called cpp5
                8) Set the environment variable YYDEBUG to 6
                                setenv YYDEBUG 6
                      If you don't do this, the graphical output will not appear!
                9) Run the program cpp5
                                cpp5
                10) Try the input:
                                int a;
                11) You should see a nice parse tree. Enjoy. Note that
                        the lexer DOES NOT INCLUDE A SYMBOL TABLE, and does
                        NOT KEEP TRACK OF CURRENT SCOPES. The hack (see the
                        CPP5.L file for details) is to assume that any identifier
                        that begins with a capital letter is a typedef name
                        Send complaints about code that doesn't parse "correctly".








HISTORICAL NOTES




Developing the C grammar (that is intended to be compatible with the
ANSI C standard) was relatively straight forward (compared to the C++
grammar). The one difficulty in this process was the desire to avoid
use of %prec and %assoc constructs in YACC, which would tend to
obscure ambiguities. Since I didn't know what ambiguities were lying
in wait in C++, obscuring ambiguities was unacceptable. It took
several weeks to remove the conflicts that typically appear, and the
tedious process exposed several ambiguities that are not currently
disambiguated by the ANSI standard. The quality of the C grammar is
(IMHO) dramatically higher than what has been made available within
the public domain. Specifically, a C grammar's support of
redefinition of typedef names within inner scopes (the most difficult
area of the grammar) is typically excluded from public domain grammar,
and even excluded from most grammars that are supplied commercially
with parser generators! I expect that this grammar will be very
useful in the development of C related tools.


The development of the C++ grammar (initially compatible with version
1.2, but enhanced to support version 2.0 specifications as they were
made available) was anything but straight forward. The requirement
that I set to NOT USE %prec and %assoc proved both a blessing and a
curse. The blessing was that I could see what the problems were in
the language, the curse was that there were A LOT of conflicts (I can
recall times during the development effort when the number of
conflicts was well in excess of 200). The most recent addition of
nested types probably took about 2 weeks to implement. On the other
hand, I probably spent several months developing the automated
documentation tools that allowed me to debug the grammar additions
this quickly.


Towards the end of the initial development of the C++ grammar, which
took roughly 3 months of my time (circa summer 1989), I began to see
the folly in part of my quest. I came to the conclusion that further
attempts to modify my grammar, so as to defer resolutions of
ambiguities, would lead to an unreadable language. Specifically, my
feeling was that I was entering into a battle of wits with the
compiler, and the compiler was starting to win. It was beginning to
be the case that the parser COULD figure out what I said, but I
couldn't. Indeed, even examples in a version of the C++ 2.0 reference
manual (and published in the ARM) demonstrated this problem (my parser
could parse some examples that neither I nor the authors parsed
correctly!). At this point I decided to stop my quest to FURTHER
defer resolutions of ambiguities, and let the grammar commit in one
direction (always in favor of declarations), at the late point that is
provided by my grammar. If this direction proved "incorrect in light
of the context that followed", then I generated a syntax error. I
believe this strategy provides ample room for expressiveness. In
support of this expressiveness, I have (based on my discussions with
language experts) deferred disambiguation far longer than other
attempts at producing an LR(1) grammar. I would strongly argue that
any code that my grammar identifies as having a "syntax error" (based
on "premature" disambiguation), but cfront allows, should ABSOLUTELY
be rewritten in a less ambiguous (and hence more portable) form.


It should be noted that my grammar cannot be in constant agreement
with such implementations as cfront because a) my grammar is
internally consistent (mostly courtesy of its formal nature and YACC
verification), and b) YACC generated parsers don't dump core. (I will
probably take a lot of flack for that last snipe, but.... every time I
have had difficulty figuring what was meant syntactically by some
construct that the ARM was vague about, and I fed it to cfront, cfront
dumped core.)


One major motivation for using the C++ grammar that I have provided is
that it is capable of supporting old style function definitions (e.g.:
main(argc, argv) int argc; char*argv[]; {...} ). I believe this
capability was removed from the C++ specification in order to reduce
ambiguities in a specific implementation (cfront). As my grammar
demonstrates, this action was not necessary. Supporting old style
function definition GREATLY eases the transition to the use of C++
from traditional C. I expect that as some parsers begin to support
this option, that other parsing engines will be forced in this
direction by a competitive marketplace. Using my grammar, and the
standards it implies, appears to be a very straightforward approach to
this support.


A second motivation for using my grammar is that it can be processed
by YACC. The advantage in this fact lies with YACC's capability to
identify ambiguities. For software manufacturers that are heavily
concerned with correctness, this is an INCREDIBLE advantage. My
experience with hand written parsers (which usually represent a
translation by a fallible human from a grammar to parsing code) is
that they evolve and become more correct with time. Ambiguous cases
are often misparsed, without the author ever realizing there was a
conflict! In contrast, either a YACC grammar supports a given
construct, or it doesn't. If a YACC grammar supports a construct, the
semantic interpretation is usually rather straight forward. The
likelihood of internal errors in the parser is therefore SIGNIFICANTLY
reduced. The fact the the grammars I supplied are free of %assoc and
%prec operators, implies the grammar are fairly portable, and the
conflicts are open to peer code review (and not obscured).


Most recently I have joined the ANSI C++ committee (X3J16), and have
tried to follow their progress with hopes of maintaining compliance in
my grammar. Unfortunately, political pressures within X3J16 appear to
make it IMHO more desirable to quickly approve a standard that matches
cfront's performance (when it is not dumping core), than to provide a
clean, consistent and formal syntax as part of the standard. Rather
than fixing inconsistent hackery within the syntax, there is IMHO a
tendency to want to "hack further" to match cfront's current
performance (or the ARM's prophesy). As an example of this, the
fundamental hack in all of C is the feedback from the parser to the
lexer to identify typedefnames. There is discussion afoot to (for no
reason other than compliance with a *proposed* cfront feature) extend
this another notch and require feedback to distinguish template names.
This hackery was not required by the syntax, rather it was "desirable"
to match the performance of beta-cfront (and the ARM). When cfront
changes, and old code is obsoleted, the arguments abound that it is
for the good of humanity. When cfront is hacking inconsistently, then
no change can be made, because of the thousands of lines of code that
depend on this psuedo-standard. Perhaps my grammar will help in some
small way Microsoft, Zortech, Borland, and dozens of other
entrepreneurs work toward building a standard for a language that has
enough consistency to grow and flourish (note that none of the above
vendors use my grammar in their products, but I think they would all
share my desire for a cleaner syntax). If I am successful with my
grammar, then I will be able to write C++ tools in a consistent and
open marketplace. From my perspective, the outcome is not clear. If
you have a channel to support the use of a cleaner syntax in the X3J16
standard, I would heartily invite you to exercise that channel.


As it currently stands, my grammar teeters on the edge of being
unusable due to its size. The size in turn is due to the variety of
special cases that must be dealt with within C++ parsing. With only a
few more inconsistent additions to the "standard language", my grammar
will surely become completely unusable. I am trying to develop a yacc
preprocessor that will allow me to rein back in the complexity of the
grammar. If I can do this, then it will continue to be possible to
update my grammar to match the emerging ANSI Standard. I can only
promise to try.




FEEDBACK ABOUT THE GRAMMARS


If you find any errors in my grammars, I would be DELIGHTED to hear
mention of them!!!! These should fall into one of the following
categories:


                1) The grammar left out the following features of C++...
                or
                2) The grammar mis-parses the following sequences...
                or
                3) The discussion of the following ambiguity is
                incorrect...
                4) The grammar could be simplified as follows...


Please send correspondence of this form to jar@hq.ileaf.com. My
response to 1's will be to add the feature (if possible!); feel sad
that I made a mistake; and feel glad that YOU found it. I will have a
similar response to 2's. Responses of type 3 are GREAT, but I haven't
found many folks that really get into YACC ambiguities, so I have low
expectations... feel free to surprise me!!! :-) :-). Items of type 3
are interesting, but since simplicity is in the eye of the beholder,
such suggestions are subject to debate. I would be interested in
seeing suggestions in this area with the constraint that they do not
increase the number of conflicts in the grammar! Please use YACC to
check your suggestions before submitting them. (It is often AMAZING
how the slightest change in the grammar can change the number of
conflicts, and it took a great deal of work to reach the current
state). Distribution site(s) will be set up to distribute updates and
or corrections. Postings about the presence of corrections will be
made on the net.


Since the two grammars (C and C++) were generated in parallel, you
should be able to compare non-terminals directly. This will hopefully
make it easier to identify the complexities involved with the C++
grammar, and "ignore" those that result from standard ANSI C. In both
cases I have left the standard if-if-else ambiguity intact. In the
case of ANSI C grammar, this is the only shift-reduce conflict in the
grammar. Although there are a number of conflicts in the C++ grammar,
there are actually very few classes of problems. In order to
disambiguate the C++ grammar enough that YACC can figure out what to
do, I was commonly forced to "inline expand" non-terminals found in
the C grammar. This expansion allowed YACC to defer disambiguation
until it was possible for an LR(1) parser to understand the context.
The unfortunate consequence of this inline expansion is a large growth
in the number of rules, and the presence of an effective "multiplier"
in most cases where conflicts do occur. As a result, any conflicts
that arise are multiplied by a factor corresponding to the number of
rules I had to list separately. I have grouped the C++ grammar
conflicts in the "Status" section of the GRAMMAR5.TXT paper, but you
are welcome to explore my grammars using YACC directly (be warned that
you will need a robust version of YACC to handle the C++ sized
grammar). PLEASE do not be put off by the number of conflicts in the
C++ grammar. There are VERY FEW CONFLICTS, but my elaborated grammar
confuses the count.


The GRAMMAR5.TXT paper is FAR from a publishable quality paper, but it
discusses many of the issues involved in ambiguities in my grammar,
and more generally in the C++ language. I hope GRAMMAR5.TXT it is a
vast improvement over "nothing at all", but apologize in advance for
lack of polished structure, and the presence of many typos (which must
surely be present). I hope you find this almost-paper interesting. My
attempts at documenting conflicts have certainly clarified the
problems in my mind. Based on my experience with the conflicts I have
identified, most current compilers and translator fall prey to the
situations that I have uncovered. I hope that other compilers, that
do not make use of the grammar I have made available, will at least
seek to standardize the resolution of the problems identified.




The AUTODOC5.TXT file provides interesting reading for both readers
interested in LR and LALR parsing (and the subtle connections and
distinctions between them), as well as any user that wishes to fully
comprehend the contents of the Y.OUTPUT file. It includes an
extensive discussion of ambiguities, how they are removed, how
LALR-only ambiguities arise, and how they can be dealt with.


With this release of the grammar I have begun to distribute machine
generated documentation for my grammar. As a result, if my analysis
of conflicts are questionable, the supporting data is immediately
present to confirm or deny my analysis. If you wish to correct any of
my analysis, please use and refer to the Y.OUTPUT file that I have
provided.


As a small commercial message... I am a freelance consultant, and I
travel far and wide to perform contracts. If you like the work that I
am presenting in this group of documents, and would like to see a
resume or at least talk, please feel free to contact me.


I hope that the grammars that I have provided, will lead to many
successful C++ processing projects.


Jim Roskind
Independent Consultant
516 Latania Palm Drive
Indialantic FL 32903
(407)729-4348
jar@hq.ileaf.com or ...!uunet!leafusa!jar
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.