Re: Compiler Compiler Compiler

"Mike Dimmick" <mike@dimmick.demon.co.uk>
26 Mar 2001 13:45:05 -0500

          From comp.compilers

Related articles
Compiler Compiler Compiler danwang+news@cs.princeton.edu (Daniel C. Wang) (2001-03-22)
Re: Compiler Compiler Compiler jjan@cs.rug.nl (J.H.Jongejan) (2001-03-26)
Re: Compiler Compiler Compiler mike@dimmick.demon.co.uk (Mike Dimmick) (2001-03-26)
Re: Compiler Compiler Compiler nr@labrador.eecs.harvard.edu (2001-03-26)
Re: compiler compiler compiler Dr_Feriozi@prodigy.net (2001-03-26)
Re: Compiler Compiler Compiler kszabo@nortelnetworks.com (Kevin Szabo) (2001-03-27)
Re: Compiler Compiler Compiler Trevor.Jenkins@suneidesis.com (Trevor Jenkins) (2001-03-27)
Re: Compiler Compiler Compiler cfc@world.std.com (Chris F Clark) (2001-03-27)
Re: Compiler Compiler Compiler i.dittmer@fh-osnabrueck.de (Ingo Dittmer) (2001-03-27)
[14 later articles]
| List of all articles for this month |

From: "Mike Dimmick" <mike@dimmick.demon.co.uk>
Newsgroups: comp.compilers
Date: 26 Mar 2001 13:45:05 -0500
Organization: Compilers Central
References: 01-03-095
Keywords: parse, tools
Posted-Date: 26 Mar 2001 13:45:04 EST

"Daniel C. Wang" <danwang+news@cs.princeton.edu> wrote in message
> The first non-trivial thing anyone does when they invent a new
> programming language seems to be rewriting yacc so they can bootstrap
> their compiler. This sounds like a great master's thesis for
> someone... :)


The reason for re-writing YACC tends to be that it, and its
derivatives, are somewhat difficult to work with. Let me expand:


Many 'new' programming languages tend to be C or C++ derivatives.
There are a number of problems with the C and C++ syntaxes which can
only be solved by absorbing semantic information into the parser.
Unfortunately, this usually means that information must be entered
into semantic tables whilst processing a complete rule. YACC really
doesn't handle this at all well.


The reason is that every action in a YACC grammar that does _not_ come
at the end of the rule is treated as a separate epsilon (empty)
production rule, and the action is executed when the epsilon is
reduced. An LALR(1) grammar tool produces its syntax trees from the
bottom up; it therefore processes its input keeping as many
possibilities in mind (the set of possible reductions) as it can, then
selecting one when the input becomes unambiguous. The introduction of
these semantic actions causes problems - the generator cannot know
which rule is to be reduced, so should it execute the action or not?
YACC treats the rule containing the action differently from those
without, I believe. This usually means including the action at the
equivalent point in all such rules, then backing out if it turns out
to be incorrect.


Another problem is the subsequent use of such semantic information.
Not having been designed for such an idea, the only place to put such
information in a YACC grammar is to change the meanings of the
identifiers (by changing the token code appropriately) or 'steering'
by introducing fake tokens as they come out of the lexical scanner.
This tends to require duplication of rules for the different
possibilities - in C++, some steering is dependent on whether a name
is a class, or whether it is a template (at all) or a template class.


It seems to me, after some study, that although LR(k) grammars are
stronger, LL(k) parsers tend to be faster. This is probably a good
case for making your programming language grammar LL(k) (especially
LL(1)) if you can - to give people who must write compilers the
greatest number of options possible. Utility of the language should
take first place, but anything that makes the life of the compiler
writer easier is a good idea. Users of your language will thank you
too, because the compiler should go one heck of a lot faster if it
doesn't have to backtrack.


It's not all bad news though. There are other parser generators
around besides YACC. A lot of them suffer from the same problems,
unfortunately. You can of course write your programming language so
as to avoid these problems! Here's a tip - avoid basing it on C or
C++! I would dearly love to see this be the end of the road for those
languages as a base for new ones, syntactically - parsing them is just
too damned hard.


As for theses, this was the subject of Terence Parr's Ph.D. thesis.
The tool he wrote, PCCTS, is a "predicated LL(k) parser generator" and
is available on the net at http://www.polhode.com/pccts.html and the
second generation, ANTLR 2.x, at http://www.antlr.org (although as
this second edition is written in Java, I wouldn't personally
recommend its use for a production compiler tool). I'm using PCCTS
1.33 Maintenance Release 22 to write a C++ parser as part of my final
year undergraduate project.


--
Mike Dimmick


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.