Re: [QUERY] A "ignorant newbie" question about compiler-writing. (John Lilley)
3 Jan 1997 23:15:28 -0500

          From comp.compilers

Related articles
[QUERY] A "ignorant newbie" question about compiler-writing. (1997-01-02)
Re: [QUERY] A "ignorant newbie" question about compiler-writing. (1997-01-03)
Re: [QUERY] A "ignorant newbie" question about compiler-writing. (1997-01-03)
Re: [QUERY] A "ignorant newbie" question about compiler-writing. (Arch Robison) (1997-01-03)
Re: [QUERY] A "ignorant newbie" question about compiler-writing. (1997-01-04)
Re: [QUERY] A "ignorant newbie" question about compiler-writing. (1997-01-29)
Re: [QUERY] A "ignorant newbie" question about compiler-writing. (1997-01-29)
Re: [QUERY] A "ignorant newbie" question about compiler-writing. (1997-01-30)
Re: [QUERY] A "ignorant newbie" question about compiler-writing. (Darius Blasband) (1997-01-30)
[10 later articles]
| List of all articles for this month |

From: (John Lilley)
Newsgroups: comp.compilers
Date: 3 Jan 1997 23:15:28 -0500
Organization: Nerds for Hire, Inc.
References: 97-01-013
Keywords: practice

synaptik wrote:

> My question is... what is it that makes compiler development so
> difficult? Forgive my ignorance, but I have always thought it would
> be a fun project, but then become disillusioned when I pick up a book
> on compiler theory and it doesn't appear "straight forward."

Hmmm, this is a good question and hard to answer simply. It's one of
those things that should be simple, because it's simple to describe
the process. But as ever, the devil is in the details. I'll offer
some opinions on the matter, but first let me recommend some
"practical" books that strip away the theory and show how a (simple)
compiler is really implemented:

Wirth "Compiler Construction" -- Implements a recursive-descent Oberon
compiler. Oberon is similar to Pascal.

Fisher/LeBlanc, "Crafting a compiler in C" -- Implements a subset of
Ada using "C" and some tools.

Now some opinions. I know that my parser bias shows here, and each of
these categories could probably be as large as the first one. I think
that the complexity comes from several places:


It would be "easy" to write a non-deterministic grammar and parser
that just tried all the alternatives of the grammr, picked the ones
that worked, and when one didn't work back out and tried the next
alternate. This is known as "exponential parsing" because it can take
close to forever, so hardly anyone ever uses it.

LL techniques (aka recursive descent) improve efficiency by analyzing
the lookahead token sets that are used to discriminate which
alternates to try before trying them. But they suffer from problems
with ambiguuity because often you need a more than one token of
lookahead (or even an unbounded amount of lookahead) to figure out
which alternate to take. To write an LL grammar you need to analyze
the lookahead token sets (or have a tool do it for you). LL is harder
than non-deterministic parsers because you have to worry about
ambiguities and massage the grammar by left-factoring, looking further
ahead, and other transformations to make it unambiguous. Ambiguities
are often very difficult to resolve in a large grammar, because it is
hard to get an idea of what language features are "really" causing the

LR and LALR techniques can reduce ambiguity over LL techniques because
they defer decisions of which rules are matched by keeping a stack of
tokens and productions waiting for more tokens to clear up any
ambiguity. LALR is a technique for analyzing a subset of LR grammars
and producing a push-down automata to implement the parser. LR and
LALR are stronger than LL because they have fewer ambiguities and can
process a larger set of grammars, in general. However, IMHO LALR is
harder to create and understand than LL because you lose the nice
one-to-one correspondence between rules, the code used to implement
the rules, and the actions taken when the rules are matched. The
state machine gets in the middle of everything and adds an extra layer
of indirection.

The more complex the language, the harder it is to write an
unambiguous grammar. Pascal and Oberon were designed to be easily
parsed, and they have keywords marking the start of productions, so
they are "easily" implemented using LL/recursive-descent techniques.
"C" and Java are medium-difficulty because they have certain
ambiguities that require extra lookahead, and they allow for the
definition of named types. They also require the discrimination of
types and identifiers, so you have to connect the lexer to the symbol
table to figure out which symbols are which. To reduce ambiguity, you
either work harder on the LL grammar, or use LALR or both. C++ is
considered to be horribly obnoxious to parse in all respects, and
defies both LL(k) and LALR(k) for any finite value of the lookahead
window size "k". Because of this, you must add advanced techniques
like backtracking or "ambiguous parsing followed by reconcilation".
C++ also adds many other horrible twists having to do with templates.

With every tweak and hack that is performed to massage the grammar or
pull some other trick to avoid ambiguities, you make the thing more
complex, harder to understand, and more likely to break for a special


It is hard to write an application that can store and process hundreds
of thousands of symbols in an efficient manner. Because of this,
optimization is the rule rather than the exception, and with
optimization comes more complexity and opportunity for error.


Compilers are complex because the input is highly structured and
variable. The permutations of the input (in a non-strict sense) are
much more variable than the permutations of input for a "normal"
application, and because of that there always seems to be one more
case that you didn't think about until you hit it in someone's code.


Compilers are subject to "error avalanche" meaning that a single error
in syntax or semantics can propagate to spawn a zillion more errors
later in the file. It is usually not good enough to bail out after
the first error. It is an art to determine the nature of the error,
report it sensibly, and synchronize the input to a state where the
probability of continued successful parse is reasonably high.


As parser recognizes the structured input, it must build a symbol
table representing the declarations that are encountered. The symbol
table in some languages like Java, C or C++ is necessary even for a
syntactic parse, because typedefs determine what is an identifier vs a
type, and there are places in the grammar where it must be known what
is a type and what is not for the parse to proceed correctly.

In C++ it gets far worse, because the symbol table has various forms
of nested scope, and symbols can be overloaded in their meaning. So
just getting the symbol table right is rather involved. To make
matters worse, corect instantiation of specialized templates requires
that a complete type analysis be performed on the arguments of the
template, which means that complete type information must be collected
for all the declarations.

ASTs are a representation of the code itself, which includes most
declarations. Whereas the symbol table's purpose is mainly to store
information about identifiers to be used during the parse and code
generation, the ASTs main purpose is to store a structured
representation of the source code so that the backend can walk the
structured representation and generate machine code that corresponds
to the source code. ASTs and symbol tables can overlap in their


Wirth considers code generation to be the hardest part of writing a
compiler. I don't have sufficient experience with backends to know
myself, but it seems reasonable that it is at least as hard as
parsing. Code generation may involve creating an "abstract machine"
such as a register-transfer machine, generating byte-code for this
machine, and then revisiting the byte code and transforming it into
the machine code for the target machine. Code generation also
requires that type sizes and field offsets are calculated, that local
symbols are reconciled, and that non-local symbol references are
placed into the object file.


Once machine code is generated, it must be tagged with all sorts of
symbolic information representing what global symbols it contains, and
what global symbols are needed. Anywhere an undefined global symbol
is needed, a "placeholder" or "fixup" record must be put in the object
file so that the linker can match up needed symbol with supplied
symbols from various object files.


Finally, a real compiler must implement the standard runtime
environment, which is otfen hundreds of thousands of lines of code in

Whew! Glad I'm not doing this :)


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.