Re: Compiler Compilers: Design and Implementation

Hans-Peter Diettrich <>
Tue, 10 Feb 2009 18:03:08 +0100

          From comp.compilers

Related articles
Compiler Compilers: Design and Implementation (Alexander Morou) (2009-02-08)
Re: Compiler Compilers: Design and Implementation (Ira Baxter) (2009-02-10)
Re: Compiler Compilers: Design and Implementation (Hans-Peter Diettrich) (2009-02-10)
Re: Compiler Compilers: Design and Implementation (Alexander Morou) (2009-02-11)
| List of all articles for this month |

From: Hans-Peter Diettrich <>
Newsgroups: comp.compilers
Date: Tue, 10 Feb 2009 18:03:08 +0100
Organization: Compilers Central
References: 09-02-020
Keywords: practice
Posted-Date: 11 Feb 2009 05:38:36 EST

Alexander Morou schrieb:

> 1. Are context-aware lexical analyzers favorable?

(see below)

> 2. I've written my share of lexers/parsers/interpreters, and I've
> looked at code from code generators. A lot of the generators emit
> large sets of integers that relate to what I'm guessing is
> state->state transitions based upon the character index and current
> state...

Right. State machines are the easiest implementation, in contrast to
code generators.

> 3. What kind of speed should I be focusing on when I've written the
> parser? I ask this question because I write programs that write
> code, quite frequently, and I've noticed that they tend to emit very
> large files. What's a good test to throw at it to ensure that it's
> not only valid but capable of handling large files (somewhere in the
> range of 10+ MB).

IMO 10 MB are neglectable, with regards to the currently available
amount of RAM. Having entire files in RAM can speed up the character
and line handling in the lexer. But when files (include files...) are
kept in memory, they can sum up, and it might be necessary to use
buffers of a more affordable size.

> 4. Beyond issues with Left Recursion, what are some issues with
> LL(k) parsers that I should be aware of, and why should I ever
> prefer LR (or LALR) parsers? I personally dislike LR parsers due to
> their ambiguity issues (think: hanging else), when writing for an
> LL(k) parser generator, you're aware that it's recursive in nature,
> and the most nested if statement gets the 'else' when the curly
> braces are omitted. Are there any other LL(k) ambiguities that I
> need be aware of?

IMO both techniques have their pro's and con's. The described handling
of the dangling else is a possible workaround in LL parsers, which
should treat such constructs as errors (same symbol in FIRST and
FOLLOW sets); there may exist cases where such automatic default
handling is inacceptable.

> 5. I want to write an LL(k) parser generator program, and to that
> end I'm hand-writing a parser for C# to get used to writing like a
> generator (by writing state machines myself, and handling context
> awareness, this stuff sucks by hand). I'm thinking about adding
> context awareness to the lexical analyzer, because in certain cases
> it'll be necessary (a simple example is C#'s generics, nesting a
> generic declaration like so: A<B<C>> would be annoying to parse
> without context-awareness, since '>>' is right-shift, unless the
> parser itself dipped into the character stream while parsing a
> generic type. Unless I'm missing something very simple.)

For such ambiguous constructs, like sequences of '>' of arbitrary
length, I'd let the lexer recognize only a single '>' in the lowest
level, and recognize ">>" in an higher level, taking the context into
account. An integrated preprocessor deserves more context specific
handling, like the handling of '#', included files, conditional
compilation, macro expansion etc. This summed up to about 7 levels in
my handwritten C lexer, before the tokens finally are passed to the

> 6. I'm planning on writing the C# parser by hand, which means I'll
> be writing the parser's state system, the tokens are few in actual
> count, but each are individually complex (there's not a different
> token for each keyword, but JUST a keyword token, 423 states in
> total, 97 keywords) Context awareness means that each parser state
> will require not only the individual type of token to be declared
> but the individual sub-structure information (ie. keywords and what
> keywords),

I distinguish between fixed and variable tokens, where fixed tokens
have one unique description, whereas variable tokens (identifiers,
literals) have a different "wording". My lexer stores the literals in
a global data structure, and only in the last stage the identifiers
are looked up and stored in the appropriate (type or name, optionally
string) symbol or literal tables. After that mapping all symbol
related information is accessible through an pointer or reference to
its data structure. That structure can contain the file position of
the symbol declaration, last use (for error handling), or (optionally)
a list of occurences for cross reference purposes. The set of tokens
is extended as appropriate, so that type names are distinct from other
identifiers, and tokens with multiple meanings (like mentioned '>' or
'>>') can be disambiguated by the lexer or parser, when their actual
meaning has been determined.

The resulting token information is small enough to be stored in an AST
or macro recorder, for later macro expansion or function inlining.

> will there be speed issues attributed to this, or is passing ~20
> bytes of information each lexical step fine? The parse method will
> be simple, all state machines will process in tandem (that is: one
> character at a time until none process further, limited execution
> based upon first character at the start of analysis). Based upon
> the length of the token, it'll return back to the parser the longest
> token. To avoid worst case scenario (re-evaluation due to following
> the improper path), does more than one token need returned if there
> are more than one that succeeds, or does that depend on the
> necessary look ahead for the given grammar (since the identification
> of a token could change based upon the next token)?

There exist various approaches for the implementation of an lexer,
including scannerless parsers. When you *want* an explicit tokenizer,
it IMO should return only one token in any case, with a chance for
later disambiguation, or combination of multiple "basic" tokens, into
another unique token.

> 7. If I write a parser generator, should I have options for multiple
> result parsers? Different parsers have different uses, some are
> focused more on simple validation and light structure (ie. Visual
> Studio integration, varying degrees of awareness depending on
> desired functionality), others require full context awareness and
> the need to build a fully structured version of the CST for
> transformation into an AST for compilation/interpretation, and
> others might be useful for simple syntax validation and even weaker
> for code colorization (such as emitting html for the code).

I prefer parsers that create a tree structure (CST), which can be
transformed as appropriate for the actual task. Recursive descent (LL)
parsers have so much information in their call stack, that the output
tree and/or tables can be created in callback functions, for every
recognized production, and these callbacks can be exchanged without
any impact on the grammar or on the lexer and parser code.

I even use an mix of LL and LR strategies in an parser, when e.g.
expressions parse better with a state machine, whereas other parts are
easier to debug and to tweak in recursive descent code.

> 8. Similar to 7, but focused on the parse stages. Some languages
> have more than one parse phase, and different rules on how to handle
> them. Should the program be built, what would be a good way to
> handle the different phases (from a language description standpoint,
> I'll be writing a DSL specifically for this project)?

IMO specific languages deserve specific strategies for the
lexer/parser implementation. I'd not bother with one unique model,
that should cover all cases, but will have to be adopted to the
requirements of every new language.

Many languages are context sensitive, in some places, so that I found
it a waste of time to write down a precise formal grammar for some
specific parser generator. Of course some kind of formal grammar
should exist for every language, but it should allow for some
sloppyness in the critical places, which have to be implemented in
semantic code anyway.

> If I sound very naive: I apologize. I'm researching language theory
> on my own, and have an interest in programs that build programs.
> I'm sure a lot here have used compiler compilers, but I'm not sure
> how many have written them so there's likely things I'm just
> oblivious to at this point. Regular expression state machines are
> probably a walk in the park compared to the phase that comes after.

I've already assisted several people, in writing their parser
generators or compiler compilers. But I never would write one myself,
because I know about the problems in various approaches, which arise
for the user of such a tool. Most tools require so much knowledge
about their features and limitations, that IMO it's easier to
handcraft an parser, instead of working out the input to an parser
generator, until the created code will do what it should do. Many
application projects effectively are done or at least are finished by
the companies, which have provided the choosen generator tool. Most
tools effectively only help their inventors, to provide the code for
some task, but are of very little use to other people.

I'm still waiting for an parser generator that produces a parser
skeleton only, which helps to avoid trivial errors in the
implementation of an parser from scratch, or when the language is
extended later. More an verifier for some implementation, than a


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.