Re: Compiler Compilers: Design and Implementation

Alexander Morou <>
Wed, 11 Feb 2009 09:38:00 -0600

          From comp.compilers

Related articles
Compiler Compilers: Design and Implementation (Alexander Morou) (2009-02-08)
Re: Compiler Compilers: Design and Implementation (Ira Baxter) (2009-02-10)
Re: Compiler Compilers: Design and Implementation (Hans-Peter Diettrich) (2009-02-10)
Re: Compiler Compilers: Design and Implementation (Alexander Morou) (2009-02-11)
| List of all articles for this month |

From: Alexander Morou <>
Newsgroups: comp.compilers
Date: Wed, 11 Feb 2009 09:38:00 -0600
Organization: Compilers Central
References: 09-02-020 09-02-026
Keywords: design, practice
Posted-Date: 11 Feb 2009 17:25:25 EST

In reply to Ira Baxter:

The end goal of this project is mostly just research. Sure I could
pull in the source of some open-source project and see how they did
it, but that's time consuming, and there's a good chance I'll miss
something important.

The primary reason I'm writing a parser on my own is I have a few
things I want to try. I could write a generator, but that only allows
me to apply large scale transformations on something, when I know
exactly what I'll be doing, since I'm not 100% sure, I can use this as
a prototyping phase to get a better idea on things to try when I do
create the compiler compiler. As far as the regular expressions, and
parse-data go, I'm using an older program I wrote to boot-strap
things; Hans-Peter Diettrich replied to a topic I made back in 2007
about it. As far as the CST goes, I'm implementing that by deriving
from a common AST I'm writing for languages written for .NET (this way
I don't have to further transform the structure). The AST itself is
focused around being malleable, and extensible, so I can handle the
necessary operations for short-circuiting, operator overloads,
implicit/explicit type casts, and the other compiler stuff necessary
to take a normal high-level expression and put it into a format that's
valid in Common Intermediate Language (such as transforming lambda
expressions, Language Integrated Query, and all sorts of fun
type-inference stuff that I've only started to scratch the surface

As for the lexical phase, when I said '20 bytes per lexical step'
(though this is more like 40+ bytes, though this data could be
remitted to a global cache shared between the parser and lexer and
referred to by an index or other shorter form), I meant more the
parser passes the contextual information necessary for context
awareness (keywords and what keywords, as per the last example). Each
individual state->state transition of the state-machine uses a single
character. As for the lexer returning more than one token, I was more
thinking that it could return a precursor-type token that holds the
information about the lex that just finished, so that when the next
look-ahead was performed, it could resolve the ambiguity as it arose.
The parser would hold onto these lexical precursors as long as its
look-ahead is, and only go as far as there is ambiguity (because at a
given point the look-ahead requirement will vary). The ambiguity will
also change based upon the individual source that it's parsing, as I'm
sure you know, so if the look-ahead is supposed to be, say 3 tokens
for 100% certainty, if the second precursor only identified one valid
token possible, it could omit the third look-ahead.

I'm also glad to a degree that I decided to write the state-machines
for this by hand. I'm learning quite a lot about how to properly
optimize the state machines, were I making a program to do the same.
In the project you're making DMS, how would someone define the actions
to be taken on each aspect of the state->state transition (an example
is a token for numbers or literals in general, it would build the
numeric representation of the text in the background as it
transitioned, instead of using standard parsing algorithms, since
you're already actively parsing it, why parse it again?)

As for recursive aspects and resolving issues with brackets, that
aforementioned contextual awareness, if your parser used a numeric
identifier, to tell the tokenizer what look-up to use, to determine
the next proper look-ahead, couldn't you, based upon whether the
parser was in a valid end-point (where more could be there, but is
optional), use a stack of these context lookup identifiers, and the
lexer could determine the valid types of tokens based upon the
individual parse-states of the active areas?

This way if you were parsing, say an expression containing a series of
array accesses, and/or method calls: a(b[c]), "b[c]"'s lookahead would say
that there could be a '.' next (to signify a member access/call), but the
stack says that the parse state for that expression is also in a valid
end-point, so the lexer could look higher in the stack and sees that ')'
and ',' is valid as well, and would merge the two state sets together, to
determine the next valid token, the parse would terminate until something
accepted it, since whatever the case is, the state transition tree says it
was valid. Naturally when I try to implement this, I'll have probably missed
something about the concept that'll bring it crumbling down, but you learn
more by failing than succeeding. ;]

As strange as it sounds, I find programs that build other programs fun
to write even better yet is when you find a bug, you simply have to
regenerate the program to fix it in every place where the deficiency

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.