Re: Modularize compiler construction?

"BGB / cr88192" <cr88192@hotmail.com>
Sun, 31 Jan 2010 16:41:14 -0700

From comp.compilers

Related articles
[3 earlier articles]
Re: Modularize compiler construction? cr88192@hotmail.com (cr88192) (2010-01-25)
Re: Modularize compiler construction? DrDiettrich1@aol.com (Hans-Peter Diettrich) (2010-01-25)
Re: Modularize compiler construction? pengyu.ut@gmail.com (Peng Yu) (2010-01-25)
Re: Modularize compiler construction? idbaxter@semdesigns.com (Ira Baxter) (2010-01-28)
Re: Modularize compiler construction? gneuner2@comcast.net (George Neuner) (2010-01-28)
Re: Modularize compiler construction? ott@mirix.org (Matthias-Christian Ott) (2010-01-31)
*Re: Modularize compiler construction? cr88192@hotmail.com (BGB / cr88192)* (2010-01-31)**
Re: Modularize compiler construction? gneuner2@comcast.net (George Neuner) (2010-02-01)
Re: Modularize compiler construction? sh006d3592@blueyonder.co.uk (Stephen Horne) (2010-02-08)

| List of all articles for this month |

From:	"BGB / cr88192" <cr88192@hotmail.com>
Newsgroups:	comp.compilers
Date:	Sun, 31 Jan 2010 16:41:14 -0700
Organization:	albasani.net
References:	10-01-080 10-01-082 10-01-086 10-01-095
Keywords:	design
Posted-Date:	01 Feb 2010 18:26:02 EST

"Ira Baxter" <idbaxter@semdesigns.com> wrote in message
> "cr88192" <cr88192@hotmail.com> wrote in message

>> however, given as much emphasis as so many people seem to put on parser
>> generators, it is not so hard to see how some people could be misled into
>> thinking that the parser IS the compiler.
>
> Parser generators are only a small part of tools one would want for
> compiler construction. Why people beleive that is a significant part
> is a complete mystery to me, especially if they've had a compiler
> class or attempted to build a compiler.
>

yeah.

even with a hand-written parser, for a parser which handles C, and
partly handles Java and C# syntax, it is still only a few kloc for the
parser (basically, the same parser is used in all 3 cases, where parts
enable and disable for the different languages).

admittedly, the Java and C# syntax support is still incomplete, and I have
not as of yet attempted to deal with the full-on horrors of C++ syntax (and
I probably will not...).

> Well, there are toolsets that are designed to support compiler
> construction. The New Jersey Machine toolkit comes to mind. There's
> a quite good list at http://catalog.compilertools.net/.
>

yes, ok.

>> BNF, sort of like traditional math, and I also suspect SSA,
>> ... represent a "mode of thinking" that I personally don't seem
>> particularly skilled with (the great oddity is that programming is
>> not particularly difficult, but I am apparently
>> mathematically-unskilled, and find that many "formalisms" seem to
>> defy understanding...).
>
> They don't and they are well worth the trouble. Formal notations
> with precise meaning lead to tools which lead to huge leverage.
>

well, I can typically modularize and define things well enough...

I can utilize BNF, granted, but some of the more "subtle" aspects (much
beyond its general behavior and left vs right recursion) pose conceptual
problems.

the problem is that many of these existing formalisms I find difficult to
fully understand.
for example, SSA is something I have been trying to make sense of for
several years now, but it still escapes understanding (mostly confusion
related to the use of temporaries and the phi operation).

IMO, RPN is actually much easier to understand, and I don't seem to have
that much difficulty figuring out how to convert RPN to SSA, and converting
SSA into C or ASM also makes sense, but going directly from ASTs to SSA is
currently not fully understood (as then I have little idea when and how
exactly to go about generating the temporaries and phi's).

in the RPN->SSA case, the "when and where" aspects of temporaries and phi's
tends to be a bit more obvious.

however, I guess the basic point of SSA is that people can do useful stuff
while in SSA form, but then again, I am not sure what can be done or how to
do it, ...

once again, I end up finding it easier to imagine performing optimizations
on the RPN version of the program...

the next issue of formalisms is that of things like first-order logic and
set theory:
in the simple cases, both make sense;
a set is then finite a collection of objects, and one can imagine them in
terms of operations on collections;
similarly, first-order logic is, in its simple form, a collection of
operations which evaluate to either true or false, and one can easily enough
imagine it as a big structure of operations evaluating to either true or
false.

but, in both cases, one finds that fairly soon something "malevolent"
happens:
the sets expand beyond being regarded as simple agregates, and take on many
aspects like those of predicate logic, and start being used to refer to
things which can't easily be conceptualized as sets ("the set of all reals",
"the set of real numbers a<x<b", ...).

this is confusing as, "x<y" is something which makes sense to me as a
predicate, but doesn't make sense to me as a set (yet a teacher will claim
that this is a set, rather than a predicate...).

similarly, predicate logic starts taking on properties like those of sets,
...
and then, there is this combined entity which no longer makes sense (but to
me, predicates and sets seem like "fundamentally different entities", yet
somehow, it seems this is not the case...).

I keep running into these beasts in math classes, and was at a loss as to
how to make sense of them...

like, one can see that things are there, but how to understand them is
itself a problem...

then again, AFAICT there are also some apparent limits as to my ability to
deal even with English semantics, so I guess conceptual limitations here and
there are unavoidable.

>> sadly, it does not take one long to fnd, if implementing a compiler
>> for a non-trivial language (such as the main C-family languages),
>> that the parser is not really the big source of complexity (but, at
>> least with C and C++, it can be a big source of slowness, but this
>> is not quite the same issue...).
>
> The *language* C++ is the "source of complexity". The problem of parsing
> it has been killed dead several times. You can do it (clumsily) with
> LALR parsers and tangling of symbol tables and parsing actions.
> You can do it by recursive descent and similar tangling (GCC, I think
> EDG).
> You can do it using GLR parsers, with *complete* separation
> of parsing and symbol table collection (our DMS Software Reengineering
> Toolkit does this), which means you can write a grammar that almost
> mirrors the reference grammar.
>

interesting...

> Further, AFAIK, GLR parsers aren't necessarily slow. (The DMS parser
> isn't particularly fast, but then we capture comments and source
> format information as we go). Adrian Johnstone and his students have
> done a bunch of work on tuning GLR parsers. I think it was Scot
> McPeak that implemented an optimization that makes GLR parsers run
> like LALR(1) [in terms of computation effort] in the highly frequent
> case that there are not multiple parses in a particular part of the
> grammar. And Pennelo showed how to "compile" LALR parsers to machine
> code for extremely fast parsing ("Very fast LR parsing", ACM Sigplan
> July 1986). [We'll get around to composing this set of ideas
> someday]. Once you do this, much of the front end slowness is in
> lexer.
>

yes, ok.

admittedly, I have not gone that far into matters of parsing...
I have used mostly recursive descent, but I am suspecting that much past a
basic C-style syntax, the parser logic starts to become overly complicated.

then again, even as such, the parser logic is still nowhere near as awkward
as the codegen logic...

in my case, the main source of slowness is the tokenizer (typically the
top-ranking position in the profiler), and with the time spread (for the
upper end) being around 30-40% preprocessor, 50% main parser, and 10-20%
other stuff (dumping the AST, main compiler logic, ...).

AFAICT, walking the AST and "doing stuff" is apparently much cheaper than
building the AST, although I am not sure the exact reason why this is the
case.

I have observed that MSVC seems to have a very fast C parser (apparently
much faster than either my compiler or GCC), although I am at a loss as to
the exact reason.

so, my observations:
MSVC, often fairly fast compile times, but x64 seems to have a notably poor
optimizer;
GCC, a bit slower WRT compile times;
my compiler, a little slower than GCC (though not significant at this point,
as lots of internal optimizations have made it much faster than it was
originally, which is to say, absurdly slow...).

I am not sure how either MSVC's or GCC's parsers work, or why MSVC seems to
be somewhat faster than GCC wrt compile times...

>> yes, however GCC does have a few down-sides:
>> like many GNU projects, it is often very difficult to get it to build on
>> Windows, and so help you if you want to build it with MSVC...
>> the code is large, and the code is nasty.
>
> That's all relative. You're dealing with a complicated beast, whose
> complications are driven the complications of the langauge(s) and
> multiple targets they are trying to manage. You can get much simpler
> beasts, but you won't be happy with them because they won't address
> real langauges or real targets. And GNU shows its roots: it grew
> organically from an early compiler. It suffers somewhat from not
> being designed to modular from the very beginning.

I got much simpler with a compiler which parses C, and targets x86 and
x86-64.
admitted, even with this much, my compiler is still fairly nasty...

> There are more organized attacks on building compiler-like tools. The
> DMS Software Reengineering Toolkit
> (http://www.semanticdesigns.com/Products/DMS/DMSToolkit.html) is a set
> of tools for defining lexers, parsers with automatic tree building and
> built-in error recovery, attribute evaluators, symbol tables, control
> flow analysis, local data flow analysis, local and global points to
> analysis, source-to-source program transformations, prettyprinting,
> ... And like GCC, it is attempting to handle many languages (and it
> does: C, C++, Java, C#, COBOL, PHP, ...) And its pretty complicated,
> too, because the machinery it implements is designed to support the
> union of the complications from all these langauges; (you can't do C++
> unless your symbol table machinery can handle all of its pecadillos).
> I'd like to think that DMS is "cleaner" than GCC, but we aren't trying
> to build high performance compilers (yet!). So there's hope yet :-}

yes, ok.

not sure what to say here.

my strategy is mostly to modularize things by tasks and use "general
representations" between components (rather than directly coupling together
the machinery and/or semantics between the parts).

I also try to avoid the "single unified superstructure" design common to
many VM projects (which I personally view as an enemy to modularity).

however, this is itself easier said than done as I have found, and sadly
there is a fair amount of "internal coupling" in many places (the worst
offender here is the "main codegen", or as can also be called, the RPNIL /
SXIL compiler). sadly, this is a big complicated beast I have thus far been
unable to do much about (either as far as dissolving the thing, or finding a
better IL).

my project also partly supports JBC (Java-ByteCode), but JBC is too weak to
be used as a general-purpose IL.

MSIL / CIL is a much more powerful IL than JBC, but thus far does not have
much at all of a working implementation (due mostly to its complexity and
awkwardness). actually, the main issue "to be resolved" at the moment is how
to "generally" work with the .NET metadata, which is stored in a form
(relational-ish tables), for which I am not sure of a good or general way to
access or work with (or even convert to a more convinient representation).

the current ideas which come to mind are to use either structured-keys, or
to rig up some sort of query interface, ...

however, the "cleanest" way I can think up to rig up a query interface,
namely via textual key-strings, seems like it would risk being overly slow
(whereas a more efficient interface would likely be less modular, as it
would lead to a lot of dependency on the internal structuring and
representation of the data...).

this is awkward, and I guess I just find it much easier to work with a
heirarchical database (sort of like the Windows Registry), which is sort of
why I originally ended up using this representation for my metadata format.

initially, I had started trying at first to beat together some .NET-like
relational-table based stuff, but after a short while of this gave up on the
strategy as it had seemed like it would be unreasonably awkward to work
with... (whereas semantic heirarchical keys allow a fairly complex
data-store to be managed with both a relatively simple interface as well as
a relatively simple internal structure...).

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: Modularize compiler construction?

"BGB / cr88192" <cr88192@hotmail.com>Sun, 31 Jan 2010 16:41:14 -0700

"BGB / cr88192" <cr88192@hotmail.com>
Sun, 31 Jan 2010 16:41:14 -0700