Re: Compiler Compilers: Design and Implementation

"Ira Baxter" <idbaxter@semdesigns.com>
Tue, 10 Feb 2009 10:31:16 -0600

          From comp.compilers

Related articles
Compiler Compilers: Design and Implementation Alex@alexandermorou.com (Alexander Morou) (2009-02-08)
Re: Compiler Compilers: Design and Implementation idbaxter@semdesigns.com (Ira Baxter) (2009-02-10)
Re: Compiler Compilers: Design and Implementation DrDiettrich1@aol.com (Hans-Peter Diettrich) (2009-02-10)
Re: Compiler Compilers: Design and Implementation alexander.morou@gmail.com (Alexander Morou) (2009-02-11)
| List of all articles for this month |
From: "Ira Baxter" <idbaxter@semdesigns.com>
Newsgroups: comp.compilers
Date: Tue, 10 Feb 2009 10:31:16 -0600
Organization: Compilers Central
References: 09-02-020
Keywords: practice, tools
Posted-Date: 11 Feb 2009 05:36:15 EST

"Alexander Morou" <Alex@alexandermorou.com> wrote in message
> I'm researching language design in general, as well as another area called
> 'Generative Programming'. Into that goal I have a few questions for other
> programmers who are interested in parser/compiler design:
>
> 1. Are context-aware lexical analyzers favorable?


If you mean "useful", yes. We build a tool call DMS, that is used to
analyze/modify/generate code. Often we need to parse complex, real
languages. It is pretty handy for lexers to have some context
awareness. Two examples a) a single bit of history, to distinguish
"/" as a "divide by" operator lexeme in the context of operator
position, or "/" as "regular-expression-start" when found in operand
position as JavaScript requires. b) matching nesting of parentheses,
to allow so-called literal strings in languages like PHP to be parsed.
This is needed so when a string like "abc${bar[(foo+b)*3]}def" is
lexed, the lexer can see the expression embedded in the string
literal, lex its tokens, and on finding the appropriate parentheses
match, continue with lexing the balance of the string literal. We
implement this context-awareness by use of lexical modes that are
declared explicitly to the lexer generator. I think most lexer
generators (e.g., Flex) has such modes, too. DMS lexical modes can be
stacked, which is how we get tricks like parentheses matching; I don't
know if standard lexer generators do this, but we've found this
amazingly useful.


One might argue, that if we're going to including PDAs in the lexer,
we'd be better off using the full power of a GLR parser, as the
scannerless parser guys seem to do, to process lexemes. But it
appears from informal comparisons of speed that the scannerless GLR
parsers don't seem to "lex" as fast (probably because they are
tracking vast numbers of ambiguities) and speed at the lexing level is
important when dealing with large source code bases.


> 2. I've written my share of lexers/parsers/interpreters, and I've
> looked at code from code generators. A lot of the generators emit
> large sets of integers that relate to what I'm guessing is
> state->state transitions based upon the character index and current
> state. Is this really the preferred, or would a series of highly
> optimized state machines be better (for speed, wherein they handle
> the state->state via a switch over the current state, and a
> sub-switch over the character; yielding true if another character is
> possible in the series, and false if it's invalid or at a proper
> edge that terminates that line).


Our lexer generator builds a set of state tables (that "large sets of
integers") that are interpreted by the lexer machinery. We don't use
a switch statement; the lexer always knows state of which mode it is
in. If I had my choice, I'd prefer to go back and generate code to
simulate the state machine for speed reasons. We'll get back to that
someday, but there always seem to be bigger fish to fry.


> 3. What kind of speed should I be focusing on when I've written the
> parser? I ask this question because I write programs that write
> code, quite frequently, and I've noticed that they tend to emit very
> large files. What's a good test to throw at it to ensure that it's
> not only valid but capable of handling large files (somewhere in the
> range of 10+ MB).


The right answer is as much as you can get, in the most expensive
place. The largest systems we've processed with DMS as a monolithic
analysis had 19,000 (nineteen thousand) C compilation units to parse
(with include file and macro expansion) having some 15+ million lines
of code. As a general rule, anything you can do to speed up
lexing/parsing/ flow analysis is highly welcome for the customer. We
haven't focused on our lexers, because the really costly part of this
analysis is computing a points-to analysis across the entire system;
that takes about 60% of the 3 solid days to process this stuff. We're
working (slowly) to parallelize the points-to analysis.


> 4. Beyond issues with Left Recursion, what are some issues with
> LL(k) parsers that I should be aware of, and why should I ever
> prefer LR (or LALR) parsers? I personally dislike LR parsers due to
> their ambiguity issues (think: hanging else), when writing for an
> LL(k) parser generator, you're aware that it's recursive in nature,
> and the most nested if statement gets the 'else' when the curly
> braces are omitted. Are there any other LL(k) ambiguities that I
> need be aware of?


This dicussion occurs over and over. We use GLR parsers (complete
with that ambiguity ickiness) because it means in practice, if you
write a context free grammar, pretty much you've got a parser. When
you try to deal with dozens of languages, simplicity of specification
is really, really nice. If you use LR, you end up having to hack the
parsers to handle real languaes, because none of them (with the
exception of Wirth's orginal PASCAL are actually LR). If you use
LL(k), you have to figure out what the say for the lookaheads, which
messes up that simplicity idea. And objecting to the ambiguity
doesn't really help you; you still have to handle it somehow. Here's
a classic in the C language: how to parse:


              a * b ;


Looks like multiply, but might be a data declaration. With GLR,
you simply pick up both and resolve the problem later after
symbol table construction. With LL or LR, you have weave symbol
table collecting into the parsing process, which makes both of them
a lot harder to understand. I see people repeated the "folk theorem"
that this is how you parse C/C++, and we don't do that at all.
Our parsers and symbol table builders are completely isolated,
which makes both much easier to maintain.


> 5. I want to write an LL(k) parser generator program, and to that
> end I'm hand-writing a parser for C# to get used to writing like a
> generator (by writing state machines myself, and handling context
> awareness, this stuff sucks by hand). I'm thinking about adding
> context awareness to the lexical analyzer, because in certain cases
> it'll be necessary (a simple example is C#'s generics, nesting a
> generic declaration like so: A<B<C>> would be annoying to parse
> without context-awareness, since '>>' is right-shift, unless the
> parser itself dipped into the character stream while parsing a
> generic type. Unless I'm missing something very simple.)


Unless you want to do this for educational reasons, I don't see the
value in writing yet another parser generator. Wikipedia lists nearly
a hundred already.


If you do it, context aware is good :-}


> 6. I'm planning on writing the C# parser by hand,


Get a life :-} or at least a parser generator. There's so much to do
beyond this that there is little point in spending a lot of time doing
this part.


> ... Context awareness means
> .... passing ~20 bytes of information each lexical step...


You don't want to pay this cost per character. Context state can be
generally kept in lexer-global variables, where it is accessible
quickly and doesn't cost you any overhead to pass from character-state
to characater-state.


> 7. If I write a parser generator, should I have options for multiple
> result parsers? Different parsers have different uses,


You can do that, but it seems like a lot of work for small gain. Our
parsers simply build CSTs compressed to ASTs, which are useful for
wide varieties of tasks.


> 8. Similar to 7, but focused on the parse stages. Some languages
> have more than one parse phase, and different rules on how to handle
> them. Should the program be built, what would be a good way to
> handle the different phases (from a language description standpoint,
> I'll be writing a DSL specifically for this project)?


We use lexical modes and nested parsers to handle this. Where one
langauge is embedded inside others, we simply pick up the embedded
langauge as a string, and pass it to a the embedded language parser to
process as a nonterminal This keeps the language definitions pretty
much isolated and easier to maintain. The scannerless guys suggest
simply throwing the grammers for both langauges into the heap and
letting the parser sort it out, and in theory this works out pretty
well. In our experience, it doesn't seem to make a lot of difference.


> I'm researching language theory on my own, and have an interest in
> programs that build programs. I'm sure a lot here have used
> compiler compilers,


:-}


> but I'm not sure how many have written them


:-}}}}


> so there's likely things I'm just oblivious to at this point. Regular
> expression state machines are probably a walk in the park compared to the
> phase that comes after.


Now, that's a profound statement that most people that start down this
path seem to completely miss. The usual mantra is "If I only had a
parser for language XYZ, then..." After the problem of parsing,
there's still the production of trees, building symbol tables,
building flow analyzers, handling multiple (e.g, sometimes tens of
thousands of) files, actually defining means to build custom analyzers
using that machinery, regenerating text, ... and battling scale. I
keeping telling people that implementing a parser is like climbing the
foothills of the Himalayas; it seems like progress until you look up
at the peaks. And until you have enough machinery to do something
pretty interesting, nobody will care much. Its hard enough to get
them to care when you do have lots of machinery :-{{


For a comparision of various tools, see
http://www.semdesigns.com/Products/DMS/DMSComparison.html


Ira Baxter, CTO
Semantic Designs


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.