From: | "Ira Baxter" <idbaxter@semdesigns.com> |
Newsgroups: | comp.compilers |
Date: | Sun, 16 Nov 2008 15:54:15 -0600 |
Organization: | Compilers Central |
References: | 08-11-061 |
Keywords: | lex, parse, practice |
Posted-Date: | 16 Nov 2008 17:55:42 EST |
<tuxisthebirdforme@gmail.com> wrote in message
> I know most people anymore use lex/yacc or some derivative of these
> tools to create scanner/parser modules for their compiler projects. I
> was wondering if anyone has developed a scanner or parser that they
> personally hand-wrote? If so, I would like to know what language you
> used and what type of grammar you parsed. If you feel divulgent,
> please tell me a little bit about you're implementation or what you
> did for intermediate representation. I'm curious and just a bit nosy
> and would like to know how you're experience went doing things this
> way. Thanks.
Building lexers by hand is generally pretty easy. The easiest way is
to draw a finite state machine covering the set of lexemes you want to
handle, and then implement that directly. That's all that FLEX does
for you anyway :-} For older languages (PASCAL, MSBasic, several
metalangauges) this is pretty straightforward. New, uglier langauges
like PHP make this harder only because they contain messy things like
HERE strings (which are strings whose quotes are identifiers, ick).
It is useful to have a character buffering scheme that allows you to
put an arbitrary number of characters back in the input stream, but in
practice putting back just one is generally sufficient.
One trick I regularly use(d) is a character-classification lookup
table. This consists of a table of indicator bits indexed by the
ASCII character code to indicate the type of characters. Useful
character classes are "illegal" (e.g, most the ascii control
characters), "whitespace", "identifier start", "identifier continue",
"decimal digit", "hexadecimal digit", etc. At each point where the
lexer has acquired a character, it may inspect it directly (check for
the "E" in exponent, if you are collecting the exponent), or look it
up to avoid a set of tests (e.g., "is it a digit").
It is useful for the start state, at least, to do an indexed branch on the
1st character, and it seems useful to hand code the state that notices
the first whitespace to collect as many as it can before rebranching
on the next one. Remarkably, I'm still working on a compiler
that has such a hand code lexer, and it regularly compiles 500K
SLOC in a few minutes.
Similarly, building parsers for many langauges is pretty easy. Write
out a non-left-recursive grammar with rules of the form of
A = B C D;
For each nonterminal A, code a subroutine that returns
a boolean flag:
boolean subroutine A()
{ if (!B()) return false;
if (!C) syntax_error();
if (!D) syntax_error();
}
You need similar boolean-returning subroutines for terminals produced
by the lexer. This doesn't work directly for langauges that you can't
define LL rules for, but its amazing what you can hack in to step
around such things. Of course, this is just parsing; you have to
interleave actions to do semantic checking, code generation or tree
generation as you see fit, and that's really where the bulk of
compiler work is.
You can find writeups for this kind parsing in most compiler texts.
But the background for this comes from a paper that I think everybody
in the compiler business should have memorized :-} This is Val
Schorre's 1964 (not a typo) paper on Meta II, a syntax directed
metacompiler.
In ten pages, he shows you how to build compilers using metacompiler
methods, by defining a meta compiler machine, showing you how to build
it, and defines two languages (including the metacompiler language)
based on it. If you haven't seen this paper, I *strongly* encourage
you to get and read it. This for me is computer science at its best.
http://portal.acm.org/citation.cfm?id=808896&dl=ACM&coll=
Having said all this, my favorite method for building parsers these
days is to use a really powerful tool for converting sets of regular
expressions for token definitions into the FSA to detect such tokens,
and to use a GLR parser generator to convert a grammar directly into
an engine to parse a context-free language including the automatic
building of syntax trees. There's just so many other compiler-things
to do in life, that hand coding lexers/parsers just isn't worth it
anymore.
[You still have to do ugly things with HERE docs as they aren't context-free
:-{]
If you want to learn more about that, check out our web site.
--
Ira Baxter, CTO
www.semanticdesigns.com
..
Return to the
comp.compilers page.
Search the
comp.compilers archives again.