Re: Good practical language and OS agnostic text?

BGB <>
Sat, 21 Apr 2012 18:58:29 -0700

          From comp.compilers

Related articles
[32 earlier articles]
Re: Good practical language and OS agnostic text? (Joe Schmo) (2012-04-21)
Re: Good practical language and OS agnostic text? (Nor Jaidi Tuah) (2012-04-21)
Re: Good practical language and OS agnostic text? (Uli Kusterer) (2012-04-21)
Re: Good practical language and OS agnostic text? (Uli Kusterer) (2012-04-21)
Re: Good practical language and OS agnostic text? (BartC) (2012-04-21)
Re: Good practical language and OS agnostic text? (Jonathan Thornburg) (2012-04-21)
Re: Good practical language and OS agnostic text? (BGB) (2012-04-21)
Re: Good practical language and OS agnostic text? (2012-04-22)
Re: writing interpreters, was Good practical language and OS agnostic (Uli Kusterer) (2012-04-22)
Re: Good practical language and OS agnostic text? (2012-04-22)
Re: generating bytecode, was Good practical language and OS agnostic t (Uli Kusterer) (2012-04-22)
Re: Recursive descent parsing and optimization, was Good practical lan (BartC) (2012-04-22)
Re: Good practical language and OS agnostic text? (Tomasz Kowaltowski) (2012-04-22)
[12 later articles]
| List of all articles for this month |

From: BGB <>
Newsgroups: comp.compilers
Date: Sat, 21 Apr 2012 18:58:29 -0700
References: 12-04-019 12-04-056
Keywords: design
Posted-Date: 22 Apr 2012 01:08:41 EDT

On 4/21/2012 2:22 AM, Uli Kusterer wrote:
> On 17.04.2012, at 23:28, wrote:
>> Guys, I'm having a bear of a time finding a good practical language
>> and OS agnostic text on writing a compiler. I'm weak in math and not
>> interested in the theoretical details. I want to understand the hows
>> and whys of compiler writing.
> Hi,
> I usually do C-style programming languages, and I'm by no means a
> professional compiler writer, but I enjoy parsers, so I've been doing
> stuff like this for ages and sponging up any bit of knowledge that
> sounded useful.


I am mostly implemented my own custom scripting language (for my own
uses), and wrote a C compiler before, although the C compiler "decayed"
mostly into being a code-processing tool (mostly as maintaining a full C
compiler turned out to be "not terribly useful" in my use-case).

as-is, the scripting language is probably at a vaguely similar
complexity level with C (in some ways it is simpler, and in others
likely more complex), as it gradually moves in the direction of being
"essentially" statically-typed, has pointers, class/instance OO, ...

(decided to leave out a bunch of stuff related to some of the hassles of
static type-checking, and dealing with the matter of having a scoping
model where the scope is both mutable and involves indirect visibility,
meaning that variable assignments can alter what bindings may be
potentially visible from a given piece of code, making determining all
this statically a relatively non-trivial problem).

I have recently been looking some at moving more logic for doing some of
this from the VM back-end into the front-end, as there are likely some
cases that the front-end can deal with more easily but for which it
would be a hassle to add to the back-end (at the cost of adding a big
pile more VM bytecode ops).

now, how is any of this simpler than a C compiler?
at least at present my main target is an interpreter, and weak
performance is generally passable.

> IMO if you know assembler or BASIC and general algorithms (i.e. you
> could implement a binary tree and walk its nodes), and you can somehow
> figure out what bytes your code gets compiled to (at worst by writing
> a dummy assembler program and looking at what bytes show up between a
> bunch of nop instructions whose bytes you know), you should be able to
> at least get a basic working knowledge of how a compiler works. Just
> write the naove approach of how you would design any program.


> The things that I benefited most from learning about compilers were:
> - Recursive descent parsers: It's the obvious way to write a parser.
> You write a function ReadProgram(), which calls ReadLine() in a loop,
> which looks at the first word and then calls ReadIfStatement() if it's
> "if" etc. If you find you go wrong, you add a way to either get back
> to the last known good state (backtracking) or you just give an error
> message. On the way you keep track of the variables in each function
> and can later calculate their stack offsets once you know how many you
> need etc.

although I use recursive descent, the above sounds different from what I
usually do.

the general structure of my parsers is more like:
ReadBlock: Reads statements until EOF or '}' is seen;
      calls ReadBlockStatement in a loop.

      Peeks token;
      Sees if it is a known block-statement keyword (if/for/while/...):
          Invokes appropriate parsing logic for each keyword.
      Try to determine if it is a declaration:
          See if parsing as a declaration works.
      call ReadStatement
      call EatSemicolor

      checks for and handles vaious statement types
      calls ReadExpression

      (actually, this is a tower of functions, one for each precedence
level, working from lowest to highest precedence)
      typically (not all precedence levels are exactly the same):
          read expression at next lower level (n1)
          while(next token is a recognized operator)
              read operator token
              read expression at next lower level (n2)
              replace n1 with 'n1 <op> n2'
      (eventually) call ReadLiteral

      peek token;
      dispatch to appropriate handler if a keyword;
      checks for token types, parsing out whatever:
          numbers become numeric literals;
          identifiers become loads;
          string tokens become string literals;

typically, the result of all of this is to produce a syntax tree.

C is a little uglier here, since it requires keeping track of typedefs
and checking for typedef when trying to parse declarations.

> - Tokenizing: Essentially grab all the words in your source text and
> build an array with an entry for each so you can more quickly walk
> forward and backward without having to scan individual characters. You
> can also attach additional information to a token, i.e. an ID number
> so you can quickly distinguish user-defined identifiers from known
> identifiers like "if" or "repeat". Keep around a token's start offset
> so you can report errors.

partly agreed, except that it isn't really strictly necessary to build
an array up-front.

most of my parsers don't bother using an array, but instead just call
the tokenizer function directly to peek and parse tokens based on the
current "stream position" (generally a raw char pointer in my language

the version used in my C compiler currently uses a small hash table
(sort of, the "hash function" is simply to keep only the low-order bits
of the address) to avoid re-reading the same token multiple times.

practically though, it may be both higher performance and generally
cleaner to just build a token array up-front.

I had considered using a table in the C compiler, but the "hash" option
was less effort (the parser was already written in this case, so the
hash could be added via a quick hack, rather than by making significant
changes to the rest of the parser to use an token array instead of
direct function calls).

a potential advantage of not using an array though is not needing to
allocate memory for it.

> - Syntax tree: Build a tree structure that represents the parsed program. You
> can then do things like evaluate constant nodes ahead of time (e.g. turn 5+4
> into 9, but leave 5 + n as an addition node with a 5 and a reference to
> variable 'n') by replacing groups of nodes with simpler nodes recursively.
> Keep around the line number/first token offset in each node so you can report
> errors.


in my case, I have generally used either "lists" built from "cons cells"
for this purpose, or in other cases a variant of XML DOM nodes.

I had actually thought this was fairly standard practice, until looking
into it and observing that, no, apparently most people store their
parse-trees in raw structures. either way I guess.

typically, my parsers do almost no direct processing on the ASTs while
building them, apart from building a semantic representation of the code

typically, things like simplifying expressions, propagating constants,
... is done in a stage I have typically called "reduction", which
happens between parsing and prior to producing (bytecode) output.

> - Code generation: A good starting point I've found is to generate a single
> function. Write a little application that takes a file containing the raw
> bytes of your function, load the pointer (mark it as executable if your
> platform requires that), and then just call it. You write the prolog, the
> epilog, and the raw instructions in between, and maybe do some labels. You may
> have to remember the offset of a particular field and fill in the address
> later on (e.g. for a return statement the offset to the epilog, so you don't
> duplicate your clean-up code). Then advance to two functions that call each
> other in the same block etc.
> - Complete code generation: you generate several functions and put them in a
> proper executable file, the way your OS expects. You may have to generate link
> table entries ("trampolines") to call dynamically-linked system functions etc.
> If you want to start simpler, write your own primitive linker just for the fun
> of seeing how two functions that don't reside at fixed (relative) addresses
> could call each other.

at this point, things vary go differently, and several more stages are
involved in my case:

- typically, after the AST has been "reduced", it will be transformed
into bytecode (basically, walk the AST, figure out expression types when
relevant, and spit out any relevant bytecode ops).

previously, only a very small amount of type-analysis was done here, but
I am looking to expand this (partly to simplify the work being done in
the interpreter back-end), such that the output produced here will be
"mostly statically typed".

I am looking at going to producing bytecode output more similar to
Java-ByteCode, basically where many opcodes are qualified for specific
types, rather than leaving it solely to the back-end to figure out what
the types are. this is a little lame, given I found it elegant how, for
example, MSIL/CIL had just used a few simple operations, and the
back-end would do its thing, but some things are currently more awkward
here, and at-present this leaves some awkward holes in the type-system.

it all still leaves a bit of uncertainty though (as to whether
adding/using piles of type-annotated opcodes is really "the right
direction to be going").

- typically (assuming the interpreter), the bytecode will be processed
and then transformed into threaded code, producing a threaded-code
version of the program. there are some parts which use generated native
code thunks, but for the most part, pretty much all of the machinery
takes place in native C functions. as-is it proves problematic to do any
non-trivial scope and type-analysis at this stage, short of making the
threaded-code logic contain most of the "hard parts" of a full native
code-generator (as well as likely making it considerably slower).

some past experiments (special purpose tests), have implied that
potentially threaded code can pull off "reasonably solid" interpreter
performance (potentially within 3-5x of native).

however, at present it is considerably slower (at least with tests like
the recursive Fibonacci sequence and bubble-sort, with most "practical"
code the slowdown seems to be smaller, but typically because most of the
actual logic takes place in native code in these cases).

- if a JIT / native code-generator is being used, the process works like
complex operations are generally decomposed into simpler ones, and much
of the type information from the prior stage is discarded (the
code-generator uses its own type-analysis, unlike the "dumb"
interpreter, 1);
it will basically start figuring out how types propagate through the
stack and through variables;
it will map out operation to physical storage and/or registers;
it will then spit out the generated assembler code.

1: the interpreter uses some amount of explicit type information
already, as well as a number of "complex compound operations". these
are, while fairly helpful regarding interpreter performance, fairly
pointless for generating native code, so most such operations are broken
down into "simpler primitive operations" (for example, a single opcode
for "perform 2 loads, compare the values, and perform a conditional
jump", is less useful than the individual operations for doing so).

the JIT for my scripting language is not fully written or operational,
but partial results thus far have looked promising, but this is a low
priority. mostly it has big holes is the ISA and type-system (currently,
only the dynamically-typed path is really implemented).

- any ASM produced by the JIT is fed into the assembler, converting it
into a COFF module (COFF is used here regardless of whether the target
OS is Windows or Linux).

(note: the assembler is currently used by a number of different
components, so the lack of a currently functional JIT doesn't mean it is
goes unused).

- COFF modules are fed into an in-program linker, which proceeds to link
them against the current program image. there is some amount of trickery
here, involving potentially both invoking code-generation handlers, as
well as mechanisms to determine whether or not to "proxy" function calls
(IOW: whether to link a function call directly against the target
address, or use an indirect jump through another pointer).

reasons a proxy may be used: the call is out-of-range (on x86-64, this
usually means the call has gone outside the +-2GB window), or, either
the code has been "hot patched" or the linker has reason to suspect that
such "hot patching" is likely to occur (currently, this involves

> I can't really say anything about optimization on the compiled code level. I
> suppose you'd build a data structure with additional information about each
> instruction and then loop over it, looking for patterns that you know can be
> simplified. Essentially the same as any other search-and-replace.

this is one possibility...

the strategy used by the codegen backend for my C compiler involved
"large numbers of special cases", but this ultimately proved a bit
problematic to debug (this type of "optimization" turns out to be a
debugging mess in most cases where one doesn't have fairly immediate
feedback as to when/where/how something has broken).

funny though, is even just using "straightforward" strategies, like
caching variables in registers, ... and the code still manages to be a
bit more optimized-seeming than what usually seems to end up being spit
out by MSVC.

I have not often been entirely impressed by what I have seen being spit
out by MSVC, and my own code-generators have often been able to
outperform it (and be competitive with GCC in my tests), however...:
making my code generators "actually work" often proves a bit more difficult.

hence why I am still mostly using an interpreter-based strategy:
it is much easier IME to maintain and debug the interpreter, whereas IME
my native code generators have more often tended to blow up in my face
than actually work reliably, and it is much more difficult than it would
seem to dig through a big pile of assembler output looking for whatever
little bug is causing the code not to work.

only "I am using an interpreter" sounds a lot less impressive than "I am
using a JIT compiler", although I remember at least one VM I know of
which had claimed to be using a JIT, but the thing was using stupid
threaded code (in pretty much the same way I am using it in my
interpreter), so I was not terribly impressed...

I guess there would always be a "cheap but lame" way to write a JIT
(and, actually, how my first working JIT worked):
just use fixed-form globs of ASM for each VM opcode, and simply append
them together in-series (emitting prologues/epilogues and labels and
so-on as-needed).

ironically, at the time, I was getting nearly the same performance as
unoptimized GCC output doing this, so I guess it probably wouldn't be
entirely unworkable (rather than trying to write more complex
code-generators with type-analysis and register allocation and so on...).

( it sometimes almost makes me wonder if I am incompetent or something
in these regards. )

> Is that un-computer-sciencey enough? This blog post may help with the basics
> of my code generation approach:
> (but
> it's C, and Intel, and badly wrapped by the new theme of my blog).


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.