Re: writing interpreters, was Good practical language and OS agnostic text?

Uli Kusterer <ulimakesacompiler@googlemail.com>
Sun, 22 Apr 2012 12:53:45 +0200

          From comp.compilers

Related articles
Good practical language and OS agnostic text? compilers@is-not-my.name (2012-04-17)
Re: Good practical language and OS agnostic text? ulimakesacompiler@googlemail.com (Uli Kusterer) (2012-04-21)
Re: Good practical language and OS agnostic text? cr88192@hotmail.com (BGB) (2012-04-21)
Re: writing interpreters, was Good practical language and OS agnostic ulimakesacompiler@googlemail.com (Uli Kusterer) (2012-04-22)
Re: writing interpreters, was Good practical language and OS agnostic cr88192@hotmail.com (BGB) (2012-04-22)
| List of all articles for this month |

From: Uli Kusterer <ulimakesacompiler@googlemail.com>
Newsgroups: comp.compilers
Date: Sun, 22 Apr 2012 12:53:45 +0200
Organization: Compilers Central
References: 12-04-019 12-04-056 12-04-060
Keywords: design, interpreter
Posted-Date: 22 Apr 2012 10:24:38 EDT

On 22.04.2012, at 03:58, BGB wrote:
> as-is, the scripting language is probably at a vaguely similar
> complexity level with C (in some ways it is simpler, and in others
> likely more complex), as it gradually moves in the direction of being
> "essentially" statically-typed, has pointers, class/instance OO, ...


  My programming language, Hammer, goes in the opposite direction. It's
essentially dynamically typed scripting language (at least as far as the user
is concerned). Every variable is a variant, but I keep values around in their
native type. Also, the syntax is very English-like and the intention is that a
user *can not* make it crash. Of course, in practice that may not hold true
due to bugs, but it is different than C, where there are a lot of "user
errors" that are well in their rights to cause a crash.


> (decided to leave out a bunch of stuff related to some of the hassles of
> static type-checking, and dealing with the matter of having a scoping
> model where the scope is both mutable and involves indirect visibility,
> meaning that variable assignments can alter what bindings may be
> potentially visible from a given piece of code, making determining all
> this statically a relatively non-trivial problem).


  Yeah, that sounds more like it ;-)


> now, how is any of this simpler than a C compiler?
> at least at present my main target is an interpreter, and weak
> performance is generally passable.


  The nice part about writing your own interpreter is that you don't have to
mess as much with predefined platform specifics. You can define instructions
the way you need them to make them easy to parse. Also, it's easy to write a
debugger into your interpreter loop, and check out the contents of your stack
and lots of other things.


> although I use recursive descent, the above sounds different from what I
> usually do.
> (...) typically, the result of all of this is to produce a syntax tree.


  That may be more due to my lacking English skills than actual differences.
I've simplified a lot, and left out expression parsing in the description, to
get the big picture across. My parser and syntax tree are written in C++ and
nodes have a class hierarchy starting at a basic Node and progressing to
things like a function call (with an array of sub-nodes for parameters, and a
"name" string field) etc.


  My expression parser is kinda fun, because it essentially walks the subtree
of the expression currently parsed, looking at the precedence of each operator
(a simple numeric value), and then either inserts itself as the next param of
the deepest, rightmost node, or inserts itself in the middle of the tree
somewhere, taking what was previously in its slot as its left argument. So
precedence actually gets worked out during building of the tree fairly easily,
with a loop in one function call, not several functions for different
precedence levels.


  That reminds me, I should check whether I've implemented right-associative
operators yet. They're fairly easy though, just a slight flip on the criteria
when something gets appended or inserted.


> C is a little uglier here, since it requires keeping track of typedefs
> and checking for typedef when trying to parse declarations.


  My scripting language doesn't really have declarations, but it has something
similar in that any identifier can be used as a one-word string constant in
many places, and there aren't really any reserved identifiers. So originally I
used to look if there was a variable of that name already in use, and then
generated a variable reference, otherwise treated it as an unquoted string
constant.


  Later, I realized that the former is sufficient, if I pre-fill the variable
with its own name as a string. That way, my "do" command (think eval(), it
executes a string as code, and has access to the current function's local
variables etc.) can contain a command that uses what I think is a constant is
turned into a variable. The same thing could happen in loop conditions, where
the first time round nothing has been put into a variable yet, so it's treated
as a string, then inside the loop it gets turned into a variable.


  Fun little details of programming languages :-)


> partly agreed, except that it isn't really strictly necessary to build
> an array up-front.
>
> most of my parsers don't bother using an array, but instead just call
> the tokenizer function directly to peek and parse tokens based on the
> current "stream position" (generally a raw char pointer in my language
> parsers).
>
> the version used in my C compiler currently uses a small hash table
> (sort of, the "hash function" is simply to keep only the low-order bits
> of the address) to avoid re-reading the same token multiple times.
>
> practically though, it may be both higher performance and generally
> cleaner to just build a token array up-front.


  Yeah, my main reason was that my language, Hammer, due to its verbose,
english-like syntax, can require a lot of look-ahead. So it just made the code
much simpler, and much, much more maintainable. Also, it's very convenient to
have a struct with everything we know about the current token while debugging,
including, if you want, its line number to more easily find it in the source
code than using a pointer or global offset.


> I had considered using a table in the C compiler, but the "hash" option
> was less effort (the parser was already written in this case, so the
> hash could be added via a quick hack, rather than by making significant
> changes to the rest of the parser to use an token array instead of
> direct function calls).


  I actually have myToken = GoNextToken(), myToken = GoPrevToken() convenience
calls around my token array.


> a potential advantage of not using an array though is not needing to
> allocate memory for it.


  Definitely, though the overhead is small if you reference your code instead
of storing copies of strings, and stick to small numeric values for the rest
of the ivars, too. Depending on average token length, it comes out to about
the same size as your source script, or slightly larger allowing for lots of
operators etc.


>> - Syntax tree: Build a tree structure that represents the parsed program.
You
>> can then do things like evaluate constant nodes ahead of time (e.g. turn
5+4
>> into 9, but leave 5 + n as an addition node with a 5 and a reference to
>> variable 'n') by replacing groups of nodes with simpler nodes recursively.
>> Keep around the line number/first token offset in each node so you can
report
>> errors.
>
> agreed.
>
> in my case, I have generally used either "lists" built from "cons cells"
> for this purpose, or in other cases a variant of XML DOM nodes.


  So I guess you're building a more dynamic structure, not unlike a
dictionary/hash map? My low-level programming instincts kinda yell at me that
this is inefficient, but honestly, the parse tree is probably the last place
where one needs to worry about that. At least until one actually finds a
situation where one runs out of memory or finds it to be a performance
bottleneck. Premature optimization, root of all evil, yadda, yadda.


> I had actually thought this was fairly standard practice, until looking
> into it and observing that, no, apparently most people store their
> parse-trees in raw structures. either way I guess.


  Yeah, I use C++ classes, but then it could be argued that C++ doesn't really
have *real* classes, so I guess it's just a glorified struct as well. I have
some virtual methods, which make walking the tree as easy as kicking off a
recursive function call, but apart from that, they're very struct-y.


> typically, my parsers do almost no direct processing on the ASTs while
> building them, apart from building a semantic representation of the code
> as-seen.


  I certainly don't simplify nodes *while* building the tree. I actually
intentionally start out with the stupidest representation that is still
correct (i.e. I pick where to insert expression nodes so evaluating the tree
gives the right result, but that's it). That way, I can see parser errors
easily and early on. I can debug-dump my tree with a recursive function call.


  And *then* I call Simplify(). Which does a recursive depth-first traversal
and does simple optimizations. So each type of node only needs to know how to
simplify itself. E.g. operators know how to check their operands whether they
are constant or not (*after* calling Simplify() on them), and can then replace
themselves with their result if both are.


> typically, things like simplifying expressions, propagating constants,
> ... is done in a stage I have typically called "reduction", which
> happens between parsing and prior to producing (bytecode) output.


  Yes, that sounds about what I do. I just used a dumber word :-)


> at this point, things vary go differently, and several more stages are
> involved in my case:
>
> - typically, after the AST has been "reduced", it will be transformed
> into bytecode (basically, walk the AST, figure out expression types when
> relevant, and spit out any relevant bytecode ops).


  Yup. I recursively call GenerateCode(), passing a CodeBlock object into which
the raw instructions get written.


> previously, only a very small amount of type-analysis was done here, but
> I am looking to expand this (partly to simplify the work being done in
> the interpreter back-end), such that the output produced here will be
> "mostly statically typed".


  I don't have that currently, but what I did in an earlier approach (where my
code actually compiled to C++), I had a "nail down types" phase. What it
essentially did was ask up the hierarchy to find out which types each
expression could have with its current parameters (I have about 5 types, so it
was just a bit field), and then pick the narrowest type that fit. Often it
turned out that somewhere at the end there was a +1 or so, which meant I ended
up with the widest numeric type or so. Or there was concatenation involved and
I knew the end result would be a string. But of course, often it was just a
variant as well.


> I am looking at going to producing bytecode output more similar to
> Java-ByteCode, basically where many opcodes are qualified for specific
> types, rather than leaving it solely to the back-end to figure out what
> the types are. this is a little lame, given I found it elegant how, for
> example, MSIL/CIL had just used a few simple operations, and the
> back-end would do its thing, but some things are currently more awkward
> here, and at-present this leaves some awkward holes in the type-system.


  Yeah. In an early version, I just chickened out and always picked the widest
type. I.e. the + operator just always did floating point maths. These days, I
check the operand types at runtime and fall back on integer maths for stuff
like addition, subtraction and multiplication when possible.


> it all still leaves a bit of uncertainty though (as to whether
> adding/using piles of type-annotated opcodes is really "the right
> direction to be going").


  It might give you better performance, particularly once you go JIT (if you
do). And it saves you a bunch of branching instructions during runtime, which
are a major slowdown these days on modern CPUs.


> - typically (assuming the interpreter), the bytecode will be processed
> and then transformed into threaded code, producing a threaded-code
> version of the program. there are some parts which use generated native
> code thunks, but for the most part, pretty much all of the machinery
> takes place in native C functions. as-is it proves problematic to do any
> non-trivial scope and type-analysis at this stage, short of making the
> threaded-code logic contain most of the "hard parts" of a full native
> code-generator (as well as likely making it considerably slower).


  "threaded-code"? *looks it up on Wikipedia* Ah, so that's what it's called
when you just spit out a bunch of CALL instructions... Good to know :-) I
build a bytecode, with an index into an instruction array and two param
fields. I thought about generating CALL instructions instead, but it's easier
to debug this way, and subroutines in my language all have a variable number
of parameters, so I thought the advantage of avoiding one function pointer
dereference probably won't gain me much. And I'm still in the "get this dang
thing working" phase, after all.


> some past experiments (special purpose tests), have implied that
> potentially threaded code can pull off "reasonably solid" interpreter
> performance (potentially within 3-5x of native).
>
> however, at present it is considerably slower (at least with tests like
> the recursive Fibonacci sequence and bubble-sort, with most "practical"
> code the slowdown seems to be smaller, but typically because most of the
> actual logic takes place in native code in these cases).


  Yeah. My experiences are similar. Hammer is what the developer of another
HyperTalk-descended language called a "Very High Level Language" and is more
of the CISC than the RISC mindset, so the number of actual bytecode
instructions is comparatively small compared to the amount of actual
instructions the functions behind each bytecode are made up of.


> - if a JIT / native code-generator is being used, the process works like
> this:
> complex operations are generally decomposed into simpler ones, and much
> of the type information from the prior stage is discarded (the
> code-generator uses its own type-analysis, unlike the "dumb"
> interpreter, 1);


  That might actually make sense, considering all the temporaries, registers
and whatever that native code "adds" to the original source code. Will have to
keep in mind if I'm ever crazy enough to do that JIT :-)


> - COFF modules are fed into an in-program linker, which proceeds to link
> them against the current program image. there is some amount of trickery
> here, involving potentially both invoking code-generation handlers, as
> well as mechanisms to determine whether or not to "proxy" function calls
> (IOW: whether to link a function call directly against the target
> address, or use an indirect jump through another pointer).
>
> reasons a proxy may be used: the call is out-of-range (on x86-64, this
> usually means the call has gone outside the +-2GB window), or, either
> the code has been "hot patched" or the linker has reason to suspect that
> such "hot patching" is likely to occur (currently, this involves
> heuristics).


  Yeah, some of my early native-code tests also had a custom loader (seems to
be identical to your "in-program linker"). I essentially stored the offset to
the trampoline "external symbol table" in the file, and then fixed up the CALL
instructions to point to the symbols I looked up at runtime.


> the strategy used by the codegen backend for my C compiler involved
> "large numbers of special cases", but this ultimately proved a bit
> problematic to debug (this type of "optimization" turns out to be a
> debugging mess in most cases where one doesn't have fairly immediate
> feedback as to when/where/how something has broken).


  Yeah. At the least one would have to be very disciplined during code
generation and keep information on all the jump-in and jump-out points, so one
doesn't combine two instructions from different branches. But then I suppose
one would have to keep track of those anyway, so all offsets can be fixed up
if an instruction is removed.


> funny though, is even just using "straightforward" strategies, like
> caching variables in registers, ... and the code still manages to be a
> bit more optimized-seeming than what usually seems to end up being spit
> out by MSVC.


  Oooo... burrrrn ;-)


> I have not often been entirely impressed by what I have seen being spit
> out by MSVC, and my own code-generators have often been able to
> outperform it (and be competitive with GCC in my tests), however...:
> making my code generators "actually work" often proves a bit more
difficult.


  TBH, it's easy to generate faster code if it doesn't have to be correct. :-p
Isn't that how SSE and OpenGL and all that stuff came about? They found some
trade-offs they could make for a particular problem domain that would make it
faster. Effectively, you're making a trade-off for correctness ;-)


> hence why I am still mostly using an interpreter-based strategy:
> it is much easier IME to maintain and debug the interpreter, whereas IME
> my native code generators have more often tended to blow up in my face
> than actually work reliably, and it is much more difficult than it would
> seem to dig through a big pile of assembler output looking for whatever
> little bug is causing the code not to work.


  Fully agreed. Hence the suggestion to generate bytecode first. But of course,
if you plan to make an actual native code compiler, pointers in that direction
don't necessarily hurt.


> only "I am using an interpreter" sounds a lot less impressive than "I am
> using a JIT compiler", although I remember at least one VM I know of
> which had claimed to be using a JIT, but the thing was using stupid
> threaded code (in pretty much the same way I am using it in my
> interpreter), so I was not terribly impressed...


  I would like to see performance comparison between the two, say on ARM and
Intel. Does a function pointer de-ref and a loop really make that much
difference in practice? OK, Hammer has lots of slowdowns, and its main loop
isn't that simple, so there's a few exit flags to check each go through the
loop, and a call to update a source level debugger, but hey...


> I guess there would always be a "cheap but lame" way to write a JIT
> (and, actually, how my first working JIT worked):
> just use fixed-form globs of ASM for each VM opcode, and simply append
> them together in-series (emitting prologues/epilogues and labels and
> so-on as-needed).


  Actually, I could imagine that, with a good optimizer, that is probably the
common way of doing things. And it's also easy to maintain.


Cheers,
-- Uli Kusterer
http://stacksmith.org


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.