Re: writing interpreters, was Good practical language and OS agnostic text?

BGB <cr88192@hotmail.com>
Sun, 22 Apr 2012 12:29:27 -0700

          From comp.compilers

Related articles
Good practical language and OS agnostic text? compilers@is-not-my.name (2012-04-17)
Re: Good practical language and OS agnostic text? ulimakesacompiler@googlemail.com (Uli Kusterer) (2012-04-21)
Re: Good practical language and OS agnostic text? cr88192@hotmail.com (BGB) (2012-04-21)
Re: writing interpreters, was Good practical language and OS agnostic ulimakesacompiler@googlemail.com (Uli Kusterer) (2012-04-22)
Re: writing interpreters, was Good practical language and OS agnostic cr88192@hotmail.com (BGB) (2012-04-22)
| List of all articles for this month |
From: BGB <cr88192@hotmail.com>
Newsgroups: comp.compilers
Date: Sun, 22 Apr 2012 12:29:27 -0700
Organization: albasani.net
References: 12-04-019 12-04-056 12-04-060 12-04-063
Keywords: code, design
Posted-Date: 22 Apr 2012 15:45:55 EDT

On 4/22/2012 3:53 AM, Uli Kusterer wrote:
> On 22.04.2012, at 03:58, BGB wrote:
>> as-is, the scripting language is probably at a vaguely similar
>> complexity level with C (in some ways it is simpler, and in others
>> likely more complex), as it gradually moves in the direction of being
>> "essentially" statically-typed, has pointers, class/instance OO, ...
>
> My programming language, Hammer, goes in the opposite direction. It's
> essentially dynamically typed scripting language (at least as far as the user
> is concerned). Every variable is a variant, but I keep values around in their
> native type. Also, the syntax is very English-like and the intention is that a
> user *can not* make it crash. Of course, in practice that may not hold true
> due to bugs, but it is different than C, where there are a lot of "user
> errors" that are well in their rights to cause a crash.


this is not to say that the language in my case is not dynamically-typed
WRT the code the user types, but the objective is to essentially limit
what about of dynamic type-checking is used within the VM, as this can
hurt performance.


the language in my case is actually much closer to being an ECMAScript
variant, with some type annotations, and some basic level of
type-inference (it may attempt to "predict" the type of an untagged
variable at a given point in time, based on whatever was last assigned
into it).




as-is, in my VM, much of the code still ends up using dynamic
type-checking, and the goal is to shave this down somewhat (as well as
making what "is known" generally more effective).




>> (decided to leave out a bunch of stuff related to some of the hassles of
>> static type-checking, and dealing with the matter of having a scoping
>> model where the scope is both mutable and involves indirect visibility,
>> meaning that variable assignments can alter what bindings may be
>> potentially visible from a given piece of code, making determining all
>> this statically a relatively non-trivial problem).
>
> Yeah, that sounds more like it ;-)


yeah.


basically, I use a scoping model itself vaguely derived from the Self
language, rather than the (more traditional) "pure lexical scoping" model.


internally, packages are objects, and imports are variable declarations
(declaring a "delegate variable" which links to the imported package).


a problem though is that it is problematic to try to statically
determine what all is potentially visible is fairly difficult. at the
moment, this largely means that anything being pulled in from a package
import uses dynamic checking.


a "cheap trick" here in the language semantics is that the visibility
order for any binding not directly visible as part of the lexical scope
is not determined. this allows the VM to use what it can figure out in
preference to "falling back" to a generic dynamically-typed lookup.


besides this, caching via "lookup hashes" is also used (so, the pure
dynamic case isn't necessarily "that" bad).




>> now, how is any of this simpler than a C compiler?
>> at least at present my main target is an interpreter, and weak
>> performance is generally passable.
>
> The nice part about writing your own interpreter is that you don't have to
> mess as much with predefined platform specifics. You can define instructions
> the way you need them to make them easy to parse. Also, it's easy to write a
> debugger into your interpreter loop, and check out the contents of your stack
> and lots of other things.


pretty much.


one can also go and compile C code to an interpreter as well,
essentially allowing for a "portable C analogue". Quake3 actually did
something like this as well.




>> although I use recursive descent, the above sounds different from what I
>> usually do.
>> (...) typically, the result of all of this is to produce a syntax tree.
>
> That may be more due to my lacking English skills than actual differences.
> I've simplified a lot, and left out expression parsing in the description, to
> get the big picture across. My parser and syntax tree are written in C++ and
> nodes have a class hierarchy starting at a basic Node and progressing to
> things like a function call (with an array of sub-nodes for parameters, and a
> "name" string field) etc.


yep.


as noted before, I have used either lists or XML nodes here (either
using a Scheme-like AST structure, or a different AST structure loosely
derived from XML-RPC, 1).


in my case, my whole VM framework is pretty much plain C, so the
alternative would be using structs.


I am not sure what would be best overall, as each has tradeoffs.




1: one of my interpreters had an origin story of roughly hacking
together a mock-up parser for a JavaScript style language on top of a
hacked-up version of XML-RPC (which I had previously implemented support
for). overall, this was probably one of the worst-performing
interpreters I had ever written.


I have little idea how slow it was, but do know this much:
at the time, all values were boxed heap objects;
at the time, checking the type of an object was O(n), essentially a
linear search over every object in the heap;
initially, it directly interpreted the XML nodes, but later switched to
using a 16-bit word-code (and using things like jumps for implementing
loops, rather than special instructions and blocks);
...


the one which directly followed (a different form of the same language)
was one of my fastest, but its use of precise GC and similar made it
very painful to work with.


then the third (partly recombining parts of the last two) was "sort of
in the middle". this one is the direct ancestor of the current
interpreter, which has since-then been mostly undergoing incremental
changes.






> My expression parser is kinda fun, because it essentially walks the subtree
> of the expression currently parsed, looking at the precedence of each operator
> (a simple numeric value), and then either inserts itself as the next param of
> the deepest, rightmost node, or inserts itself in the middle of the tree
> somewhere, taking what was previously in its slot as its left argument. So
> precedence actually gets worked out during building of the tree fairly easily,
> with a loop in one function call, not several functions for different
> precedence levels.


interesting.


I had just used copy/paste/edit a bunch of times to make the various
precedence levels, so either way works I guess.


IIRC, at one point I had used naive left-associative arithmetic for all
operators (with the first horridly-slow VM), before noting that I could
copy/paste the function to get different levels.




> That reminds me, I should check whether I've implemented right-associative
> operators yet. They're fairly easy though, just a slight flip on the criteria
> when something gets appended or inserted.


yes, or one can use recursion for the right-node at the same precedence
level, for the "stack of functions" strategy.




>> C is a little uglier here, since it requires keeping track of typedefs
>> and checking for typedef when trying to parse declarations.
>
> My scripting language doesn't really have declarations, but it has something
> similar in that any identifier can be used as a one-word string constant in
> many places, and there aren't really any reserved identifiers. So originally I
> used to look if there was a variable of that name already in use, and then
> generated a variable reference, otherwise treated it as an unquoted string
> constant.
>
> Later, I realized that the former is sufficient, if I pre-fill the variable
> with its own name as a string. That way, my "do" command (think eval(), it
> executes a string as code, and has access to the current function's local
> variables etc.) can contain a command that uses what I think is a constant is
> turned into a variable. The same thing could happen in loop conditions, where
> the first time round nothing has been put into a variable yet, so it's treated
> as a string, then inside the loop it gets turned into a variable.
>
> Fun little details of programming languages :-)


I am not entirely sure I understand how this would work.




my language actually has two basic types of declarations:
via the "var" and "function" keywords;
via a different strategy (for using "C-like" declarations).


so, for example:
"var x;" or "var i:int;" or ...
"function foo(x, y) { ... }";
...


the alternative strategy relies on another detail:
<identifier> <identifier> ...


is not actually valid anywhere else in the syntax at statement level.
so, basically, if a construction like:
<identifier> <identifier>;
<modifier*> <identifier> <identifier>;
...


is seen, the parser can fairly safely assume it is a variable declaration.


actually, the rules are a little more complicated:
<modifier*> <type_expression> <identifier> ...


where type-expression is an special expression which is also a
syntactically valid type, consisting of, potentially:
several prefix modifier tokens (*, &, ...);
a known type-name keyword, or an identifier;
several allowed suffixes ("[]", "[...]", "<...>", "...", ...).


so, for example, "*int[] arr;", would declare an array of
integer-pointers (well, as would "var arr:*int[];").


there is a little annoyance though, from a design POV, of having two
clearly redundant declaration syntax forms is, well, redundant.






>> partly agreed, except that it isn't really strictly necessary to build
>> an array up-front.
>>
>> most of my parsers don't bother using an array, but instead just call
>> the tokenizer function directly to peek and parse tokens based on the
>> current "stream position" (generally a raw char pointer in my language
>> parsers).
>>
>> the version used in my C compiler currently uses a small hash table
>> (sort of, the "hash function" is simply to keep only the low-order bits
>> of the address) to avoid re-reading the same token multiple times.
>>
>> practically though, it may be both higher performance and generally
>> cleaner to just build a token array up-front.
>
> Yeah, my main reason was that my language, Hammer, due to its verbose,
> english-like syntax, can require a lot of look-ahead. So it just made the code
> much simpler, and much, much more maintainable. Also, it's very convenient to
> have a struct with everything we know about the current token while debugging,
> including, if you want, its line number to more easily find it in the source
> code than using a pointer or global offset.


yeah...


if I want the line number, the typical strategy is to walk from the
start of the buffer to the current position and count lines.


this didn't work for C though, due to preprocessing, so a different
strategy was used:
nearly every line of PP output is annotated with the current line-number
(though it could be potentially skipped for the case of consecutive line
numbers).


it is lame, but hell, it works...




>> I had considered using a table in the C compiler, but the "hash" option
>> was less effort (the parser was already written in this case, so the
>> hash could be added via a quick hack, rather than by making significant
>> changes to the rest of the parser to use an token array instead of
>> direct function calls).
>
> I actually have myToken = GoNextToken(), myToken = GoPrevToken() convenience
> calls around my token array.


yeah, that works.




>> a potential advantage of not using an array though is not needing to
>> allocate memory for it.
>
> Definitely, though the overhead is small if you reference your code instead
> of storing copies of strings, and stick to small numeric values for the rest
> of the ivars, too. Depending on average token length, it comes out to about
> the same size as your source script, or slightly larger allowing for lots of
> operators etc.


yes, but at a potential cost, depending if the language has or how it
implements things like unicode escapes.




<snip>


>>
>> agreed.
>>
>> in my case, I have generally used either "lists" built from "cons cells"
>> for this purpose, or in other cases a variant of XML DOM nodes.
>
> So I guess you're building a more dynamic structure, not unlike a
> dictionary/hash map? My low-level programming instincts kinda yell at me that
> this is inefficient, but honestly, the parse tree is probably the last place
> where one needs to worry about that. At least until one actually finds a
> situation where one runs out of memory or finds it to be a performance
> bottleneck. Premature optimization, root of all evil, yadda, yadda.


yeah.


it is more worrisome for the C parser because:
it parses chunks of code that are often measured in the MB range (like
what happens if "windows.h" or similar gets in the mix, esp the MSVC
version which results in about 8MB of output);
because the C parser uses XML nodes.


but, oddly enough, this is not apparently the main source of slowness.
the tokenizer tends to bog down considerably more than any code having
much to do with XML, and the preprocessor eats up more time than
building the AST.


granted, the XML implementation is itself somewhat micro-optimized,
IIRC, with direct support for numeric-valued attributes, omission of
unnecessary features for this use-case (such as namespaces), ....


so, in all, it doesn't seem "too terrible".


however, granted, the parser may take easily 500ms or 2s or so to parse
a source module if large headers are involved, so it isn't free.


given that GCC is similarly slow, I am probably not too far off.
MSVC tears through code much more quickly though, making me suspect MSVC
has some sort of "scary-fast parser" going on.




>> I had actually thought this was fairly standard practice, until looking
>> into it and observing that, no, apparently most people store their
>> parse-trees in raw structures. either way I guess.
>
> Yeah, I use C++ classes, but then it could be argued that C++ doesn't really
> have *real* classes, so I guess it's just a glorified struct as well. I have
> some virtual methods, which make walking the tree as easy as kicking off a
> recursive function call, but apart from that, they're very struct-y.


yeah.




I don't think the present form of ASTs is all that bad though as, after
all, I can quickly dump them to or read them from files, and (at least
in concept) it allows people to more easily write independent code
compatible with the parser or compiler back-end.


not that it probably matters all that much though, more just a quirk of
my personal history it seems.




>> typically, my parsers do almost no direct processing on the ASTs while
>> building them, apart from building a semantic representation of the code
>> as-seen.
>
> I certainly don't simplify nodes *while* building the tree. I actually
> intentionally start out with the stupidest representation that is still
> correct (i.e. I pick where to insert expression nodes so evaluating the tree
> gives the right result, but that's it). That way, I can see parser errors
> easily and early on. I can debug-dump my tree with a recursive function call.


agreed.




> And *then* I call Simplify(). Which does a recursive depth-first traversal
> and does simple optimizations. So each type of node only needs to know how to
> simplify itself. E.g. operators know how to check their operands whether they
> are constant or not (*after* calling Simplify() on them), and can then replace
> themselves with their result if both are.


yeah, this sounds similar to how my "reduce" process works.
there is a little context though, used mostly for things like constant
propagation and similar.




>> typically, things like simplifying expressions, propagating constants,
>> ... is done in a stage I have typically called "reduction", which
>> happens between parsing and prior to producing (bytecode) output.
>
> Yes, that sounds about what I do. I just used a dumber word :-)


fair enough.




>> at this point, things vary go differently, and several more stages are
>> involved in my case:
>>
>> - typically, after the AST has been "reduced", it will be transformed
>> into bytecode (basically, walk the AST, figure out expression types when
>> relevant, and spit out any relevant bytecode ops).
>
> Yup. I recursively call GenerateCode(), passing a CodeBlock object into which
> the raw instructions get written.


sounds functionally equivalent, though terms like "Compile" and "Emit"
(ex: "CompileStatement" or "EmitOpcode") are used instead, and the
target structure is called a "Context".




>> previously, only a very small amount of type-analysis was done here, but
>> I am looking to expand this (partly to simplify the work being done in
>> the interpreter back-end), such that the output produced here will be
>> "mostly statically typed".
>
> I don't have that currently, but what I did in an earlier approach (where my
> code actually compiled to C++), I had a "nail down types" phase. What it
> essentially did was ask up the hierarchy to find out which types each
> expression could have with its current parameters (I have about 5 types, so it
> was just a bit field), and then pick the narrowest type that fit. Often it
> turned out that somewhere at the end there was a +1 or so, which meant I ended
> up with the widest numeric type or so. Or there was concatenation involved and
> I knew the end result would be a string. But of course, often it was just a
> variant as well.


the current "real" type-system is considerably more complex.


as far as the VM itself is concerned, a simpler type-model is used:
ILFDV (Int, Long, Float, Double, Variant).


most other types map to Variant, which is (partly) subdivided:
FN: "Fixnum", actually means "any integer types represented as a variant
and within the int32 range" (originally, it did mean, specifically,
"fixnum" in prior VMs, but was relaxed to "int32 range", which may mean
either an actual fixnum or a "BVT int32");
FL: "Flonum", which represents flonum and potentially "BVT float32";
BVT: "Boxed Value Type" (should be mostly self-explanatory), these
generally represent most value types which don't fit into a fixnum.




currently (in the interpreter), the ILFD types are themselves BVT's,
mostly for sake of simplicity.


I have also recently been working on "optimizing" the handling of BVTs
somewhat, mostly so that they will be faster and less prone to leaking
memory (and making the ILFD types use dedicated free-lists rather than
general-purpose memory-allocation functions).


granted, some of this new interpreter logic consists almost more of
macros than it does of actual code (mostly because of large numbers of
very repetitive operation-handler functions).




>> I am looking at going to producing bytecode output more similar to
>> Java-ByteCode, basically where many opcodes are qualified for specific
>> types, rather than leaving it solely to the back-end to figure out what
>> the types are. this is a little lame, given I found it elegant how, for
>> example, MSIL/CIL had just used a few simple operations, and the
>> back-end would do its thing, but some things are currently more awkward
>> here, and at-present this leaves some awkward holes in the type-system.
>
> Yeah. In an early version, I just chickened out and always picked the widest
> type. I.e. the + operator just always did floating point maths. These days, I
> check the operand types at runtime and fall back on integer maths for stuff
> like addition, subtraction and multiplication when possible.
>


mostly, in my case, it was just lots of cases where the interpreter just
ends up using "variant" for everything, rather than, say, producing
threaded code specialized for 32-bit integer values or whatever else.


this means a big pile of new opcodes though, most with names like:
"ADD_XI" for "add fixed integer" and "MUL_XL_C" for "multiply fixed long
by a constant" and similar.


I don't really like this, as the interpreter "should" be smart enough to
figure all this stuff out on its own, but at present, it is not.




my native code-generator back-ends could figure this stuff out, but
granted, they had considerably more complex logic for things like
working with the type-system.


when I tried before to write an interpreter which did all of this, I
soon realized that it was working out to nearly the same functional
complexity as a native code-generator, but with worse performance,
largely defeating the point.




>> it all still leaves a bit of uncertainty though (as to whether
>> adding/using piles of type-annotated opcodes is really "the right
>> direction to be going").
>
> It might give you better performance, particularly once you go JIT (if you
> do). And it saves you a bunch of branching instructions during runtime, which
> are a major slowdown these days on modern CPUs.


well, or at least, it allows the possibility of a "better yet simpler"
JIT, since the types will already be mostly known, meaning that the JIT
wont need to be able to figure out all of this stuff to, errm, actually
work...


theoretically, with threaded-code, the run-time checks could be partly
skipped already simply by inferring the type and then choosing the
handler with the correct type, and then keeping track of type-flow and
similar when building the threaded code.


however, I have not actually generally done this, due to the relative
amount of added complexity involved.


so, although less elegant, annotating many opcodes with types could be a
lot more generally workable.


for most other cases (those not arithmetic ops), the annotations are
more likely to be in the form of "signatures" (already commonly used,
these basically encode type-information for a variable/argument list/...
into the form of a string).




>> - typically (assuming the interpreter), the bytecode will be processed
>> and then transformed into threaded code, producing a threaded-code
>> version of the program. there are some parts which use generated native
>> code thunks, but for the most part, pretty much all of the machinery
>> takes place in native C functions. as-is it proves problematic to do any
>> non-trivial scope and type-analysis at this stage, short of making the
>> threaded-code logic contain most of the "hard parts" of a full native
>> code-generator (as well as likely making it considerably slower).
>
> "threaded-code"? *looks it up on Wikipedia* Ah, so that's what it's called
> when you just spit out a bunch of CALL instructions... Good to know :-) I
> build a bytecode, with an index into an instruction array and two param
> fields. I thought about generating CALL instructions instead, but it's easier
> to debug this way, and subroutines in my language all have a variable number
> of parameters, so I thought the advantage of avoiding one function pointer
> dereference probably won't gain me much. And I'm still in the "get this dang
> thing working" phase, after all.




well, there is bytecode, just the interpreter converts it to threaded
code prior to actual execution (and just uses the threaded-code for
every time the function is called).


the actual "working logic" is, in this case, actually contained in a
mass of C functions.


I had looked it up, and concluded at the time that "treaded code" is
still technically an interpreter.




>> some past experiments (special purpose tests), have implied that
>> potentially threaded code can pull off "reasonably solid" interpreter
>> performance (potentially within 3-5x of native).
>>
>> however, at present it is considerably slower (at least with tests like
>> the recursive Fibonacci sequence and bubble-sort, with most "practical"
>> code the slowdown seems to be smaller, but typically because most of the
>> actual logic takes place in native code in these cases).
>
> Yeah. My experiences are similar. Hammer is what the developer of another
> HyperTalk-descended language called a "Very High Level Language" and is more
> of the CISC than the RISC mindset, so the number of actual bytecode
> instructions is comparatively small compared to the amount of actual
> instructions the functions behind each bytecode are made up of.


yeah.


my VM currently has a "huge" number of opcodes (several hundred),
although due to there being a number of (sometimes large) holes in the
opcode space, they are currently numbered up to 860.


previously (before adding a bunch of type-annotated arithmetic opcodes)
the limit was 540. I tried to put them all in one such "hole", but they
didn't fit, and I didn't want to break them up and spread them out.


so, I "relocated" the whole range up to a space starting at 768 (next
even multiple of 256), leading to the current range.


probably a large number of opcodes are due to "compound operations",
which perform several smaller operations in sequence. most of these
were, in turn, because a single larger piece of C code to do something
was faster than making more passes through the dispatch loop to invoke
several smaller operations.


or: a single "lpostinc" opcode is faster than "lload; dup; push_1; add;
lstore".


so, it is a tradeoff...


better performance at the cost of 100s of opcodes (at present, 459).




>> - if a JIT / native code-generator is being used, the process works like
>> this:
>> complex operations are generally decomposed into simpler ones, and much
>> of the type information from the prior stage is discarded (the
>> code-generator uses its own type-analysis, unlike the "dumb"
>> interpreter, 1);
>
> That might actually make sense, considering all the temporaries, registers
> and whatever that native code "adds" to the original source code. Will have to
> keep in mind if I'm ever crazy enough to do that JIT :-)
>


yes.


things like temporaries, register allocation, ... all serve to largely
nullify most of the gains from these complex opcodes, as the
code-generator will figure most of this out again on its own.


although it seems pointless to build compound ops merely to break them
down again in the JIT, otherwise different bytecode code would need to
be generated depending on whether the interpreter or the JIT was the
target, and decomposing the opcodes in the JIT, although slightly more
bulky, is not particularly complicated (vs pretty much everything else
going on in the thing...).




it is much like the "complexity" of a C style syntax for a programming
language:
if a person is writing a compiler or VM for a language that is much more
than a toy, the added complexity of parsing a C-like syntax is probably
the least of the worries.




>> - COFF modules are fed into an in-program linker, which proceeds to link
>> them against the current program image. there is some amount of trickery
>> here, involving potentially both invoking code-generation handlers, as
>> well as mechanisms to determine whether or not to "proxy" function calls
>> (IOW: whether to link a function call directly against the target
>> address, or use an indirect jump through another pointer).
>>
>> reasons a proxy may be used: the call is out-of-range (on x86-64, this
>> usually means the call has gone outside the +-2GB window), or, either
>> the code has been "hot patched" or the linker has reason to suspect that
>> such "hot patching" is likely to occur (currently, this involves
>> heuristics).
>
> Yeah, some of my early native-code tests also had a custom loader (seems to
> be identical to your "in-program linker"). I essentially stored the offset to
> the trampoline "external symbol table" in the file, and then fixed up the CALL
> instructions to point to the symbols I looked up at runtime.


pretty much.


the linker is basically just a large table for symbols and relocations,
and a special-purpose memory manager (for allocating memory for those
pesky code/data/bss sections).


originally, it would link things in aggressively (get handed object,
link it in directly), but I later switched to a "lazy" 2-stage strategy
(where an object is not fully linked until one of its exported symbols
is referenced) mostly to fix some nasty problems which existed
previously (linking being sensitive to object ordering, ...).




the modules used here are basically raw COFF objects though.


as-needed, the linker will also defer to OS-specific functions
("dlsym()" and "GetProcAddress()"), in order to link against DLL exports.




>> the strategy used by the codegen backend for my C compiler involved
>> "large numbers of special cases", but this ultimately proved a bit
>> problematic to debug (this type of "optimization" turns out to be a
>> debugging mess in most cases where one doesn't have fairly immediate
>> feedback as to when/where/how something has broken).
>
> Yeah. At the least one would have to be very disciplined during code
> generation and keep information on all the jump-in and jump-out points, so one
> doesn't combine two instructions from different branches. But then I suppose
> one would have to keep track of those anyway, so all offsets can be fixed up
> if an instruction is removed.


well, potentially, except I tend to produce ASM code as a giant string
to be fed into the assembler.


the assembler is, in fact, plenty fast IMO, and I speculate that by the
time one is trying to feed 10s of MB per second into the assembler, and
having this be a bottleneck, something has likely gone horribly wrong...


at first I had tried using direct code-generation, and later
function-call driven code generation, but soon found textual ASM to be
most convenient to work with, so I generally used this.




elsewhere in the VM, there are a number of cases where code generation
follows a pattern like:
Begin()
...
Print("ASM stuff...")
...
End()
...
fptr=GetAddr("foo")
...
fptr();




>> funny though, is even just using "straightforward" strategies, like
>> caching variables in registers, ... and the code still manages to be a
>> bit more optimized-seeming than what usually seems to end up being spit
>> out by MSVC.
>
> Oooo... burrrrn ;-)


I think the issue is that MSVC seems to spit out some very often rather
naive-looking code.


the bigger irony though, is that as naive looking as it often is, it
still often manages to perform reasonably well.




like, it often seems to have a hard time with things like, keeping its
variables stored in registers, as it will very often spill the contents
of a register in cases where it is not necessary, often only to load the
variable again into a different register, and often not even in obvious
cases (such as spilling the register during a jump such that all of the
jump points have the value in the same place).


I am not really sure what is going on there sometimes...




that or, maybe it is just the profiler tending to get a lot more hits on
heavily used pieces of code where the compiler decided to do something
wacky.


this is combined sometimes though with the tendency of the compiler to,
every once in a great while, actually crash during a build. usually then
bringing up a "special" send error report box (to the effect of
"Microsoft is especially interesting in collecting these error
reports"...). maybe so, a crash in the compiler is probably hardly a
good sign...




OTOH, GCC tends to produce rather optimized-looking spaghetti-code.
code at least looks efficient and preforms well, but, how does it
work?... sometimes it is not so obvious.




>> I have not often been entirely impressed by what I have seen being spit
>> out by MSVC, and my own code-generators have often been able to
>> outperform it (and be competitive with GCC in my tests), however...:
>> making my code generators "actually work" often proves a bit more
> difficult.
>
> TBH, it's easy to generate faster code if it doesn't have to be correct. :-p
> Isn't that how SSE and OpenGL and all that stuff came about? They found some
> trade-offs they could make for a particular problem domain that would make it
> faster. Effectively, you're making a trade-off for correctness ;-)


probably...


when I was writing the code generator, I used an optimization process like:
look at output;
identify any obvious inefficiencies;
figure out the logic which was involved in producing the code;
write a special case to detect the specific situation and emit an
alternate faster code sequence instead;
...


it then performed comparably with the other compilers, but lots of code
tended to break in non-obvious ways.




I later largely stopped using this code generator as:
I couldn't debug it sufficiently to make the output actually work reliably;
most of the code it contains is actually pretty horrid (whole thing is
basically bit-twiddly and special cases);
never mind its use of a multi-pass "magic bits" system, where the
compiler basically used multiple passes which operated (mostly) in
lock-step, and communicated primarily by using a synchronized byte-array
which was used typically to pass receive bit-flags from prior passes or
pass bit-flags to the next pass;
...


"magic bits" would encode things like which registers to try using (and
which have already been tried), warnings about things like potentials
for things like impending register spills, or running out of available
registers, ... (in effect, optimization by using multiple passes and a
system analogous to "road signs"). whenever the logic ran into a problem
case, it would flag a warning about it ("do not use this register here",
"insert stack padding here", "try to perform this operation without
using any additional registers", ...).


the goal would then be for the system to, between 2 and N passes
(usually with a limit of 16 or so), "converge" on an "optimal"/"stable"
output, followed by a "final" pass to actually emit the final-state code.


typically, then, the overall process looks like:
set up initial state;
while(!OutputIsStable())
      Send Begin
      Send all opcodes for function
      Send End
Flatten final output to an ASM string.




but, sadly, this was also my high-point here:
most of my subsequent native code generators have not actually gotten
fully written, before usually falling to either "grr... this is getting
a bit hairy" or "I have more important things to be working on".


meanwhile, other parts of the VM are often moving targets, so as to add
to the challenge, ...




>> hence why I am still mostly using an interpreter-based strategy:
>> it is much easier IME to maintain and debug the interpreter, whereas IME
>> my native code generators have more often tended to blow up in my face
>> than actually work reliably, and it is much more difficult than it would
>> seem to dig through a big pile of assembler output looking for whatever
>> little bug is causing the code not to work.
>
> Fully agreed. Hence the suggestion to generate bytecode first. But of course,
> if you plan to make an actual native code compiler, pointers in that direction
> don't necessarily hurt.




actually, typically bytecode has been the intermediate stage.


however, usually most of the analysis and similar would have been left
in the back-end, leading to:
simple/naive front-end, very complex back-end.


hence, the realization that I may need to shift a little more of the
work to the front-end.




>> only "I am using an interpreter" sounds a lot less impressive than "I am
>> using a JIT compiler", although I remember at least one VM I know of
>> which had claimed to be using a JIT, but the thing was using stupid
>> threaded code (in pretty much the same way I am using it in my
>> interpreter), so I was not terribly impressed...
>
> I would like to see performance comparison between the two, say on ARM and
> Intel. Does a function pointer de-ref and a loop really make that much
> difference in practice? OK, Hammer has lots of slowdowns, and its main loop
> isn't that simple, so there's a few exit flags to check each go through the
> loop, and a call to update a source level debugger, but hey...






well, if it is like the current partially-written JIT implementation,
which just uses full dynamic checking and invoking link-time
code-generation for everything, it would probably actually work out
being slower.


the major advantage of a JIT is in being able to do things like:
produce small/compact instruction sequences;
effectively utilize the system stack and registers for storage;
...


otherwise, there is not a huge advantage over threaded code.




>> I guess there would always be a "cheap but lame" way to write a JIT
>> (and, actually, how my first working JIT worked):
>> just use fixed-form globs of ASM for each VM opcode, and simply append
>> them together in-series (emitting prologues/epilogues and labels and
>> so-on as-needed).
>
> Actually, I could imagine that, with a good optimizer, that is probably the
> common way of doing things. And it's also easy to maintain.
>


yeah.


may just go back to that strategy eventually...



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.