Re: Managing the JIT

"BGB / cr88192" <>
Sun, 2 Aug 2009 16:28:12 -0700

          From comp.compilers

Related articles
[6 earlier articles]
Re: Managing the JIT (BGB / cr88192) (2009-07-28)
Re: Managing the JIT (Armel) (2009-07-29)
Re: Managing the JIT (BGB / cr88192) (2009-07-30)
Re: Managing the JIT (Armel) (2009-07-31)
Re: Managing the JIT (Barry Kelly) (2009-08-01)
Re: Managing the JIT (BGB / cr88192) (2009-08-02)
Re: Managing the JIT (BGB / cr88192) (2009-08-02)
Re: Managing the JIT (Aleksey Demakov) (2009-08-07)
Re: Managing the JIT (BGB / cr88192) (2009-08-08)
| List of all articles for this month |

From: "BGB / cr88192" <>
Newsgroups: comp.compilers
Date: Sun, 2 Aug 2009 16:28:12 -0700
References: 09-07-079 09-07-093 09-07-108 09-07-113 09-07-117 09-08-003
Keywords: code, incremental
Posted-Date: 06 Aug 2009 13:59:59 EDT

"Barry Kelly" <> wrote in message
> BGB / cr88192 wrote:
>> basically, pretty much any capability of the assembler is available from
>> the
>> textual interface.
>> however, the textual interface provides capabilities not available if
>> direct
>> function calls were used, such as using multi-pass compaction (AKA: the
>> first pass assumes all jumps/... to be full length, but additional passes
>> allow safely compacting the jumps).
> If the assembler function interface encoded jumps specially (which it
> would need to do anyway due in case of fixups, such as jumps to
> non-local entrypoints) it can do jump optimization and simply blit the
> surrounding code.

my assembler does a little trickery for compacting local jumps.
technically, it tries to assemble the code several times, and stops if it
doesn't change size.

it also has the option of VLI-pointers, which use the same mechanism (these
were originally intended for debug info and metadata, but ended up not being
used in preference of alternative options).

as far as non-local jumps goes, it emits a "default" jump, typically a
32-bit relative jump, where a generic relocation is used.

as for full 64-bit jumps, currently, my assembler does not handle this case,
and it is currently left up to the dynamic linker to notice this and try to
patch the jump.

typically this happens with labels which are DLL exports, as Windows tends
to like putting its DLL's in the 0x000007FF00000000-0x0x000007FFFFFFFFFF

in my case, I made my linker default to putting its local image between
0x0000000B00000000 and 0x0000000BFFFFFFFF with a center at

the main reason for this being to make it more visually obvious when JIT'ed
code is in use.

the main reason for grabbing a big linear chunk of address space like this,
is so that it is much easier for the linker to ensure that code can be
linked while keeping it within the +-2GB window (note, this memory is only
committed in a piecewise manner).

for 32-bit code, space is just grabbed from wherever, and it is generally
assumed that it will be within the window (granted, with 32-bit mode, there
"is" a risk of code being outside the +-2GB window, however, in this case
the space will actually just wrap around, so no indirect jump would be

>> what does the binary interface buy you?...
> Speed and memory - elimination of a whole pass, both emitting and
> parsing.

in my case, neither has been much of an issue in practice...

usually though, this is because the volume of the ASM tends to be fairly
small (vs, for example, the volume of code spewed out as a result of the C

>> note that wrapping every single opcode with a function would likely be
>> far
>> more work than writing most of the assembler.
> Opcodes have patterns; addressing modes similarly have patterns, and
> usually apply in the same way to a subset of the opcodes. So one only
> needs a simple interface, along the lines of:
> GenByte
> Gen2Bytes
> Gen3Bytes
> // etc.
> GenEffectiveAddress // i.e. addressing mode like r/m & sib on x86
> GenFixup // for linker to keep track of
> GenBranch // for jump optimization to keep track of
> This kind of low-level interface doesn't need more than couple of
> hundred lines of C, including implementation, if even that.

for the "generic" portion, yes...
my assembler does use a fairly generic form (of which, about 2200 lines make
up the core of the assembler, most of this being logic relatted to
emitting/encoding opcodes... another 2300 lines or so are used for the
parser, ...).

ran line counter: whole assembler+linker is about 24 kloc (24000 lines...).

a lot of the volume in the core of the assembler is related to the different
ways for structuring opcodes, ... (some added complexity likely due to AVX,
which adds opcodes with up to 4 or 5 operands...).

however, providing an individual wrapper function for every opcode (and
variant) would make the code huge...

simple reason:
you know how many opcodes (and opcode variants) there are in x86?...

>> printf-like interface, as a very convinient and usable way to drive the
>> assembler...
> Convenience doesn't always add up to performance. Of course, for a
> compiler like Delphi, compilation speed is a key priority.

for JIT, it is not exactly a low priority either...

however, in practice it has been "plenty fast enough...".

most of the internal machinery (for example, the main C parser, complex
logic trees within the compilers, ...) have used up the vast majority of the
compile times.

it might actually matter though if compiling large volumes of a language
which, in itself, required very little processing to compile (AKA: some
language not C or C++).

>> the overal performance difference either way is likely to be small, as in
>> this case, the internal processing is likely to outweigh the cost of
>> parsing
>> (figuring out which opcode to use, ...).
> The hot path for lexing, parsing, optimizing, codegen'ing and assembling
> of a chunk of source text need never blow the CPU cache, if you're
> careful. I find it hard to see the same kind of efficiency coming from
> an intermediate text format.

it shouldn't be too much of an issue with textual ASM, since the amount of
textual ASM is likely to never really exceed a few kB or so anyways... (nor
should most of the other internal stuff within the assembler, nor the
linker, likely exceed cache limits).

note that the assembler does not use a lexing pass, or ASTs, or anything of
the sort (it does use an optional preprocessor though, but this could be
skipped if performance were an issue).

actually, the parser for the assembler fairly directly parses tokens, uses a
hash to lookup items, and forces this out through the generic APIs. other
tokens directly drive logic within the assembler.

but, what will kill the cache:
all the stuff within the frontend compiler stages.

once preprocessed, I can easily end up with several MB of source text
(mostly crap pulled in from headers);
this in turn results if fairly large ASTs, and lots of intermediate data.

however, often all of this data boils down to only a few kB of ASM...

similarly, if cache is the concern, it is far more likely that the cache
will be blown by having however much space is taken up by having a separate
function for every possible opcode variant (assuming about 200 bytes each,
and for about 2500 opcodes, this would be about 500 kB).

(then again, the assembler DLL is already about 600 kB, hmm...).

note that processing ASM in my case will typically involve passes:
ASM data is fed in through printf-like calls;
ASM data is run through preprocessor;
parser/assembler invoked several times (typically for jump compaction);
results are serialized into a COFF object (in the assembler);
COFF object is parsed into an internal form (in the linker) and added to a

attempting to reference a symbol (which is not part of the image) will then
cause the linker to lookup the queued object, and link this into the image.

note, in some cases, the object will be linked directly (rather than

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.