|Date:||Sun, 8 Oct 2017 22:49:36 +0100|
|References:||17-10-001 17-10-004 17-10-009 17-10-011 17-10-014|
|Injection-Info:||gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="10463"; mail-complaints-to="email@example.com"|
|Keywords:||architecture, performance, interpreter|
|Posted-Date:||08 Oct 2017 19:26:18 EDT|
On 08/10/2017 19:36, George Neuner wrote:
> On Sat, 7 Oct 2017 19:05:10 +0100, bartc <firstname.lastname@example.org> wrote:
>> I tried an experiment a few years ago, where the byte-code for a test
>> program was expanded into a sequence of instructions in a statically
>> compiled language (or it might have been done at run-time; I can't
>> remember). Each byte-code was represented by some mix of function
>> call, or some inline code.
>> This would have been expected to benefit by eliminating dispatch
>> overheads (it just steps into the next lot of code like an ordinary
>> program), and also by having dedicated code for some things that were
>> otherwise a parameter in a generic byte-code instruction.
>> But in fact the results weren't that great at all; most normal
>> dispatchers were actually faster!
> Some of the things to come out of the discussions in c.l.a.x86 were
> that the fastest [simple] JIT sequence for an interpreter is a list of
> explicit call instructions: e.g.,
> call <bytecode_function>
> call <bytecode_function>
> rather than a list of function addresses with a dispatcher. This is
> because the call leverages the CPU branch predictor rather than
> fighting with it.
I think that's the sort of thing I was doing. Of course when it comes to
function calls, conditional code and so on, all things that require
jumping about in the byte-code, then it starts to get mixed up with the
language used to express those calls.
But I thought I'd do one quick experiment on one specific program in the
interpreted language (a very poor benchmark but just trying to see if
the runtime can be improved):
while i<100 million do
So a while-loop counting to 100 million. The normal byte-code that is
generated for that fragment:
2: 00006 ----------push_ci 0d
3: 00008 ----------pop_f [while.start.i:-1]
4: 00010 ----------jump L14
5: 00012 L12:
4: 00014 L14:
4: 00016 ----------push_ci 100000000d
4: 00018 ----------jumplt L12
These are normally executed in some sort of loop (not the loop indicated
here). The experiment emulates five of those byte-codes, by implementing
them as functions callable from static code representing this program,
as shown below.
The normally interpreted byte-code took 2.2 seconds.
The static version took 1.9 seconds.
A small difference. But this code uses my own poor x64 compiler. If
converted to C and compiled with gcc -O3 -m64, then the figures are 1.7
seconds and 0.75 seconds. (1.8 and 1.1 using -m32.)
So a significant difference this time, up to twice the speed. However,
that's comparing against the slow function-pointer dispatcher; a
switch- or label-pointer-based dispatcher would be rather brisker. But
the static code would still be quite a bit faster I think.
(But note: the very small number of handlers in my test can be easily
inlined by gcc, especially as they are only called once or twice. It's
might be a different story when they are called 100s of times. And also
the code might be generated via JIT methods; inlining too much could
have its own problems.)
There's one more comparison to make however: normally I run my byte-code
using my ASM overlay dispatcher, coupled with my poor compiler. For this
test, it runs the normal byte-code in 0.55 seconds. Faster than the
static code compiled with gcc-O3.
The test function 'testwhile()', and the five special byte-code
handlers, written in the same static language as the interpreter, and
hidden away in a corner of the interpreter.
The local 'i' variable of the original has been changes to a static 'ii'
variable here to simplify matters (no stack frame has been created):
proc testwhile= # emulate the byte-code of the 'while' test
if do_lt() then goto L12 fi
println "Done, ii = ",ii.value
proc do_push_ci(int a)=
proc do_pop_m(ref varrec a)=
vx_ufree(a) when a^.hasref
proc do_incrto_m(ref varrec a)=
when tint then
when trefvar then
when trefpacked then
proc do_push_m(ref varrec a)=
if sptr^.hasref then
ref varrec x,y
if xt<>y^.tag then
if xt:=vx_mixed(x,y) then
when tint then
if x^.value < y^.value then
Return to the
Search the comp.compilers archives again.