Related articles |
---|
From: | bartc <bc@freeuk.com> |
Newsgroups: | comp.compilers |
Date: | Sun, 8 Oct 2017 22:49:36 +0100 |
Organization: | virginmedia.com |
References: | 17-10-001 17-10-004 17-10-009 17-10-011 17-10-014 |
Injection-Info: | gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="10463"; mail-complaints-to="abuse@iecc.com" |
Keywords: | architecture, performance, interpreter |
Posted-Date: | 08 Oct 2017 19:26:18 EDT |
Content-Language: | en-GB |
On 08/10/2017 19:36, George Neuner wrote:
> On Sat, 7 Oct 2017 19:05:10 +0100, bartc <bc@freeuk.com> wrote:
>> I tried an experiment a few years ago, where the byte-code for a test
>> program was expanded into a sequence of instructions in a statically
>> compiled language (or it might have been done at run-time; I can't
>> remember). Each byte-code was represented by some mix of function
>> call, or some inline code.
>>
>> This would have been expected to benefit by eliminating dispatch
>> overheads (it just steps into the next lot of code like an ordinary
>> program), and also by having dedicated code for some things that were
>> otherwise a parameter in a generic byte-code instruction.
>>
>> But in fact the results weren't that great at all; most normal
>> dispatchers were actually faster!
> Some of the things to come out of the discussions in c.l.a.x86 were
> that the fastest [simple] JIT sequence for an interpreter is a list of
> explicit call instructions: e.g.,
>
> call <bytecode_function>
> call <bytecode_function>
> :
>
> rather than a list of function addresses with a dispatcher. This is
> because the call leverages the CPU branch predictor rather than
> fighting with it.
I think that's the sort of thing I was doing. Of course when it comes to
function calls, conditional code and so on, all things that require
jumping about in the byte-code, then it starts to get mixed up with the
language used to express those calls.
But I thought I'd do one quick experiment on one specific program in the
interpreted language (a very poor benchmark but just trying to see if
the runtime can be improved):
i:=0
while i<100 million do
++i
end
So a while-loop counting to 100 million. The normal byte-code that is
generated for that fragment:
2: 00006 ----------push_ci 0d
3: 00008 ----------pop_f [while.start.i:-1]
4: 00010 ----------jump L14
5: 00012 L12:
----------incrto_f [while.start.i:-1]
4: 00014 L14:
----------push_f [while.start.i:-1]
4: 00016 ----------push_ci 100000000d
4: 00018 ----------jumplt L12
These are normally executed in some sort of loop (not the loop indicated
here). The experiment emulates five of those byte-codes, by implementing
them as functions callable from static code representing this program,
as shown below.
The normally interpreted byte-code took 2.2 seconds.
The static version took 1.9 seconds.
A small difference. But this code uses my own poor x64 compiler. If
converted to C and compiled with gcc -O3 -m64, then the figures are 1.7
seconds and 0.75 seconds. (1.8 and 1.1 using -m32.)
So a significant difference this time, up to twice the speed. However,
that's comparing against the slow function-pointer dispatcher; a
switch- or label-pointer-based dispatcher would be rather brisker. But
the static code would still be quite a bit faster I think.
(But note: the very small number of handlers in my test can be easily
inlined by gcc, especially as they are only called once or twice. It's
might be a different story when they are called 100s of times. And also
the code might be generated via JIT methods; inlining too much could
have its own problems.)
There's one more comparison to make however: normally I run my byte-code
using my ASM overlay dispatcher, coupled with my poor compiler. For this
test, it runs the normal byte-code in 0.55 seconds. Faster than the
static code compiled with gcc-O3.
-------------------------------------------------------------------
The test function 'testwhile()', and the five special byte-code
handlers, written in the same static language as the interpreter, and
hidden away in a corner of the interpreter.
The local 'i' variable of the original has been changes to a static 'ii'
variable here to simplify matters (no stack frame has been created):
varrec ii
proc testwhile= # emulate the byte-code of the 'while' test
do_push_ci(0)
do_pop_m(&ii)
goto L14
L12:
do_incrto_m(&ii)
L14:
do_push_m(&ii)
do_push_ci(100 million)
if do_lt() then goto L12 fi
println "Done, ii = ",ii.value
end
proc do_push_ci(int a)=
(--sptr)^.tagx:=tint
sptr^.value:=a
end
proc do_pop_m(ref varrec a)=
vx_ufree(a) when a^.hasref
a^:=sptr++^
end
proc do_incrto_m(ref varrec a)=
switch ttbasetype[a^.tag]
when tint then
++a^.value
when trefvar then
++a^.varptr
when trefpacked then
a^.ptr+:=ttsize[a^.refelemtag]
else
pcustype("INCRTO_M",a)
endswitch
end
proc do_push_m(ref varrec a)=
(--sptr)^:= a^
if sptr^.hasref then
++sptr^.objptr^.refcount
fi
end
function do_lt:int=
ref varrec x,y
int xt,res
y:=sptr++
x:=sptr++
xt:=x^.tag
if xt<>y^.tag then
if xt:=vx_mixed(x,y) then
goto retry
fi
pcmxtypes("LT/MIXED",x,y)
fi
retry:
switch xt
when tint then
if x^.value < y^.value then
return 1
fi
..... etc
return 0
end
--
bartc
Return to the
comp.compilers page.
Search the
comp.compilers archives again.