Re: Opcode handler dispatch in an interpreter: Implementing switch on OpCode.

bartc <>
Sun, 8 Oct 2017 22:49:36 +0100

          From comp.compilers

Related articles
| List of all articles for this month |

From: bartc <>
Newsgroups: comp.compilers
Date: Sun, 8 Oct 2017 22:49:36 +0100
References: 17-10-001 17-10-004 17-10-009 17-10-011 17-10-014
Injection-Info:; posting-host=""; logging-data="10463"; mail-complaints-to=""
Keywords: architecture, performance, interpreter
Posted-Date: 08 Oct 2017 19:26:18 EDT
Content-Language: en-GB

On 08/10/2017 19:36, George Neuner wrote:
> On Sat, 7 Oct 2017 19:05:10 +0100, bartc <> wrote:

>> I tried an experiment a few years ago, where the byte-code for a test
>> program was expanded into a sequence of instructions in a statically
>> compiled language (or it might have been done at run-time; I can't
>> remember). Each byte-code was represented by some mix of function
>> call, or some inline code.
>> This would have been expected to benefit by eliminating dispatch
>> overheads (it just steps into the next lot of code like an ordinary
>> program), and also by having dedicated code for some things that were
>> otherwise a parameter in a generic byte-code instruction.
>> But in fact the results weren't that great at all; most normal
>> dispatchers were actually faster!

> Some of the things to come out of the discussions in c.l.a.x86 were
> that the fastest [simple] JIT sequence for an interpreter is a list of
> explicit call instructions: e.g.,
> call <bytecode_function>
> call <bytecode_function>
> :
> rather than a list of function addresses with a dispatcher. This is
> because the call leverages the CPU branch predictor rather than
> fighting with it.

I think that's the sort of thing I was doing. Of course when it comes to
function calls, conditional code and so on, all things that require
jumping about in the byte-code, then it starts to get mixed up with the
language used to express those calls.

But I thought I'd do one quick experiment on one specific program in the
interpreted language (a very poor benchmark but just trying to see if
the runtime can be improved):


        while i<100 million do

So a while-loop counting to 100 million. The normal byte-code that is
generated for that fragment:

            2: 00006 ----------push_ci 0d
            3: 00008 ----------pop_f [while.start.i:-1]
            4: 00010 ----------jump L14
            5: 00012 L12:
                                ----------incrto_f [while.start.i:-1]
            4: 00014 L14:
                                ----------push_f [while.start.i:-1]
            4: 00016 ----------push_ci 100000000d
            4: 00018 ----------jumplt L12

These are normally executed in some sort of loop (not the loop indicated
here). The experiment emulates five of those byte-codes, by implementing
them as functions callable from static code representing this program,
as shown below.

The normally interpreted byte-code took 2.2 seconds.
The static version took 1.9 seconds.

A small difference. But this code uses my own poor x64 compiler. If
converted to C and compiled with gcc -O3 -m64, then the figures are 1.7
seconds and 0.75 seconds. (1.8 and 1.1 using -m32.)

So a significant difference this time, up to twice the speed. However,
that's comparing against the slow function-pointer dispatcher; a
switch- or label-pointer-based dispatcher would be rather brisker. But
the static code would still be quite a bit faster I think.

(But note: the very small number of handlers in my test can be easily
inlined by gcc, especially as they are only called once or twice. It's
might be a different story when they are called 100s of times. And also
the code might be generated via JIT methods; inlining too much could
have its own problems.)

There's one more comparison to make however: normally I run my byte-code
using my ASM overlay dispatcher, coupled with my poor compiler. For this
test, it runs the normal byte-code in 0.55 seconds. Faster than the
static code compiled with gcc-O3.


The test function 'testwhile()', and the five special byte-code
handlers, written in the same static language as the interpreter, and
hidden away in a corner of the interpreter.

The local 'i' variable of the original has been changes to a static 'ii'
variable here to simplify matters (no stack frame has been created):

varrec ii

proc testwhile= # emulate the byte-code of the 'while' test


          goto L14
          do_push_ci(100 million)
          if do_lt() then goto L12 fi

          println "Done, ii = ",ii.value

proc do_push_ci(int a)=

proc do_pop_m(ref varrec a)=
          vx_ufree(a) when a^.hasref

proc do_incrto_m(ref varrec a)=

          switch ttbasetype[a^.tag]
          when tint then
          when trefvar then
          when trefpacked then

proc do_push_m(ref varrec a)=
          (--sptr)^:= a^

          if sptr^.hasref then

function do_lt:int=
          ref varrec x,y
          int xt,res


          if xt<>y^.tag then
                  if xt:=vx_mixed(x,y) then
                          goto retry

          switch xt
          when tint then
                  if x^.value < y^.value then
                          return 1
          ..... etc
          return 0


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.