Re: Opcode handler dispatch in an interpreter: Implementing switch on OpCode.

bartc <bc@freeuk.com>
Sun, 8 Oct 2017 22:49:36 +0100

          From comp.compilers

Related articles
| List of all articles for this month |

From: bartc <bc@freeuk.com>
Newsgroups: comp.compilers
Date: Sun, 8 Oct 2017 22:49:36 +0100
Organization: virginmedia.com
References: 17-10-001 17-10-004 17-10-009 17-10-011 17-10-014
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="10463"; mail-complaints-to="abuse@iecc.com"
Keywords: architecture, performance, interpreter
Posted-Date: 08 Oct 2017 19:26:18 EDT
Content-Language: en-GB

On 08/10/2017 19:36, George Neuner wrote:
> On Sat, 7 Oct 2017 19:05:10 +0100, bartc <bc@freeuk.com> wrote:


>> I tried an experiment a few years ago, where the byte-code for a test
>> program was expanded into a sequence of instructions in a statically
>> compiled language (or it might have been done at run-time; I can't
>> remember). Each byte-code was represented by some mix of function
>> call, or some inline code.
>>
>> This would have been expected to benefit by eliminating dispatch
>> overheads (it just steps into the next lot of code like an ordinary
>> program), and also by having dedicated code for some things that were
>> otherwise a parameter in a generic byte-code instruction.
>>
>> But in fact the results weren't that great at all; most normal
>> dispatchers were actually faster!


> Some of the things to come out of the discussions in c.l.a.x86 were
> that the fastest [simple] JIT sequence for an interpreter is a list of
> explicit call instructions: e.g.,
>
> call <bytecode_function>
> call <bytecode_function>
> :
>
> rather than a list of function addresses with a dispatcher. This is
> because the call leverages the CPU branch predictor rather than
> fighting with it.


I think that's the sort of thing I was doing. Of course when it comes to
function calls, conditional code and so on, all things that require
jumping about in the byte-code, then it starts to get mixed up with the
language used to express those calls.


But I thought I'd do one quick experiment on one specific program in the
interpreted language (a very poor benchmark but just trying to see if
the runtime can be improved):


        i:=0


        while i<100 million do
            ++i
        end


So a while-loop counting to 100 million. The normal byte-code that is
generated for that fragment:


            2: 00006 ----------push_ci 0d
            3: 00008 ----------pop_f [while.start.i:-1]
            4: 00010 ----------jump L14
            5: 00012 L12:
                                ----------incrto_f [while.start.i:-1]
            4: 00014 L14:
                                ----------push_f [while.start.i:-1]
            4: 00016 ----------push_ci 100000000d
            4: 00018 ----------jumplt L12


These are normally executed in some sort of loop (not the loop indicated
here). The experiment emulates five of those byte-codes, by implementing
them as functions callable from static code representing this program,
as shown below.


The normally interpreted byte-code took 2.2 seconds.
The static version took 1.9 seconds.


A small difference. But this code uses my own poor x64 compiler. If
converted to C and compiled with gcc -O3 -m64, then the figures are 1.7
seconds and 0.75 seconds. (1.8 and 1.1 using -m32.)


So a significant difference this time, up to twice the speed. However,
that's comparing against the slow function-pointer dispatcher; a
switch- or label-pointer-based dispatcher would be rather brisker. But
the static code would still be quite a bit faster I think.


(But note: the very small number of handlers in my test can be easily
inlined by gcc, especially as they are only called once or twice. It's
might be a different story when they are called 100s of times. And also
the code might be generated via JIT methods; inlining too much could
have its own problems.)


There's one more comparison to make however: normally I run my byte-code
using my ASM overlay dispatcher, coupled with my poor compiler. For this
test, it runs the normal byte-code in 0.55 seconds. Faster than the
static code compiled with gcc-O3.




-------------------------------------------------------------------


The test function 'testwhile()', and the five special byte-code
handlers, written in the same static language as the interpreter, and
hidden away in a corner of the interpreter.


The local 'i' variable of the original has been changes to a static 'ii'
variable here to simplify matters (no stack frame has been created):


varrec ii


proc testwhile= # emulate the byte-code of the 'while' test


          do_push_ci(0)
          do_pop_m(&ii)


          goto L14
L12:
          do_incrto_m(&ii)
L14:
          do_push_m(&ii)
          do_push_ci(100 million)
          if do_lt() then goto L12 fi


          println "Done, ii = ",ii.value
end


proc do_push_ci(int a)=
          (--sptr)^.tagx:=tint
          sptr^.value:=a
end


proc do_pop_m(ref varrec a)=
          vx_ufree(a) when a^.hasref
          a^:=sptr++^
end


proc do_incrto_m(ref varrec a)=


          switch ttbasetype[a^.tag]
          when tint then
                  ++a^.value
          when trefvar then
                  ++a^.varptr
          when trefpacked then
                  a^.ptr+:=ttsize[a^.refelemtag]
          else
                  pcustype("INCRTO_M",a)
          endswitch
end


proc do_push_m(ref varrec a)=
          (--sptr)^:= a^


          if sptr^.hasref then
                  ++sptr^.objptr^.refcount
          fi
end


function do_lt:int=
          ref varrec x,y
          int xt,res


          y:=sptr++
          x:=sptr++
          xt:=x^.tag


          if xt<>y^.tag then
                  if xt:=vx_mixed(x,y) then
                          goto retry
                  fi
                  pcmxtypes("LT/MIXED",x,y)
          fi
          retry:


          switch xt
          when tint then
                  if x^.value < y^.value then
                          return 1
                  fi
          ..... etc
          return 0
end


--
bartc



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.