|From:||George Neuner <email@example.com>|
|Date:||Sat, 07 Oct 2017 12:55:13 -0400|
|Organization:||A noiseless patient Spider|
|Injection-Info:||gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="90294"; mail-complaints-to="firstname.lastname@example.org"|
|Keywords:||architecture, code, performance|
|Posted-Date:||07 Oct 2017 13:26:14 EDT|
On Fri, 6 Oct 2017 02:56:06 +0200, Hans-Peter Diettrich
>Am 05.10.2017 um 18:29 schrieb Robert Jacobson:
>> I am trying to wrap my mind around the issue of dynamic dispatch in the
>> context of switching on opcode in a bytecode interpreter ...
>I don't see how branch prediction will have an impact on runtime, when
>the prediction only applies to conditional branches, and the next opcode
>is one-of-many. If the cache can hold the jump table, nothing can be
>improved with this model. Consider how many *possible* comparisons and
>branches must be encoded, so that they will lead to *all* possible (161)
Unfortunately, on modern CPUs, *all* branches are predicted. An
indirect jump through the table will mispredict virtually every time.
The same will be true of an indirect jump via register based address.
The best you can do with an interpreter is to have all the code in L1
code cache. As soon as you have to go to L2 (which typically is
shared between code and data) or deeper, you risk taking large hits if
the code is not resident.
comp.lang.asm.x86 has seen extensive discussions of mispredict
problems in interpreters and JIT compiled code. The conclusions there
are applicable to most CPU architectures.
>Many years ago I saw an 68k emulator, where the opcode was multiplied by
>16 and used as an offset into the emulator code, eliminating the branch
>table. The 16 byte blocks cover the emulator code for most instructions,
>this factor can be scaled to any sufficiently large block size if
>required. If some bytecodes require more than one block of executable
>code, the codes can be made discontiguous, with gaps wherever the
>code-to-execute is longer than the chosen blocking factor.
>Of course such an approach lacks any cache support, but a large cache
>still may hold the most frequently used code sequences automatically. If
>the cache is too small to hold the code for multiple opcodes, it's
>useless at all.
Caching is relevant mainly to minimizing the effect of the mispredict
- i.e. whether it takes 15 cycles or 315 cycles to get the pipeline
moving again. Minimizing mispredicts buys you more than caching.
Return to the
Search the comp.compilers archives again.