Interpreters and caller-saved registers
Fri, 13 Oct 2023 07:44:46 +0000

          From comp.compilers

Related articles
Interpreters and caller-saved registers (2023-10-13)
Re: Interpreters and caller-saved registers (Thomas Koenig) (2023-10-15)
Re: Interpreters and caller-saved registers (2023-10-19)
Re: Interpreters and caller-saved registers (Thomas Koenig) (2023-10-22)
Re: Interpreters and caller-saved registers (2023-10-24)
Re: bug fixes, Interpreters and caller-saved registers (Kaz Kylheku) (2023-10-25)
| List of all articles for this month |

Newsgroups: comp.compilers
Followup-To: comp.arch
Date: Fri, 13 Oct 2023 07:44:46 +0000
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Injection-Info:; posting-host=""; logging-data="42835"; mail-complaints-to=""
Keywords: interpreter, architecture
Posted-Date: 13 Oct 2023 16:12:48 EDT

In August we discussed in comp.arch that Intel's APX will increase the
number of caller-saved registers, but not the number of callee-saved
registers. And I pointed out that more caller-saved registers are
hardly useful for interpreters.

This discussion inspired me to reconsider how we might use
caller-saved registers for stack caching in Gforth. With stack-caching
as used in Gforth [ertl&gregg05], 0-n stack items reside in registers,
with n+1 different stack representations (one for each number of
registers on the stack).

E.g., on RISC-V the stack representations S0-S3 are:

elements S0 S1 S2 S3
top 8(s8) s7 s0 s2
second 16(s8) 8(s8) s7 s0
third 24(s8) 16(s8) 8(s8) s7
fourth 32(s8) 24(s8) 16(s8) 8(s8)

With these stack representations, the definition

: cubed dup dup * * ;

is compiled into:

$3F9F301748 dup 1->2
    3F9EFA4ACA: mv s0,s7
$3F9F301750 dup 2->3
    3F9EFA4ACC: mv s2,s0
$3F9F301758 * 3->2
    3F9EFA4ACE: mul s0,s0,s2
$3F9F301760 * 2->1
    3F9EFA4AD2: mul s7,s7,s0
$3F9F301768 ;s 1->1
    3F9EFA4AD6: ld a6,0(s9)
    3F9EFA4ADA: addi s9,s9,8
    3F9EFA4ADC: mv s10,a6
    3F9EFA4ADE: ld a5,0(s10)
    3F9EFA4AE2: jr a5

We use n=3 if enough registers are available, but until September 2023
we used n=1 on AMD64 because of a lack of registers. (n>3 results in
diminishing returns [ertl&gregg05].)

So on AMD64 we got:

$7F4534F89B00 dup 1->1
      0x00007f4534c41030: mov %r8,0x0(%r13)
      0x00007f4534c41034: sub $0x8,%r13
$7F4534F89B08 dup 1->1
      0x00007f4534c41038: mov %r8,0x0(%r13)
      0x00007f4534c4103c: sub $0x8,%r13
$7F4534F89B10 * 1->1
      0x00007f4534c41040: imul 0x8(%r13),%r8
      0x00007f4534c41045: add $0x8,%r13
$7F4534F89B18 * 1->1
      0x00007f4534c41049: imul 0x8(%r13),%r8
      0x00007f4534c4104e: add $0x8,%r13
$7F4534F89B20 ;s 1->1
      0x00007f4534c41052: mov (%rbx),%r10
      0x00007f4534c41055: add $0x8,%rbx
      0x00007f4534c41059: mov %r10,%r15
      0x00007f4534c4105c: mov (%r15),%rcx
      0x00007f4534c4105f: jmp *%rcx

I.e., lots of memory accesses and lots of stack pointer updates.

As you can see in the RISC-V register names, gcc used callee-saved
registers (with names starting with "s") rather than caller-saved
registers (names starting with "t") for the stack cache registers.
That's because there are >100 VM instructions in Gforth that call
functions with code like:

/* start in S1 */
char *a_addr = (char *)s7;
long wior = 0;
s7 = wior;
/* end in S1 */
... invoke next VM instruction ...

(I used s7 here instead of the name of the C variable to make the
connection with the stuff above clearer).

Now the idea for making use of caller-saved registers is that, for a
VM instruction that ends in representation S1, (the variables in)
registers s0 and s2 are dead, because any transition to S2 or S3 will
write these registers. Unfortunately, gcc does not know that, and
assumes that the invocation of the next VM instruction could also
invoke something that expects the stack to be in S3, where s0 and s2
are alive. Therefore it thinks that s0 and s2 are alive at the call
to free() and allocates them in callee-saved registers. In Gforth,
the VM instructions containing calls all end in S1, so s0 and s2 are
(in reality, but unknown to gcc) dead after each call, and could be
allocated to caller-saved registers.

So the idea is to kill these two registers after the call to free_l()
(and after the other calls) in the eyes of gcc. More generally, my
idea was to kill s2 at the end of every VM instruction that ends in
S2, kill s2 and s0 at the end of every VM instruction that ends in S1,
and kill s2, s0, and s7 at the end of every VM instruction that ends
in S0. That should provide the maximum freedom to gcc, which is
generally believed to help the compiler produce good code.

How do we kill a variable/register: By writing to it, or making the
compiler believed that we write to it. One way would be:

s2 = 0;

However, this costs an instruction; a cheap one on sophisticated
microarchitectures, but still. So what we actually do is:


This tells gcc that the asm statement writes to s2, and thus kills it,
but it actually does not generate any assembly language.

I implemented this and then added S2 and S3 to AMD64, and gcc now
indeed managed to keep s2 and s0 in caller-saved registers.

Unfortunately, gcc-11.4 also introduced two additional redundant move
instructions in every VM instruction, and Bernd Paysan reported that
gcc-12 and gcc-13 introduced even more superfluous code in every VM
instruction. This is similar to what we have seen from gcc-3.0 for
Gforth at that time, and what we have seen from clang last we tried

I tried to work around this issue by having the kills only at the end
of VM instructions that perform a call, and indeed, that worked for
gcc-11.4. However, gcc-12 and gcc-13 still produced bad code.
Finally Bernd Paysan had the right idea and added -fno-tree-vectorize
to the list of options that we use to avoid gcc shenanigans, and now
we can also use this idea with gcc-12 and gcc-13.

The result is that the code for CUBED now looks as follows on AMD64:

$7F9D0D2531F0 dup 1->2
      0x00007f9d0cefa8b2: mov %r8,%r15
$7F9D0D2531F8 dup 2->3
      0x00007f9d0cefa8b5: mov %r15,%r9
$7F9D0D253200 * 3->2
      0x00007f9d0cefa8b8: imul %r9,%r15
$7F9D0D253208 * 2->1
      0x00007f9d0cefa8bc: imul %r15,%r8
$7F9D0D253210 ;s 1->1
      0x00007f9d0cefa8c0: mov (%r14),%rbx
      0x00007f9d0cefa8c3: add $0x8,%r14
      0x00007f9d0cefa8c7: mov (%rbx),%rax
      0x00007f9d0cefa8ca: jmp *%rax

In terms of performance, the difference is (on a 4GHz Core i5-6600K,
numbers in seconds):

  sieve bubble matrix fib fft
  0.070 0.096 0.029 0.058 0.022 with S0-S1
  0.060 0.091 0.019 0.051 0.021 with S0-S3

Concerning the discussion of caller-saved vs. callee-saved registers,
if AMD64 had more callee-saved registers, it would have seen these
performance benefits many years ago, and we would not have had to
spend the work to introduce the kills, find out the problems, and work
around them.

This optimization is quite specific to the stack caching optimization.
A more general approach for VM interpreters is to surround every call
with code saving VM registers before and code restoring them
afterwards. This allows the C compiler to put the VM registers into
caller-saved registers. E.g.:

vmstate->ip = ip;
vmstate->sp = sp;
ip = vmstate->ip;
sp = vmstate->sp;

Of course, these saves and restores also cost, but at least they allow
to use caller-saved registers for the VM instructions that do not
perform calls. To reduce the overhead of these saves and restores,
one can keep m VM registers in C variables across calls instead of
saving them, where m is the architecture-specific number of
callee-saved registers. So architectures with more callee-saved
instructions have less of this saving and restoring overhead.

Followups set to comp.arch; change this to comp.compilers (moderated)
if your followup is more appripriate for that.

    author = {M. Anton Ertl and David Gregg},
    title = {Stack Caching in {Forth}},
    crossref = {euroforth05},
    pages = {6--15},
    url = {},
    pdfurl = {},
    OPTnote = {not refereed},
    abstract = {Stack caching speeds Forth up by keeping stack items
                                    in registers, reducing the number of memory accesses
                                    for stack items. This paper describes our work on
                                    extending Gforth's stack caching implementation to
                                    support more than one register in the canonical
                                    state, and presents timing results for the resulting
                                    Forth system. For single-representation stack
                                    caches, keeping just one stack item in registers is
                                    usually best, and provides speedups up to a factor
                                    of 2.84 over the straight-forward stack
                                    representation. For stack caches with multiple stack
                                    representations, using the one-register
                                    representation as canonical representation is
                                    usually optimal, resulting in an overall speedup of
                                    up to a factor of 3.80 (and up to a factor of 1.53
                                    over single-representation stack caching).}
    title = {21st EuroForth Conference},
    booktitle = {21st EuroForth Conference},
    year = {2005},
    key = {EuroForth'05},
    editor = {M. Anton Ertl},
    url = {}

- anton
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <>

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.