Re: Register Based Interpreter Model (Anton Ertl)
2 Nov 2005 22:12:56 -0500

          From comp.compilers

Related articles
Register Based Interpreter Model (Avatar) (2005-11-01)
Re: Register Based Interpreter Model (Basile Starynkevitch \[news\]) (2005-11-02)
Re: Register Based Interpreter Model (Uncle Noah) (2005-11-02)
Re: Register Based Interpreter Model (2005-11-02)
Re: Register Based Interpreter Model (Omri Barel) (2005-11-02)
Re: Register Based Interpreter Model (Kurt Jung) (2005-11-02)
Re: Register Based Interpreter Model (2005-11-02)
Re: Register Based Interpreter Model (Florian Weimer) (2005-11-02)
Re: Register Based Interpreter Model (2005-11-02)
| List of all articles for this month |

From: (Anton Ertl)
Newsgroups: comp.compilers
Date: 2 Nov 2005 22:12:56 -0500
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
References: 05-11-016
Keywords: interpreter, VM, bibliography
Posted-Date: 02 Nov 2005 22:12:56 EST

"Avatar" <> writes:
>I am doing some research on the register-based interpreter model in
>comparison to the more traditional stack-based approach. I would like
>to know if anyone has had any practical experience with register-based

There are some register-based VM interpreters, and their implementors
should have experience, and sometimes have written about it. Some
that come to mind are Inferno [winterbottom&pike97], Parrot, and IIRC
the Lua VM. The last two are especially interesting, because
originally the language was implemented using a stack-based VM, and
then converted to a register-based VM. Moreover, there is the WAM for
implementing Prolog [vanroy94].

However, I still believe, that, if you want good performance, a hybrid
stack-based VM (hybrid because you usually do indexed accesses to
local variables) is a good idea, because:

- The main advantage of register VMs is fewer dispatches. If you
    really want performance, you use dynamic superinstructions (aka
    selective inlining) [piumarta&riccardi98,ertl&gregg03], which bring
    the number of dispatches to the same level (one dispatch per control
    flow change), eliminating the advantage gained from register VMs.

- It's easier to keep VM stack items in real-machine registers than VM
    registers. Keeping one stack item in real-machine registers is
    almost trivial, and can be very beneficial (factor 2 speedup on
    PPC970). For more complex schemes, see our work on stack caching

- Stack-based VMs are simpler to implement, leaving more time for
    implementing highly effective optimizations like dynamic

>or has any good suggestions for references about this

Well, this has been discussed here several times, so you can look up
the old discussions. There is one paper that does a direct empirical
comparison [davis+03] (a journal version of this paper appeared
recently in Science of Computer Programming).

    author = {Brian Davis and Andrew Beatty and Kevin Casey and
                                    David Gregg and John Waldron},
    title = {The Case for Virtual Register Machines},
    booktitle = {Interpreters, Virtual Machines and Emulators
    pages = {41--49},
    year = {2003},
    url1 = {},
    url2 = {},
    abstract = {Virtual machines (VMs) are a popular target for
                                    language implementers. Conventional wisdom tells us
                                    that virtual stack architectures can be implemented
                                    with an interpreter more efficiently, since the
                                    location of operands is implicit in the stack
                                    pointer. In contrast, the operands of register
                                    machine instructions must be specified
                                    explicitly. In this paper, we present a working
                                    system for translating stack-based Java virtual
                                    machine (JVM) code to a simple register code. We
                                    describe the translation process, the complicated
                                    parts of the JVM which make translation more
                                    difficult, and the optimisations needed to eliminate
                                    copy instructions. Experimental results show that a
                                    register format reduces the number of executed
                                    instructions by 34.88%, while increasing the number
                                    of bytecode loads by an average of 44.81%. Overall,
                                    this corresponds to an increase of 2.32 loads for
                                    each dispatch removed. We believe that the high cost
                                    of dispatches makes register machines attractive
                                    even at the cost of increased loads.}

    author = "Phil Winterbottom and Rob Pike",
    title = "The Design of the {Inferno} Virtual Machine",
    OPTeditor = "{IEEE}",
    booktitle = "Hot Chips {IX}: Stanford University, Stanford,
                                  California, August 24--26, 1997",
    publisher = "IEEE Computer Society Press",
    OPTaddress = "1109 Spring Street, Suite 300, Silver Spring, MD
                                  20910, USA",
    year = "1997",
    ISBN = "????",
    OPTpages = "??--??",
    bibdate = "Mon Jan 08 16:33:30 2001",
    acknowledgement = ack-nhfb,

    author = {Ian Piumarta and Fabio Riccardi},
    title = {Optimizing Direct Threaded Code by Selective
    crossref = {sigplan98},
    pages = {291--300},
    url = {},
    annote = {They reduce the overhead of a direct threaded
                                    interpreter by combining all the VM instructions in
                                    a basic block into a single virtual machine
                                    instruction. This is done by simply concatenating
                                    the machine code for the virtual machine
                                    instructions (except for the Next code). Multiple
                                    instances of the same sequence are just translated
                                    once, and then reused. They evaluated this technique
                                    empirically in the context of a fine-grained
                                    RISC-like VM, and in an Objective Caml
                                    interpreter. The speedup over plain direct threaded
                                    code for the RISC-like VM is a factor
                                    1.33--2.06. For the Caml VM the speedups varied
                                    with benchmark and processor, from 1 to 2.2. The
                                    code expansion ranges from 2.2--4 for the Sparc,
                                    with larger benchmarks having less expansion (due to
                                    more reuse of sequences). Implementing this
                                    technique on the Caml VM took only one day.}

    author = "M. Anton Ertl and David Gregg",
    title = "Optimizing Indirect Branch Prediction Accuracy in Virtual Machine Interpreters",
    crossref = "sigplan03",
    OPTpages = "",
    url = "",
    abstract = "Interpreters designed for efficiency execute a huge
                                    number of indirect branches and can spend more than
                                    half of the execution time in indirect branch
                                    mispredictions. Branch target buffers are the best
                                    widely available\mn{on all recent general-purpose
                                    machines?} form of indirect branch prediction;
                                    however, their prediction accuracy for existing
                                    interpretes is only 2\%--50\%. In this paper we
                                    investigate two methods for improving the prediction
                                    accuracy of BTBs for interpreters: replicating
                                    virtual machine (VM) instructions and combining
                                    sequences of VM instructions into superinstructions.
                                    We investigate static (interpreter build-time) and
                                    dynamic (interpreter run-time) variants of these
                                    techniques and compare them and several combinations
                                    of these techniques. These techniques can eliminate
                                    nearly all of the dispatch branch mispredictions,
                                    and have other benefits, resulting in speedups by a
                                    factor of up to 3.17 over efficient threaded-code
                                    interpreters, and speedups by a factor of up to 1.3
                                    over techniques relying on superinstructions alone."

    booktitle = "SIGPLAN '03 Conference on Programming Language
                                    Design and Implementation",
    title = "SIGPLAN '03 Conference on Programming Language
                                    Design and Implementation",
    year = "2003",
    key = "SIGPLAN '03"

    author = {M. Anton Ertl and David Gregg},
    title = {Combining Stack Caching with Dynamic
    crossref = {ivme04},
    pages = {7--14},
    URL = {},
    abstract = {Dynamic superinstructions eliminate most of the
                                    interpreter dispatch overhead. This results in a
                                    higher proportion of interpreter time spent in stack
                                    accesses (on a stack-based virtual machine). Stack
                                    caching reduces the stack access overhead. Each of
                                    these optimizations provides more speedup, if the
                                    other one is applied, too. Combining these
                                    optimizations also opens additional opportunities:
                                    we can insert stack state transitions without
                                    dispatch cost; this reduces the number of necessary
                                    VM instruction instances significantly. A
                                    shortest-path search can find the optimal sequence
                                    of state transitions and VM instructions. In this
                                    paper we describe an implementation of static stack
                                    caching employing these ideas. We also represent
                                    empirical results for our implementation, resulting
                                    in a speedup of up to 58\% over a version that keeps
                                    one value in registers all the time.}

    booktitle = {Interpreters, Virtual Machines and Emulators (IVME '04)},
    title = {Interpreters, Virtual Machines and Emulators (IVME '04)},
    year = {2004}

    author = {M. Anton Ertl and David Gregg},
    title = {Stack Caching in {Forth}},
    crossref = {euroforth05},
    pages = {6--15},
    url = {},
    OPTnote = {not refereed},
    abstract = {Stack caching speeds Forth up by keeping stack items
                                    in registers, reducing the number of memory accesses
                                    for stack items. This paper describes our work on
                                    extending Gforth's stack caching implementation to
                                    support more than one register in the canonical
                                    state, and presents timing results for the resulting
                                    Forth system. For single-representation stack
                                    caches, keeping just one stack item in registers is
                                    usually best, and provides speedups up to a factor
                                    of 2.84 over the straight-forward stack
                                    representation. For stack caches with multiple stack
                                    representations, using the one-register
                                    representation as canonical representation is
                                    usually optimal, resulting in an overall speedup of
                                    up to a factor of 3.80 (and up to a factor of 1.53
                                    over single-representation stack caching).}

    title = {21st EuroForth Conference},
    booktitle = {21st EuroForth Conference},
    year = {2005},
    key = {EuroForth'05},
    url = {}

    author = {Van Roy, Peter},
    title = {1983--1993: The Wonder Years of Sequential Prolog Implementation},
    journal = {Journal of Logic Programming},
    year = 1994,
    volume = {19,20},
    pages = {385--441},
    url = {},
    url = {}

- anton
M. Anton Ertl

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.