Re: virtual machine efficiency (Anton Ertl)
30 Dec 2004 00:57:29 -0500

          From comp.compilers

Related articles
virtual machine efficiency (ed_davis2) (2004-12-29)
Re: virtual machine efficiency (Rafael 'Dido' Sevilla) (2004-12-30)
Re: virtual machine efficiency (2004-12-30)
Re: virtual machine efficiency (2004-12-30)
Re: virtual machine efficiency (cr88192) (2004-12-30)
Re: virtual machine efficiency (Chris F Clark) (2004-12-30)
Re: virtual machine efficiency (2004-12-30)
Re: virtual machine efficiency (Calum Grant) (2004-12-30)
Re: virtual machine efficiency (cr88192) (2004-12-31)
[6 later articles]
| List of all articles for this month |

From: (Anton Ertl)
Newsgroups: comp.compilers,comp.lang.misc
Followup-To: comp.compilers
Date: 30 Dec 2004 00:57:29 -0500
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
References: 04-12-151
Keywords: VM, performance
Posted-Date: 30 Dec 2004 00:57:29 EST

"ed_davis2" <> writes:
>I have developed a simple stack based virtual machine. I would like
>it to be as efficient as possible, in terms of both speed and object
>code size. Size, as in size of the object code the VM processes
>(byte-code), and not necessarily the size of the VM.
>But I've run into one of those situations where I can't figure out how
>to have both.

You can't. The fastest techniques (read some of my papers) use more
memory than some other techniques. However, there are some techniques
that are good compromises between speed and space consumption; read:

    author = "Todd A. Proebsting",
    title = "Optimizing an {ANSI~C} Interpreter with Superoperators",
    crossref = "popl95",
    pages = "322--332",
    annote = "Interpreter performance is optimized by combining
operators during code generation, when they are
still organized as trees. So a different, optimized
is used for each program. Speedups of 1.8--3.1 are
achieved, but this is probably strongly dependent on
the original set of operators. The paper uses lccs
intermediate code operators \cite{fraser&hanson91a}."

    booktitle = "Principles of Programming Languages (POPL '95)",
    title = "Principles of Programming Languages (POPL '95)",
    year = "1995",
    key = "POPL '95"

@string{spe="Software---Practice and Experience"}

    author = "Jan Hoogerbrugge and Lex Augusteijn and Jeroen Trum
                                  and Rik van de Wiel",
    title = "A code compression system based on pipelined
    journal = spe,
    volume = "29",
    number = "11",
    pages = "1005--1023",
    month = sep,
    year = "1999",
    OPTannote= ""

    author = {Mario Latendresse and Marc Feeley},
    title = {Generation of Fast Interpreters for {Huffman}
                                    Compressed Bytecode},
    booktitle = {Interpreters, Virtual Machines and Emulators
    pages = {32--40},
    year = {2003},
    url1 = {},
    url2 = {},
    abstract = {Embedded systems often have severe memory
                                    constraints requiring careful encoding of
                                    programs. For example, smart cards have on the order
                                    of 1K of RAM, 16K of non-volatile memory, and 24K of
                                    ROM. A virtual machine can be an effective approach
                                    to obtain compact programs but instructions are
                                    commonly encoded using one byte for the opcode and
                                    multiple bytes for the operands, which can be
                                    wasteful and thus limit the size of programs
                                    runnable on embedded systems. Our approach uses
                                    canonical Huffman codes to generate compact opcodes
                                    with custom-sized operand fields and with a virtual
                                    machine that directly executes this compact code. We
                                    present techniques to automatically generate the new
                                    instruction formats and the decoder. In effect, this
                                    automatically creates both an instruction set for a
                                    customized virtual machine and an implementation of
                                    that machine. We demonstrate that, without prior
                                    decompression, fast decoding of these virtual
                                    compressed instructions is feasible. Through
                                    experiments on Scheme and Java, we demonstrate the
                                    speed of these decoders. Java benchmarks show an
                                    average execution slowdown of 9%. Compression
                                    factors highly depend on the original bytecode and
                                    the training sample, but typically vary from 30% to
                                    60%. }

>All of the opcodes are one byte in size. Given the instruction:
>push address
>'push' takes one byte, and the address takes 4 bytes. If I pack
>the code into a byte array, this takes 5 bytes. However, now
>that means that address isn't on a word boundary. If I load
>'address' using:
>unsigned char *code;
>operand = *(int *)code;
>I incur a speed hit on many processors, and it may even fail on
>others, since code probably isn't suitably aligned.
>But if I use:
>memcpy(&operand, code, sizeof(operand));
>operand =
>(code[pc]) |
>(code[pc + 1] << 8) |
>(code[pc + 2] << 16) |
>(code[pc + 3] << 24);
>I don't incur the miss-aligned speed hit, but it is nowhere as
>fast as "operand = *(int *)code;" could be, if code were suitably
>I could make all opcodes a word in size, and make the code array an
>array of integers. But that would dramatically increase the size of
>the byte-code required by the VM.

Not necessarily. If you use superinstructions, you may have
relatively few VM instructions compared to operands; with dynamic
superinstructions you can get it down to one VM instruction per VM
branch target (but you need space for the dynamic superinstruction
code; I don't know if you consider that part of the VM, which is not
as sensitive for you).

>Is there a way I can have both fast loading of operands, and
>small byte-code size?

This particular problem has been addressed by Ogata et al. [ogata+02],
but I am not sure that I would use it if I could define my own
bytecode (they wanted to do JVM interpretation without rewriting the
JVM code).

Other techniques that come to mind:

- Use byte-sized operands where possible.

- Collect VM instructions and byte-sized operands within words, but
let word-sized operands start at the next word; this is somewhat like
the two-stream (sections) idea suggested by our esteemed moderator,
but reduces the overhead of dealing with two instruction pointers on
VM control flow. Example of such VM code (for a machine with 4-byte

inst 1 | byte-operand 1 of inst 1 | inst 2 | inst 3
word-operand 1 of inst 1
word-operand 2 of inst 1
word-operand 1 of inst 2
word-operand 1 of inst 3
byte-operand 1 of inst 3 | inst 4 ...

Various methods for implementing such a scheme efficiently come to
mind, but I won't go into that now.

    author = {Kazunori Ogata and Hideaki Komatsu and Toshio
    title = {Bytecode Fetch Optimization for a {Java}
    crossref = {asplos02},
    pages = {58--67},
    annote = {The paper presents a Java Bytecode interpreter for
                                    the PowerPC architecture and some optimizations and
                                    evaluates them: stack caching (a few variations),
                                    position-based handler customization and
                                    position-based speculative decoding (software
                                    pipelining of the interpreter). Position-based
                                    handler customization deals with different
                                    alignments of bytecodes by having four states in the
                                    interpreter for the different alignments, each state
                                    with its own specialized copy of the
                                    interpreter. For stack caching they evaluated a
                                    fixed one-TOS-register organization with
                                    write-through caching (5.6\% speedup over base), and
                                    dynamic stack caching with two registers (3 states,
                                    7/9% speedup over base), and used the write-through
                                    organization for further experiments; write-through
                                    is not compared empirically to
                                    write-back. Position-based handler customization
                                    buys another 19\%, and software pipelining an
                                    additional 3.4\%. The paper also presents results on
                                    memory traffic (both I and D).}

    title = "Architectural Support for Programming Languages and
Operating Systems (ASPLOS-X)",
    booktitle = "Architectural Support for Programming Languages and
Operating Systems (ASPLOS-X)",
    year = "2002",
    key = "ASPLOS-X"

Followups set to comp.compilers.

- anton
M. Anton Ertl

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.