ultra fast but too complex ... summary

hoesel@igc.ethz.ch (Frans van Hoesel)
Mon, 22 Mar 1993 18:54:44 GMT

From comp.compilers

Related articles
ultra fast but too complex interpreter. hoesel@igc.ethz.ch (1993-03-17)
*ultra fast but too complex ... summary hoesel@igc.ethz.ch* (1993-03-22)**
Re: ultra fast but too complex ... summary fjh@mundil.cs.mu.OZ.AU (1993-03-23)
ultra fast but too complex ... appendum hoesel@igc.ethz.ch (1993-03-24)

| List of all articles for this month |

Newsgroups:	comp.compilers
From:	hoesel@igc.ethz.ch (Frans van Hoesel)
Keywords:	interpreter, summary
Organization:	University of Groningen, the Netherlands
References:	93-03-060
Date:	Mon, 22 Mar 1993 18:54:44 GMT

Hello everybody

This is the requested summary, and what I learned so far from the combined
knowledge of the net people.

First the approach of having many functions like add_i0_i1_i4, and execute
the program as a big list of function addresses.
Some people replied warning me for the possible cache problem:

From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
        This approach might be slower than you expect due to cache thrashing.
        If you want fewer instructions, make a general version (with explicit
        register numbers, like 3) and create a specialized version only for
        frequently occurring instruction/reg combinations.

        It seems you are creating a C function for every instruction. Throwing
        everything into a big switch is faster (since you can keep the
        interpreters state in registers). Best of all is direct threading with
        GNU C's "labels as values" feature.
From: "Stephen P Spackman" <spackman@dfki.uni-sb.de>
        I think that on a modern machine the many many functions approach will
        be a performance *lose* because of cache thrashing.... At least TRY the
        bitbanging approach and compare.
From: basker@diku.dk (Tom Thuneby)
        A general observation is that caches and/or TLBs can make bigger programs
        slower. I thought about writing a M68K interpreter for the PC, but I
        don't think I can get the instruction decoding fast enough: if I use some
        kind of bit-wise decoding, it would take perhaps 10 instructions (on the
        average) to decode each instruction. On the other hand, if I do it with
        a 65536-way jump table, I'll be actively using an amount of memory
        somewhat greater than the TLB can cover. I don't know the exact
        effects of this, but if each TLB miss causes a 50 cycle delay, this
        latter option doesn't look that appealing (as a two-instruction decode
        sequence would otherwise look) I don't know if this observation is of
        any help to you, but I would suggest that you did some calculations on
        your memory needs for each of the three approaches that you mention.

This suprised me! not the fact that the many functions wouldn't fit into
the cache, but more the fact that the other approaches would possibly fit.
An interpreter in 8K (words) on a risc machine (SGI indigo)??!?. This was
the response that I never expected (I should have know better, I once
wrote a complete disassembler for 6502 processor in 254 bytes... but this
isn't comp.hackers):

From: Anton Ertl <anton@mips.complang.tuwien.ac.at>
        Unless you have dozens of data types, a stack machine interpreter or a
        register machine interpreter with explicit registers fits into the
        cache. E.g., here's the size of a stack machine with 193 primitives on
        a Decstation.

        [mips.complang.tuwien.ac.at:8] size engine.o
        text data bss dec hex
        10528 928 0 11456 2cc0

(I asume that this is in bytes!, anyway, cache sizes tend to get bigger
and bigger). Also I had objections against my second approach: a simple
stack machine. what did other people write about that, and why can a stack
machine be fast too:

From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
        First of all, the stack pointer will be in a register (in a
        switch-based interpreter). Then you can access stack locations in one
        instruction with current processors (except th 29k). Secondly, you can
        keep the top of stack (TOS) in a register, so you need only one stack
        access for an instruction like add. Thirdly, the interpreter overhead
        will outweigh the rest, especially if you use switch or functions. On
        the R3000 a switch costs 12 cycles; the add for a stack machine as
        described above costs 3 instructions+interpreter overhead.

        E.g., here's "add" for a direct threaded stack machine compiled with gcc
        for the R3000:

lw $2,4($22)
lw $8,0($21)
addu $22,$22,4
addu $21,$21,4
.set noreorder
.set nomacro
j $8
addu $23,$2,$23
.set macro
.set reorder

        $22 = sp, $21 = ip, $23 = TOS
        The one time I had a choice I chose the stack machine, for simplicity.

From: " (David Gay)" <dgay@di.epfl.ch>
        I'm also confronting a similar decision (also for my own language),
        but I've currently implemented approach 2 (for simplicity, I wanted a
        working version rapidly).

From: larry@titan.tsd.arlut.utexas.edu (Larry Maturo)
        A lot depends on if you are planning to do any optimizations. In a
        previous life I worked for a company that decided to go with the many
        unique function approach and found that it was then impossible to
        apply any standard compiler optimizations on the code. Because of
        this the interpreter was actually slower than if they had just gone
        with a normal stack machine emulator whose code could have been highly
        optimized. Also, the code that got executed was larger because of
        many unique instructions and caused lots of swapping because it was
        large, slowing things down even more.

The third approach: having a few register oriented instructions which are
decoded by bit fiddling:

>From jhummel@esp.ICS.UCI.edu Thu Mar 18 12:00:03 1993
        How about (3), where your "registers" in your interpreter are actually
        stored in an array, and then simply use the small number passed to
        the one add function to index into the array. You'll have to do a little
        bit manipulation, but no big switch stmt would be needed.

From: David Gay <dgay@di.epfl.ch>
        I decided that it wasn't really possible to tell in advance and that I
        would implement option 3, and compare it with 2 ... I didn't think of
        option 1. I also found (while optimising approach 2) that things that
        make the code faster on some machines don't on others. So the answer
        may not be clear-cut.
        It all depends really ... The stack has to be stored somewhere too
        after all. If you have the registers stored in an array which is local
        to the interpreter (the interpreter could access the registers and
        pass them as parameters to the functions doing the work if these are
        not written inside the interpreter function), the cost should be
        comparable to having a stack. And I doubt if the cost of accessing a
        global array vs a local one will be very significant with respect to
        the other decoding costs.

From: Gene (No address, my mistake .. I lost your header in the editor)
        Why not just use an array for the register file and encode the
        register numbers in instruction fields? Computing two or three array
        references will require only a few instructions. Then you need only
        one add function. 1500 functions are going to have bad code cache
        performance. You may find that the array method actually runs faster
        on `real' programs with a wide variety of instructions.

Many responses somehow were related to threaded code. The best reference
to that approach and really *good* reading is in the archives of
comp.compilers ftp-able from primost.cs.wisc.edu, (128.105.2.115). get the
91-03 & 91-04 files. the answers:

From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
        It seems you are creating a C function for every instruction. Throwing
        everything into a big switch is faster (since you can keep the
        interpreters state in registers). Best of all is direct threading with
        GNU C's "labels as values" feature.
        > I admit that using expressions as a label would be great, but not useful
        > in my strictly ansi environment.
        One ANSI-compliant way to direct threaded code is converting the
        function calling interpreter to continuation-passing style, but that
        requires compilers that perform tail call optimization.
        There are some papers about it in the SIGPLAN proceedings and a book
        by Andrew Appel (I prefer his papers).

        Applied to the interpreter problem it would look like this:

        typedef void (*Inst)(Inst *, ...);
        /* don't know if I got this syntax right, anyway, it's a pointer to a
          * function like the one below */

        void some_instruction(Inst *ip, ... /*other interpreter state variables*/)
        {
            ... /* action of the instruction */
            (*ip)(ip+1, ...); /* and now execute the next instruction: */
        }

        If the compiler would perform tail call optimization, this would result
        in direct threaded code. Since it does not, it results in a stack
        overflow.

From: hosking@kiwi.cs.umass.edu (Tony Hosking)
        Our Smalltalk interpreter here at UMass uses a semi-threaded code
        approach. The entire interpreter is written in GNU C, with heavy use
        of the GNU C labels-as-values extension and register variables for the
        registers of the interpreter (or "virtual machine"). Smalltalk code
        is compiled to bytecodes in the standard way (see Goldberg and Robson,
        "Smalltalk-80: The Language and its Implementation", Addison-Wesley,
        1983). Each bytecode is implemented as a labelled fragment of C code,
        and we store the label's value (i.e., the code fragment's address) in
        a global array. Bytecode dispatch involves indexing this global table
        with the bytecode, and branching to the corresponding code fragment.
        Our interpreter is both portable (to any machine to which GNU C has
        been ported) and fast, benchmarking at around 300% of the Xerox Dorado
        micro-coded Smalltalk on a Sparc 2 (about half the speed of commercial
        implementations that compile to native code). If we were to go to
        fully threaded code we would see even better performance
        > what is semi-threaded code?
        By semi-threaded code I mean that the instructions of the interpreter are
        those for a stack-based virtual machine, so all "compiled" Smalltalk code
        still consists of these stack-based bytecodes. Our dispatch mechanism
        involves indexing the global bytecode table with the next virtual machine
        instruction to be executed to obtain an address to branch to. A fully
        threaded approach would compile all Smalltalk code to sequences of target
        addresses rather than sequences of bytecodes, eliminating the need to
        index the table. To illustrate, here is the C code for one of our
        bytecodes, which pushes the contents of the 0th temporary variable on
        the stack:

        *--sp = fp[0]; /* push 0th temporary variable's value */
        goto *byte_codes[*pc++]; /* dispatch on next bytecode */

        Fully threaded code would be:

        *--sp = fp[0];
        goto *pc++;

        Code such as a=12345678 would be a sequence of bytecodes like:

        1: push constant 12345678
        2: pop & store top of stack to frame variable n (where a is the nth
        variable in the stack frame)

        a=b+1234 would come out as:

        1: push variable m (where b is mth variable in stack frame)
        2: push constant 1234
        3: add (for Smalltalk this means send "add" message to object b)
        4: pop & store top of stack (result of add) to variable n (where a is the
        nth variable in the stack frame)

        We could probably do a better job by moving to a register load-store, as
        opposed to stack-based, instruction set, but the "standard" definition as
        defined in Goldberg and Robson is what we went with. I would note that more
        recent commercial Smalltalk releases (notably that from ParcPlace) have
        moved to improved instruction sets.

and some final remarks from various people:

From: gunter@FICUS.CS.UCLA.EDU (Michial Gunter)
        > My interpreter of a language of my own (yes again a data processing
        > language)

        This seems to indicate that you'll be doing a lot of IO. If this is
        the case and/or your primitives are high-level I expect your programs
        will spend almost all of their time in these places (so its not worth
        the effort to make the other stuff fast.)

From: pardo@cs.washington.edu
        >[How do I write a fast interpreter?]

        See:

        %A Robert Bedichek
        %T Some Efficient Architecture Simulation Techniques
        %J Winter '90 USENIX Conference
        %D 26 October, 1989
        %P 53-63

        %A Craig Chambers
        %A David Ungar
        %T Making Pure Object-Oriented Languages Practical
        %J OOPSLA '91 Proceedings; SIGPLAN Notices
        %V 26
        %N 11
        %P 1-15
        %C Phoenix, AZ
        %D November 1991

        %A Peter Deutsch
        %A Alan M. Schiffman
        %T Efficient Implementation of the Smalltalk-80 System
        %J 11th Annual Symposium on Principles of Programming Languages
        (POPL 11)
        %D January 1984
        %P 297-302

        They all said it much better than I.

;-D on ( Stock in tradeoff ) Pardo

OK, I hope I did'nt mangle too much with the original postings, but I had
to order them myself, to let them make even more sense the the originals.
Some sentences may look strange, as if you miss some parts of the
conversation, and you do! (I did not keep my 'private' mail with some of
the people above, Don't panic I know what I wrote to them).

My conclusions:
gnu-cc is faster for both the all approaches (it has global registers, and
values-as-labels), but the features needed from gnu-c are *NOT* ansi.
Well, how cares you could say, almost any machine has gnu-c... forget
about ansi. I won't! I need optimizing compilers, and vectorrizing
compilers to handle the primitive functions that will handle my vector
data (those functions would be highly optimized) threading is only really
fast in non-standard environments. I like the concept very much, but it
will not work in ansi. mike is right that lots of users will do a great
deal of IO so optimizing the interpreter to its limits in's worth the
trouble. There is one objection against this, that casual user tend to
forget about the existence of fortran, c, pascal, or whatever other
langauge they speak), and *do* program algorithmes that have nothing to do
with io or data processing. They use it as a convenient language (I have
seen this happen more then once... the ultimate being the use of my
language simply to add 123 to 567 uughh).

I'm not really sure yet, but unless more you (the people on the net)
convince me of the opposite, I will go for the simpler stack approach
(difficult in itself anyway). Maybe I'm just afraid of writing a full
blown 'compiler'

Any comments on the comments above, or on my 'conclusion' (not the little
quotes, indicating I'm still not very sure) are appreciated very much.

-- frans
---- === frans van hoesel hoesel@igc.ethz.ch
--

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

ultra fast but too complex ... summary

hoesel@igc.ethz.ch (Frans van Hoesel)Mon, 22 Mar 1993 18:54:44 GMT

hoesel@igc.ethz.ch (Frans van Hoesel)
Mon, 22 Mar 1993 18:54:44 GMT