Related articles |
---|
ultra fast but too complex interpreter. hoesel@igc.ethz.ch (1993-03-17) |
ultra fast but too complex ... summary hoesel@igc.ethz.ch (1993-03-22) |
Re: ultra fast but too complex ... summary fjh@mundil.cs.mu.OZ.AU (1993-03-23) |
ultra fast but too complex ... appendum hoesel@igc.ethz.ch (1993-03-24) |
Newsgroups: | comp.compilers |
From: | hoesel@igc.ethz.ch (Frans van Hoesel) |
Keywords: | interpreter, summary |
Organization: | University of Groningen, the Netherlands |
References: | 93-03-060 |
Date: | Mon, 22 Mar 1993 18:54:44 GMT |
Hello everybody
This is the requested summary, and what I learned so far from the combined
knowledge of the net people.
First the approach of having many functions like add_i0_i1_i4, and execute
the program as a big list of function addresses.
Some people replied warning me for the possible cache problem:
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
This approach might be slower than you expect due to cache thrashing.
If you want fewer instructions, make a general version (with explicit
register numbers, like 3) and create a specialized version only for
frequently occurring instruction/reg combinations.
It seems you are creating a C function for every instruction. Throwing
everything into a big switch is faster (since you can keep the
interpreters state in registers). Best of all is direct threading with
GNU C's "labels as values" feature.
From: "Stephen P Spackman" <spackman@dfki.uni-sb.de>
I think that on a modern machine the many many functions approach will
be a performance *lose* because of cache thrashing.... At least TRY the
bitbanging approach and compare.
From: basker@diku.dk (Tom Thuneby)
A general observation is that caches and/or TLBs can make bigger programs
slower. I thought about writing a M68K interpreter for the PC, but I
don't think I can get the instruction decoding fast enough: if I use some
kind of bit-wise decoding, it would take perhaps 10 instructions (on the
average) to decode each instruction. On the other hand, if I do it with
a 65536-way jump table, I'll be actively using an amount of memory
somewhat greater than the TLB can cover. I don't know the exact
effects of this, but if each TLB miss causes a 50 cycle delay, this
latter option doesn't look that appealing (as a two-instruction decode
sequence would otherwise look) I don't know if this observation is of
any help to you, but I would suggest that you did some calculations on
your memory needs for each of the three approaches that you mention.
This suprised me! not the fact that the many functions wouldn't fit into
the cache, but more the fact that the other approaches would possibly fit.
An interpreter in 8K (words) on a risc machine (SGI indigo)??!?. This was
the response that I never expected (I should have know better, I once
wrote a complete disassembler for 6502 processor in 254 bytes... but this
isn't comp.hackers):
From: Anton Ertl <anton@mips.complang.tuwien.ac.at>
Unless you have dozens of data types, a stack machine interpreter or a
register machine interpreter with explicit registers fits into the
cache. E.g., here's the size of a stack machine with 193 primitives on
a Decstation.
[mips.complang.tuwien.ac.at:8] size engine.o
text data bss dec hex
10528 928 0 11456 2cc0
(I asume that this is in bytes!, anyway, cache sizes tend to get bigger
and bigger). Also I had objections against my second approach: a simple
stack machine. what did other people write about that, and why can a stack
machine be fast too:
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
First of all, the stack pointer will be in a register (in a
switch-based interpreter). Then you can access stack locations in one
instruction with current processors (except th 29k). Secondly, you can
keep the top of stack (TOS) in a register, so you need only one stack
access for an instruction like add. Thirdly, the interpreter overhead
will outweigh the rest, especially if you use switch or functions. On
the R3000 a switch costs 12 cycles; the add for a stack machine as
described above costs 3 instructions+interpreter overhead.
E.g., here's "add" for a direct threaded stack machine compiled with gcc
for the R3000:
lw $2,4($22)
lw $8,0($21)
addu $22,$22,4
addu $21,$21,4
.set noreorder
.set nomacro
j $8
addu $23,$2,$23
.set macro
.set reorder
$22 = sp, $21 = ip, $23 = TOS
The one time I had a choice I chose the stack machine, for simplicity.
From: " (David Gay)" <dgay@di.epfl.ch>
I'm also confronting a similar decision (also for my own language),
but I've currently implemented approach 2 (for simplicity, I wanted a
working version rapidly).
From: larry@titan.tsd.arlut.utexas.edu (Larry Maturo)
A lot depends on if you are planning to do any optimizations. In a
previous life I worked for a company that decided to go with the many
unique function approach and found that it was then impossible to
apply any standard compiler optimizations on the code. Because of
this the interpreter was actually slower than if they had just gone
with a normal stack machine emulator whose code could have been highly
optimized. Also, the code that got executed was larger because of
many unique instructions and caused lots of swapping because it was
large, slowing things down even more.
The third approach: having a few register oriented instructions which are
decoded by bit fiddling:
>From jhummel@esp.ICS.UCI.edu Thu Mar 18 12:00:03 1993
How about (3), where your "registers" in your interpreter are actually
stored in an array, and then simply use the small number passed to
the one add function to index into the array. You'll have to do a little
bit manipulation, but no big switch stmt would be needed.
From: David Gay <dgay@di.epfl.ch>
I decided that it wasn't really possible to tell in advance and that I
would implement option 3, and compare it with 2 ... I didn't think of
option 1. I also found (while optimising approach 2) that things that
make the code faster on some machines don't on others. So the answer
may not be clear-cut.
It all depends really ... The stack has to be stored somewhere too
after all. If you have the registers stored in an array which is local
to the interpreter (the interpreter could access the registers and
pass them as parameters to the functions doing the work if these are
not written inside the interpreter function), the cost should be
comparable to having a stack. And I doubt if the cost of accessing a
global array vs a local one will be very significant with respect to
the other decoding costs.
From: Gene (No address, my mistake .. I lost your header in the editor)
Why not just use an array for the register file and encode the
register numbers in instruction fields? Computing two or three array
references will require only a few instructions. Then you need only
one add function. 1500 functions are going to have bad code cache
performance. You may find that the array method actually runs faster
on `real' programs with a wide variety of instructions.
Many responses somehow were related to threaded code. The best reference
to that approach and really *good* reading is in the archives of
comp.compilers ftp-able from primost.cs.wisc.edu, (128.105.2.115). get the
91-03 & 91-04 files. the answers:
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
It seems you are creating a C function for every instruction. Throwing
everything into a big switch is faster (since you can keep the
interpreters state in registers). Best of all is direct threading with
GNU C's "labels as values" feature.
> I admit that using expressions as a label would be great, but not useful
> in my strictly ansi environment.
One ANSI-compliant way to direct threaded code is converting the
function calling interpreter to continuation-passing style, but that
requires compilers that perform tail call optimization.
There are some papers about it in the SIGPLAN proceedings and a book
by Andrew Appel (I prefer his papers).
Applied to the interpreter problem it would look like this:
typedef void (*Inst)(Inst *, ...);
/* don't know if I got this syntax right, anyway, it's a pointer to a
* function like the one below */
void some_instruction(Inst *ip, ... /*other interpreter state variables*/)
{
... /* action of the instruction */
(*ip)(ip+1, ...); /* and now execute the next instruction: */
}
If the compiler would perform tail call optimization, this would result
in direct threaded code. Since it does not, it results in a stack
overflow.
From: hosking@kiwi.cs.umass.edu (Tony Hosking)
Our Smalltalk interpreter here at UMass uses a semi-threaded code
approach. The entire interpreter is written in GNU C, with heavy use
of the GNU C labels-as-values extension and register variables for the
registers of the interpreter (or "virtual machine"). Smalltalk code
is compiled to bytecodes in the standard way (see Goldberg and Robson,
"Smalltalk-80: The Language and its Implementation", Addison-Wesley,
1983). Each bytecode is implemented as a labelled fragment of C code,
and we store the label's value (i.e., the code fragment's address) in
a global array. Bytecode dispatch involves indexing this global table
with the bytecode, and branching to the corresponding code fragment.
Our interpreter is both portable (to any machine to which GNU C has
been ported) and fast, benchmarking at around 300% of the Xerox Dorado
micro-coded Smalltalk on a Sparc 2 (about half the speed of commercial
implementations that compile to native code). If we were to go to
fully threaded code we would see even better performance
> what is semi-threaded code?
By semi-threaded code I mean that the instructions of the interpreter are
those for a stack-based virtual machine, so all "compiled" Smalltalk code
still consists of these stack-based bytecodes. Our dispatch mechanism
involves indexing the global bytecode table with the next virtual machine
instruction to be executed to obtain an address to branch to. A fully
threaded approach would compile all Smalltalk code to sequences of target
addresses rather than sequences of bytecodes, eliminating the need to
index the table. To illustrate, here is the C code for one of our
bytecodes, which pushes the contents of the 0th temporary variable on
the stack:
*--sp = fp[0]; /* push 0th temporary variable's value */
goto *byte_codes[*pc++]; /* dispatch on next bytecode */
Fully threaded code would be:
*--sp = fp[0];
goto *pc++;
Code such as a=12345678 would be a sequence of bytecodes like:
1: push constant 12345678
2: pop & store top of stack to frame variable n (where a is the nth
variable in the stack frame)
a=b+1234 would come out as:
1: push variable m (where b is mth variable in stack frame)
2: push constant 1234
3: add (for Smalltalk this means send "add" message to object b)
4: pop & store top of stack (result of add) to variable n (where a is the
nth variable in the stack frame)
We could probably do a better job by moving to a register load-store, as
opposed to stack-based, instruction set, but the "standard" definition as
defined in Goldberg and Robson is what we went with. I would note that more
recent commercial Smalltalk releases (notably that from ParcPlace) have
moved to improved instruction sets.
and some final remarks from various people:
From: gunter@FICUS.CS.UCLA.EDU (Michial Gunter)
> My interpreter of a language of my own (yes again a data processing
> language)
This seems to indicate that you'll be doing a lot of IO. If this is
the case and/or your primitives are high-level I expect your programs
will spend almost all of their time in these places (so its not worth
the effort to make the other stuff fast.)
From: pardo@cs.washington.edu
>[How do I write a fast interpreter?]
See:
%A Robert Bedichek
%T Some Efficient Architecture Simulation Techniques
%J Winter '90 USENIX Conference
%D 26 October, 1989
%P 53-63
%A Craig Chambers
%A David Ungar
%T Making Pure Object-Oriented Languages Practical
%J OOPSLA '91 Proceedings; SIGPLAN Notices
%V 26
%N 11
%P 1-15
%C Phoenix, AZ
%D November 1991
%A Peter Deutsch
%A Alan M. Schiffman
%T Efficient Implementation of the Smalltalk-80 System
%J 11th Annual Symposium on Principles of Programming Languages
(POPL 11)
%D January 1984
%P 297-302
They all said it much better than I.
;-D on ( Stock in tradeoff ) Pardo
OK, I hope I did'nt mangle too much with the original postings, but I had
to order them myself, to let them make even more sense the the originals.
Some sentences may look strange, as if you miss some parts of the
conversation, and you do! (I did not keep my 'private' mail with some of
the people above, Don't panic I know what I wrote to them).
My conclusions:
gnu-cc is faster for both the all approaches (it has global registers, and
values-as-labels), but the features needed from gnu-c are *NOT* ansi.
Well, how cares you could say, almost any machine has gnu-c... forget
about ansi. I won't! I need optimizing compilers, and vectorrizing
compilers to handle the primitive functions that will handle my vector
data (those functions would be highly optimized) threading is only really
fast in non-standard environments. I like the concept very much, but it
will not work in ansi. mike is right that lots of users will do a great
deal of IO so optimizing the interpreter to its limits in's worth the
trouble. There is one objection against this, that casual user tend to
forget about the existence of fortran, c, pascal, or whatever other
langauge they speak), and *do* program algorithmes that have nothing to do
with io or data processing. They use it as a convenient language (I have
seen this happen more then once... the ultimate being the use of my
language simply to add 123 to 567 uughh).
I'm not really sure yet, but unless more you (the people on the net)
convince me of the opposite, I will go for the simpler stack approach
(difficult in itself anyway). Maybe I'm just afraid of writing a full
blown 'compiler'
Any comments on the comments above, or on my 'conclusion' (not the little
quotes, indicating I'm still not very sure) are appreciated very much.
-- frans
---- === frans van hoesel hoesel@igc.ethz.ch
--
Return to the
comp.compilers page.
Search the
comp.compilers archives again.