Portable Fast Direct Threaded Code

Eliot Miranda <eliot@.cs.qmw.ac.uk>
28 Mar 91 12:20:06 GMT

          From comp.compilers

Related articles
Portable Fast Direct Threaded Code eliot@.cs.qmw.ac.uk (Eliot Miranda) (1991-03-28)
Re: Portable Fast Direct Threaded Code Tom.Lane@G.GP.CS.CMU.EDU (1991-04-01)
Re: Portable Fast Direct Threaded Code metzger@watson.ibm.com (1991-04-02)
Re: Portable Fast Direct Threaded Code pardo@cs.washington.edu (1991-04-02)
Re: Portable Fast Direct Threaded Code vestal@SRC.Honeywell.COM (1991-04-03)
Re: Portable Fast Direct Threaded Code firth@sei.cmu.edu (1991-04-04)
Re: Portable Fast Direct Threaded Code pardo@cs.washington.edu (1991-04-04)
[1 later articles]
| List of all articles for this month |
Newsgroups: comp.compilers
From: Eliot Miranda <eliot@.cs.qmw.ac.uk>
Keywords: interpreter, design
Organization: Computer Science Dept, QMW, University of London, UK.
Date: 28 Mar 91 12:20:06 GMT

Various people have asked me for details on how I do threaded code
in my dynamic translation Smalltalk-80 VM. So here's the gory details
as well as the first published benchmarks for the system.


How to do "Machine-Independent" Fast Direct Threaded Code:




First off, use C (although other flexible machine-oriented imperative
languages would probably work too).


Global registers:
If you can use GCC >v1.33 you can use global register variables to
hold the threaded code machine's registers. If you have various forms of
stupid C compiler then you can get global register variables by declaring
your globals as register variables in every function, and later editing the
assembler generated by the C compiler to remove global register saves &
restores (details in [Miranda]).




Threaded Code:
Threaded code instructions (TCIs) are written as C procedures.
They are compiled to assembler by the C compiler. Subsequently a simple
editor script is run on the assembler to remove the prologues and epilogues
from the threaded code procedures, and possibly to insert direct threaded
code jumps.


How to Identify Threaded Code Instructions:
The method I prefer is to segregate TCIs from other procedures &
functions in the machine by placing related groups of TCIs in separate
source files. I give my threaded code source files the .tc extension and
they have a rule in the makefile that will run the editor script on the
assembler. An alternative is to identify each threaded code procedure with
a special prefix which is spotted by the editor script. This is probably
more error prone & doesn't fit well with leaf-routine optimization (see
below).


How to Write Threaded Code Instructions:
Each TCI is writen an a void function of no arguments. It is wise to start
and end each TCI with two special macros to replace '{' and '}'. So, using
GCC on the SPARC, given some declarations:




typedef void (*TCODE)(); /* a TCODE is a pointer to a function */
typedef ???? ObjectPointer;


. . .


register TCODE *tcip asm("%g7"); /*threaded code instruction pointer*/
register ObjectPointer *stackPointer asm("%g5");


e.g. popStack would be written:


void popStack()
TBEGIN
stackPointer--;
TEND


With GCC TBEGIN is


#define TBEGIN {


With stupid C compilers it can be defined to be the list of global register
variables. Further, if you want to build a debuger for your threaded code
machine you could compile the system with


#define TBEGIN { int frig = checkForBreakPoint();


and ignore lots of warnings about variable frig being unused :-).


TEND has to do a direct threaded code jump. In my system I want an indirect
post-increment jump on tcip; i.e. jump to *tcip++. On the SPARC with tcip
in %g7 the jump is


ld [%g7],%o0 ! get *tcip
jmpl %o0,%g0 ! jump to it
add %g7,4,%g7 ! increment tcip in the jump's delay slot


On the 68k with tcip in a5 the jump is


movl a5@+,a0
jmp a0@


With GCC this is implemented by the JUMPNEXT macro. On the SPARC:
#define JUMPNEXT do{ \
asm("ld [%g7],%o0; jmpl %o0,%g0; add %g7,4,%g7");\
return;}while(0)


Note the return, it tells the compiler that control does not pass this point.
On the 68k:
/* SBD = Silent But Deadly = Stack Bug Dummy. gcc has a bug with
no-defer-pop. it always depers the pop of the last function call in
a routine. SBD is a dummy call to ensure no other previous call gets
its pop deferred.
*/
extern void SBD P((void));


#define JUMPNEXT do{ \
asm("movl a5@+,a0; jmp a0@");\
SBD();return;}while(0)


SBD is then removed by the editor script.


So TEND is defined to be
#define TEND JUMPNEXT; }


On the SPARC popStack is expanded to
void popStack()
{
stackPointer--;
do{asm("ld [%g7],%o0; jmpl %o0,%g0; add %g7,4,%g7");return;}while(0);
}


Its compiled to:
_popStack:
!#PROLOGUE# 0
save %sp,-80,%sp
!#PROLOGUE# 1
add %g5,-4,%g5
ld [%g7],%o0; jmpl %o0,%g0; add %g7,4,%g7
ret
restore
The editor script then reduces this to:`
_popStack:
! [gotcher]
add %g5,-4,%g5
ld [%g7],%o0; jmpl %o0,%g0; add %g7,4,%g7


On the 68k you end up with:
.globl _popStack
_popStack:
subqw #4,a3
movl a5@+,a0; jmp a0@


Global Register Variables and Stupid C Compilers:
Some C compilers are stupid enough to provide straight-forward global
registers. A C compiler on a nat-semi based system I used just allocated
registers in the order they were declared. The assembler syntax was very
simple to edit. Global register variables could thus be obtained easily.


Some C compilers are stupid but think they're clever. Sun's SUN3
compiler is a case in point. It also allocates registers in the order declared.
However, it tries to be clever by removing 'dead assignments' (assignments to
subsequently unused register variables). These compilers are easy to fool.
Simply arrange that the register variables are always used before leaving a
function. I do this by always writing RETURN or RETURNV(e) instead of
return and return e. On systems with such stupid C compilers RETURN(e)
is defined thus:


#define RETURNV(e) do{DummyUseRegs(GR1,GR2,GR3); return e;}while(1)


The call on DummyUseRegs fools the compiler into thinking the registers
are live & hence saves assignments to them. The editor scripts can then
remove calls on DumyUseRegs.


Of course on systems with marginally clever C compilers (SUN4 HP-UX etc)
you're stuffed. However, in clever C compilers like GCC and Acorn's C compiler
you can declare global registers & everything is clean & wholesome :-).






Conditional TCODE Jumps:
Say we wanted a conditional tcode jump. This might be writen:
void skipIfTrue()
TBEGIN
if (*stackPointer-- == TrueObject) {
tcip += 1;
JUMPNEXT;
}
TEND


How this All Works:
With the above scheme, each threaded code procedure runs in the same C
stack frame, and jumps directly to the next procedure, eliminating an
unnecessary <epilogue, return>, <call, prolog> pair. Once we establish a
stack frame and call the first function away we go. Assuming that you've
produced your first threaded code method (a sequence of pointers to
threaded code procedures), and that tcip points at the start, then
StartTCMachine might be defined as follows:


volatile void
StartTCMachine()
{ char *enoughSpaceForAllTCIStackFrames;


enoughSpaceForAllTCIStackFrames = alloca(1024);


while(1) { (*tcip++)(); }
}


The alloca allocates space on the stack. After the first (*tcip++)()
control goes off into threaded code land and never returns.


Leaf Routine Optimization:
The threaded code machine will make calls on support routines e.g.
graphics, garbage collector etc. Any group of routines that dont access
the global registers and don't directly or indirectly call routines that
need to access the global registers can be optimized. These routines
should be compiled without declaring the global registers. These routines
will then use as many registers as the compiler will give them and will
save & restore any registers they use, preserving the values of the global
register variables.




Details of my Smalltalk Threaded Code Machine:
I use a pair of words for each TCI, a pointer to the procedure followed
by an optional operand. This avoids going out of line to access arguments.
e.g. pushLiteral is:
void pushLit()
TBEGIN
*++stackPointer = (OOP)*tcip++;
TEND
where OOP is an ordinary object pointer. So on entry to push lit we have:
<pointer to pushLit>
tcip-> <object pointer>
<pointer to next TCI>
<next TCI's operand>
and popStack must therefore be written
void popStack()
TBEGIN
stackPointer--;
tcip++;
TEND


I dynamically compile Smalltalk-80 bytecodes to threaded code. I use 128k
bytes of memory to hold all threaded code. This 'tspace' is periodically
scavenged to reclaim space. The architecture is similar to
[DeutschSchiffman]. Using an eighth of the space used by the Deutch
Schifman machine I get around 75% of the performance on the non-graphics
benchmarks. Here are the Smalltalk macro benchmarks for BrouHaHa
Smalltalk-80 v2.3.2t running on a monochrome SUN3/60 (20MHz 68020):


BitBLT 76.7308
TextScanning 222.857
ClassOrganizer 80.6667
PrintDefinition 59.0278
PrintHierachy 142.857
AllCallsOn 112.5
AllImplementors 130.0
Inspect 116.667
Compiler 86.4341
Decompiler 101.639
KeyboardLookAhead 212.5
KeyboardSingle 302.778
TextDisplay 148.837
TextFormatting 273.81
TextEditing 180.342
Performance Rating 134.198


and on the same machine under the same conditions are the timings for
ParcPlace Smalltalk-80 V2.3:


BitBLT 99.75
TextScanning 390.0
ClassOrganizer 155.128
PrintDefinition 137.097
PrintHierachy 192.308
AllCallsOn 120.0
AllImplementors 108.333
Inspect 146.774
Compiler 118.617
Decompiler 129.167
KeyboardLookAhead 303.571
KeyboardSingle 473.913
TextDisplay 172.973
TextFormatting 442.308
TextEditing 285.135
Performance Rating 189.504


134.198/189.504 = 0.708154


WARNING!! These systems ARE different, these benchmarks are included only
to give a feeling for ball-park performance.
Example differences:
BrouHaHa ParcPlace
closures blue-book BlockContexts
immediates:
characters, smallints, fixedpoints immediate smallintegers
5844 compiled methods 5136 compiled methods
(5026 ordinary methods) (4798 ordinary methods)
(818 quick methods) (338 quick methods)






More Portable File Organization:
To keep the code as clean looking as possible all machine-dependencies are
isolated in separate files. e.g. tcode.h gives machine independent
definitions for TCODE. It includes machine dependencies from another file:


/* for debugging purposes; single step breakpoint at start of each tcode
*/
#define DEBUG_FETCH_BREAK int frig = fetchBrk();


#ifdef FAST
# include "fasttcode.h"
#else


TCODE *tcip; /* the tcode ip points at TCODEs */


# define JUMPNEXT return
# ifdef BIDebug
# define TBEGIN { DEBUG_FETCH_BREAK
# else
# define TBEGIN {
# endif
# define TEND }
#endif


GCC/SPARC/fasttcode.h:
/* tcodeip in g7 */
register TCODE *tcip asm("%g7");


#define JUMPNEXT do{asm("ld [%g7],%o0; jmpl %o0,%g0; add %g7,4,%g7");return;}while(0)


#ifdef BIDebug
# define TBEGIN { DEBUG_FETCH_BREAK
#else
# define TBEGIN {
#endif
#define TEND JUMPNEXT; }


I also don't want to include the necessary definitions for the global registers
in every file. So for those non-leaf routines that must avoid using the
global registers there's a fastglobal.h file that gives dummy definitions for
these registers. e.g. GCC/SPARC/fastglobal.h:
/* machine specific FAST defines.
Gnu C 1.33 systems can use nice compiler provided global registers.
*/


#define BEGIN {
#define END }
#define RETURN(e) return e
#define RETURNV return


register char *GlobRegDummy1 asm("a5");
register char *GlobRegDummy2 asm("a4");
register char *GlobRegDummy3 asm("a3");
register char *GlobRegDummy4 asm("d6");


#ifdef MASKREGISTER
register char *GlobRegDummy5 asm("d7");
#endif


I use symbolic links to set up the machine dependent include files. This has the
advantage that if you add a new machine you don't have to remake all the others.




The Tedious Bit:
The only tedious bit is writing the sed-scripts. For the SPARC this took 1 day.
Here are the sed scripts I use for SUN 3, MAC2AUX (using GAS) and SUN4,
all using GCC (v1.33 upwards). There's a problem on the SPARC in that the ABI
does not seem to define the status of the global registers. Some math and
library routines stomp on the global registers (beware getwd!), so I've included
GCC/SUN4/sed.globreg.bugfix as an example of how to spot the offending math
routines:


SUN3/GCC/lib/sed.tcode.opt:
# script to strip prolog & epilog from threaded code under gcc.
# WARNING the script may strip a push of a register argument if a call is the
# first statement of a function!!
#
/^_.*:$/{n
N
N
s/ link a6,#[^\n]*\n//
/ fmovem #[^\n]*,sp@-/{
N
s/ fmovem #[^\n]*,sp@-\n//
}
s/ moveml .*,sp@-\n//
s/ movel [ad][0-7],sp@-\n//
}
/ moveml a6@(-[1-9][0-9]*),#/{N
s/ moveml a6@(-[1-9][0-9]*),#[^\n]*\n unlk a6//
}
/ movel a6@(-[1-9][0-9]*),[ad][0-7]/{N
s/ movel a6@(-[1-9][0-9]*),[ad][0-7]\n unlk a6//
}
/ movel sp@+,/d
/ moveml sp@+,#/d
/ unlk a6/d
/ rts/d
/ jbsr _SBD/d


MAC2AUX/GCC/lib.gas/sed.tcode.opt:
/COMMENT/{
i\
script to strip prolog & epilog from threaded code under gcc. WARNING \
the script may strip a push of a register argument if a call is the\
first statement of a function!!
}
/^gcc_compiled:/d
/^.[^%].*:$/{ n
/ link %a6/{
N
N
s/ link %a6,#[x0-9-]*\n//
/ fmovem #[^\n]*,%sp@-/{
N
s/ fmovem #[^\n]*,%sp@-\n//
}
s/ moveml #[x0-9a-f]*,%sp@-\n//
s/ movel %[ad][0-7],%sp@-\n//
n
}
}
/ moveml -[1-9][0-9]*%a6@,#/{ N
s/ moveml -[1-9][0-9]*%a6@,#[x0-9a-f-]*\n unlk %a6//
}
/ movel -[1-9][0-9]*%a6@,%[ad][0-7]/{ N
s/ movel -[1-9][0-9]*%a6@,%[ad][0-7]\n unlk %a6//
}
/ movel %sp@+,%/d
/ moveml %sp@+,#/d
/ movel %d0,%a0/{
N
s/ movel %d0,%a0\n unlk %a6//
/ movem*l %a6/{
N
s/ movel %d0,%a0\n movem*l %a6.*\n unlk %a6//
/ fmovem %a6/{
N
s/ movel %d0,%a0\n movem*l %a6.*\n fmovem %a6.*\n unlk %a6//
}
}
}
/ unlk %a6/d
/ rts/d
/ jbsr SBD/d




SUN4/GCC/lib/sed.tcode.opt:
# script to strip prolog & epilog from threaded code under gcc.
#
/^_.*:$/{n
N
N
s/ !#PROLOGUE# 0\n save %sp,[-0-9]*,%sp\n !#PROLOGUE# 1/ ! [gotcher]/
}
/ ret/d
/ restore/d


SUN4/GCC/lib/sed.globreg.bugfix:
# Some of the libc builtin routines (.rem .urem .div & .udiv so far known)
# stamp on %g3 which is the maskReg (it contains 0x7FFFFF).
# This script reassigns the value of maskReg after each of these routines
# has been called.
/call[ ]\.div,[0-9]/{n
n
i\
sethi %hi(0x7FFFFF),%g3 ![globregfix]\
or %lo(0x7FFFFF),%g3,%g3
}
/call[ ]\.udiv,[0-9]/{n
n
i\
sethi %hi(0x7FFFFF),%g3 ![globregfix]\
or %lo(0x7FFFFF),%g3,%g3
}
/call[ ]\.rem,[0-9]/{n
n
i\
sethi %hi(0x7FFFFF),%g3 ![globregfix]\
or %lo(0x7FFFFF),%g3,%g3
}
/call[ ]\.urem,[0-9]/{n
n
i\
sethi %hi(0x7FFFFF),%g3 ![globregfix]\
or %lo(0x7FFFFF),%g3,%g3
}




You can now see why I put "Machine-Independent" in quotes. Here's the count
of machine dependent code for the SPARC:


25 99 786 fastcache.h
68 262 1882 fastglobal.h
31 112 906 fasttcode.h
28 80 595 ../tcsrc/SUN4/GCC/lib/sed.globreg.bugfix
5 34 222 ../tcsrc/SUN4/GCC/lib/sed.peep.opt
9 30 173 ../tcsrc/SUN4/GCC/lib/sed.tcode.opt
166 617 4564 total


Of these 166 lines 51 lines are banner headers. 100 odd lines are
machine dependent. A whole VM is around the 20k lines mark. So
machine dependencies are down in the 0.5% range.






Use this stuff as part of what ever you like. If you try & assert ownership
I'll fight & batter you over the head with the GPL ('bout time we had some
serious steel in that thar GPL).


Share And Enjoy!


P.S. The BrouHaHa machine is available to educational institutions with a
valid ParcPlace Smalltalk-80 licence, subject to a strict non-disclosure
agreement. email me if you want it. I am slow to answer requests!
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.