Newsgroups: | comp.compilers |
From: | andrew@rentec.com (Andrew Mullhaupt) |
Keywords: | registers, optimize, linker |
Organization: | Renaissance Technologies Corp., Setauket, NY. |
References: | 92-05-123 92-06-023 |
Date: | Fri, 5 Jun 1992 16:37:57 GMT |
preston@dawn.cs.rice.edu (Preston Briggs) writes:
>how much register save and restore overhead can there be? Usually a call
>to a simple leaf function can be accomplished with very few cycles of
>overhead (a call and return, with no registers saved, and no stack frame
>allocated).
>
>On the other hand, if the argument function is more complex, the overhead
>goes up somewhat, depending on how many registers must be saved. The hope
>is that a function requiring 15 or 20 registers will be spending more than
>a few cycles doing real work.
>
>Interface specifications surely help with interprocedural type checking
>and optimization. However, your specification of "dynamically loaded" and
>"alien host" make me uncertain. Probably you're wanting to do stuff that
>is hairy beyond what I've imagined (like somehow interfacing an APL
>interpreter to IMSL subroutines).
A simple example can clarify this point. Let us suppose you are using
ATT's S interpreter, (or more likely its cousin Splus). It is not unusual
for simple functions written in S to be very slow because of the
interprative overhead. Here is the timing on a moderately loaded SS-2 for
four ways to compute the same function, the square root of the sum of the
integers from 1 to 10000. The first way uses the obvious S loop:
> unix.time(boo(1:10000))
[1] 121.216675 3.733333 132.000000 0.000000 0.000000
The second way uses a dynamically loaded C function which is essentially
nothing more than the S loop transliterated and compiled. It is a good
thing to remember that this actually needs to do more shuffling to call
the C than the previous code.
> unix.time(.C("S_boo", x=as.double(1:10000), n=as.integer(10000), y=0.0)$y)
[1] 0.23333740 0.01666689 1.00000000 0.00000000 0.00000000
It's about 500 times faster (user time is the first number). No wonder
that dynamic loading is a big deal in S, even though by considering a pure
C program to do the same thing we get:
[penguin 97 97] time a.out
70711.031671
0.3 real 0.1 user 0.1 sys
Which shows that the dynamically loaded C + Splus overhead costs about as
much as the whole function call.
You can easily see that Splus can be made nearly as fast as pure C by
avoiding some of the interpreter's disadvantages using dynamically loaded
C code. (Or FORTRAN, etc.) This is a lot better than writing everything as
full C programs, since not everything is performance critical, but almost
everything is easier to write in S.
In this case, you can see that roughly as much time is spent passing
arguments, etc. as is spent getting _real work_ done. But you can see that
I would be totally justified in using dynamically loaded C even if the
overhead were even _fifty_ times as bad, because my total run time would
still be _ten_ times faster than Splus alone. Now in this case, Splus
provides C functions which can do this job:
> unix.time(sqrt(sum(1:10000)))
[1] 0.09999847 0.00000000 0.00000000 0.00000000 0.00000000
What this shows is that when you can get the interpreter to do what you
want using its primitives, it's about as fast as normal C. The point is
that you usually don't have such convenient facilities for all your
performance problems, so you end up fixing your hot spots by dynamically
loading compiled code.
So far, I've made the point that dynamic linking of alien code can be
useful _even if the calling overhead is much larger than the actual "work"
done_. I have not shown by this example that register allocation is
involved, but I could, if I had time (and the reader had patience) arrive
at such an example. Now the function we linked in here has three
arguments, all pointers. This isn't going to push anyone's stack over the
edge, but when you look at IMSL, you can easily find cases where there are
twenty or even thirty arguments, often to create many ways of calling the
same function, etc., as well as times where you have to pass logical and
physical dimensions of arrays along with strides. If you think about doing
an R-SVD of a small matix, (say 10x5) on a machine with fast enough FPU,
you might very well be at the point where the argument passing is as
costly as the crunching. Now it is quite easy to think of RLS or Kalman
filtering implementations where this would be the performance critical
portion, but because of the complexity of arriving at a good C
implementation (in other words pushing the Kalman code into the C instead
of leaving the S/C divide at the R-SVD interface) you might be better off
leaving it at that, which would mean a significant number of cycles being
spent in passing arguments.
People do not usually think about programs where argument passing is
taking up the bulk of the time, or even a significant fraction of it, but
this kind of situation is not unheard of in scientific computing. The
first time I came across it was in the vortex sheet code for the Blasius
problem, and then again in Concus and Proskurowski's entropy condition in
the Random Choice method. Now these were codes which were not separately
compiled, and so compiler inlining basically solved these problems. But if
we tried to implement these in the alien code situation, neither we nor
the compiler can really do anything about it.
Later,
Andrew Mullhaupt
--
Return to the
comp.compilers page.
Search the
comp.compilers archives again.