|compilers, in a nutshell email@example.com (1994-05-09)|
|Re: compilers, in a nutshell chase@Think.COM (1994-05-09)|
|Switch code generation, correction. chase@Think.COM (1994-05-10)|
|From:||chase@Think.COM (David Chase)|
|Keywords:||errors, optimize, sparc|
|Organization:||Thinking Machines Corporation, Cambridge MA, USA|
|Date:||Tue, 10 May 1994 21:47:34 GMT|
Correction to my previous post -- if you're going to give a detailed
example, then you had better get the details right. Mark Bromley spotted
this (and I am very embarassed by the branch latency case, because that is
rather important, and I've had it wrong for some time):
> The supersparc won't issue a branch in a cycle including a branch delay
> instruction. So your asymptotic cost is 2 cycles per branch.
This is a consequence of "Split Before Delay Group CTI Unless First".
This slows down the binary and linear cases, but the code can be reordered
and tweaked to regain some of the lost cycles. The idea is that either
(a) you can execute up to 3 additional ALU instructions per branch in the
same amount of time to get more bang for your branch cycle,
or (b) you can convert the branch to annulled form, and place an
instruction from the target block there (hopefully, one that will actually
save you a cycle),
ldd ... ! LDD integer is single-issue anyway.
(I checked this with some timing loops, just to be sure, and it is
> Supersparc doesn't cascade into the shifter so you have to
> rearrange a couple things:
(That I knew, but got backwards in the example)
> An indirect jump takes two cycles.
sub %o0,K0,%o0 ! Normalize to zero
sethi %hi(table),%g2 ! (cyc 1)
bgu default ! Use unsigned comparison hack (cyc 2)
add %g2,%lo(table),%g2 ! (cyc 3)
ld [%g1+%g2],%g1 ! (cyc 4)
! pipeline bubble, I think (cyc 5) yes
jmp %g1 ! (cyc 6 and 7)
nop ! Useless delay slot (cyc 8)
So, so much for the incredibly picky side of code generation for
superscalar machines. Expected linear cost L(N) = N+1, expected table
cost T(N) = 8, expected binary cost B(N) = 2 * depth to linear fragments.
If X is the breakeven N for linear versus binary, then B(N) = L(X) +
2*log2(N/X). X is 4, B(4) = L(4) = 5, so B(8) = 7, and B(16) = 9. By
shortening the tails to length 3, you can get B(12) = 8.
Sigh. I've been working from the wrong schedules for branches for a long
I received mail asking if anything had been written on switch table
generation with knowledge of (non-uniform) case frequencies, and I sure
don't know. Upon reflection, I realized that you can (abuse) the spare
ALU slots to enable a Huffman-like encoding. If the order cannot be
changed, you can end up in situations where (for instance) cases in the
range 0-99 get 25% of the execution, 100-199 get 50%, and 200-299 get 25%.
You might like to save 2 cycles for the 50% case by making that half of
the tree 1 level shallower, but the order does not permit it. However, if
this is within the binary search tree already, you have some spare
instructions to play with, as in:
bcond elsewhere1 ! cyc N
<delay from elsewhere1>
sub %o0,100,%o0 ! cyc N+1
bgeu elsewhere2 ! cyc N+2
<delay from elsewhere2>
Under unsigned comparison, "negative" numbers look like large positive
numbers, so they compare greater than 200, which means the two (original
values) intervals 0-99 and 200-299 are trimmed away in one comparison.
This can pay even at the root, since the additional subtraction adds at
most one cycle, and not the two that a second compare-and-branch would
Note -- it is not a good idea to annul the delay slot, since that would
add an additional cycle -- a likely instruction to put there would be a
normalizing step for the cases covered "elsewhere", using a second
register (that is, the comparisons could ping-pong back and forth between
%o0 and %o1).
To my knowledge, nobody has gone quite this bananas in the generation of
binary-tree case statements, though I'd be thrilled to hear it.
David Chase, speaking for myself
Thinking Machines Corp.
Return to the
Search the comp.compilers archives again.