Tue, 10 May 1994 21:47:34 GMT

Related articles |
---|

compilers, in a nutshell ellard@endor.harvard.edu (1994-05-09) |

Re: compilers, in a nutshell chase@Think.COM (1994-05-09) |

Switch code generation, correction. chase@Think.COM (1994-05-10) |

Newsgroups: | comp.compilers |

From: | chase@Think.COM (David Chase) |

Keywords: | errors, optimize, sparc |

Organization: | Thinking Machines Corporation, Cambridge MA, USA |

References: | 94-05-018 94-05-021 |

Date: | Tue, 10 May 1994 21:47:34 GMT |

Correction to my previous post -- if you're going to give a detailed

example, then you had better get the details right. Mark Bromley spotted

this (and I am very embarassed by the branch latency case, because that is

rather important, and I've had it wrong for some time):

*> The supersparc won't issue a branch in a cycle including a branch delay*

*> instruction. So your asymptotic cost is 2 cycles per branch.*

This is a consequence of "Split Before Delay Group CTI Unless First".

This slows down the binary and linear cases, but the code can be reordered

and tweaked to regain some of the lost cycles. The idea is that either

(a) you can execute up to 3 additional ALU instructions per branch in the

same amount of time to get more bang for your branch cycle,

bcond somewhere1

ALU1

ALU2

SETCC

ALU3

bcond somewhere2

or (b) you can convert the branch to annulled form, and place an

instruction from the target block there (hopefully, one that will actually

save you a cycle),

bcond,a somewhere

ldd ... ! LDD integer is single-issue anyway.

(I checked this with some timing loops, just to be sure, and it is

correct).

*> Supersparc doesn't cascade into the shifter so you have to*

*> rearrange a couple things:*

(That I knew, but got backwards in the example)

and

*> An indirect jump takes two cycles.*

sub %o0,K0,%o0 ! Normalize to zero

sethi %hi(table),%g2 ! (cyc 1)

cmp %o0,KN

sll %o0,2,%g1

bgu default ! Use unsigned comparison hack (cyc 2)

add %g2,%lo(table),%g2 ! (cyc 3)

ld [%g1+%g2],%g1 ! (cyc 4)

! pipeline bubble, I think (cyc 5) yes

jmp %g1 ! (cyc 6 and 7)

nop ! Useless delay slot (cyc 8)

So, so much for the incredibly picky side of code generation for

superscalar machines. Expected linear cost L(N) = N+1, expected table

cost T(N) = 8, expected binary cost B(N) = 2 * depth to linear fragments.

If X is the breakeven N for linear versus binary, then B(N) = L(X) +

2*log2(N/X). X is 4, B(4) = L(4) = 5, so B(8) = 7, and B(16) = 9. By

shortening the tails to length 3, you can get B(12) = 8.

Sigh. I've been working from the wrong schedules for branches for a long

time.

I received mail asking if anything had been written on switch table

generation with knowledge of (non-uniform) case frequencies, and I sure

don't know. Upon reflection, I realized that you can (abuse) the spare

ALU slots to enable a Huffman-like encoding. If the order cannot be

changed, you can end up in situations where (for instance) cases in the

range 0-99 get 25% of the execution, 100-199 get 50%, and 200-299 get 25%.

You might like to save 2 cycles for the 50% case by making that half of

the tree 1 level shallower, but the order does not permit it. However, if

this is within the binary search tree already, you have some spare

instructions to play with, as in:

bcond elsewhere1 ! cyc N

<delay from elsewhere1>

sub %o0,100,%o0 ! cyc N+1

cmp %o0,200

bgeu elsewhere2 ! cyc N+2

<delay from elsewhere2>

Under unsigned comparison, "negative" numbers look like large positive

numbers, so they compare greater than 200, which means the two (original

values) intervals 0-99 and 200-299 are trimmed away in one comparison.

This can pay even at the root, since the additional subtraction adds at

most one cycle, and not the two that a second compare-and-branch would

cost.

Note -- it is not a good idea to annul the delay slot, since that would

add an additional cycle -- a likely instruction to put there would be a

normalizing step for the cases covered "elsewhere", using a second

register (that is, the comparisons could ping-pong back and forth between

%o0 and %o1).

To my knowledge, nobody has gone quite this bananas in the generation of

binary-tree case statements, though I'd be thrilled to hear it.

David Chase, speaking for myself

Thinking Machines Corp.

--

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.