Re: stack access speed

rkrayhawk@aol.com (RKRayhawk)
17 Nov 2000 23:53:49 -0500

          From comp.compilers

Related articles
stack access speed j8fx@my-deja.com (2000-11-15)
Re: stack access speed dlindauer@notifier-is.net (david lindauer) (2000-11-16)
Re: stack access speed rkrayhawk@aol.com (2000-11-17)
| List of all articles for this month |

From: rkrayhawk@aol.com (RKRayhawk)
Newsgroups: comp.compilers
Date: 17 Nov 2000 23:53:49 -0500
Organization: AOL http://www.aol.com
References: 00-11-113
Keywords: GCC, code, comment
Posted-Date: 17 Nov 2000 23:53:49 EST

The moderator has you on the right track. Let me add two notions that may
contribute to your analysis.


  You conjecture:
<< This has caused our program to crawl in certain
instances when arrays are constantly being pushed and popped from the
stack.
>>


That is not really the problem. The items are not really getting
pushed and popped ... instead the local stack space gets allocated by
a large bump of the stack pointer (so other necessary pushes and pops,
AND interrupts, do not stomp on the variables local to your
function). But once that section of the local stack is 'allocated' by
the bump of the stack pointer, you are not necessarily into any kind
of push/pop of those vars.


Compilers definitely can get into weak zones when they think your data
is on the stack, even though the segment (or base) and pointer data
can be replicated (in some cases) and the same optimizations are
possible as with any information managed under normal array
management. In the case of GCC that would just be legacy coding
concepts in the optimizer, not necessarily restricted to Intel
architecture, but just kind of a hang-over from entering optimization
with factors that indicate the data is on the stack (thus the static
declaration keeps the optimizer from staggering).


Secondly, and less confidently, is the problem that the stack is not
necessarily full word aligned, which can skew your test. If I am not
mistaken the stack on the pentium is still byte addressable. (And my
confidence issue is that I do not know if NT runs the device in some
mode that defeats this concern). So the alignment of the stack upon
entry to your code may not be integral to 4-byte unsigned long. The
auto declarations may or may not get aligned (and I am supposing the
options recommended helps).


Performance of the addressing of the data in long loops is effected by
boundary alignments as they intersect with or instead set integral to
cache line boundaries. Depending again upon compilation options, the
static declaration probably aligns the the long unsigned ints.


Additionally, if performance is your highest priority here,it may be
possible to get more from the compiler if you could lay this out as a
struct with 8 buckets (naturally, an array of such a struct occuring
500 times). This is an idea that would have to fit your design reqs,
but the single loading of a base register instead of potentially eight
loads of the bases in each loop may be easy for the optimizer to
squeeze for you..... can't tell from your post if that is really
appropo, so forgive if not relevant.


Hope that helps a little.


Bob Rayhawk
rayhawk@alum.calberkeley.org
[I looked at the code that GCC generates for that test program. It's
just lousy, no misalignment or anything like that. The problem appears
to be that GCC doesn't know about the x86 double indexed addressing, and
the x86 doesn't have enough registers to let it make pointer temporaries
to all of the stack arrays, so it keeps recomputing the base address of
each array inside the loop. -John]


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.