Related articles |
---|
[11 earlier articles] |
Re: Parallelizing (WAS: Death by pointers.) chase@centerline.com (1995-11-06) |
Re: Parallelizing (WAS: Death by pointers.) hbaker@netcom.com (1995-11-10) |
Re: Parallelizing (WAS: Death by pointers.) dave@occl-cam.demon.co.uk (Dave Lloyd) (1995-11-13) |
Re: Parallelizing (WAS: Death by pointers.) davids@ICSI.Berkeley.EDU (1995-11-14) |
Re: Parallelizing (WAS: Death by pointers.) cliffc@ami.sps.mot.com (1995-11-14) |
Re: Parallelizing (WAS: Death by pointers.) chase@centerline.com (1995-11-19) |
Re: Parallelizing (WAS: Death by pointers.) mikey@banzai.ontek.com (1995-11-20) |
Newsgroups: | comp.compilers |
From: | mikey@banzai.ontek.com (Mike Lee) |
Keywords: | design, storage, architecture |
Organization: | Ontek Corporation -- Laguna Hills, California |
References: | 95-10-092 95-11-093 95-11-131 |
Date: | Mon, 20 Nov 1995 19:18:06 GMT |
In comp.compilers, cliffc@ami.sps.mot.com (Cliff Click) writes:
| [lots of nifty stuff, but I take exception to this one part]
|
| For resizing arrays, they double in size as they run out. Requires a quicky
| test&increment in the common case, with a doubling realloc log_N times.
| Total worst-case data movement is 2N, and slop is O(N/2). You can fiddle
| with the worst-case constants by resizing using a constant other than 2.
| Data movement is actually fairly cheap, since block copy uses the memory
| hierarchy "exactly right".
Block copy on blocks whose size is a power of two can be expensive (in
time) if these two situations hold:
- malloc likes to allocate on big power-of-two boundaries (whether
for matching up with page sizes or because of a buddy-system
allocator) for large requests
- the cache hardware algorithm is dimwitted and simply masks off the
upper bits when choosing a cache line
What happens is that the successive chunks of the source and
destination of the block move may very well occupy the same cache
line. "Predictive" and "write-behind" features are designed to
minimize this sort of problem but when the cache lines collide
the limits of the cache's designed-in intelligence are tested.
Anyway, your suggestion of choosing a number other than two may help
best-case performance, not just worst case performance. Your article
may prompt me to run some tests... offhand, I'd guess that 1+phi or any
other irrational number between 1.25 and 2.0 that isn't close to a
power-of-two fraction should minimize cache thrashing.
A further suggestion is to use realloc() rather than
malloc/memcpy/free. realloc() may be able just call sbrk() and append
some more pages at nearly no cost, especially if there is only one
growing array in your program (it'll migrate to the end of the heap
after a small number of doublings.)
| I have a friend (hi Preston!) who believes you
| should always precompute, but it's easy to resize and it's never the
| bottleneck.
Precomputing minimizes the risk of asking for twice as much memory
as you really need... if you're up against the edge of physical memory
the VM thrashing will be far more annoying than overlapping cache
lines.
Kevin Dowd, High Performance Computing, ORA, ISBN 1-56592-032-5, is a
good introduction to cache and memory performance problems. I don't
know of any more comprehensive books on the subject--it seems like
every brand and model cache is of a different capacity with different
sized cache lines and a slightly different predictive algorithm. Given
that L2 cache cards are now a commodity item on PCs and Macs it may not
be worth optimizing for a specific brand and model anyway. The
behavior of on-chip cache is usually better documented though, and
better tuned to block moves I would think.
Also I should put a disclaimer here to the effect that carefully writing
your code so as minimize cache-line collisions is usually of no help.
Run a profiler first to see what the problem is, _then_ think about
optimizing.
mikey
--
Return to the
comp.compilers page.
Search the
comp.compilers archives again.