Re: Parallelizing (WAS: Death by pointers.) (Mike Lee)
Mon, 20 Nov 1995 19:18:06 GMT

          From comp.compilers

Related articles
[11 earlier articles]
Re: Parallelizing (WAS: Death by pointers.) (1995-11-06)
Re: Parallelizing (WAS: Death by pointers.) (1995-11-10)
Re: Parallelizing (WAS: Death by pointers.) (Dave Lloyd) (1995-11-13)
Re: Parallelizing (WAS: Death by pointers.) davids@ICSI.Berkeley.EDU (1995-11-14)
Re: Parallelizing (WAS: Death by pointers.) (1995-11-14)
Re: Parallelizing (WAS: Death by pointers.) (1995-11-19)
Re: Parallelizing (WAS: Death by pointers.) (1995-11-20)
| List of all articles for this month |

Newsgroups: comp.compilers
From: (Mike Lee)
Keywords: design, storage, architecture
Organization: Ontek Corporation -- Laguna Hills, California
References: 95-10-092 95-11-093 95-11-131
Date: Mon, 20 Nov 1995 19:18:06 GMT

In comp.compilers, (Cliff Click) writes:
| [lots of nifty stuff, but I take exception to this one part]
| For resizing arrays, they double in size as they run out. Requires a quicky
| test&increment in the common case, with a doubling realloc log_N times.
| Total worst-case data movement is 2N, and slop is O(N/2). You can fiddle
| with the worst-case constants by resizing using a constant other than 2.
| Data movement is actually fairly cheap, since block copy uses the memory
| hierarchy "exactly right".

Block copy on blocks whose size is a power of two can be expensive (in
time) if these two situations hold:

    - malloc likes to allocate on big power-of-two boundaries (whether
            for matching up with page sizes or because of a buddy-system
            allocator) for large requests
    - the cache hardware algorithm is dimwitted and simply masks off the
            upper bits when choosing a cache line

What happens is that the successive chunks of the source and
destination of the block move may very well occupy the same cache
line. "Predictive" and "write-behind" features are designed to
minimize this sort of problem but when the cache lines collide
the limits of the cache's designed-in intelligence are tested.

Anyway, your suggestion of choosing a number other than two may help
best-case performance, not just worst case performance. Your article
may prompt me to run some tests... offhand, I'd guess that 1+phi or any
other irrational number between 1.25 and 2.0 that isn't close to a
power-of-two fraction should minimize cache thrashing.

A further suggestion is to use realloc() rather than
malloc/memcpy/free. realloc() may be able just call sbrk() and append
some more pages at nearly no cost, especially if there is only one
growing array in your program (it'll migrate to the end of the heap
after a small number of doublings.)

| I have a friend (hi Preston!) who believes you
| should always precompute, but it's easy to resize and it's never the
| bottleneck.

Precomputing minimizes the risk of asking for twice as much memory
as you really need... if you're up against the edge of physical memory
the VM thrashing will be far more annoying than overlapping cache

Kevin Dowd, High Performance Computing, ORA, ISBN 1-56592-032-5, is a
good introduction to cache and memory performance problems. I don't
know of any more comprehensive books on the subject--it seems like
every brand and model cache is of a different capacity with different
sized cache lines and a slightly different predictive algorithm. Given
that L2 cache cards are now a commodity item on PCs and Macs it may not
be worth optimizing for a specific brand and model anyway. The
behavior of on-chip cache is usually better documented though, and
better tuned to block moves I would think.

Also I should put a disclaimer here to the effect that carefully writing
your code so as minimize cache-line collisions is usually of no help.
Run a profiler first to see what the problem is, _then_ think about



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.