Re: Cache size restrictions obsolete for unrolling?

"Harold Aptroot" <harold.aptroot@gmail.com>
Sat, 10 Jan 2009 17:41:59 +0100

          From comp.compilers

Related articles
Cache size restrictions obsolete for unrolling? linuxkaffee_@_gmx.net (Stephan Ceram) (2009-01-07)
Re: Cache size restrictions obsolete for unrolling? harold.aptroot@gmail.com (Harold Aptroot) (2009-01-09)
Re: Cache size restrictions obsolete for unrolling? gneuner2@comcast.net (George Neuner) (2009-01-10)
Re: Cache size restrictions obsolete for unrolling? linuxkaffee_@_gmx.net (Stephan Ceram) (2009-01-10)
Re: Cache size restrictions obsolete for unrolling? jgd@cix.compulink.co.uk (2009-01-10)
Re: Cache size restrictions obsolete for unrolling? harold.aptroot@gmail.com (Harold Aptroot) (2009-01-10)
| List of all articles for this month |

From: "Harold Aptroot" <harold.aptroot@gmail.com>
Newsgroups: comp.compilers
Date: Sat, 10 Jan 2009 17:41:59 +0100
Organization: A noiseless patient Spider
References: 09-01-010 09-01-011 09-01-014
Keywords: architecture, storage
Posted-Date: 10 Jan 2009 13:14:20 EST

"Stephan Ceram" <linuxkaffee_@_gmx.net> wrote in message
> Yes, this seems reasonable for systems with separate data caches and
> even more when both data and code are in the same cache. However, in
> my system there is no data cache and all data is placed in a fast
> memory close to the processor called scratch pad memory. So, what do
> you think when data is not competing with the code for the memory
> system? Can I-cache misses be neglected at the cost of more aggressive
> optimization?


(Now, I'm not a real expert in this area, but my observations may be of some
use.. )


Then you'd mostly be balancing the cost of the I-cache misses against
the cost of the loop control logic I guess. So it would still depend
on the size of the cache, the cost of the loop control logic relative
to the cost of the loop body (depends also on the I-cache size), the
cost of an I-cache miss (which could be related to it's size) and some
other things which I'm overlooking now..


-- the following piece only applies to processors with a dynamic branch
predictor --


If the processor has a dynamic branch predictor and the loop body
contains branches and the branch predictor uses the address (or part
of it) of the branch to remember where it was (that's a lot of
requirements..), then unrolling too far may cause very slight
performance degration in other parts of the code for which the branch
predictor had to evict the entries due to the overload of branches in
the unrolled loop. This effect is probably very small though.


It may also cause extra branches in the loop itself to be mispredicted
because branches which are essentially "the same branch" would not be
detected as being the same brach due to them having different
addresses. This effect could be larger though probably not very
large. If the processor clears its branch predictor cache on I-cache
misses (some might do this to be able to keep the branch predictor
cache small without running the risk of accidentally using an entry to
predict a branch that is wasn't meant for, but shares the same low
bits by accident) the first effect could suddenly be quite big. I
don't know how common it is for processors to work that way though..


-- end of branch prediction piece --


Probably, the bigger the I-cache is, the better it would be to keep the loop
inside it.


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.