|[11 earlier articles]|
|Re: Prediction of local code modifications firstname.lastname@example.org (Max Hailperin) (2008-04-03)|
|Re: Prediction of local code modifications email@example.com (Tim Frink) (2008-04-03)|
|Re: Prediction of local code modifications firstname.lastname@example.org (Matthias Blume) (2008-04-04)|
|Re: Prediction of local code modifications email@example.com (George Neuner) (2008-04-04)|
|Re: Prediction of local code modifications cfc@shell01.TheWorld.com (Chris F Clark) (2008-04-04)|
|Re: Prediction of local code modifications cfc@shell01.TheWorld.com (Chris F Clark) (2008-04-05)|
|Re: Prediction of local code modifications firstname.lastname@example.org (Mayan Moudgill) (2008-04-05)|
|Re: Prediction of local code modifications email@example.com (Tim Frink) (2008-04-08)|
|Re: Prediction of local code modifications firstname.lastname@example.org (gneuner) (2008-04-19)|
|From:||Mayan Moudgill <email@example.com>|
|Date:||Sat, 05 Apr 2008 21:35:56 -0400|
|Posted-Date:||06 Apr 2008 01:18:51 EDT|
Tim Frink wrote:
> Maybe you have some ideas how to cope with this problem:
It looks like you have a multi-level memory hierarchy
1. a single 8-byte aligned "cache"
2. an I-cache (of unspecified line-size/set-size/associativity)
3. a fast memory (of some unspecified size)
4. a slow memory
I'm assuming you have access to branch-taken and execution-frequency
information, either through profiling or computed synthetically.
To exploit 1., clearly all 8-byte (or smaller) loops should be arranged
so that the instructions do not cross the 8-byte boundary. Also, all
forward branches should be "switched" so that the fall through path is
the more common one.
To exploit 2., you need to optimize memory placement so that you
minimize collision based spilling. There is a fair bit about literature
To exploit 3, you need to place the most frequently fetched locations in
the fast memory. Lets ignore collisions in the I-cache (e.g. assume
fully associative I-cache). In that case, we're generally talking about
capacity misses. In general, that implies that entire basic blocks will
need to be reloaded into the cache.
If a basic block is moved from slow memory to fast memory, then it may
be necessary to add extra branch instructions to stitch together an
unmoved successor/predecessor to the moved block.
There can be only one predecessor of interest (i.e. only one predecessor
can fall through into the moved block; all other blocks will already
have a branch instruction that branches to the block). If it is not also
moved into fast memory, then it will need to have a branch added, which
will then possibly perturb the alignment of the code
For succcessors, if we ignore indirect branches, there are three cases:
- unconditional branch,
- a conditional branch
Obviously, the unconditional branch already has a branch, so there is no
The fall-through case can probably be ignored, since it is most likely
that that successor basic block will have higher priority to move into
the fast memory
The only other case to consider is the conditional branch; there, it may
be necessary to introduce an unconditional branch that goes to the
original fall-through basic block, which will then perturb the alignment
and cache miss pattern of the code.
It looks like a quick and dirty algorithm would be to add basic blocks
to fast memory based on number of reloads till the fast memory is full.
Align blocks so that 8-byte loops are within same line. Lay-out blocks
so that most-probable blocks fall through. Then add stitch branches. If
adding a stitch branch would cross a 8-byte boundary, add enough nops so
that you add exactly 8-bytes. This may make some blocks no longer
copyable to fast memory. Fixup and try again, till converges.
Return to the
Search the comp.compilers archives again.