Re: Optimizations to improve icache hit rate (Anton Ertl)
29 Apr 2004 12:00:05 -0400

          From comp.compilers

Related articles
Optimizations to improve icache hit rate (Jeff Kenton) (2004-04-21)
Re: Optimizations to improve icache hit rate (A Pietu Pohjalainen) (2004-04-28)
Re: Optimizations to improve icache hit rate (2004-04-29)
| List of all articles for this month |

From: (Anton Ertl)
Newsgroups: comp.compilers
Date: 29 Apr 2004 12:00:05 -0400
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
References: 04-04-068
Keywords: design, optimize
Posted-Date: 29 Apr 2004 12:00:05 EDT

Jeff Kenton <> writes:
>I had a discussion during an interview about optimizations designed to
>decrease icache misses. It seems to me that code re-arrangement will
>give small benefits at most, and that general optimizations,
>especially those that decrease overall code size, will be more
>productive in the end.


- Mueller and Whalley [mueller&whalley92] found that an optimization
that increases code size by 50%, but increases spatial locality,
reduces the number of I-cache misses even for very small I-caches
(IIRC they simulated 2KB caches).

- We tried to use partial replication to reduce the code size and the
I-cache misses resulting from full replication [ertl&gregg03] without
losing most of its branch prediction accuracy advantage. We found
that the code size was reduced significantly, but the number of
I-cache misses increased, again due to worse spatial locality.

    author = "Frank Mueller and David B. Whalley",
    title = "Avoiding Unconditional Jumps by Code Replication",
    booktitle = "{SIGPLAN} '92 Conference on Programming Language
                                  Design and Implementation",
    year = "1992",
    pages = "322--330",
    annote = "Nearly all unconditional jumps can be eliminated by
                                  code replication. This expands compiled C code by about
                                  50\%, but reduces the number of executed instructions
                                  and the cache work (miss ratio increases slightly).
                                  There are some nontrivial problems in this technique,
                                  which are solved heuristically in this paper."

    author = "M. Anton Ertl and David Gregg",
    title = "Optimizing Indirect Branch Prediction Accuracy in
                                  Virtual Machine Interpreters",
    booktitle = "{SIGPLAN} '03 Conference on Programming Language
                                  Design and Implementation",
    year = "2003",
    URL = "",
    abstract = "Interpreters designed for efficiency execute a huge
                                  number of indirect branches and can spend more than
                                  half of the execution time in indirect branch
                                  mispredictions. Branch target buffers are the best
                                  widely available\mn{on all recent general-purpose
                                  machines?} form of indirect branch prediction; however,
                                  their prediction accuracy for existing interpretes is
                                  only 2\%--50\%. In this paper we investigate two
                                  methods for improving the prediction accuracy of BTBs
                                  for interpreters: replicating virtual machine (VM)
                                  instructions and combining sequences of VM instructions
                                  into superinstructions. We investigate static
                                  (interpreter build-time) and dynamic (interpreter
                                  run-time) variants of these techniques and compare them
                                  and several combinations of these techniques. These
                                  techniques can eliminate nearly all of the dispatch
                                  branch mispredictions, and have other benefits,
                                  resulting in speedups by a factor of up to 3.17 over
                                  efficient threaded-code interpreters, and speedups by a
                                  factor of up to 1.3 over techniques relying on
                                  superinstructions alone."

- anton
M. Anton Ertl

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.