Re: Books and other things which cost money to improve performance

George Neuner <>
Tue, 06 Jul 2010 15:45:52 -0400

          From comp.compilers

Related articles
Books and other things which cost money to improve performance (Colin Paul Gloster) (2010-07-05)
Re: Books and other things which cost money to improve performance (Hans-Peter Diettrich) (2010-07-06)
Re: Books and other things which cost money to improve performance (George Neuner) (2010-07-06)
Re: Books and other things which cost money to improve performance (Paul Colin Gloster) (2010-07-09)
Re: Books and other things which cost money to improve performance (Matthias-Christian Ott) (2010-07-10)
Re: Books and other things which cost money to improve performance (George Neuner) (2010-07-11)
Re: Books and other things which cost money to improve performance (Paul Colin Gloster) (2010-08-31)
| List of all articles for this month |

From: George Neuner <>
Newsgroups: comp.compilers
Date: Tue, 06 Jul 2010 15:45:52 -0400
Organization: A noiseless patient Spider
References: 10-07-006
Keywords: books, optimize
Posted-Date: 07 Jul 2010 22:55:45 EDT

On Mon, 5 Jul 2010 12:42:21 -0400, Colin Paul Gloster
<> wrote:

>It has been warned many times in books such as by Appel and the red
>dragon book that compilers need to be conservative. If I can get more
>ideas on what mind be hindering compilers from applying optimizations,
>then I could rewrite the code to make the work easier for the
>compilers, or to manually perform such optimizations. After all, I do
>not need to be conservative. I do not even need to preserve the
>semantics of the program: if I find that I can get similar semantics
>which is just as correct but results in faster executables, then I

The biggest problems for compilers are indirect data access
(pointers), object aliasing (pointers again), indirect function calls
(yet more pointers), OO and flexible data structures (did I mention
pointers?) and, in general, any data structure with an irregular
access pattern.

The biggest problem for developers on modern OSes is the failure to
recognize the impact of virtual memory on large data sets.

Tricky coding and trying to outguess the compiler can easily result in
_worse_ performance. Better algorithms, clean simple code, making
sure data is properly aligned and reorganizing data to better match
VMM paging behavior (tiling, etc.) buys you much more than playing at
being a compiler.

It is true that the compiler must be conservative ... it is not
supposed to alter the program's semantics (modulo compile "modes" nd
conditional compilation) ... but it is also true that good compilers
are smarter and more observant than all but the very best developers.
If a good compiler passes up an opportunity to apply some meaningful
optimization it is *usually* because it has seen a problem the
developer has overlooked.

>Various tricks in tiger books; the red dragon book; and "Computer
>Architecture: A Quantitative Approach" (or superior variants of those
>tricks as appeared in the books "More Tricks of the Game Programming
>Gurus" and the excellent "Black Art of Macintosh Game Programming" by
>Kevin Tieskoetter) have helped me to speed up an as-yet unreleased
>version of the Monte Carlo particle simulator Cosima (
>WWW.MPE.MPG.De/MEGA/megalib.html ).

That's great. But keep in mind that *many* of those "tricks" are
special purpose and/or architecture specific ... moving up or down one
generation in CPU or changing the number of cores may invalidate
everything and result in a net loss of performance.

Low level architecture details become important in automatic data
layout (vs. programmer specified layout) and in code generation, but
as much as possible, compilers try to stick to architecture neutral
optimizations that are proven to generally improve code. Things like:
  - precomputing values at compile time,
  - simplifying math,
  - reordering steps to reduce the number of intermediate values,
  - eliminating redundant or unnecessary calculations,
  - reducing passes over the same data (loop fusion),
  - increasing (to a point) loop parallelism (loop unrolling),
  - inlining leaf functions,

CPU specific optimizations apart from data alignment are much more
difficult to analyze for value and overall are worth less than a more
general code optimization in the face of multi-tasking. For example,
it is feasible to statically determine optimal data placements so as
to minimize direct mapped cache conflicts ... but there's little point
to doing this complicated analysis if the cache is expected to be
polluted by unrelated programs.

>Using C++ was really not a good idea. Expect to see a version with
>a lot more Ada on the Internet later this year.

YMMV but I think you are naove to expect a whole lot of performance
improvement to come simply from switching to Ada. Once you step
outside the restricted real-time subset, Ada and C++ provide equal
opportunity for a programmer to derail potential optimizations and
produce slow, bloated code. Ada's stricter type system *occasionally*
can provide an opportunity for increased optimization vs. C++, but
such opportunities are relatively rare and always are architecture

>We unfortunately target x86s, but we will consider targeting other
>types of architectures such as FPGAs; GPUs; and supercomputers. As we
>already have x86s and might adopt something much faster, I do not
>expect that we shall target RISC machines though they probably would
>had been better to start with than x86s.

There is little internal difference between a modern RISC and a modern
x86 - the x86 has too few programmer visible register names and a
complex instruction set, but behind the instruction decoder it is an
out-of-order superscalar RISC with hundreds of rename registers.

In actual fact, the densely packed complex instruction set of the x86
helps reduce memory pressure vs. the fixed width instructions of a
traditional RISC (not that any modern RISCs are "traditional" - they
all sport variable length instructions now).

There is an ongoing thread in comp.arch talking about this right now.

The one place that x86 compilers routinely do stumble is in stack
handling. The x86 prefers certain data alignment but none is actually
*required* (even for SSE) ... and it is quite hard to guarantee that a
value on the stack will always be properly aligned for most efficient
handling - particularly when some performance obsessed developer has
switched off stack framing and function preambles and makes
(in)judicious use of alloca() (or local equivalent).

GPUs and separate memory multi-computers are a different ball game.
They require very different algorithms from general shared memory
machines. GPUs themselves are not all alike - they all try to exploit
massive data parallelism but their particular ISAs may be focused on
either SIMD or MIMD style processing or, in some cases, may be able to
go either way.

FPGAs implement algorithms directly in hardware (more or less - many
now have scratchpad RAM so you don't have to implement register level
logic) and are, in general, unlike any kind of CPU programming.

>Feedback [on] the following books ...

I haven't read many of these, but if you've studied compiler basics
already, then you can safely ignore any intro books (if there are
chapters on parsing, you can pretty much forget about it).

For optimization purposes you need to concentrate on various forms of
dependence analysis. It's important to know what the compiler looks
for - and what it does *not* want to find - when it tries to optimize

By far, the bulk of research on performance optimization has been done
for Fortran. You'll need to be able to read Fortran and transliterate
Fortran examples into your preferred implementation language(s).

>John M. Levesque & Joel W. Williamson,
>"A Guidebook to Fortran on Supercomputers",
Don't need this unless you're actually writing Fortran.

>Utpal Banerjee,
>"Loop Parallelization"

>Steven A. Przybylski,
>Morgan Kaufmann
This will just confuse you.

>Keith D. Cooper & Linda Torcson,
>"Engineering a Compiler"
This is my favorite intro book.

>Randy Allen & Ken Kennedy,
>"Optimizing Compilers for Modern Architectures:
>A Dependence-based Approach"

>Utpal Banerjee,
>"Dependence Analysis for Supercomputing",

>Vivek Sarkar,
>"Partitioning and Scheduling Parallel Programs for Multiprocessors",
If you're going to program separate memory multi-computers.

>Richard Gerber & Aart J. C. Bik & Kevin Smith & Xinmin Tian,
>"The Software Optimization Cookbook High Performance Recipes for IA 32
>Richard Gerber,
>"The Software Optimization Cookbook: High-Performance Recipes for the Intel
Both worthwhile.

>Monica S. Lam,
>"A Systolic Array Optimizing Compiler"
Interesting but irrelevant unless you actually have a systolic array.
SAs are essentially a fixed pipeline of (relatively) simple processing
steps that together perform some complex computation on streaming
data. They are almost always dedicated, single purpose machines and
are not very common.

>Alfred V. Aho & Monica S. Lam & Ravi Sethi & Jeffrey D. Ullman,
>"Compilers: Principles, Techniques, & Tools",
>second edition, the first printing of the second edition is not acceptable
Another good intro book.

>stuff by Jon Bentley such as
> :
A lot of Bentley's stuff is architecture specific.

>J. C. Beatty
>A Global Register Assigment Algorithm

>M. Schaefer
>A Mathematical Theory of Global Program Optimization
Interesting, but will likely cramp your brain.

>Peter Lee,
Mostly irrelevant. The papers on type recovery analysis might be
worth a read.

>John R. Ellis,
>"Bulldog: A Compiler for VLIW Architectures",
Interesting, but focused on low level instruction selection which is
way below the level of optimizations you want to target.

>Leland Beck,
>"System Software: An Introduction to Systems Programming",
Haven't read this one, but Beck's stuff tends to be good.

>Alpha Architecture Handbook,
The Alpha is dead.

>Pappas & Murray,
>"80386 Microprocessor Handbook",
Way out of date. The 486 is still used in embedded systems, but it
has a different internal architecture and optimization for 486 is
quite different than for 386. The 486 and original Pentium have much
more in common internally than do the 386 and 486 (selecting
Pentium/P5 as the target is a cheap way to make some 486 programs run

Get Intel and AMD developer manuals from their web sites

>Seyed H. Roosta,
>"Parallel Processing and Parallel Algorithms, Theory and Computation",
Worthwhile. Also see:
F. Thomas Leighton, "Introduction to Parallel Algorithms and
Architectures: Arrays, Trees, Hypercubes", Morgan Kaufman, 1992.

>"Games Programming Gems 4", maybe other books in the series
The "Programming Gems" books, in general, are worth reading ... but
keep in mind that many of their tricks are architecture and
program/purpose specific rather than general principles to follow.

>Thomas Pittman,
>"Practical Code Optimization by Transformational Attribute Grammars Applied
to Low-Level Intermediate Code Trees",
Interesting but irrelevant.

>Kevin Dowd,
>"High Performance Computing",
>O'Reilly & Associates,
>we own a copy of this but I have not had time to look at it yet: if it
>is worthwhile then we will probably want to buy another copy

>"Data-Parallel Programming on MIMD Computers"
>The MIT Press,
>we own a copy of this but I have not had time to look at it yet: if it
>is worthwhile then we will probably want to buy another copy
Haven't read this one but it might be worthwhile.

>Appel cited a book edited by Sites, R.L. supposedly called
>"Appendix A: Software Considerations" but it seems to be
Alpha is dead.

Many of the others you listed (did you cut-n-paste a biblio search?)
are too old, too basic, too general, or just irrelevant to your stated

Unless you have a very specific task that needs a special purpose
architectural tweak to achieve your performance goal, you will want to
stay with more general knowledge of what compilers are looking for.
Follow the path of dependence analysis and you won't go wrong.

>Yours faithfully,
>Paul Colin Gloster

Hope this helps.

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.