Summary: "Optimization opportunities unexploited: examples sought"

pshuang@athena.mit.edu (Ping Huang)
Tue, 16 May 1995 02:19:40 GMT

          From comp.compilers

Related articles
Summary: "Optimization opportunities unexploited: examples sought" pshuang@athena.mit.edu (1995-05-16)
| List of all articles for this month |

Newsgroups: comp.compilers,comp.arch.arithmetic
From: pshuang@athena.mit.edu (Ping Huang)
Followup-To: poster
Keywords: optimize, summary
Organization: Massachusetts Institute of Technology
Date: Tue, 16 May 1995 02:19:40 GMT

[This is the "summary" which I promised when I posted my question a
week ago. This article should be in RFC1153-compliant digest format.]


Due to time constraints, I cannot truly summarize all the responses I
received --- so I'll have to settle for preparing a digest of replies,
somewhat ordered by the type of information provided. Overall, I got
lots of reasonably helpful pointers to published research papers about
using profiling feedback for various languages, and lots of comments
that various vendor compilers currently do make *SOME* use of
profiling feedback for certain kinds of optimizations. By comparison,
however, I didn't get very many actual code examples. If any of you
reading this now would like to contribute additional real-world
examples, I'm still listening. :)


I'd like to thank all these respondents (listed in reverse
alphabetical order by email address).


        o stevez@code3.code3.com (Steve Zimmerman)
        o sethml@ugcs.caltech.edu (Seth LaForge)
        o sdl@austin.ibm.com (sdl)
        o schooler@apollo.hp.com
        o pshuang@athena.mit.edu (Ping Huang)
        o przemek@rrdjazz.nist.gov (Przemek Klosowski)
        o pardo@cs.washington.edu
        o paik@pvision5.eng.pko.dec.com (Samuel S Paik)
        o minton@ptolemy-ethernet.arc.nasa.gov (Steve Minton)
        o mike@raederle (Michael Wojcik)
        o metzger@bach.convex.com (Robert Metzger)
        o metzemakers@labri.u-bordeaux.fr (Timo Metzemakers)
        o lkaplan@BIX.com
        o jecel@lsi.usp.br ("Jecel Mattos de Assumpcao Jr.")
        o jdean@cs.washington.edu (Jeffrey Dean)
        o jbuck@Synopsys.COM (Joe Buck)
        o Jay.Sipelstein@cs.cmu.edu
        o guiguer@cgrnserc.uwaterloo.ca
        o greve@rs1.thch.uni-bonn.de (Thomas Greve)
        o gilbertd@cs.man.ac.uk (David Alan Gilbert)
        o gilbert@marin.cc.ca.us (Tim Gilbert)
        o ganswijk@xs4all.nl (Jaap van Ganswijk)
        o funcsoft@cix.compulink.co.uk (Mike Ruddy)
        o fhart@atlanta.twr.com (C. Frederick Hart)
        o djb@silverton.berkeley.edu ("D. J. Bernstein")
        o ddunn@netcom.com (David A Dunn)
        o bnm@central.bbt.com (Brian N. Miller)
        o bagwill@kangaroo.ncsl.nist.gov (Robert Bagwill)
        o Philipp.Hoschka@sophia.inria.fr (Philipp Hoschka)


Note that I have trimmed quotations as much as possible, and removed
most mail signatures, following the convention of using "[....]" to
indicate where significant text has been elided. Also, only the
original responses from these people are included. I have started up
email conversations with several of them, but I feel that it would be
inappropriate to include further correspondence in a public posting;
I had made only clear that (initial) responses would be summarized.


----------------------------------------------------------------------


From: pshuang@athena.mit.edu (Ping Huang)
Subject: Optimization opportunities unexploited: examples sought


[Posted-Date: Mon, 8 May 1995; various times]
[Posted-To: comp.lang.c,comp.lang.c++,comp.lang.misc]
[Posted-To: comp.compilers,comp.arch.arithmetic]
[Posted-To: comp.unix.programmer,comp.os.msdos.programmer,
                        comp.os.ms-windows.programmer.misc,comp.os.os2.programmer.misc,
                        comp.sys.mac.programmer.misc]


[....]


I am working on a thesis based on the idea is that it is worthwhile to
feed profiler output (from actual runs of the program by the real
end-users using real data on real machines) back to the COMPILER and
not just the human programmer... especially if the human programmer
used annotations in the source code to give hints to the compiler as
to the kinds of more drastic and aggessive program transformations it
can attempt to perform.


I am seeking real-world examples where this would have been useful.
Ideally, I'd like to hear both the application context and actual
(short) snippets of code illustrating the point. (C/C++ code
preferred but not necessary... I'm asking for snippets of code for
concreteness.) Also, if you know of existing run-time systems,
libraries, compilers, etc., which already do this kind of automatic
adaptation in an interesting fashion, please let me know about them.


----------------


This may seem a bit vague, so I've given some (perhaps contrived)
examples below. I don't want to overly bias the kinds of examples I
might get back from people by the choice of these examples --- I'm
interested in hearing about all sorts of cases where if the compiler
could just know how long the resulting program *ACTUALLY* takes to
run, it could make better decisions during a recompile without having
to involve a human programmer at that point in the program lifecycle.
In other words, scenarios where the compiler could have generated
better-performing code if it could have conducted experiments with the
actual compiled code on the destination platform.


  * On large data sets, common for scientific and engineering
      simulations, for example, it's pretty important to size the working
      set of data being accessed at any given time to fit into main
      memory and in the cache. The larger the "blocking size", however,
      the better the performance, at least until the working set gets
      *TOO* large. Profiling data would allow the compiler to
      dynamically experiment with the blocking size and data alignments
      to minimize cache misses and virtual memory thrashing for both
      high-end and low-end machines of a given binary-compatible
      architecture line. While it is possible to try to model such
      effects, the large variety of cache and memory organizations out
      there make analysis difficult.


  * Quicksort (implemented by qsort() in the C library) performs very
      well on average. However, other sorts sometimes perform better; in
      particular, Quicksort performs poorly on almost-sorted arrays (and
      even randomized Quicksort has higher overhead than other sorts for
      almost-sorted arrays). While in this particular instance, it would
      not be difficult to write qsort() to check its input data to see if
      it's already almost sorted and not use Quicksort if that is the
      case, that check would itself be not computationally cheap;
      furthermore, in general, it would not be easy to even see how to
      write a test to decide which implementation would run likely
      faster, much less write such a test which would be computationally
      cheap. A compiler could experiment with linking in such an
      alternative qsort() implementation, and notice that application A
      wins, while application B loses.


  * Performance of programs which use mathematical functions which are
      expensive to compute can often be improved by memoization, where
      the function remembers the results for arguments which it has
      recently been called with. However, memoization isn't free: it has
      space costs (past arguments and results need to be cached somewhere
      in memory) and time costs (every call to the function now checks if
      the result is already known... the more past calls are "memoized",
      the longer this check is likely to take). A compiler could, with
      programmer help, generate versions of functions which do or do not
      do memoization, and use the appropriate one for applications which
      do often call the function with the same arguments vs. applications
      which do not.


  * Cross-platform variance of performance of algorithms. I can't
      come up with an example off the top of my head, but I'd really love
      to see something along this line.


  * Cross-user variance of how programs are used. A very vague example
      might be a novice who uses very simple database queries vs. someone
      who almost always constructs very complex precise queries.


[....]


------------------------------


Date: Mon, 8 May 1995 22:58:52 -0400
From: guiguer@cgrnserc.uwaterloo.ca


[....]


this is a neat idea. I hope my example is of help.


[....]


As you probably know, in most C++ compilers operator new is almost
always implemented using C's malloc() function. Malloc(), however
usually uses a nice, slow linear search to find a block of memory for
an object. Creating lots of small objects of a fixed size or many
temporaries can make a C++ application crawl becuase calls to malloc
are expensive. The usual way around this, is to override the global
new operator with a class specific one the allocates from a local pool
of memory. A tool that hinted to me when this would be beneficial to
do would be extremely useful. One that added the code to my class
would be even better :)


If you think this qualifies, I'll post back with some example code if
needed, but I'm a bit tired right now to pull up an example.


Mike Johnson


------------------------------


Date: Tue, 9 May 1995 08:13:15 -0600
From: stevez@code3.code3.com (Steve Zimmerman)


Here's an example I ran into that may or may not be useful...


We have an application that removes certain unwanted bytes from a large
buffer by essentially sliding the contents of the buffer to the left
using a strcpy, like so:


while (stuff)
{
temp = strchr(ptr, TERMINATE_CHAR);
block_size = temp-ptr;
strncpy(ptr, ptr+1, block_size);
other stuff;
}


Since the algorithmn was time intensive, we assumed that using strncpy
(or memmove) would be fast since it would take advantage of multi-byte
move intructions available on the x86 architecture.


However, we found that using something like this


while (stuff)
{
while(*ptr1 != TERMINATE_CHAR)
*ptr1++ = *ptr2++;
other stuff;
}


was actually significantly faster! In fact, in our algorithmn, it was
faster for all values of block_size < ~20000.


While this result may or may not be suprising, it seems that an intelligent
compiler might choose the single-byte move algorithmn (with less overhead)
over a call to a function that handles a multiple-byte move (with more
overhead) if it knows how big block_size will be. If the programmer could
somehow inform the compiler that block_size is always less than 20000,
the compiler would be able to adjust the implementation accordingly.


[....]


------------------------------


Date: Tue, 9 May 1995 18:02:39 +0200
From: greve@rs1.thch.uni-bonn.de (Thomas Greve)


[...]


I have written a simulation Program which excessively uses matrix
multiplications. I have tried different algorithms (which have proven
to be efficient with other C++ compilers on other platforms), but the
opimizer of the compiler for OS/2 has been the bottleneck (and, yes,
the brain damaged structure of PC processors). Nevertheless it was
possible to increase the performance with a few dozen lines of
assembly code -- but this is usually not the way to go in a C++
program. I have analyzed this problem in detail, let me know if you
are interested.


------------------------------


Date: Thu, 11 May 1995 13:56:46 -0700
From: Joe Buck <jbuck@Synopsys.COM>


[....]


In C++, one common case is an inline virtual function that could
be overriden but rarely is. E.g.


class Base {
private:
int data;
public:
virtual int func() { return data;}
...
};


On seeing


void foo(Base& obj) {
... obj.func();
}


Code like


(obj.__vtable == __Base__vtable ? data : <virtual function call>)


may be much more efficient than an unconditional call, but only if
it is more common for the function not to be overriden.


[....]


------------------------------


Date: Sun, 14 May 1995 21:23:33 -0400
From: Samuel S Paik <paik@pvision5.eng.pko.dec.com>


In article 95-05-061 you write:
>I am working on a thesis based on the idea is that it is worthwhile to
>feed profiler output (from actual runs of the program by the real
>end-users using real data on real machines) back to the COMPILER and
>not just the human programmer...


This part is not new, and is an actively pursued topic, both in
production and research systems.


Profile generated feedback has been used for, minimizing i-cache
conflicts and misses, minimizing d-cache conflicts and misses,
predicting branches, selecting traces, global register allocation,
selecting functions for inlining, to name a few I can recall seeing.


>I am seeking real-world examples where this would have been useful.


Lets take a few of the above. Given the right kind of profile
information, you can place functions that are called temporally close
together so they don't interfere in the cache. I belive this is the
primary optimization that the MIPS C compiler does with profile
feedback.


You can try to place static data structures that are likely to be used
so they don't interfere. You can reorder data structures to remove
false invalidations (just remembered this one) in coherent caches
multi-processors.


Or here is a concrete example. Quite often, a branch (in an if
statement) has skewed probabilities. Given basic block profiling
data, you can reorganize the code to minimize pipeline bubbles, e.g.
on an 21064 (Alpha), forward jumps are predicted not taken initially,
so you want to place the likely branch of an if/else on the
fallthrough path, and place the unlikely branch out of line. So, if
your if is guarding an error handler, you'll want the test to branch
out of line, so that the fallthrough continues on, while error
conditions jump out.


              .
              .
      if (will_crash) { -> cmp will_crash, 0
            ask_for_help(); bne if_true
            recover(); if_return:
      } .
              . .
              . /* after function */
              . if_true:
              . ask_for_help...
                                                                  recover...
                                                                  jmp if_return


>Ideally, I'd like to hear both the application context and actual
>(short) snippets of code illustrating the point. (C/C++ code
>preferred but not necessary... I'm asking for snippets of code for
>concreteness.) Also, if you know of existing run-time systems,
>libraries, compilers, etc., which already do this kind of automatic
>adaptation in an interesting fashion, please let me know about them.


Besides the MIPS compiler, I think Om does some interesting things,
go and check out ftp://gatekeeper.dec.com/pub/DEC/WRL


[....]


------------------------------


Date: Tue, 9 May 1995 13:59:24 -0700
From: jdean@cs.washington.edu (Jeffrey Dean)


[....]


There's a huge body of work in this area. Just a few examples:


@article{chang:91,
    author = "Pohua P. Chang and Scott A. Mahlke and and Willam Y. Chen Wen-Mei W. Hwu",
    title = "Profile-guided Automatic Inline Expansion for C Programs.",
    journal = "Software Practice and Experience",
    pages = "349-369",
    month = may,
    year = 1992,
    volume = 22,
    number = 5
}
Profile information identifies program hot spots to which aggressive
inlining should be applied.


@article{pettis:pldi90,
    author="Karl Pettis and Robert C. Hansen",
    title="Profile Guided Code Positioning",
    pages="16--27",
    journal=sigplan,
    year=1990,
    month=jun,
    volume=25,
    number=6,
    note=pldi90
}
Profile data to improve the i-cache behavior ofe heavily executed code.


@article{wall:pldi91b,
    author="David W. Wall",
    title="Predicting Program Behavior Using Real or Estimated Profiles",
    pages="59--70",
    journal=sigplan,
    year=1991,
    month=jun,
    volume=26,
    number=6,
    note=pldi91
}
A study of how predictive profile data across different program inputs.


@inproceedings{calder:popl94,
    author="Brad Calder and Dirk Grunwald",
    title="Reducing Indirect Function Call Overhead in {C++} Programs",
    pages="397--408",
    booktitle=popl21,
    address="Portland, Oregon",
    year=1994,
    month=jan
}
Using profile data to improve performance of virtual function calls in C++


@article{holzle:pldi94,
    author="Urs H{\"o}lzle and David Ungar",
    title="Optimizing Dynamically-Dispatched Calls with Run-Time Type Feedback",
    pages="326--336",
    journal=sigplan,
    year=1994,
    month=jun,
    volume=29,
    number=6,
    note=pldi94 } Dynamically-dispatched message call sites are
monitored, and then optimized if one or a few receiver classes tend to
dominate at the call site.


@techreport{garrett:profiles,
    author="Charlie Garrett and Jeffrey Dean and David Grove and Craig Chambers",
    title ="Measurement and Application of Dynamic Receiver Class Distributions",
    institution = "University of Washington",
    number = "UW-CS 94-03-05",
    month = mar,
    year = 1994
}


Similar to Holzle & Ungar (above), except the profiling is done
off-line, and the stability of profile data across different program
inputs and across different versions of the program is investigated.
A paper about this will appear in OOPSLA'95, and is available from the
Cecil project WWW page:


    http://www.cs.washington.edu/research/projects/cecil/www/cecil-home.html


I hope this helps. You might just try looking under the keyword
'profile' in your school's library: there's far more work than the few
articles I've cited here. In addition, most good optimizing compilers
support using profile data to help the compiler's optimizations
(primarily to identify frequently executed portions of code).


    -Jeff


[....]


------------------------------


Date: Tue, 9 May 95 17:05:54 PDT
From: gilbert@marin.cc.ca.us (Tim Gilbert)


[....]


You might look into a paper describing the design of the RScheme
compiler. Unfortunately I have forgotten both the author and ftp site
where I found it; I'm pretty sure it wasn't published in any journals.
I think a bit of hunting from the Scheme web page should turn it up;
it is in a directory with a bunch of class notes for a scheme class.
If you're interested and can't find it mail me back and I'll try to
dig it up for you.


The paper describes the implementation of a scheme compiler which
compiles into bytecode. The interesting part is that there are plans
to optimize the bytecodes based on profiling information -- eg, only a
few of the available bytes are used for the basic virtual machine; if
an operation is found to be used enough times, then perhaps a new
instuction can be allocated for its use. So perhaps if the operation
"ADD 3" occurs several times, the next available bytecode can be taken
to mean "ADD3" (as one operation). This makes the code smaller, and
potentially faster.


Hmm... Hope that was clear enough to understand. It's a pretty neat
paper, describing the whole implementation of the compiler and some of
the ideas behind it.


Good luck; let me know if you want it and can't find it... -- Tim


[....]


------------------------------


Date: Wed, 10 May 1995 01:58:40 -0300
From: "Jecel Mattos de Assumpcao Jr." <jecel@lsi.usp.br>


The Self programming language has used type feedback since version
3.0. You can get their papers by ftp to self.stanford.edu or on the
web as http://self.stanford.edu.


I have written a paper about using this to match the compiled code's
parallelism to the runtime environment ( via the web:
http://www.lsi.usp.br/~jecel/jabs1.html ).


If you need more information, feel free to ask.


[....]


------------------------------


Date: Wed, 10 May 1995 13:51:20 +0200
From: Philipp Hoschka <Philipp.Hoschka@sophia.inria.fr>


There is definitely quite a lot of existing work on compilers that
do profile based optimisation. A quick scan of the papers I currently
have on my table:


Pettis, Hansen
"Profile Guided Code Positioning"
ACM Sigplan '90
16-27


addresses optimising the layout of the program code so that it
makes good use of the memory hierarchy (caches).


(Scheifler 1977,
"An Analysis of Inline Substitution for a Structured Programming
Language", CACM 20(9), 647-654)


uses profile information to guide the inlining of function calls.


McFarling, Hennessy
"Reducing the cost of Branches"
13th Annual Symposium on Computer Architecture
396-403


applies this for scheduling code on RISC processors


An implementation of such a compiler is described in:


Chang, Mahlke, Hwu
"Using Profile Information to Assist Classic Code Optimisations"
Software Practice and Experience, 21 (12), 1301-1321


Compilers using this technique I know of are:
- the compiler for the MIPS CPU delivered with DEC-Ultrix and with
    Silicon Graphics machines
- DEC's Alpha compiler


Current research focusses on finding heuristics that allow to
statically predict the most beneficial optimisations at compile
time, i.e. without the need for feeding back profile information
(I have just finished my Ph.D. thesis on using such techniques in
a special-purpose compiler).


[....]


------------------------------


Date: Wed, 10 May 95 18:58:34 +0200
From: metzemakers@labri.u-bordeaux.fr (Timo Metzemakers)


At Bordeaux, we are developing a compiler where profiling information
is used to guide register allocation (i.e. to allow the register
allocator to concentrate on frequently-used code. I can provide more
information if you're interested.


I suppose that you've already got this paper, but here's a reference,
just in case:


@InProceedings{wall:prediction,
    author = "David W. Wall",
    title = "Predicting Program Behavior Using Real or Estimated Profiles",
    series = "SIGPLAN Notices",
    pages = "59--70",
    booktitle = pldi,
    year = 1991,
    file = paper
}


[....]


------------------------------


Date: Thu, 11 May 95 16:41:23 GMT
From: Jay.Sipelstein@cs.cmu.edu


[....]


If you don't already know about them, you should check out the
theses of Dain Samples (Berkeley, 91) and Eric Brewer (MIT, 94).


[....]


------------------------------


Date: Thu, 11 May 95 12:12:23 PDT
From: Steve Minton <minton@ptolemy-ethernet.arc.nasa.gov>


Ping,
      In regard to your post on comp.compilers, let me refer to you the following
papers I've written:


``Integrating Heuristics for Constraint Satisfaction
Problems: A Case Study'', Proceedings of the National Conference on
Artificial Intelligence, 1993.


``Using Machine Learning to Synthesize
Search Programs'', Proceedings of the Knowledge-Based Software
Engineering Conference, 1994.


I've been working on an application generator for search programs for
several years (essentially, a fancy compiler), and the papers referred
to above are the early papers on this project. This project is more
blue-sky than the work you are pursuing, but I think you might find it
interesting nevertheless. The work is certainly related, though it's
not along the lines of traditional compiler technology. (I can email
you the PostScript if you want).


[....]


------------------------------


Date: Fri, 12 May 1995 10:36:21 -0700
From: pardo@cs.washington.edu


I don't have good contributions but if you haven't already, you MUST
read David Wall's paper on profiling.


%A David W. Wall
%T Predicting Program Behavior Using Real or Estimated Profiles
%R WRL Technical Note TN-18
%D December 1990
%I Digital Equipment Corporation, Western Research Laboratory
%W See under `/www.research.digital.com/wrl/people/wall/papers.html'


%A David W. Wall
%T Predicting Program Behavior Using Real or Estimated Profiles
%J ACM SIGPLAN '91 Conference on Programming Language Design and
Implementation; SIGPLAN Notices
%V 26
%N 6
%D June 1991


There's also a modest literature on programs that profile themselves
and recompile themselves on-the-fly to take advantage of profile
information. See e.g.


%A Gilbert Joseph Hansen
%T Adaptive Systems for the Dynamic Run-Time Optimization of Programs
%R Ph.D Thesis
%I Carnegie-Mellon University
%D March 1973
%W Copy from CMU; `Catherine_Copetas@GANDALF.CS.CMU.EDU'


%L HCU
%A Urs Ho\\*:lzle
%A Craig Chambers
%A David Ungar
%T Optimizing Dynamically-Typed Object-Oriented Languages With
Polymorphic Inline Caches
%R Proceedings of the European Conference on Object-Oriented
Programming (ECOOP)
%E P. America
%I Springer-Verlag
%C Geneva, Switzerland
%P 21-38
%K olit self ooplas ecoop91
%D July 1991
%W urs@cs.ucsb.edu


[....]


------------------------------


Date: Tue, 9 May 95 10:33:18 EDT
From: fhart@atlanta.twr.com (C. Frederick Hart)


I believe the C compiler from HP on their Unix machines does what you
are talking about. I don't know much more than that so I have no real
idea exactly what information it collects and uses or what
optimizations it makes.


[....]


------------------------------


Date: Tue, 09 May 1995 00:20:16 -0700
From: Seth LaForge <sethml@ugcs.caltech.edu>


[....]


HP's compilers do this. I've included below the relevant sections of
the cc and ld man pages. The ld man page explains it better than I
could, so take a look. I haven't actually tried these options yet,
but I'll probably try them for a Mandlebrot rendering engine and see
how much it improves; it seems like a good candidate. I'll let you
know how it goes.


I don't see any copyrights on the man pages, but I assume they exist.
I'm not sure how it applies to distributing the things...


Seth


[Note: I believe the amount excerpted here is not unreasonable usage.
However, I do have access to HP-UX machines, and take responsibility
in excerpting the below parts of the man pages. == pshuang]


  cc(1) cc(1)


[....]


+I Instrument the application for profile based
optimization. See ld(1), +P, +df, and +pgm for
more details. This option is incompatible with
-G, -g, +m, +o, -p, -S, and -y.


...


+P Optimize the application based on profile data
found in the database file flow.data, produced
by compilation with +I. See ld(1), +I, +df,
and +pgm for more details. This option is
incompatible with -G, -g, +m, +o, -p, -S, and
-y.


[....]


  ld(1) ld(1)


[....]


            The Series 700/800 linker works in conjunction with the compilers to
            support profile-based optimization: transformations that use profile
            data gathered from previous runs of a program to improve run-time
            performance. Profile data collection and profile-based optimizations
            can be performed within the bodies of procedures compiled with the +I
            option to cc or f77 (see cc(1) and f77(1)). For example, code can be
            repositioned within procedures so that fewer branches are executed and
            better code locality is achieved. The linker can also instrument
            calls between procedures, even if they were not compiled with the +I
            option. After profile data is collected, the program can be relinked
            and the linker will reposition the procedures to further improve code
            locality. See the -I, -P, +df, and +pgm options below. Also see the
            discussion of profile-based optimization in the Programming on HP-UX
            manual.


            Object files that are compiled with the +I or +P option (see cc(1) and
            f77(1)) contain a compiler intermediate representation instead of PA-
            RISC object code. The linker then invokes a PA-RISC code generator on
            these files before linking them into the executable file. Because of
            this, linking object files compiled with +I or +P may take
            significantly more time than linking ordinary object files. The
            object files that are generated by compiling the intermediate code are
            written to the directory specified by the TMPDIR environment variable,
            or /tmp if TMPDIR is not specified. The object files are deleted
            before the linker exits.


[....]


------------------------------


Date: Thu, 11 May 1995 21:00:29 -0700
From: ddunn@netcom.com (David A Dunn)


[....]


Check out the HP compiler. Not the free one, but the full fledged C
or Fortran compilers. We support whats called PBO (profile based
optimization) and its used for everything from getting loop trip
counts to figuring out which procedures should be located close to
each other in the final executable.


To use PBO you first build your product using the +I flag and then
relink using the +P flag. The second link uses the profile
information from a run of the instrumented code. You would be
surprised what kind of performance difference this can make.


For example:


cc +O3 +I -c foo.c
cc +O3 +I -c bar.c
cc +O3 +I -o foobar foo.o bar.o
foobar <sample_input
cc +O3 +P -o foobar foo.o bar.o


        Dave Dunn


------------------------------


Date: Fri, 12 May 1995 16:48:26 -0400
From: schooler@apollo.hp.com


Certainly there are commercial systems which do this. HP's compilers
(using the +I/+P switches) will generate an instrumented version of
the program which records control-flow-edge counts, then feed it back
into the compiler. This information was originally used in the back
end, to do things like basic block repositioning as well as tune
execution frequencies for other optimization phases. Later on, the
high-level optimizer used it for inlining, where knowing which call
sites are really the most frequently-executed ones is a big win.
There have been published studies (for instance by Fisher and
Freudenberger of HP Labs) that show that the profile information is
quite stable across different runs for the above applications.


Using the information for loop transforms is very attractive, as well,
but I'd point out one potential problem: the information is quite a
bit less stable across different runs. For example, in the SPEC92 and
SPEC95 suites, the profile-gathering runs must be done with alternate,
shorter test runs, which are different enough to make applying the
information potentially dangerous (in a performance sense, i.e.,
making the code slower than it would have been.)


Quite a number of people are now suggesting gathering run-time type
information to optimize object-oriented programs, by taking advantage
of peakiness in the distribution of method call invocation.


[....]


------------------------------


Date: Wed, 10 May 1995 18:57:37 -0400
From: mike@raederle (Michael Wojcik)


IBM's new compilers for AIX have a profiling utility that feeds
information to an object-code reorganizer, which post-processes the
executable to optimize it for data sets like the one the program
was profiled with.


That's not quite the same as passing profiling information back to
the compiler, of course (which sounds like a good idea, by the way).
It might make for an interesting comparison, though.


See Randall R. Heisch, "FDPR for AIX Executables", _AIXpert_, August
1994: 16-20. _AIXpert_ is an IBM publication; contact
geo@austin.ibm.com for more information. He also cites R. R. Heisch,
"Trace-directed Program Restructuring for AIX Executables", _IBM
Journal of Research and Development_, 38.4 (1994).


Michael Wojcik
Transaction Services Product Group, Micro Focus Inc.
Department of English, Miami University


[I am leaving this mail signature in because I received the email with
a non-fully-qualified From address which cannot be replied to. == pshuang]


------------------------------


Date: Thu, 11 May 1995 09:26:59 -0500
From: metzger@bach.convex.com (Robert Metzger)
Subject: compilers taking profile input


The Convex Application Compiler has accepted optional profile input
for over 2.5 years. The Application Compiler is an interprocedural
compiler that handles Fortran and C.


We use the profile input to direct two optimizations:
1) procedure inlining - we do this automatically, and the heuristic
      is driven off estimates of call execution frequency, among other
      things. these estimates can be replaced with actual counts if
      profile input is available


2) storage reorganization - we use profile input to determine which
      procedures should be adjacent to each other in the text segment so
      that instruction cache thrashing is minimized


------------------------------


Date: Fri, 12 May 95 10:36:47 BST
From: David Alan Gilbert <gilbertd@cs.man.ac.uk>


Hi,
Although its a couple of years since I've used them, I think the compilers
on Silicon Graphics workstations (MIPS processors) have facilities to feed
profiling and cache info back into the compiler. In particular I think
theyrearrange the code to help with cacheing.


[....]


------------------------------


Date: Fri, 12 May 1995 10:11:50 -0400
From: przemek@rrdjazz.nist.gov (Przemek Klosowski)


[....]


This approach is in fact used on MIPS suite of compiler tools (see DEC
MIPS Ultrix, MIPS machines and SGI machines). The profiler output may
be fed back into the compiler/linker, to e.g. optimize layout of object
modules for better cache behaviour. I don't remember off hand what other
optimizations are done by the compiler.


------------------------------


Date: Sat, 13 May 1995 19:04:32 -0400 (EDT)
From: lkaplan@BIX.com


[....]


I may be mistaken, but I believe that the C compiler used on the
DECstation 2000 had that capability (MIPS processor/ULTRIX, not VAX).
Sorry I don't have more concrete information, but it's been a few
years since I used the machine.


[....]


------------------------------


Date: Sat, 13 May 1995 22:04:44 -0700
From: "D. J. Bernstein" <djb@silverton.berkeley.edu>


Have you seen the MIPS compiler?


[....]


------------------------------


Date: Mon, 15 May 1995 06:13:55 +0200
From: ganswijk@xs4all.nl (Jaap van Ganswijk)


If I remember correctly, DEC proposes this kind of tactic, (which I
don't like), for it's Alpha processor. But let me be clear: I don't
like it as general way to compile. It's allright, I guess, if you want
to optimize just one program for just one particular architecture.
And I am not condemning you or your work: You probably found a niche
in which it's useful..


Let me know if you want me to find the particular information. It's
probably on http://www.digital.com or http://www.dec.com Look for the
appendices in the AXP-architecture manual about optimization.


Sorry, that I seem very harsh about this issue, but as a compiler-
writer I don't like the set of problems, that the hardware-boys think
we can solve. And this kind of loop-back is the last straw. (But you
probably know..)


Don't hesitate to write back, I like discussions..


[....]


------------------------------


Date: Tue, 9 May 1995 12:03:20 -0400
From: bnm@central.bbt.com (Brian N. Miller)


There's a product called Segmentor for the IBM PC which takes profiles
and uses simulated annealing not to recompile, but to relink so as to
maximize segment locality of functions in virtual memory.


The product is featured on the Microsoft Developer Network CD ROM.


Also, I believe BYTE magazine's PowerPC issue a year ago mentioned that
IBM might be using profile-guided recompilation as part of its
Motorola 68K to PowerPC instruction set translator.


------------------------------


Date: Tue, 9 May 95 22:35 BST-1
From: funcsoft@cix.compulink.co.uk (Mike Ruddy)


Hi Ping,


I am developing an application to do this.


My first version just does working set tuning of code but is a released
product. If you'd like to have a look the demo is at ftp-os2.nmsu.edu
in the os2/demos directory as LXOPT100.ZIP


This version watches a specially modified version of the app execute and
then constructs a new EXE/DLL with code rearranged at the assembler
level to minimise code load page faults. It operates directly on the
EXE/DLL so no source code changes. It is also a suitable launchpad from
which to do many of the optimisations that you describe in your msg.


IMHO direct alteration of the EXE/DLL is the way to go with this.
Compilation units are human logical divisions. Compilers are severely
limited by calls into other compilation modules where the compiler can
have no knowledge of what is going on.


Three examples:
1) a function returning a boolean value
This is typically done by returning an int in EAX, 0=false otherwise
true. An EXE processor can see how the return value is generated and
used and use a processor flag to return the value which will probably be
set as a byproduct of the function anyway. The caller need not then
test the return value, just branch on the state of the flag.


2) stack validity
The stack often remains valid above the current stack pointer (OS/2
guarantees not to corrupt the first 128 bytes above ESP in its exception
handlers). A calling function could retrieve a return value/structure
from the local variables of the called function *after* it has returned.


3) Corrupted register values
Both caller and callee work to the linkage spec when deciding what
registers need to be saved and restored. This can result in the caller
assuming a register has changed which hasnt or the callee preserving an
unneeded register.


Of course a more trivial "compilation" stage with more info passed to a
powerful optimising linker could achieve the same result but is a ground
up solution. An EXE optimiser can be applied to apps regardless of dev
method.


Hope you find the LXOPT demo interesting,


Mike Ruddy


p.s. I'd love to read your thesis if you wouldnt mind emailing it to me.
          I'll do a proof read for you if you like. It will benefit me in the
          search for new optimisation ideas and I'll try to suggest anything
          I think has been left out.


------------------------------


Date: Tue, 9 May 1995 13:29:27 -0400
From: Robert Bagwill <bagwill@kangaroo.ncsl.nist.gov>


[....]


None of your examples seem like a good requirement for compiler
optimization. For example, what does the compiler have to do with
which algorithm a sort uses? You could write a jacket routine for a
bunch of sort algorithms that checks its input, then calls the
appropriate routine. The compiler would never get involved.


Capturing a profile of a running program is a good idea, but I don't
think feeding the results back to the compiler is very useful, since
to react to a change in the data or execution environment, you'd have
to recompile the program. It makes more sense to build heuristics
that take advantage of the profiler output into the program and/or its
major components. For example, memory management is very important to
big applications. A memory management module could use the results of
profiling to allow it to make good guesses about the size and access
patterns for objects in memory.


[....]


------------------------------


Date: Thu, 11 May 1995 08:57:35 -0500
From: sdl@austin.ibm.com (sdl)


[....]


In the case of C/C++, doesn't memoization effectively change the
semantics of the langauage? That is, in addition to explicit results
library functions can have meaningful side effects that must occur
each time. For a simple example, on an IEEE machine sqrt(-1.0)
returns NaNQ, but per ANSI C it must also set errno to EDOM, and may
cause a SIGFPE signal to be delivered.


[....]


------------------------------


End of Optimization Question Responses Digest
*********************************************
--
Ping Huang <pshuang@mit.edu>, probably speaking for himself


--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.