Related articles |
---|
comparing compilers & architectures peterd@june.cs.washington.edu (1989-03-15) |
comparing compilers & architectures wsmith@m.cs.uiuc.edu (Bill Smith) (1989-03-16) |
From: | peterd@june.cs.washington.edu (Peter C. Damron) |
Newsgroups: | comp.compilers |
Keywords: | how to measure |
Date: | 15 Mar 89 22:50:22 GMT |
Organization: | U of Washington, Computer Science, Seattle |
In a recent discussion with a friend working on compilers
I brought up these comments about comparing system X versus system Y.
(the names have been changed to protect the innocent.)
I don't want to open up a discussion about benchmarking or choosing
benchmarks, but I am interested in any techniques for evaluating
the "effectiveness" of a compiler or compiler + architecture.
I wonder if anyone on the net has any further comments or suggestions.
I see three possibilities for comparing the systems:
1. compiler effectiveness
2. architecture effectiveness
3. compiler + architecture effectiveness
It is arguable which of these measurements is most important.
It depends on your point of view.
The end user probably cares most about number 3.
Comparing compiler + architecture
---------------------------------
Number 3 is the easiest to test and the one usually measured. The usual
method is to collect a "suitable" set of benchmarks, compile and run them.
The word suitable means the set which proves the point you want to make :-).
If you want to be as fair as possible, you should choose benchmarks
from a wide variety of application areas and they should be large programs.
If you have a particular application in mind, use that as a benchmark.
See below under benchmarks.
Comparing architectures
-----------------------
To measure number 2, architecture effectiveness, you must attempt
to hold compiler technology constant. This is where using gcc or pcc
or other retargetable compilers might help. You port the compiler
to the architectures of interest, then you use some benchmark programs,
generate code, and compare the output code.
See below about benchmarks and comparing code.
The problem with this approach is that the compiler technology you
use to do the comparison can severely bias your results.
If architecture A depends on very good register allocation to run
efficiently, but the compiler you use has a bad register allocator,
the results will be biased against architecture A.
So, the best you can do is either compare architectures which require
roughly the same compiler technology for effective use, or use the
latest and greatest compiler technology in all aspects of the compiler.
The latter is too difficult, so the former is probably the best you can do.
Comparing compilers
-------------------
To measure number 1, compiler effectiveness, you must attempt to hold
the architecture constant. Again, a set of benchmarks is chosen and run
through the compilers. The output code is then evaluated.
See benchmarks and code comparison below.
If the target architectures of the compilers differs, then trying to
make any comparison is very difficult -- too many parameters.
In comparing compilers, it is important to understand the different
compiler phases and how they interact. The effectiveness of each
phase can be measured by designing artificial benchmarks to exercise
that phase. However, poor performance on a benchmark can easily be
attributed to one phase, but be caused by a bad interaction between
other phases of the compiler.
Phases which might be of interest to test include: instruction selection,
instruction scheduling, register allocation, local optimizations, and
global optimizations.
Benchmarks
----------
There are 3 kinds of benchmarks:
A. synthetic (e.g. dhrystone, whetstone, livermore loops, etc.)
B. actual applications
C. abstractions from actual applications
The most widely used are class A benchmarks, but they are only very rough
measures of performance, since any real application will be different.
Class B, actual applications, are what the user is actually interested in,
but these can be hard to execute and measure accurately.
Class C is a compromise. Take your application, figure out what a typical
"run" might be (or a series of them), modify the application to
do the typical run in an automated and measurable way.
This is the most work, but probably gives the best measure of how well
the system under measure will perform for your application.
Of course, you may have several applications, so you can run more than
one benchmark. It is probably a good idea to run many benchmarks anyway,
to get an idea of what the variance is in relative performance.
Note that compiler libraries, the run-time system, the operating system,
and I/O may all effect the outcome of benchmarks.
Comparing machine code
----------------------
There are many ways to compare machine code for different machines.
Probably none of them are really adequate.
First, there are two classes of measurements, static and dynamic.
Static measurement attempts to quantify the quality of the code
without regard to how many times each instruction might be executed.
Static measurements are probably better for comparing compilers.
Dynamic measurements attempt to quantify how fast the code will execute,
taking into account how many times and in what order instructions
are executed. Dynamic measurements are probably better for comparing
architectures.
Static measurements are typically taken with some sort of instruction
counting program. You might measure number of instructions, cycles,
memory references, instruction size, or some combination.
Dynamic measurements are typically taken by executing the program
on some hardware, though instruction counting or simulation is possible.
If the code is to be executed, attention must be given to making
sure the hardware configurations are equivalent or comparable.
In measuring code, you may want to find out certain characteristics
about the code, especially for comparing compilers.
How effective is the register allocation?
How many different instructions from the architecture were actually used?
How many addressing modes were used?
Which optimizations were performed? Dead code? Loop Invariants? etc.
How effective was the scheduler in eliminating NOP's or filling
branch delay slots?
These questions relate directly back to various portions of the compiler.
Some of these measurements are easy, but others are difficult.
Also, these measurements are only an indirect measure of the related
portion of the compiler, because they will depend on the program too.
Summary
-------
Any of these types of measurements are difficult to make fairly.
There are many factors to consider and many parameters which cannot
be held constant for the comparison. Generalizations are probably to
be avoided, but some specific findings may hold.
(e.g. compiler A does a better job at register allocation than compiler B,
or compiler+architecture A is better at numerical analysis than
compiler+architecture B, primarily because of the library routines)
Just producing a set of numbers is not enough.
Careful analysis of the results is necessary to avoid unintended bias
and to fully understand why the numbers came out that way.
Plug
----
The following technical reports compare various instruction selection
techniques (i.e. comparing parts of compilers). Of course, these
comparisons used a very controlled experimental testbed.
Comparisons with production compilers will be much harder.
Robert R. Henry and Peter C. Damron,
"Performance of Table-Driven Code Generators Using Tree-Pattern Matching",
Technical Report 89-02-02, Dept. of Computer Science, Univ. of Washington,
Feb. 1989.
Robert R. Henry and Peter C. Damron,
"Algorithms for Table-Driven Code Generators Using Tree-Pattern Matching",
Technical Report 89-02-03, Dept. of Computer Science, Univ. of Washington,
Feb. 1989.
Peter C. Damron
Dept. of Computer Science, FR-35
University of Washington
Seattle, WA 98195
peterd@cs.washington.edu
{ucbvax,decvax,etc.}!uw-beaver!uw-june!peterd
--
Return to the
comp.compilers page.
Search the
comp.compilers archives again.