|Testing JIT compilers firstname.lastname@example.org (Tim Harris) (2000-06-14)|
|Re: Testing JIT compilers email@example.com (2000-06-20)|
|From:||Tim Harris <firstname.lastname@example.org>|
|Date:||14 Jun 2000 12:58:13 -0400|
|Organization:||University of Cambridge Computer Laboratory|
I was thinking about testing compilers a bit last week while working
on a Java-bytecode to x86-native-code compiler. I'd be interested in
feedback on the prototype system described below -- e.g. flaws in it,
or that it has been done before :-).
I am working slowly through the previous compiler-testing
bibliographies that have been posted here, but the emphasis there
seems to be more on automatic test generation.
The compiler is itself implemented in Java and so, as part of a
bootstrapping process, it compiles itself and this "compiled-compiler"
is used during subsequent execution. I had been testing the system by
executing a series of simple regression tests (essentially covering
each bytecode operation with a range of input values) and a number of
larger Java applications drawn from the SPEC JVM suite.
This approach to testing caused two problems. Firstly, when a
regression test failed, it was difficuly to associate the
changed output back to an error in the compiler: particularly if the
error lay in the compiled-compiler, or is only provoked by one of the
larger tests. Secondly, I found some problems for which the system
generated incorrect code but produced no visibly-different output from
the test application. A concrete example of this second problem was
when a particular kind of integer comparison was implemented
incorrectly in the compiled-compiler. This caused the
inappropriate (but operationally correct) machine code instructions to
be selected -- for example using 32-bit immediate operands for small
values which could have been contained in a faster 8-bit format.
It seems plausible to suspect that such problems could be detected
more easily by using an extensive suite of regression tests, for
example by including the compiled-code and the contents of other
internal data structures as visible output. However, creating and
running such tests would be time consuming. Ideally, I would like to
be able to supply a large suite of applications and have some
automated test system confirm that they execute in the same way when
interpreted, when compiled with the original interpreted-compiler and
when compiled with the compiled-compiler.
The basic idea would be to execute the test programs in each
environment and to examine intermediate results etc to ensure that
they were consistent. However, there is a tension between the desire
to record lots of information (with the aim of trapping differences in
behaviour as soon as possible) and the desire to avoid picking up
"false positives" by logging information which is intentionally
different between runs (perhaps because of valid transformations
applied by the compiler). Some obvious candidates cannot be used
because the information may not be available under all execution
strategies -- for example method return values are readily available
in the interpreter, but may not always be available in the compiled
code once a method has been inlined at a site at which the return
value is discarded.
The alternative I've prototyped is to compare the last values taken by
objects' fields. That is, to look at objects when they are reclaimed
by the GC, discarded from the stack (if on-stack allocation is used)
or at the end of the application's execution. The intuition is that
these last values should be reasonably resilient to common
optimizations and so one would expect them to be consistent between
different application runs and between different application runs with
different compiler configurations. If on-stack allocation is not used
then integrating the logging with the GC means that there is no
modification to the compiler itself or to the code it generates.
I extend each object with an extra hidden field recording its
allocation site and a per-allocation-site thread-local sequence
number. This information is used to attempt to convert references in
fields to a canonical form before comparing object contents.
This scheme has seemed to work pretty well with some of the SPEC JVM
tests -- highlighting a small number of expected differences between
runs (e.g. different file-descriptor values) and when I've
re-introduced bugs into the compiler it has spotted different
behaviour and generated human-readable error messages.
Return to the
Search the comp.compilers archives again.