Related articles |
---|
Re: Writing fast compilers II rs/6000 tmb@ai.mit.edu (1991-08-18) |
Re: Writing fast compilers II rs/6000 rockwell@socrates.umd.edu (1991-08-19) |
Re: Writing fast compilers II rs/6000 mccalpin@perelandra.cms.udel.edu (1991-08-18) |
Newsgroups: | comp.compilers |
From: | mccalpin@perelandra.cms.udel.edu (John D. McCalpin) |
Keywords: | design, optimize |
Organization: | College of Marine Studies, U. Del. |
References: | <PCG.91Aug18004243@aberda.aber.ac.uk> |
Date: | 18 Aug 91 14:17:49 GMT |
[From comp.arch -John]
>>>>> On 18 Aug 91 00:42:43 GMT, pcg@aber.ac.uk (Piercarlo Grandi) said:
pcg> Consider the appropriate choice of abstraction level: if your
pcg> application requires a matrix multiplication, is it clearer and more
pcg> efficient to write three nested loops to do it and have them rewritten
pcg> behind your back to five only if the target is a vector machine, or to
pcg> write 'a matmult b' and have the compiler generate either the three of
pcg> five loop version depending on machine type?
That is fine, of course, if you are doing matrix multiplication. Since I
occasionally find myself doing other things {:-)}, I find (not too
surprisingly) that there exist no HLL's that implement the operations that I
need. If the language is suitably extensible, then I can define them myself.
But to get the required efficiency, I still have to program the kernels in an
ugly, unreadable, machine-specific way.
The example I posted a few days ago is a case in point. I want to be
able to write:
t = jacobian(h,z)
where the finite-difference jacobian operator is defined as the
average of the 3 straightforward forms (J0,J+,J-), which are defined
by the 3 groups of lines in my previous posting (original form). In
a suitable HLL, one might provide the compiler with something like:
array function jacobian(a,b) == (J0(a,b)+JP(a,b)+JM(a,b))/3.0
array function J0(a,b) = (first two lines of "original code")
array function JP(a,b) = (next three lines of "original code")
array function JM(a,b) = (next three lines of "original code")
I believe that optimizations such as I showed in the "optimized" form are
well beyond the ability of modern compilers when presented with a
"high-level" description of the problem. I wish it were not so, but I have
not yet found *any* compiler that can fuse and unroll loops to produce code
that is within 25% of the manually "improved" versions.
The biggest problem is that the optimizations that are *crucial* on
hierarchical memory machines (function/subroutine inlining -> loop
fusion/rewriting -> loop unrolling) are still rather poorly done.
On machines that have enough memory bandwidth to run in streaming
mode (e.g. the ETA-10, and Cray X & Y), the original code gives
performance which is not too far from optimum....
--
John D. McCalpin mccalpin@perelandra.cms.udel.edu
Assistant Professor mccalpin@brahms.udel.edu
College of Marine Studies, U. Del. J.MCCALPIN/OMNET
--
Return to the
comp.compilers page.
Search the
comp.compilers archives again.