Related articles |
---|
constant expressions gah@ugcs.caltech.edu (glen herrmannsfeldt) (2009-01-13) |
Re: constant expressions DrDiettrich1@aol.com (Hans-Peter Diettrich) (2009-01-15) |
Re: constant expressions gneuner2@comcast.net (George Neuner) (2009-01-16) |
From: | George Neuner <gneuner2@comcast.net> |
Newsgroups: | comp.compilers |
Date: | Fri, 16 Jan 2009 02:04:18 -0500 |
Organization: | A noiseless patient Spider |
References: | 09-01-031 |
Keywords: | parallel, optimize |
Posted-Date: | 16 Jan 2009 07:02:45 EST |
On Tue, 13 Jan 2009 21:52:11 +0000 (UTC), glen herrmannsfeldt
<gah@ugcs.caltech.edu> wrote:
>It seems that compilation of array expressions in general is
>improving, but that there is still room for improvement. One common
>question in comp.lang.fortran has to do with the execution time of an
>array expression compared to one using DO loops. In many cases the
>array expression is much slower, often using temporary arrays not
>needed in the DO case.
The issue is that array expressions are logically parallel whereas a
corresponding DO loop is logically sequential. Consider
DO i = 2,100
A(i) = A(i-1) + 1.0
END DO
vs
A(2:100) = A(1:99) + 1.0
They both look simple enough ... but what happens if you try to run
the array expression on a multiprocessor or a vector processor?
In the sequential version, each iteration of the loop changes the
input to a subsequent iteration. If the loop is striped across
multiple CPUs, it might happen that A(i) is fetched before A(i-1) is
updated. The solution is to introduce a temporary array and rewrite
the original code as
DO i = 2,100
T(i) = A(i-1) + 1.0
END DO
DO i = 2,100
A(i) = T(i)
END DO
Now the loops can now be striped and executed in parallel because all
the computations are INDEPENDENT - which you may recognize as an HPF
declaration. HPF doesn't parallelize loops unless they are declared
independent and relies on the programmer to be correct.
On a vector processor, the input "registers" hold multiple array
elements - possibly the whole array. In operation the register is
treated as a FIFO queue and the contents are streamed through the
processor with the results going back to memory or into another
register[1].
Because of this, a vector processor typically can't write back results
into its own input[2]. Thus the processor can see only the initial
array values loaded from memory and not any updates to them. To
produce correct program results, the compiler has to introduce a
temporary array and do something like
T(1:99) = A(1:99) + 1.0
A(2:100) = T(1:99)
In general, for an array expression, the compiler has to use a
temporary whenever the same array is used as input and output unless
it can prove there is no overlap of elements (which requires knowledge
of the hardware).
[1] machines using vector units typically provide multiple vector
registers to minimize memory traffic, particularly if there are
multiple vector units. Some can also chain the output of one
unit to the input of another.
[2] some vector units can do reduction operations by looping their
output back as one of their inputs. But AFAIK, none can write
back into their input registers on the fly. In general, if a
dependent computation can't be expressed as a reduction, it can't
be done on a vector unit without temporaries.
>[Sheesh. I turned array expressions into do loops (well, Basic for loops)
>in the compiler I wrote for my first compiler course in 1972. It's not
>that hard. -John]
Turning an array expression into loops is easy. Turning loops into an
array expression (ie. parallelizing them) is, in general, much harder.
George
Return to the
comp.compilers page.
Search the
comp.compilers archives again.