Re: constant expressions

George Neuner <gneuner2@comcast.net>
Fri, 16 Jan 2009 02:04:18 -0500

          From comp.compilers

Related articles
constant expressions gah@ugcs.caltech.edu (glen herrmannsfeldt) (2009-01-13)
Re: constant expressions DrDiettrich1@aol.com (Hans-Peter Diettrich) (2009-01-15)
Re: constant expressions gneuner2@comcast.net (George Neuner) (2009-01-16)
| List of all articles for this month |

From: George Neuner <gneuner2@comcast.net>
Newsgroups: comp.compilers
Date: Fri, 16 Jan 2009 02:04:18 -0500
Organization: A noiseless patient Spider
References: 09-01-031
Keywords: parallel, optimize
Posted-Date: 16 Jan 2009 07:02:45 EST

On Tue, 13 Jan 2009 21:52:11 +0000 (UTC), glen herrmannsfeldt
<gah@ugcs.caltech.edu> wrote:


>It seems that compilation of array expressions in general is
>improving, but that there is still room for improvement. One common
>question in comp.lang.fortran has to do with the execution time of an
>array expression compared to one using DO loops. In many cases the
>array expression is much slower, often using temporary arrays not
>needed in the DO case.


The issue is that array expressions are logically parallel whereas a
corresponding DO loop is logically sequential. Consider


DO i = 2,100
A(i) = A(i-1) + 1.0
END DO


vs


A(2:100) = A(1:99) + 1.0


They both look simple enough ... but what happens if you try to run
the array expression on a multiprocessor or a vector processor?




In the sequential version, each iteration of the loop changes the
input to a subsequent iteration. If the loop is striped across
multiple CPUs, it might happen that A(i) is fetched before A(i-1) is
updated. The solution is to introduce a temporary array and rewrite
the original code as


DO i = 2,100
T(i) = A(i-1) + 1.0
END DO
DO i = 2,100
A(i) = T(i)
END DO


Now the loops can now be striped and executed in parallel because all
the computations are INDEPENDENT - which you may recognize as an HPF
declaration. HPF doesn't parallelize loops unless they are declared
independent and relies on the programmer to be correct.




On a vector processor, the input "registers" hold multiple array
elements - possibly the whole array. In operation the register is
treated as a FIFO queue and the contents are streamed through the
processor with the results going back to memory or into another
register[1].


Because of this, a vector processor typically can't write back results
into its own input[2]. Thus the processor can see only the initial
array values loaded from memory and not any updates to them. To
produce correct program results, the compiler has to introduce a
temporary array and do something like


T(1:99) = A(1:99) + 1.0
A(2:100) = T(1:99)




In general, for an array expression, the compiler has to use a
temporary whenever the same array is used as input and output unless
it can prove there is no overlap of elements (which requires knowledge
of the hardware).




[1] machines using vector units typically provide multiple vector
        registers to minimize memory traffic, particularly if there are
        multiple vector units. Some can also chain the output of one
        unit to the input of another.
[2] some vector units can do reduction operations by looping their
        output back as one of their inputs. But AFAIK, none can write
        back into their input registers on the fly. In general, if a
        dependent computation can't be expressed as a reduction, it can't
        be done on a vector unit without temporaries.




>[Sheesh. I turned array expressions into do loops (well, Basic for loops)
>in the compiler I wrote for my first compiler course in 1972. It's not
>that hard. -John]


Turning an array expression into loops is easy. Turning loops into an
array expression (ie. parallelizing them) is, in general, much harder.


George


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.