Related articles |
---|
EQNTOTT Vectors of 16 bit Numbers [Was: Re: Yikes!!! New 200Mhz Intel glew@ichips.intel.com (1995-11-09) |
Re: EQNTOTT Vectors of 16 bit Numbers [Was: Re: Yikes!!! New 200Mhz In pardo@cs.washington.edu (1995-11-14) |
Re: EQNTOTT Vectors of 16 bit Numbers cdg@nullstone.com (1995-11-17) |
Re: EQNTOTT Vectors of 16 bit Numbers hbaker@netcom.com (1995-11-19) |
Re: EQNTOTT Vectors of 16 bit Numbers cliffc@ami.sps.mot.com (1995-11-20) |
Re: EQNTOTT Vectors of 16 bit Numbers bernecky@eecg.toronto.edu (1995-11-20) |
Re: EQNTOTT Vectors of 16 bit Numbers bernecky@eecg.toronto.edu (1995-11-21) |
Newsgroups: | comp.compilers |
From: | bernecky@eecg.toronto.edu (Robert Bernecky) |
Keywords: | architecture, optimize, APL, comment |
Organization: | University of Toronto, Computer Engineering |
References: | <47j9hm$tdp@caesar.ultra.net> 95-11-079 95-11-132 |
Date: | Mon, 20 Nov 1995 17:36:35 GMT |
pardo@cs.washington.edu (David Keppel) writes:
>>["Intel's special SPEC optimization."]
>
>This reminds me of APL benchmarking from twenty years ago. The
>interpreters recognized particular idioms and implemented them as
>special cases. The APL vendors (notably, IBM) were accused of
>"benchmark optimizations", though I think the original motivation
>was speeding up real use.
We call this product engineering -- it solves users' real problems.
It is traditional to blame marketeers for taking our work and
using it in benchmarks. 8^}
Actually, IBM was fairly innocent on that front. It was us competitors
in that field, primarily the I.P. Sharp Associates development
group who did a lot of it. For some reason, IBM was very slow to
adopt algorithmic improvements in their early APL products --
Work I did on fast set membership took at least 10 years to
hit the IBM products. IBM's APL2 does a very nice job on
structural operations -- probably ahead of all competitors on that
front (transpose, reverse, take, drop...)
APL implementors worked on two fronts:
- storage savings [Typical workspace size back then was
a generous 100k, up from the 36k of yore [yore wasn't that
much earlier]], to avoid workspace full problems.
Logically, these used loop fusion techniques to
avoid generating large temps. The earliest one I remember
was the "compress iota" idiom:
A common technique in APL is to use a Boolean compression
mask to compress a list of the first N integers.
For example, let's call the index generator "iota".
Then:
iota 5
0 1 2 3 4
and compress (denoted by "/") with a Boolean left argument
gives the elements of the right argument that correspond
to ones in the left argument. E.g., to locate the blanks
a character vector:
x=. ' Where are the blanks here? "
x=' '
10000010001000111000000100000111
Stick blanks between the digits above so its a numeric vector.
I left them out so you can see what's happening.
(x=' ')/ iota shape x
0 6 10 14 15 16 23 29 30 31
Now, a naive implementation would run across "iota shape n"
and build an integer vector w/as many elements as there are
elements in x. That would get fed to compress which would
discard 99% of it.
The integer temp is 4 times as big as characters (32-bit ints)
and 32 times as big if x is Boolean.
Result: Big workspace full problem for common applications.
So, we (I think Larry Breed was the originator of this
particular work. If not Larry, then it was Roger Moore -- both
Hopper Award winners for APL\360) made the system recognize
the "compress iota" idiom, replaced it by a new token in the
internal representation, and avoided generation of the integer
list temp. Much faster and much less (usually) space.
This was generalized in work by Phil Abrams (Stanford PhD thesis:
An APL Machine) as "J-vectors", aka "Arithmetic Progression Vectors".
These represent numeric vectors as a 3-element vector of
initial value, element count, increment
A system that implemented j-vectors [EBIverson: APL73, Springer&Verlag]
would have iota generate a j vector of: 0, (shape x), 1
and then have compress operate as desired on the resulting j-vector.
This appeared in the Siemens BS4000 APL system.
Lessons:
a. APVs are good, but in an interpreter have not proved worthwhile.
The bookkeeping gets excessive, and doesn't pay off in the
long run -- it's very similar to CISC vs RISC design arguments --
if your gate delays to do something fancy [per-operation
overheads] are larger than the potential gains from something
fancy, the design is a net loss.
In a compiler, APVs can pay off big time, as the analysis and
dispatch time is done once at compile time.
b. The compress iota kludge doesn't generalize as nicely as one would
like to other expressions. You can support a handful of common
idioms, but an interpreter doesn't have the luxury of time to
perform the required analysis. Again, a compiler can do much
better, and in a more general way. For example, loop fusion of
all sorts comes essentially for free.
We also engaged in a variety of time-saving algorithmic changes, such
as:
fast set membership and table lookup (Bernecky APL73)
fast matrix products (unpublished, but uses CDC STAR 100 algorithm)
compiled scalar functions and other commonly used operations
On a vaguely related topic, I happen to be evaluating the loop fusion
capabilities of APEX (The APL Parallel Executor) right now, to
determine how well it handles cases such as the above compress iota.
(APEX cranks out SISAL code from APL.)
[I believe that a lot of people came up with APVs at about the same time. I
invented them for an APL compiler I wrote as a class project in the fall of
1972, for example. -John]
--
Return to the
comp.compilers page.
Search the
comp.compilers archives again.