Re: How many vector registers are useful?
Mon, 1 Feb 1993 15:00:03 GMT

          From comp.compilers

Newsgroups: comp.sys.super,comp.arch,comp.compilers
(Volker Kurz)
Followup-To: comp.sys.super
Keywords: architecture, performance
Organization: University of Frankfurt/Main, Dept. of Mathematics
References: 93-01-174
Mon, 1 Feb 1993 15:00:03 GMT
> [is a large vector] register file useful at all ?

Definitely yes.

> A register has an optimizing effect only when the value in it can be used
> several times, at least twice, ...
> But how is this on vector machines ? The register creates a speedup only
> when it can hold an entire vector, which can be used again later. This
> requires a register long enough to do so. That means vectors of e.g. a
> length of 5000 can not be held anyway, every machine must load, process,
> and store it in pieces, and only a lot of memory bandwidth helps.

Every vector command introduces a new startup period. So if you have to
cut your original vector(s) into pieces that fit into a vector register,
it helps if you need fewer pieces. That is the advantage of configuring a
few very long registers.

> When configured as a few long vectors the Fujitsu vector registers may
> help, but then comes the second question: Are there any statistics on the
> reusing of vectors? I know about such things for scalar registers, where
> people found that 32 is plenty enough, and only 8 help a lot. But in these
> cases registers are used for loop indexes, addresses etc., which can not
> be compared to the use of vector registers.
> So: what can be gained with such a big vector register file ? Or is it
> only of limited help ? Can the register file be traded against bandwith to
> load and store from memory ?

Yes it can, and this may be the main reason why Fujitsu gave us such a
large register file.

If you configure more but shorter registers, than you have enough space to
keep intermediate results. This may be the most important advantage of a
large register file: to avoid memory traffic at all.

By keeping intermediate results in vector registers, you do increase
computational intensity which is defined as

number of arithmetic operations
number of (main-)memory references

This has to be seen together with the number of data paths (max number of
memory references per pipe per cicle), which is 3 for a Cray Y-MP, 2 for a
VP1xxx (as you have in Kaiserslautern) and, alas, only 1 for a VP2xxx. As
a rule of thumb, a good estimate for an upper bound of the speed of an
arithmetic operation is

min{computational intensity * data paths, 1} * peak performance

A simple vector add has a computational intensity of 1/3, so it requires 3
data paths for full speed. This is the case on a Y-MP (at least
theoretically, you cannot get the full speed because of memory conflicts
with other processors). On a VP2xxx however you get only roughly 1/3 of
peak performance. On the latter machine, increasing computational
intensity has a dramatic impact on the sustained speed. In many cases
(among these is matrix multiplication) you can increas computational
intensity by unrolling outer loops. This is where a large number of
vector registers is very useful.

You can exploit this on your own machine fairly easily by using the
routines from level 2 BLAS and level 3 BLAS. To the best of my knowledge,
Kaiserslautern uses the routines that were optimized at the University of
Karlsruhe as part of the ODIN project.

Hope this helps,
Volker Kurz

Dr. Volker Kurz *** J. W. Goethe-Universitaet *** Fachbereich Mathematik

