software pipelining on s/390s (Robert Bernecky)
Sun, 12 Nov 1995 22:23:37 GMT

          From comp.compilers

Related articles
software pipelining on s/390s (1995-11-12)
| List of all articles for this month |

Newsgroups: comp.compilers
From: (Robert Bernecky)
Keywords: architecture, optimize, IBM, comment
Organization: University of Toronto, Computer Engineering
Date: Sun, 12 Nov 1995 22:23:37 GMT

Does anyone have some hard numbers [I'll even settle for soft-boiled
numbers] on the utility of software pipelining for array operations
such as double_vector+double_vector on current S/390 machines? The
reason I ask is that I'm trying it out myself, and getting fairly
puzzling results, as if it is no help at all.

The initial loop looks roughly like:

lp ld d0,0(r5)
            ld d2,0(r8)
            la r5,8(,r5)
            la r8,0(,r8)
            adr d0,d2
            std d0,0(r10)
            bxle ra,r6,lp

The pipelined loop loads d4,d6 while doing the la/la/adr, then
moves d4,d6 into d0,d2 [for code generator limitation reasons],
then starts load of next operand pair into d4,d6, etc.

My numbers suggest that either caching [we ARE stride 1] is working
VERY well, or that something else in the system is working very well.
Or else, some other part of the system is running so damn slow that
it's swamping all my measurements either way.

[The 390 probably has great big cache lines. Also, they may be using
Tomasulo's scheme (invented for the 360/91) which gives the effect of
software pipelineing in hardware, and is particularly useful on the 360 arch
since it only has 4 FP registers. -John]

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.