Related articles |
---|
200 way issue? davidm@questor.rational.com (1993-09-29) |
Re: 200 way issue? anton@mips.complang.tuwien.ac.at (1993-09-30) |
Re: 200 way issue? pop@mtu.edu (1993-09-30) |
Re: 200 way issue? grover@brahmand.Eng.Sun.COM (1993-09-30) |
Re: 200 way issue? petersen@sp51.csrd.uiuc.edu (1993-09-30) |
Re: 200 way issue? mac@coos.dartmouth.edu (1993-10-01) |
Re: 200 way issue? preston@dawn.cs.rice.edu (1993-10-01) |
Re: 200 way issue? daveg@thymus.synaptics.com (Dave Gillespie) (1993-10-04) |
Newsgroups: | comp.compilers |
From: | Dave Gillespie <daveg@thymus.synaptics.com> |
Keywords: | performance, parallel |
Organization: | Compilers Central |
References: | 93-09-142 |
Date: | Mon, 4 Oct 1993 20:24:14 GMT |
davidm@questor.rational.com (David Moore) writes:
> HOWEVER, the question I want to raise is this: How many way issue can one
> actually use on real code. ... Intuition suggests that the mode of the
> distribution would be quite small - probably 2.
A lot of it depends on what you count as an instruction.
Here's an excerpt from a loop we use on i860's to do the vector
computation "b[i] += W[i] * e". On the i860 parallelism and pipelining
must be managed by the programmer; the "d." prefix on the FP instructions
in the left column cause them to execute in parallel with the subsequent
integer instructions.
loop1:
d.i2p1.ss b3,W6,b_0; pfld.d 2*4(rW)++,W2
d.i2p1.ss b4,W7,b_1; fld.q 4*4(rb)++,b0
d.i2p1.ss b5,W0,b_2; pfld.d 2*4(rW)++,W4
d.fnop; bla incr,len,loop2
d.i2p1.ss b6,W1,b_3; fst.q b_0,-8*4(rb)
;; Loop2 is similar but with a different set of registers;
;; it branches back to loop1.
The "i2p1" instruction is a particular variety of multiply-add. The "e"
operand has already been loaded into an implied register. The multiply
and add pipelines each have three stages, which is why the "b", "W", and
"b_" numbers are out of sync.
The "fld.q" and "fst.q" instructions load and store a quadword (four
single-precision floats) in one gulp; "++" specifies an auto-increment.
The "pfld.d" instruction loads a double-word using a different datapath
that can run simultaneously with the main fld/fst datapath. By
interleaving the two types of loads you can keep data flowing into the CPU
without any cache-related stalls.
The "bla" is basically a decrement-and-delayed-branch.
This loop produces four results per five clock cycles. (The newer i860 XP
chip has a "pfld.q" instruction, which should allow one to eliminate the
"nop" on the FP side and achieve optimum performance of one result per
cycle with 100% utilization of the adder, multiplier, data cache, and
external data bus.)
While the i860 treats each line of the above as a pair of instructions
(and the "feel" is more like a single VLIW instruction), some other
processors would require many more instructions to do the same amount of
work. From my understanding of the Alpha architecture, the line
d.i2p1.ss b4,W7,b_1; fld.q 4*4(rb)++,b0
would require *seven* instructions to match. You'd need a multiply, an
add, four loads, and an increment. To do it in one cycle, you'd need
seven-way issue; as far as I know there are no Alpha chips that come close
to this yet. (To be fair, current Alpha chips have much higher clock
rates.)
-- Dave
--
Return to the
comp.compilers page.
Search the
comp.compilers archives again.