Re: 200 way issue?

Dave Gillespie <>
Mon, 4 Oct 1993 20:24:14 GMT

          From comp.compilers

Related articles
200 way issue? (1993-09-29)
Re: 200 way issue? (1993-09-30)
Re: 200 way issue? (1993-09-30)
Re: 200 way issue? grover@brahmand.Eng.Sun.COM (1993-09-30)
Re: 200 way issue? (1993-09-30)
Re: 200 way issue? (1993-10-01)
Re: 200 way issue? (1993-10-01)
Re: 200 way issue? (Dave Gillespie) (1993-10-04)
| List of all articles for this month |

Newsgroups: comp.compilers
From: Dave Gillespie <>
Keywords: performance, parallel
Organization: Compilers Central
References: 93-09-142
Date: Mon, 4 Oct 1993 20:24:14 GMT (David Moore) writes:
> HOWEVER, the question I want to raise is this: How many way issue can one
> actually use on real code. ... Intuition suggests that the mode of the
> distribution would be quite small - probably 2.

A lot of it depends on what you count as an instruction.

Here's an excerpt from a loop we use on i860's to do the vector
computation "b[i] += W[i] * e". On the i860 parallelism and pipelining
must be managed by the programmer; the "d." prefix on the FP instructions
in the left column cause them to execute in parallel with the subsequent
integer instructions.

loop1: b3,W6,b_0; pfld.d 2*4(rW)++,W2 b4,W7,b_1; fld.q 4*4(rb)++,b0 b5,W0,b_2; pfld.d 2*4(rW)++,W4
d.fnop; bla incr,len,loop2 b6,W1,b_3; fst.q b_0,-8*4(rb)

;; Loop2 is similar but with a different set of registers;
;; it branches back to loop1.

The "i2p1" instruction is a particular variety of multiply-add. The "e"
operand has already been loaded into an implied register. The multiply
and add pipelines each have three stages, which is why the "b", "W", and
"b_" numbers are out of sync.

The "fld.q" and "fst.q" instructions load and store a quadword (four
single-precision floats) in one gulp; "++" specifies an auto-increment.
The "pfld.d" instruction loads a double-word using a different datapath
that can run simultaneously with the main fld/fst datapath. By
interleaving the two types of loads you can keep data flowing into the CPU
without any cache-related stalls.

The "bla" is basically a decrement-and-delayed-branch.

This loop produces four results per five clock cycles. (The newer i860 XP
chip has a "pfld.q" instruction, which should allow one to eliminate the
"nop" on the FP side and achieve optimum performance of one result per
cycle with 100% utilization of the adder, multiplier, data cache, and
external data bus.)

While the i860 treats each line of the above as a pair of instructions
(and the "feel" is more like a single VLIW instruction), some other
processors would require many more instructions to do the same amount of
work. From my understanding of the Alpha architecture, the line b4,W7,b_1; fld.q 4*4(rb)++,b0

would require *seven* instructions to match. You'd need a multiply, an
add, four loads, and an increment. To do it in one cycle, you'd need
seven-way issue; as far as I know there are no Alpha chips that come close
to this yet. (To be fair, current Alpha chips have much higher clock

-- Dave

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.