Re: Hardware doing the work (Was Re: Why C is much slower than Fortran)

zalman@netcom.com (Zalman Stern)
6 Jun 1999 23:02:13 -0400

          From comp.compilers

Related articles
Re: Why C is much slower than Fortran gneuner@dyn.com (1999-05-16)
Re: Why C is much slower than Fortran jhallen@world.std.com (1999-05-29)
Hardware doing the work (Was Re: Why C is much slower than Fortran) creedy@mitretek.org (Chris Reedy) (1999-06-02)
Re: Hardware doing the work (Was Re: Why C is much slower than Fortran jhallen@world.std.com (1999-06-03)
Re: Hardware doing the work (Was Re: Why C is much slower than Fortran sjc@netcom.com (1999-06-06)
Re: Hardware doing the work (Was Re: Why C is much slower than Fortran zalman@netcom.com (1999-06-06)
| List of all articles for this month |

From: zalman@netcom.com (Zalman Stern)
Newsgroups: comp.compilers,comp.arch
Date: 6 Jun 1999 23:02:13 -0400
Organization: Netcom
References: <3710584B.1C0F05F5@hotmail.com> 99-05-057 99-05-142 99-06-021 99-06-024
Keywords: architecture

Joseph H Allen (jhallen@world.std.com) wrote:
: I don't think the compiler has to do very much to support this. A C
: compiler has to unload registers around subroutine calls anyway, so instead
: of emitting:


: ld r7,16(sp)
: ...
: st r7,16(sp)
: jsr foo
: ld r7,16(sp)


: it emits:


: sld r7,16(sp) ; speculative load of r7
: ...
: jsr foo
: check r7 ; reload r7 if 16(sp) changed


I'm not sure if this is meant to refer to IA64 or not. If so, and the
above example is referring to control speculative loads, it makes no
sense. (I am ignoring the offset form loads as a mere typo.) If the
fault on the stack reference is worth considering at all, it certainly
must be realized before the function call. Also, the check instruction
for control speculation is chk.s and branches to a block of fixup code
if the original load faulted.


Beyond all that, you wouldn't worry about aliasing to a stack local
unless a pointer to it is passed as an argument or put in a global. If
that is the intent, then the above code should show it. (And perhaps
the C code you are starting from would make the argument clearer.)


[...]


: Even a more complicated situation is pretty easy:


: ald r1,16(sp)
: ald r2,32(sp)
: ald r3,48(sp)
: ... f(r1,r2,r3)->r4 ...


: jsr foo


: bcheck r1,redo ; Branch if 16(sp) changed.
: bcheck r2,redo
: bcheck r3,redo
: bra skip
: redo:
: ald r1,16(sp)
: ald r2,32(sp)
: ald r3,48(sp)
: ... f(r1,r2,r3)->r4 ...
: skip:


Once again, the "looseness" here is making it difficult for me to tell
what is going on. (I can't read the above as IA64 code because the
check instructions must have the effective address as an operand.) But
if it means what I think it means, the idea works on IA64. There are a
couple notable properties of the microarchitecture to take into
account though.


First, the associative table that advanced loads place their addresses
into is of fixed size. If this size is "too small" it is likely that
the original entry will be pushed out before the advanced load check
happens after the function call return. Before implementing the above
optimization across function calls, I'd be very interested in knowing
the anticipated size of this table.


Second, the associative table also holds a "register tag." In effect
it establishes an assertion that register such and such holds the
value of a certain address. These register tags must be to physical
registers, so on IA64, they must take into account register rotation,
and more importantly, windowing. On a RISC without windowing, the
function call is likely to trash the register anyway. At some point,
you look at having the register fill code reestablish the assertion in
the associative table.


I'm much more optimistic about using this hardware technique within a
given function than across calls. But we shall see.


-Z-


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.