Re: SPARC references

torek@elf.ee.lbl.gov (Chris Torek)
Fri, 12 Apr 91 16:35:14 GMT

          From comp.compilers

Related articles
SPARC references ressler@cs.cornell.edu (1991-03-20)
Re: SPARC references salomon@ccu.umanitoba.ca (1991-03-24)
Re: SPARC references jpff@maths.bath.ac.uk (John ffitch) (1991-03-28)
Re: SPARC references salomon@ccu.umanitoba.ca (1991-03-29)
Re: SPARC references chased@Eng.Sun.COM (1991-04-01)
Re: SPARC references pardo@cs.washington.edu (1991-04-01)
Re: SPARC references torek@elf.ee.lbl.gov (1991-04-12)
| List of all articles for this month |

Newsgroups: comp.compilers
From: torek@elf.ee.lbl.gov (Chris Torek)
Keywords: SPARC, optimize, code, registers
Organization: Lawrence Berkeley Laboratory, Berkeley
References: <1991Mar20.222801.11711@cs.cornell.edu> <1991Mar24.222923.10183@ccu.umanitoba.ca> <1991Mar28.115715.3545@maths.bath.ac.uk> <1991Mar29.214751.3045@ccu.umanitoba.ca>
Date: Fri, 12 Apr 91 16:35:14 GMT

(I will try to stick to `language' issues here.)


>In article <1991Mar28.115715.3545@maths.bath.ac.uk> John ffitch
><jpff@maths.bath.ac.uk> writes:
[about SPARC machines]
>>Some thing which really explained what happens to the stack or window
>>roll, how va_args is supposed to work, and so on would be really helpful.


In article <1991Mar29.214751.3045@ccu.umanitoba.ca> salomon@ccu.umanitoba.ca
(Dan Salomon) writes:
>... the top of the stack must at all times contain 23 words that can be
>clobbered by called procedures. So any space that you use on the stack
>must have been allocated below the stack top by the SAVE instruction.


No and yes: you must typically reserve at least 96 bytes, but the reason
given above is incomplete.


The SPARCstation has `register windows': there are some number of
registers arranged in a circular fashion, with overlap. A five bit
field in the CPU Processor Status Register (PSR), called the Current
Window Pointer (CWP), tells which window is `current'. References
to Input, Local, and Output registers are really references to
registers in the current window.


The five bit field guarantees that no more than 32 windows will ever
exist. Actual SPARC implementations have fewer windows (e.g., this
SparcStation-1 has 7). Call the actual number `nwindows'. Two special
unprivileged instructions allow you to alter the CWP field:


  -- SAVE: this decrements CWP. If the result is 31, it is changed to
        nwindows - 1. In other words, this computes
psr<4:0> <- (psr<4:0> - 1) mod nwindows;


  -- RESTORE: this increments CWP. If the result is nwindows, it is
        changed to 0. In other words, this computes:
psr<4:0> <- (psr<4:0> + 1) mod nwindows;


Another privileged register, the Window Invalid Mask (WIM), holds a
bitmask of `invalid' windows. This is used, e.g., to keep subroutines
from `stepping on' the contents of some other subroutine's window.
SAVE and RESTORE trap to the operating system, rather than doing their
usual job, if the bit corresponding to the new CWP field is set in the
WIM. That is, SAVE and RESTORE really do this:


new_cwp <- (psr<4:0> OP 1) mod nwindows; // OP => + or -
if ((1 << new_cwp) & WIM) then trap; else psr<4:0> <- new_cwp;


Every trap begins by doing an implicit SAVE (even if the result makes
CWP indicate an invalid window) and writing some trap recovery
information into the Local registers in the new window, thus the
operating system must always maintain at least one invalid window (for
traps). (Trap handlers must either run entirely within their special
window, or else go through some fairly major gyrations, which makes
writing the trap code very interesting, but this is mainly an
architectural issue....)


For simplicity, let us assume that the machine has 7 windows, leaving
at most 6 to user programs. Suppose that a user program is started
with CWP=6 and window 0 marked invalid. This means the user program
can use windows 6, 5, 4, 3, 2, and 1 without causing a trap. Let us
also assume that nothing else uses any windows (e.g., all interrupts
are disabled), and that each subroutine uses one new window. Then
we might have a situation like this:


_startup window = 6
main() window = 5
init() window = 4
initobj() window = 3
initobjtab() window = 2
emalloc() window = 1


Now emalloc() calls malloc(), which attempts a SAVE instruction.
Window 0 is invalid and this therefore traps. What must the trap
handler do?


Somehow, the trap handler must make window 0 `available'. For window 0
to be available, window 6 must not be in use (it must be available for
traps to scribble into)---but window 6 contains values that the C
library startup code may need. These must be saved somewhere.


SunOS, Sprite, and 4BSD all use the same technique: they write the
contents of window 6 into the place to which window 6's stack pointer
points. Clearly window 6's registers must be saved into some location
unique to this invocation of _startup. One technique, used (I believe)
on the Pyramid, is to have a separate `control stack'. The advantage
here is that if the control stack pointer is not user-modifiable, the
O/S can be sure that it points to a valid place. Existing SPARC window
save code goes through the above-noted gyrations in order to verify the
user stack pointer. Things are particularly exciting when the user
stack happens to have been paged out. With a control stack, the O/S
can guarantee a minimal in-core region whenever the user process is
runnable. Depending on time pressures and compatibility bugaboos, we
may investigate using a control stack instead of, or in addition to,
the user's stack, in 4BSD, assuming I ever get 4BSD going (maybe if I
stopped working on this news article... :-) ). Control stacks have the
disadvantage that one must partition the virtual space in advance. If
the partitioning is a mismatch for the process, this may put an
artificially low limit on the number of stack frames. One can move
move the control stack in virtual space, but this quickly becomes
complex.


In any case, current systems want to do the following within the trap
handler:


(change to window 6)
std %l0, [%sp + (0*8)] ! store Local registers into stack
std %l2, [%sp + (1*8)]
std %l4, [%sp + (2*8)]
std %l6, [%sp + (3*8)]
std %i0, [%sp + (4*8)] ! store Input registers into stack
std %i2, [%sp + (5*8)]
std %i4, [%sp + (6*8)]
std %i6, [%sp + (7*8)]
(change back to window 0, set WIM to 1 << 6, return from trap)


This whole sequence imposes one constraint, and the `std' instructions
impose another:


  A. There must be at least 64 bytes at each window's %sp that are
        otherwise unused.
  B. Each window's %sp must be doubleword (8 byte) aligned.


If these conditions are not met, SunOS and Sprite kill the process.
(My kernel uses a special per-process save area to hold the values
until C code can store them into the user stack, and I do not bother to
check for 8-byte alignment in this code, so in theory your program will
continue to run, albeit slowly, if you goof up the alignment.) The
obvious inverse sequence occurs when RESTORE instructions trap (the
trap handler is somewhat peculiar since the implicit SAVE on each trap
moves the CWP in the wrong direction).


This takes care of the first 64 bytes, or 16 words, that Dan Salomon
mentioned. What about the other 7 words?


Sun defined their stack frame format to include another 8 words on
every stack frame, partitioned as follows (see <machine/frame.h>):


1*4 bytes: fr_stret, `struct return addr'
6*4 bytes: fr_argd[6], `arg dump area'
1*4 bytes: fd_argx[1], `array of args past the sixth'


The `struct return addr' field is normally unused. For C functions
that return a structure object, however, Sun's compiler does the
following. Suppose function f() returns a structure. Then:


  1. Routines that call f() set their own fr_stret frame element to
        point to the place in which f() should store its return value.


  2. Routines that call f() do so with the sequence:


call _f
nop ! or pass an argument
unimp SIZE


        Here SIZE is the number of bytes the caller expects f() to store
        through fr_stret; this is stored in an otherwise unused bit field
        within the UNIMP instruction.


  3. f() returns not with the usual `ret; restore' sequence, but rather
        through a jump to .stret1, .stret2, .stret4, or .stret8. These
        library routines are given the address of f()'s return value (which
        f() has built somewhere in its own stack frame) and the size of
        this value. They then:


          A. Check the instruction at the return location. If it is not an
UNIMP, they just return (thus discarding f()'s return value).
Otherwise:


          B. Read the `size' field out of the UNIMP instruction. If this
matches the number of bytes f() wants to return, they copy that
many bytes from f()'s return structure to wherever fr_stret
points, and then advance the return address over the UNIMP
instruction. If the sizes do not match, they leave the return
address alone. They then return from f(); if the sizes did
not match, this causes a runtime error (a core dump) because
of the UNIMP instruction.


        The only difference between the four `.stret' routines is the loop
        used to copy the return value: .stret1 copies bytes, .stret2 copies
        halfwords, .stret4 and .stret8 copy words (.stret8 could copy
        doublewords but on SparcStation-1s this is no faster). (I have to
        wonder why Sun do not simply have f() do the work inline and call
        bcopy() or memmove().)


There are a number of reasons why this is the wrong approach, but I
will not go into them here. GCC uses the correct technique: callers of
f() pass an extra `hidden' argument which points to the place f()
should write its return value, and small structures are returned
entirely within registers, without any copying. This does not catch
runtime errors but is considerably more efficient (oops, I said I was
not going to go into this :-) ).


This leaves the arg dump area and arg extension space. fr_argx is
simply an array of `arguments that did not fit into the 6 Input
registers'; its size actually varies, and is usually 0 (which is then
rounded up to 1 because of the stack alignment constraints). The arg
dump area has two essentially separate uses:


  A. Functions that run out of registers may spill some of them into
        the arg dump area. Sun's compiler only ever spills input registers
        0 through 5 into the corresponding space (probably because of the
        next use). Functions whose arguments' addresses are taken can
        likewise use the arg dump area to store these.


  B. Functions that take variable arguments can write their input
        registers here. Since the arg dump area immediately precedes the
        extension space, this puts all parameters into a single contiguous
        region on the stack. This means that old (broken) code that
        assumes contiguous addressible arguments continues to work.


Both of these can be (and, I claim, should be) done differently.
Functions that must spill registers, or take addresses of parameters,
can allocate their own stack space. Functions with variable argument
lists can write the register arguments into local stack space and
retrieve arguments using something like:


next_arg = --regarg >= 0 ? *regblock++ : *argx++;


(note that large objects such as arguments of type `double' must be
handled differently). This slows down varargs functions, but they are
rare.


At the least, the `fr_stret' field should not exist; this would allow
input registers to be stored with `std' instructions, which *are*
faster on some implementations It would also allow the argument
extension area to occpy no space in the usual case.


Anyway, this is where the 96 bytes per stack frame disappears to.
--
In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427)
Berkeley, CA Domain: torek@ee.lbl.gov
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.