Re: Fixed stack / dynamic stack for each thread

"Ira Baxter" <idbaxter@semdesigns.com>
Sat, 10 Jan 2009 13:06:30 -0600

From comp.compilers

Related articles
Re: Fixed stack / dynamic stack for each thread idbaxter@semdesigns.com (Ira Baxter) (2008-12-23)
Re: Fixed stack / dynamic stack for each thread jgd@cix.compulink.co.uk (2009-01-05)
*Re: Fixed stack / dynamic stack for each thread idbaxter@semdesigns.com (Ira Baxter)* (2009-01-10)**
Re: Fixed stack / dynamic stack for each thread jgd@cix.compulink.co.uk (2009-01-25)

| List of all articles for this month |

From:	"Ira Baxter" <idbaxter@semdesigns.com>
Newsgroups:	comp.compilers
Date:	Sat, 10 Jan 2009 13:06:30 -0600
Organization:	Compilers Central
References:	08-12-105 09-01-009
Keywords:	parallel, storage, design
Posted-Date:	12 Jan 2009 15:28:05 EST

<jgd@cix.compulink.co.uk> wrote
> idbaxter@semdesigns.com (Ira Baxter) wrote:
>
> > > It's quite possible that people theorising about dynamic stacks, and
> > > maybe even experimenting with them, have solved all these problems.
> > > But I'd want to see some evidence for it.
> > Our PARLANSE parallell programming language
>
> This is interesting; having a language that has this concept thoroughly
> built into it at the code generation level is clearly the best way to do
> it. Trying to retrofit it into an existing language, especially one like
> C/C++ that has pointers, is going to be very hard.

"Having pointers" doesn't have any specific impact on how functions
are called. The C and C++ standards don't dictate use of "contiguous
stacks" for implementating calls, precisely because that's an
implementation concept. PARLANSE has pointers, and this all works
fine.

> I hadn't thought of going this far with dynamic stacks, but this
> stack-per-call concept is clearly more viable than only switching
> stacks occasionally.

Dunno about more viable. Gets rid of the conditional check on out-of-stack
at the price of always fetching a new frame. We've been considering
building "medium size stacks" that hold enough stack space for a
*tree* of function calls, where the members of the tree don't recurse.
(Where they do recurse, a stack limit check could be inserted).
This would combine the nice properties of low-overhead allocation
for most calls with the arbitrary size and separate allocation needs
of "fine-grain" threading.
I'd be surprised if this idea hasn't already been implemented by somebody.

> > [PARLANSE] is based on the philosophy that there are an unbounded
> > number of small computational grains.

> Makes sense. The stuff I work on won't readily break down into that
> kind of form, sadly.

If you don't try to do this, you end up with rather big chunks of
work, and they tend to be extremely very hard to parallelize. So
choose: try to make smaller chunks to make parallelism possible, or
use big chunks and don't go parallel. The latter approach seems not a
good strategy given current trends in processor design.

> > It operates by using allocation activation records on a per-function
> > call basis, where the activation record size is determined by the
> > needs just of the function. Each function call thus does a stack
> > switch.
>
> Makes sense. I presume that this language does not allow functions to
> be passed as parameters? I mean this in the sense of the function
> itself, where C would pass a function pointer, rather than the result
> of the function. If that is allowed, how is a function passed?

Functions are easily passed as values. The data structure representing
a function is a pair consisting of a pointer to the function code,
and a pointer to a display. That function value is simply passed
to a reciever. Calling the function represented by the funciton
value is only a slight variation on a standard function call.
We do special case the function call code for functions that need a
zero-length display.

> > Argument handling occurs by passing arguments in the registers, or
> > passing a pointer to a block of parameters in the caller's stack.
>
> Always call-by-value? No way to pass addresses? Much safer that way
> for parallelism, of course.

PARLANSE data types are C-like with a few dynamic types
(such as dynamically sized strings and dynamically sized arrays
of other PARLANSE types). So, it has pointers.
Pointers are just values. If you want to pass a pointer to a function,
just pass it as a value. This isn't necessarily safe for parallelism,
but we assume that our coders understand the trouble they can get into when
two pointers in parallel grains might refer to the same block of storage.
Its a sharp sword, and you can clearly cut yourself badly.
PARLANSE disallows pointer arithmetic, and that seems to help.

> > Uplevel addressing to lexically scoped, addressable values in
> > surrounding function definitions is handled by constructing a
> > "display", an old Algol idea which uses a small array per function,
> > indexed by scope level number, containing a pointer to each parent
> > activation record.
>
> So it's a display of activation records, rather than of variables within
> hem? I'm used to displays in work's custom language.

Yes. If you need up level addressing to a zillion variables
in a parent function context, one activation record pointer
is way cheaper than a zillion pointers and passing them.

> > The overhead for all this is cheaper than you'd expect. it only takes
> > a few machine instructions per function call, and this matters only if
> > you are computing fibonacci or something else equally rediculously
> > simple.
>
> Indeed, the Windows thread activation overhead is going to be much
> bigger than this. That's handling your stacks for you, naturally enough.

We ask Windows to start threads for us, exactly once. Then we try
not to talk to Windows at all, because of its immense overheads.
PARLANSE has its own grain schedulers, split between runtime code
and compiler-generated machine code.

> What stack sizes do you set for the Windows threads? Smaller than the
> default 1MB, I assume?

It is useful to keep a traditional Windows stack per Windows thread.
When a PARLANSE computation grain wants to make a Windows call, it
switches from its small stack to the Windows stack for the underlying
thread, does the Windows calls, and then switches back. A grain owns
the thread stack while the Windows call is being made. That limits
parallelism in calling into Windows, but that doesn't seem like a
particularly big loss.

> > In practice, <snip>
> > The overhead seems to be maybe 5%; this is well worth
> > it because we can usefully go parallel with reasonably large numbers
> > of processors (threads).
>
> How many real processor cores have you taken this up to? it seems to me
> that simple memory contention would start to bite you at the 10-ish kind
> of level.

Probably. We don't have a lot experience there. The memory contention
we have seen to date seems largely in two places:
1) Single-channel-to-memory archicture for the system.
8 CPUs all trying to read at once can swamp the memory channel.
2) As people probably expect, contention over locks.

> > In production, our code doesn't trap
>
> Can you absolutely prevent this? You can go a long way towards this by
> having no floating point trapping and limited dynamic memory allocation
> (and thus no pointer arithmetic for programmers to screw up), but what
> happens when a Windows API call messes up? Debugging would seem hard.

You can't prevent it; people write code that might trap (e.g., divide
by zero, etc.). What the compiler does is detect cases where traps
might occur in a function. (We ignore the cases of "illegal
instruction" or "illegal access" as being impossible if the compiler
works in the former case, and a broken application program in the
latter case). If no traps can occur in a function (the remarkably
common case), you can use a small stack frame safely. If a trap might
occur (floating point traps are the main set), we allocate and
appropriately larger stack space.

Windows API calls are made using a thread-specific "large stack".
If the Windows API is passed bad values, yes, the program may blow up.

We use a lot of "trust" checks (analogs to C "asserts"), and the
compiler has a debug mode in which it generates many sanity checks
(e.g., array bounds checks, uniitialized variables, access to
already-freed-storage). These are all pretty useful, but yes,
debugging in such a parallel enviroment can be hard.

> > What's more remarkable is that PARLANSE propagates exceptions across
> > not only function boundaries, but cleanly across parallelism
> > boundaries. That requires a lot of complicated machinery
>
> Yup. I've done trap propagation out of Windows worked thread back to the
> thread that created them. Not a lot of code, but a lot of care needed.

It takes more than that, if one grain can asynchronously abort
another, as is possible in PARLANSE. You need to have very clean
rules about what happens when an async abort occurs, and how exception
handlers can act like transactions in the face of asynchonronous
aborts.

> John

-- IDB

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: Fixed stack / dynamic stack for each thread

"Ira Baxter" <idbaxter@semdesigns.com>Sat, 10 Jan 2009 13:06:30 -0600

"Ira Baxter" <idbaxter@semdesigns.com>
Sat, 10 Jan 2009 13:06:30 -0600