Re: Instrumenting multithreaded applications

David Chase <>
25 Jan 2003 01:07:12 -0500

          From comp.compilers

Related articles
Instrumenting multithreaded applications (Matthew Legendre) (2003-01-21)
Re: Instrumenting multithreaded applications (2003-01-25)
Re: Instrumenting multithreaded applications (Dmitry A.Kazakov) (2003-01-25)
Re: Instrumenting multithreaded applications (John Stracke) (2003-01-25)
Re: Instrumenting multithreaded applications (Robert A Duff) (2003-01-25)
Re: Instrumenting multithreaded applications (Ben Elliston) (2003-01-25)
Re: Instrumenting multithreaded applications (Joachim Durchholz) (2003-01-25)
Re: Instrumenting multithreaded applications (David Chase) (2003-01-25)
Re: Instrumenting multithreaded applications (Lex Spoon) (2003-01-25)
Re: Instrumenting multithreaded applications (Ira Baxter) (2003-01-26)
| List of all articles for this month |

From: David Chase <>
Newsgroups: comp.compilers
Date: 25 Jan 2003 01:07:12 -0500
Organization: Compilers Central
Keywords: parallel, testing
Posted-Date: 25 Jan 2003 01:07:10 EST

On 21 Jan 2003 00:14:44 -0500, Matthew Legendre <> wrote:
> Does anyone have any experience with or know of any work related to
> instrumenting a multithreaded application?

Yes, some of it proprietary. I think I can provide general
information that is still useful.

> We've got a situation where we want to insert instrumentation into a
> multithreaded application (for profiling purposes). We need to use some
> form of mutual exclusion since the instrumentation writes to a global data
> structure.
> The problem is that we may insert instrumentation into a signal handler.

There are two general attacks on this problem:

1) you can rewrite your instrumentation code to use fetch-and-add or
compare-and-swap -- this is possible if you are just updating
counters. This removes the need for locks; no locks, means no
deadlocks. The locking probably uses one of these anyway, so (if
there is only one or two counters per update) it will run as fast or
faster, too.

Exactly what is done will depend on the details of the underlying
memory architecture, but assuming you have written the assembly
language subroutine for compareAndSwap properly, you get something
like this:

int preIncrement(volatile int * px, int delta) {
    /* i.e., ++(*px) */
    int old_value = * px;
    int new_value = old_value + delta;

    while (!compareAndSwap(px, old_value, new_value)) {
        old_value = * px;
        new_value = old_value + delta;
    return new_value;

int postIncrement(volatile int * px, int delta) {
    /* i.e., (*px)++ */
    int old_value = * px;
    int new_value = old_value + delta;

    while (!compareAndSwap(px, old_value, new_value)) {
        old_value = * px;
        new_value = old_value + delta;
    return old_value;

Here's a sample implementation of compareAndSwap (LOCKED) for use on
multiprocessor Pentium.

#define operands(in,out) out,in
#define pushl push
#define movl mov
#define popl pop
#define addl add
#define andl and
#define subl sub
#define testl test
#define xorl xor
#define disp(reg,offset) [reg+offset]

/* Parameters are passed in registers C, D, then memory.
      Ret must remove any in-memory parameters. */
_declspec(naked) void __fastcall compareAndSwap(int ccc, volatile int * ddd, int mmm) { __asm {
movl operands(disp(esp,4),eax)
lock cmpxchg operands(ecx,[edx])
jne ret_zero
mov operands(1,eax)
ret 4
mov operands(0,eax)
ret 4

The cost of the "LOCK" or other memory consistency barriers (on other
architectures) is quite high, and if you know that you are on a
uniprocessor, you can and should avoid it. You can either test a flag
(which will have low cost compared to overhead of the LOCK prefix) or
link against a different library, or (quite plausibly) generate or
rewrite the code in place.

You can read more about this if you go looking for papers by, or
referencing, "Maurice Herlihy", and look for the word "wait-free".

2) Plan B is collect your statistics on a per-thread basis, and avoid
locking. You will still need to worry about signal handlers, but for
that, you can use a simple compare-and-swap without fooling around
with memory consistency (a thread and its signal handler do not run at
the same time) to do your updates. If you need a global time, use
compare-and-swap to increment a counter, and use that, or periodically
reference that to a real time.

Removing shared data structures is a tremendous simplification, even
with the possibility of interrupts. In particular, you can reason
that if a signal handler interrupts, it does not return until it is
done (though it may itself be interrupted by other signal handlers).

David Chase

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.