Re: Compiler support for multicore architectures

"Ira Baxter" <>
Fri, 28 Nov 2008 13:31:09 -0600

          From comp.compilers

Related articles
[2 earlier articles]
Re: Compiler support for multicore architectures (Ian Lance Taylor) (2008-11-18)
Re: Compiler support for multicore architectures (2008-11-19)
Re: Compiler support for multicore architectures (Walter Banks) (2008-11-19)
Re: Compiler support for multicore architectures (toby) (2008-11-20)
Re: Compiler support for multicore architectures (Jatin Bhateja) (2008-11-21)
Re: Compiler support for multicore architectures (kamal) (2008-11-23)
Re: Compiler support for multicore architectures (Ira Baxter) (2008-11-28)
Re: Compiler support for multicore architectures (Ray Dillinger) (2008-11-28)
Re: Compiler support for multicore architectures ( (2008-11-30)
| List of all articles for this month |

From: "Ira Baxter" <>
Newsgroups: comp.compilers
Date: Fri, 28 Nov 2008 13:31:09 -0600
Organization: Compilers Central
References: 08-11-086
Keywords: parallel
Posted-Date: 28 Nov 2008 14:42:37 EST

"gaurav" <> wrote in message
> There has been a lot of discussions going on about compiler support
> for multicore architectures.
> What different things can a compiler do to make mutithreaded programs
> run better on multicore? Isnt it much depends on the threading
> library , OS and the programmar instead on compiler ?

There are a number of issues related to multicore.
Pretty much they are all thinly disguised issues for parallelism.
I won't claim this list is exhaustive, but I find it useful
as a first cut when I try to look at some "multicore"
or parallel solution:

1) What kind of parallelism is supported?
        a) Data flow parallelism (e.g. "functional" languages)
        b) Data parallelism (arrays and array operations)
                [People don't talk much about what happens when
                  the data is big but irregularly connected, but
                  that's an interesting case]
        c) Task parallelism (parallel, partial order, concurrent, ...)
      d) Streaming/pipeline parallelism (IMHO, a special case of a and c)
      e) Speculative parallelism (transactions, ...)
2) How is parallelism identified?
          a) Does the programmer specify it?
          b) Does the compiler detect it? If so, how good is the detection?
          c) Some hybrid of a and b
3) How is the parallelism harnessed?
        a) fine grain vs. coarse grain threads
        b) array partitions / SIMD instructions,
        c) Data/task distribution across processors
        d) Homogenous vs. Heterogenous machines
4) What's the overhead?
        a) Cost of switching between "threads"?
        b) Cost of moving data between tasks
        c) Various complications with caching data

The question posed by the OP has many answers depending on which
of these issues the "compiler" intends to address. They're all
hard, and thus they all get attention. I don't think the community
has a clear consensus on what the right combination of answers
is, or if there is even a right combination, or if it is "something else".

Let me sketch a particular combination that we have chosen.
I'm writing this using the OP's directive of
"what can your compiler do to support parallelism",
becaue I think our set of choices is interesting. YMMV.

My company is in the business of building large scale program
analysis and transformation tools ("DMS", check the signature
for our website if you want to know more about these tools,
but I've discussed them here in comp.compilers in the past).
The "large scale" part means we face big *irregular* data structures.
Imagine having abstract syntax trees, symbol tables,
control and data flow graphs for thousands of compilation
units, and trying to do whole-program analysis. So,
we need lots of cycles, too, to produce answers in shorter
times. So we have a parallel computing problem.

At the time we started building our tools (1996), there were no
obviously adequate "parallelism toolkits" or "parallel languages" that
addressed the irregular parallelsim problem well. "Parellism
toolkits" are often sets of thread APIs and synchronization. Writing C
code to call threads for complicated parallelism is pretty much a way
to ensure your engineering costs go through the roof: you can't write
the code reliably, you can't debug it well, and call OS APIs to start
threads has high overhead which limits your opportunities for
parallism. Parallel Fortran does data parallel stuff, works poorly on
graphs. Nobody had heard yet of Java, and even if we had, Java's
threads seem pretty heavyweight and I suspect were designed to handle
low degrees of concurrency (can you say, screen, keyboard, mouse?)
There were lots of academic research "solutions" out there, but robust
isn't generally a good word for those and we wanted to build large
applications. (DMS applications that are interesting tend to have
500K+ lines of code, and we've built some that have 1.5M lines).

After lots of painful discussions, we decided to roll our own
language, PARLANSE, to address the issues.

Because we thought we had irregular data, we chose to support task
parallelism as the key target, with some support for speculating
(start and abort tasks and teams of tasks).

Since identifying parallelism is hard, and our early versions of
whatever we built wasn't likely to be good at this, we punted and
required programmers to identify "potential parallelism"; the compiler
is allowed to ignore such declarations if it thinks it can do better,
and it is allowed to add implicit paralellism where it can detect it
(to date, we don't take advantage of this part of our langauge
definition, so you can justifiably say "cute but not part of your

The language has LISP syntax, but isn't LISP, rather it more like C
(ints, floats, structs, tagged unions, statically sized arrays,
pointers) but no pointer arithmetic as a bone thrown to "try to make
the langauge analyzable so parallelism could be detectable". It
includes a variety of dynamic data types (strings, dynamic arrays,
teams of worker grains, ....) that are useful when dealing with
irregular data. It includes standard synchronizing primitives

We use the word "grain" to suggest fine-grain parallel computations.
One of PARLANSE's goals is to minimize the cost of task switching,
enabling smaller code chunks to be executed as grains, enabling more
potential parallelism. We expect to have lots (thousands!) of grains
in a program, even if the running machine only has a few hundred :-}
cores. Lots of grains means lots of available work, which helps keeps
all those cores busy. More on this in a bit.

The compiler generates heap-based activation records for functions.
Each function may contain a large number of a paralleism
invoking-primitives (see below); the compiler preallocates lots of
grain data structures, grain-specific stack areas, etc. in advance in
the activation record so these require usually zero runtime to set up,
sometimes just a machine instruction or two. This helps control the
cost of the parallelism.

Parallism is expressed explicitly. PARLANSE has special operators for
forking grains, call "(spawn". More interesting are "teams", which
are dynamically collected sets of grains. You can spawn a grain
directly into a team "(draft". Spawn grains and teams can be
speculatively aborted. One of PARLANSE's more interesting results is
zero-overhead exception handling that works sensibly across grain/team
boundaries, as opposed to the Java model, where what happens when an
exception reaches a runnable thread parent doesn't seem defined. I
think it is hard to make exception management work asynchronously
across parallelism boundaries if the compiler doesn't know where they
are; you sure can't make it be zero overhead.

The explicit parallelism specificaiton scheme using explicit language
operators doesn't suffer from "its hard to code and you get it wrong".
It is easy to code the parallelism here; takes one-two lines. This is
really important if you intend to write a big program and not die from
doing it. (That doesn't mean that parallel algorithms are easy to
code; just means you don't screw up by forgetting to set bit 7 in the
thread data structure as required by you local OS's API in the 50
lines of gunk you need to set it up, wait for responses, and tear it

However, spawn and draft suffers from the same overhead as "parallism
toolkits" except that we think we implemented "our" thread libraries
very efficiently, and the compiler knows about them. (To Windows,
running parlanse programs look like a set of threads equal to the
number of CPUs on the system, with no Windows thread calls once it
start; they're too expensive).

One way to avoid more overhead is to identify "static paralellism",
and have the compiler set up as much as it can in advance of
execution. For this, we a number of PARLANSE constructs to indicate
that parallelism directly:

      (|| a b c ...) means "do actions a, b, c in (potential) parallel"
    (;; a b c ...) means "do a, then b, then c"
    (;| first a second (>> first) b third c fourth (>> first third) d )
              establishes a partial order over actions a, b, c, d
              named first second third fourth respectively
              with time relations encoded using "happens later" >>
              operators, e.g., fourth happens after first and third.
            This example is one you cannot code with the parallel
            or sequential operators. Note the parallelism
            here is also "potential", but the sequencing order is not.
            A partial can always be linearized to preserve sequencing.
      (|! a b c ...) "means do a, b, c, concurrently".
              This isn't "potential parallelism", these tasks really
              must run in parallel, usually beacuse they communicate
              bidirectionally and so none can occur after all the others.
It shouldn't be a suprise that most constructs in the existing
code based are coded (;; but a lot of this is sheer laziness.
We code (|| judiciously (lots of divide and conquer algorithms are trivial
to implement in parallel this way) and partial orders occasionally.
Concurrency is cute be we hardly ever code it.

Weirdly, in spite of not coding partial orders by hand much, partial
orders are a lot more common that you might expect. In a DMS-derived
analysis tool, we have code generators that produce analyzers, using
attribute-grammar domain-specific-languages (mostly functional but
some side effects). Such DSLs *are* easy to analyze for partial-order
parallelism, and we do. The code generators thus produce vast tracts
of PARLANSE code with partial orders, that would be impossible to code
by hand. In fact, we often synthesize 500K SLOC of PARLANSE code
(representing name and type anlaysis for e.g., C++) with several
thousand partial order constructs, which are correct by construction.
You can't even think about debugging these.

While PARLANSE does has a grain scheduler (queue-per-processor with
work stealing), static parallelism constructs allow the compiler to
generate "thread switches" that don't need to go to a scheduler. With
(;| first a second b third (>> first second) c), the compiler can
generate a single word counter initialzed to "two" to protect the
front door of c. When either a or b finishes, an atomic decrement of
the counter tells the code if c is ready to run. Rather than the
current grain scheduling c, and then destroying itself, and going to
fetch work, it simply "becomes" grain c, and considerable scheduler
overhead is saved.

We duck all the issues regarding "distribution" by insisting that
PARLANSE is only for SMP workstations.

All of this has been a huge amount of work, and there are still
glitches we occasionally have fix. We have about 10 years of
experience with this language and it only just now starting to pay
off; this mostly isn't worth it when you have only two CPUs.

We see 6/8/12 workstation systems becoming common for PCs in 2009, so
we're happy that we likely have a sweet spot coming up. Even so,
getting all those processors to work together to achieve good speedup
is still somewhat problematic, we have lots of tuning still to do with
our DMS applications.

Without 10 years of experience, I'm not sure what one can say about
the effectiveness of a parallel langauge/compiler. What I dare
anybody to do, is to achieve good speedups on large irregular
computations on large applications without some kind of sophisticated
langauge and compiler support.

So the question isn't "what can your compiler do for you?". I think
it is, "How can you do anything without a good compiler?"

Ira Baxter, CTO

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.