Re: Are these all really true ? (Preston Briggs)
Mon, 16 Oct 1995 22:53:38 GMT

          From comp.compilers

Related articles
[20 earlier articles]
Re: Are these all really true ? (Rodney Bates) (1995-10-03)
Re: Are these all really true ? (Jeremy Carroll) (1995-09-29)
Re: Are these all really true ? (Stefan Monnier) (1995-10-02)
Re: Are these all really true ? (Scott Nicol) (1995-10-02)
Re: Are these all really true ? (1995-10-02)
Re: Are these all really true ? (1995-10-03)
Re: Are these all really true ? (1995-10-16)
Re: Are these all really true ? (1995-10-04)
Re: Are these all really true ? (1995-10-11)
Re: Are these all really true ? (1995-10-12)
| List of all articles for this month |

Newsgroups: comp.compilers
From: (Preston Briggs)
Keywords: storage, performance
Organization: Tera Computer Company, Seattle, WA
References: 95-09-076 95-09-110 95-10-020
Date: Mon, 16 Oct 1995 22:53:38 GMT (Jay Finger) writes:

>I agree with the need for picking a good algorithm. But if processors
>and memory continue to advance at their current rates, then in the
>next year or two we're going to start seeing systems where cache
>misses can cost you on the order of 500 instructions. Just as (or
>maybe more) important than page faults are cache misses, and the
>"memory is free" attitude can lead you down the wrong path quickly.
>I think people are going to have to start dusting off their old 70's
>algorithms books that discussed sorting/searching on disk/tape in
>order to rekindle the thinking processes used to come up with
>algorithms that run well on tomorrow's systems.
>(Of course multi-threaded CPUs get you around this (are you out there,
>Preston? :-))

Yes, of course. And for those who haven't bothered about comp.arch in
a while... The basic idea is that we can use parallelism to cover
latency. In a small way, we see this with instruction-level
parallelism: processor issues a load but is able to continue
processing until the result of the load is needed. It's basically
doing some other instructions while the waiting on memory latency.
Same thing happens with a cache that allows several outstanding cache

A multithreaded CPU (like the one we're building at Tera) does the
same thing, but in a big way. In one of our processors, we keep many
threads active at the same time (up to 128). After an instruction is
issued from one active thread, we switch to another active thread,
issue one instruction, etc. The hope is that by the time we roll
around to the first thread, it's instruction will be complete.

For loads and stores, it works the same way. Thread 1 might issue a
load. Then the processor switches through many other threads issuing
many other instructions (perhaps loads, stores, or whatever).
Effectively, it's doing work while waiting for Thread 1's load to be
resolved. Eventually, the load will complete and Thread 1 will be
eligible to have another instruction issued.

A multithreaded processor is similar to the old barrel processors,
though it differs in that a barrel processor always rolled around the
entire barrel looking for active threads.

The idea is also similar to the way most uniprocessors do I/O. When
my workstation starts to read something from disk, it starts up the
disk controller, and then switches to another task so that the CPU
gets some work done instead of idling, waiting for the disk.

The advantages to the compiler (and compiler writer!) are pretty
large. We don't worry about cache, TLB, or virtual memory. That is,
no blocking, no concern for small-stride access, etc. In a parallel
machine (meaning many multithreaded processors), the memory is shared
and appears to have uniform access time, regardless of location. So
there's no need for data distribution or message passing.

The downside is that we must find parallelism.

But, in our case, the architecture is versatile to take advantage of
all sorts of parallelism. The easiest kind is the sort we see on
every machine: multiple unrelated jobs running at the same time.
Within a single job, there are several more levels:

task-level parallelism -- we provide language extensions
to aid the user in specifying this sort.

loop-level parallelism -- discovered by the compiler (includes
vector parallelism). Much simplified by not having to worry
about cache effects.

instruction-level parallelism - also discovered by compiler
(does software pipelining and block scheduling)

So rather than dusting off all those old algorithms books, we're
hoping to get some use out of the new ones, the ones that talk about
PRAM models and such. Such a model isn't very realistic on most
machines, but makes a lot of sense for a multithreaded machine.

Preston Briggs

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.