Towards a 4 Teraflop microprocessor
11 Feb 2005 22:29:49 -0500

          From comp.compilers

Related articles
Towards a 4 Teraflop microprocessor (2005-02-11)
Re: Towards a 4 Teraflop microprocessor (=?ISO-8859-1?Q?J=FCrgen_Kahrs?=) (2005-02-12)
| List of all articles for this month |

Newsgroups: comp.compilers
Date: 11 Feb 2005 22:29:49 -0500
Keywords: architecture, design
Posted-Date: 11 Feb 2005 22:29:49 EST

I'm a microprocessor designer, computer architect, and manager with 40
years experience who has been developing for the last 3 years a single
device vector uni-microprocessor initially targeting 4 TERA-FLOPS
(DP). It is not targeted at supercomputer end; it is targeted for
desktop. It's a vector co-processor that would be inserted into a
workstation or PC and is intended to crunch numbers for applications
that currently run for a fairly long time.

What I've done is to look at vertical integration to see how it could
be leveraged for a computation intensive application. Obviously one
could place a cluster of micros together, but there still remain many
problems in getting clusters to evenly distribute their workload as
well as being difficult to code for. "Massively parallel machines can
be dramatically faster and tend to possess much greater memory than
vector machines, but they tax the programmer, who must figure out how
to distribute the workload evenly among the many processors."
(National Center for Supercomputing Applications)

My thinking on making use of vertical integration was to instead
define a superscalar vector processor. The instructions have a
general format of <opcode>, <source reg3>, <source reg2>, <source
reg1>, <destination reg 0> where the opcode specifies two operations:
(s1 op1 s2) op2 s3 --> d0. The arch splits into 2 operand spaces: one
for scalars and one for vectors. The scalars are processed with their
own register file and scalar instructions and are intended to handle
address generation and housekeeping such as loop counts, etc. Scalars
are currently restricted to integer only. Vectors are (for initial
model) 128 fields of 64-bit data and are processed with their own
register file and vector instructions and are intended to do heavy
number crunching. The vector unit is sliced and vertically stacked
with 8Kb data bus(es) connecting slices. Level 0 vector operand cache
is distributed over slices as is the vector register file. Vector
instructions are primarily SIMD on the fields, but there are some VLIW
MIMD instructions as well (patent pending on method). The Level 1
cache is unified on different die and interfaces with the vertical
data bus. My estimate for in device cache size is a minimum of 2 GB.

A matrix multiply requires 0.01 * N**3 instructions while up to 128
sorts can be done in parallel.

The ISA is mostly complete and I have notes that I need to sift
through describing a unique method of handling virtual memory
addressing, cache handling, etc. One patent application has been
published: Pub. No.: US 2002/0144091 A1; Pub. Date: Oct. 3, 2002.
This IP allows efficient procedure activations without
saving/restoring a lot of register data.

Currently I am mining this ISA for patents. I really need to find a
couple more people who feel like working on this. At the barest
A compiler/debugger/applications person, with good experience in
technical computing, to assess the user-level ISA as a compiler target
and to provide input on applications. This may take 2+ persons or one
persone who's been doing FORTRAN compilers and scientific libraries
for 20+ years on vector machines and knows all of the pitfalls from
firsthand knowledge.

An OS person to help spec privileged operations, virtual MMU, cache
management, and support for host processor to vector processor
interactions. This has to be the kind of OS whos is used to dealing
with not yet completely specified ISAs and new hardware, and who has
some reasonable hardware knowledge.

Between them they would have to design the host software stack which
means they would need to know enough about Linux/Windows or Unix
(depending upon host environment) to do that.

A hardware systems designer familiar with high speed interfaces able
to make sure that it is possible to build plausible boards with the
proposed product.

Those with an initial interest who wish to learn more and who believe
thay meet the qualifications are invited to respond. Being financially
independent would also be a plus. Goal of effort is twofold: generate
IP content and strive to refine the architecture with an intent to
obtain funding for a startup.

Interested and qualified parties will be asked to sign an NDA for
further detailed information. Bringing an architecture to market is
extremely difficult and costly. Serious inquiries only please.

Larry Widigen

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.