|Is This a Dumb Idea? paralellizing byte codes firstname.lastname@example.org (Jon Forrest) (2022-10-22)|
|Parallelizing byte codes email@example.com (Christopher F Clark) (2022-10-23)|
|From:||Christopher F Clark <firstname.lastname@example.org>|
|Date:||Sun, 23 Oct 2022 10:17:23 +0300|
|Injection-Info:||gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="84287"; mail-complaints-to="email@example.com"|
|Posted-Date:||23 Oct 2022 12:30:38 EDT|
This is not necessarily a dumb idea. However, the details are important.
Relevant important factors are the cost communication and
synchronization. Fine grained parallelism often flounders on those
ILP works in things like modern x86 processors, because the x86 is
effectively a very odd byte code developed under severe constraints.
The code an out-of-order x86 processor runs doesn't really look like
the x86 and tricks like register renaming internal micro-ops make that
However, at the more generic byte code level, you have different
things to worry about (unless you are doing an actual hardware
design). The impact of the software framework that is used to
implement the byte code interpreter can take a large toll. That's one
of the reasons hotspot compilers are effective. They can remove that
framework. They can do that fine grained parallelism which mimics
what goes on behind the scenes in an out-of-order processor.
We were able to exploit that in Hyperscan (a software regular
expression accelerator) that Intel acquired, optimized, and then open
sourced. That was essentially my last project at Intel. We took an
NFA, turned it into upto 8 parallel DFAs, but then wrote a specialized
interpreter that interleaved the 8 DFAs, with the help of VTune.
Doing that and interleaving the code carefully to reduce instruction
stalls (which were mostly due to memory reads), we managed to run the
8 DFAs at the same performance as the original DFA code ran 1, simply
by effectively filling up "delay slots" that were implicit in the
code. Note the NFA and DFAs were essentially byte code, it was just
the interpreter we tweaked.
A different example was exploited at NuoDB, where we were optimizing
SQL queries. We did this by noting that most Queries deal with
multiple rows, and changed the byte code emitted to pass through the
rows. The interpreter then would do the rows as batches, meaning that
1 byte code operation could operate on 1000 rows concurrently,
reducing the overhead by the batch size. This was a speed up even
without using the vector instructions on the machine.
But notice in both cases, we did analysis to figure out what the
relevant parallelism was. It was not ad hoc. And, it was cases where
we had large enough and slow enough sections that were key
bottlenecks. Very rarely can one take serial code and find that
parallelism without having a model that exposes it.
Chris Clark email: firstname.lastname@example.org
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris
Return to the
Search the comp.compilers archives again.