Re: Is This a Dumb Idea? paralellizing byte codes

anton@mips.complang.tuwien.ac.at (Anton Ertl)
Sun, 23 Oct 2022 12:33:40 GMT

          From comp.compilers

Related articles
Is This a Dumb Idea? paralellizing byte codes nobozo@gmail.com (Jon Forrest) (2022-10-22)
Re: Is This a Dumb Idea? paralellizing byte codes alain@universite-de-strasbourg.fr (Alain Ketterlin) (2022-10-22)
Re: Is This a Dumb Idea? paralellizing byte codes DrDiettrich1@netscape.net (Hans-Peter Diettrich) (2022-10-23)
Re: Is This a Dumb Idea? paralellizing byte codes gah4@u.washington.edu (gah4) (2022-10-22)
Re: Is This a Dumb Idea? paralellizing byte codes anton@mips.complang.tuwien.ac.at (2022-10-23)
Re: Is This a Dumb Idea? paralellizing byte codes anton@mips.complang.tuwien.ac.at (2022-10-23)
Re: Is This a Dumb Idea? paralellizing byte codes alain@universite-de-strasbourg.fr (Alain Ketterlin) (2022-10-23)
Re: Is This a Dumb Idea? paralellizing byte codes gah4@u.washington.edu (gah4) (2022-10-26)
Re: Is This a Dumb Idea? paralellizing byte codes 864-117-4973@kylheku.com (Kaz Kylheku) (2022-10-27)
Re: Is This a Dumb Idea? paralellizing byte codes anton@mips.complang.tuwien.ac.at (2022-10-28)
Re: Is This a Dumb Idea? paralellizing byte codes anton@mips.complang.tuwien.ac.at (2022-10-29)
| List of all articles for this month |

From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.compilers
Date: Sun, 23 Oct 2022 12:33:40 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
References: 22-10-046
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="85659"; mail-complaints-to="abuse@iecc.com"
Keywords: optimize, interpreter
Posted-Date: 23 Oct 2022 12:32:50 EDT

Jon Forrest <nobozo@gmail.com> writes:
>Modern CPUs employ all kinds of clever techniques to improve
>instruction level parallelism (ILP). I was wondering if it
>makes sense to try to employ similar techniques in the
>virtual machines used to execute byte code produced by language
>compilers.


I'll first assume that you mean virtual-machine interpreters.


First of all, if you use a modern CPU with OoO execution, it does much
of that by itself.


If you use an in-order execution CPU, you can try to software-pipeline
the execution of the virtual-machine instructions, in particular
perform virtual-machine instruction fetching early [hoogerbrugge+99].


It is helpful to design the virtual machine for that: Have a mostly
fixed encoding, so that, e.g., the dispatch can be prepared in
advance, rather than, e.g., having to wait until you know about the
present instruction before you can start the fetch of the next
instruction.


>By that I mean what if virtual machines were to examine byte code
>streams to detect when it would be safe to execute multiple
>byte codes concurrently? Then, based on its findings, the virtual
>machine would execute as many byte codes concurrently as is safe.


There has been quite a bit work on combining sequences of
virtual-machine instructions into superinstructions
[hughes82,proebsting95,naessen+01,ertl&gregg03euroforth,eller05], but
the typical approach was to combine a sequence of dependent
instructions, which has several advantages:


* One VM instruction often communicates the data to subsequent ones
    through memory (i.e., D-cache), which often incurs a latency of
    several cycles. The superinstruction can communicate the data
    through a register.


* If the data is addressed explicitly by the VM instruction (e.g., in
    a register-oriented VM), that requires code for dealing with this
    addressing, which can be eliminated for the data passed implicitly
    in a superinstruction. E.g. "add r3=r1*r2; mul r5=r3+r4" requires
    dealing with six VM "register" accesses, while "madd r5=r1*r2+r4"
    requires only four.


If you are actually thinking about JIT compilation of the virtual
machine, you can use the usual techniques for extracting
instruction-level parallelism: instruction scheduling (within basic
blocks, or within traces or superblocks), and software pipelining.
Some of these techniques incur a significant overhead, however, so you
may not want to employ them in a JIT compiler.


>I'm also worried that internal virtual machine locking requirements
>might make this idea infeasible. For example, in a virtual machine with
>a global interpreter lock, would it be possible for there to be any
>concurrent execution?


The memory-ordering requirements may restrict the reorderings of
memory accesses, but otherwise I don't see a problem.


>This idea, if it works, would be a great way to take advantage of
>multiple cores without having to rewrite any user code. The big
>question is whether it would work.


The stuff I have outlined does not introduce additional execution
threads, and I don't think you can find thread-level parallelism at
the virtual-machine level unless you have your virtual machine (and
the programming language that it models) designed for exactly that.


- anton




@string{spe="Software---Practice and Experience"}
@Article{hoogerbrugge+99,
    author = "Jan Hoogerbrugge and Lex Augusteijn and Jeroen Trum
                                  and Rik van de Wiel",
    title = "A code compression system based on pipelined
                                  interpreters",
    journal = spe,
    volume = "29",
    number = "11",
    pages = "1005--1023",
    month = sep,
    year = "1999",
    OPTannote= ""
}


@InProceedings{hughes82,
    author = "R. J. M. Hughes",
    title = "Super-Combinators",
    booktitle = "Conference Record of the 1980 LISP Conference,
                                  Stanford, CA",
    pages = "1--11",
    publisher = "ACM",
    address = "New York",
    year = "1982",
    OPTannote = {}
}


@InProceedings{proebsting95,
    author = "Todd A. Proebsting",
    title = "Optimizing an {ANSI~C} Interpreter with Superoperators",
    crossref = "popl95",
    pages = "322--332",
    annote = "Interpreter performance is optimized by combining
operators during code generation, when they are
still organized as trees. So a different, optimized
interpreter
is used for each program. Speedups of 1.8--3.1 are
achieved, but this is probably strongly dependent on
the original set of operators. The paper uses lccs
intermediate code operators \cite{fraser&hanson91a}."
}


@Proceedings{popl95,
    booktitle = "Principles of Programming Languages (POPL '95)",
    title = "Principles of Programming Languages (POPL '95)",
    year = "1995",
    key = "POPL '95"
}


@InProceedings{naessen+01,
    author = {Henrik N\"ass\'en and Mats Carlsson and Konstantinos
                                    Sagonas},
    title = {Instruction Merging and Specialization in the
                                    {SICStus Prolog} Virtual Machine},
    booktitle = {Principles and Practice of Declarative Programming
                                    (PPDP01)},
    OPTpages = {},
    year = {2001},
    url = {http://www.csd.uu.se/%7Ekostis/Papers/sicstus.ps.gz},
    annote = {Gives an overview of various WAM optimization
                                    techniques and then evaluates combining (merging)
                                    pairs of instructions into (about 60)
                                    superinstructions, specializing WAM instructions for
                                    specific immediate arguments (in particular,
                                    specific registers, for about 200 new instructions),
                                    and a combination of both (for about 100 new
                                    instructions). Instruction merging produces small
                                    speedups (about 8\% on average), specialization
                                    produces a small slowdown on average, and both
                                    combined are about as fast as instruction merging
                                    alone. VM code size is reduced by around 10\% with
                                    these techniques, and the VM emulator size grows by
                                    up to 15KB.}
}


@InProceedings{ertl&gregg03euroforth,
    author = {M. Anton Ertl and David Gregg},
    title = {Implementation Issues for Superinstructions in
                                    {Gforth}},
    booktitle = {EuroForth 2003 Conference Proceedings},
    OPTpages = {},
    year = {2003},
    URL = {http://www.complang.tuwien.ac.at/papers/ertl%26gregg03euroforth.ps.gz},
    abstract = {Combining Forth primitives into superinstructions
                                    provides nice speedups. Several approaches to
                                    superinstructions were explored in the Gforth
                                    project. This paper discusses the effects of these
                                    approaches on performance, compilation time,
                                    implementation effort, and on programming tools such
                                    as the decompiler and backtracing.}
}


@MastersThesis{eller05,
    author = {Helmut Eller},
    title = {Optimizing Interpreters with Superinstructions},
    school = {TU Wien},
    year = {2005},
    type = {Diplomarbeit},
    url = {http://www.complang.tuwien.ac.at/Diplomarbeiten/eller05.ps.gz},
    abstract = {Superinstructions can be used to make virtual
                                    machine (VM) interpreters faster. A superinstruction
                                    is a combination of simpler VM instructions which
                                    can be executed faster than the corresponding
                                    sequence of simpler VM instructions, because the
                                    interpretative overhead, like instruction dispatch
                                    and argument fetching, is reduced. This work
                                    discusses the following three topics related to
                                    superinstructions. First, I present some heuristics
                                    to choose superinstructions. I evaluated the
                                    heuristics for Forth and Java programs. If the
                                    number of allowed superinstructions was very large,
                                    $> 1000$, then the heuristic which chooses all
                                    possible subsequences up to length 4 achieved the
                                    best results. If the number of allowed
                                    superinstructions was more limited, then a heuristic
                                    which favors short sequences and sequences which
                                    occur in many different programs and many different
                                    basic blocks performed better than the
                                    others. Second, I compare a simple greedy algorithm
                                    and an optimal algorithm to cover a program with
                                    superinstructions. I found that the greedy algorithm
                                    achieves almost optimal results. Finally, I compare
                                    superinstructions with non-sequential patterns. In
                                    my experiments, superinstructions performed slightly
                                    better than non-sequential patterns.}
}




--
M. Anton Ertl
anton@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.