Re: Questions about Bytecode

anton@mips.complang.tuwien.ac.at (Anton Ertl)
23 Apr 2007 07:48:51 -0400

From comp.compilers

Related articles
Questions about Bytecode Sean.D.Gillespie@gmail.com (Bison) (2007-04-18)
Re: Questions about Bytecode englere_geo@yahoo.com (Eric) (2007-04-19)
Re: Questions about Bytecode DrDiettrich1@aol.com (Hans-Peter Diettrich) (2007-04-19)
Re: Questions about Bytecode cfc@shell01.TheWorld.com (Chris F Clark) (2007-04-19)
Re: Questions about Bytecode Sean.D.Gillespie@gmail.com (Bison) (2007-04-20)
*Re: Questions about Bytecode anton@mips.complang.tuwien.ac.at* (2007-04-23)**
Re: Questions about Bytecode ajohnson@mathworks.com (Andy Johnson) (2007-04-23)
Re: Questions about Bytecode DrDiettrich1@aol.com (Hans-Peter Diettrich) (2007-04-23)
Re: Questions about Bytecode haberg@math.su.se (2007-04-23)
Re: Questions about Bytecode chris.dollin@hp.com (Chris Dollin) (2007-04-23)
Re: Questions about Bytecode gah@ugcs.caltech.edu (glen herrmannsfeldt) (2007-04-25)
Re: Questions about Bytecode Peter_Flass@Yahoo.com (Peter Flass) (2007-04-26)
[2 later articles]

| List of all articles for this month |

From:	anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups:	comp.compilers
Date:	23 Apr 2007 07:48:51 -0400
Organization:	Institut fuer Computersprachen, Technische Universitaet Wien
References:	07-04-061
Keywords:	interpreter
Posted-Date:	23 Apr 2007 07:48:51 EDT

"Bison" <Sean.D.Gillespie@gmail.com> writes:
>Hello everyone. I've been trying to read about VMs for quite some
>time now, and I am still a bit confused about compiling to bytecode.
>
>How would the typical structure of a bytecode-compiled file look?

Why put the stuff in a file at all? Why not use the source code as
on-disk representation and just compile to memory?

In any case, if you want to have bytecode files, its structure is
usually a serialized version of the in-memory representation. In one
case I just used XDR routines to convert between the in the in-memory
representation and the file format; a modern, but bloated equivalent
would be to use XML as external representation.

Of course, if you have additional requirements for the external
representation, you might choose to use something else. E.g., you
might want to read the works of Michael Franz and his students on slim
binaries and SafeTSA on some variations for external program
representations (not really representations of bytecode).

In any case, below I will answer your questions first from the
perspective of the in-memory representation, and only afterwards from
the on-disk persprective.

>More specifically, how would I represent literals, like String and
>numbers? The problem I have with this is, how could the VM
>differentiate between instructions and literal data?

Typically, in a bytecode format, the VM instructions will have
immediate (literal) arguments. A VM processor (interpreter or JIT
compiler) typically distinguishes them from instructions by processing
instructions from the start to the end; if it encounters an
instruction with n bytes of immediate arguments, it knows that it has
to skip n bytes to find the next instruction. An interpreter does not
always process the instructions front-to-back, but it only branches to
instructions, so there is no problem there, either.

Normally the immediate arguments of an instruction have a fixed size;
data that may have variable size (e.g., strings), is usually stored
out-of-line, and only referenced by an immediate pointer within the
code.

For the external format, you have to collect all the out-of-line data,
and you have to replace the pointers with something that survives
relocation (e.g., an offset from the start of the file); the routines
of a serializing library (e.g., XDR), lift a lot of that work from
you.

>I've heard
>someone say that I could use a delimiter to mark start and end points.

You can use delimiters, or length specifiers.

>I'm also wondering if things would likely go in a specific order. For
>example, should code and constants be logically seperated?

Logically? Isn't that just a question of how you view it?

Should they be physically separated? For interpretation, you usually
want at least the smaller, fixed-size constants in-line, otherwise you
have to deal with two instruction pointers (one for the code and one
for the constants).

People working on code compression often separate the code from the
data, and various kinds of data from each other, in order to achieve
better compression.

>Also, I'm wondering if there are any decent readings on the subject
>other than source code.

On VM interpreters in general:

@Article{kogge82,
    author = "Peter M. Kogge",
    title = "An Architectural Trail to Threaded-Code Systems",
    journal = ieeecomputer,
    year = "1982",
    pages = "22--32",
    month = mar,
    annote = "Explains the design of (a classical
implementation of) Forth, starting with threaded
code, then adding the parameter stack, constants,
variables, control structures, dictionary, outer
interpreter and compiler."
}

I have also heard good things about Etienne Gagnon's Ph.D. thesis:

http://www.sable.mcgill.ca/publications/thesis/#gagnonPhDThesis

>Are there any VMs that would be good examples
>(and that is open source)?

SableVM would go with the thesis above.

I have also heard good things about JamVM.

Our esteemed moderator writes:
>If your bytecode
>is intended to be translated into machine code before running, then you'll
>probably need a constant pool since large in-line constants tend to be
>ugly in machine code.

But if you translate to machine code, you can do the separation or
merging (whatever is useful) during that translation; the bytecode
format is not related to that.

- anton
--
M. Anton Ertl
anton@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: Questions about Bytecode

anton@mips.complang.tuwien.ac.at (Anton Ertl)23 Apr 2007 07:48:51 -0400

anton@mips.complang.tuwien.ac.at (Anton Ertl)
23 Apr 2007 07:48:51 -0400