Re: Third party compiler middle and back-end

"BGB / cr88192" <cr88192@hotmail.com>
Wed, 13 Oct 2010 13:46:50 -0700

From comp.compilers

Related articles
[3 earlier articles]
Re: Third party compiler middle and back-end redbrain@gcc.gnu.org (Philip Herron) (2010-10-10)
Re: Third party compiler middle and back-end cr88192@hotmail.com (BGB / cr88192) (2010-10-10)
Re: Third party compiler middle and back-end jm@bourguet.org (Jean-Marc Bourguet) (2010-10-11)
Re: Third party compiler middle and back-end j.o.williams.jow@gmail.com (James O. Williams) (2010-10-11)
Re: Third party compiler middle and back-end gneuner2@comcast.net (George Neuner) (2010-10-12)
Re: Third party compiler middle and back-end bobduff@shell01.TheWorld.com (Robert A Duff) (2010-10-13)
*Re: Third party compiler middle and back-end cr88192@hotmail.com (BGB / cr88192)* (2010-10-13)**
Re: Third party compiler middle and back-end cr88192@hotmail.com (BGB / cr88192) (2010-10-13)
Re: Third party compiler middle and back-end FredJScipione@alum.RPI.edu (Fred J. Scipione) (2010-10-13)
Re: Third party compiler middle and back-end danielzazula@gmail.com (Daniel Zazula) (2010-10-17)
Re: Third party compiler middle and back-end gneuner2@comcast.net (George Neuner) (2010-10-17)
Re: Third party compiler middle and back-end gneuner2@comcast.net (George Neuner) (2010-10-18)
Re: Third party compiler middle and back-end cr88192@hotmail.com (BGB / cr88192) (2010-10-18)
[5 later articles]

| List of all articles for this month |

From:	"BGB / cr88192" <cr88192@hotmail.com>
Newsgroups:	comp.compilers
Date:	Wed, 13 Oct 2010 13:46:50 -0700
Organization:	albasani.net
References:	10-10-010 10-10-013 10-10-019
Keywords:	code, GCC
Posted-Date:	16 Oct 2010 09:33:51 EDT

"George Neuner" <gneuner2@comcast.net> wrote in message
> On Sun, 10 Oct 2010 14:22:12 +0100, Philip Herron
> <redbrain@gcc.gnu.org> wrote:

<snip>

> Adding a new front end - or even a code generator - to GCC does not
> give one a true picture of the density of the code base. The guts of
> GCC have been pejoratively described as "write only" for very good
> reason.

agreed...

GCC is a huge pile of nasty IMO.
getting it to build and work is a bit of a pain, and the codebase is large
and a problem to try to understand, ...

don't know about just writing frontends, as admittedly my main uses for
compiler technology is not to sit around trying to be a language designer.

> Moreover, writing a front end is a relatively simple task given the
> tools available now and the front end is such a small part of a modern
> compiler that it is not worth investing a lot of intellectual effort
> (unless your goal is to write a tool like bison or antlr, etc.).

agreed.

even in my compilers/VMs, the frontend logic is a fairly small piece of the
whole.
parsing and generating an IL or IR are not much of a problem (in a larger
sense), but it is most of the backend machinery where a lot of the pain sets
in IME.

> Understanding enough to modify the IR analyses or transformations can
> be extremely difficult. For the most part, GCC's IR code is a tangle
> of spaghetti which interacts with (literally) hundreds option switches
> and was deliberately designed to minimize the number of passes over IR
> data structures. Often you will find that code for logically separate
> analyses is interleaved, and the code is spread over many different
> modules.
>
> The organization of GCC v4 is vastly improved relative to previous
> versions, but in no way can it be considered "easy" to understand.

admittedly, my personal experience with GCC's internals is limited, so not
much comment here.

>>I would highly reccomend gcc over llvm for language development any
>>day. GCC you have much more freedom in how you want to build your
>>front-end llvm is very tied up i've found.
>
> LLVM is much superior to GCC if you want to understand what's going on
> in the guts of your compiler. LLVM is not as mature as GCC and does
> not yet do some of the more involved technical analyses, but it's IR
> and code generator passes are cleanly delineated and are pretty easy
> to figure out.

I have mixed feelings.

for GCC, I much prefer the overall process, since it is more solidly divided
into layers and representational stages (nevermind if one wants to
understand how the code itself works, as least they know that it takes a
certain input and produces a certain output). takes GIMPLE, produces GAS
(nevermind what happens internally...).

LLVM seems to be a bit more winding and a bit more object-centric from what
I have seen (so, it is more like dealing with a lot of object-plugging,
rather than dealing with layered data filters/transforms, and one has to
look at a bunch of different classes to understand how things fit together).
it is also C++, and admittedly I am a little less comfortable with C++ and
OOP, but I guess it may not be so much of an issue for people more
comfortable with them.

although, sadly, I am far from being an expert on either codebase, so I may
be wrong.

side note:
I had looked at LLVM some before a few years back, but at the time it didn't
do what I wanted or how I wanted things done (basically, converting directly
from IR to machine code and putting machine code into executable buffers).
there was no dynamic relinking, but instead a process of recompiling chunks
of IR and patching up the prior versions of functions with jumps to the new
version.

I wanted a real assembler and linker:
produce ASM -> assemble ASM into objects -> link objects into running image.

I approached the dynamic relinking issue a little differently, where
'proxies' were used (basically, small stubs dedicated to containing an
indirect jump to the real function), which I viewed as a nicer option than
patching the old function (however a cost is that it is not always clear
when to do this apart from explicit request). the reason to not patch the
old function is that this would effectively prevent the memory used by this
function from being later released.

in non-proxy cases, full physical re-linking is also available, but is
slightly less safe (there is a possible risk of stale pointers floating
around, or added risks if code/data is being used at the same time it is
being relinked, which is very possible if multi-threading is involved).

proxies are generally also used for any non-local jumps on x86-64 (IOW, when
the linker knows that the target is or may be outside the +-2GB window), and
in several other cases (I don't use them in all cases, since they do add the
overhead of an additional indirect jump, as well as some memory overhead,
...).

as a cost though, this doesn't really work with ".data" or ".bss", meaning
that resizing top-level arrays/variables/..., ... is not currently done by
default (the linker is technically capable of this, but this is a very
unsafe operation and could easily break a running program, so by default the
linker will ignore relinking these variables even if the size or other
things changes).

note: generally for executable code on the main (GC'ed / Garbage Collected)
heap, I ended up using an alternate strategy for handling linking. namely, I
link code similarly to the ELF/GOT system (although typically absent
generating PIC, so code is simply "linked" into its heap location, and is
after this point treated as a normal piece of GC'ed memory).

linking normal objects/ASM will work, but these objects will not be kept up
to date by the linker (hence, if another function or variable gets patched
or physically relocated in memory, any references to them will be stale).
sadly, due to technical reasons, the main executable heap (where normal code
is linked) is not subject to garbage collection (it would be needed to
unlink modules to be able to release their memory).

this means making code GC-safe means doing it differently:
call FooFunc ; generates direct call, unsafe
call dword [G.FooFunc] ; generates (explicit) indirect call via a proxy or
GOT (GC-safe)

this also works for variables, however, this feature is not likely to work
correctly with external static linkers (say, MS link, ...). also, the 'G.'
prefix is specific to my assembler (as are the '$' and '$.' prefixes and
similar, which are used for special purposes, usually related to PIC).

since then LLVM has added both an assembler and object emitters AFAIK, but I
have not gone and really looked into it again. I don't know how they
presently deal with dynamic linking either though.

> To the OP:
> I think generating low level C source is fine for an academic
> compiler. I would stay away from source level C# and Java simply
> because they are more complex to generate correct code for, but
> targeting byte code for either .NET or JVM would make for a good
> project.

agreed...

I guess an issue is that the OP does not specify the type or intended use
cases of the compiler wanted.

for example, is it a static compiler?
or, does it need the ability to do operations like "eval" or "apply"?
what sort of memory footprint or behaviors are acceptable?
is it a higher-level abstract machine, or does it need a lot of access to
the HW and/or lower-level C/OS/... interfacing?
...

a lot about what is the best strategy depends on what one needs to do, what
external constraints are involved, ...

for example, a person writing a static compiler for PC's will have different
requirements than someone doing a dynamic scripting engine intended to run
on an embedded device (and depend on the specifics of the device, what is
provided by the system, ...), ...

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: Third party compiler middle and back-end

"BGB / cr88192" <cr88192@hotmail.com>Wed, 13 Oct 2010 13:46:50 -0700

"BGB / cr88192" <cr88192@hotmail.com>
Wed, 13 Oct 2010 13:46:50 -0700