Re: The Universal High Level Assembler/Code Generator

"Randall Hyde" <rhyde@cs.ucr.edu>
10 Apr 2002 00:20:02 -0400

From comp.compilers

Related articles
High Level Assemblers vs. High Level Language Compilers whopkins@csd.uwm.edu (2002-03-19)
Re: High Level Assemblers vs. High Level Language Compilers rhyde@cs.ucr.edu (Randall Hyde) (2002-03-21)
The Universal High Level Assembler/Code Generator whopkins@csd.uwm.edu (2002-04-06)
*Re: The Universal High Level Assembler/Code Generator rhyde@cs.ucr.edu (Randall Hyde)* (2002-04-10)**

| List of all articles for this month |

From:	"Randall Hyde" <rhyde@cs.ucr.edu>
Newsgroups:	comp.compilers
Date:	10 Apr 2002 00:20:02 -0400
Organization:	Prodigy Internet http://www.prodigy.com
References:	02-03-120 02-03-127 02-04-021
Keywords:	assembler
Posted-Date:	10 Apr 2002 00:20:02 EDT

>> This problem doesn't get resolved via multi-phase assembly? Maybe I
>> don't understand the exact nature of the problem you are describing.

The short answer to the question, which you elaborate on below, is yes.

>In the following, I'll describe the most general situation and then
>outline an actual definition for a universal code generator.
>
>In the most general situation, the tool in question will generate a
>system of equations
[snipped]
>The general problem of finding a solution (merely FINDING a solution!,
>not finding an 'optimal' solution) is equivalent to that of solving
>systems of equations over a boolean algegra since the expressions
>above are in a finitary arithmetic which, itself, can always be
>embodied within a boolean algebra.

Yes, general solutions are often far more complex than specific
solutions.

>Equation solving over boolean algebra is the standard archetype of
>NP-complete problems.

And even specific solutions cannot always avoid the intractibility
problem in the worst case. Fortunately, few real-world programs
actually cause the translator to exhibit pathological behavoir
(Indeed, for the example of optimizing displacements in machine
code, I've only seen synthetic programs exhibit bad behavior).

>Different technologies will handle this situation in different ways
>ranging from:
>
>(A) Forbidding any cyclic references in the system of
> equations.
>
>This is too restrictive for most assembly languages, where instruction
>mapping (e.g. for multi-size instructions like "jmp") necessarily
>involves cyclic references in the corresponding equation list
>
>(B) Restricting cyclic references to such a form that
> the corresponding equations can be written in the
> form X1 = Z1; X2 = Z2; ...; Xn = Zn; where the
> expressions Z1, Z2, ..., Zn are expressions over a
> more restricted finitary algebra.

Well, (B) always happens anyway, since implementation on a computer
system always implies certain limits. I have thought through whether
this implies certain limits on "n" as well, but limiting 'n' is
another way to handle this.

>The usual restriction is to permit only sums of the
>form X + X + ... + X + Number for the Z's. The equations
>are then simple linear equations which can be readily solved
>(or readily proven to have no solution).

Multiphase assemblers bypass this restriction by using an iterative
solution to refine the equation set. The drawback is that the number
of iterations is not necessarily bounded. Of course, if you place an
artificial restriction on the number of iterations (to handle
intractible situations), then you've got exactly this restriction. In
practice, of course, a value of nine or ten usually works just fine.

>The question of what restrictions apply is inessential. The
>general tool gives you an environment that looks like this
>(and this is the kind of tool what CAS was meant to evolve into):
[snipped]
>
>AS: S -> O -- assemble source files to object files.
[snipped]
>LK: O,O,...,O -> O -- resolve & fold multiple O files.
>LK is a universal linker. ...

Okay, I raised this issue in a separate thread, but it's worth
repeating here. The theory works out well if AS and LK don't have to
interoperate with other tools or you supply versions of all tools
(e.g., HLL compilers, librarians, etc.) that need to interoperate
with AS. For simple problem domains (e.g., embedded applications on
the 8051) this isn't too much of a problem. But if you're writing an
assembler for more general purpose use (e.g., x86 PC applications), LK
cannot be universal; it needs to work within the confines of existing
object code formats. People need to be able to link their assemble
code with C/C++, Pascal, Ada, and God knows what else. True, LK could
emit its folded code in an OMF, ELF, A.OUT, COFF, or other format, but
a final linking step would be necessary using a traditional linker and
that traditional linker wouldn't be capable of finishing the job that
LK needs. Clearly the idea of LK would be somewhat beneficial because
it could resolve cross-assembly-module references (which traditional
linkers would not be able to do) and this is good, but if the user
only writes a few short assembly routines and does everything else in
a HLL (the most common case), you wouldn't gain much by this solution.

This dove-tails with a latter discussion in your post:

>The non-universal elements corresponding to the environment where this
>tool is being used convert O files to either OBJ files or EXE files.
>There are actually 2 kinds of environments, each requiring a different
>kind of tool.
>
>Hosted link environment:
>X: O -> OBJ
>
>The "X" tool is target-dependent. This is the place where any
>restrictions on the forms of allowable equations are enforced.

Ouch, this means that you've got to have a separate version of LK for
the different host environments. Note that this does not imply one
for OMF, one for A.OUT, one for ELF, one for PE/COFF, etc.

Different compilers out there use different subsets of the object
format. For example, Borland's Delphi doesn't support the full OMF
format (e.g., you cannot link arbitrary programs written in MASM *that
should work* with Delphi; similarly, Kylix has all kinds of
restrictions on what it can deal with in an ELF file). In theory, you
could have a different version of LK for all these different tools.
In practice, that's an intractible problem because there are lots of
tools out there with minor, incompatible, differences.

>In general, every native object file format is, itself, an implicit
>account of a system of symbolic equations interspersed with the
>reservation/mapping primitives. So, what X is doing is stepping down
>the generality of the allowable systems contained in O files to those
>allowed in OBJ files.

Worse, every subset of the native object file formats is such a
system. Unfortunately, there is not "least common subset" even within
one format. Therein lies a big problem with this approach.

>You can even have different X's for the same environment, depending on
>how much equation-solving power you want to put into X, itself. So,
>greater or smaller ranges of expressions Z1,Z2,...,Zn may be allowed
>in the O file.

Exactly, but the number of X's can get quite large. Therein is the
practical problem.

>Generally, equations in an OBJ file are non-recursive. I don't think
>there's any native object file format nor any native linker which
>performs symbolic equation-solving algebra.

Not that I'm aware of. Therefore, there are many tools out there that
won't interoperate well with your system. That's the problem I
foresee. You might take a look at the Borland object formats for
Delphi and Kylix. Although I've not looked at what they're doing,
Borland brags about their "smart linker" technology that removes dead
code. This suggests that they've got *something* in their linkers
that is akin to what you're suggesting (though nowhere near as fancy).
(Note, BTW, that I understand their dead-code removal to be
intra-module, not simply "by module".)

>No native linker:
>Y: O -> EXE
>
>The "Y" tool is also target-dependent. This functions almost exactly
>like the X tool, except for the fact that the general EXE file format
>is much more restrictive as to what kinds of unresolved references it
>allows for.

Y really has to work like this:

Y: O, OBJ -> EXE

where OBJ is the restricted form that doesn't contain the algebra
you're proposing. This, unfortunately, defeats much of the purpose
to all of this.

>--------------------------
[discussion of the universal assembler snipped]

> This is something that's
>actually incorporated in the C-AS assembler.

I assume you mean the version you are working on? (Or, perhaps, some
later version than the one that I've downloaded from
www.csd.uwm.edu/~whopkins/8051/index.html as I couldn't find this
functionality in that version; looking forward to seeing the latest
and greatest, though).

[Discussion of the 8051 meta-language snipped]
[Discussion of the 8086 meta-language snipped]

>The overall structure of the SRC format is quite simple
>and orthogonal. Other features may be added to round
>out the list, e.g.; modifying label definition syntax to
>
>A X:
>A X equ/T Ex
>
>where A can be a list of attributes (e.g. "global");
>expressing external linkage by
>
>extern equ/T X,X,...,X
>
>and requiring that all symbolic constants in a file be given either by
>a X: or X equ/T definition or be given by an "extern".

I'm not convinced by the explanation thus far that this scheme allows
an arbitrary syntax (outside the meta-language, of course). It makes
for a great low-level DSEL system, but I'm not sure how well the tool
will allow the creation of assemblers with strange syntax (e.g.,
IA-64). I'm not suggesting that C-AS won't be able to do this, just
that your description is (necessarily in the context of a post) too
short to draw any conclusions.

"Universal" assemblers have been tried in the past (TLA, Gas, etc.)
with less than stellar results (inevitably, they forced a non-standard
syntax on their users). Your system sounds like a big advance on that
work (since it's more of a "meta-assembler" rather than a "universal"
assembler). Given the proper language declaration system, I believe
you've convinced me that the assembler could also be called a "high
level assembler" (since you should be able to create high-level
language declarations and control statements in the "assembly
language" the meta-statements define); though this is just a guess,
not something I can verify yet.

The remaining question is "how efficient will the tool be?" Modern,
(very) fast machines not withstanding, how rapidly can the tool
interpret the meta-language, evaluate the system of equations, process
the DSEL source (i.e., the actual assembly), do the universal linking,
and then do the final linking? For small files (and most assembly
modules are small), translation time isn't too much of an issue; but
if you've got something like the HLA Standard Library (over 50,000
lines of assembly code), a slow tool could be a bit of a problem (HLA
is slow enough as it is).
Randy Hyde

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: The Universal High Level Assembler/Code Generator

"Randall Hyde" <rhyde@cs.ucr.edu>10 Apr 2002 00:20:02 -0400

"Randall Hyde" <rhyde@cs.ucr.edu>
10 Apr 2002 00:20:02 -0400