The Universal High Level Assembler/Code Generator

whopkins@csd.uwm.edu (Alfred Einstead)
6 Apr 2002 23:19:16 -0500

          From comp.compilers

Related articles
High Level Assemblers vs. High Level Language Compilers whopkins@csd.uwm.edu (2002-03-19)
Re: High Level Assemblers vs. High Level Language Compilers rhyde@cs.ucr.edu (Randall Hyde) (2002-03-21)
The Universal High Level Assembler/Code Generator whopkins@csd.uwm.edu (2002-04-06)
Re: The Universal High Level Assembler/Code Generator rhyde@cs.ucr.edu (Randall Hyde) (2002-04-10)
| List of all articles for this month |
From: whopkins@csd.uwm.edu (Alfred Einstead)
Newsgroups: comp.compilers
Date: 6 Apr 2002 23:19:16 -0500
Organization: http://groups.google.com/
References: 02-03-120 02-03-127
Keywords: assembler
Posted-Date: 06 Apr 2002 23:19:16 EST

"Randall Hyde" <rhyde@cs.ucr.edu> wrote in message news:02-03-127...
> >There were two major problems which were unresolved which blocked
> >that line of development [of C-AS]: (1) a comprehensive scheme for
> >properly handling the weirdness associated with the way assemblers
> >allow for references to yet-to-be-defined addresses but yet allows them
> >to be used in assembler directives and expressions (a VERY nasty
> >recursion issue lurks beneath this),
>
> This problem doesn't get resolved via multi-phase assembly? Maybe I
> don't understand the exact nature of the problem you are describing.


In the following, I'll describe the most general situation and then
outline an actual definition for a universal code generator.


In the most general situation, the tool in question will generate a
system of equations


X1 = E1, X2 = E2, ..., Xn = En


involving *arbitrary* symbolic expressions E1, E2, ..., En
in X1, X2, ..., Xn.


The general problem of finding a solution (merely FINDING a solution!,
not finding an 'optimal' solution) is equivalent to that of solving
systems of equations over a boolean algegra since the expressions
above are in a finitary arithmetic which, itself, can always be
embodied within a boolean algebra.


Equation solving over boolean algebra is the standard archetype of
NP-complete problems.


Different technologies will handle this situation in different ways
ranging from:


(A) Forbidding any cyclic references in the system of
        equations.


This is too restrictive for most assembly languages, where instruction
mapping (e.g. for multi-size instructions like "jmp") necessarily
involves cyclic references in the corresponding equation list


(B) Restricting cyclic references to such a form that
        the corresponding equations can be written in the
        form X1 = Z1; X2 = Z2; ...; Xn = Zn; where the
        expressions Z1, Z2, ..., Zn are expressions over a
        more restricted finitary algebra.


The usual restriction is to permit only sums of the
form X + X + ... + X + Number for the Z's. The equations
are then simple linear equations which can be readily solved
(or readily proven to have no solution).


The question of what restrictions apply is inessential. The
general tool gives you an environment that looks like this
(and this is the kind of tool what CAS was meant to evolve into):


S = Source File
O = Object File
OBJ = Host Native Linker Format
EXE = Host Native Load Format


AS: S -> O -- assemble source files to object files.


This is universal and AS has absolute no knowledge of any
application it might be being used for (i.e., AS doesn't
even know it's an assembler).


LK: O,O,...,O -> O -- resolve & fold multiple O files.


LK is a universal linker. It too doesn't even know it's
a linker or what application it's being used for.


Mutual references between different O's can be used to
perform "equation folding".


In general, the O file contains a system of symbolic equations, as
well as a symbolic representation of the mapping & reservation
directives. When multiple O files are merged, the free variables of
one O file may become bound to expressions via definitions in another
O file.


Equation folding removes all non-cyclic references of expressions from
the right hand sides of the equations. The result is a system of
recursive equations.


All assemblers contain this structure implicitly, though it may not
actually be fully manifest but rather concatenated with the rest of
the structure described below.


The non-universal elements corresponding to the environment where this
tool is being used convert O files to either OBJ files or EXE files.
There are actually 2 kinds of environments, each requiring a different
kind of tool.


Hosted link environment:
X: O -> OBJ


The "X" tool is target-dependent. This is the place where any
restrictions on the forms of allowable equations are enforced.


In general, every native object file format is, itself, an implicit
account of a system of symbolic equations interspersed with the
reservation/mapping primitives. So, what X is doing is stepping down
the generality of the allowable systems contained in O files to those
allowed in OBJ files.


You can even have different X's for the same environment, depending on
how much equation-solving power you want to put into X, itself. So,
greater or smaller ranges of expressions Z1,Z2,...,Zn may be allowed
in the O file.


Generally, equations in an OBJ file are non-recursive. I don't think
there's any native object file format nor any native linker which
performs symbolic equation-solving algebra.


No native linker:
Y: O -> EXE


The "Y" tool is also target-dependent. This functions almost exactly
like the X tool, except for the fact that the general EXE file format
is much more restrictive as to what kinds of unresolved references it
allows for.


This is also the place where you see the "unresolved reference"
errors. The O file which gets converted to an EXE generally has no
free variables (other than loader-defined symbols).


--------------------------


So... what does the SRC language look like in general? Again, this is
the most general situation that you see in assemblers, and every
actual tool out there will embody some kind of restriction of this.


Generally, everything is a directive built out of the mapping
primitive: db A,B,...,Z; and reservation primitive ds A; and the
object $, which represents current location.


Statements are built out of directives in the following
ways with the following syntax:


St -> Op Ex,Ex,...,Ex == macro operation/mnemonic
St -> if (Ex) St [else St] == conditional
St -> { St ... St } == sequence
St -> Ex; == expression statement


St -> db Ex,...,Ex
St -> ds Ex


And you generally have statements which set location and
address space, as well as those which define labels:


St -> org Ex == set location
St -> seg T [at Ex] == set address space to type T
St -> X: == define label
St -> X equ Ex == explicit define of label
St -> X T Ex == explicit define of type T address


In actual fact, none of these are really primitive but
can be defined in terms of more primitive operations;
e.g.,


org A <--> $ = A
seg T at A <--> $ T A
X: <--> X equ $
ds A <--> $ += A


A further extension to "seg" may enable one to distinguish between
exclusive-access vs. reuseable (roughly paralleling 'static'
vs. 'auto').


If one adds in mapping and reservation primitives:


A <- B == map value B to location A
A <<- B == reserve B units starting at location A


one could more accurately represent ds and db by:


ds A <--> $ <<- A; $ += A
db A1,...,An <--> $ <<- n; $ <- A1; $+1 <- A2; ... $+n-1 <- An; $ += n


The same covering syntax for expressions covers most or all
actual languages out there:


Ex -> u Ex | Ex b Ex | Ex p == prefix, infix, postfix operators
Ex -> C | X == constants and identifiers
Ex -> "(" Ex ")" | "{" Ex "}"
Ex -> "[" Ex "]" | Ex "[" Ex "]"
Ex -> T Ex == address


An example of the latter: if "code" denotes an address type,
then code 0x20 would be address 0x20 in the corresponding
address space.


The label definition "X T A" is just shorthand for "X equ T A".


Symbolic operator symbols can be defined with the same syntax
as in Prolog, and arities and precedences likewise; e.g.,


xf ":", 200


defines ":" as a (non-associative) postfix operator with
precedence 200,


yxf "+", 800


would define "+" as a left-associative infix operator with
precedence 800. A standard configuration would exist for
the full range of C-like prefix, infix and postfix operators.


A macro operation/mnemonic is defined by a macro, whose definitions
take on the general form:


define Op(Ex,Ex,...,Ex) St


where the free variables in the corresponding expressions are
matched upon macro call by pattern-matching.


A useful expediency, especially for this context, is the idea
of "anonymous labels", which for example could be represented
as


number:


with references made by "numberF" or "numberB" where the former
refers to the next occurring number: label, and latter to the
previous occurring number: label. This is something that's
actually incorporated in the C-AS assembler.


The AS tool generally would have no knowledge that it's even
an assembler. That means the actual address types are,
themselves defined too.


I won't specify the syntax for address type definitions in
details, but try to illustrate how it might look like for
the two most immediate examples: the 8051 and 8086.


An 8051 configuration might look like this:


An "i" space is one where the mapping primitive <- can be used.
The reservation primitive <<- can be used with all spaces.


// Base address spaces
type i code[0-0xffff], xdata[0-0xffff];
type bit[0-0xff]
type ddata[0-0x7f], sdata[0x80-0xff], sfr[0x80-0xff];


// Concatenated spaces
type data = ddata:sdata, idata = ddata:sfr


// Subspaces
type bdata = ddata[0x20-0x2f], rdata = data[0-0x1f]
type dbit = bit[0-0x7f]


// Register files (each is one-element)
type _accum1 = {C}, _accum8 = {A}, _accum16 = {AB}
type _ptr16 = {DPTR}, _pc = {PC}
type reg = {R0,R1,R2,R3,R4,R5,R6,R7}


The latter definition would be equivlaent to:


type reg[0-7];
R0 reg 0; R1 reg 1; R2 reg 2; R3 reg 3
R4 reg 4; R5 reg 5; R6 reg 7; R7 reg 7


This exhausts the list of operand spaces which the standard
8051 language makes use of in operators. Some variations
extend _ptr16 = {DPTR,DPTR2}.


An 8086 configuration might look like this:


type i near[0-0xffff], i far [0-0xffffffff]
type _reg8 = {AL,CL,DL,BL,AH,CH,DH,BH};
type _reg16 = {AX,CX,DX,BX,SP,BP,SI,DI};
type _flags = {FLAGS};


(Not sure if I have the actual numbering for the
registers correct).


Extensions of the x86 add in more register files
(e.g. the debug register file).


The overall structure of the SRC format is quite simple
and orthogonal. Other features may be added to round
out the list, e.g.; modifying label definition syntax to


A X:
A X equ/T Ex


where A can be a list of attributes (e.g. "global");
expressing external linkage by


extern equ/T X,X,...,X


and requiring that all symbolic constants in a file be given either by
a X: or X equ/T definition or be given by an "extern".


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.