Related articles |
---|
What Is Assembly Language? mark@omnifest.uwm.edu (1997-06-04) |
Re: What Is Assembly Language? khays@sequent.com (1997-06-10) |
Re: What Is Assembly Language? bhouse@dazsi.com (Bill House) (1997-06-10) |
From: | mark@omnifest.uwm.edu (Mark Hopkins) |
Newsgroups: | comp.compilers |
Date: | 4 Jun 1997 22:49:36 -0400 |
Organization: | Omnifest |
Keywords: | assembler, summary |
This is an issue relates which just as well to the question of what a high
level language is and what a compiler is and does, which I discuss. So I
decided to also post it here. In it, I also discuss a new kind of software
tool, a "code generator", describing it in contrast to a compiler.
----------------------
If I were to write a FAQ on the issue, I asked myself what would I say on the
following topics: What is Assembly Language? and What Does/Should An Assembler
Do? This is what I wrote:
(1) What is Assembly Language?
It is a mathematical notation used to represent a finite state machine
which operates in real time. Each program is a different finite state machine,
with the microprocessor, itself, being a kind of tabula rasa on which these
machines are "written".
(1.1) Assembly Language vs. High-Level Language
A high-level language provides a notation for an abstract machine, which
in a similar way is written on a kind of tabula rasa. In contrast, though,
the notation of the high-level language represents an abstract machine model,
which normally has no direct connection to real time, and which is meant to be
of equivalent power to the infinite state abstract machine model known as the
Turing Machine.
Since the real world consists only of finite state machines, a process
must be set up which attempts to cram a program written for an infinite
state machine into the confines of a finite state machine. Not since the
telling of Story of the Gospel has such an audacious idea of shoehorning
Infinity into a Finite physical form ever arisen. This incredible process
is done by none other than the Compiler.
The actual cramming takes place when the compiler attempts to map out the
variables the programmer uses into registers, and attempts to map out the
actual program to memory locations. Therefore, the efficiency of these
processes represents a critical feature in the design of a given compiler and
is the basis for comparing different compilers. Over the years, these
techniques have been refined to the point where they will generally produce
better results than what a normal programmer will produce, when writing in
assembly language.
There is thus a real distinction of levels between the two language
types. Since the very purpose of a compiler is to bridge this gap, it is
then fundamentally true that there will be a degree of indirection between
what the programmer writes in a high-level language and what the compiler
produces.
At times, this indirection is a major liability. Therefore, programmers
will sometimes resort to writing in assembly language, down at the level
of the finite state machine, itself. For instance, the very mapping
process the compiler uses may, itself, get in the way of what the programmer
is trying to do(!), such as if they wanted to set up a small and extremely
efficient process-swapping kernel (whose form changes with and depends
critically on the application at hand). This will almost certainly interfere
with any code they attempt to combine it with that comes from a compiler.
I do this, on a routine basis, for all assembly language programs I write,
for real-time applications, since the very nature of such an application
is that two or more processes are running concurrently and need to be
synchronized to each other and to external events. Because I'm spreading
the word, writing real-time applications in this style, with the concurrency
made explicit and embedded into each application, will eventually become the
norm.
The timing of the code, or its size may be another issue, but not as
important as it used to be. Also, a compiler may be completely missing the
Clue on something where nobody would normally expect Get It (the multi-
tasking example above being the most notable one), where the programmer has
to directly intervene.
At last there is the issue of the intellectual satisfaction of having
direct control over the machine. Underlying this sentiment is a very real,
never discussed, but major advantage to writing in assembly language:
verifiability. Because of the Halting Problem and other related limitative
results in the theory of Turing Machines, it is known to be theoretically
impossible to provide a general process which can verify a program written
in a high-level language. Even a process with more limited ambitions of
verifying certain kinds of programs, or of verifying only certain assertions
in a program is difficult to set up precisely because the general theoretical
problem is unsolveable.
In contrast, there is no such limitative theoretical result for Finite
State Machines, though testing for the equivalence of two Finite State
Machines, I believe, has complexity at least as high as NP-complete.
Nevertheless, it is much easier to come up with verification techniques, both
for use by machine verification programs, and for use by programmers, which,
in turn, are easier to apply in practice. To add to this, it is generally true
that a machine level language (such as the 8051 assembly language) will
have an extremely detailed and precise definition almost to the point
where you can write a simulator to reproduce the actions of the given
microprocessor down to the last machine cycle and down to the last I/O pin
on the chip, itself. This becomes somewhat less true of processors that
have behind-the-scenes work, like pipelining, but the detail and precision
of the operations' definitions still remains.
In contrast, high-level languages are not normally defined with anywhere
near this level of mathematical detail and precision in their corresponding
Standards. Therefore, you can never be entirely sure what a program is
doing when you write it in a high-level language. This imprecision, in many
ways, defeats the very purpose behind having a high-level language in the
first plaxce!
This is what people are actually (and unconsciously) referring to when
they say that "I have more direct control and more control over the machine
when I write in assembly language". Verifiability and the ability to
logically prove programs correct, and to provide methods and tools for
analysis both automated and by hand is an advantage held exclusively by
assembly language over high-level languages. This is an issue that cannot be
ignored, and it is the underlying reason that assembly language programs and
programmers will and must continue to thrive indefinitely into the future.
(1.2) Assembly Language vs. Machine Language
Even though assembly lanugage is strictly at a lower level than high-level
language, and resides at the same level as the machine, itself, there is
still a distinction to be drawn between it and machine language. I can
illustrate this most easily with the assembly language for the 8-bit 8051
processor.
Machine language consists of words written in binary code, and it is
customary to reprsent these codes in more compact manner either in base
8 (octal) or base 16 (hexadecimal). Some processors were designed with a
closer affinity to octal (the 8080, 8085, Z80, 80x86, Pentium), and others
to hexadecimal (the 8048/49, 8051/52, 80*96, 80960, 680*, 6800*).
In base 16, the extra 6 digits are represented by the first 6 letters
of the Roman alphabet. Therefore, when you count in base 16, starting at 0,
you would say: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f, before you get
to 10. Thus, 10 in hexadecimal is the same number as 16 in decimal.
In 8051/2 machine language, the hexadecimal codes c3 and d3 have the
effect of setting a single binary register, known as the "carry bit" (denoted
as C) to 0 or 1 respectively. In the standard assembly language of the
8051, these operations are written respectively as "clr C" and "setb C".
But, now here's where the distinction between assembly language and
machine language lies. One could just as well design an assembler which
allows you to write down instructions like "mov C, #x", where x is some
algebraic expressions which evaluates to 0 or 1. This would say "move
the value x into the carry bit". The assembler would translate it as
c3 + 10x. An even more expressive assembler could allow C-like conditionals,
even involving registers, such as "mov C, #(A < 25h)", which would be
translated into the binary: b4 25 00 (which is expressed in the Standard
language as "cjne A, #25h, $+3" and makes use of the side-effect this
operations has on C).
Assemblers don't normally extrapolate this far from a language Standard
(yet), though there's nothing saying that one can't be designed to do so.
But it is common for assemblers to allow programmers to write down operations
which can be interpreted in different ways. The typical example is the
jump instruction, which can come with several different binary encodings,
typically to encode "near" jumps and "far" jumps. An assembler will decide
which one to use.
In addition to these conveniences, an assembler will normally provide
facilities that enable you to name areas of memory for use as variables,
to name routines, and to even customize the language to some degree by
defining your own operations (called Macros), such as those described
above.
Since the actual binary codes of a machine language are not directly linked
to the meanings of the operations they are meant to represent, since a given
binary can stand for two or more INTENDED operations (e.g., c3 meaning
"setb C", or "mov C, #1"), or one operation can be represented by two or more
binary codes depending on restrictions which are a pain to have to be
continually calculating by hand, one will normally write a program in assembly
language rather than in binary.
Since operations are normally coded in multiple numbers of bytes, it
is even possible that one could write program segments whose binaries overlap
with each other. For instance, one might have a sequence of codes,
like AA, BB, CC, DD, ... where the segment AA BB; CC DD; stands for one
set of operations and BB CC; DD ...; stands for another set of operations.
Or it is possible that one could write self-modifying code which manipulates
its own binary codes to alter its actions (that's essentially what an
operating system is doing when it loads a new program into memory to be
executed). Such a capability might come in handy, for instance, in
designing programs which are self-optimizing, or even self-customizeable to
a given application.
None of this is usually allowed or possible when writing with an
assembler. Therefore there is, in this sense, a very real distinction of
levels too between machine language and assembly language.
(2) What Would The Ideal Assembler Do?
What should go into the design of an assembler? I'll give you a head
start by listing what I think the ideal design should look like.
Much of this (items (2.2), (2.3) except for macros, and (2.4)) has been
incorporated in the CAS 8051 assembler which I wrote a couple years back and
which is archived at ftp.pppl.gov::/pub/8051/wisc/* along with a bunch of
8051-related demo code.
Since items (2.1) and (2.3) (macros), and especially (2.6) are not included
in its design, I consider the assembler to be incomplete, and even broken.
Items (2.1) and (2.5) are actually the critical requirements for retargeting
this assembler into a generic CAS assembler. The C- prefix is because its
syntax is borrowed heavily from C.
(2.1) Resolution of Operations
A subtle fact that took me a long while to learn is that there are
different levels of assembly language. At the bottommost level, one has
what you might call "literal" assembly. Here you have to explicitly
indicate the exact form of every mnemonic and take total responsibility for
making sure that the mnemonic has valid operands. The processors I'm most
familiar with are the 8051 and 8086. In both processors you have different
instructions, for instance, for calling a subroutine or jumping to an
address. With the 8051, for instance, you have these operations:
SJMP Relative
AJMP Paged ACALL Paged
LJMP Absolute LCALL Absolute
JMP @A+DPTR
A relative address lies in the range [$ - 0x80, $ + 0x7f], where $ stands
for the address following the end of the instruction. A paged address
lies on the same 0x800-byte "page" as $, i.e. (Paged & 0x7ff) == ($ & 0x7ff).
An absolute address can lie anywhere. An absolute address can also be
specified through the pointer formed as the sum of A and DPTR. In a "literal"
assembler you have to specify which instruction to use.
With the 8086 you have
JMP SHORT Relative
JMP NEAR Paged CALL NEAR Paged
JMP FAR Absolute CALL FAR Absolute
where FAR Absolute is a 32-bit address that can be specifed directly or
through a pointer, Paged is an address (specifiable directly or through a
pointer) that lies in the same 16-bit page as $, and Relative means the same
thing as it does on the 8051.
It's not really a major problem to indicate which form you want: you always
specify the shortest form (SJMP, ACALL) and let the assembler tell you which
ones are out of range and need to be extended. But it's a minor hassle.
At the next level, you'll have an assembler which does the resolution of
operations itself. One way to do this, with the 8086 example, is to require
the programmer to explicitly designate some addresses as near and some as
far. This runs roughly parallel to what you do in C by specifying some
functions as static (visible only from within the file) and some as global
(visible from all files: the default), with the C file and 8086 segment being
analogous. In the case of CALL, this is absolutely required since the
corresponding return instruction has to be explicitly indicated as a NEAR
or FAR return.
With the 8051, an assembler may allow the programmer to just say:
JMP Absolute CALL Absolute
and then resolve the instructions as best as it can.
At the highest level -- the ideal level -- an assembler will allow EVERY
operand combination for EVERY mnemonic. For example, consider the following
instruction:
XCH X,Y
used to exchange the values in variables X and Y. On the 8051, the only
allowed combinations are:
XCH A,Data
XCH A,@Ri
XCH A,Rn
where Data is an address in "directly addressible" (internal) RAM, where
@Ri (i = 0, 1) points to "indirectly addressible" RAM, and Rn (n = 0, ..., 7)
lies in the Register address space.
Directly addressible RAM consists of the internal RAM 0 to 0x7f, and
the Special Function Registers (range: 0x80 to 0xff). Indirectly addressible
RAM consists also of the internal RAM 0 to 0x7f, but a different segment
in the range 0x80 to 0xff, which can also be used as stack space. The
Register address space is a "register window" in internal RAM, with one
of the 4 ranges 0 to 7, 8 to 0xf, 0x10 to 0x17 or 0x18 to 0x1f.
An assembler at the first two levels will force you to make your code
conform to the restrictions placed on the XCH instruction. The assembler
at the 3rd level will insert its own code, as needed, to make your
operands fit the requirements of the instruction. For instance, you might
have the following translations:
XCH Data,A ---> XCH A,Data
XCH @Addr,A ---> MOV R0,#Addr
XCH A,@R0
XCH D1,D2 ---> XCH A,D1
XCH A,D2
XCH A,D1
with the middle instruction requiring some extra resources not explicitly
indicated in the instruction. There are a few choices that can be made in
the design about how to handle such cases, from allowing the programmer to
control which resources are used through directives, to flagging such codings
by warning messages, or various combinations of strategies of this nature.
Another kind of instruction which will be resolved in an orthogonal
manner in this kind of assembler is the conditional jump. On the 8051,
which has addressible bits, there are only a few different kind of conditional
jumps. For instance, one has
JB Bit,Relative
which is equivalent to the C statement:
if (Bit) goto Relative;
Typical of these instructions is that they only allow restricted ranges
of addressing. An assembler will normally force you to fit the address
into the indicated range or else flag an error. This is true even of those
assemblers (at level 2) which try to resolve JMP into SJMP/AJMP/LJMP -- which
is a serious inconsistency! If you're going to allow one kind of jump to
be resolved by the assembler, then you have to let them all be.
What the assembler might do, if the indicated address is not in range,
is convert it into the sequence:
JB Bit,Address ---> JNB Bit,XX
JMP Address
XX:
where the opposite condition is used. In the 8051, there are some versions
of the conditional jump which don't have opposites. They would have to be
translated, when out of range, as in the following instance:
JBC Bit,Address ---> JBC Bit,YY
SJMP XX
YY: JMP Address
XX:
So, an assembler at the highest level should allow every operator to
be used with every type-conforming combination of operands, possibly doing
some translation in the process to make the operation fit the requirements
of the hardware.
(2.2) Resolution of Addresses
In order to resolve operations, the assembler has to be able to map
your variables and routines to actual addresses. At the lowest level,
the assembler will force the programmer to explicitly indicate every
address by the following devices:
ORG Address
to specify that what follows is to begin at the indicated address, or
Var: DS Size
to reserve a space of indicated size for the indicated variable (in the
process incrementing $ by Size), or
Label:
to equate Label to $, the current address, or
Label EQU Address
to explicitly set the indicated label to the indicated address. The
assembler, itself will not do any mapping beyond this.
In an assembler at the next level, a programmer could specify a segment for
variables or code without explicitly indicating the starting address. It's
then up to the assembler to map this segment to an appropriate location.
Normally this kind of assembler will also allow separate assembly of
files (into an object file format), and will come with a linking phase.
It's during linking -- and is the main function of linking -- that the
mapping of the segments will take place.
Since addresses may be referred to before actually being defined, and
since they could even be used in expressions (such as in the EQU example
above), then there has to be some way to defer the actual creation of code
until all the information about the address is known.
In the traditional two-pass assembler, the first run through the program
is made to collect numeric values for addresses. The mnemonics, like JMP
and CALL are not resolved in the optimal way when referring to addresses
not yet defined, and other operations (like JB, JBC) could not be resolved
in this way at all. Also, there's no easy way to handle relatively addressed
segments this way since the *existence* of a valid map may depend on the
order the segments are mapped in.
So two-pass assembly is not really suitable for the ideal design we're
constructing here.
A one-pass assembler will likewise collect addresses (*relative to
segments*) but also generate some intermediate code during the first pass.
Ideally, this code will be none other than the OBJECT FILE, and it need not
be at all similar to the binary which is finally created. This is especially
true considering how instructions like JB or JMP (or XCH in our example
above) are going to be mapped in the object file.
In place of the second pass, to create the actual assembly code, there
will be a linker which will combine the object files, mapping addresses in
the process. As part of the mapping process, operations can also be resolved
using a kind of "shortest-first" strategy, as described above, to find the
optimal fit.
Since the yet-to-be-resolved addresses may even be involved in expressions,
the object file format has to be able to list not just partially resolved
addresses, but partially resolved expressions! Still, it is convenient to
restrict the types of address expressions (which are what almost all the
partially resolved expressions will be) to the following:
Address + Number
Address - Number
Number + Address
Address - Address
where a restriction is made that in expressions of the form (A1 - A2),
not only must A1 and A2 be addresses of the same type, but they must come
from the same segment. For purposes of making that distinction, an
absolute address can be considered to lie on the "absolute segment" of
the corresponding type (data, code, bit, etc.), so that A1 and A2 are
allowed to be absolute addresses.
These restrictions are the same as apply to pointer arithmetic in C,
where instead of "segment", you have static or dynamic array (or a
single variable, which can be considered as a one-unit segment).
Other expressions may use the "numeric value" of an address. This
should be explicitly disallowed, except for those addresses which lie
on the absolute segment. The behavior of a program which allowed typecasting
of relative addresses to numeric values would actually depend on the output of
the linker, whose operation is supposed to be invisible to the programmer.
I know of no assembler which does all of what I've described above.
(2.3) User-Defined Operations and Directives
A facility for defining your own operations (MACROS), somewhat analogous
to C's #define, should be included. In the ideal design, the syntax used for
mnemonics should be fully orthogonal, as described above. Not only should
it be fully orthogonal, but there should be enough power in the ability
to define macros that even mnemonics could be user-defineable in terms of the
basic operation:
DB Byte
(which maps the given byte expression explictly to the current location in
the current segment).
First, this requires the ability for the macro to distinguish between
alternatives -- which implies that one must also have the operation:
IF (Cond) S1 [ELSE S2]
where Cond is an expression indicating a condition, and where S1 and S2 are
assembler statements. It implies the ability to group 0, 1, 2 or more
statements together (even on the same line!):
{ S1; S2; ...; Sn }
and it implies that conditional expressions should be able to refer to
whether an address (and more generally: an expression) is relative or
absolute, and the type of an expression and the segment it lies on.
Further, the statment and expression syntax should be fully orthogonal, so
that conditionals can be used in any context any other expression can, and
conditional statements or statement groups can be used outside of macro
definitions, even to group assembly statements together on one line. Also,
since one has to have the ability to refer to the type of an expression,
then the very names of these types should be able to fully participate in
the expression syntax, and there should even be a type called "type".
So there should be operators of the form
TYPEOF Expression
to indicate an expression's type (thus: typeof (typeof E) == type, and
typeof (type) == type)
SEGMENT Address
to indicate the (starting location of) the indicated address's segment,
and
OFFSET Address
which should be equivalent to (Address - SEGMENT Address). For absolute
addresses, (SEGMENT A) will be equal to the starting location of the
address space -- which normally will be 0.
Among other things, all of this implies the need to name segments. The
easiest way to handle this is to just allow segments to be defined by the
syntax:
[Label:] SEG [TYPE] [AT/ORG Address]
and to allow for the following equivalences:
SEG Type AT A <---> SEG Type
AT/ORG A
X: SEG Type AT A <---> SEG Type AT A
X:
When no starting address is explicitly indicated, the segment is then
considered to be relative and then the value of X will not be known
until the linker maps out segments.
Since the ; is used to mark comments in many assemblers -- those for
the Intel processors such as the 8086 and 8051, some amends have to be
made here to avoid a conflict of syntax. An easy fix is to require all
comments to begin in ;;, and interpret two consecutive ";"'s as the
start of a line comment instead of two statement terminators. It's also
useful to allow the full range of C++ comments -- the line comments
which start in //, and the /* ... */ comments which may extend across
several lines.
(2.4) Modularity
Since you'll already have the ability to separately assemble files,
then it's also natural to add in some kind of encapsulating feature which
allows the programmer to define labels as local or global.
The natural default is to make all labels local, and thus to make names
reuseable in other files. To explicitly declare a label as local, one can
then put a GLOBAL in front of its definition. Example:
GLOBAL Label:
or
GLOBAL Label EQU Address
Then one must have the ability to declare externally defined labels in other
files, which means one will have declarations such as:
EXTERN Type Label
and since a module may be used in several places, one should be able to
centralize these declarations and include them in other files with a
statement such as:
INCLUDE "File"
(2.5) Argument Syntax
As much as possible, the syntax for arguments should be one and the
same as the syntax for expressions:
Exp: Exp infix-op Exp
prefix-op Exp
Exp postfix-op
Exp? Exp: Exp
@ Exp
( Exp )
$
NUMBER
STRING
CHARACTER
TYPE
LABEL
where the "Exp" in "@ Exp" should refer to an address. If necessary, this
will mean reinventing some of the reserved words in the standard version of
the language as defined labels. For instance, the accumulator, A, of the
8051 would be used equivalently to ACC and would stand for the special
function register address 0xe0, register R2 would be address 2 in the
register address space, PC would be the code address $ (possibly even
extended to allow its use outside code segments with the definition PC = $),
and the reserved words AB and DPTR would have to be creatively handled
in order to fit this expression syntax. On the 8086, things will be a
bit easier to handle and made to fit this syntax.
(2.6) Listing Binary Codes
A novel, but potentially fruitful approach here is to produce a listing
of the binary output of the assembler which is in a suitable form that even
it, itself, could be assembled to produce the same output! The way to do
this -- consistent with the design goal of reinventing all mnemonics as
internally-defined macros defineable ultimately in terms of DB -- is to
write down a listing line of the form exemplified below:
Label code 0x1258 ;; Label:
db 0x10,0xe1,0x03 ;; jbc ACC.1,Rel
where the explicit definition of Label defines it as the code address
1258 (hexadecimal) and the byte sequence 10, e1, 03 (hexadecimal) the binary
translation of the operation indicated in comments after it.
The listing file will have to do a second pass through the assembly
files AFTER the linker is done, since it will need to know the absolute
locations of all the segments first, and since it will have to mimick the
lines of the original assembly files. However, the listings for each
file may be separately produced. Multiple statements per line would have
to be chopped up, and the operands of the DB instruction in the listing
would have to be more general expressions than numeric constants for
listings of macros.
(2.7) Generalizing the Assembler
With these goals, it should be possible to retarget the assembler to
a large number of processors with very little needed in the way of changes.
It may even be possible to isolate these changes and make them defineable
in some kind of "configuration" file.
The main difficulty there would be in defining the binary mappings for
mnemonics. Other assemblers which have attempted this level of generality
have also tried to cope with the difficulty of defining the operation and
operand *syntax*. But we have completely by-passed that problem by
requiring all our mnemonics to be completely orthogonal with respect to
their operands -- which, in fact, is the main reason for imposing this
requirement. And since we're also requiring that the mnemonics, in
principle, be defineable in terms of DB through the built-in macro facility,
then it should even be possible to design the generic assembler so that it
has NO BUILT-IN MNEMONICS, and where everything is defined by macros.
It might even be desireable to extend our statement and expression syntax
by allowing statment loops, variable expressions, assignment statements,
thus turning our assembler completely into a generic memory-mapping tool,
which (for a lack of a better name) could be called a CODE GENERATOR.
The difference between a code generator and a compiler is that, for
instance, loops in a compiler are translated into loops in the assembly
language and variables into addresses, but in the code generator all of
this stands over and above the target language (which isn't even known
to the generator given our assumption about there being nothing but DB's).
There, a loop causes the repeated creation of new binary, and variables only
exist as a kind of re-defineable label, all of which is gone by the time the
linker sees the code.
--
Return to the
comp.compilers page.
Search the
comp.compilers archives again.