Re: writing an assembler!

"John Lindsay" <>
27 Jun 1998 00:43:40 -0400

          From comp.compilers

Related articles
slightly off topic -- writing an assembler! (samuel) (1998-06-24)
Re: slightly off topic -- writing an assembler! (1998-06-24)
Re: writing an assembler! (John Lindsay) (1998-06-27)
Re: writing an assembler! (1998-06-28)
Re: writing an assembler! (Dr Richard A. O'Keefe) (1998-07-01)
Re: writing an assembler! (Norman Ramsey) (1998-07-03)
Re: writing an assembler! telnet@wagner.Princeton.EDU.composers (1998-07-08)
| List of all articles for this month |

From: "John Lindsay" <>
Newsgroups: comp.compilers
Date: 27 Jun 1998 00:43:40 -0400
Organization: Royal Military College of Canada
References: 98-06-126 98-06-144
Keywords: assembler, design

Kirk Hays wrote:
> samuel <> wrote:
> > I am currently working on writing an assembler (intel syntax
> >for the x86 microprocessor)for my operating system project.


> A scheme that works particularly well for the x86, with it's mnemonic
> overloading, is to implement a macro assembler, where each binary
> instruction is described by its name and acceptable arguments, and
> results in a procedure for emitting code.
> IOW, you write the macro language interpreter, then write your
> assembler as a set of macro procedures describing all the
> instructions, their opcodes, and any assembly state that affects the
> emitted binary code.


Right on ! I'll add a bit by suggesting some things that can
become lost too easily, and structure that avoids a number
of serious bottlenecks and restrictions that can arise later on.

1. Pay very careful attention to the data types that exist in the
machine hardware and implement these in the assembler language. This
implies looking at the uses to which the hardware is put in a wide
selection of imperative languages. Do have a look at the T' attribute
of the F, (G,) H, (SPASM,) VS, and HLASM assemblers of the
I.B.M. mainframe assemblers. This T' attribute is badly botched, but
it will give you the idea: names in the assembler language have
attributes (type, length, scale, width, multiplicity, ....), and these
are accessible at macro expansion time for use in conditional code

2. The opcodes, including assembler pseudo-operations, are
essentially different from the names, and possess the same sort of
attributes as operators of HLLs. They should exist in stacks of
operator definitions, one stack for each opcode name initialized with
an unpopable first level definition provided with the assembler (with
a couple of macro-time enquiries about the state of each stack and a
notational escape that allows reference to the assembler-provided
definition, and another that allows down-stack calls to aid in
overloading). Barring such escapes, the macro expander uses what's on
top of the stack for each opcode. Macro definitions and a couple of
other things like a deferred-text facility push definitions on an
opcode stack of definitions and pseudo-ops replace or pop them. If
you can get a copy of the SPASM documentation (Single Pass ASseMbler,
by John Ehrman &c., at one time of SLAC at Stanford U.) do read it.

3. Multiple named location counters within a segment or section make
assembly of out of line constants and routines much easier. See the H
or HLASM assemblers.

4. Pay attention to the linker(s) to be used after your assembler,
and to all the object code structures it accepts to support high level
languages. This includes things like COMMON regions, storage mappings
where names within are resolved as offsets for use as constants at
execution time, but where no storage is allocated at link or load
time, code sections like load-time initializaton routines to be
concatenated or chained together at link time, ....

5. Watch the interpretation of macro arguments for consistency. What
are the delimeters of an argument ? How are list and sublist
arguments formed ? Are they consistent with any list forms used in
the raw assembly language (so that the operators or pseudo-ops of such
lists may be overloaded by a macro def.) ? Is it possible to extract
a subargument or sublist in a macro and pass it to another macro as an
argument or list ?

6. In your consideration of constants and literals (terminology will
vary here; this is my usage), distinguish between the two. The former
include arithmetic, character and bit strings, are devoid of any
constraints of computer storage and have very flexible types (is 1.0
packed decimal or float, single, double, or quad precision ?). They
may be used to represent initial or assumed values of (typed) data
storage. The latter include such constraints, and although they are
implicitly typed, they may be used for the same purposes but with the
conversions (truncation, shifts, sign propogation or zero fill)
implied by the storage representations or without such (bitwise

7. The assembler needs assembly-time (macro-time) variables.
Consistent with 6 above, it needs to be able to handle logical,
arithmetic, bit and character string operations so as to be able to
store results and _compute_ what to do with various values.
Important: computation is important at macro time, even as important
as substitution. See any of the assemblers referenced above here.

8. Finally, this. Consider the assembler divided into two parts, an
input routine, which when called, provides source lines to the output
main line part, and that output main line part which does the code
generation and object and listing output. The latter operates in two
or more modes, (a) simple assembly and (b) macro _definition_ and
related activities like processing of some pseudo-ops or collection of
deferred text for later emission. The input part includes every
function which provides lines of input, the source and include (copy)
file reader, the deferred text emitter, AND the MACRO EXPANDER. The
input routine has a stack of instances of these -- each instance with
parameters. When called for a line of input, it uses whatever routine
is on the top of this stack to deliver a line of code for the main
line code. The stack is primed with (bottom level) a routine to end
the assembly, and (1st. level) a copy of the file reader with the main
source code file, so that initially, the assembler will read from the
main source file.

If the output main-line routine discovers a macro _call_, it pushes an
instance of the input routine's macro expander on the input routine's
stack with the name of the macro called and the macro call parameters,
and goes to the top of the main line where the main line calls the
input routine for a line of input; the macro expander then works until
it delivers a line of text for assembly. The input routine will
deliver further lines of text on each call until the macro expansion
ends or terminates; then it pops the stack of input mechanisms and
goes to the top of the input routine where it consults what's on top
of the stack for the next line for the output main line.

If the output main-line routine discovers the start of a macro
definition, it switches to macro definition mode, and either stuffs
'unquoted' assembly text into the macro definition, or, stuffs
individual 'quoted' lines of assembly text or whole multi-line blocks
of 'quoted' assembly text into the macro definition, or interprets
'unquoted' macro-time pseudo-ops. When it finds the end of the macro
def., it closes the macro, switches the output routine back to
assembly, and goes to the top of the output routine where it asks for
the next line of input. Thus a macro call within a macro definition
is expanded or not as the macro definition is being stored for later
use, according as the macro call is 'unquoted' or 'quoted' -- a
crucial decision in some circumstances.

Similar considerations apply to include (copy) files and deferred text
streams and to the 3*2 = 6 combinations of these.

If the input routine finds a block of quoted text, it strips off just
one level of quoting and delivers the line(s) to the calling main-line
output routine.

Why all this ? It allows some crucial things, prime of which is the
ability of a macro expansion to define a macro -- a bit of a mind
bender at first, but essential for some advanced applications.
John H. Lindsay
Department of Mathematics and Computer Science

Phone: (613) 541-6000--1--6419
Fax: (613) 541-6584
[Well, that's the big blob approach. The alternative is the Unix "m4 | as"
which makes a small fast assembler, and keeps the macro processor out of it.
Yes, I know there's a few things that are easier with an integrated system,
but the speed of the PDP-11 Unix assembler was amazing. -John]


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.