Re: Intermediate code representations (detailed)

tarvydas@tsctrl.guild.org (tarvydas)
Thu, 23 Jul 1992 16:51:57 GMT

From comp.compilers

Related articles
*Re: Intermediate code representations (detailed) tarvydas@tsctrl.guild.org* (1992-07-23)**

| List of all articles for this month |

Newsgroups:	comp.compilers
From:	tarvydas@tsctrl.guild.org (tarvydas)
Organization:	Compilers Central
Date:	Thu, 23 Jul 1992 16:51:57 GMT
Keywords:	Eiffel, design, bibliography

I've had great pleasure/success using the ideas of Ric Holt (University of
Toronto) and Jim Cordy (Queen's University, Kingston, Ontario). I've used
these same ideas in the past to build most of an experimental Eiffel
compiler (a relatively pure OOP language).

The nucleus for compiler data type representation can be found in "data
descriptors". Your semantic analyzer should contain at least 3 "tables"
(semantic domains) - the symbol table, the type table and the scope table.
The job of the semantic analyzer is to connect elements of these tables,
for example a struct:

                struct s {
                                int a;
                                char b;
                };

would be represented by:

                symbol scope type
                ====== ===== ====
                s ----------------------------> struct -+
                                                                                                |
                      +------------+++++ <-----------------+
                    / +----------+++++
                  / /
                a ----------------------------> int
                  /
                b ----------------------------> char

You can see that, as this method scales up to handle larger examples,
specific types will be shared by many symbols. Complex structures, eg.
structs of arrays of structs, just fall into place (eg. imagine that the
"int" entry above happens to be another struct, with links to another
scope and its symbols).

Scope entries can be implemented as linked lists, or, as small hash
tables, or, as blocks of integers/pointers which index/point-to entries in
the symbol table. The set of scopes visible at any point during the
compilation (eg. the global scope, the routine scope and, say, another
nested scope) can be represented by a stack of pointers to little scope
hash tables, or, by a framed stack of indices/pointers-to symbols (ie.
information copied directly from the scope table).

As your semantic analysis proceeds, it filters out variables and converts
them to data descriptors, using the information in the tables. Later
"passes" tweak the data descriptors, eg. the allocator decides where each
piece of data should be allocated and modifies the corresponding data
descriptors to contain this machine specific information. The coder
"pass" then takes the filtered program and emits code. The coder can do a
good job, without great complexity, because data descriptors give a
uniform representation of any compile-time object (ie. variables,
parameters, constants, registers, etc).

A data descriptor is, roughly:

                  k
                @ b.d.[x.s]

where:

  k
@ represents the number of levels of indirection (0 for constants and
registers, 1 for "normal" variables in memory, 2 for variables pointed-to
by pointer variables, etc).

B represents the base (null for constants and absolute addresses, a
register, another data descriptor or an abstract lexic base (eg. the
Global Lexic Base, the Lexic Base for the local variables for a certain
routine, etc).

D represents the displacement - a numeric displacement from the base,
plus, a set of labels (for representing labelled data, offsets from the
".data" beginning address (_edata), and branch labels within the program
itself).

[X.S] represents and index register plus a scale (the multiplier).

Here are some simple data descriptor examples:

68k-like operand data descriptor
---------------- ---------------

                                                  0
#5 (constant) @ .null.5 (index is null, so I dropped it)

                                                  1
1234 (absolute addr) @ .null.1234

                                                  0
a4 @ .a4.0

                                                  1
-8(sp) @ .sp.-8

                                                  2
0(-8(sp)) @ .sp.-8

                                                  1
8(a4,d1:L) @ .a4.8.[d1.4]

                                                  1
xyz (pc-rel label) @ .pc.xyz

                                                  1
5-th local within @ .L92.5
routine #92

The translation of this notation into a C struct is straight-forward. In
reality, you need to add a size field and it's sometimes convenient to add
a machine-type field (b/w/l) to help the coder make some of its decisions
more efficiently.

You can develop a whole data descriptor algebra which then tells you how
to emit good code. For example, imagine a local array of words (16 bits),
starting at the -8th byte from the stack pointer. The array starts at
index 1. We wish to access element 3. The data descriptor algebra shows
that a manifest indexing operation can be done without emitting any code:

                                                                      1
                array base (first byte) = @ .sp.-8

                                                            0
                index (constant 3) = @ .null.3

                indexing operation =

                                take the address of the array, scale the index to
                                origin 0, multiply the index by its scaling factor (2
                                in this case), add it to the address and promote the
                                result back to being a variable

                                                      1 0 0 0
                = makevar( addrof(@ .sp.-8) + (@ .null.3 - @ .null.1) * @ null.2) )

                                        -1 0
                = makevar( @ * @ .sp.-8 + ...)

                                        0
                = makevar( @ .sp.-8 + ...)

                                                      0 0
                = makevar( ... + (@ .null.2) * @ null.2)

                                                      0
                = makevar( ... + (@ .null.4))

                                        0 0
                = makevar( @ .sp.-8 + @ .null.4 )

                                        0
                = makevar( @ .sp.(-8+4) )

                                        0
                = makevar( @ .sp.-4 )

                      1 0
                = @ * @ .sp.-4

                      1
                = @ .sp.-4

                = -4(sp).

We've just watched an automatable compile-time calculation of an indexing
operation which didn't emit any code. The resulting operand, -4(sp), goes
on to be used in further operations and will be emitted as necessary - the
whole array indexing operation was subsumed, at compile time, into a
single operand. It should be obvious that variable (non-manifest) indices
can't be folded exactly this way, but, for example, they could be folded
into the index field of the data descriptor (the code emitted might be:
"move.l var,d2" and the
                                                                        1
resulting data descriptor might be @ .sp.-8.[d2.2], which still emits the
least amount of code possible for this situation (I'm just trying to make
a point - if you think that a better code sequence is possible, you're
welcome to twist the decision trees :-)).

What I've just shown are, literally, some of the steps that the OCG
technology takes when optimizing code - address mode arithmetic becomes a
no-brainer and can be relegated to just a few OCG decision trees when data
descriptors are used.

Your intermediate code problem can be solved by looking at the operations
which the Orthogonal Code Generator can accept. The operations use data
descriptors as operands - you can interpret the result if you wish, and,
you can use the OCG to compile the operations into locally good code.

Rosselet's paper on PT Pascal, although old and applied to compiling a
boring language, is an excellent learning tool in that it has pictures of
these data structures and has the S/SL source code for the whole compiler
(you have to get S/SL via anonymous ftp and you have to write the simple
S/SL mechanisms yourself in C or something). PT's coder predates OCG, but
still demonstrates the evolutionary path which spawned the OCG.

Paul Tarvydas
TS Controls
tarvydas@tsctrl.guild.org

R. C. Holt
Data Descriptors: A Compile-Time Model of Data and Addressing
ACM Transactions on Programming Languages and Systems.
Volume 9.
pages 367-389.
July 1987.

Code Generation Using an Orthogonal Model
Cordy, Holt
Software Practice and Experience
Volume 20 (3)
March 1990
pages 301-320

An Orthogonal Model for Code Generation
Technical Report CSRI-177.
Computer Systems Research Institute.
University of Toronto.
January 1986.

PT: A Pascal Subset
Alan Rosselet
Tech Report CSRG-119
University of Toronto
September 1980

--

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: Intermediate code representations (detailed)

tarvydas@tsctrl.guild.org (tarvydas)Thu, 23 Jul 1992 16:51:57 GMT

tarvydas@tsctrl.guild.org (tarvydas)
Thu, 23 Jul 1992 16:51:57 GMT