Related articles |
---|
The Magic Algebra -- The Algebraic Approach to Control Flow Analysis markwh04@yahoo.com (Rock Brentwood) (2009-01-12) |
An Algebra for Control Flow Analysis and Decompilation federation2005@netzero.com (Mark) (2012-07-16) |
Re: An Algebra for Control Flow Analysis and Decompilation federation2005@netzero.com (2012-07-20) |
From: | Rock Brentwood <markwh04@yahoo.com> |
Newsgroups: | comp.compilers |
Date: | Mon, 12 Jan 2009 13:13:02 -0800 (PST) |
Organization: | Compilers Central |
Keywords: | design, question, assembler, optimize |
Posted-Date: | 13 Jan 2009 07:31:49 EST |
0. Background
Back in 2002 I posted a brief note here on a rewriting of a
disassembler (the DAS 8051 disassembler), which put it in a position
where it could (1) be easily ported to other microprocessors (which it
was) and (2) converted into a decompiler (which I started to do --
resulting in DAC).
The irony is that the only remaining extant version of the 2002 DAS
(and DAC) are executables sitting on one of my old machines, which I
just managed to get onto my newer machines. The decompiler and
disassembler need to be decompiled and disassembled; hence the renewed
interest.
The revisions that had been done on them grew partly out of a
formalism I started developing in 2000. The following material
describes this formalism, which never saw light of day beyond my
notes, until now.
First, some history...
Contents:
0. Background
1. Magmas and Algebras
2. Alpha versus Beta Levels; Control Flow by Infinitary Expressions
3. The Extended Landin Correspondence -- Statements as Contexts
4. Object-Lifetimes via the "Free Variable" Predicate
5. A "Free" and "Bound" Calculus for Control Flow Constructs
6. The Magic Algebra
7. Extension to Inter-Procedural Control Flow
8. Multiple Objects and the Extended Magic Algebra
9. Primitive Statements; the Handling of Object-Subobject Relations
10. Conclusion
At the time, I was involved in a project whose very conception raised
a significant amount of attention from the executive staff at the
company I was at. The idea was simply to take all the legacy firmware
that had been written (all in assembly), run a simple decompiler on it
(in fact, the "DAS" 8051 disassembler that is part of the CSD4 8051
archive, retrofitted to produce simple C-like code), and port it over
to a PC development platform.
The company makes LED video displays. That includes everything. Not
only is the hardware constructed and designed in-house, but it
contains virtually no direct support for raster-scanning. It's all
done at the firmware level, which is also in-house. That means: even
the production of different shades of colors, since the individual LED
units are binary and only have an ON or OFF state. Shading is done by
time-dithering. That's atop the CRT-like raster-scanning, which is
also at the firmware level.
At the time, an experimental hardware-programmed card had been
developed by the company's European subsidiary (which, however, also
has only minimal support for raster scanning and color dithering). I
had already ported MY assembly over in a very short time to the PC.
The simplicity of the task arose from the fact that it was written
using the CAS 8051 assembler (which I developed). As its name
suggests, CAS provides one with a source language with many of the
niceties seen in C. So, translating the actual text is trivial.
The non-trivial part was porting over the underlying kernel (the WSMK,
or "World's Smallest Multitasking Kernel", which I also developed). To
call the kernel a microkernel would be understating the point. The
listing of WSMK is about 1 page and it has only 3 1/2 system calls:
SPAWN (to spawn a new process), EXIT (which happens when a process
does a "return" at the top-level), PAUSE (to await an event) and
RESUME (to kick-start someone waiting on an event). Since events can
be triggered by interrupts at multiple priority levels in the
underlying hardware (the 8051, which also provides limited support for
"register windows", amply used in the kernel), this provides the basis
for a co-routining multi-threaded kernel that is nonetheless pre-
emptive.
This was ported to DOS and everything was done on top of DOS.
With this past, the general idea was now to take the legacy firmware,
disassemble it to C, massage it over and run it on the PC card. This
would be the first step in launching an evolutionary transformation of
the entire range of legacy code into C onto a PC platform.
However, given that the legacy is an accrual of code written by
multiple people at various times, and that the original assumptions
that went into its construction were long since broken; the whole
thing had been for long in a morbid tangle. So, this made necessary a
detailed analysis down to the data flow level; as well as an analysis
of the stack usage (which, itself, is extremely important if one wants
to seek out and eliminate all the leakages from strict "call-return"
nesting that existed in the code).
The lynchpin of this analysis was a transformation of the program into
a series of algebraic equations. Each jump
X: jump Y
can be treated as an equation X = Y. Each statement
X: S; Q:
can be treated as an equation X = S[Q], where Q is the address label
following the statement S; and each branch
X: jump on condition C to Y; Q:
can be treated as the equation X = C? Y: Q, where Q is the address
label following the statement.
*IF* properly nested call-return format is restored and all the
leakages eliminated, then a "return" statement for a procedure P that
had been invoked as
X: call P; Q:
would be bound to Q, thus resulting in the equation
return = Q.
Thus, each procedure P is a control-flow expression with a "hole" in
it, marked by the "return" statement. If P is defined with the
statement body B[return], then its call
X: call P; Q:
is treated as the expression B[Q].
There is one equation posed for every single address -- even
overlapping addresses. (The 8051 supports 65536 addresses for code, so
that's 65536 equations). Overlapping addresses are treated as multiple
program segments superposed on top of one another. Out of the entire
set of equations, only those accessible from a designated subset of
"entry" points are then included. The check for overlapping is used by
the DAS disassembler to ensure that the entry points have are
consistently chosen. DAS, itself, does not carry out the full analysis
of setting up 65536 equations, but its design is nonetheless based on
the description just given.
The key point in reconstructing the routines is determining how the
data flows -- particularly how the objects live and die (i.e. are read
from and written to). On top of the system of equations provided by
the control-flow structure, one can formulate a system of algebraic
equations that captures the variable usage.
Once this information is intact, it's then possible to uncover the
formal parameters to be attached to the subprograms (and to make
decisions on which variables ought to be treated as global, static,
auto, etc.)
1. Magmas and Algebras
I'm going to describe an approach, here, to control flow analysis
which eliminates the dynamic element and incorporates an algebra and
calculus into the analysis. The foundation of this approach is that
each thread of control is treated as a single expression -- one that's
infinite in size, as I'm about to explain.
In functional languages, or other notations or calculi founded on the
Lambda calculus, there is a distinction drawn between the "alpha"
level and "beta" level. The former lies in the realm of syntax (more
precisely: in "abstract" syntax). In it, the alpha conversion rule
applies. This carries out a syntactic transformation whereby bound
variables are renamed. The effect of the alpha conversion is such that
not just the MEANING of the term remains the same, but its FORM, at a
suitably abstract level -- hence the designation as a syntactic rule.
The bound variables are "not there". A term is regarded as an
equivalence class of all its concrete forms, each one convertible to
the other by alpha conversion; each one committing to specific names
for the bound variables.
In contrast, at the beta level, one is at the semantic level. The
operative rule, beta conversion still preserves the MEANING of the
term, but not its form. The transformation of form defines the essence
of computation.
In algebraic theory, these differences can be lucidly and simply
explained in a way that removes the fog of philosophical haze. Terms
are elements of an algebra. For a language, like C, the actual
notation of the language is meant to embody a specific many-sorted
algebra that contains certain operators that act on certain types to
produce results of certain types. The language, itself, is then built
around this algebraic theory with the following modifications and
extensions: (a) the introduction of a dynamic element atop the static
algebraic notation -- control flow; and (b) the imposition of
limitations that restrict the algebra to a finite algebra or a finite
approximation of an infinite algebra.
The distinguishing features of an algebra are (1) its set of "sorts"
or types and its "signature", which comprises the listing of its
operations and what types they apply to and result in; (2) a set of
relations or axioms that define the equivalence of terms and serve as
the basis for the transformations that go to make up "computation".
If we suspend the axioms and take only the algebra with its sorts and
signature, but with no axioms, then the result is a generalization of
what's known as a MAGMA. A magma (in this general sense) is simply an
algebraic theory consisting of a signature but no axioms.
The designation "magma" probably arises from that fact that this gives
us the substructure on top of which all algebras are constructed.
Hence, the magma for a group G would be a signature containing the 3
group operations
Identities e in G
Inverses g in G |-> g' in G
Products (g1,g2) in GxG |-> g1 g2 in G.
An equivalent theory can be posed with just the one operation
Left Quotient (g1,g2) in GxG |-> g1\g2 in G
or
Right Quotient (g1,g2) in GxG |-> g1/g2 in G
(provided we also add a postulate that G be a non-empty set; so, we
can get around this by explicitly adding the identity to the
signature). With suitable axioms this can be made to emulate a group
(in fact, the axioms for the quotient are MUCH simpler than for the
product).
Every theory that can be expressed for any language possessing this
signature then comes down to an imposition of an equivalence relation.
This relation is known as a CONGRUENCE. A congruence is an equivalence
relation whose equivalence classes are preserved by all operations.
So, here, a congruence for this signature would be a relation (a==b)
satisfying the properties:
a==a
a==b -> b==a
a==b & b==c -> a==c
a==b -> a'==b'
a==b & a'==b' -> aa' == bb'.
The mapping of a magma M with this signature onto its equivalence
classes M/== = { [m]/==: m in M } then gives us the archetypical case
in point of what's termed a HOMOMORPHISM. The process reduces the
magma M to its quotient M/==, as a homomorphic image.
Since congruence relations are closed under arbitrary intersections,
then it follows that they form a complete lattice. Hence, one can also
talk about taking the "smallest" relation that contains a set of
equations. Out of this comes the notion of an algebra and a
PRESENTATION for an algebra. For instance, to define a group G, we can
take the quotient+identity signature and define the congruence
relation == as that containing the following relations
g/e==g
g/g==e
(g/k)/(h/k)==g/h
The result makes the reduced magma G/== into an algebra equivalent to
a group, in which the group operations are defined by
g' = (g/g)/g
g1 g2 = g1/((g2/g2)/g2)
It then follows that these operations define a group and that, under
these definitions, one also has the identity g/h = g h', as expected;
as well as g/g = e.
Finally, we get to the point: the FREE magma <X> generated by a set X
is none other than the set of terms built up out of the operations of
the given signature, with "variables" taken from the set X. When we go
one step further and add in a functional calculus with a lambda
operator, then this is where the distinction between free and bound
variables is introduced. Then, we would also require that terms be
alpha convertible, even in the free magma.
2. Alpha versus Beta Levels; Control Flow by Infinitary Expressions
So, with this background, we can draw a clear distinction between the
alpha and beta levels as follows:
Alpha Level -- the equivalences at the level of the magma
Beta Level -- the equivalences under the algebraic theory that are
defined atop the magma.
It's at this point that we bump (almost accidentally) right into the
middle of control flow analysis; and it's also here that we can show
where a sharp distinction between recursive functions or subroutines
lies with respect to another concept that will be introduced here --
the control-flow expression.
The question that naturally arises is this: if we define a functional
notation at the beta level, with recursion, we have the workings of
the lambda calculus built atop the given signature. It's out of this
that one normally expects to see a functional notation and language
arise.
If, on the other hand we try to define a fixed-point closure of the
FREE magma, then what do we get? Since the free magma <X> is nothing
more than an algebraic rendering of the concept of a "term language",
whose elements are the "trees" representing the structures of the
individual terms; then what inhabits the fixed point closure?
If we assume closure under finite systems of fixed point relations,
the resulting solutions are infinite extensions of terms whose tree
structures are known in the computer science literature as RATIONAL
TREES. The distinguishing feature of a rational tree is that, though
it may be infinite in size, it contains only a finite number of
distinct sub-trees.
For the expression calculus underlying a language like C, one has a
conditional operator A? B: C. Under a fixed-point closure, one may
conceive of a term of the form
A? (A? (A? (A? ...): C): C): C.
If the entire term is denoted X, then the sub-terms are A, C (and
their sub-terms), as well as A? X: C, which is just X, itself.
This is "control-flow" at its purest. Each expression or sub-
expression may be thought of as a "label". A reference to the sub-
expression is a "goto". A rational expression, if infinite, contains
cyclic references in it -- i.e., cyclic control flow structures.
A larger class, still, of expressions is arrived at by allowing fixed-
point abbreviations with GNU-breviations (i.e. macros that have
infinite self-referential expansions that may unfold into an infinite
number of distinct sub-terms). This embodies the notion of
INTER-procedural control flow, just as the rational expression
embodies the notion of INTRA-procedural control flow.
3. The Extended Landin Correspondence -- Statements as Contexts
It is out of this that a natural SYNTACTIC (not semantic, but
syntactic) representation for control flow arises. When this is
grafted onto an algebra that, itself, is built atop the magma, the
result is an extension of the algebra to a control-flow language of
such a nature, that the syntactic representation leads to direct
realizations of all of the known formalisms for computational
semantics.
The basis of this correspondence is as follows: associated with each
statement S, in the control flow language is a "context" S[] (i.e. a
function defined on the magma) such that the result of substituting an
expression Q into the context is an expression S[Q] that encapsulates
the "value of Q after S is executed".
For assignment statements involving simple variables, the context
arose from the functional<->imperative correspondence first described
by Landin in the 1960's
x = e <-> (lambda x []) e.
One may conceive of a more complex extension of the lambda calculus
where not just simple variables but objects (and subobjects) are
permitted under the binding of the lambda. For instance, if a(x) is
the function underlying an array a[x], then one would have an
equivalence of the form
(lambda a[e].f) = lambda x. (lambda a.f)([e]<-x,a)
where ([e]<-y,a) is the array whose underlying function is given by
([e]<-y,a) = lambda x.(x==e? y? a(x))
A more elaborate set of equivalences can be posed for multiple access;
e.g.
(lambda a[e][f].g) = lambda x. (lambda a[e].g)([f]<-x,a(e)).
Further elaborations can be conceived of in which object-subobject
relations are enforced (e.g. with pointers, dynamic segments, etc.)
The point of all this is not to get into these intricacies, but to
exemplify the primordial case-in-point of the generalized "Landin
correspondence":
S[Q] is the result of applying statement S to an (infinite)
expression Q, where S[] is the context corresponding to S.
More elaborate formulae for contexts can be built up, e.g.
(S1;S2)[Q] = S1[S2[Q]]
(E?S)*[Q] = E? S[(E?S)*[Q]]: Q
S# = S[S#]
(;)[Q] = Q
(E?S)[Q] = E?S[Q]:Q
(E?S1:S2)[Q] = E?S1[Q]:S2[Q]
and so on.
Though we're only looking at single-threaded computation, it's also
natural to extend these deliberations to non-deterministic terms
(i.e. terms that designate entire sets of expressions), with the
following primitives
(exit)[Q] = {}
(spawn S)[Q] = S[Q] union Q
This will actually AID in the analysis to be formulated below, and (in
fact) proves to be a nearly-indispensable ingredient even for the
lesser task of writing an analysis out for the deterministic case.
4. Object-Lifetimes via the "Free Variable" Predicate
The rules we're seeking for are to determine the activity of the
various objects in the language. We will assume that the underlying
lambda notation works with objects and subobjects, as well as simple
variables.
The analysis adopted here is based on the simple primitives: the
notions of FREE and BOUND variables. And it's here that the underlying
syntactic representation in terms of magmas makes its full power felt.
For, even though the usual definitions of "free" and "bound" are posed
only for the FREE magma defining a "term language", they apply intact,
just as well, to fixed-point closed magmas. That is, one can still
distinguish which variables in a given expression are free, and which
ones are bound even when the expression is infinite. That's because
the freeness and boundness is determined by what expressions contain a
given sub-expression. All sub-expressions lie a finite number of
levels below the top-level expression. This is also true even if the
expression is not rational and has an infinity of distinct sub-
expressions.
Hence, the freeness and boundness predicates are well-defined for
infinitary expressions.
We start out by assuming that there are a family of "primitive"
statements; and that the context S[] of each such statement binds some
variables, while adding free variables.
Hence, if we denote F(Q) as the set of free variables of Q, we
postulate the existence of sets F(S) and B(S) such that
F(S[Q]) = F(S) union (F(Q) - B(S)).
For instance, an assignment to a simple variable x of the form x = 0
has F(x=0) = {}, B(x=0) = {x}.
If this, indeed, holds true then we can actually DERIVE the formulas
for the basic control constructs:
F((S1;S2)[Q])
= F(S1[S2[Q]])
= F(S1) u (F[S2[Q]]) - B(S1))
= F(S1) u ((F(S2) u (F(Q) - B(S2))) - B(S1))
= F(S1) u (F(S2) - B(S1)) u (F(Q) - B(S2) - B(S1))
= (F(S1) u (F(S2) - B(S1))) u (F(Q) - (B(S2) u B(S1)))),
(where I'm abbreviating "union" to "u" above, for brevity). Matching
to the formula
F((S1;S2)[Q]) = F(S1;S2) union (F(Q) - B(S1;S2)),
we find that
F(S1;S2) = F(S1) union (F(S2) - B(S1))
B(S1;S2) = B(S1) union B(S2).
Based on the analogy between the formula for F(S1;S2) and F(S[Q]),
we'll also assume that a "bound" operator B(Q) can be defined for
expressions which satisfies the identity
B(S[Q]) = B(S) union B(Q).
5. A "Free" and "Bound" Calculus for Control Flow Constructs
Second, we as assume for the control flow extension that
F(Q1 union Q2) = F(Q1) union F(Q2)
F({}) = {}
Then we may derive the following
F(exit[Q]) = {} = F(exit) union (F(Q) - B(exit))
from which we obtain
F(exit) = {}
B(exit) = Omega = set of ALL objects
In turn, we have
F(spawn S[Q]) = F(S[Q] union Q)
= F(S[Q]) union F(Q)
= (F(S) union (F(Q) - B(S))) union F(Q)
= F(S) union F(Q)
from which we derive
F(spawn S) = F(S)
B(spawn S) = {}.
If we treat the conditional as a derivative of the non-deterministic
branch, e.g. by the equivalence
x? S1: S2 = (spawn (x? S1: exit));(spawn (x? exit: S2))
whenever x is a simple variable, then we can write down a similar rule
F(x? S1: S2) = {x} union F(S1) union F(S2)
B(x? S1: S2) = {x} union (B(S1) INTERSECT B(S2))!
where the last formula is underscored for emphasis.
Out of this arises the following
F(E? S[Q]: Q) = F(E) union (F(S[Q] union Q) - B(E))
= F(E) union ((F(S) union F(Q)) - B(E))
= F(E) union (F(S) - B(E)) union (F(Q) - B(E)).
Hence
F(E?S) = F(E) union (F(S) - B(E)) = F(E;S)
B(E?S) = B(E)
Similarly, we get
F(E?S1:S2) = F(E) u ((F(S1) u F(S2)) - B(E)) = F(E;S1) u F(E;S2)
B(E?S1:S2) = B(E) u (B(S1) n B(S2)) = B(E;S1) n B(E;S2)
(abbreviating "union" and "intersection", resepctively, to "u" and
"n").
In order for this to be consistent with a B() operator on expressions,
we also have to adopt the following axioms:
B(Q1 union Q2) = B(Q1) intersect B(Q2)
B({}) = Omega.
Finally, to deal with fixed point relations, we adopt the postulate
that the infinitary expression corresponding to the expression is the
limiting case of all finite truncations of that expression where sub-
expressions are pruned off and replaced by {}.
Hence, for S#, we have the limiting case containing the set of
expressions
{}, S[{}], S[S[{}]], S[S[S[{}]]], ...
out of which we get for F():
{}, F(S), F(S), F(S), ...
and for B(), we get:
Omega, Omega, Omega, Omega, ...
The limiting relations are then taken to be the values for the
corresponding rational expression:
F(S#) = F(S),
B(S#) = Omega
For the loop construct, taking limits, we arrive at the following:
F((E?S)*) = F(E) union (F(S) - B(E)) = F(E?S)
B((E?S)*) = B(E) = B(E?S)
In summary, we have the following table
F({}) = {} B({}) = Omega
F(Q1 u Q2) = F(Q1) u F(Q2) B(Q1 u Q2) = B(Q2) n B(Q2)
F(S[Q]) = F(S) u (F(Q) - B(S)) B(S[Q]) = B(S) u B(Q)
F(exit) = {} B(exit) = Omega
F(spawn S) = F(S) B(spawn S) = {}
F(S1;S2) = F(S1) u F(S1) B(S1;S2) = B(S1) n B(S2)
F(S#) = F(S) B(S#) = Omega
F((E?S)*) = F(E?S) = F(E;S) B((E?S)*) = B(E?S) = B(E)
F(E?S1:S2) = F(E;S1) u F(E;S2) B(E?S1:S2) = B(E;S1) n B(E;S2)
(again, with the abbreviations "u" and "n").
6. The Magic Algebra
Now, it's at this point that the MAGIC ALGEBRA enters into the
picture. First, pick out a single object x. Using this object, we may
assign any of the following values to an expression Q and statement S:
<Q> = 1 if x is not in F(Q) or B(Q)
<Q> = r if x is in F(Q) - B(Q)
<Q> = w if x is in B(Q) - F(Q)
<Q> = rw if x is in B(Q) intersect F(Q).
Then, the surprise that awaits us is that there is a simple algebra
for {1,r,w,rw} underlying the decomposition formulae just derived. It
is given by the following
<{}> = <exit> = w
<Q1 union Q2> = <Q1> + <Q2>
<S[Q]> = <S><Q>
<S1;S2> = <S1><S2>
<S#> = <S>w
<(E?S)*> = <E?S> = <E>(<S>+1)
<spawn S> = <S>+1
<E?S1:S2> = <E>(<S1>+<S2>),
with operations given by
a+b = least upper bound of {a,b}
1x = x = x1 for all x
rr = r, ww = wr = w
The least upper bound is taken under the partial ordering given by
w < 1 < r and w < rw < r.
Thus, for instance, 1 + rw = r, 1 + w = 1, 1 + r = r. Under addition,
r behaves as infinity, while w behaves as zero. Under multiplication 1
is the identity, while w behaves as a zero ONLY on the left, but not
on the right!
It is also useful to carry the analysis to places where control flow
cannot be easily resolved by static analysis (i.e. indirect calls and
jumps). For this, we add in the extra elements
<S> = 1, R, r, W, RW, rW, w, Rw or rw
with the factor R occurring if x's membership in F(S) is unresolved,
and W occurring if x's membership in B(S) is unresolved.
A suitable extension to the algebra can then be defined. In
particular, the partial ordering relation is the one given by
w < W < 1 < R < r,
along with the ordering this implies for all the factorings
Rw < RW < R, rw < rW < r, w < Rw < rw, W < RW < rW. This results in
the following lattice
1 R r
W WR Wr
w wR wr
where increasing order takes place going up or to the right.
Again, we find that
1x = x = x1, wx = w
for all x. From this, it follows that Rwx = Rw and rwx = rw for all x.
The remaining identities are
RR = R, rR = Rr = rr = r, WR = Wr = RW, WW = W, Ww = w.
Together, this gives us the MAGIC ALGEBRA -- which serves as a basis
for the control flow analysis of F() and B().
7. Extension to Inter-Procedural Control Flow
To expand this to inter-procedural analysis, this is where we may use
of the larger family of infinitary expressions -- those which solve
non-rational fixed-point relations.
The context for a subprogram given by the following definition
define P(x1,x2,...,xn) B[return](x1,x2,...,xn)
when applied to the procedure call
P(e1,e2,...,en)[Q]
is the expression
B[Q](e1,e2,...,en).
This representation treats the "return" statement, itself, as a place-
holder for the context Q in the expression S[Q]. That is, "return" is
treated as an expression.
The context for a value-returning subprogram given by the following
definition
define F(x1,x2,...,xn) B[return[]](x1,x2,...,xn)
when applied to the function-evaluation (itself, in context),
S[F(e1,e2,...,en)]
is the expression
B[S[]](e1,e2,...,en) = (B;S)[](e1,e2,...,en)
Here, the "return" statement is not a place-holder for a context, Q,
but for a statement S[]; "return" is treated as a statement.
A call-by-value semantics for parameter-binding is directly
incorporated by the following equivalences
define P(x1,x2,...,xn) B[return]
define F(x1,x2,...,xn) B[return[]]
becoming
P(e1,e2,...,en)[Q] = (x1=e1;x2=e2;...;xn=en;B)[Q]
S[F(e1,e2,...,en)] = (x1=e1;x2=e2;...;xn=en;B;S)
or, if ordering is to be made indeterminate,
P(e1,e2,...,e2)[Q] = (spawn(x1=e1);spawn(x2=e2); ...; xn=en)[B[Q]]
S[F(e1,e2,...,en)] = (spawn(x1=e1);spawn(x2=e2); ...; xn=en);(B;S)
Both of these schemata are fixed-point equations in the underlying
magma whose solutions yield infinite expressions that contain an
infinite number of distinct sub-terms, in general -- irrational terms.
Here, F() and B() when applied to subprograms will produce a recursive
system of equations between the subprograms. The same prescription may
be used to arrive at a limit definition.
When the MAGIC ALGEBRA is inserted into this analysis, what this
amounts to is solving fixed point equations over the algebra. Though
it hasn't been developed in detail here, it seems fairly obvious and
intuitive that there is both a simple and efficient equation-solving
algorithm for parametric fixed-point systems over both the {1,r}x{1,w}
magic algebra and its extended {1,R,r}x{1,W,w} algebra.
More importantly: we're not talking about something as powerful as the
Boolean algebra. So equation-solving over this algebra should not
degenerate into full-blown NP-completeness.
8. Multiple Objects and the Extended Magic Algebra
Multiple objects are handled together by generalizing the algebra to
include POWERS of r and w. Thus, corresponding to the pair of sets (F
(S), B(S)) will be the expression r^F(S) w^B(S). This places the focus
on the algebra inherited for the exponents, themselves.
The extended algebra is thus obtained by generalizing the sets {1,R,r}
and {1,W,w} to powers of r and w. The primitives adopted here reflect
the relations of the F() and B() operators, and are as follows:
r^a r^b = r^(a|b), w^a r^b = r^(b-a) w^a, w^a w^b = w^(a|b)
r^a w^b + r^c w^d = r^(a|b) w^(cd)
r^0 w^0 = 1
a < b -> r^a < r^b & w^a > w^b.
In order for this to work consistently with associativity and the
identity property for products, this requires that
a|0 = a = 0|a, a-0 = a, 0-a = 0
a|(b|c) = (a|b)|c, (a|b)-c = (a-c)|(b-c), a-(b|c) = (a-c)-b
Note the reversal of ordering in the last identity.
Consistency with the ordering relation requires
0 <= a
a <= b, c <= d -> a|c <= b|d
a <= b, c >= d -> a-c <= b-d
From the last of these, it follows that a-b <= a.
In order that it yield associativity for +, it also requires that
a(bc) = (ab)c.
For commutativity of +, we also need
a|b = b|a, ab = ba.
For distributivity over + on the left, we need
a|b|a = a|b, a|bc = (a|b)(a|c).
For distributivity over + on the right, we need
a|b|(a-c) = b|a|(a-c), ab|c = (a|c)(b|c)
Finally, to recover r and w, respectively, as special cases r^1 and
w^1 (corresponding to r^Omega and w^Omega), we assume that all
exponents lie in the interval [0,1]. Then we also have
a <= 1.
If all these properties are assumed, then it follows that a|b is the
least upper bound operator for the ordering relation <, and that 0 is
the minimal element and 1 as the maximal element.
An example may be obtained by mapping the exponents onto the linearly
ordered real-number interval [0,1]. Here, a|b = max(a,b) and a-b can
be given any of a wide range of definitions consistent with the above
properties, including the following:
(a>b)?(a-b):0, (a>b)?a:0.
9. Primitive Statements; the Handling of Object-Subobject Relations
For the above analysis to take off, we first need a grounding for the F
() and B() sets. These need to be defined for the primitive
statements. These are the assignment statements and (more generally)
statements that contain expression evaluation threaded with embedded
assignments.
At the most general level, an assignment statement of the form X = E
bounds the object X to B(). However, the story doesn't end there. If
the object is part of an object hierarchy, then the SUB-objects are
also bound, and the SUPER-objects are partly bound. The object-
subobject hierarchy imposes a set of constraints on F() and B().
For an assignment statement S = (X = E), one has the following
F(S) = F(&X; E) = F(&X) union F(E) - B(&X)
B(S) = B(&X; E) = B(&X) union F(E)
In the absence of an object-subobject hierarchy, we may simply write
F(&X) = B(&X) = { &X }.
However, in the presence of an object-subobject hierarchy, the sets F
(&X) and B(&X) need not be trivial.
Instead, we have to account for the hierarchy. Suppose &Y contains &X
as a proper subobject. Then, a read from &X also reads &Y, while a
write to &X equates to both a read AND write of &Y (in effect, it's as
if one "reads" the unchanged parts of &Y and writes them back). Hence,
if <S> = r^a w^b for &X then <S> = r^{a|b} w^b for &Y.
This is the constraint that needs to be enforced.
For non-overlapping objects, there are no constraints. For overlapping
objects, things get more complicated, but the occurrence of
overlapping (other than proper inclusion or identity of objects) is
not generally granted by programming languages.
Thus, at its most general level, the value of the statement (X.f = E)
for the object &X.g would be the conditional expression
<X.f = E> = w^{f==g}.
while its value for the object &X would be
<X.f = E> = rw.
If &X.g, itself, had a non-trivial part-whole relation with &X or with
a subobject of &X.f, then the first expression could also acquire an
exponent of r or raise its exponent of w to a higher power.
10. Conclusion
For the task at hand, analysis of assembly code for the 8051, none of
these elements were of direct relevance. The processor is essentially
is a register-based processor (a relatively large 256+128-byte
register space), having little need for non-register variables in most
applications the 8052 is used for. This was (for the most part) the
case with the analysis done in 2000.
However, for more general tasks -- particlarly the problem of
decompilation in its full-blown generality (and even the problem of
*COM*-piling and doing optimization and control-flow analysis, the
considerations just raised here come fully to bear.
Return to the
comp.compilers page.
Search the
comp.compilers archives again.