Re: Re old post: newbie questions about AST

rkrayhawk@aol.com (RKRayhawk)
23 Mar 2000 22:39:21 -0500

          From comp.compilers

Related articles
Re old post: newbie questions about AST floris@vangog.net (Floris 'Tamama' van Gog) (2000-03-23)
Re: Re old post: newbie questions about AST rkrayhawk@aol.com (2000-03-23)
| List of all articles for this month |
From: rkrayhawk@aol.com (RKRayhawk)
Newsgroups: comp.compilers
Date: 23 Mar 2000 22:39:21 -0500
Organization: AOL http://www.aol.com
References: 00-03-088
Keywords: AST

  "Floris 'Tamama' van Gog", floris@vangog.net
posts a generalized enquiry about


<< ... how to really make an AST (i'm having some probs
creating some nice structures that are both easy to use, and aren't too
bulky (each node very big))


Or am i seeing AST's completely wrong here and is the entire program one
big happy AST?
>>


And depicts two basic approaches to structuring an AST with a function
invocation parse as an example


first:


int function(int arg1,int arg2,int arg3);


                                      function
                                        / \
                                push 0
                              / \
                      push arg3
                    / \
            push arg2
          / \
        0 arg1


and then second


            function
            / | \
        arg1 arg2 arg3


(Those sketches are direct quotes, but I have left out any dingbats
like << >> to keep it legible).


The basic situation you face is that the input and output of a parsing
process are both like large buckets with many marbles. These buckets
are basically mixed-up when compared to one another.


The theory of a parsing algorithm is very simply that if we take the
input piece by piece, examining the sequence of pieces, we will be
able to detect some structure that we can use to map the input to the
output.


In a few cases the whole input is a single discernable structure, and
the output too is a single structure. That is rare. The universe
seems to have been deployed with structures inside of structures. So
many economically practical applications can legitimately assume that
there are a sequence of mini structures inside of the input, and the
output also has a sequence of mini structures.


An AST is simply a recording technique. An AST is a means to record
your detection of structure. You can quite reasonably store the entire
structure that reflects the input, or just portions of it. To get away
with storing only portions in the AST, you must have a strategy for
correct 'understanding' of the structure piece by piece, before the
entire input is scanned.


This 'understanding' is the part that makes every thing clear to you
as a parser designer making decisions about the use of recording
techniques like an AST. The contents of the AST has no
meaning. Instead it is a list of symbols; arranged in a structural
manner. You apply conventions to interpret the elements (nodes) in the
structure (AST). That set of conventions is outside of the information
system.


When all goes well, the set of conventions you apply matches the set
of conventions that the originator of the source (code or parsed data)
had in mind.


The AST is merely a recording method. The essential feature is that it
can record hierarchical attributes (the dependency relationships, and
precedence). When you write code to respond to the AST, or any other
internal format of stored structure, you are applying your
conventions.


There is no right or wrong way to apply your conventions. But you will
probably apply them one by one. So a structure with more than two
nodes tied to a header node, such as your ....


            function
            / | \
        arg1 arg2 arg3


Has simply delayed a decision to choose the sequence of applying the
conventions you have in mind. If this is a computer language and the
expressions in arg1 thru arg3 can have side effects, then you must
evaluate them in the order that the language conventions requires
(right to left, or left to right).


For an expression, if you get complex in your nodes, as in


          expression
            / / | \ \
          a - b - c


then again you have delayed the decision as to how to apply your
precedence rules (conventions). In this case the tree walker must look
at the nodes to figure out the proper sequence (that is more severe
than the order of evaluation of the function arguements, which could
be mechanical, here your walker must adapt to tree _contents_).


If the parser deployed the alternative pushed structure, then the
parser would, in effect implement the precedence or order of
evaluation sequence; in contrast to the tree walker trying to figure
that out. Either the parser or the tree walker is going to gobble up
resources to do that. But, typically, there are devices in the parser
_tool_ that will largely support the need for precedence and
evaluation order. ((So your work will be much easier if you lean on
the parser generator tool. But that is not more 'right' then delaying
it and coding your tree walker to figure it out later)).


So anyway, when you look at how vast can be the bucket of marbles in
the input and output, you inevitably proceed to break the requirements
into smaller parts that organize just a few marbles at a time. Each
organized handful of marbles can be processed as soon as all of its
members are detected, as a single small unit of work to which to apply
your conventions. Or alternatively, you can certainly wait to the end
when every thing has been gathered, and then walk the tree, handling a
handful of marbles at a time as a unit of work to which to apply your
conventions.


Either way, in your minds eye, you can see that the entire set of
symbols is one single structure, but it is not usually necessary to
retain it all in memory at one time.


Basically you will apply your conventions in very small actions to
just a few items. If you have deployed an AST that has something
awesome, like


            function
            / | \ \ ..... \
        arg1 arg2 arg3 arg4 ... arg-n


Then really the tree walker will have to become a parser itself, and
you have just delayed figuring it all out. (That is not inherently
wrong. It is just usually convenient to let the parser generator tool
assist you with this).


But let me emphasize that there is no magic to having just two nodes
beneath each parent node. It is just that that is a very practical
approach. (it also helps you re-understand your code when you come
back to it for fixes and enhancements). You can even have different
kinds of nodes in a single AST; perhaps some with 2, 3 or 4 children,
with a type code somewhere to record the complexity at each node. Most
approaches will be more simple and monolithic. For example:


Some assembler languages might allow only two operands, such as
      ADD r1,r2
An AST designer might choose a strategy that records this as a
two-child node structure with the operation recorded in the parent
node. It would not be inconceivable to record it instead as a
three-child node structure, with the operation as one node and the
operands as the other three.


Some assembler languages might allow three operands, such as
    ADD destination,source1,source2
This might be recorded in a three-child node with the operand in the
parent, or again, a designer could put the operation in a fourth node.
(all just by way of example).


Since some of the three operand assembler languages allow a short-hand
that names only two operands, the second being a source and a
destination; then that situation _might_ lead a designer to an AST
that varies between two and three child nodes (or three and four).


The AST is just a storage facility. You basically need many trips to
the storage facility to fill it up. You can choose to interleave
these with an occassional trip that pulls out just a little bit at a
time, or you can wait to the end and empty the whole storage facility
at once. Yet even then you will actually take out just a handful at
once. When you wait to the end, you have really only waited to the end
to begin draining the stor, it is usually feasible to commence earlier
than that.


The AST makes more sense when you realize that in your mind or in the
language specification you have a set of conventions. Those
conventions are to be applied in a stepwise fashion to the symbols and
sequences your parser detects. Your conventions define your units of
work. Once you clearly see your requirements you, _usually_ see that
each action that accomplishes a unit of work, _usually_ applies to
just a few items. The ordinary exceptions are lists, which _usually_
can be understood as a sequence of units of work.


As a small extension to this discussion, you may wish to consider that
the AST tree walker is a navigator. The more complex your nodes can
be the harder it is to code and stablize the tree walker navigation
logic. Generally you would want to manage the issues of precedence and
order of evaluation in the grammar processor itself (the parser
generator) rather than merge it into the tree navigator.


The notion of a context free grammar is instructive in this
situation. You have not stated that you are using a CFG tool. But the
idea is relevant no matter what tools you use. A designer may wish to
deploy an AST in a manner that allows the tree navigator to process
nodes free of context: that is left-first, right-first, depth-first,
breadth-first, or whatever, purely mechanically.


It helps to see that the tree walker is not 'understanding' anything,
it is setting up the application of conventions to small sets of
elements a handful at a time. This works because the input rules we
adopt as conventions and enforce with grammar rules restricts the
input sequence, and therefore the internally stored sequence (perhaps
in an AST) which the walker navigates, to a valid sequence. The output
is 'right' and 'works' because we place restrictions on the input. The
tree walker is really mechanical. It can take only one step at a
time.


As a finaly note one should also mention that back end requirements
may lead you to avoid processing handfuls of work too quickly. There
can be a number of reasons. An easy one to perceive is the motivation
to optimize the output. Some data flow and control flow optimizations
will necessitate having numerous bundles to evaluate. It is possible
to implement some of these in algorithms applied against the AST. So
it is possible to have multiple scans of the AST, some to do
rearrangements for perhaps optimization, some to build in debuging or
cross-reference support; and then finally the 'interpretation' tree
walk. However, it is not rare that these requirements can be shifted
forward or backward of the basic AST navigation, involving other
intermediate representations than the AST stor.


Hope that helps


Robert Rayhawk
RKRayhawk@aol.com


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.