Re: [Q] How Is AST structured for entire program??

George Neuner <gneuner2@comcast.net>
Wed, 17 Oct 2007 22:35:22 -0400

          From comp.compilers

Related articles
[Q] How Is AST structured for entire program?? heliboy2003@yahoo.com.tw (2007-10-16)
Re: [Q] How Is AST structured for entire program?? gneuner2@comcast.net (George Neuner) (2007-10-17)
Re: [Q] How Is AST structured for entire program?? DrDiettrich1@aol.com (Hans-Peter Diettrich) (2007-10-17)
Re: [Q] How Is AST structured for entire program?? cfc@shell01.TheWorld.com (Chris F Clark) (2007-10-18)
Re: [Q] How Is AST structured for entire program?? paul@paulbmann.com (Paul B Mann) (2007-10-24)
| List of all articles for this month |

From: George Neuner <gneuner2@comcast.net>
Newsgroups: comp.compilers
Date: Wed, 17 Oct 2007 22:35:22 -0400
Organization: Compilers Central
References: 07-10-051
Keywords: parse, practice
Posted-Date: 18 Oct 2007 00:44:45 EDT

On Tue, 16 Oct 2007 09:05:48 -0000, heliboy2003@yahoo.com.tw wrote:


>I am a very newbie and am learning compiler, after read (or tried to
>read) many compiler books, there is one thing I haven't been able to
>get a clear picture on. I understand that when we parse a program, a
>parse tree can be constructed, and I can construct a parse tree for
>ONE expression, but I can not see how this parse tree is constructed
>for the entire program.
>
>Let me take a simple example in C
>
>void main()
>{
>int a;
>int b;
>int c;
>int i;
>a = 3;
>b = 4;
>c = a+b;
>
>for (i = 0; i < 10; i++)
>{
>c = a-b;
>}
>
>}
>
>Do we construct one tree for each line of expressions (Tree1 for a =
>3, Tree2 for b = 4....... this I know how to do) or do we generate ONE
>big tree with all the expression built into the tree (this I have no
>idea how)?
>
>Also, how is the for-loop constructed into the tree or it just simply
>not built into tree and only expressions are built into tree?


First, "parse" trees and "ASTs" are related but quite different. An
AST tries to abstract, as much as possible, away from details of the
source language syntax whereas a parse tree typically includes source
language symbols.


As for your question: there may not be an AST at all ... ASTs are only
one form of intermediate representation (IR). It is possible, for
example, for the parser to directly construct a 3-address code
representation.


If a tree is used, whether there is a single tree or many depends on
the language. In C where all functions are top level, it makes as
much sense to have a separate tree for each function as it does to
produce a single tree for the whole source file. On the other hand,
Pascal has a hierarchical structure to the source which makes it more
practical to produce a single tree for the program.


Trees are recursive structures, each branch of a tree *is* a tree
itself. When you encounter a new expression, you remember where you
are in the current tree, construct a new tree representing the
expression, and attach it as a branch at the remembered location in
the first tree.


ASTs can take any convenient form. One dirt simple initial AST (given
in Lisp notation) for the code you presented above might be something
like:


(function "main"
(decl "a" int)
(decl "b" int)
      (decl "c" int)
      (decl "i" int)
      (seq
          (assign (var "a") (value 3))
          (assign (var "b") (value 4))
          (assign (var "c") (op+ (var "a") (var "b")))
          (assign (var "i") (value 0))
          (while (op< (var "i") (value 10))
              (assign (var "c") (op- (var "a") (var "b")))
              (assign (var "i") (op_postincr (var "i"))
              )
      ))


Notice that I've turned the "for" loop into a "while" loop. C doesn't
actually have a "for" loop - in C "for" is just a convenience syntax
for the equivalent "while" loop. Other languages may do things
differently - in Pascal, for example, "for" and "while" have quite
different semantics and are not interchangeable.


You might be wondering why each variable reference is a separate
subtree. For an assignment, the left and right hand sides must be
treated differently: a variable on the right side is evaluated for
value; on the left side it is evaluated for its address. An
expression like


    (assign (var "c") (op- (var "a") (var "b")))


might end up after processing as something like


    (store (addressof "c")
          (op-
              (fetch (deref (addressof "a")))
              (fetch (deref (addressof "b")))
          ))


It's easier to rewrite (or copy with modifications) the tree during
processing if all subexpressions are treated in a regular way.




>Now when we go one step deeper to optimize the code, there are two
>redundencies, first is c=a+b and second the for-loop, a well-optimized
>code will be
>
>void main()
>{
>int a;
>int b;
>int c;
>int i;
>a = 3;
>b = 4;
>c = a-b;
>}
>
>I am curious do we get any help from some magical arrangement of the
>tree structure so that these types of redudencies can be detected or
>we have to write a special routine to scan through the code for this
>types of redundencies? Or my understanding of the whole thing is
>incorrect?


Actually a well optimized version would be empty - the function
produces no return value and has no side effects like producing output
or modifying non-local variables ... it can be optimized entirely
away.


Trees are not very useful for spotting redundancies. DAGs (directed
acyclic graphs) and hash tables are typically used for this purpose.
But you still have to be careful: in the sequence


    a = b + c;
    b = d;
    e = b + c;


a and e may or may not represent the same value (it depends on d).
There are ways to make sure expressions really are the same, by using
methods like value numbering and single static assignment (SSA).


In your example, the expression "c = a - b" is not a redundant
expression but rather a loop invariant expression (it does not depend
on any of the loop variables). Invariant expressions can be lifted
out of the loop and executed just once either before or after the loop
body, a process called "hoisting". In your example, once the
invariant expression is hoisted, the loop body is empty and can be
eliminated.


And loops are frequently reorganized to better match characteristics
of the CPU. For example, on many CPUs, a conditional branch
instruction executes faster if the branch is taken and it is faster to
branch backward than forward. On such a machine, the compiler would
likely expand


      for ( expr1; expr2; expr3 ) expr4;


or the equivalent


      expr1; while ( expr2 ) { expr4; expr3; }


into assembler that implements


              expr1
              if ( !expr2 ) goto L2
      L1:
              expr4
              expr3
              if ( expr2 ) goto L1
      L2:




George


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.