Parsing expressions directly into an abstract syntax tree

Travis Nixon <travis_nixon@yahoo.com>
7 Jan 2006 22:12:51 -0500

          From comp.compilers

Related articles
Parsing expressions directly into an abstract syntax tree travis_nixon@yahoo.com (Travis Nixon) (2006-01-07)
Re: Parsing expressions directly into an abstract syntax tree rjshaw@netspace.net.au (Russell Shaw) (2006-01-08)
Re: Parsing expressions directly into an abstract syntax tree ncannasse@motion-twin.com (Nicolas Cannasse) (2006-01-09)
| List of all articles for this month |

From: Travis Nixon <travis_nixon@yahoo.com>
Newsgroups: comp.compilers
Date: 7 Jan 2006 22:12:51 -0500
Organization: Compilers Central
Keywords: parse, AST

I have been working out an algorithm to parse expressions directly
into an abstract syntax tree, in a way that allows you to give the
parser arbitrary prefix, postfix, and infix operators, specifying the
precedence and associativity for each operator. The parsing itself is
done from left-to-right, with one lookahead token, and very little
recursion. (recursion only happens when encountering a parenthesized
sub-expression, and even that could be done away with fairly easily)


As far as I can tell, what I've got so far is something very similar
to operator precedence bottom-up parsing, but either they're not quite
the same, or they really are the same, and I just don't really have a
full understanding of what it is I'm actually doing. (Which is not at
all outside the realm of possibility)


:)


I'm quite sure I haven't come up with anything new here, since it is
so similar to operator precedence parsing, but I've had trouble
finding any info about it, since I have no idea what "it" might be
called. much searching the comp.compilers archives has shown me lots
of different things, but not exactly what I'm looking for. So, after
I describe the algorithm as it stands so far, I would *greatly*
appreciate it if somebody could tell me what this thing is called, so
I can google more information about it! Of course, it may be
special-cased enough that there's not anything exactly like this out
there, in which case I'd greatly appreciate comments on the sorts of
problems I'm likely to run into that I haven't thought of yet.


I'm going to "describe" it by posting pseudo-code and a couple example
parses, but there are a couple comments to make first.


The algorithm is meant to parse directly to an AST. This AST has
three types of nodes: operators, simple values, and sub-expressions.
Operators are...er...operators. Simple values are constants (numeric,
string, whatever types of literals you need to be able to handle), and
sub-expressions are parenthesized expressions.


Simple values and sub-expressions are both considered "values". One
important note about sub-expressions is that there are going to be
references to "token" in the code, which is the lookahead token. In
the case of an operator or a simple value, the token really is a token
in the traditional sense, but in the case of a sub-expression, the
expression is parsed upon encountering the left paren. The result is
a "token" (a pre-constructed node really) that can be treated exactly
like a simple value. Since I don't really know what to call that
either, hopefully "token" will make sense.


:)


On to the (not-very-pseudo) pseudo-code:


// insert a new node as the parent of the "current" node,
// making the current node the left child of the new one,
// and setting the new node as the "current" one.
PushLeft()
{
      newnode = new node( token );
      newnode.left = currentnode;
      newnode.parent = currentnode.parent;
      currentnode.parent = newnode;
      currentnode = newnode;
}


// insert a new node as the right child of the "current" node.
// No change to "current" in this case.
SetRight()
{
      newnode = new node( token );
      currentnode.left = newnode;
      newnode.parent = currentnode;
}


// do the actual parsing
Expression()
{
      currentnode = null;
      token = GetNextToken();


      if( token is prefixop or value )
            currentnode = new node( token );
      else
            return null;


      token = GetNextToken();
      while( token )
      {
            if( currentnode is value )
            {
                  if( token is operator )
                        PushLeft();
                  else
                        break;
            }
            else if( currentnode is operator )
            {
                  if( token is value )
                  {
                        // If the current operator is prefix or infix, and doesn't
                        // already have it's full complement of children,
                        // make this value a child.
                        // Any other case indicates we've reached the end of
                        // our expression.
                        if( (currentnode is infixop and currentnode.right is null) or
                              (currentnode is prefixop and both right and left are null) )
                              SetRight();
                        else
                              break;
                  }
                  else if( token is operator )
                  {
                        // current node and next token are both operators.
                        // Compare precedence to figure out what to do
                        if( token.precedence == currentnode.precedence )
                              PushLeft();
                        else if( token.precedence > currentnode.precedence )
                        {
                              if( currentnode.right is null )
                              {
                                    // only a prefix op is allowed here. If the operator
                                    // can both prefix and postfix, ++ for example, assume
                                    // prefix, since a postfix would result in an invalid
                                    // expression
                                    if( token is prefixop )
                                    {
                                          // set right and step down
                                          SetRight();
                                          currentnode = currentnode.right;
                                    }
                                    else
                                    {
                                          // I believe the only way we can reach this spot is if
                                          // the current op is a postfix operator with a lower
                                          // precedence than the next operator, or a malformed
                                          // expression.
                                          ERROR;
                                    }
                              }
                              else
                              {
                                    // step down right and push
                                    currentnode = currentnode.right;
                                    PushLeft();
                              }
                        }
                        else // token.precedence < currentnode.precedence
                        {
                              if( currentnode.parent is null )
                                    PushLeft();
                              else
                              {
                                    // for simplicity's sake, just go one step back up
                                    // the tree and continue parsing, without consuming
                                    // the token.
                                    // What we're really doing here is backing up the tree
                                    // until we find an operator whose precedence is less
                                    // than or equal to ours, which could be done by
                                    // iteration here that probably would have taken up
                                    // less space than this comment.
                                    currentnode = currentnode.parent;
                                    continue;
                              }
                        }
                  }
            }
            token = GetNextToken();
      }


      // climb back to the root
      while( currentnode.parent )
            currentnode = currentnode.parent;


      return currentnode;
}


Another quick note before example parses. In the trees below, prefix
and postfix operators are going to look the same. The tree would
actually need to record whether the operator was used as prefix or
postfix, so that things can be done in the right order.


Here are a couple example parses, using C operators and precedence.
The "current" node is marked in brackets.


(I *hope* the formatting stays relatively intact here, or this is
going to get really messy.)


a + b * c ++


(start with value, set current)
[a]


(cur = value, next = op, pushleft)
  [+]
  /
a


(cur = op, next = value, setright)
  [+]
  / \
a b


(cur = op, next = higher prec op, step down and push)
    +
  / \
a [*]
      /
    b


(cur = op, next = value, setright)
    +
  / \
a [*]
      / \
    b c


(cur = op, next = higher prec op, step down and push)
    +
  / \
a *
      / \
    b [++]
            /
          c


a * ++ b * c


(start with value, set current)
[a]


(cur = val, next = op, pushleft)
    [*]
  /
a


(cur = op, next = higher prec prefixop, set right and step down
because curop doesn't have all its children)
    *
  / \
a [++]


(cur = op, next = value, setright)
    *
  / \
a [++]
          \
            b


(cur = op, next = lowerprec op, climb until we find an op less than or
equal, and pushleft)
        [*]
      /
    *
  / \
a ++
          \
            b


(cur = op, next = value, setright)
        [*]
      / \
    * c
  / \
a ++
          \
            b




One more, with a subexpression


a * (b + c)


(start with value, set current)
[a]


(cur = val, next = op, pushleft)
    [*]
  /
a


(cur = op, next = lparen, parse subexpression)


SE: b + c
(start with value, set current)
[b]


(cur = val, next = op, pushleft)
    [+]
  /
b


(cur = op, next = val, setright)
    [+]
  / \
b c


(cur = op, next = rparen, done with subexpression)


back to the original expression
    [*]
  /
a


(cur = op, next = value[subexp], setright)
    [*]
  / \
a SE
            |
            +
          / \
        b c




So...after all that...


What in the world is this method of expression parsing called?


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.