Re: Bison games

Chris F Clark <cfc@world.std.com>
9 Sep 2000 13:27:41 -0400

          From comp.compilers

Related articles
Bison games astro@athenet.net (Patrick Clochesy) (2000-09-08)
Re: Bison games cfc@world.std.com (Chris F Clark) (2000-09-09)
| List of all articles for this month |

From: Chris F Clark <cfc@world.std.com>
Newsgroups: comp.compilers
Date: 9 Sep 2000 13:27:41 -0400
Organization: The World Public Access UNIX, Brookline, MA
References: 00-09-040
Keywords: parse, yacc

Here is a LaTeX file of a column from Sigplan NOTICES that addresses
that very question. (I write such things for them on a semi-regular
basis.) It's roughly readable as-is, although it looks better in
print, of course.


Note, I am under the impression from the ACM that I am still the
copyright holder of this article. However, if you wish to publish
it for a fee, you need to contact them.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\documentclass[twocolumn]{sn}


\ColumnName{Practical Parsing Patterns}
{1} % page number filled in at Rowan


\ColumnEditor{Chris Clark}{Compiler Resources, Inc., 3 Proctor St.,
Hopkinton, MA 01748}{cfc@world.std.com}


\ColumnTitle{Build a Tree---Save a Parse}


\ColumnSubTitle{}


% \ColumnAuthor{Name \\Address \\More Address}
\ColumnAuthor{}
{ Although he writes about parsers, Chris Clark has spent most of his
career working on optimizers and code generators. He often became
keeper of the ``IL''--the one who decided how the intermediate
language should evolve over time. One of the lessons he learned
through those experiences is that most intermediate languages are more
similar than they are different. The most recent intermediate
representation he helped design was internal to a semantic checking
tool that looked for memory leaks and memory overwrites, Compaq's
Third Degree. Even though the purpose of this intermediate
representation was different (it was never parsed nor used to generate
code), it still shares deep similarities with the others. Chris is
currently a wandering technical consultant, co-author with Barbara
Zino of Yacc++\raisebox{.6ex}{\tiny\textregistered} and the Language
Objects Library, and frequent poster on the comp.compilers newsgroup.
}


\begin{document}
\maketitle


Two commonly asked questions are: ``How can I rewind my input to a
specific point and reparse from there?'' and ``How do I make my parser
faster?'' These are very good questions, but they are not always the
right ones. As in many cases, the right question is often,
``How do I look at my problem differently?'' This column will address
exactly that issue.


\section{Scripting Languages}


Many applications start as a customized version of the ``desktop
calculator''. They then evolve into a kind of \emph{scripting
language}. A scripting language is simply a way to make an
application language programmable. By the way, if all you want is a
scripting language for your application, in many cases you will be
better served by adopting one of the available scripting languages
like tcl or python than by growing your own. In contrast, if you want
to learn language design, building a scripting language is a good
place to start.


The evolution from a desktop calculator to a scripting language
is a fine progression. However, a simple desktop calculator does not
require the same level of semantic support that a scripting language
requires.


A simple desktop calculator can often get by with accumulating results
as it goes along. This can be modeled by the return values of recursive
descent parsing routines or using the \emph{dollar}
variables of the yacc stack as in:


\vspace{-6pt}
\begin{verbatim}
expr: expr "+" expr
            { $$ = $1 + $3; }
        | expr "-" expr
            { $$ = $1 - $3; }
        | identifier // variable
        | integer // constant
        ;
\end{verbatim}
\vspace{-6pt}


In a scripting language one will often want to execute the same thing
over and over again. For example, the scripting language may have
loops, go to's, or subroutines. In each of these cases the same
expression may get evaluated repeatedly.


Here is an example scripting language derived from the calculator with
a loop and a print statement that illustrates the problem:


\vspace{-6pt}
\begin{verbatim}
stmt: "print" expr
            { cout << $2 << endl; }
        | "for" identifier "=" expr
                        "to" expr "do" stmt
            { execute stmt many times }
        ;


expr: expr "+" expr
            { $$ = $1 + $3; }
        | expr "-" expr
            { $$ = $1 - $3; }
        | identifier
        | integer
        ;
\end{verbatim}
\vspace{-6pt}


This language allows us to execute the print statements in the loop.
Thus, the code for the print statement and the expression to be
printed must be executed over and over again, with the variables
potentially changing value each time. The action code in the grammar
is executed only once, when the statements are parsed.


The naive way of fixing this is to parse the print statements and
expressions each time they need to be evaluated. However, that
requires an extremely fast parser and uses it very inefficiently.


\subsection{Build an Intermediate Representation}


The recommended solution to this problem is to build a tree that
represents the parsed text. By building a tree, the input needs to be
parsed only once, not over and over again. This eliminates the
inefficiency.


Some readers will take that paragraph literally and take that as a
recommendation to build a ``parse tree''. That is good advice and if
one takes nothing else from this column, it will still result in
better structured parsers.


However, I meant \emph{tree} in a generic sense. Any convenient
representation that captures the relevant information from the parse
will suffice. Parse trees are the most obvious forms to construct. A
tree that removes the extraneous details that are only artifacts of
parsing is called an \emph{abstract syntax tree}. Further removed from
the parsing are \emph{virtual assembly languages}, which emulate an
abstract machine designed to fit the problem domain.


The variety of trees that can be built to capture the parsing
information are known by a plethora of names. Even the general
category has more than one name. Because parse trees and their
brethren are often simply steps along the way to the final result,
they are often referred to as \emph{intermediate representations}
(IR's), \emph{forms} (IF's), or \emph{languages} (IL's). The three
terms are equivalent, implying a difference only in the level of
formality.


Sometimes the intermediate language takes on a life of its
own. Several systems that have multiple parsers, sophisticated
optimizers, and multiple code generators, have been developed and
marketed commercially. Each of these systems has its own common
virtual assembly language used by the various parsers and code
generators. These intermediate languages all began connecting just
one parser to one code generator.


\emph{P-code} is an example IL that took on a life of its own. It
was invented by Nicklaus Wirth as the IL for the ETH Pascal compiler.
Many variants of that compiler arose \cite{p-code}, including the USCD
Pascal compiler that was used at Stanford to define an optimizer
\cite{Chow-83}. Chow's compiler evolved into the MIPS compiler
suite, which was the basis for one of the DEC C compilers--acc. That
compiler did not parse the same language nor use any code from the ETH
compiler, but the IL survived.


This article will start you on that path, showing how to build parse
trees and then abstract syntax trees.


\subsection{Tree Building Tools}


Tree building is so important that many parser generators have either
built-in support for tree building or provide an add-on tool to build
trees. Unfortunately, the tree building notation is quite diverse.


Therefore, I will describe the tree building process in terms of
semantic actions attached to the grammar. This will have the advantage
of showing how you can build trees even if your tool has no tree
building support. Under the hood, this is what a tree building tool
does anyway. It simply defines some data structures and annotates the
grammar with actions to create those data structures.


One thing worth noting is that most tree building tools have an
implicit ``style'' (or design philosophy) ingrained in them. Take
this as friendly guidance from the tool designer on how to define your
intermediate form. In the future, I'll point out some of the
trade-offs in the different tree building styles embodied in the
different tools, so you can make your own judgment calls. However,
for this first article on trees I will start with some
simple-to-build trees with nice properties. Many of the tools build
exactly these trees.


\vspace{-3pt}
\section{A Parse Tree of Typed Records}


The first type of intermediate form we will design could be called a
\emph{parse tree} of \emph{typed records}. As a parse tree
it will exactly capture all the steps taken to parse the input. The
tree will consist of typed records, where each different phrase parsed
will have a unique record type that corresponds to it. These distinct
record types will help us capture the parsing steps. Perhaps in too
much detail, but we will prune later.


The useful connection between trees of typed records and parsing is
well-known. It has been developed from both the parsing side
\cite{yacc-meets-c++} and the object-oriented data structure side
\cite{demeter1}\cite{demeter2}\cite{demeter3}.


\vspace{-3pt}
\subsection{Defining the Types}


The first step in defining this type of intermediate form is defining
the record types. Define record types for each distinct token type and
for each distinct alternative.


First, create record types for the distinct token types.


\vspace{-3pt}


\begin{itemize}
\item{You will typically have record types for identifiers, string
literals, integers, real numbers, and perhaps some application
specific constants. Each of these token types needs a distinct record
type because each will have distinct semantic information fields.}


\item{You will also have a record type representing the punctuation
characters. All the punctuation marks can be represented by one
record type or let each punctuation mark have its own distinct record
type.}


\item{Keywords are an interesting special case. You may handle
keywords like punctuation marks or treat them as identifiers. I will
have more to say about keywords in the next month's column.}


\item{If there are other types of tokens that not fit in the
above categories, create record types for them also.}
\end{itemize}


Second, create a record type for each alternative in your grammar.
Each alternative represents a distinct phrase that can be
parsed. Thus, each alternative needs a distinct record type to capture
its distinct parsing. For example, the expression non-terminal from
our scripting language has four distinct alternatives and would have
four distinct record types.


\vspace{-3pt}
\begin{verbatim}
expr: expr "+" expr // plus_expr
        | expr "-" expr // minus_expr
        | identifier // var_expr
        | integer // int_expr
        ;
\end{verbatim}


\subsection{Classes and Variant Records}


Depending on your implementation language you may have choices in how
to implement your typed records.


In an object-oriented language, it is generally best to make each
record type a distinct class joined by a common base class (or
hierarchy of base classes). Here is some fragmentary C++ to
illustrate:


\vspace{-3pt}
\begin{verbatim}
class ast_base { . . . }
class identifier : public ast_base
        { . . . }
class integer : public ast_base
        { . . . }
. . .
class plus_expr : public ast_base
        { . . . }
class minus_expr : public ast_base
        { . . . }
. . .
\end{verbatim}


Another good representation is to collect the types into a variant
record. Here is a Pascal fragment:


\vspace{-3pt}
\begin{verbatim}
ast_type =
        (identifier, integer, . . .
          plus_expr, minus_expr, . . .);


ast_record = record
          case ast_tag: ast_type of
          (identifier: . . . );
          (integer: . . . );
          . . .
          (plus_expr: . . . );
          (minus_expr: . . . );
          . . .
          end;
\end{verbatim}


If your language supports neither of those constructs but allows some
form of casting, put an explicit tag field at the beginning of each
record type and use that to do your selection. Here is a C fragment:


\vspace{-3pt}
\begin{verbatim}
enum ast_type {
        identifier, integer, . . .
        plus_expr, minus_expr, . . .
        };


struct identifier_s {
        ast_type tag;
        . . .
        };
struct integer_s {
        ast_type tag;
        . . .
        };
. . .


struct plus_expr_s {
        ast_type tag;
        . . .
        };
struct minus_expr_s {
        ast_type tag;
        . . .
        };
. . .
\end{verbatim}


\subsection{Member Fields}


A key part of defining the typed records is laying out their member
fields. Some semantic fields will be specific to your application.
However, all typed records representing non-terminals will need
pointers to their children. For example, the rule for plus
expressions has three parts: a left sub-expression, a plus token, and
a right sub-expression. Therefore, the \verb/plus_expr/ typed record
will have three pointers to point to the three children. You can
either create three explicit pointer fields or create an array of
pointers with three entries, as in:


\vspace{-3pt}
\begin{verbatim}
class plus_expr : public ast_base
        { ast_base *left, *plus, *right;
            . . . }
\end{verbatim}
\vspace{-9pt}
or:
\vspace{-6pt}
\begin{verbatim}
class plus_expr : public ast_base
        { ast_base *operands[3];
            . . . }
\end{verbatim}


\subsection{Omit Irrelevant Tokens}


At this point, it is worth mentioning a quick and simple optimization.
Omit tokens with no semantic relevance from the list of children.


\vspace{6pt}


\framebox[2.8in]{\parbox{2.5in}{\emph{
Any token which has no semantic information does not need a reference
field in the typed record representing the non-terminal that uses it.
}}}


\vspace{6pt}


In our example, the plus token carries no semantic information once we
know we are dealing with a \verb/plus_expr/. Therefore, we do not
need to create a pointer for it. We only need the left and
right fields or a two element array.




\subsection{Annotate Your Grammar}


You now have your record types defined. Add code to each alternative
that creates the matching type of record. Thus, our expression
example might look like this:


\vspace{-3pt}
\begin{verbatim}
expr: expr "+" expr
            { $$ = new plus_expr($1, $3); }
        | expr "-" expr
            { $$ = new minus_expr($1, $3); }
        | identifier
            { $$ = new var_expr($1); }
        | integer
            { $$ = new int_expr($1); }
        ;
\end{verbatim}


Note that by eliminating the fields for semantically insignificant
tokens, the constructors reference the same things that the original
desktop calculator did.


\section{Improve Your Intermediate Form}


Attention paid to designing your intermediate form can result in a
representation that is easier to use. If your application is nearly
trivial, simply building a parse tree can be the option that
requires the least mental effort. However, as your application
increases in complexity, effort spent on intermediate form design has
increasing benefit.


\subsection{Chain Productions and Copy Rules}


A good thing to eliminate from your intermediate form are abstract
syntax tree records that represent \emph{chain productions}. Chain
productions are special-case alternatives that merely convert one
non-terminal into another. They are called chain productions because
the conversion rule serves to ``chain'' the non-terminals together.
They are also sometimes called \emph{unit productions} because the rule is
one non-terminal or ``unit'' long.


Chain productions rarely have any semantic significance. Therefore,
you should not create abstract syntax tree nodes for them. Instead,
simply pass along the abstract syntax tree node of the child
non-terminal. The semantic actions that do this also have a special
name. They are called \emph{copy rules}. Copy rules merely copy the
semantic value from one non-terminal to another.


Implementing expression precedence by rules in the grammar is one way
to create a grammar with chain productions. For example, you may have
implemented expression precedence using terms and factors in your
grammar. These precedence distinctions are probably not important at
the semantic level. An abstract syntax tree that collapses the
distinctions between expression precedence levels will be easier to
manipulate in the semantic phase. For example, collapse:


\vspace{-3pt}
\begin{verbatim}
expr: term
            { $$ = new unary_term($1); }
        | term "+" expr;
            { $$ = new term($1, $3); }
term: fact
            { $$ = new unary_fact{$1); }
        | fact "*" term;
            { $$ = new fact($1, $3); }
\end{verbatim}
\vspace{-5pt}
to:
\vspace{-5pt}
\begin{verbatim}
expr: term
            { $$ = $1; } // yacc default
        | term "+" expr;
            { $$ = new term($1, $3); }
term: fact
            { $$ = $1; } // yacc default
        | fact "*" term;
            { $$ = new fact($1, $3); }
\end{verbatim}
\vspace{-3pt}


It is worth noting that some parser generators even perform special
optimizations on chain productions and copy rules. For example, yacc
and its derivatives generally implicitly implement a copy rule that
copies the first non-terminal to the result at the beginning of
processing the rule. If no further semantics are required, the user
does not need to add any actions to the rule and the two actions marked
yacc default can be omitted.


\section{Adding Semantics}


Once you have an abstract syntax tree built for your grammar, you can
define semantic functions that process the tree that is created.
These functions typically start at the root of the tree and
recursively descend it. In a scripting language an important set of
functions are the ones that execute the script. Here are some C++
semantic functions for our scripting language:


\vspace{-3pt}
\begin{verbatim}
int plus_expr::execute() {
        return(operand[0]->execute() +
                      operand[1]->execute());
        }
int minus_expr::execute() {
        return(operand[0]->execute() -
                      operand[1]->execute());
        }
\end{verbatim}
\vspace{-3pt}


The functions that execute the scripting language mirror the original
desktop calculator semantic actions, although the indexes may change
since semantically irrelevant tokens have been omitted. This meets
the goal of allowing the actions to be executed repetitively, but
still having the same semantics for each action.


Of course, there is one key difference between these semantic
functions and the original parse actions. The execution of the semantic
functions is divorced from the parsing. One does not need to reparse
the input to have the semantic function reapplied. One simply calls
the semantic function again. Here is some C++ code that executes the
for-statement:


\vspace{-3pt}
\begin{verbatim}
int for_stmt::execute() {
        int start = start_expr->execute();
        int end = to_expr->execute();
        for (i = start; i <= end; ++i) {
                var->assign(i);
                stmt->execute();
                }
        }
\end{verbatim}
\vspace{-9pt}


\subsection{Visitors}


One of the key advantages of building an abstract syntax tree is that
you can traverse it more than once and you can traverse it with a
variety of semantic functions. You can define as many distinct
semantic functions as you wish and they can traverse the tree in
different orders and perform different calculations at each node.


The next step in generalizing this is to separate the semantic
functions from the abstract syntax tree. This is the essence of the
\emph{visitor pattern}\cite{visitor}. In the visitor pattern the
abstract syntax tree defines an interface for the semantic functions
to use. However, the semantic functions are not part of the abstract
syntax tree. They are distinct functions that use the abstract syntax
trees' interface. This allows semantic functions to be added as
needed without any changes to the abstract syntax tree. There are
more variations on the visitor pattern than I can list--\cite{vis1},
\cite{vis2}, \cite{vis3}, \cite{vis4}, \cite{vis5}, \cite{vis6},
\cite{vis7}, \cite{vis8}, \ldots. Exploring the trade-offs of these
variations is outside the scope of this current article, but is
something we will return to in the future.


\section{Attribute Grammars}


There is a connection between attribute grammars and tree building
that is worth considering.


First, attribute grammars can be thought of as computations over an
abstract syntax tree. That is, an attribute grammar system implicitly
builds an abstract syntax tree and traverses it, performing the proper
computations at each point in the tree. Generally, this traversal
corresponds to the semantic functions we defined. However, many
attribute grammar systems only permit functions without side-effects
so that the attribute grammar system can safely implement them by
traversals that visit each branch of the tree only once. Some
attribute grammar systems will optimize the tree away if not needed.


Second, most intermediate forms can be created by an attribute
grammar. In particular, most intermediate forms can be computed as a
\emph{synthesized attribute}. A synthesized attribute is simply the
the return values of recursive descent parsing routines or
\emph{dollar} variables of the yacc stack. A synthesized
attribute is the type of attribute that can be computed by either an
LL or an LR parser during the parsing pass. Therefore, if your
attribute grammar system does not preserve the tree for your access,
you can define an attribute that creates the desired tree.


\section{Questions}


Here are some questions to prompt further thought.


\begin{enumerate}


\item{What are the trade-offs of using an off-the-shelf scripting
language compared to a home-grown one?}


\item{What intermediate languages are you aware of? What are their
strong points? Their weaknesses? How were they similar? \ Different?}


\item{What are the advantages and disadvantages in using a distinct
record type for each punctuation mark in the language?}


\item{What kinds of tokens do not need to have record types
representing them? Is there a simple rule of thumb to apply?}


\item{Is it possible to use the visitor pattern in a
non-object-oriented language? How is the pattern affected?}


\end{enumerate}


\bibliography{c3}


\bibliographystyle{alpha}


\SpewBio


\end{document}


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.