Re: concrete vs. abstract syntax trees

"Ira Baxter" <idbaxter@semdesigns.com>
Sat, 14 Feb 2009 12:02:30 -0600

          From comp.compilers

Related articles
concrete vs. abstract syntax trees eliben@gmail.com (eliben) (2009-02-13)
Re: concrete vs. abstract syntax trees haberg_20080406@math.su.se (Hans Aberg) (2009-02-14)
Re: concrete vs. abstract syntax trees idbaxter@semdesigns.com (Ira Baxter) (2009-02-14)
Re: concrete vs. abstract syntax trees cfc@shell01.TheWorld.com (Chris F Clark) (2009-02-14)
Re: concrete vs. abstract syntax trees cfc@shell01.TheWorld.com (Chris F Clark) (2009-02-14)
| List of all articles for this month |

From: "Ira Baxter" <idbaxter@semdesigns.com>
Newsgroups: comp.compilers
Date: Sat, 14 Feb 2009 12:02:30 -0600
Organization: Compilers Central
References: 09-02-050
Keywords: AST
Posted-Date: 14 Feb 2009 16:53:48 EST

"eliben" <eliben@gmail.com> wrote in message
> I'm trying to nail down the difference between concrete and abstract
> syntax trees.


A concrete syntax tree is one that matches the productions that drive
the parsing process.


> Is this correct to say that if I semi-mechanically translate the BNF
> of a language to a yacc grammar, what I'll get from parsing is a
> concrete syntax tree?


No. YACC style tree construction typically does not construct one
node per production (although the opportunities to construct nodes are
presented once per production).


> If so, this tree is almost useless for any kind
> of interesting interaction (especially for languages like C where some
> things like types don't fit hierarchically in a way that resembles the
> concrete syntax). Is this right?


I don't see it that way. (You should manage type information
in the symbol table). The problem with CSTs is that they
technically contain nodes for tokens that control the syntax but
don't typically contribute to the semantics. In the grammar rule,
              STMT = LHS ":=" EXP ";"
a concrete syntax tree for STMT would have 4 children, one
for each grammer token on the right hand side, notably
ones for ":=" and ";". These latter tokens help distinguish
the assignment statement from other statements, but don't
contribute to the semantic interpreation once a STMT has
been detected. So the concrete syntax tree tends to contain
many tree nodes that aren't interesting to visit after parsing.


OTOH, CSTs are easy and brainless to construct from the grammar. If
you constructed one, and simply didn't visit the dull nodes, you'd
have pretty much all the advantages of an AST. People don't do this
with YACC because people are lazy, and they'd have to write a node
constructor for every grammar token and they (rightfully) don't see
the point of doing this.


However, if you intend to do general purpose program transformation,
as we do with our DMS product, having representatives of the CST nodes
around allows you to retain position information to regenerate the
source tree with high degrees of accuracy, including white space gaps.
And, if you have automate tree matchers, they can automatically avoid
visiting such keyword nodes.


A second downside to CSTs is their size; they're simply bigger than
ASTs in which such nodes aren't present. If you are going to
manipulate only a single "source file", this probably won't matter
with 32 or 64 bit address spaces. If you're going to process
thousands of files, then size matters. If nothing else, if you have
fewer nodes, you have fewer memory touches, and we work on a scale
where the caches easily overflow with nodes, and so memory touches
materially affect performance.


> Also, do compilers more commonly go through the concrete-syntax
> stage or do they usually build the AST directly in the parser?


The YACC style tries to produce AST nodes directly while parsing. The
price is that the parser-writer has to invent the entire AST node
vocabulary, and write node-constructing code in all the relevant
grammer productions. For many modern languages with large grammars,
this is a modest but real amount of work. Worse, as the language
grammar evolves (they do!) somebody has to continually go back and
revise the AST constructing part. Granted, not very hard.


We've tried to get the best of both worlds. Our parsers *logically*
construct CSTs straight from the grammar, using a GLR parser. But, we
automatically compress out "non-value carrying" tokens (leaves such
":=" and ";"), and unary productions that have no semantic import
(which we detect by a) inspecting attribute evaluators over the
grammar; if an attrribute grammar rule simply copies values, you don't
need the CST node for that rule or b) an explicit "keep this"
annotation on the grammar rule). Finally, we introduce list nodes
whereever we detect list-forming productions in the grammar (various
generalizations
of L = e; and L = L "," e ; for list elements e and
non-value-carrying comma tokens).


The result looks amazingly look an AST you might construct by hand,
but at no effort to do so. A nice consequence is that all the
"syntax" hacking (tree matching, tree splicing) is all done in terms
of virtual CSTs, with the DMS machinery covering up all the
differences. There's even a low-level API called "GetNthGrammarChild"
that allows procedural code to walk around the tree as if the CST
existed.


In practice, we can turn the various tree compression accellerators
(eliminate terminals, eliminate unary, form lists) on and off. This
lets us debug the machinery when we occasionally break it.


--
Ira Baxter, CTO
www.semanticdesigns.com



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.