Re: Generic AST in XML for any language

"Ira Baxter" <>
Sat, 13 Mar 2010 11:31:56 -0600

          From comp.compilers

Related articles
Generic AST in XML for any language (Kalahan) (2010-03-11)
Re: Generic AST in XML for any language (Ira Baxter) (2010-03-13)
Re: Generic AST in XML for any language (BGB / cr88192) (2010-03-13)
Re: Generic AST in XML for any language (2010-03-14)
Re: Generic AST in XML for any language (Manuel Collado) (2010-03-14)
Re: Generic AST in XML for any language (Olaf Krzikalla) (2010-03-15)
Re: Generic AST in XML for any language (Nikolaos Kavvadias) (2010-03-18)
Re: Generic AST in XML for any language (Hans-Peter Diettrich) (2010-03-20)
| List of all articles for this month |

From: "Ira Baxter" <>
Newsgroups: comp.compilers,comp.lang.c++
Date: Sat, 13 Mar 2010 11:31:56 -0600
Organization: Compilers Central
References: 10-03-020
Keywords: XML, AST
Posted-Date: 13 Mar 2010 12:51:13 EST
X-RFC2646: Format=Flowed; Original

The OMG is pushing something called the Abstract Syntax Tree Metamodel
(ASTM) This attempts to define a universal AST having low-level
details such as expressions, operators, operands, the usual. It
doesn't quite succeed because you can't model every language in single
representation nicely (the UNCOL problem again!). It comprises by
allow "Specific" Abstract Syntax tree models for specific languages,
which is a concession to the UNCOL problem but not a satisfactory one.

Here's a (rather dated) tutorial:
You can find the actual standards documents at the OMG site.
It has an interchange model in XML.

Because there are competing factions inside OMG for every standard
(sigh), there's *another* "standard" called the Knowledge Discovery Model.
This models programs in rather larger chunks as the OP requested,
(functions, statements but not details inside statements AFAIK)
but because nobody is satisified with that, they attempt to model
flow information between the chunks. So it tries to be chunkier than
the AST (easier to produce results) but more detailed at the same time.

There aren't a lot of folks writing tools to produce this information.
We build lots of language front ends (see and have
been asked a number of times why (by OMG people involved in these
standards) we don't produce data in either form. The answer is pretty
simple: putting data in this form is only useful if you have strong
machinery that can consume it and reason from it.

I think the KDM is wrong-headed in that it is lossy; you don't
get the details of the code, and that's needed for deep reasoning.
We only want to do deep reasoning (shallow reasoning leads
to not useful answers or bad false positives) so that's not a good route.

The ASTM at least tries not to be lossy, but it attempts to jam
everything into a single model.
My personal belief is that attempting to jam every language's
syntax into one pretty much makes deep reasoning impossible
because you again lose detail.

So, I have "+" operator in the AST. What does it mean?
"+" in C with 2's complement non-flows?
String "+" in Java? Python "+" with infinite precision?
You can fix this with "+" by marking it with the precise dialect from which
it came,
but now I have "+~C", "+~Java.stringtype", "+.Python2_6"
but its hard to argue I now have a universal tree.

Finally, I'm more interested in the reasoning results than the
intermediate tree representation. So we've concentrated on building
specific trees for specific language dialects, and building
language-specific analyzers (using generic anlaysis support machinery
to the extent we can define it). So we concentrate on building
machinery to process the trees (and downstream analyses such as
control- and data-flow, points-to analysis, ....).

Our tools can export XML for the trees we generate. This is pretty
easy to do. But we do it for "checkbox" reasons: if somebody asks, we
can do it. None of our customers actually use this, because 1) the
trees are enormous as text files, and 2) after you've exported the XML
from our tools, you are in a tool vacuum. Now what do you do with

Ira Baxter, CTO

"Kalahan" <> wrote in message
> Does anyone knows if there is such thing as an standard to represent
> the basic elements of a language (functions, variables, classes)? And
> generated in XML?
> I know that the title might be misleading about the meaning of an AST
> but I have a project in mind and I don't want to replycate work. Also
> that might be aiming too high if we start adding functional languages,
> aspect oriented programming, etc
> Also I would appreciate if you could point me to projects where I can
> get a good XML representation of a source file.

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.