Re: source-to-source translators, what language for unparsers

Stephen Horne <>
Wed, 06 May 2009 04:08:28 +0100

          From comp.compilers

Related articles
source-to-source translators, what language for unparsers (Chris F Clark) (2009-05-05)
Re: source-to-source translators, what language for unparsers (Stephen Horne) (2009-05-06)
Re: source-to-source translators, what language for unparsers (2009-05-06)
Re: source-to-source translators, what language for unparsers (Ira Baxter) (2009-05-06)
Re: source-to-source translators, what language for unparsers Ruslan@Shevchenko.Kiev.UA (rssh) (2009-05-08)
Re: source-to-source translators, what language for unparsers (Ray Dillinger) (2009-05-09)
Re: source-to-source translators, what language for unparsers (2009-06-03)
| List of all articles for this month |

From: Stephen Horne <>
Newsgroups: comp.compilers
Date: Wed, 06 May 2009 04:08:28 +0100
References: 09-05-026
Keywords: translator
Posted-Date: 06 May 2009 05:58:51 EDT

For direct source-to-source, I've played with TXL.

My main thought was that while it's certainly Turing complete, I
didn't much like having to represent all intermediate data structures
and algorithms as text/AST translation.

I've been giving though to an AST walking DSL recently. Not a tested
idea, but maybe worth considering, because the underlying model *is*
tested and known useful.

I have already written a DSL which does some AST-handling code
generation. Basically, it generates a single-inheritance class
heirarchy in C++ for the nodes (trivial, obviously). The real point of
it is to generate some code that exploits knowledge of the class
heirarchy, particularly multiple-dispatch functions. The original idea
was stolen from treecc.

It can also generate tree-walking iterator classes according to a
specification of which pointer-fields to 'descend' through and which
to 'yield' in which order, using a single-dispatch-style class
resolution. These are often useful because they avoid the need to pass
context stuff as parameters through layers of recursion.

A typical iterator (in this case, a simple binary tree traversal)
might be specified as follows...

  iterator <name> : <nodename> yield <yieldname> ;

        <name> : generated iterator class name
        <nodename> : class node for the starting point node
        <yieldname> : class for yielded node pointers

    iterate <name> : <nodename> yield <yieldname>
        descend m_Left;
        descend m_Right;

        <name> : generated iterator class name
        <nodename> : specific node class iterated over by this case
        <yieldname> : class for yielded node pointers

It occurred to me that the iterator classes are basically data-driven
generators (in the Icon/Python sense) implemented using a very simple
virtual machine. The underlying model is recursive method calls based
on the classes of the AST nodes iterated over. Each 'iterate' block is
a single-dispatch function.

This model can be extended to multiple dispatch by extending the
'call' system, and a lot of the work for that would be based on my
existing multiple dispatch logic - building the dispatch decision
table, converting to a decision tree, and minimising into an acyclic
digraph. The biggest complication after that is language design -
having multiple function names for the same "iterator", whether the
iterator start point should have multiple parameters, should the
iterate handler functions be shared between different iterators or
should the iterate handlers simply be seen as private functions with
the 'iterator' starting-point declaration being like a 'public'
declaration, etc.

Beyond that, the thought occurs that instead of yielding AST node
pointers, the model could be based on returning and concatenating
strings - basically bypassing the need for code that loops over the
iteration. Obviously, that's just a limited version of what my
existing multiple-dispatch functions can do, but one that can be
compiled to a data-driven virtual-machine-like model.

For the moment, though, I'm working on a simpler event-driven
text-generating DSL, generating C++ classes in which the method calls
(and output strings) map to regular grammar terminal tokens. The
resulting automata have two legal classes of state - input states
(waiting for a method call, all outward transitions are labelled with
method call events) and transient states (only one outward transition
allowed, must be labelled with an output event). Transient states
won't really appear in output code - they just make modelling easier.

I was going to switch to a PDA model, but changed my mind after
realising I could simplify my substitution variable handling without
switching models. Sometimes, you need to take a while off and think
about other things.

The reason for this event-driven approach is I can get away with it.
No-ones paying me, or ever likely to. I'm more interested in regular
grammars (and digraph drawing for error reporting) than I am in
multiple dispatch, decision tree handling etc, which I've already
done. Not exactly pragmatic, but then, if someone wants to pay me to
do this so I can get off incapacity...

Anyway, a multiple-dispatch function-call method of processing an AST,
whether to iterate, build text strings, create a new AST or whatever,
is a tried and tested model. Building direct AST-to-text virtual
machine stuff based on this model may be new, AFAIK, but it's a new
application of old ideas and should be pretty effective.

My own decision tree handling is overkill (based on an ID3 variant) -
it can generate simpler/faster decision code, but rarely does, and the
cost is much slower code generation. The basic idea of a decision tree
minimised to a DAG works well, though.

There are some tricks I've found useful for the multiple-dispatch
specification itself, though. One thing is the option to specify
priorities for specific function variants, especially shorthand
'strong' and 'weak' flags. This gives a simple ambiguity resolution
mechanism. Another is the option to treat null pointers as a special
case in the dispatch resolution. Dispatch based on boolean parameters
is useful, too. Simple predicates are also occasionally useful, based
both on parameter types and on boolean parameters. 'Kind only'
parameters are occasionally handy, though they mostly end up referring
non-node enumerate-style values - the parameter for these is just an
integer specifying a node type, not a node instance, but they take
part in dispatch resolution decisions.

If you'd like a look at this DSL, you're welcome - I can e-mail it to
you. Unusually for me, it has a reasonable manual. Partly there's too
much for me to remember myself, partly dreams of selling the thing -
if you were living on incapacity benefit, you'd dream too. The source
won't help much. You'd have to know your way around too many of my
libraries, and there's VC++ dependence problems - I got caught by a
dependent types in template parameters issue in my container
libraries. GCC chokes badly, and I haven't figured out a portability
fix yet.

The binary is enough to play with, though, if you have access to
Windows. Generated code is standards-compliant and fairly readable for
the most part, and error checking is pretty thorough. There's a minor
standards issue with the iterator code using offsetof in a way that
usually works, but which isn't guaranteed by the standard. When I
wrote it, I thought it was safe. GCC should be fine, though I haven't
tested it.

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.