Re: Incorporating comments in syntax tree?

Dave Gillespie <synaptx!thymus!>
2 Feb 1996 09:48:21 -0500

          From comp.compilers

Related articles
Incorporating comments in syntax tree? (1996-01-30)
Re: Incorporating comments in syntax tree? (1996-01-31)
Re: Incorporating comments in syntax tree? (1996-02-01)
Re: Incorporating comments in syntax tree? (1996-02-01)
Re: Incorporating comments in syntax tree? synaptx!thymus! (Dave Gillespie) (1996-02-02)
Re: Incorporating comments in syntax tree? (Nadav Aharoni) (1996-02-02)
Re: Incorporating comments in syntax tree? (Charles Fiterman) (1996-02-02)
Re: Incorporating comments in syntax tree? (1996-02-09)
Re: Incorporating comments in syntax tree? (Greg Titus) (1996-02-13)
Re: Incorporating comments in syntax tree? (Conor O'Neill) (1996-02-23)
| List of all articles for this month |

From: Dave Gillespie <synaptx!thymus!>
Newsgroups: comp.compilers
Date: 2 Feb 1996 09:48:21 -0500
Organization: Compilers Central
References: 96-01-121 96-01-140
Keywords: syntax, Pascal, C (A. Grant) writes:
> Does anyone know of any techniques for reading comments in compiler
> input and associating them with the syntax tree [...] . writes:
> The approach I've used in the past -- admitted in desperation -- was
> to include an additional type of comment which was part of the syntax.

I wrote a Pascal to C translator, p2c, some time ago which attempted
to deal intelligently with comments and whitespace. You can FTP a
copy from or and try it out.

P2c works by parsing the Pascal code into parse trees, fiddling with
the trees in various ways (including some code-moving
transformations), then generating output that is supposed to be nice
enough to use as maintainable source code. P2c devotes a fair amount
of effort to dealing with comments, and does a decent job of it (but
hardly a perfect one!). Comments are, both literally and in the more
general linguistic sense, right on the boundary between computer
language and "natural" language.

P2c uses an ad-hoc recursive descent parser and lexical analyzer (a
survival tactic for coping with the nightmarish mixture of Pascal
dialects p2c must understand). The lexer does not actually generate
tokens for comments, but rather appends them to a grand list-of-
comments as it goes. Each comment is tagged according to its
position, such as left-column, indented-as-code, or trailing-a-
statement. Programmers use blank lines in much the same way they use
comments, so p2c also records blank lines as a special sort of
delimiter-free comment.

Then, it uses a combination of comments-as-tokens and serial numbers
to attach comments to code. As it parses statements, it assigns them
increasing serial numbers. Comments get tagged with the current
serial number as they go on the comment list. Full-line comments can
be either "pre-" the following statement or "post-" the preceding
statement. If there is a series of comments and blank lines between
two statements, p2c picks a midpoint in the series by finding the
largest number of contiguous blank lines, and assigns all comments
before that point as "post-" the preceding serial number, and all
comments after that point as "pre-" the following serial number. Then
it can do all the code-moving, code-destroying, and code-introducing
transformations it likes. When it comes time to emit the result in
the form of C statements, p2c emits each comment pre- or post- the C
statement with the nearest matching serial number. Comments that
originally trailed a statement on the same line reattach if a
statement with their serial number survives, otherwise they turn into
full-line comments and attach to the closest statement they can. The
idea is to do the best possible job with zero risk of accidentally
dropping a comment between the cracks.

The parser has a number of ad-hoc rules to deal with special cases.
For example, Pascal treats semicolons as statement separators, which
means in `writeln; {write a blank line}', the natural lex/parse order
would want to associate the comment with the *following* statement!
P2c deals with this with some special insider trading between the
parser and the lexer. Also, in Pascal the words "begin" and "end",
which are not themselves statements, are often written on their own
lines with trailing comments; these comments get a special comment
class and are associated with the enclosing compound statement. P2c
mostly uses a different mechanism to handle comments on declarations
(as opposed to statements). Etc., etc. These ad-hoc rules are much
like treating comments as tokens whereever convenient or necessary,
but with a good fallback mechanism for the many cases in the grammar
where you don't want to (or didn't remember to) deal with comments.

Another advantage of p2c's method is that it allows the code
transformations themselves, which are the trickiest and most dangerous
part of the translator, to do their work in blissful ignorance of
comment placement.

-- Dave

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.