RE: Java Comment-Preserving Grammar

Quinn Tyler Jackson <>
21 Jun 2004 23:39:44 -0400

          From comp.compilers

Related articles
RE: Java Comment-Preserving Grammar (Quinn Tyler Jackson) (2004-05-30)
RE: Java Comment-Preserving Grammar (Matthew Herrmann) (2004-05-30)
RE: Java Comment-Preserving Grammar (Matthew Herrmann) (2004-05-30)
Re: Java Comment-Preserving Grammar (Chris F Clark) (2004-06-15)
RE: Java Comment-Preserving Grammar (Quinn Tyler Jackson) (2004-06-21)
| List of all articles for this month |

From: Quinn Tyler Jackson <>
Newsgroups: comp.compilers
Date: 21 Jun 2004 23:39:44 -0400
Organization: Compilers Central
References: 04-06-071
Keywords: Java, parse
Posted-Date: 21 Jun 2004 23:39:44 EDT

Chris F Clark said:

> One of the few cases you need whitespace at parsing time is in C
> preprocessing (if you implement it in the parsing grammar), where in a
> #define whitespace present or absent between name identifier being
> defined and a following parenthesis determine whether the identifier
> is a parameterized macro (and the parenthesis begins an argument list)
> or not (and the parenthesis is part of the expansion). However, even
> in this case, the problem can be solved lexically by returning two
> different tokens sequences for "id(" and "id (".
> Note, it was this specific whitespace problem, that prompted the
> "ignore" extension in Yacc++, which specifcally allows one to omit
> whitespace from all parts of the grammar where it isn't important for
> the parsing (and not just the lexing) phase, but to include it where
> it was important. The same problem prompted Quinn Tyler Jackson to a
> different solution in meta-S.

Ah, whitespace.

Yes, Meta-S grammars ($-grammars) do indeed take a different approach to

Whitespace rules, like any other rule in a $-grammar, can change in
mid-parse. The parsing engine behind Meta-S doesn't really have a notion of
a static "terminal" and a "non-terminal" in the traditional sense, and
whitespace is just another production in the grammar, albeit a reserved name
is used for the production used for whitespace (namely: __ws). Whitespace
nodes can be dropped from the tree during a parse, the same way any
production's nodes can be, through the use of the #notree directive, but
#notree is not quite the same as ignore, in that ignore'd tokens typically
never reach get beyond a traditional parser's lexical analysis phase,
whereas #notree productions are still fully productions -- just productions
that don't adorn the parse tree with artifacts.

Since the inclusion of too many __ws and [__ws] statements within a grammar
can be quite unsightly, I introduced the abbreviations ## and #? for those,
resulting in rules such as:

foo ::= id #? "(" #? expr #? ")"

In a similar vein, there is another notion that turned out to be useful ...
that of the "keyword terminator":

__kw ::= '[a-zA-Z-=9_]';

return_statement ::= "return" #@ expr #? ";";

In the above, #@ expands to ^__kw ("not __kw"). This allows for:

return 10;
return a;

but not for:


Several have suggested that I could probably have deduced the ##, #?, and #@
operators and inserted them quietly during generation, but this is not
always the case, so I left the requirement for whitespace operators in

To see why it is not always the case that whitespace operators can be
determined, it must be remembered that productions can change in an
$-grammar during a parse:

expr ::= /* some dynamic rule that might be either '[a-z]+' or '(' expr ')'
or arith_expr */;
ret_expr ::= return expr ";"

In the above example, at any given point in the parse, depending on the
definition of expr, ret_expr would need to be any of the following:

return #? expr; // if expr = '(' expr ')';
return ## expr; // if expr = '[a-z]+';
return #@ #? expr; // if expr = arith_expr;

Rather than focus on ways to deduce these kinds of things at run-time, I
decided it was safer to require explicit whitespace tokens in all cases.


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.