RE: Java Comment-Preserving Grammar

Quinn Tyler Jackson <quinn-j@shaw.ca>
21 Jun 2004 23:39:44 -0400

          From comp.compilers

Related articles
RE: Java Comment-Preserving Grammar quinn-j@shaw.ca (Quinn Tyler Jackson) (2004-05-30)
RE: Java Comment-Preserving Grammar matt@faredge.com.au (Matthew Herrmann) (2004-05-30)
RE: Java Comment-Preserving Grammar matt@faredge.com.au (Matthew Herrmann) (2004-05-30)
Re: Java Comment-Preserving Grammar cfc@shell01.TheWorld.com (Chris F Clark) (2004-06-15)
RE: Java Comment-Preserving Grammar quinn-j@shaw.ca (Quinn Tyler Jackson) (2004-06-21)
| List of all articles for this month |
From: Quinn Tyler Jackson <quinn-j@shaw.ca>
Newsgroups: comp.compilers
Date: 21 Jun 2004 23:39:44 -0400
Organization: Compilers Central
References: 04-06-071
Keywords: Java, parse
Posted-Date: 21 Jun 2004 23:39:44 EDT

Chris F Clark said:


> One of the few cases you need whitespace at parsing time is in C
> preprocessing (if you implement it in the parsing grammar), where in a
> #define whitespace present or absent between name identifier being
> defined and a following parenthesis determine whether the identifier
> is a parameterized macro (and the parenthesis begins an argument list)
> or not (and the parenthesis is part of the expansion). However, even
> in this case, the problem can be solved lexically by returning two
> different tokens sequences for "id(" and "id (".
>
> Note, it was this specific whitespace problem, that prompted the
> "ignore" extension in Yacc++, which specifcally allows one to omit
> whitespace from all parts of the grammar where it isn't important for
> the parsing (and not just the lexing) phase, but to include it where
> it was important. The same problem prompted Quinn Tyler Jackson to a
> different solution in meta-S.


Ah, whitespace.


Yes, Meta-S grammars ($-grammars) do indeed take a different approach to
whitespace.


Whitespace rules, like any other rule in a $-grammar, can change in
mid-parse. The parsing engine behind Meta-S doesn't really have a notion of
a static "terminal" and a "non-terminal" in the traditional sense, and
whitespace is just another production in the grammar, albeit a reserved name
is used for the production used for whitespace (namely: __ws). Whitespace
nodes can be dropped from the tree during a parse, the same way any
production's nodes can be, through the use of the #notree directive, but
#notree is not quite the same as ignore, in that ignore'd tokens typically
never reach get beyond a traditional parser's lexical analysis phase,
whereas #notree productions are still fully productions -- just productions
that don't adorn the parse tree with artifacts.


Since the inclusion of too many __ws and [__ws] statements within a grammar
can be quite unsightly, I introduced the abbreviations ## and #? for those,
resulting in rules such as:


foo ::= id #? "(" #? expr #? ")"


In a similar vein, there is another notion that turned out to be useful ...
that of the "keyword terminator":


__kw ::= '[a-zA-Z-=9_]';


return_statement ::= "return" #@ expr #? ";";


In the above, #@ expands to ^__kw ("not __kw"). This allows for:


return(10);
return 10;
return a;


but not for:


return10;
returna;


Several have suggested that I could probably have deduced the ##, #?, and #@
operators and inserted them quietly during generation, but this is not
always the case, so I left the requirement for whitespace operators in
A-BNF.


To see why it is not always the case that whitespace operators can be
determined, it must be remembered that productions can change in an
$-grammar during a parse:


expr ::= /* some dynamic rule that might be either '[a-z]+' or '(' expr ')'
or arith_expr */;
ret_expr ::= return expr ";"


In the above example, at any given point in the parse, depending on the
definition of expr, ret_expr would need to be any of the following:


return #? expr; // if expr = '(' expr ')';
return ## expr; // if expr = '[a-z]+';
return #@ #? expr; // if expr = arith_expr;


Rather than focus on ways to deduce these kinds of things at run-time, I
decided it was safer to require explicit whitespace tokens in all cases.


--
Quinn


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.