RE: What should be check in Lexical Analyzer along with

"Quinn Tyler Jackson" <qjackson@shaw.ca>
22 Sep 2002 12:22:58 -0400

          From comp.compilers

Related articles
Re: What should be check in Lexical Analyzer along with generating cfc@shell01.TheWorld.com (Chris F Clark) (2002-09-20)
RE: What should be check in Lexical Analyzer along with qjackson@shaw.ca (Quinn Tyler Jackson) (2002-09-22)
| List of all articles for this month |

From: "Quinn Tyler Jackson" <qjackson@shaw.ca>
Newsgroups: comp.compilers
Date: 22 Sep 2002 12:22:58 -0400
Organization: Compilers Central
References: 02-09-124
Keywords: lex, design
Posted-Date: 22 Sep 2002 12:22:58 EDT

Durchholz said:


> > A lossless lexer (i.e. a lexer that emits enough
> information to allow a
> > character-by-character reconstruction of the source)
> should be the most
> > flexible approach.


Clark replied:


> These are both very good principles! You should probably take these
> as your basic rules of thumb.
>
> I have a third rule of thumb you can add to that list.
>
> - Pick the division point between your lexer and parser
> where it makes the most semantic sense.


When it comes to adaptive parsing, all bets are off as regards a
discrete tokenization phase. This, of course, is a result of the fact
that tokens can be added, removed, and modified over the space of a
single production (through binding). In some cases, these
modifications are local, in other cases, they are global. Reorganizing
the lexical analyzer every time a new lexeme is added would be
extremely inefficient.


This also means that whitespace must be specified in Meta-S grammars,
but even though this seems a bit nightmarish, it isn't in practice,
since rather than require this:


S ::= function_name opt_whitespace "(" opt_whitespace args
opt_whitespace ")"


Meta-S has the symbols ## and #? aliases for required whitespace and
optional whitespace, and then reserved the production __ws to allow
the syntax for ## and #? to be defined for each grammar. (That is, if
## or #? occur in any production, __ws must be supplied. For instance)


S ::= function_name #? "(" #? args #? ")"


At first, it may seem tedious to have to do that, but in practice
there aren't as many ## and #?'s required as one may first believe --
it all depends on how one designs the grammar. Moreover, since every
grammar must have these -- pretty printer (or other lossless uses)
grammars come free with every Meta-S grammar. (That may not seem like
much of a benefit


Meta-S supplies two directives that can be used to clean up the tree
s.t. it actually represents the semantic structure of the language:


#notree
#refine(node_name)


#notree tells the tree builder that the current production (or any
underneath it on the tree) should be discarded during the parse. By
putting #notree into the __ws production, the tree is free of
unnecessary node-age. This also means that productions like
cpp_comment and cstar_comment, if they do have contexts in which they
should be retained on the tree (such as when they contain directives,
or whatever), can be, simply by referring to them in some other
production that doesn't use #notree.


#refine(node_name) came about as the result of the fact that, since
the core engine of Meta-S is LL(k), expression grammars tend to
produce trees that involve a lot of intermediate productions that
don't add useful information to the tree. If a production has a
#refine(node_name) in it, intermediate nodes are removed, and the node
itself is renamed to node_name:


expr ::= e_addsub
e_addsub ::= e_multdiv #refine(expr) [ "+"|"-" expr
#node(e_addsub) ]
e_multdiv ::= e_power #refine(expr) [ "*"|"/" expr
#node(e_mult_div) ]
// and so on


The effect of the above constructs is to produce a tree that looks
just like an LR(k) tree with precedent disambiguation.


Finally:


> One of the key issues is finding a way to preserve the tokens that
> would normally be thrown away but not force the grammar writer to
> clutter the grammar indicating each place the token is allowed.


In practice, I have found that ## and #? prevent extreme clutter if
the grammar is well organized. Since ## refers to the __ws production,
rather than to any particular lexeme, and since __ws can include
whitespace, comments, preprocessor defines that evaluate to 0 (which
can be handled directly in an adaptive grammar, rather than in a
preprocessor, but that's neither here nor there) -- the number of ##
and #?'s in a grammar is far less than one might imagine it would have
to be.


--
Quinn Tyler Jackson
http://members.shaw.ca/qjackson/


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.