Re: Q2. Why do you split a monolitic grammar into the lexing and parsing rules?

Roy Haddad <rh26@humboldt.edu>
4 Mar 2005 14:25:14 -0500

          From comp.compilers

Related articles
Q2. Why do you split a monolitic grammar into the lexing and parsing r spam@abelectron.com (valentin tihomirov) (2005-02-20)
Re: Q2. Why do you split a monolitic grammar into the lexing and parsi vidar@hokstad.name (Vidar Hokstad) (2005-02-28)
Re: Q2. Why do you split a monolitic grammar into the lexing and parsi Ron@xharbour.com (Ron Pinkas) (2005-02-28)
Re: Q2. Why do you split a monolitic grammar into the lexing and parsi mefrill@yandex.ru (2005-02-28)
Re: Q2. Why do you split a monolitic grammar into the lexing and parsi ndrez@att.net (Norm Dresner) (2005-02-28)
Re: Q2. Why do you split a monolitic grammar into the lexing and parsi rh26@humboldt.edu (Roy Haddad) (2005-03-04)
Re: Q2. Why do you split a monolithic grammar into the lexing and pars spam@abelectron.com (valentin tihomirov) (2005-03-08)
Re: Q2. Why do you split a monolithic grammar into the lexing and pars ndrez@att.net (Norm Dresner) (2005-03-08)
| List of all articles for this month |

From: Roy Haddad <rh26@humboldt.edu>
Newsgroups: comp.compilers
Date: 4 Mar 2005 14:25:14 -0500
Organization: Cox Communications
References: 05-02-087
Keywords: parse, design
Posted-Date: 04 Mar 2005 14:25:14 EST

valentin tihomirov wrote:
> Looks like this is a group discussing parsing-translating issues as a matter
> of compiler front-ends. So, the problem is to enable using reserwed keywords
> as identifiers. The problem is purely artificial; the input "(name name)" is
> can be generated by
>
> name -> '(' 'name' ID ')';
> ID -> ('a'..'z')+;
>
> CF grammar; therefore, it must be recognizable. However, the lexer will turn
> the 2nd "name" into a literal token which is different from ID token. As a
> result, the token stream "LP NAME NAME RP" will not match the input. The
> issue merely does not exist for unified grammar recognizes.


I'm assuming this would be like being able to do this in C:
int else = 4;
printf("%i",else);


A character is a different class of object than a word, and the two
operate on different levels. Tokens that are considered to have direct
meaning, like ELSE, IF, WHILE, etc., should be differentiated from
collections of characters: 'e','l','s','e', and etc. This is how we
think of it anyway (you don't think of words as collections of
characters or sounds - otherwise it would be just as easy to memorize a
passage in an unknown language as a passage in your native language),
and because of this, with a unified grammar I would want to have all
"words" constructed out of "letters" and nothing else derived from
"letters", so there would be no practical difference between a separate
lexer and parser and a unified parser...


So I don't think the problem is artificial, or at least the problem is
not artificial in quite that way.


I the most natural solution I can think of that works purely by
grammar is to make the lexer and parser non-deterministic, so 'else'
above would come out of the lexer as the set { ID, ELSE
}. Practically, you could probably factor out the non-determinism from
there with the usual method of replacing every set with multiple
elements by a single token, and then creating rules from the rules for
each single token. This may result in a nasty grammar, but I don't
think this method will fail where any other will succeed.


So, for example, from this:


IFBLOCK -> IF STATEMENT BLOCK [ELSE BLOCK]
VARDECL -> TYPE ID


we would get this:


IFBLOCK -> IF STATEMENT BLOCK [ELSISH BLOCK]
VARDECL -> TYPE ELSISH
VARDECL -> TYPE ID


where ELSISH is the single token version of { ID, ELSE } (since there is
no way for a lexer to read an ELSE without also recognizing ID, we don't
need to keep the original IFBLOCK rule.)



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.