Related articles |
---|
Q2. Why do you split a monolitic grammar into the lexing and parsing r spam@abelectron.com (valentin tihomirov) (2005-02-20) |
Re: Q2. Why do you split a monolitic grammar into the lexing and parsi vidar@hokstad.name (Vidar Hokstad) (2005-02-28) |
Re: Q2. Why do you split a monolitic grammar into the lexing and parsi Ron@xharbour.com (Ron Pinkas) (2005-02-28) |
Re: Q2. Why do you split a monolitic grammar into the lexing and parsi mefrill@yandex.ru (2005-02-28) |
Re: Q2. Why do you split a monolitic grammar into the lexing and parsi ndrez@att.net (Norm Dresner) (2005-02-28) |
Re: Q2. Why do you split a monolitic grammar into the lexing and parsi rh26@humboldt.edu (Roy Haddad) (2005-03-04) |
Re: Q2. Why do you split a monolithic grammar into the lexing and pars spam@abelectron.com (valentin tihomirov) (2005-03-08) |
Re: Q2. Why do you split a monolithic grammar into the lexing and pars ndrez@att.net (Norm Dresner) (2005-03-08) |
From: | Roy Haddad <rh26@humboldt.edu> |
Newsgroups: | comp.compilers |
Date: | 4 Mar 2005 14:25:14 -0500 |
Organization: | Cox Communications |
References: | 05-02-087 |
Keywords: | parse, design |
Posted-Date: | 04 Mar 2005 14:25:14 EST |
valentin tihomirov wrote:
> Looks like this is a group discussing parsing-translating issues as a matter
> of compiler front-ends. So, the problem is to enable using reserwed keywords
> as identifiers. The problem is purely artificial; the input "(name name)" is
> can be generated by
>
> name -> '(' 'name' ID ')';
> ID -> ('a'..'z')+;
>
> CF grammar; therefore, it must be recognizable. However, the lexer will turn
> the 2nd "name" into a literal token which is different from ID token. As a
> result, the token stream "LP NAME NAME RP" will not match the input. The
> issue merely does not exist for unified grammar recognizes.
I'm assuming this would be like being able to do this in C:
int else = 4;
printf("%i",else);
A character is a different class of object than a word, and the two
operate on different levels. Tokens that are considered to have direct
meaning, like ELSE, IF, WHILE, etc., should be differentiated from
collections of characters: 'e','l','s','e', and etc. This is how we
think of it anyway (you don't think of words as collections of
characters or sounds - otherwise it would be just as easy to memorize a
passage in an unknown language as a passage in your native language),
and because of this, with a unified grammar I would want to have all
"words" constructed out of "letters" and nothing else derived from
"letters", so there would be no practical difference between a separate
lexer and parser and a unified parser...
So I don't think the problem is artificial, or at least the problem is
not artificial in quite that way.
I the most natural solution I can think of that works purely by
grammar is to make the lexer and parser non-deterministic, so 'else'
above would come out of the lexer as the set { ID, ELSE
}. Practically, you could probably factor out the non-determinism from
there with the usual method of replacing every set with multiple
elements by a single token, and then creating rules from the rules for
each single token. This may result in a nasty grammar, but I don't
think this method will fail where any other will succeed.
So, for example, from this:
IFBLOCK -> IF STATEMENT BLOCK [ELSE BLOCK]
VARDECL -> TYPE ID
we would get this:
IFBLOCK -> IF STATEMENT BLOCK [ELSISH BLOCK]
VARDECL -> TYPE ELSISH
VARDECL -> TYPE ID
where ELSISH is the single token version of { ID, ELSE } (since there is
no way for a lexer to read an ELSE without also recognizing ID, we don't
need to keep the original IFBLOCK rule.)
Return to the
comp.compilers page.
Search the
comp.compilers archives again.