Re: Q2. Why do you split a monolithic grammar into the lexing and parsing rules?

"valentin tihomirov" <>
8 Mar 2005 01:03:51 -0500

          From comp.compilers

Related articles
Q2. Why do you split a monolitic grammar into the lexing and parsing r (valentin tihomirov) (2005-02-20)
Re: Q2. Why do you split a monolitic grammar into the lexing and parsi (Roy Haddad) (2005-03-04)
Re: Q2. Why do you split a monolithic grammar into the lexing and pars (valentin tihomirov) (2005-03-08)
Re: Q2. Why do you split a monolithic grammar into the lexing and pars (Norm Dresner) (2005-03-08)
| List of all articles for this month |

From: "valentin tihomirov" <>
Newsgroups: comp.compilers
Date: 8 Mar 2005 01:03:51 -0500
Organization: Compilers Central
References: 05-02-087 05-03-019
Keywords: lex, design
Posted-Date: 08 Mar 2005 01:03:51 EST

> I'm assuming this would be like being able to do this in C:
> int else = 4;
> printf("%i",else);

Nice example, I was thinking of pascalish
        procedure do();

> A character is a different class of object than a word, and the two
> operate on different levels.

One bit, one soound or one char is not enaught to encode all the
notions in the world. Fortunately for mankind, the ancient Latins, as
opposed to Chinese, were wise enough to construct the words from
letters. You right, most of the words encode notions in natural
languages, where the letters confom to sounds in the oral speach
(English language has lost this natural relationship). Such separation
looks natural in the human world. The formal languages do not have the
2 levels restriction. They may have as much levels as necessary. When
I speak about CF lang I see a syntax tree.

> Tokens that are considered to have direct
> meaning, like ELSE, IF, WHILE, etc.

.. and the wrong meaning will be assigned by lexer to ELSE in your
example above. The ELSE may have any sense depending on the context
(as well as most words in the natural languages, which are considered
PS or CS). Recognizers do not worry about meaning at all (both tokens
and letters are symbols in formal langs).

> and because of this, with a unified grammar I would want to have all
> "words" constructed out of "letters" and nothing else derived from
> "letters",

Which class do literal tokens belong to? The IF, THEN, ELSE are separator
        IF expr THEN stat ELSE stat;

Some words (known as phrases) consist of 2 or more words.

> so there would be no practical difference between a separate
> lexer and parser and a unified parser...

The advantage is a known context of symbols. Despite we do not use CS
languages, the words still have a context-dependent menaning. In
addition, the lexer cannot calculate the LL(k) lookahead as it cannot
predict what comes after the token end; it assumes that any token can
follow resulting in huge ambiguities (ANTLR just supresses them not to
confuse user); meantime, the top-down parser knows start symbol and
owns information about FOLLOW(token) set.

> So I don't think the problem is artificial, or at least the problem is
> not artificial in quite that way.

Humans bring their (natural) way of thinking into the formal language
theory which already has a general device for language description.

> I the most natural solution I can think of that works purely by
> grammar is to make the lexer and parser non-deterministic, so 'else'
> above would come out of the lexer as the set { ID, ELSE
> }.

As you see, the lexer prematurely assigns a meaning to the "word"
regardless of the context. Context is a placeholder where a token
(syntactically structured sequence of symbols) can appear in the
production rule.
[The most plausible argument I've seen for syntax design without
reserved words was for PL/I, which had so many of them that it was
impractical for programmers to remember them all. Assuming you
have a more parsimonious language than PL/I or Cobol, the ability
to use IF or FOR as a variable is more likely to be confusing than
useful. -John]

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.