Re: Why separate Lexical & Parser Generators? (Chris Clark USG)
Wed, 12 Oct 1994 18:08:54 GMT

          From comp.compilers

Related articles
Re: Why separate Lexical & Parser Generators? (1994-10-10)
Re: Why separate Lexical & Parser Generators? (1994-10-12)
| List of all articles for this month |

Newsgroups: comp.compilers
From: (Chris Clark USG)
Keywords: lex, yacc, design
Organization: Digital Equipment Corporation
References: 94-10-075
Date: Wed, 12 Oct 1994 18:08:54 GMT

Another good topic in parser generator design, why are lexers and parsers
separate. As mentioned by Anders Andersson and Jerry Leichter, there are
SIGPLAN papers showing that don't have to be. Furthermore, this work
seems about to become "mainstream". In particular, as Walter Smith notes,
Terrence Parr is thinking of rewriting PCCTS to fold the lexer and the
parser together.

Having talked about this with Terrence, I'd like to think that part of his
inspiration comes from Yacc++(R). We did not build a unified lexer and
parser. However, the grammar has two sections lexer and parser and the
two sections take the same syntax, yacc-style productions augmented with
regular expressions. Moreover, the lexer and parser engines are based
upon the same technology, direct translation ELR, which merges LR parsing
and regular expression matching.

There are some advantages to having the lexer and the parser seperate,
even if a unified syntax is used. The key advantage is that the point of
interface gives a well defined spot for handling irregularities (some
might call it cheating or hacking). [One point Terrence brings up is that
semantic predicates can solve some of these problems.]

For example, at the time the lexer reduces a string to a token, there is a
feature of Yacc++ which allows the user action code to inspect the token
(and anything else it desires) and to change the type of the token or to
cause the token to be discarded. This is used to implement keyword and
typedef processing. A symbol table search can be done with the token and
if the symbol is found, it determines the type of the token. This can
also be used for macro and comment processing. The token found can be
discarded and another sequence of tokens can be fed to the parser in its
place (or none at all).

One reason why this would be difficult in a unified lexer/parser is that
the parser generator knows which tokens it's looking for. This is obvious
in LL, but is equally true in LR--the only productions being looked at are
ones which are expansions of core items (and that incorporates all of the
left context). In a unified lexer/parser, one would have to be able to
specify (or detect) which tokens can transmute into other tokens, so that
they could be added to the list of tokens expected (when one of the tokens
which they could transmute into was expected). In the separate
lexer/parser model this is not a problem, as the lexer is always expecting
any token.

Prof. Venter mentions another concept with is useful in this regard. This
is the concept of ignorable (non-)terminals. [Interestingly we used the
same name for them in Yacc++, just using the verb "ignore" form rather
than the adjective "ignorable" form.] When an item is ignored, the parser
just "throws it away when it sees it". Here again, we have found the
distinction between lexer and parser useful.

For example, when the lexer throws tokens away, we refer to it as
discarding. In this case the parser never sees the token. This is useful
for comments and whitespace in most languages. However, it is sometimes
useful for the parser to be able to see the token in some contexts. In
that case, we don't have the lexer discard the token, but instead have the
parser see the token and ignore it. Then, when the parser is in a state
which is expecting the token, instead of ignoring it, it can process it.
A good example of this use is whitespace in a C #define statement.

"#define x(a)" means something different than "#define x (a)"

One of the advantages of separating the lexer from the parser for this
processing is that it simplifies the semantics. Because the parser is
only ignoring these sequences at token boundaries. One does not have to
worry that an ignored sequence will be embedded in the middle of a token.
The lexer gets to establish the token boundaries before the sequence is
recognized. To allow ignored sequences to be embedded in a rule, one
makes it a parser rule, to disallow embedding one makes it a lexer rule.

One place where a unified lexer/parser might have some advantages is in
the lexer will have context for knowing what can follow a token. A very
common problem is the situation where the correct action is swallow
everything upto some specific sequence of characters. The Ada or C++
commment which continues to the end of line is an example. The
traditional solution is to use lexer classes (or lexer states) to handle
this. When the situation occurs the lexer is put into a special state
where everything upto the desired sequence is packaged into one token and
processed as a whole. In a unified lexer/parser the special state would
"disappear" into the rule which recognized the context, the stuff to be
batched together, and the final sequence.

These are, of course, just some of the trade-offs.

Chris Clark

Send Yacc++ related email to:
Compiler Resources, Inc.
[Purveyors of Yacc++ and the Language Objects Library]

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.