Re: Why separate Lexical & Parser Generators?

bhv@herman.upe.ac.za (Prof Herman Venter)
Mon, 10 Oct 1994 16:15:00 GMT

          From comp.compilers

Related articles
Re: Why separate Lexical & Parser Generators? bhv@herman.upe.ac.za (1994-10-10)
Re: Why separate Lexical & Parser Generators? clark@quarry.zk3.dec.com (1994-10-12)
| List of all articles for this month |
Newsgroups: comp.compilers
From: bhv@herman.upe.ac.za (Prof Herman Venter)
Keywords: lex, yacc, design
Organization: Compilers Central
Refernces: 94-10-028
Date: Mon, 10 Oct 1994 16:15:00 GMT

John wrote:
: ... Why have a separate parser generator
: and lexical analyzer generator? It seems to me that the generator could
: recognize the regular portions of the grammar by starting at the terminal
: symbols and working its way up until it sees a non-regular production.
: After separating the grammar into 2 parts, one regular and one
: context-free, you could proceed to build 2 separate FSM's.


This is indeed possible -- I have written a scanner+parser generator that
does something pretty close to what you describe. I might just add that
there is no particular need to specify tokens by means of a regular grammar,
provided that one can denote which productions are meant for the scanner and
which are meant for the parser. For example:


expression = `(' expression `)' | number [`+' number]
number = Digit{Digit} (`.'Digit{Digit} [(`e' | `E') [`-'] Digit{Digit}] |
                      (`e' | `E') [`-'] Digit{Digit})
Tokens = number




Given the above grammar, my scanner+parser generator will pick up that
the first production is meant for the parser and the second for the scanner,
simply because number is listed in the reserved production, Tokens. It is
then able to pick up that the strings `(', `)' and `+' used in the first
production must be returned as tokens by the scanner, whereas the strings
`.', `e', `E' and `-' never reach up to the parser, but are characters that
are scanned as part of a number. The production Digit is predefined, so as
to keep grammars short.


The astute reader will no doubt pick up a number of potential pitfalls and
notice some limitations. Nevertheless, the scheme can be made to work for
most reasonable languages, and even some unreasonable ones.


There has been a fair amount of discussion about handling comments. One way
to do this is to indicate explicitly where comments may appear, but this soon
clutters up the grammar. I prefer to handle it by convention. For example,
to add Ada-like comments to the above grammar, I merely add:


comment = `--' {Char} Eol
Ignorable = comment


and change:


Tokens = number | comment




Again, a number of conventions govern the interpretation of the grammar.
I find the unified grammars very readable and intuitive, but I'm obviously
biased ;-)


Our esteemed moderator writes:


: ... Don't forget to make provisions for tracking line numbers, too.


This is indeed a tricky problem, but it does not become any simpler when
one uses separate tools for generating the scanner and parser. I do not
think it is necessary to burden the grammar with the details.
--
Prof Herman Venter (csabhv@cs.upe.ac.za)
Dept of Computer Science
University of Port Elizabeth
P O Box 1600, Port Elizabeth
6000 South Africa
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.