Re: Why separate Lexical & Parser Generators

leichter@zodiac.rutgers.edu
Thu, 6 Oct 1994 18:06:22 GMT

          From comp.compilers

Related articles
Why separate Lexical & Parser Generators heronj@smtplink.NGC.COM (John Heron) (1994-10-05)
Re: Why separate Lexical & Parser Generators andand@csd.uu.se (Anders Andersson) (1994-10-06)
Re: Why separate Lexical & Parser Generators leichter@zodiac.rutgers.edu (1994-10-06)
Re: Why separate Lexical & Parser Generators morrison@hal.cs.uiuc.edu (1994-10-07)
Re: Why separate Lexical & Parser Generators johnl@cs.indiana.edu (John Lacey) (1994-10-10)
Re: Why separate Lexical & Parser Generators hagerman@ece.cmu.edu (1994-10-10)
Re: Why separate Lexical & Parser Generators wrs@apple.com (Walter Smith) (1994-10-10)
Re: Why separate Lexical & Parser Generators cef@geodesic.com (Charles Fiterman) (1994-10-11)
Re: Why separate Lexical & Parser Generators pardo@cs.washington.edu (1994-10-11)
[4 later articles]
| List of all articles for this month |

Newsgroups: comp.compilers
From: leichter@zodiac.rutgers.edu
Keywords: lex, yacc, design
Organization: Rutgers University Department of Computer Science
References: 94-10-028
Date: Thu, 6 Oct 1994 18:06:22 GMT

John Heron <heronj@smtplink.NGC.COM> writes:
> Pardon me if this question is naive. Why have a separate parser generator
> and lexical analyzer generator? It seems to me that the generator could
> recognize the regular portions of the grammar by starting at the terminal
> symbols and working its way up until it sees a non-regular production.
> After separating the grammar into 2 parts, one regular and one
> context-free, you could proceed to build 2 separate FSM's. Are we still
> using separate tools like yacc and lex just because they're widely
> available, and widely understood? Or is there some other technical,
> engineering, or business reason for doing things the traditional way?
>
> ***** John Heron, Network General Corp. - email - johnhe@ngc.com
> [The short answer is that they do different things. There are lots of places
> where lex is useful without yacc, and some where yacc is useful without lex.
> Also, if you think it'd be easier to write one big grammar, try writing the
> BNF for a language, including indicating all the places where you can have
> comments. Don't forget to make provisions for tracking line numbers, too.
> -John]


Actually, it's not perhaps not quite that simple. Much of this has become
"received wisdom", going back to the days when compilers were built using
very different tools, and machine constraints were very different from
what they are today. For example, one traditional argument for a separate
lexer is that then only the lexer has to actually examine each byte of
input text; the rest of the compiler only deals with a token stream, which
is typically much smaller. With today's machines, however, it's not clear
that this really matters much.


Traditionally, it was argued that the statement level of most languages
was LL(1), while the expression level was easily handled by a precedence
grammar. This lead to a time-honored design in which you use recursive
descent until you get to the expression level, then a precedence parser.
Simple, quick to code, easy to understand, takes very little space. One
of the good arguments for LR(1) parser generators, on the other hand, is
that for the "LL(1) parts" of the language, you get something that
performs like an LL(1) parser, while for the non-LL(1) parts you get
something at least as good as a precedence parser - and you, the compiler
writer, don't need to choose: You write one, unified grammar and leave the
rest to the generator.


Why not extend this argument to the "regular" part of a language grammar?
Sure, if you choose to write your grammar so as to mix the (logical)
lexical and syntactic levels - so that, for example, comments show up all
through the grammar, rather than in a few low-level productions - you'll
have problems. But language designs no longer do that, for the most part.
(On the other hand, *some* fairly recent languages don't really have a
regular lexical level. A classic example concerns ' in Ada. Currently,
these have to be special-cased. In a unified grammar, they would not have
to be.)


In a feedback loop, current compiler generators don't try to pull out the
"regular" substratum of their input grammars, mainly because given the
assumption of a separate lexer, there isn't one to find.


I know of at least one experiment to see what a unified parser would do.
I can't put my hands on the paper quickly - it appeared in SIGPLAN
Notices, perhaps in a Proceedings, about 7-8 years ago. The authors took
some common language grammar, added the lexical definitions, and fed the
whole thing to unmodified YACC. With reasonable effort, they were able to
get YACC to accept it. The resulting compiler, as I recall the paper, was
somewhat larger than the traditional lex+YACC version, but actually ran
faster. (Whether this was an artifact of running the time-honored, very
slow, standard lex, I don't know.)


As far as I know, no one has followed up on this work; so we really have
no idea how a parser generator *intended* to be used this way would
compare to the current partitioned scheme. I would argue on general
principles of optimization, however, that the more decisions that the
optimizer *can* make that you leave to it, the better it will do.
Insisting the the programmer choose the partition between lexer and parser
may be akin to insisting that the programmer decide which variables to put
in registers.


-- Jerry
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.