Re: is lex useful?

Jerry Leichter <leichter@smarts.com>
27 Jun 1996 11:23:08 -0400

          From comp.compilers

Related articles
[6 earlier articles]
Re: is lex useful? kanze@lts.sel.alcatel.de (1996-06-24)
Re: is lex useful? bos@serpentine.com (1996-06-26)
Re: is lex useful? dhami@mdd.comm.mot.com (1996-06-26)
Re: is lex useful? stefan.monnier@lia.di.epfl.ch (Stefan Monnier) (1996-06-26)
Re: is lex useful? raph@kiwi.cs.berkeley.edu (1996-06-26)
Re: is lex useful? rgreen@barach.bbn.com (1996-06-26)
Re: is lex useful? leichter@smarts.com (Jerry Leichter) (1996-06-27)
Re: is lex useful? scooter@mccabe.com (Scott Stanchfield) (1996-06-27)
Re: is lex useful? Scott.Nicol@infoadvan.com (1996-06-27)
Re: is lex useful? Scott.Nicol@infoadvan.com (1996-06-27)
Re: is lex useful? 72510.2757@CompuServe.COM (Stephen Lindholm) (1996-06-27)
Re: is lex useful? kanze@lts.sel.alcatel.de (1996-06-27)
Re: is lex useful? bart@time.cirl.uoregon.edu (1996-06-30)
[10 later articles]
| List of all articles for this month |

From: Jerry Leichter <leichter@smarts.com>
Newsgroups: comp.compilers
Date: 27 Jun 1996 11:23:08 -0400
Organization: System Management ARTS
References: 96-06-101 96-06-118
Keywords: lex

The biggest problem I have with lexical analysis generators is that they
do a lot of work to solve problems that aren't really all that
important.


Much is made of the ability to recognize RE's quickly. However, if you
look at the typical programming language, most of its lexical level is
absolutely trivial:


- Spacing is trivial to handle - just skip whitespace at
the beginning of the "get next token" routine.
- Operators usually require at most one peak-ahead. The
code involved (number of characters you have to type,
whatever) ends up about the same whether you write a
regexp or write C code. The same goes for various
kinds of punctuation.
- Identifiers are trivial - if you can't get the code to pull
off an identifier right on the first try, you shouldn't
be writing compilers. The few cases where identifiers
are *not* trivial, it's their regexp's that are not
trivial. I recall one language in which an identifier
was letter followed by letters, digits, or underscores,
except that you could not have two consecutive under-
scores. Again, trivial to write in C. I challenge
anyone to write a correct regexp for that on the first
try.
- Comments range from trivial (from -- to end of line, for
example) to slightly tricky (C - a *very* common
beginner's error will fail to terminate properly if
the input is /***/) to not regular (PL/I - just like
C but comments nest).
- Strings tend to be trivial; any complexity is in embedded
escape sequences - e.g., knowing where the number
stops in "\078" - but these things tend to be done by
hand even in compilers that use lexers for other things
since to really make things work you'd need multiple
RE machines - the "in string" syntax differs quite a
bit from the "outside of string" syntax.
- Probably the single most error-prone part of lexers is in
recognizing numbers. I've seen lexers that think a
lone "." is a legal floating point constant; that
07.5 is *not* a legal C floating point constant (because
the leading 0 switched them irrevocably into octal
mode); that 079 is a legal "octal" constant; that 079.5
is *not* a legal floating point value, even though
07.5 is (they switch modes, but too late); and so on.


In this one case, I grant you that a true lexer helps.
On the other hand, it's *very* easy to get the much
more important step of proper conversion wrong, even in
trivial and obvious ways. How long should the number
buffer be? Will the code handle 00000000000000000000001
- "obviously" too many digits to fit in 32 bits! -
correctly?


A more serious underlying problem, if you want *real* speed, is the lack
of flexibility in the interface to the I/O package. Picking things up a
character at a time with stdio is fine, but you can beat it by reading
files in large chunks and processing stuff directly in the input buffer.
Take a look at lcc for an example. As has been frequently pointed out,
the lexer is the only part of the compiler that must look at every byte
of input. It can contribute a surprising amount of cost.


A complete tokenizer I wrote for Modula-2 comes to 915 lines of C - and
that includes extensive comments and debugging code (this was written
for a compiler class I taught - students were expected to read and
modify it) and initializers for tables to back-translate token names to
external format in producing error messages. The actual guts -
including such things as conversion of integer constants with proper
overflow detection (I cheated and used strtod() for reals) - is under
half of the total. I see little reason to use a tool to save me the
effort of writing 400-500 lines of code, most of it very straightforward
- especially since, when all is said and done, I'll have to most of the
hard parts myself anyway!
-- Jerry


PS The corresponding recursive-descent parser and AST builder, with
decent error reporting and a little error correction, comes to 1760
lines of quite readable code. Of course, since Niklaus Wirth seems to
prefer to do his parsers using recursive-descent, too, the design of the
language avoids things that make such parsers hard. Not that you'd
notice until you sat down to write a compiler!
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.