Re: is lex useful? (Chris Clark USG)
1 Jul 1996 22:38:36 -0400

          From comp.compilers

Related articles
[18 earlier articles]
Re: is lex useful? (1996-06-30)
Re: is lex useful? Robert.Corbett@Eng.Sun.COM (1996-06-30)
Re: is lex useful? (1996-06-30)
Re: is lex useful? (1996-06-30)
Re: is lex useful? (1996-06-30)
Re: is lex useful? (1996-06-30)
Re: is lex useful? (1996-07-01)
Re: is lex useful? (1996-07-02)
Re: is lex useful? (1996-07-02)
Re: is lex useful? (1996-07-04)
Re: is lex useful? (1996-07-05)
| List of all articles for this month |

From: (Chris Clark USG)
Newsgroups: comp.compilers
Date: 1 Jul 1996 22:38:36 -0400
Organization: Digital Equipment Corporation - Marlboro, MA
References: 96-06-101 96-06-123
Keywords: lex

Jerry Leichter wrote:
> The biggest problem I have with lexical analysis generators is that they
> do a lot of work to solve problems that aren't really all that
> important.
> Much is made of the ability to recognize RE's quickly. However, if you
> look at the typical programming language, most of its lexical level is
> absolutely trivial:
. . .

While I agree that most parts of lexing are trivial, it's the few
parts that aren't which is why I come down on exactly the other side
of the fence. I would never write a lexer by hand, nor would I trust
anyone's code who did. I have personally made too many mistakes on
trivial problems.

In the discussion of generated lexers versus hand coded lexers, one
argument seems to have been missed. A generated lexer uses regular
expressions as a specification of the tokens, which is separate from
the implementation. That allows the separation of what is being lexed
from how it is being lexed.

In addition, the specification has nice formal properties. Being
forced to write a specification which describes exactly what a person
means can often keep that person from designing something too
complicated. I can think of numerous cases where a language designer
has specified something which seems trivial but was actually subtly
complicated, once all of the ramifications were considered.

Your floating point case illustrates the exact point.

> - Probably the single most error-prone part of lexers is in
> recognizing numbers. I've seen lexers that think a
> lone "." is a legal floating point constant; that
> 07.5 is *not* a legal C floating point constant (because
> the leading 0 switched them irrevocably into octal
> mode); that 079 is a legal "octal" constant; that 079.5
> is *not* a legal floating point value, even though
> 07.5 is (they switch modes, but too late); and so on.
> In this one case, I grant you that a true lexer helps.

Even if each token has a simple regular expression, there are often
cases where a combination of tokens makes the language complicated.
(The RE for floating point is complicated enough, and worse it
interacts with the various integer RE's). A good tool will point out
the ambiguities and force you to resolve them in a non ad-hoc way.
With hand-written code, each modifier has to re-verify that they
haven't changed the meaning of the code. Can you guarantee that all
future modifiers of your code will take that effort?

> A more serious underlying problem, if you want *real* speed, is the lack
> of flexibility in the interface to the I/O package. Picking things up a
> character at a time with stdio is fine, but you can beat it by reading
> files in large chunks and processing stuff directly in the input buffer.

I don't see that as part of the generated versus hand-written
argument. The lexer generator I use has a default input object which
does exactly that, and another input object which does the same thing
for strings, and one which uses C++ streams, and one which uses the
character at a time stdio interface, . . . .


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.