Re: is lex useful?

bart@time.cirl.uoregon.edu (Barton C. Massey)
30 Jun 1996 16:36:53 -0400

          From comp.compilers

Related articles
[12 earlier articles]
Re: is lex useful? leichter@smarts.com (Jerry Leichter) (1996-06-27)
Re: is lex useful? scooter@mccabe.com (Scott Stanchfield) (1996-06-27)
Re: is lex useful? Scott.Nicol@infoadvan.com (1996-06-27)
Re: is lex useful? Scott.Nicol@infoadvan.com (1996-06-27)
Re: is lex useful? 72510.2757@CompuServe.COM (Stephen Lindholm) (1996-06-27)
Re: is lex useful? kanze@lts.sel.alcatel.de (1996-06-27)
Re: is lex useful? bart@time.cirl.uoregon.edu (1996-06-30)
Re: is lex useful? Robert.Corbett@Eng.Sun.COM (1996-06-30)
Re: is lex useful? leichter@smarts.com (1996-06-30)
Re: is lex useful? trd@murlibobo.cs.mu.OZ.AU (1996-06-30)
Re: is lex useful? WStreett@shell.monmouth.com (1996-06-30)
Re: is lex useful? dmr@bell-labs.com (1996-06-30)
Re: is lex useful? clark@quarry.zk3.dec.com (1996-07-01)
[4 later articles]
| List of all articles for this month |

From: bart@time.cirl.uoregon.edu (Barton C. Massey)
Newsgroups: comp.compilers
Date: 30 Jun 1996 16:36:53 -0400
Organization: University of Oregon
References: 96-06-101 96-06-118 96-06-123
Keywords: lex

Jerry Leichter <leichter@smarts.com> wrote:
> The biggest problem I have with lexical analysis generators is that they
> do a lot of work to solve problems that aren't really all that
> important.
[...]
> - Identifiers are trivial - if you can't get the code to pull
> off an identifier right on the first try, you shouldn't
> be writing compilers. The few cases where identifiers
> are *not* trivial, it's their regexp's that are not
> trivial. I recall one language in which an identifier
> was letter followed by letters, digits, or underscores,
> except that you could not have two consecutive under-
> scores. Again, trivial to write in C. I challenge
> anyone to write a correct regexp for that on the first
> try.


Offhand,
[A-Za-z]_?([A-Za-z0-9]_?)*
would seem to do the trick. About 1 minute. Proof of correctness:


(a) Soundness (won't accept __ ids)
If there's an initial underbar, it must
be followed by nothing or by a letter or digit.
Thus, it is not part of a 2 char pair. If the
other underbar is in the pattern, it must follow
a letter or digit, and must be followed either
by nothing or by a letter or digit.


(b) Completeness (accepts all valid ids without __)
The pattern matches
letter
letter _
letter letter-digit
letter _ letter-digit
letter letter-digit letter-digit
letter letter-digit _
and by induction (not shown) on the length of
the string, all valid ids longer than length 3.


I challenge you to prove your C code correct. :-)
(Although, as these things tend to go, the pattern and the
not-shown part of the proof are probably wrong. Was it Dijkstra
or Knuth who said "I've only proven it correct, I haven't yet tried
it."? :-)


Another big win of using lex or flex to parse identifiers is
that no keyword symbol table is necessary -- in trivial
languages, it's a big win to be able to just get a keyword back
directly from the lexer without executing any symbol table
code. Even in real languages, it's nice not to have to clutter
the symbol table code with keyword handling.


> - Strings tend to be trivial; any complexity is in embedded
> escape sequences - e.g., knowing where the number
> stops in "\078" - but these things tend to be done by
> hand even in compilers that use lexers for other things
> since to really make things work you'd need multiple
> RE machines - the "in string" syntax differs quite a
> bit from the "outside of string" syntax.


Everybody "in the know" uses start states with lex or flex (flex
is nicer because of exclusive start states) to get "multiple RE
machines" to parse strings and comments with.


> A more serious underlying problem, if you want *real* speed, is the lack
> of flexibility in the interface to the I/O package. Picking things up a
> character at a time with stdio is fine, but you can beat it by reading
> files in large chunks and processing stuff directly in the input buffer.
> Take a look at lcc for an example. As has been frequently pointed out,
> the lexer is the only part of the compiler that must look at every byte
> of input. It can contribute a surprising amount of cost.


flex is very good at this part of the problem. It automatically "reads
things in large chunks and processes stuff directly in the input buffer",
thus producing *very* fast scanners per unit programmer effort.


> A complete tokenizer I wrote for Modula-2 comes to 915 lines of C - and
> that includes extensive comments and debugging code (this was written
> for a compiler class I taught - students were expected to read and
> modify it) and initializers for tables to back-translate token names to
> external format in producing error messages. The actual guts -
> including such things as conversion of integer constants with proper
> overflow detection (I cheated and used strtod() for reals) - is under
> half of the total.


IMHO it's probably a similar amount of C + flex code, but the
flex code will be a lot more readable and maintainable, and
somewhat more likely to be correct. Probably, for this amount
of engineering effort each way, the flex scanner will also be
faster.


I have written many hand-tokenizers. I have also written many
flex scanners. I prefer the latter.


Bart Massey
bart@cs.uoregon.edu
[For the double underscore thing, whether in C or lex, I'd recognize
identifiers as a string of letters, digits, and underscores, then look for
a double underscore and if I found one, produce a helpful "this language
doesn't permit double underscores in your identifiers" message rather than
a generic and mysterious "syntax error". -John]


--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.