Scanner generator design hints desired

James Kanze <>
Thu, 17 Feb 1994 16:26:16 GMT

          From comp.compilers

Related articles
Scanner generator design hints desired (1994-02-16)
Scanner generator design hints desired (James Kanze) (1994-02-17)
Re: Scanner generator design hints desired (1994-04-17)
| List of all articles for this month |

Newsgroups: comp.compilers
From: James Kanze <>
Keywords: lex
Organization: Compilers Central
References: 94-02-107
Date: Thu, 17 Feb 1994 16:26:16 GMT

|> Hello!! I'm working on a project to develop a scanner generator and was
|> wondering if anyone could suggest a good data structure to hold the state
|> table generated. A simple 2 dim array is not what I want. Since I here a
|> lot of people discussing the DRAGON BOOK -- Compilers, principles,
|> Techniques etc. by Alfred V. Aho etc, I am curious to see if anyone could
|> decipher the table compression technique on page 145, fig 3.47 -- I don't
|> seem to understand it!!

If I'm not mistaken, the table compression technique described for
scanners in the Dragon Book is actually the one used in yacc (and not
lex). If you take the yacc output, and run through it by hand, I think
you'll find that it is not too difficult to understand.

For a scanner, however, I would stand back and ask myself just how general
I wanted to make it. I think that for most languages, it should be
possible to create a dispatch table based on the first character in the
token. There are then a number of special cases which can be optimized,
ie: if the reg. expr. describing the token has the form CC0 CC1* (CC0 and
CC1 are character classes), this is best handled by a simple loop.

In working with conventional procedural languages (C, Pascal and their
derivitives), I have found that only a very few cases actually occur:

Symbols: These have the format CC0 CC1* (as above). In addition, in all
of the cases I've seen, CC0 is a subset of CC1. This is handled by a
single character translation table: the entry is either '\0' (not in the
sequence, so the symbol is finished), '\377' (ignore the character, but do
not terminate the sequence) or the character desired in the symbol (which
may be different from the actual character read, for instance if case is
not significant).

Numbers: These are the only things where I've found enough complexity to
justify a full regular expression. The tables are greatly reduced in
size, however, by using a character translation table on the input before
evaluating the regular expression. All characters which can never occur
in a numeric constant (which is most of them) map to the same character.
Typically, the remaining character set will have less than 30 characters.
Note too that when there are two different characters with the same
significance ('a' and 'A', for example), these can map to the same
character. I've found it useful to map all of the characters which can
serve as digits to their numeric value.

Special characters: The easiest way to do this is just a small table of
legal token strings starting with this character, and a linear search.
Just put the longest strings at the beginning.

White space: This is pretty easy, just a loop to skip the characters.

There are, of course, a number of special cases. Perhaps the most
interesting is the '.' in C/C++, which is a number if the following
character is a digit, but a special character otherwise.

And of course, comments. When compiling well written, maintainable
programs, a significant amount of time will be spent just skipping
comments. The most general solution I have found is to define the start
of comment as a special character token (with a negative token value), and
then generate a DFA (either with tables, or with actual code using switch
statements) to scan to the end.

I also add meta-characters ('#' in C/C++) and recognize line ends
separately from other white space (due to the way my file reader works,
but also useful for maintaining line numbers, etc.)

Being lazy, and not up to date in AI, I simply defined an input format
which required the user to give the above information, although I imagine
that you could also derive it directly from lex input. I also added input
information like a list of keywords (the scanner generated a table used to
initialize the symbol table), base and type information for the numeric
constants, whether case was significant or not, etc. Since my generator
was designed to generate scanners for a specific context, with a known
symbol table class, adding this information allowed me to do away with the
concept of user supplied actions completed
James Kanze email:
GABI Software, Sarl., 8 rue du Faisan, F-67000 Strasbourg, France

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.