Re: Pondering the future of lexical analysis

"John McEnerney" <jmcenerney@austin.rr.com>
20 Oct 2002 22:46:33 -0400

          From comp.compilers

Related articles
Pondering the future of lexical analysis clint@0lsen.net (Clint Olsen) (2002-10-18)
Re: Pondering the future of lexical analysis jmcenerney@austin.rr.com (John McEnerney) (2002-10-20)
Re: Pondering the future of lexical analysis snicol@apk.net (Scott Nicol) (2002-10-20)
Re: Pondering the future of lexical analysis whopkins@alpha2.csd.uwm.edu (Mark) (2002-10-20)
Re: Pondering the future of lexical analysis arnold@skeeve.com (Aharon Robbins) (2002-10-20)
| List of all articles for this month |

From: "John McEnerney" <jmcenerney@austin.rr.com>
Newsgroups: comp.compilers
Date: 20 Oct 2002 22:46:33 -0400
Organization: Compilers Central
References: 02-10-068
Keywords: lex, i18n
Posted-Date: 20 Oct 2002 22:46:32 EDT

in article 02-10-068@comp.compilers, Clint Olsen at clint@0lsen.net wrote on
10/18/02 10:41 PM:


> I've been reading the Dragon Book lately about lexing, and after some
> discussion with folks on the Flex team, the big hurdle in the future
> will be the support of unicode - primarily due to the size of the
> transition tables.
>
> The Dragon book mentions that transitions should be defined for the
> entire alphabet for each state, but this doesn't jive with what I've
> seen in the diagrams. It seems like you only need to store _valid_
> transitions in your tables, and even then you could store those as the
> ranges as they were specified in your lexer specification.


You can do a lot better than this. I once coded a lexer for Unicode
input that handled all transitions for e.g. all valid identifier
characters.


The (obvious?) trick is to use a 2-level dispatch.


The first one maps the entire set of Unicode characters to a set of
keys that group all Unicode characters which have the same behavior
vis a vis the lexer into a single "character"


For example, most alphabetical characters (and in Unicode there are a
-lot- of them) would map to a single "character" except for those that
may appear in non-identifier situations, e.g. 'A'-'F' and 'a'-'f', 'l'
and 'L', ...


This reduces to a fairly small number of "characters" (I recall on the
order of 40) that the lexer actually needs to represent transitions
for; this is small enough to represent it as a square matrix or labels
with branch tables.


I found it convenient to generate this first-level table
programmatically, e.g. for a Java lexer you'd write a Java program
that calls isJavaLetter() isJavaDigit() etc. It takes up 64K bytes but
we're way past the day where we need to worry about that. It's also
probably rare that a single input will access entries all over the
range since they are more-or-less organized by language and you'd
expect typically to find a single language + ASCII in a typical input.


Something else they don't tell you in the Dragon Book is that lexers is one
of the areas where it can still make sense to code in assembly language. You
are, after all, executing a path through the lexer on every character in the
input--if you can cut that in half it makes a measurable difference.
(Assuming of course you don't populate the rest of your compiler with
algorithms that are N-squared or worse in the size of the input)


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.