Re: 16 bits lexers...

Jerry Leichter <>
8 May 1997 21:34:15 -0400

          From comp.compilers

Related articles
16 bits lexers... (Darius Blasband) (1997-05-04)
Re: 16 bits lexers... (Scott Stanchfield) (1997-05-08)
Re: 16 bits lexers... (Jerry Leichter) (1997-05-08)
| List of all articles for this month |

From: Jerry Leichter <>
Newsgroups: comp.compilers
Date: 8 May 1997 21:34:15 -0400
Organization: System Management ARTS
References: 97-05-041
Keywords: lex, i18n

| There have been quite some arguments about how to implement scanners
| dealing with 16 bits (unicode) transitions. One of the major issues
| was the growth of the transition tables, and the need for complex
| (hence, time consuming) compression and decompression techniques....

A Unicode lexer is only problematical if you insist that the
implementation match, directly, the logical model. While there are
256 distinct characters in ANSI, no language anyone uses really treats
them all in different ways. In most languages, all letters (upper and
lower class) are treated pretty much identically; all digits,
likewise. If you initially map all letters into a pseudo-character L,
all digits into a pseudo-character D, all other characters that are
legal outside of comments into themselves, and all the rest into a
pseudo-character Q, you can write a lexer with transitions on maybe
20-30 distinct characters. The mapping takes a single 256-byte table.

Yes, there are all sorts of complexities. For example, if you're
lexing C, the digits 0-7 fall into one class (constituents of octal
values), 8 and 9 into another. The letters a-f and A-F are "special"
in that they can be part of hex values. If you want to recognize
keywords in your lexer, rather than doing an after-the-fact lookup in
a table, all letters that occur in any keyword are "special". But
even in the worst case, it's unlikely you're going to end up with even
100 distinct classes.

If you go to Unicode, not much is going to change. Sure, you may have
many more things to toss into the "letter" class, if your language
allows for accented letters in identifiers and such. And perhaps you
have a bunch of new operator symbols. But just about everything will
usually fall into that catch-all Q class - legal in strings, illegal
anywhere else. The total number of classes is unlikely to change very
much. Almost certainly, a one-byte representation for character
classes will be more than enough, giving you a 65K-byte input
translation map - no big deal.

Sure, you *can* trivially construct a first-order language with
distinct transitions on 1000 different, or all 65K, Unicode
characters. Why would you want to? (Think about how long the lex
specification must be! If you can actually come up with a reasonable
example that needs so many different transitions - is this really the
right way to express it?)
-- Jerry

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.