Re: How to implement a double byte Lex and Yacc (Chris Clark USG)
22 Apr 1997 21:21:29 -0400

          From comp.compilers

Related articles
Double-byte lex and yacc? (Michael O'Leary) (1997-04-02)
Re: Double-byte lex and yacc? (1997-04-03)
How to implement a double byte Lex and Yacc (JUKKA) (1997-04-16)
Re: How to implement a double byte Lex and Yacc (John Lilley) (1997-04-20)
Re: How to implement a double byte Lex and Yacc (1997-04-22)
Re: How to implement a double byte Lex and Yacc (Stephen Arons) (1997-05-05)
Re: How to implement a double byte Lex and Yacc (1997-05-06)
| List of all articles for this month |

From: (Chris Clark USG)
Newsgroups: comp.compilers
Date: 22 Apr 1997 21:21:29 -0400
Organization: Digital Equipment Corporation - Marlboro, MA
References: 97-04-013 97-04-023 97-04-075 97-04-140
Keywords: lex, i18n

John Lilley asked:
> I welcome other optinions, but I am dubious as to the prospects of a
> table-based DFA scanner for Unicode character sets. A C++ scanner
> that I generated (for ASCII) produces a state-transition table with
> something like 50,000 entries. The question I have is: how does the
> size of these tables, and the running-time of the DFA creation, scale
> with the size of the character set? Is it linear with the character
> set size? Quadratic?
> If a "normal" algorithm is applied, then all of the character-set
> tests internal to lex will be dealing with 64k members instead of 256
> members, which says to me that lex will probably run about 256 times
> slower. But that's just a guess... Has anyone really tried this?

It all depends on your internal and extenal representations of the
character sets. Most of it can actually be constant time if done
right, i.e. size of character set doesn't really matter. For example,
we represent ranges as lower_bound..upper_bound both internally and in
the output tables (if the -table small option is requested), and by
doing so the size of the lower and upper bounds are irrelevant. Now,
that wouldn't be a useful optimization if the character sets did not
have large ranges (imagine if even/odd characters were the
discriminant), but as you and Terrence have speculated it is
worthwhile for the common cases.

By the way, there are probably other reasons why your C++ scanner was
large also. You probably represented every keyword as a token in the
lexer and had a state for each character. That is far from space
optimal (although it wouldn't be too bad if your scanner used the
tunnel automata approach of Cocktail). A much more space efficient
solution is to put the keywords in the symbol table and have that
change the token type when an identifier matches one of those special
symbols. Frank Deremer (of LALR language fame) called the two parts
of tokenizing with that approach scanning and screening (and for those
of you who hadn't guessed ;-) that's the model of lexing supported by
Yacc++). By the way, if space is at a real premium, you can even
implement the screening with a compressed table.

-Chris Clark

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.