Re: How to implement a double byte Lex and Yacc

John Lilley <jlilley@empathy.com>
20 Apr 1997 12:16:28 -0400

          From comp.compilers

Related articles
Double-byte lex and yacc? moleary@primus.com (Michael O'Leary) (1997-04-02)
Re: Double-byte lex and yacc? Julian.Orbach@unisys.com (1997-04-03)
How to implement a double byte Lex and Yacc jukkaj@ping.at (JUKKA) (1997-04-16)
Re: How to implement a double byte Lex and Yacc jlilley@empathy.com (John Lilley) (1997-04-20)
Re: How to implement a double byte Lex and Yacc clark@quarry.zk3.dec.com (1997-04-22)
Re: How to implement a double byte Lex and Yacc arons@panix.com (Stephen Arons) (1997-05-05)
Re: How to implement a double byte Lex and Yacc vern@daffy.ee.lbl.gov (1997-05-06)
| List of all articles for this month |

From: John Lilley <jlilley@empathy.com>
Newsgroups: comp.compilers
Date: 20 Apr 1997 12:16:28 -0400
Organization: Nerds for Hire, Inc.
References: 97-04-013 97-04-023 97-04-075
Keywords: lex, i18n

JUKKA wrote:
> Here is what I could cook. I think one should use Unicode to
> implement a truly universal scanner. Ok maybe Lex is more
> dedicated to the old Ascii 256 character set. But also computers
> have now mostly 32Mb and not 32kb of memory - as compared to
> the times when Lex was done. So basicly the tables can be
> bigger now.


I welcome other optinions, but I am dubious as to the prospects of a
table-based DFA scanner for Unicode character sets. A C++ scanner
that I generated (for ASCII) produces a state-transition table with
something like 50,000 entries. The question I have is: how does the
size of these tables, and the running-time of the DFA creation, scale
with the size of the character set? Is it linear with the character
set size? Quadratic?


If a "normal" algorithm is applied, then all of the character-set
tests internal to lex will be dealing with 64k members instead of 256
members, which says to me that lex will probably run about 256 times
slower. But that's just a guess... Has anyone really tried this?


I am currently working on ANTLR 2.0, which produces LL(k)
recursive-descent lexers. ANTLR 2.0 is written in Java and generates
Java, so it gets Unicode support "for free". The times that I have
built lexers with the entire Unicode character set enabled, the
analysis ran maybe one minute instead of 2 seconds, and generated
these gigantic bit sets for the lookahead tests. However, even ANTLR
2.0 doesn't really produce Unicode lexers yet, because the resulting
size of the generated bit-set and switch() code chokes the compilers.
So we are looking at ways to optimize the bit-set tests by identifying
linear sub-ranges for the testing -- an optimization which, IMHO, is
difficult to integrate into table-driven lexers such as those
generated by lex.


john lilley
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.