Re: How to implement a double byte Lex and Yacc

vern@daffy.ee.lbl.gov (Vern Paxson)
6 May 1997 22:29:21 -0400

          From comp.compilers

Related articles
Double-byte lex and yacc? moleary@primus.com (Michael O'Leary) (1997-04-02)
Re: Double-byte lex and yacc? Julian.Orbach@unisys.com (1997-04-03)
How to implement a double byte Lex and Yacc jukkaj@ping.at (JUKKA) (1997-04-16)
Re: How to implement a double byte Lex and Yacc jlilley@empathy.com (John Lilley) (1997-04-20)
Re: How to implement a double byte Lex and Yacc clark@quarry.zk3.dec.com (1997-04-22)
Re: How to implement a double byte Lex and Yacc arons@panix.com (Stephen Arons) (1997-05-05)
Re: How to implement a double byte Lex and Yacc vern@daffy.ee.lbl.gov (1997-05-06)
| List of all articles for this month |

From: vern@daffy.ee.lbl.gov (Vern Paxson)
Newsgroups: comp.compilers
Date: 6 May 1997 22:29:21 -0400
Organization: Lawrence Berkeley National Laboratory, Berkeley CA
References: 97-04-013 97-04-023 97-04-075 97-04-140
Keywords: lex, i18n

John Lilley <jlilley@empathy.com> wrote:
> something like 50,000 entries. The question I have is: how does the
> size of these tables, and the running-time of the DFA creation, scale
> with the size of the character set? Is it linear with the character
> set size? Quadratic?


For flex, it works as follows. As it scans the input rules, it
partitions the input alphabet's characters into equivalence classes,
based on which characters are always used together. While doing this
with complete accuracy is a bit of work, the following heuristic works
well: characters appearing in a character class are (in that instance)
used together, and any other occurrence of a character makes it
unique. So, for example, if the rules are:


%%
foobar
[a-z]+
[A-Za-z0-9]+


Then f, o, b, a, and r are all put in their own equivalence class; the
other lowercase alphabetics go into one class; the uppercase
alphabetics and digits go into one class; and everything else goes
into the final class.


Once the equivalence classes are generated, everything from that point
on is done only in terms of equivalence classes. For the example
above, this reduces the 256 character input alphabet to an equivalence
one with only 8 distinct characters.


When generating transition tables, the first table maps from input
character to equivalence class, and the remainder are all again in
terms of equivalence classes. (You can turn this off if you wish, to
avoid the cost of the mapping - if so, flex re-expands the equivalence
classes as it writes the output file.)


This means that the running-time of DFA creation, table size, etc.,
scale not with the size of the input alphabet (except when writing out
the character -> equivalence class mapping), but with the number of
distinct symbols used from the alphabet.


Vern
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.