16 bits lexers...

Darius Blasband <darius@phidani.be>
4 May 1997 22:10:30 -0400

          From comp.compilers

Related articles
16 bits lexers... darius@phidani.be (Darius Blasband) (1997-05-04)
Re: 16 bits lexers... scotts@metaware.com (Scott Stanchfield) (1997-05-08)
Re: 16 bits lexers... leichter@smarts.com (Jerry Leichter) (1997-05-08)
| List of all articles for this month |

From: Darius Blasband <darius@phidani.be>
Newsgroups: comp.compilers
Date: 4 May 1997 22:10:30 -0400
Organization: Phidani Software, Brussels
Keywords: lex, i18n

Hello,


There have been quite some arguments about how to implement scanners
dealing with 16 bits (unicode) transitions. One of the major issues
was the growth of the transition tables, and the need for complex
(hence, time consuming) compression and decompression techniques.


Thinking about this problem, I thought that a very simple solution
would be to consider a 16 bits transition as a sequence of two 8 bits
transitions. Of course, this would induce a performance penalty, since
two transitions would be required instead of one, but it might free us
from the need of compression, or at least, it might allow us to
consider simpler and much efficient compression schemes.


Another problem might be the size of the automaton. Recognizing a
shorter input (8 bits instead of 16 bits) yields less information.
More states are required. At first sight, the increase in terms of
states might be dramatic. By dividing the transitions of the input in
smaller transitions, we add states to the NFA. Each state of the
corresponding DFA is a member of the power set of the NFA states.
Based on this assumption, the number of DFA states might grow
exponentially when the number of NFA states grows linearly.


However, in practice, things are far from being as dramatic as they
sound. When one deals with 16 bits transitions, split into 8 bits
transitions, the NFA state set is divided into two disjoint subsets. A
DFA state must be included in one of these disjoint subsets. This can
be generalized easily to cases where the "large" transitions are
divided in more than two "smaller" transitions.


In other words, the power set is an unnecessarily pessimistic
assumption. Exponential growth does not happen.


In order to measure this on actual scanners, I added an option to
lexyt (which is the lex-like tool we use internally and that generates
YAFL code instead of C) so that the number of bits of the transitions
can be changed. Acceptable values are 8 (the default), 4, 2, and
1. The following table summarizes the result obtained when producing
lexers for the same specification (something like the C lexical
definition) using different transition sizes. The table includes the
number of NFA states, the number of DFA states, the number of entries
in the corresponding uncompressed transition table, and (even if not
very relevant) the lexer *generation* time in seconds (not to be
confused with the lexer's own performance).


Bits NFA DFA Table Time


      8 1034 150 38400 5
      4 1632 250 4000 8
      2 2828 459 1836 14
      1 5220 877 1754 41


One can see that the table size shrinks as the transition size gets
shorter. My guess is that using this technique to handle 16 bits
transitions would result in tables that would be small enough to allow
simpler (if any) compression techniques and that would outbalance the
cost due to the increase of the number of transitions required to
analyze a given input.


Well, perhaps this is not original at all. Please feel free to flame
me (in a gentle way, if possible) if this has been written before, or
if the issues raised here are not even relevant to the true concerns
of those people dealing with Unicode scanners...


Cheers,


darius.
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.