4 May 1997 22:10:30 -0400

Related articles |
---|

16 bits lexers... darius@phidani.be (Darius Blasband) (1997-05-04) |

Re: 16 bits lexers... scotts@metaware.com (Scott Stanchfield) (1997-05-08) |

Re: 16 bits lexers... leichter@smarts.com (Jerry Leichter) (1997-05-08) |

From: | Darius Blasband <darius@phidani.be> |

Newsgroups: | comp.compilers |

Date: | 4 May 1997 22:10:30 -0400 |

Organization: | Phidani Software, Brussels |

Keywords: | lex, i18n |

Hello,

There have been quite some arguments about how to implement scanners

dealing with 16 bits (unicode) transitions. One of the major issues

was the growth of the transition tables, and the need for complex

(hence, time consuming) compression and decompression techniques.

Thinking about this problem, I thought that a very simple solution

would be to consider a 16 bits transition as a sequence of two 8 bits

transitions. Of course, this would induce a performance penalty, since

two transitions would be required instead of one, but it might free us

from the need of compression, or at least, it might allow us to

consider simpler and much efficient compression schemes.

Another problem might be the size of the automaton. Recognizing a

shorter input (8 bits instead of 16 bits) yields less information.

More states are required. At first sight, the increase in terms of

states might be dramatic. By dividing the transitions of the input in

smaller transitions, we add states to the NFA. Each state of the

corresponding DFA is a member of the power set of the NFA states.

Based on this assumption, the number of DFA states might grow

exponentially when the number of NFA states grows linearly.

However, in practice, things are far from being as dramatic as they

sound. When one deals with 16 bits transitions, split into 8 bits

transitions, the NFA state set is divided into two disjoint subsets. A

DFA state must be included in one of these disjoint subsets. This can

be generalized easily to cases where the "large" transitions are

divided in more than two "smaller" transitions.

In other words, the power set is an unnecessarily pessimistic

assumption. Exponential growth does not happen.

In order to measure this on actual scanners, I added an option to

lexyt (which is the lex-like tool we use internally and that generates

YAFL code instead of C) so that the number of bits of the transitions

can be changed. Acceptable values are 8 (the default), 4, 2, and

1. The following table summarizes the result obtained when producing

lexers for the same specification (something like the C lexical

definition) using different transition sizes. The table includes the

number of NFA states, the number of DFA states, the number of entries

in the corresponding uncompressed transition table, and (even if not

very relevant) the lexer *generation* time in seconds (not to be

confused with the lexer's own performance).

Bits NFA DFA Table Time

8 1034 150 38400 5

4 1632 250 4000 8

2 2828 459 1836 14

1 5220 877 1754 41

One can see that the table size shrinks as the transition size gets

shorter. My guess is that using this technique to handle 16 bits

transitions would result in tables that would be small enough to allow

simpler (if any) compression techniques and that would outbalance the

cost due to the increase of the number of transitions required to

analyze a given input.

Well, perhaps this is not original at all. Please feel free to flame

me (in a gentle way, if possible) if this has been written before, or

if the issues raised here are not even relevant to the true concerns

of those people dealing with Unicode scanners...

Cheers,

darius.

--

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.