|References for scanner table compression using "comp-vector" email@example.com (1994-10-07)|
|Re: References for scanner table compression using "comp-vector" firstname.lastname@example.org (1994-10-11)|
|From:||email@example.com (Vern Paxson)|
|Organization:||Lawrence Berkeley Laboratory, Berkeley CA|
|Date:||Tue, 11 Oct 1994 19:09:24 GMT|
Michael Sperber [Mr. Preprocessor] <firstname.lastname@example.org> wrote:
> It seems to be a fairly widespread technique to compress scanner
> tables using the comb-vector method described in the Dragon Book ...
> However, I haven't been able to find an algorithm to actually compute
> these data structures.
I ran into this problem when writing flex's table compression. Under
Van Jacobson's guidance, I tried to determine where most of the redundancy
in the DFA tables lay, and then tuned the base/default/next/check approach
to compress out these redundancies. We learned several things:
- It's a big win to group characters into "equivalence classes",
which are characters that are always used in exactly the same
contexts. Flex does this early in the game and then uses only
equivalence classes for the rest of the DFA subset construction
and table compression, a significant time & space savings.
- Very often a DFA state D1 will have a large number of transitions
on different equivalence classes to the same next state, D2, and
only one exceptional transition on a single equivalence class to D3.
Another state D4 will exhibit the same pattern with transitions to
D2 and D5. So it pays to identify states like D2, build a pseudo-DFA
state that *only* makes transitions to it, and then chain the tables
for states like D1 and D4 to the table for D2 via the "default"
array. Internally (if you have the stomach to poke around), flex
refers to the pseudo-DFA states as "templates".
- It also turns out that fairly often two states D6 and D7 will
have quite similar though not identical transitions to many
different next states. To deal with this, flex keeps an MRU
queue of 50 previous states, and if a new state is heterogeneous
enough (not a candidate for using a "template", as discussed
above), it spins through the list looking for close matches.
If it finds a match, it chains the new state to the old one
and moves the old state to the front of the queue.
These are the basic mechanisms. They work quite effectively, though in
recent years it seems to me that more and more users are willing to trade
off larger tables for higher speeds, because a 50 KB table is no longer a
big deal. So table compression is no longer a key concern.
I've never gotten around to writing up the techniques. The code is
actually well commented, though (IMHO). See "tblcmp.c" in the flex
distribution, which you can get from ftp.ee.lbl.gov.
Vern Paxson email@example.com
Information and Computing Sciences ucbvax!ee.lbl.gov!vern
Lawrence Berkeley Laboratory (510) 486-7504
Return to the
Search the comp.compilers archives again.