Reg. expr. character class compression

cohenw@ecn.purdue.edu (William Cohen)
Thu, 18 Mar 1993 16:39:01 GMT

          From comp.compilers

Related articles
regular expressions (bug-report) megatest!plethorax!djones@uu4.psi.com (1993-03-14)
Reg. expr. character class compression cohenw@ecn.purdue.edu (1993-03-18)
Reg. expr. character class compression dmr@research.att.com (1993-03-21)
| List of all articles for this month |

Newsgroups: comp.compilers
From: cohenw@ecn.purdue.edu (William Cohen)
Keywords: lex, DFA
Organization: Compilers Central
References: 93-03-046
Date: Thu, 18 Mar 1993 16:39:01 GMT

I have seen quite a bit of discussion on how to implement character
classes in NFAs and DFAs, and I thought that it would be productive to
explain how the Purdue Compiler Construction Tool Set (PCCTS) lexical
analyzer generator does character class compression. The character class
compression in DLG makes the conversion from NFA to DFA go much faster in
addition to making the resulting DFA tables smaller because it does the
character compression on the NFA. The price paid for this compression is
that there is an extra look up operation in the automaton to convert the
character to the class.


The algorithm works on the NFA initially built by DLG. When DLG is
building the NFA it records all characters used and expands out each range
operation in a set. Therefore, there is no difference between [a-d] and
[abcd]. These sets are used to label the transition arcs between the NFA
nodes.


Now for the part that you all have been waiting for, how does it convert
the expanded sets back into compact character classes. This is where the
set of characters used is needed by the relabeling algorithm. DLG picks a
character from that set of used characters that has to be in the partition
and assumes that the partition is the set of unpartitioned characters.
DLG then goes through every transition arc and if the label for the arc
contains that character, DLG does a set intersection between the label and
the current partition. When all of the arcs have been examined a
character class is completed. The character class is then removed from
the unpartitioned set. Another character is selected from the
unpartitioned set and process is repeated with the unpartitioned set until
the unpartitioned set is empty. Another pass is made over the transition
arc labels to replace the old labels with new labels composed of the new
character classes.


This method of selecting the character classes has three very good
qualities about it. First, it is indifferent to how the character classes
were specified in the lexical specification and may even find character
classes for things that were not specified as such. Second, the time to
convert a NFA to a DFA is proportional to the number of character classes,
so reducing the number of character classes by relabeling with a smaller
number of sets improves performance. Finally, it also reduces the size of
the resulting DFA, since each DFA array has one element for each character
class.


An example of the difference that the character class compression can make
is given below. This example is the PCCTS C grammar lexical analyzer
because each keyword is searched for explicitly. For the uncompressed
lexical analyzer each DFA node has 257 entries (256 characters +EOF ),
while the compressed one had 64 entries for the largest DFA nodes, roughly
a factor of four savings in space and time using the compression.


garage.ecn.purdue.edu% /bin/time dlg -i parser.dlg scanbig.c
dlg Version 1.06 1989-1992
              21.4 real 20.7 user 0.3 sys
garage.ecn.purdue.edu% /bin/time dlg -i -C2 parser.dlg scansmall.c
dlg Version 1.06 1989-1992
                6.7 real 6.3 user 0.1 sys


garage.ecn.purdue.edu% ls -l scansmall.c scanbig.c
-rw-r--r-- 1 pccts 367140 Mar 18 10:54 scanbig.c
-rw-r--r-- 1 pccts 98910 Mar 18 10:23 scansmall.c


garage.ecn.purdue.edu% size scansmall.o scanbig.o
text data bss dec hex
9600 33288 88 42976 a7e0 scansmall.o
9464 137160 88 146712 23d18 scanbig.o


If you are interested in getting a copy of PCCTS which includes DLG send
mail to the PCCTS mailserver, pccts@ecn.purdue.edu, with a blank subject
line. There is no charge or copyright for this software.


-Will Cohen
cohenw@ecn.purdue.edu
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.