regular expressions (bug-report)

Jos Horsmeier <jos@thinx.convex.com>
Tue, 16 Mar 1993 23:08:26 GMT

          From comp.compilers

Related articles
regular expressions (bug-report) megatest!plethorax!djones@uu4.psi.com (1993-03-14)
Re: regular expressions (bug-report) Paul.Klint@cwi.nl (1993-03-15)
Re: regular expressions (bug-report) mab@wdl39.wdl.loral.com (1993-03-15)
regular expressions (bug-report) jos@thinx.convex.com (Jos Horsmeier) (1993-03-16)
Re: regular expressions (bug-report) pardo@cs.washington.edu (1993-03-19)
Re: regular expressions (bug-report) henry@zoo.toronto.edu (1993-03-25)
Re: regular expressions (bug-report) kanze@us-es.sel.de (1993-03-26)
| List of all articles for this month |

Newsgroups: comp.compilers
From: Jos Horsmeier <jos@thinx.convex.com>
Originator: daemon@toonder.rivm.nl
Keywords: lex
Organization: AND Software Inc, Texas
References: 93-03-046
Date: Tue, 16 Mar 1993 23:08:26 GMT

megatest!plethorax!djones@uu4.psi.com (Dave Jones) writes:
[ Transforming an NFA into a DFA: ]
|However it is not immediately clear to me how to handle character range
|matches. For example, most reg-exp packages allow you to indicate a range
|of matches with a construct such as this:


| [a-zA-Z]


|The above means, match any letter, either upper or lower case. There is
|potential ecconomy for the implementation as well as for the user, because
|the processor need not check for a match against every letter in the
|range. It can take advantage of the fact that in ASCII, the letters a-z
|are contiguous, and merely check to see if the current character is in
|range.


|It is not immediately obvious to me how to handle these character-ranges
|when converting an NFA to a DFA.
|The simplest way would be to split the construct [a-zA-Z] into 52
|single-character transitions. But the character-range tests should be
|preserved in the implementation when possible. When it is not possible,
|perhaps a subrange or a couple of subranges could be used in the DFA.


It's just a matter of implementation. One implementation of the DFA simply
pops those character ranges into existence. After transforming the NFA
into a DFA in the normal way, i.e. build a separate transition for all
separate characters, simply ignore those character classes, one can
represent the DFA as a matrix, where all the rows are labeled with the
state labels of the DFA and all columns are labeled with all the
characters in the input alphabet. As known, these matrices are usually
very sparse. Now, have a look at any row: if this row contains two or more
cells, all adjacent, and their contents is the same, we've found a range
of characters that all cause the DFA to do a transition from the state
labeled at that particular row, to the state stored in all those cells.


An example:


        a b c d e f g h i j k l m n
    +---+---+---+---+---+---+---+---+---+---+---+---+---+---+
S | T | T | T | T | | | U | U | U | | | V | | |
    +---+---+---+---+---+---+---+---+---+---+---+---+---+---+


This row represents all transitions starting from state S. This row could
easily be represented as:


S: R#(a,d)T, R#(g,i)U, S#(l)V


Here R#(x,y) represents a range [x-y] and S#(x) represents a single
character x. The last characters (T, U and V) represent the resulting
state after the transition.


If long ranges exist in those row, this `range representation' is quite a
compact method of storing the entire DFA. In practice, as I have noticed,
if one represents those ranges as strings, lots of rows in the matrix are
(proper) prefixes of other rows. So, taking advantage of that, one could
easily reduce the amount of memory needed for the storage of a DFA.


So, concluding: let go of the idea of preserving character ranges, while
doing the transformation of the NFA into a DFA, because those character
classes (and maybe others) will show up again when the DFA construction is
ready.


kind regards,


Jos aka jos@and.nl aka thinx!jos@convex.com
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.