regular expressions (bug-report)

megatest!plethorax!djones@uu4.psi.com (Dave Jones)
Sun, 14 Mar 1993 23:07:13 GMT

          From comp.compilers

Related articles
regular expressions (bug-report) megatest!plethorax!djones@uu4.psi.com (1993-03-14)
Re: regular expressions (bug-report) Paul.Klint@cwi.nl (1993-03-15)
Re: regular expressions (bug-report) mab@wdl39.wdl.loral.com (1993-03-15)
Re: regular expressions (bug report) wright@hicomb.hi.com (David Wright) (1993-03-16)
regular expressions (bug-report) jos@thinx.convex.com (Jos Horsmeier) (1993-03-16)
Reg. expr. character class compression cohenw@ecn.purdue.edu (1993-03-18)
Re: regular expressions (bug-report) pardo@cs.washington.edu (1993-03-19)
[2 later articles]
| List of all articles for this month |

Newsgroups: comp.compilers
From: megatest!plethorax!djones@uu4.psi.com (Dave Jones)
Keywords: lex, errors
Organization: Compilers Central
Date: Sun, 14 Mar 1993 23:07:13 GMT

Yesterday I was reading through "Algrorithms", Robert Sedgewick,
Addison-Wesley Pub. Co, 1983. I was surprised to find that the algorithm
on p. 265 for matching a string by scanning an NFA generated from a
regular expression contains a couple of bugs. One is a whopper.


1. The algorithm stops at the first match, rather than looking for the
longest one. The fix: Keep a "longest-match" variable, and do not exit
loop in the first final state encountered. Go until one of the other two
end-conditions is met.


2. (The whopper) The algorithm uses a deque with a marker in it to store
sets of states. The states that the NFA could be in are placed above the
mark, and the states it can shift to on the current token are put below
the mark. The deque is of bounded size, just large enough to hold all
states. But, there is no provision in the code for keeping duplicate
states out of either end of the deque. Pathological regular expressions,
the ones whose DFA's are exponentially larger than their NFA's, may
overflow the deque. In fact I'm not even sure the algorithm as written
proveably stops. A state should not be added at the top (or bottom) of the
deque if it is already at that end. In fact, there is no reason to use a
deque at all. A modern Set-object is just what's needed.


This suggests that scanning an NFA to match strings may not be very
economical process. Perhaps better is to translate the NFA into a DFA
first, thereby factoring the check-for-duplicates into the translation
step rather than doing it as the automaton is scanned to match strings.


The process is straight-forward if only single character transitions are
used in the NFA. In fact it is quite similar to the process for scanning
the NFA for pattern-match.


However it is not immediately clear to me how to handle character range
matches. For example, most reg-exp packages allow you to indicate a range
of matches with a construct such as this:


[a-zA-Z]


The above means, match any letter, either upper or lower case. There is
potential ecconomy for the implementation as well as for the user, because
the processor need not check for a match against every letter in the
range. It can take advantage of the fact that in ASCII, the letters a-z
are contiguous, and merely check to see if the current character is in
range.


It is not immediately obvious to me how to handle these character-ranges
when converting an NFA to a DFA.


The simplest way would be to split the construct [a-zA-Z] into 52
single-character transitions. But the character-range tests should be
preserved in the implementation when possible. When it is not possible,
perhaps a subrange or a couple of subranges could be used in the DFA.


Is there anything published on this? Does anyone know how the various
reg-exp packages in existence do it? Do they work correctly all the time?
(I've found that most reg-exp packages reject some strings that should
match.)
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.