DFA-based regexp matching with substring addressing

vlaurika@cs.hut.fi (Ville Laurikari)
23 Jan 2000 10:12:44 -0500

          From comp.compilers

Related articles
DFA-based regexp matching with substring addressing vlaurika@cs.hut.fi (2000-01-23)
| List of all articles for this month |

From: vlaurika@cs.hut.fi (Ville Laurikari)
Newsgroups: comp.compilers
Date: 23 Jan 2000 10:12:44 -0500
Organization: Helsinki University of Technology, CS lab
Keywords: DFA

Hello all.

Many regular expression matching/searching libraries have a feature
often called "substring addressing". For example, when the
expression "foo([0-9]*)bar" is matched against the string "foo123bar",
the positions corresponding to the parentheses are calculated. Here,
the opening parenthesis would have position 3 (assuming the first
character is at position 0), and the closing parenthesis would have
position 6.

For example, Python's re-library supports substring addressing. The
implementation method (a backtracking matcher not based on
automatons), however, has some shortcomings. In particular, the
longest overall match is not always produced:

>>> import re
>>> re.compile("(a|ab)(blip)?").match("ablip").span()
(0, 5)
>>> re.compile("(a|ab)(blip)?").match("ab").span()
(0, 1)
>>> re.compile("(ab|a)(blip)?").match("ablip").span()
(0, 2)
>>> re.compile("(ab|a)(blip)?").match("ab").span()
(0, 2)

What I want is a matcher which has a commutative union operator and
always produces the longest possible match, *and* supports substring
addressing. However, I have not been able to find one. I did find
this paper, though (in fact, it was recommended to someone asking about
the same problem in comp.compilers back at 1991, IIRC):

Nakata, I. and Sassa, M.:
Regular Expressions with Semantic Rules and Their Application
to Data Structure Directed Programs,
Advances in Software Science and Technology, Vol. 3, pp. 93-108
(Dec. 1991).

The algorithms presented there do not handle all cases correctly,
e.g. "(a*)aaa". I believe I have heard of another paper/article
addressing this problem, but it slips my mind.

Since I really wanted a working solution, I went ahead and designed an
algorithm which compiles regular expressions to DFA-based automatons.
Extra deque-handling operations are attached to the transitions of the
DFA, to keep track of the possible indices in the input string
corresponding the parentheses of the regular expression. The matcher
does exactly what I want, for example, "(ab|a)(blip)?" matched against
"ablip" will produce substring addressing like this: "(a)(blip)".
I believe it is quite fast too, and enables building lexers which
support substring addressing rather easily, among other things.

Now, after this lengthy introduction, I can get to the question:
Is this algorithm of mine anything new? Is it maybe worth a paper?

I was quite dumbfounded by finding almost no references addressing
this problem, but maybe I was just looking in all the wrong places?

Thanks in advace,

    Ville Laurikari


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.