Re: Leftmost longest match with DFA search

"Russ Cox" <rsc@swtch.com>
Mon, 12 May 2008 08:09:04 -0400

          From comp.compilers

Related articles
Leftmost longest match with DFA search monnier@iro.umontreal.ca (Stefan Monnier) (2008-05-10)
Re: Leftmost longest match with DFA search daniel2villeneuve@videotron.ca (Daniel Villeneuve) (2008-05-11)
Re: Leftmost longest match with DFA search rsc@swtch.com (Russ Cox) (2008-05-12)
Re: Leftmost longest match with DFA search monnier@iro.umontreal.ca (Stefan Monnier) (2008-05-13)
Re: Leftmost longest match with DFA search monnier@iro.umontreal.ca (Stefan Monnier) (2008-05-13)
Re: Leftmost longest match with DFA search Danny.Dube@ift.ulaval.ca (2008-05-13)
Re: Leftmost longest match with DFA search rsc@swtch.com (Russ Cox) (2008-05-14)
Re: Leftmost longest match with DFA search Danny.Dube@ift.remove.ulaval.remove.ca (2008-05-15)
| List of all articles for this month |

From: "Russ Cox" <rsc@swtch.com>
Newsgroups: comp.compilers
Date: Mon, 12 May 2008 08:09:04 -0400
Organization: Compilers Central
References: 08-05-026
Keywords: lex, DFA
Cc: compilers@iecc.com
Posted-Date: 12 May 2008 21:01:49 EDT

Stefan Monnier wrote:
> Can someone point me to articles that discuss various ways to get the
> leftmost longest match when implementing regexp search using a DFA?
>
> The "obvious" solution ...


Actually, the "obvious" solution is to implement unanchored search
using repeated anchored searches. Running an anchored DFA search and
remembering the last position that produces a matching state will give
you the longest match at that point in the string. You can create an
unanchored search by trying an anchored search at position 0, then
position 1, and so on. If you stop the DFA when it gets to a dead
state, then if you are lucky most of the searches will process very
little of the string, and the whole thing will run in something like
linear time in many cases. In the worst case, it is quadratic in the
length of the text just to find one match.


This method--implementing unanchored search via repeated anchored
search--is what Bell Labs awk, GNU libc (and gawk), and PCRE all do when
using DFAs. (In fact the backtracking implementations all do this too,
and the DFAs picked up the bad habit from them.)


> ... of turning the problem "search for RE" into the
> problem "match .*RE" (where I use "match" here to mean "anchored
> search") only gives you the leftmost shortest match.


Turning RE into .*RE gives you an unanchored search, but the particular
match it finds depends on other implementation details, specifically
whether you stop and return the position of the first matching state or
continue on and return the last matching state (as above).


If you are searching for RE and you stop at the first matching state,
you do get the shortest match, but what you've really found is the
earliest rightmost endpoint among all (anchored) matches. Since this
is an anchored search, that's the shortest match.


If you are searching for .*RE and you stop at the first matching state,
you get the earliest rightmost endpoint among all (unanchored) matches.
You don't know where the (RE part of the) match begins, and it is not
guaranteed to be a leftmost match. For example, if you are searching
for a*bb|b in abb, you will stop at the position between the two b's,
marking the right endpoint of the "b" match, but the leftmost match is
the entire string "abb".


What you actually find this way is the right endpoint of the leftmost
shortest non-overlapping match. To find the left endpoint, you can,
adapting Daniel Villeneuve's suggestion, match the reverse expression
ER starting at that right endpoint and stop at the earliest match.
You can repeat this operation to find a set containing all the shortest
non-overlapping matches.


Tradtionally, when you ask for all matches for a RE in a text, you get
a set of leftmost-longest non-overlapping matches. Clarke and Cormack
make a case for using the shortest non-overlapping matches instead in
their paper ``On the use of Regular Expressions for Searching Text''
(TOPLAS 1995). http://www.cs.uwaterloo.ca/research/tr/1995/07/regexp.pdf


Russ


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.