Wed, 2 Sep 2009 22:48:00 -0700

Related articles |
---|

Deterministic Finite Automata: Sub-element captures alexander.morou@gmail.com (Alexander Morou) (2009-09-02) |

Re: Deterministic Finite Automata: Sub-element captures rsc@swtch.com (Russ Cox) (2009-09-02) |

Re: Deterministic Finite Automata: Sub-element captures Danny.Dube@ift.ulaval.ca (2009-09-12) |

From: | Russ Cox <rsc@swtch.com> |

Newsgroups: | comp.compilers |

Date: | Wed, 2 Sep 2009 22:48:00 -0700 |

Organization: | Compilers Central |

References: | 09-09-013 |

Keywords: | lex, DFA |

Posted-Date: | 03 Sep 2009 13:10:27 EDT |

The closest you can get with a DFA is to take the union of the NFAs of

the various regexps you care to match and tag each NFA with a distinct

matching state. Then when you construct the DFA, you can use the

distinct NFA matching states to tell which regexps are matched by

a particular DFA state. So you could have one regexp for the

hex numbers, another for floating point, and so on. Capturing the

submatch boundaries within each regular expression requires more work:

in general a DFA doesn't track enough information to do explicit

capture tracking. You really need to back off to an NFA (in the

formal sense not the popular sense) so that you can manipulate the

states individually during the subset simulation.

I like to think of a linearized version of the NFA as a program

for a peculiar virtual machine that allows non-deterministic choice,

by letting multiple lightweight threads run over the NFA at once.

An individual NFA state can be viewed as a program counter (PC) in

this program, and the set of NFA states that corresponds to a

single DFA state during the subset simulation can be thought of

as a thread run queue: there's one run queue being executed

and a second run queue being prepared for the next position

in the input.

In the code at http://swtch.com/~rsc/regexp/regexp1.html,

the State* is the program counter, clist is the current run queue,

and nlist is the next run queue.

The non-determinism in the NFA is accomodated by sometimes

adding multiple PCs--that is, multiple thread states--to nlist.

Reasonable performance is ensured by never adding redundant

entries (same State*) to the list. Thus nlist is at most

O(# states in NFA), producing the O(# states in NFA * length of

text) run-time bound.

So far I am sure I've said nothing you didn't already know,

except maybe the weird thread analogy. But given this basic

framework, it is easy to add submatch tracking.

To add submatch tracking, you expand the representation of a

thread on the run queue. Before it was just an NFA state (a PC)

but now we add the submatch boundaries crossed during the

thread's execution so far (capture registers, if you will).

For example, if the regexp is (a(a*)ab), there are four capture

registers, one for each parenthesis. Each parenthesis becomes

an empty transition in the NFA tagged with a register index.

As a thread crosses that transition, it stores the current text

position in the indexed register. In this implementation, it is

still important to bound the length of the run queue, and again

the solution is to toss out multiple threads with the same PC,

even if they have different capture register sets. The capture

register sets cannot influence the future computation of the

threads, so there's no need for both threads: keep the ``better''

one and discard the other. This pruning gives the same run-time

bound as in the non-capturing NFA.

The only thing left is to define what a ``better'' thread is.

The rule in backtracking implementations is to prioritize each

non-deterministic choice: in "a|b", the higher-priority choice

is the "a"; in "a*", it is to continue the repetition; in "a*?"

it is to stop the repetition; and so on. The choice corresponds

precisely to which path the backtracking explores first.

The simulation I just described explores them all simultaneously,

but it is not hard to maintain the run queue so that the threads

appear in their backtracking priority order.

To see how the priority works, suppose a particular thread's

execution path were summarized by a bit string recording the

choices it made. For simplicity, let's call 0 the high-priority

choice and 1 the low-priority choice. With this representation,

comparing the strings lexicographically suffices to determine

which thread is ``better:'' 0000 beats 0100 beats 0111 beats 1000.

It's not necessary to record this history explicitly. Instead,

if the run queue is maintained and traversed in priority order

at each step, each new run queue will end up in priority order.

Suppose there are four threads with histories (a) 0000, (b) 0100,

(c) 0111, and (d) 1000. Then the run queue order is a, b, c, d.

Executing thread (a) can only lead to new threads with histories

beginning with 0000: 00000 and 00001. Executing thread (b) can

only lead to 01000 and 01001. And so on. Because each thread

has a distinct summary, the relative priorities of the successor

threads can't cross: any new thread created during the exploration

from state (a) will necessarily have a higher priority than a thread

created during the exploration from state (b). So as long as the

current run queue is keep in priority order and the choices are

explored in priority order, the next run queue will also be in

priority order. If a thread is already on the next run queue

with a given PC, then any other thread with the same PC is lower

priority and can be discarded.

Since the threads are executed in priority order, once a match

is found, all the remaining threads on the run queue can be

killed early--they can't possibly lead to a higher priority

match. Then once the higher priority threads have all died off

naturally, the most recent match found is necessarily the best one.

An alternate approach is to put capturing parentheses around the

entire expression, so that you can tell where each thread's match

begins, and kill off only the threads that start later in the

text. Then keep the match that ends latest in the text. This

produces the so-called ``leftmost longest'' match in a single

linear-time pass through the text, something even most DFA

implementations don't do (it's possible, but that's another story).

Of course, you still need a rule for pruning threads, and you

might as well use the backtracking priorities; you'll end up with

the leftmost longest match and the submatches that a backtracking

engine would have chosen during that match.

The C program at http://swtch.com/~rsc/regexp/nfa-perl.y.txt

implements the algorithm I've just described. I believe that the

regular expression library Rob Pike wrote in the early 1980s was

the first to use this algorithm. It didn't bother trying to mimic

the backtracking priorities but otherwise matches my description.

That library got adapted to become Eighth Edition Unix's regexp(3)

library; Henry Spencer wrote a public domain workalike that used

backtracking instead, and the rest is history. Neither looked

closely enough at the innards of the other library to realize the

difference in implementations. It didn't help that both approaches

call themselves NFA algorithms, nor did it help that Pike's code

wasn't made freely available until 1992 (as part of the text editor

sam). In the interim it became conventional wisdom that to track

submatches one must resort to backtracking. Ville Laurikari

discovered an algorithm with the same time and space complexity

as the above in 2000 and published it (http://laurikari.net/tre).

I suspect that deep down they are variants of a single algorithm,

but I haven't tried to work out the correspondence.

As you might imagine, this NFA simulation is more expensive than

the the simple one-memory-lookup-per-input-character DFA, which is

why most lexers don't tell you anything beyond the particular regexp

that matched.

Russ

P.S. Tracking a capture register set is a simple generalization of an

idea that was known to at least one of Aho, Hopcroft, or Ullman in 1974

but then disappeared from the literature, including their later books.

In _The Design and Analysis of Computer Algorithms_, Chapter 9

(Pattern Matching Algorithms) describes the usual NFA simulation

and then gives these two exercises:

9.6 Let x = a_1 a_2 ... a_n be a given string and alpha a regular

expression. Modify Algorithm 9.1 [the NFA simulation] to find

the least k and, having found k, the (a) least j and (b) the

greatest j such that a_j a_j+1 ... a_k is in the set denoted

by alpha. [Hint: Associate an integer j with each state in S_i.]

*9.7 Let x and alpha be as in Exercise 9.6. Modify Algorithm 9.1

to find the least j and, having found j, the greatest k such

that a_j a_j+1 ... a_k is in the set denoted by alpha.

The star on 9.7 is theirs, presumably an indication of difficulty

(I filed away a photocopy of that section but don't have the

entire book handy).

P.P.S. I didn't mention the POSIX rules for choosing submatches;

they were designed to be easy to specify, but I doubt anyone on

the committee knew how to implement them efficiently. The usual

suggestion is to run a backtracking engine until it has enumerated

all possible matches and then pick the one that satisfies the

POSIX rules; this turns a worst-case exponential but common case

faster algorithm into a common-case exponential algorithm, which

is why basically no one implements it. I believe that the POSIX

submatch can be found in a single linear-time pass through the text

in the spirit of the above algorithm, but you have to process the

text backward to hit the matches in an order that enables the right

``better'' comparison. http://swtch.com/~rsc/regexp/nfa-posix.y.txt

has an implementation, though I have not tested it extensively.

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.