Re: Henry Spencer's "Tcl" Regex Library

"Russ Cox" <rsc@swtch.com>
Tue, 2 Oct 2007 12:12:18 -0400

          From comp.compilers

Related articles
Henry Spencer's "Tcl" Regex Library cfc@shell01.TheWorld.com (Chris F Clark) (2007-10-01)
Re: Henry Spencer's "Tcl" Regex Library rsc@swtch.com (Russ Cox) (2007-10-02)
Re: Henry Spencer's "Tcl" Regex Library cfc@TheWorld.com (Chris F Clark) (2007-10-02)
| List of all articles for this month |
From: "Russ Cox" <rsc@swtch.com>
Newsgroups: comp.compilers
Date: Tue, 2 Oct 2007 12:12:18 -0400
Organization: Compilers Central
References: 07-10-016
Keywords: lex, DFA
Posted-Date: 03 Oct 2007 13:18:36 EDT

There are (at least) three different issues you've brought up --
the use of the term Hybrid NFA/DFA, whether Tcl does what
Michela does, and on-the-fly DFA constructions.




* Hybrid NFA/DFA


The first problem is your use of the term "hybrid NFA/DFA", which
unfortunately has been defined in Friedl's _Mastering Regular
Expressions_ to mean something different from what you mean.


Friedl doesn't ever define what a DFA or NFA is. He only says
that NFAs provide submatch information and backreferences, and
DFAs don't but run faster. The closest he comes to admitting that
these aren't really definitions is in the sidebar on p. 180:


        As a programmer, if you have a true (mathematically speaking)
        NFA regex engine, it is a relatively small task to add support
        for backreferences. A DFA's engine's design precludes the
        adding of this support, but an NFA's common implementation
        makes it trivial. In doing so, you create a more powerful
        tool, but you also make it decidedly nonregular (mathematically
        speaking). What does this mean? At most, that you should
        probably stop calling it an NFA, and start using the phrase
        ``nonregular expressions,'' since that describes (mathematically
        speaking) the new situation. No one has actually done this,
        so the name ``NFA'' has lingered, even though the implementation
        is no longer (mathematically speaking) an NFA.


        What does all this mean to you, as a user? Absolutely
        nothing. As a user, you don't care if it's regular, nonregular,
        unregular, irregular, or incontinent. So long as you know
        what you can expect from it (something this chapter will
        show you), you know all you need to care about.


        http://regex.info/blog/2006-09-15/248


This "quacks like a duck" approach to calling things NFAs or
DFAs cannot accomodate implementations that run a DFA pass first
and then an NFA pass only to find submatches, or that manage to
use nothing but DFAs to find submatches. So Friedl calls them
"hybrid NFA/DFAs", because they kind of quack like both.
Table 4-1 (p. 145), which classifies programs by "engine type",
lists GNU awk, GNU grep/egrep, and Tcl as "Hybrid NFA/DFA".


The sidebar on p. 182 expands on this classification:


        I've said several times that a DFA can't provide capturing
        parentheses or backreferences. This is quite true, but it
        certainly doesn't preclude hybrid approaches that mix
        technologies in an attempt to reach regex nirvana. The
        sidebar on page 180 told how NFAs have diverged from the
        theoretical straight and narrow in search of more power, and
        it's only natural that the same happens with DFAs. A DFA's
        construction makes it more difficult, but that doesn't mean
        impossible.


        GNU grep takes a simple but effective approach. It uses a
        DFA when possible, reverting to an NFA when backreferences
        are used. GNU awk does something similar--it uses GNU grep's
        fast shortest-leftmost DFA engine for simple "does it match"
        checks, and reverts to a different engine for checks where
        the actual extent of the match must be known. Since that
        other engine is an NFA, GNU awk can conveniently offer
        capturing parentheses, and it does via its special gensub
        function.


        Tcl's regex engine is a true hybrid, custom built by Henry
        Spencer (whom you may remember having played an important
        part in the early development and popularization of regular
        expression [p. 88]). The Tcl engine someitmes appears to
        be an NFA--it has lookaround, capturing parentheses,
        back-references, and lazy quantifiers. Yet, it has true
        POSIX longest-leftmost match ([p. 177]), and doesn't suffer
        from some of the NFA problems that we'll see in Chapter 6.
        It really seems quite wonderful.


Thus the use of "hybrid NFA-DFA" to describe Michela's work
unfortunately makes it sound, to those who know these terms only
through Friedl's book, like it's no different than GNU grep or Tcl.


Your one-sentence description makes it sound, to me, like it
*is* different from both the awk/grep hybrid and from Tcl; if
it's not too late, you might try to change the name to something
other than "hybrid".




* What does Tcl's regex do?


Let's assume a regexp with capturing parentheses but not
backreferences. (That is, /the (cat|dog) ran/ is okay, but
/(cat|dog) eat \1/ is not.) Let's also assume POSIX-style longest
match semantics are the goal.


Tcl's regex first runs an unanchored DFA search to find p, the
ending position of the shortest match (meaning the match with
leftmost end). If there is no match, stop, returning failure.


If the caller doesn't care about any match position info, stop,
returning success. Otherwise, it's time to figure out the exact
position of the leftmost longest match.


The leftmost longest match must start at or before p. Tcl's
regex loops over every character position from the beginning of
the string to p, trying an anchored DFA search to find the length
of the longest match starting at the current position. When it
finds a match, it stops the loop: that's the leftmost longest
match.


If the caller only cares about the position of the overall match
($0) but not any of the capturing parentheses ($1, $2, ...),
then stop, returning success. Otherwise, it's time to figure
out the positions of the capturing parentheses.


The algorithm up to this point is:


        match(re, text, match, nmatch)
        {
                etext = text + strlen(text);
                p = shortest(/.*re/, text, etext);
                if(p == NULL)
                        return false;
                if(nmatch == 0)
                        return true;
                for(q = text; q <= p; q++){
                        eq = longest(/re/, q, etext);
                        if(eq != NULL){
                                match[0].start = q;
                                match[0].end = eq;
                                break;
                        }
                }
                if(nmatch == 1)
                        return true;


                dissect(re, match[0].start, match[0].end, match);
        }


The function dissect determines the position of every subexpression
in re, saving the ones corresponding to capturing parentheses.
It does this by recursing over the structure of the regexp and
searching using DFAs corresponding to the subpieces.


To dissect a concatenation, Tcl figures out the longest possible
match for the first half and checks whether the remainder matches
the second half. If so, done. If not, find the next longest
possible match for the first half, and so on:


        dissect(re, text, etext, match)
        {
                if(re is concat: left right){
                        for(middle = etext; middle >= text; middle = p-1){
                                p = longest(/left/, text, middle);
                                if(p == NULL)
                                        break;
                                if(longest(/right/, p, etext) == etext){
                                        dissect(left, text, p, match);
                                        dissect(right, p, etext, match);
                                }
                        }
                        panic("cannot dissect concat");
                }


To dissect an alternation, Tcl tries each choice looking for the
one that uses all the text:


                if(re is alt: left|right){
                        if(longest(/left/, text, etext) == etext)
                                dissect(left, text, etext, match);
                        else
                                dissect(right, text, etext, match);
                }


If dissect comes across a capturing parenthesis, then it can
record the match.


                if(re is capture n: (subexpr)){
                        match[n].start = text;
                        match[n].end = etext;
                        dissect(subexpr, text, etext, match);
                }
        }


(The regexp is actually a directed graph that might have cycles,
not a tree; regexps like a* are rewritten into (aa*|).)


Note that there is no backtracking here: each step determines
the subdissection and then moves on. There is no "and if that
doesn't work, ...".


An obvious optimization would be to make dissect do nothing if
the re being considered contained no capturing parens. The Tcl
code I'm looking at appears not to do that.


The run-time of this algorithm is a little tricky to analyze.
On a regexp of length m on a text of length n, and ignoring the
time to construct DFAs/NFAs from regexps...


If no submatch info is required it takes O(n) time.


If only match[0] is required this it O(n^2) time.


The top-level concatenation dissection requires O(n^2) time. In
the worst case you cut off just one character, making the next
levels O(1^2) and O((n-1)^2). Repeating that gives O(n^3) time.
Alternations might force you to do this multiple times, but no
more than m, which would be O(m * n^3) time.


I'm not sure how things like repetition or empty strings affect
the run-time. The fact that a* is rewritten into an effectively
infinite-length regular expression can probably bump the m up
to n, so that you could force O(n^4) behavior if you really tried.


Anyway, at least the run time is polynomial!




* On-the-fly DFA constructions


Thompson's 1968 CACM paper is hard to classify, since it doesn't
mention the terms NFA or DFA. If you're willing to read between
the lines then the compiled code is really an encoding of an NFA
that runs a subset simulation as it reads the input. It's kind
of building the DFA on the fly, if you view each successive CLIST
as a DFA state, but it's not caching any of the DFA states that
it builds, so it's more like an NFA subset simulation. Since
it doesn't cache at all, it's slower, but it can't possibly build
an exponentially-large machine.


Spencer's Tcl regex builds the DFA on the fly from the NFA
representation, using a cache to speed the DFA simulation. The
cache has a fixed maximum size, though, so it can't end up
building an exponentially-large machine either.




Russ


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.