Re: Henry Spencer's "Tcl" Regex Library

Chris F Clark <cfc@TheWorld.com>
Tue, 2 Oct 2007 15:02:46 -0400 (EDT)

          From comp.compilers

Related articles
Henry Spencer's "Tcl" Regex Library cfc@shell01.TheWorld.com (Chris F Clark) (2007-10-01)
Re: Henry Spencer's "Tcl" Regex Library rsc@swtch.com (Russ Cox) (2007-10-02)
Re: Henry Spencer's "Tcl" Regex Library cfc@TheWorld.com (Chris F Clark) (2007-10-02)
| List of all articles for this month |
From: Chris F Clark <cfc@TheWorld.com>
Newsgroups: comp.compilers
Date: Tue, 2 Oct 2007 15:02:46 -0400 (EDT)
Organization: Compilers Central
References: 07-10-016 07-10-026
Keywords: lex, DFA
Posted-Date: 03 Oct 2007 13:19:21 EDT

Thank you for your very informative reply.


I don't like the term Hybrid NFA/DFA either. However, it is her and
her professor's choice of terms and they have several papers that use
the term, so I think they will probably continue using it. For what
they are doing, I would use a different term, "mostly deterministic
finite-state automata", because their machine is mostly like a DFA,
but with a few NFA portions (and I mean true NFA not nonregular/regexp
extensions).


For what Friedl calls "nonregular expressions", I am of the "regexp"
naming camp. I use regexp when I mean something that looks like a
regular expression but doesn't describe a regular language. I reserve
regular expression when I mean something that describes a regular
language. I hadn't thought about the issue of what to call an
automaton that can process a regexp. It isn't a true NFA, since it
has the auxillary data structure that handles the captures and
backreferences.


They do mean something quite different from Friedl means. They are
squarely within the standard theoretical framework (e.g. no capturing
subexpressions or backreferences). Their goal is to combat the state
explosion that normally occurs in NFA->DFA translation. However, they
only handle things which "could" be translated into a DFA (if it
weren't for the state explosion problem). For the problems they are
attacking the regexp extension problem isn't an issue because the
prior approaches were all restricted to true regular expressions (or
even simpler set of strings problems).


They probably use the hybrid term because the machine executes a DFA
until certain conditions are met and then when they are met, the
machine can fork and send an NFA thread off to investigate the pattern
which if compiled to a DFA would have caused a state explosion, but
continue running the DFA for the rest of the "main" pattern. Thus,
her machine is a hybrid of a DFA and an NFA. In fact, when you look
at the results of her compilation process, there are sections of her
machine that are DFAs and sections that are NFAs, with "border" states
in between them. The border states are where the forks occur.


Thank you for the detailed description of what the regexp package in
Tcl does also. It is very informative. In particular, the
description of how boundaries are set by using an NFA was relevant.
One of the problems Michela and I worked on was finding the boundaries
of where a match was found. (While the problem set of interest does
not admit subpattern capture, the start and end of the actual match
are relevant for the downstream uses.) That turns out to be a more
difficult problem for DFAs than NFAs, because any final state in a DFA
may represent a set of different patterns with different start
positions. She worked out several interesting solutions to that
problem--several of which I'm certain she will publish, so I won't
give spoilers for.


Thanks again,
-Chris


*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
23 Bailey Rd voice : (508) 435-5016
Berlin, MA 01503 USA fax : (978) 838-0263 (24 hours)


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.