|Fast NFA engine anyone? firstname.lastname@example.org (Remo D.) (2006-04-21)|
|Re: Fast NFA engine anyone? email@example.com (Russ Cox) (2006-04-22)|
|Re: Fast NFA engine anyone? firstname.lastname@example.org (Remo D.) (2006-04-22)|
|Re: Fast NFA engine anyone? cfc@shell01.TheWorld.com (Chris F Clark) (2006-04-23)|
|Re: Fast NFA engine anyone? email@example.com (2006-04-23)|
|Re: Fast NFA engine anyone? Danny.Dube@ift.ulaval.ca (2006-05-11)|
|Re: Fast NFA engine anyone? firstname.lastname@example.org (Remo D.) (2006-05-12)|
|From:||"Remo D." <email@example.com>|
|Date:||22 Apr 2006 09:59:21 -0400|
Remo D. wrote:
> I have the impression that going with a DFA representation is not a
> good idea. I feel that the time I will have to spend extracting the
> substring from the matching text will wipe out any benefit I will get (I
> might be wrong, of course).
> [I think you're wrong about DFAs. If your patterns are really all anchored,
> the match is the string from the beginning of the line to the point where
> the match succeeded. -John]
Well, consider that I'm interested in subpatterns and back references as
well (sorry, I didn't mentioned that in my post).
For example things like: ([_^]*)[A-G]([',]*) where I want to get the,
possibly empty, sequences I matched at the beginning and at the end. Or
things like: (["'])[A-Z]\1 where I want uppercase strings enclosed in
single or double quotes.
My understanding (again, I could be wrong) is that DFAs, mapping sets of
NFA states in a single state, won't allow me to do that. I have to get
the matching portion and scan it again to extract the parts I want.
Also, back references are out of question with DFA.
I've done a bit of research on citeseer and from the reports I've been
able to read I got the impression that NFAs was the way to go.
I think Laurikari's master thesis "Efficient submatch addressing for
regular expression" (http://citeseer.ist.psu.edu/480392.html) gives a
good summary of this point of view. He's also author of the TRE matching
The problem I have many of the reports I found is that they focus on
matching a single regular expression, assuming that given the regexps a
and b I can always combine them to match a|b. Actually I have to behave
differently if I match a or I match b and I don't want to re-scanning
the match to see which one I matched.
Other reports spend a great deal of effort providing optimization
similar to the Boyer-Moore algorithm that is not relevant for me since
I've anchored patterns to match!
All the lexer tools I've examined (flex, re2c, pccts, ...) follow the
DFA approach trying to match all the regexp in parallel (which seems a
good thing to me!!) but how can I get subpatterns and backref?
That's why, in my confusion, I was looking for some insight from people
on comp.compilers newsgroup: maybe I missed how DFAs could give me
submatches or maybe there are other approaches that I could follow (I
briefly looked at SNOBOL patterns but I'm not quite convinced they are
better than regexp).
[Oh, you want backrefs. Yes, NFAs are the way to go. -John]
Return to the
Search the comp.compilers archives again.