Re: How to do this odd kind of regex match?

"Michael Parker" <michaelparker@earthlink.net>
21 Jul 2002 01:51:34 -0400

          From comp.compilers

Related articles
How to do this odd kind of regex match? dot@dotat.at (Tony Finch) (2002-07-15)
Re: How to do this odd kind of regex match? michaelparker@earthlink.net (Michael Parker) (2002-07-21)
Re: How to do this odd kind of regex match? joachim_d@gmx.de (Joachim Durchholz) (2002-07-21)
Re: How to do this odd kind of regex match? Martin.Ward@durham.ac.uk (Martin Ward) (2002-07-21)
Re: How to do this odd kind of regex match? simon.cozens@computing-services.oxford.ac.uk (Simon Cozens) (2002-07-24)
| List of all articles for this month |
From: "Michael Parker" <michaelparker@earthlink.net>
Newsgroups: comp.compilers
Date: 21 Jul 2002 01:51:34 -0400
Organization: EarthLink Inc. -- http://www.EarthLink.net
References: 02-07-041
Keywords: lex
Posted-Date: 21 Jul 2002 01:51:34 EDT

Tony Finch wrote:
> Can anyone point me in the direction of literature on matching regexes
> with capturing brackets (as in sed) efficiently, i.e. without using a
> backtracking NFA. Is it possible to do with a DFA?


The usual way to do this is to scan with a DFA until you find a match,
then run the NFA over that to snag the submatches. There are also
some very quick NFA's out there that substantially minimise
backtracking and other forms of book-keeping. I wrote one in Common
Lisp that is an order of magnitude or more faster than the GNU regex
engine, and presumably a C implentation would be several times faster
still.


> I'd also like to be able to match several regexes against the same
> text in parallel, preferably without running N regex engines in
> parallel. This is different from the standard tokenization problem
> because the several matches may overlap rather than occurring one
> after the other. Would it be useful to think of this as
> (re1)|(re2)|(re3) with capturing brackets, all of which capture if
> possible?


yes. A DFA isn't really sensitive to regex complexity the way a NFA
is -- it's one state transition per input character, no matter what.
The table sizes can get pretty big, though.


> (The aim is to speed up heuristic spam detection such as SpamAssassin.)


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.