Re: How to do this odd kind of regex match?

"Joachim Durchholz" <joachim_d@gmx.de>
21 Jul 2002 02:05:19 -0400

          From comp.compilers

Related articles
How to do this odd kind of regex match? dot@dotat.at (Tony Finch) (2002-07-15)
Re: How to do this odd kind of regex match? michaelparker@earthlink.net (Michael Parker) (2002-07-21)
Re: How to do this odd kind of regex match? joachim_d@gmx.de (Joachim Durchholz) (2002-07-21)
Re: How to do this odd kind of regex match? Martin.Ward@durham.ac.uk (Martin Ward) (2002-07-21)
Re: How to do this odd kind of regex match? simon.cozens@computing-services.oxford.ac.uk (Simon Cozens) (2002-07-24)
| List of all articles for this month |

From: "Joachim Durchholz" <joachim_d@gmx.de>
Newsgroups: comp.compilers
Date: 21 Jul 2002 02:05:19 -0400
Organization: Compilers Central
References: 02-07-041
Keywords: lex
Posted-Date: 21 Jul 2002 02:05:19 EDT

Tony Finch wrote:
> Can anyone point me in the direction of literature on matching regexes
> with capturing brackets (as in sed) efficiently, i.e. without using a
> backtracking NFA. Is it possible to do with a DFA?


I'm not sure what you mean - if what you want to know which parts of the
regex matched what part of the input, Henry Spencer's regex engine will
do fine. AFAIK it's doing all the pattern matching with a DFA, so you
should be just fine.


> I'd also like to be able to match several regexes against the same
> text in parallel, preferably without running N regex engines in
> parallel. This is different from the standard tokenization problem
> because the several matches may overlap rather than occurring one
> after the other.


I think this isn't too difficult to do, but you'll probably have to roll
your own engine - I don't think the Spencer engine can to this.


I'd start constructing an NFA, with start and accept states for the
subexpressions that interest you.
The DFA might have to encode many substart and subaccept states in its
states, and you might end up assigning the same input text to many
different parts of the RE. For pathological cases, this might be worse
than exponential: think '(A|AA)*', applied to a string of a million As...
But these caveats don't touch the NFA->DFA transformation, so this
should work.


HTH
Joachim


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.