Re: Inverse grep

torbenm@diku.dk (Torben Ęgidius Mogensen)
Tue, 14 Jun 2011 11:16:52 +0200

          From comp.compilers

Related articles
Inverse grep gah@ugcs.caltech.edu (glen herrmannsfeldt) (2011-06-08)
Re: Inverse grep cfc@shell01.TheWorld.com (Chris F Clark) (2011-06-12)
Re: Inverse grep dot@dotat.at (Tony Finch) (2011-06-13)
Re: Inverse grep torbenm@diku.dk (2011-06-14)
Re: Inverse grep anton@mips.complang.tuwien.ac.at (2011-06-15)
Matching very large patterns, was Re: Inverse grep cfc@shell01.TheWorld.com (Chris F Clark) (2011-06-19)
Re: Matching very large patterns, was Re: Inverse grep gah@ugcs.caltech.edu (glen herrmannsfeldt) (2011-06-20)
Re: Inverse grep junyer@gmail.com (Paul Wankadia) (2012-05-28)
| List of all articles for this month |

From: torbenm@diku.dk (Torben Ęgidius Mogensen)
Newsgroups: comp.compilers
Date: Tue, 14 Jun 2011 11:16:52 +0200
Organization: SunSITE.dk - Supporting Open source
References: 11-06-015
Keywords: lex, performance, comment
Posted-Date: 14 Jun 2011 20:15:10 EDT

glen herrmannsfeldt <gah@ugcs.caltech.edu> writes:


> I suppose this is a strange question, but I was wondering if
> there was ever something like an inverse grep. That is,
> match a string against a file full of regular expressions.
>
> Now, one could just read the file, compile the regex one at
> a time, and do the match, but maybe there is another way.
>
> [If you want to know which pattern it was, there's flex which turns all
> the patterns into one DFA with tags to know which one it was, or else
> there's the perl "study" operator which pre-scans a string to make its
> NFA matcher faster on subsequent runs against the same string. -John]


The best method depends on how many times you match the same set of
regexps against different strings. If you use the same set with many
strings, it can pay to preprocess the regexps to, e.g., produce
automata (similar to what flex does), but if you match only a few
times, it is probably better to keep the regexps unchanged and use an
algorithm that matches a string to a regexp without first converting
the regexp to a DFA. Converting each regexp to an NFA is fairly fast,
so this is a possibility, but if only a small part of each NFA is ever
used, it might be better to use the regexps directly using regular
expression derivatives. Matching using regular expression derivatives
is generally a bit slower than matching with NFAs, but you don't have
to convert the regexps first. So if you don't expect to match the set
of regexps to more than a handful of strings, this might be a better
approach.


As Chris said, the DFA for a union of a large number of regexps can
explode, so it is a risky business to use this approach, even if you
are going to match the regexps against many strings. You can convert
the combined regexps to an NFA and do local reductions to keep the
size down, but the matching time is still nearly the same as matching
the NFAs for the individual regexps in sequence. A compromise could
be to combine the NFAs from one end and stop when the resulting DFA
becomes too big and then start combining the remaining NFAs from
there. In the worst case, you don't gain anything over keeping the
NFAs separate, though.


Torben
[How serious is the explosion problem in practice, now that compilers
no longer have to fit into 64k? We use lex scanners with a thousand
tokens, and spamassassin feeds about 2000 patterns to re2c if you use
the compiled patterns feature. None of them seem to strain our storage
systems. -John]


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.