Re: Matching very large patterns, was Re: Inverse grep

glen herrmannsfeldt <gah@ugcs.caltech.edu>
Mon, 20 Jun 2011 04:52:28 +0000 (UTC)

          From comp.compilers

Related articles
Inverse grep gah@ugcs.caltech.edu (glen herrmannsfeldt) (2011-06-08)
Re: Inverse grep torbenm@diku.dk (2011-06-14)
Matching very large patterns, was Re: Inverse grep cfc@shell01.TheWorld.com (Chris F Clark) (2011-06-19)
Re: Matching very large patterns, was Re: Inverse grep gah@ugcs.caltech.edu (glen herrmannsfeldt) (2011-06-20)
| List of all articles for this month |

From: glen herrmannsfeldt <gah@ugcs.caltech.edu>
Newsgroups: comp.compilers
Date: Mon, 20 Jun 2011 04:52:28 +0000 (UTC)
Organization: A noiseless patient Spider
References: 11-06-015 11-06-022 11-06-033
Keywords: lex, storage
Posted-Date: 22 Jun 2011 00:39:05 EDT

Chris F Clark <cfc@shell01.theworld.com> wrote:
> Our moderator asked:
>> How serious is the explosion problem in practice, now that compilers
>> no longer have to fit into 64k? We use lex scanners with a thousand
>> tokens, and spamassassin feeds about 2000 patterns to re2c if you use
>> the compiled patterns feature. None of them seem to strain our storage
>> systems. -John


> For compilers on mainline chips this is generally not a problem.
> However, this is such a useful tool, it gets used in areas where the
> problems are much-much larger (or the device is smaller). The
> Aho-Corasick algorithm is one of the key algorithms in matching a set
> of strings to a large set of incoming data. The two places where I've
> seen it used and where size is a problem is networking and
> computational biology.


The favorite for computational biology for many years was dynamic
programming, though often it is too slow. The dynamic programming
algorithm used for diff originated in the one Needleman-Wunsch used
for protein sequences. (I believe that it is referenced in the diff
documentation.)


More recently, the Burrows-Wheeler transform, commonly used in
compression, has become popular for computational biology.


BLAST is based on a DFA similar to the first part of Aho-Corasick, but
dynamic programming is used instead of the usual second part.


-- glen



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.