Re: DotStar regex package

Chris F Clark <cfc@shell01.TheWorld.com>
Wed, 12 May 2010 13:58:22 -0400

          From comp.compilers

Related articles
DotStar regex package johnl@iecc.com (John R. Levine) (2010-05-11)
Re: DotStar regex package dot@dotat.at (Tony Finch) (2010-05-11)
Re: DotStar regex package cfc@shell01.TheWorld.com (Chris F Clark) (2010-05-11)
Re: DotStar regex package cfc@shell01.TheWorld.com (Chris F Clark) (2010-05-11)
Re: DotStar regex package kym@sdf.lonestar.org (russell kym horsell) (2010-05-12)
Re: DotStar regex package cfc@shell01.TheWorld.com (Chris F Clark) (2010-05-12)
| List of all articles for this month |

From: Chris F Clark <cfc@shell01.TheWorld.com>
Newsgroups: comp.compilers
Date: Wed, 12 May 2010 13:58:22 -0400
Organization: The World Public Access UNIX, Brookline, MA
References: 10-05-054 10-05-065
Keywords: lex, performance
Posted-Date: 12 May 2010 17:14:01 EDT

Not to belabor this point, but I'd like to respond to some of the
speculation (some accurate, some not so accurate) in what


russell kym horsell <kym@sdf.lonestar.org> writes:


> This has been a moderately warm topic over the past 5 years as that
> "ubiquitous parallelism" has broken out from 4 cores on an x86 to
> 256 cores on a GPU. :)


This topic is driven by IDS/IPS/AV intrusion detection
systems/intrusion preventions systems/anti-virus) needs not the
presence of more cores, i.e.
        > [I got the impression that the main application was scanning
        > stuff for malware signatures. -John]


> A few people have published work that compiles "a" RE or a set of RE
> into machine code (e.g. MMX) rather than some HLL, but this seems to be
> among the early s/w for relatively large-scale SIMD and more specifically
> for large sets of RE.


Yes, in such systems, one has many REs one wants to process against
the input as a time, each RE representinag a unique [set of]
intrusion[s] or virus[es].


> I *think* any claim about limiting state explosion relates more to
> combinging tables for multiple RE rather than something related to build
> of a single DFA.


Sometimes the state explosion issue comes from multiple REs, but often
the explosion comes from characteristics of a single RE, which is
where the name DOTSTAR presumably comes from. A RE of the form
"abc.*defghi" suffers a state explosion when being translated from an
NFA to a DFA which is proportional in length to the part of the RE
after the .*, that is how many characters long "defghi" matches. Such
problems are common in the above systems, because they are using REs
to recognize complex text consisting of multiple tokens. The problem
is even more complex when one (as one often does in such schemes)
distance relations between the tokens, as in "abc.{10,50}defghi",
where one needs to take into account not only the length of the
"defghi" part but over what range of characters, the .{10,50} can
match, (i.e. you essentially multiply 6 by 50, and as is well known,
it is not difficult to generate cases where the number of states grows
exponentially with RE length).


Now, one can solve such problems with PCRE (perl compatible regular
expressions), but implementations like libpcre are more than an order
of magnitude too slow (usually more like a factor of 1k times as fast
is needed) to achieve reasonable processing rates.


The automata theoretic constructions of NFA->DFA conversion could
yield implementations that had the requisite performance if 1) they
didn't suffer state explosions that rendered many of the typical REs
thousands of times larger (such that the resulting compiled DB of REs
don't fit in a 32 bit address space (nor even close)) and 2) could
yield a cache friendly footprint (i.e. just fitting into available
memory isn't good enough, you need a much smaller working set also).


This is where the innovations come into play. They allow one to build
something that is mostly a DFA (or at least is a DFA when you care
about it), but which can do non-DFA things (and sometimes even non-NFA
things) to deal with the explosive cases. A relevant Google term is
hybrid FA, although even that term is not standardized and used
consistently.


Hope this helps,
-Chris


******************************************************************************
Chris Clark email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris
------------------------------------------------------------------------------


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.