Matching very large patterns, was Re: Inverse grep

Chris F Clark <cfc@shell01.TheWorld.com>
Sun, 19 Jun 2011 21:45:49 -0400

          From comp.compilers

Related articles
Inverse grep gah@ugcs.caltech.edu (glen herrmannsfeldt) (2011-06-08)
Re: Inverse grep torbenm@diku.dk (2011-06-14)
Matching very large patterns, was Re: Inverse grep cfc@shell01.TheWorld.com (Chris F Clark) (2011-06-19)
Re: Matching very large patterns, was Re: Inverse grep gah@ugcs.caltech.edu (glen herrmannsfeldt) (2011-06-20)
| List of all articles for this month |

From: Chris F Clark <cfc@shell01.TheWorld.com>
Newsgroups: comp.compilers
Date: Sun, 19 Jun 2011 21:45:49 -0400
Organization: The World Public Access UNIX, Brookline, MA
References: 11-06-015 11-06-022
Keywords: lex
Posted-Date: 19 Jun 2011 22:20:17 EDT

Our moderator asked:
> How serious is the explosion problem in practice, now that compilers
> no longer have to fit into 64k? We use lex scanners with a thousand
> tokens, and spamassassin feeds about 2000 patterns to re2c if you use
> the compiled patterns feature. None of them seem to strain our storage
> systems. -John


For compilers on mainline chips this is generally not a problem.
However, this is such a useful tool, it gets used in areas where the
problems are much-much larger (or the device is smaller). The
Aho-Corasick algorithm is one of the key algorithms in matching a set
of strings to a large set of incoming data. The two places where I've
seen it used and where size is a problem is networking and
computational biology.


I know less about the computational biology use-cases so I'll dispense
with that first. As I understand it, the basic idea in DNA and
protein databases is that you have long lists of strings that
represent know genomes or protein foldings, and you match your unknown
DNA or protein against that and see what matches you get. The number
of known genomes and protein foldings is very large, on the order of
millions. It is probable that generating the DFA on the fly might
work in that application as Anton suggests, since one can generally
apply a mainstream processor to the problem and one isn't resource
constrained.


On the networking side, there are numerous applications of this
problem. It's the case where one has a hammer, every problem looks
like a nail. The unix world loves writing tools that match regexes or
sets of strings or some nearly equivalent notation (e.g. ACLs, globs,
URL fragments) to the traffic.


The first case is anti-virus (AV). The number of virus signatures is
again in the millions. This is not enough to make the database "too
big" in the sense that it won't fit in memory, but it does make it too
big to fit in cache. So, although we can construct a DFA that fits in
memory to represent the problem, the resulting DFA doesn't fit in
cache and ends up thrashing the cache. This adversely effects the
processor performance, as modern processors run many times fater than
main memory speeds and depend on caching to keep that performance
balance. These AV patterns are effective cache-walkers and any size
of cache that the processor is provisioned with gets exceeded.


In addition, the databasse is growing at a phenomenal rate. New
viruses are added at the rate of several per hour and that rate is
increaing. (I believe I've seen that a new virus signature is added
every 9 minutes on average.) In fact, the rate of new virus detecting
is so fast, that it isn't done by humans any more. Machine learning
of the viruses from honeypots is becoming the only way to keep up with
the rate of virus growth. The fact that the hackers are now using
generators to create new unique polymorphic viruses for each attack
makes this fact unsurprising--automation is balanced by automation.


To lessen the dependence on AV, black-listing and white-listing of web
sites has become popular. However, the numbers of URLs is still
sufficiently large that it presents a similar performance issue. The
problem in this case is exacerbated by the fact that the list of URLs
is not a static database kept on the client, but instead is gotten
from a web server. In this case, building the DFA on the fly might be
pratical because the cost is driven by the downloading of the URLs off
the web.


Intrusion detection and prevention systems (IDS/IPS) are a smaller but
similar problem, with a database on the order of 100k patterns with
many of the patterns requiring back-references and other non-DFA
functionality. Although cache thrashing is still somewhat of a
problem if we solve the problem on the CPU, the bigger issue is that
this problem would like to me moved out of the central processor to
the network edge (i.e. the NIC) so that the traffic is never brought
into the main memory system at all, as even the cost of moving the
traffic onto the memory is a signficant load.


At the edge we experience must harder contraints, 100kB is still a lot
of memory to put into a peripheral device and typical DFA sizes are on
the order of MBs. For example, when I first investigated Snort a few
years ago, the DFA compiled to 160MB, but the target device we wanted
to build would have had a memory size of 8-32kB. Even a pure NFA
representation of the Snort DB wouldn't fit in that target.


The current favorite algorithm that every network device is trying to
solve is applicaton identification (often called Deep Packet
Inspection or DPI). The easiest way to envision it, is presume you
have written a suite of compilers, C, C++, C#, Java, Pascal, PL/I,
Haskell, ML, F#, Lisp, Scheme, etc. and for any given program, you run
all the compilers in parallel and if one of them compiles the program
successfully, you say that is the language the program is written in.


Now, that method is untenable as the sum of the grammars is simply too
large and instead one has regular expressions that look for typical
distinctions in the protocols, just as virus signatures look for
specific patterns that are part of the virus or IDS systems look
typical for attack vectors. The resulting set of regular expressions
is similar in size to the IDS problem and again the target is the
network edge.


As a result of these memory limitations, there is a fairly active
community trying to find new and novel ways of representing regular
expressions that have the best characteristics of NFAs (linear size
growth) and DFAs (linear processing time). Some clever ways of
avoiding DFA explosion have been found for important common cases.


If you find this area potentially interesting, I would recommend
looking up the papers by Michela Becchi and the work at IBM on
dot-star as two good places to start further investigation.


It's an interesting area to work because the desire to push this into
smaller, cheaper, less power hungry devices has just begun. Today we
want these algorithms to run in routers (including SOHO routers bought
off-the-shelf). The desire to run them in smartphones and tablets is
appearing. You want your cars' network to be scure, so they will be
needed in the SOCs for that. Beyond that, there are these
micro-devices called "motes" which are basically a radio and some
sensors that want to be secure. Eventually, it is possible that a
processor for this area could be embedded in your "smart" credit-card.


Hope this helps,
-Chris


******************************************************************************
Chris Clark email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.