Techrepts & C tookit: Performance of (single,multi) keyword pattern matching algorithms

watson@win.tue.nl (Bruce W. Watson)Fri, 10 Jun 1994 23:51:55 GMT

From comp.compilers

Related articles
Techrepts & C tookit: Performance of (single,multi) keyword pattern ma watson@win.tue.nl (1994-06-10)
| List of all articles for this month |

 Newsgroups: comp.compilers From: watson@win.tue.nl (Bruce W. Watson) Keywords: report Organization: Eindhoven University of Technology, the Netherlands Date: Fri, 10 Jun 1994 23:51:55 GMT

Hi All,

I would like to announce a technical report describing a keyword pattern
matching toolkit (written in Standard C), and the performance of keyword
pattern matching algorithms in practice.
The full citations of the reports are:
The performance of single-keyword and multiple-keyword pattern
matching algorithms''
Computing Science Note 94/19, Eindhoven University of Technology, The
Netherlands.

The algorithms in the toolkit (for the benchmarking) are directly
implemented from the report:
A taxonomy of keyword pattern matching algorithms'' by B.W. Watson and
G. Zwaan, Computing Science Note 92/27, Eindhoven University of
Technology, The Netherlands, 1992.

The abstract of the performance report'' is:
\begin{abstract}
\noindent
This paper presents a toolkit of pattern matching algorithms, performance
data on the algorithms, and recommendations for the selection of an
algorithm (given a particular application). The pattern matching problem
is: given a finite non-empty set of keywords and an input string, find all
occurrences of any of the keywords in the input string.

The pattern matching toolkit (written in the C programming language, and
freely available) contains implementations of the Knuth-Morris-Pratt,
Boyer-Moore, Aho-Corasick, and Commentz-Walter algorithms. The algorithms
are implemented directly from the abstract algorithms derived and
presented in the taxonomy of Watson and Zwaan \cite{WatsonZwaan}. The
toolkit provides one of the few known correct implementations of the
Commentz-Walter precomputation algorithm.

The performance of all of the algorithms (running on a variety of
workstation hardware) was measured on two types of input: English text and
genetic sequences. The input data, which is the same as that used in the
benchmarks of Hume and Sunday \cite{HumeSunday}, were chosen to be
representative of two of the typical uses of pattern matching algorithms.
The differences between natural language text and genetic sequences serve
to highlight the strengths and weaknesses of each of the algorithms. Until
now, the performance of the multiple-keyword algorithms (Aho-Corasick and
Commentz-Walter) had not been extensively measured.

The Knuth-Morris-Pratt and Aho-Corasick algorithms performed linearly and
consistently (on widely varying keyword sets), as their theoretical
running time predicts. The Commentz-Walter algorithm (and its variants)
displayed more interesting behaviour, greatly out-performing even the best
Aho-Corasick variant on a large portion of the input data. The
recommendations section of this paper details the conditions under which a
particular algorithm should be chosen.
\end{abstract}

The report can be obtained by anonymous ftp from:
ftp.win.tue.nl (also known as 131.155.70.100)
in directory:
pub/techreports/pi/pattm/bench
Files WARRANTY (says that there is NO warranty)
pattperf...
pattmkit.tar.gz (contains the C code)

(Files with 600dpi as a substring in their name are for use with
high-resolution printers.)

Paper copies of the reports can also be obtained by emailing me:
watson@win.tue.nl
or paper mailing me:
Bruce W. Watson
HG 7.39
Faculty of Computing Science
Eindhoven University of Technology
P.O. Box 513
5600 MB Eindhoven
The Netherlands
--
Bruce Watson
watson@win.tue.nl
watson@stack.urc.tue.nl
--

Post a followup to this message