# hash specifics: GOOD NEWS

## Thilo Ernst <te@prosun.first.gmd.de>Mon, 17 Dec 90 15:43:56 +0100

From comp.compilers

Related articles
hash specifics: GOOD NEWS te@prosun.first.gmd.de (Thilo Ernst) (1990-12-17)
Re: hash specifics: GOOD NEWS jeffj@cs.umr.edu (1990-12-19)
| List of all articles for this month |

 Newsgroups: comp.compilers From: Thilo Ernst Keywords: design Organization: Compilers Central Date: Mon, 17 Dec 90 15:43:56 +0100

Hello hash-fellows,

In a 1985 paper in Comm. of the ACM, T.J. Sager of the University of Missouri
presented the so-called mincycle algorithm for computing perfect (collision-
free) hash functions.
Although the problem is nasty (NP-complete) in general, the algorithm
was shown to work in polynomial time for virtually all cases.
Sager's hash functions have the general form

H(w) = ( h0(w) + g (h1 (w)) + g (h2 (w)) ) mod M

which doesn't seem very efficient on first glance, but have a closer look:

h0, h1, h2:
three functions which map the string into a triple of (bounded)
integers - usually ord (w[j]) mod k, with k some power of 2.
The only restriction on their simplicity is that there be no
fatal conflicts, i.e. two keywords with the same (h0, h1, h2)
triple.
M: equals N, the number of keywords to hash, or slightly greater
(may be forced to be a power of 2, too)
g: a stored mapping, i.e. an integer array.
This is the main output of the mincycle algorithm. Its size will
be around N/3.

As you see, all operations occuring in the definition can be implemented very
efficiently on machine level. h0, h1, h2 may be simple bitwise AND's on
single bytes of the string to hash; for small keyword sets, one of them
can be set to zero, that is, omitted in practice.

M is an output of Sager's algorithm, with the property M>=N. However (good
news continued), in all his (and our) tests, M was equal to N. That means,
there are no gaps in the hashing table; Sager's hash functions are *minimal*
(although he had no proof). BTW, Sager computed MPHF's for keyword sets up to
a size of 512.

I don't know if this is the best result about hash functions, but in my
opinion it is a significant breakthrough compared to the work listed in the
posting of J. R. Levine, which falls in two categories: bad or unknown
algorithm complexity (restricting the keyword set to small sizes) and bad or
inefficient resulting hash functions (non-perfect/non-minimal HF's,
large-number divisions required, etc.).

And now for the best: we have done it, Adriaan. We included an implementation
of the mincycle algorithm (with some improvements and optimizations) in our
compiler compiler to speed up the lexical analyzers generated and were very
satisfied with the results. Our largest example had 350 keywords and took
about ten minutes to compute a MPHF on a PC/AT.

The implementation was done in Modula-2, and I don't know if it would be of
use for you. I'd rather suggest you to e-mail me your keyword set(s) and to
look forward to my reply, including MPHF, in January.

(Sorry for errors/inexact information - unfortunately, I don't have Sager's
paper here; neither my implementation and the tech. rep. about it.
Also, sorry for abuse of the English language, if any.)

Regards, Thilo.

Thilo Ernst ( te@gmdtub.uucp, te@prosun.first.gmd.de )
Institute of Informatics and Computing Techniques Berlin
Rudower Chaussee 5, D-O-1199 Berlin, Germany
Phone: (+37-2-) 6 74 51 52, (+49-30-) 25 49 91 16
Fax: (+37-2-) 6 76 22 00
[Thanks for pointing this one out, I'd completely missed it. The reference is
Sager, Thomas J., "A Polynomial Time Generator for Minimal Perfect Hash
Functions," CACM 28:5, May 1985, pp. 523-532. The article includes a listing
of a Pascal version of his algorithm. -John]
--

Post a followup to this message