Re: Generating hash tables from known input (I think)

leichter@zodiac.rutgers.edu
Tue, 11 May 1993 21:08:14 GMT

          From comp.compilers

Related articles
Generating hash tables from known input (I think) rv@erix.ericsson.se (1993-05-10)
Re: Generating hash tables from known input (I think) leichter@zodiac.rutgers.edu (1993-05-11)
Re: Generating hash tables from known input (I think) rv@erix.ericsson.se (1993-05-24)
| List of all articles for this month |

Newsgroups: comp.compilers
From: leichter@zodiac.rutgers.edu
Keywords: theory, performance
Organization: Rutgers University Department of Computer Science
References: 93-05-049
Date: Tue, 11 May 1993 21:08:14 GMT

Robert Virding asked about constructing hash tables for pattern matching.
The moderator notes:
> Usually one can get adequate performance with a sufficiently large hash
> table (several times more slots than entries) and any old hash function that
> avoids degenerate cases. What's "adequate" depends on the rest of the
> application; adequate usually means that it doesn't contribute enough to
> the total runtime to be worth reworking.


Actually, "several times more slots than entries" is way too generous.
The theory of hash tables is well understood. The expected number of
probes to find an element that is in the table is a function only of the
fill factor, that is, the fraction of slots in use. I don't recall the
exact values, but I think the expected number of probes is a little over 2
for a fill factor of 80%, around 3 for a fill factor of 90%, then takes
off rapidly. (For an element that is NOT in the table, you have to search
until you find an empty slot, which takes a bit longer - but not much.)
However, for any fixed fill factor, the expected number of probes is a
constant.


Tables from which you never delete anything are particularly easy to deal
with. Tables which you fill ahead of time are even better. For example,
even if the fill factor is fairly high, you can remember and store the
maximum number of probes needed to INSERT any element, then never do more
than that many probes on a search. (However, the cost of checking against
the limit is often comparable to the cost of another probe, so it's often
better to just use a bigger table.) Knuth and a student whose name I
forget developed an interesting variation, a sorted hash table, which lets
you determine that an element is NOT in the table more quickly, even when
the fill factor is high. Again, this can be very good for the kind of
off-line constructed table being described here.


Because the number of probes depends only on the fill factor, but there is
always some overhead per hash table, it's usually best to build one big
hash table rather than a number of small ones. Simply build the
identification of the "subtable" being searched into the key. For the
particular case at hand, it's probably likely that the same pattern occurs
in multiple places, and the compiler could probably arrange to map it to
the same value. This could save space at little or no real cost.


A word of warning: It's easy to spend more time computing the hash
function than scanning the hash table! Look closely at what you are
hashing and try to come up with a hash that works well for your particular
input distribution WITHOUT spending a lot of time. Usually, you do NOT
have to make the entire key contribute to the hash to get good results.
-- Jerry
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.