Related articles |
---|
Extendible hashing harvard!cmcl2!bullwinkle!cornell!dill (1987-02-13) |
Date: | Fri, 13 Feb 87 19:53:35 EST |
From: | harvard!cmcl2!bullwinkle!cornell!dill (Jens M. Dill) |
Gus,
I have a small collection of papers on extendible hashing, but I agree
with John that their focus is too much on disk-resident tables. Instead
of pointing you to the papers, let me give you the general outline of the
in-core extendible hashing method I plan to use. It is not very much
more complicated than hashing with a fixed table size.
(1) Choose your hash function. It can be any function you want,
subject to the following constraints:
a. It must return a non-negative integer that is chosen from a
range that is large enough to index a maximal-size table.
In most languages, a long integer is sufficient, since it
usually has at least as many bits as a memory address.
b. The results of the hash function applied to a reasonable sample
of key values hould be evenly distributed over the entire range.
In this algorithm, unlike others, the distribution of the high-
order bits is more critical than that of the low-order bits.
(2) Choose your collision-handling mechanism. Any reasonable method will
work. The important thing is that you understand it well enough to
know how to detect the point at which performance degrades so far that
you'd be better off expanding the table. You also should be able to
find in a reasonable time, for a given hash cell, all the entries that
hash to that cell.
(3) Choose your page size. This is the unit by which the table is
extended or shrunk. It should probably be a nice round (binary)
number of cells, like 256. Call this value PSIZE.
(4) Choose your page indexing method. You probably don't want to require
that successive pages be stored contiguously, but you do want them to
be retrievable in constant time. Something like an array of pointers
to pages seems to be the best compromise.
(5) Initialize your table to empty by setting the current table size
(TSIZE) to zero, and setting things up so that no pages are currently
allocated. Your insertion algorithm should recognize this
condition as equivalent to "table full enough for expansion." You
also need to keep track of the number of significant bits in the
table size. Call this value BITS. It is easy to keep track of
incrementally.
(6) To look up an entry, compute the hash function of your key and call
the result HASH. Extract the BITS most significant bits of HASH.
if it is larger than TSIZE, divide it by two to drop the least
significant bit. This is your hash table index. Split it into
a page number and offset, and look for your entry.
(7) To extend the table, allocate the next sequential page. Add PSIZE
to TSIZE and adjust BITS if necessary. Let NEWPAGE be the number
of your new page (we assume the first page is numbered 0). Let
OLDPAGE be NEWPAGE divided by 2. Examine every entry that hashes
to a cell on OLDPAGE. For those entries, look at the bit we would
have discarded in the lookup algorithm before the table was extended.
This is the BITS'th-most high-order bit of the hash value. If it is
zero, leave the entry on OLDPAGE. Otherwise, move it to NEWPAGE.
(8) To shrink the table, reverse the process. Let OLDPAGE be the current
highest-numbered page, and NEWPAGE be OLDPAGE divided by 2. Merge all
entries from NEWPAGE into OLDPAGE by simply dividing their table
indices by 2, and resolving any collisions that result. Then reduce
TSIZE by PSIZE and de-allocate the newly-emptied page.
Best of luck with your implementation.
-- Jens Dill
dill@svax.cs.cornell.edu
[This is similar to the scheme that the DBM routines use to manage disk files,
except that DBM splits pages as they fill up and keeps a bit map of them
rather than splitting the page in the middle as this scheme does. - john]
--
Return to the
comp.compilers page.
Search the
comp.compilers archives again.