Fri, 13 Feb 87 19:53:35 EST

Related articles |
---|

Extendible hashing harvard!cmcl2!bullwinkle!cornell!dill (1987-02-13) |

Date: | Fri, 13 Feb 87 19:53:35 EST |

From: | harvard!cmcl2!bullwinkle!cornell!dill (Jens M. Dill) |

Gus,

I have a small collection of papers on extendible hashing, but I agree

with John that their focus is too much on disk-resident tables. Instead

of pointing you to the papers, let me give you the general outline of the

in-core extendible hashing method I plan to use. It is not very much

more complicated than hashing with a fixed table size.

(1) Choose your hash function. It can be any function you want,

subject to the following constraints:

a. It must return a non-negative integer that is chosen from a

range that is large enough to index a maximal-size table.

In most languages, a long integer is sufficient, since it

usually has at least as many bits as a memory address.

b. The results of the hash function applied to a reasonable sample

of key values hould be evenly distributed over the entire range.

In this algorithm, unlike others, the distribution of the high-

order bits is more critical than that of the low-order bits.

(2) Choose your collision-handling mechanism. Any reasonable method will

work. The important thing is that you understand it well enough to

know how to detect the point at which performance degrades so far that

you'd be better off expanding the table. You also should be able to

find in a reasonable time, for a given hash cell, all the entries that

hash to that cell.

(3) Choose your page size. This is the unit by which the table is

extended or shrunk. It should probably be a nice round (binary)

number of cells, like 256. Call this value PSIZE.

(4) Choose your page indexing method. You probably don't want to require

that successive pages be stored contiguously, but you do want them to

be retrievable in constant time. Something like an array of pointers

to pages seems to be the best compromise.

(5) Initialize your table to empty by setting the current table size

(TSIZE) to zero, and setting things up so that no pages are currently

allocated. Your insertion algorithm should recognize this

condition as equivalent to "table full enough for expansion." You

also need to keep track of the number of significant bits in the

table size. Call this value BITS. It is easy to keep track of

incrementally.

(6) To look up an entry, compute the hash function of your key and call

the result HASH. Extract the BITS most significant bits of HASH.

if it is larger than TSIZE, divide it by two to drop the least

significant bit. This is your hash table index. Split it into

a page number and offset, and look for your entry.

(7) To extend the table, allocate the next sequential page. Add PSIZE

to TSIZE and adjust BITS if necessary. Let NEWPAGE be the number

of your new page (we assume the first page is numbered 0). Let

OLDPAGE be NEWPAGE divided by 2. Examine every entry that hashes

to a cell on OLDPAGE. For those entries, look at the bit we would

have discarded in the lookup algorithm before the table was extended.

This is the BITS'th-most high-order bit of the hash value. If it is

zero, leave the entry on OLDPAGE. Otherwise, move it to NEWPAGE.

(8) To shrink the table, reverse the process. Let OLDPAGE be the current

highest-numbered page, and NEWPAGE be OLDPAGE divided by 2. Merge all

entries from NEWPAGE into OLDPAGE by simply dividing their table

indices by 2, and resolving any collisions that result. Then reduce

TSIZE by PSIZE and de-allocate the newly-emptied page.

Best of luck with your implementation.

-- Jens Dill

dill@svax.cs.cornell.edu

[This is similar to the scheme that the DBM routines use to manage disk files,

except that DBM splits pages as they fill up and keeps a bit map of them

rather than splitting the page in the middle as this scheme does. - john]

--

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.