Extendible hashing

harvard!cmcl2!bullwinkle!cornell!dill (Jens M. Dill)
Fri, 13 Feb 87 19:53:35 EST

          From comp.compilers

Related articles
Extendible hashing harvard!cmcl2!bullwinkle!cornell!dill (1987-02-13)
| List of all articles for this month |
Date: Fri, 13 Feb 87 19:53:35 EST
From: harvard!cmcl2!bullwinkle!cornell!dill (Jens M. Dill)

Gus,


I have a small collection of papers on extendible hashing, but I agree
with John that their focus is too much on disk-resident tables. Instead
of pointing you to the papers, let me give you the general outline of the
in-core extendible hashing method I plan to use. It is not very much
more complicated than hashing with a fixed table size.


(1) Choose your hash function. It can be any function you want,
          subject to the following constraints:


          a. It must return a non-negative integer that is chosen from a
                  range that is large enough to index a maximal-size table.
                  In most languages, a long integer is sufficient, since it
                  usually has at least as many bits as a memory address.


          b. The results of the hash function applied to a reasonable sample
                  of key values hould be evenly distributed over the entire range.
                  In this algorithm, unlike others, the distribution of the high-
                  order bits is more critical than that of the low-order bits.


(2) Choose your collision-handling mechanism. Any reasonable method will
          work. The important thing is that you understand it well enough to
          know how to detect the point at which performance degrades so far that
          you'd be better off expanding the table. You also should be able to
          find in a reasonable time, for a given hash cell, all the entries that
          hash to that cell.


(3) Choose your page size. This is the unit by which the table is
          extended or shrunk. It should probably be a nice round (binary)
          number of cells, like 256. Call this value PSIZE.


(4) Choose your page indexing method. You probably don't want to require
          that successive pages be stored contiguously, but you do want them to
          be retrievable in constant time. Something like an array of pointers
          to pages seems to be the best compromise.


(5) Initialize your table to empty by setting the current table size
          (TSIZE) to zero, and setting things up so that no pages are currently
          allocated. Your insertion algorithm should recognize this
          condition as equivalent to "table full enough for expansion." You
          also need to keep track of the number of significant bits in the
          table size. Call this value BITS. It is easy to keep track of
          incrementally.


(6) To look up an entry, compute the hash function of your key and call
          the result HASH. Extract the BITS most significant bits of HASH.
          if it is larger than TSIZE, divide it by two to drop the least
          significant bit. This is your hash table index. Split it into
          a page number and offset, and look for your entry.


(7) To extend the table, allocate the next sequential page. Add PSIZE
          to TSIZE and adjust BITS if necessary. Let NEWPAGE be the number
          of your new page (we assume the first page is numbered 0). Let
          OLDPAGE be NEWPAGE divided by 2. Examine every entry that hashes
          to a cell on OLDPAGE. For those entries, look at the bit we would
          have discarded in the lookup algorithm before the table was extended.
          This is the BITS'th-most high-order bit of the hash value. If it is
          zero, leave the entry on OLDPAGE. Otherwise, move it to NEWPAGE.


(8) To shrink the table, reverse the process. Let OLDPAGE be the current
          highest-numbered page, and NEWPAGE be OLDPAGE divided by 2. Merge all
          entries from NEWPAGE into OLDPAGE by simply dividing their table
          indices by 2, and resolving any collisions that result. Then reduce
          TSIZE by PSIZE and de-allocate the newly-emptied page.


Best of luck with your implementation.
                                                                                -- Jens Dill
                                                                                      dill@svax.cs.cornell.edu
[This is similar to the scheme that the DBM routines use to manage disk files,
except that DBM splits pages as they fill up and keeps a bit map of them
rather than splitting the page in the middle as this scheme does. - john]
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.