Re: C Hashmap implementation

George Neuner <gneuner2@comcast.net>
26 Apr 2007 09:39:24 -0400

          From comp.compilers

Related articles
C Hashmap implementation Sean.Gillespie@bisonofborg.com (bison) (2007-04-23)
Re: C Hashmap implementation vmakarov@redhat.com (Vladimir Makarov) (2007-04-25)
Re: C Hashmap implementation cr88192@hotmail.com (cr88192) (2007-04-26)
Re: C Hashmap implementation Sean.Gillespie@bisonofborg.com (bison) (2007-04-26)
Re: C Hashmap implementation cdiggins@gmail.com (Christopher Diggins) (2007-04-26)
Re: C Hashmap implementation gneuner2@comcast.net (George Neuner) (2007-04-26)
Re: C Hashmap implementation gene.ressler@gmail.com (Gene) (2007-04-26)
Re: C Hashmap implementation roessmann@gmx.net (Bernhard Roessmann) (2007-04-26)
Re: C Hashmap implementation DrDiettrich1@aol.com (Hans-Peter Diettrich) (2007-04-27)
Re: C Hashmap implementation DrDiettrich1@aol.com (Hans-Peter Diettrich) (2007-04-27)
| List of all articles for this month |

From: George Neuner <gneuner2@comcast.net>
Newsgroups: comp.compilers
Date: 26 Apr 2007 09:39:24 -0400
Organization: Compilers Central
References: 07-04-089
Keywords: symbols
Posted-Date: 26 Apr 2007 09:39:24 EDT

On 23 Apr 2007 07:52:27 -0400, bison <Sean.Gillespie@bisonofborg.com>
wrote:


>I'm looking into hashmap implementations for a VM based
>dynamically typed language. Since the implementation must allow me to
>resize on demand, I am a bit confused about the various approaches.


The best approach depends on the expected key distribution and number
of collisions. If relatively few collisions are expected, it really
doesn't matter which method is used. If lots of collisions are
expected, then which to use depends on the cost of rehashing or
probing vs. manipulating the extensible bucket data structure.




>Almost everyone I've talked to has said that Chained Hashmaps are much
>easier to implement than Open Addressed maps. Wikipedia suggests that
>an approach to resizing hashmaps is to allocate space for a newer
>hashmap and copy elements from to the new table, and in some cases do
>it incrementally.


Resizing the key space means changing the hash function - or at least
the function's parameters - and rehashing all the entries.


Coming up with good hash functions is a black art. Generally an open
map needs at least two different hash functions - one to start and a
fall back if the first probe fails to find a slot. You also need a
strategy if the second (or Nth) hash fails. Each differently sized
map requires a different set of hash functions to maintain the desired
key distribution.


Chained maps are simpler because the key space doesn't have to be
resized as often (or at all) as the map grows. That means you may
only need one decent hash function. If lots of collisions are
expected and searches dominate insertions/deletions, you can get
better results by making each bucket a suitable tree structure rather
than a list.




>How much space should a hashmap allocate initially, and when
>it's full...by how much would I typically increase it?


In general you should not use powers of 2 for the size of the key
space. Some authors have suggested using prime numbers - at the very
least the size should be odd. When expanding the map some suggest the
new size should be relatively prime to the old size.


For open maps I've seen some studies that indicate performance can
start to degrade at 30% of capacity (depends on key distribution) and
that resizing before 70% of capacity is a must. Since you want the
new fill ratio back below 30%, you need to triple or quadruple the
size, keeping in mind oddness/relative primeness and changing hash
functions.


I haven't seen similar studies on chained implementations because the
key space is so rarely resized. Effort is generally put into making
the single hash function as good as possible and then making the
bucket search efficient.


Hope this helps.
George



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.