Re: C Hashmap implementation

Hans-Peter Diettrich <DrDiettrich1@aol.com>
27 Apr 2007 11:29:14 -0400

          From comp.compilers

Related articles
[3 earlier articles]
Re: C Hashmap implementation Sean.Gillespie@bisonofborg.com (bison) (2007-04-26)
Re: C Hashmap implementation cdiggins@gmail.com (Christopher Diggins) (2007-04-26)
Re: C Hashmap implementation gneuner2@comcast.net (George Neuner) (2007-04-26)
Re: C Hashmap implementation gene.ressler@gmail.com (Gene) (2007-04-26)
Re: C Hashmap implementation roessmann@gmx.net (Bernhard Roessmann) (2007-04-26)
Re: C Hashmap implementation DrDiettrich1@aol.com (Hans-Peter Diettrich) (2007-04-27)
Re: C Hashmap implementation DrDiettrich1@aol.com (Hans-Peter Diettrich) (2007-04-27)
| List of all articles for this month |

From: Hans-Peter Diettrich <DrDiettrich1@aol.com>
Newsgroups: comp.compilers
Date: 27 Apr 2007 11:29:14 -0400
Organization: Compilers Central
References: 07-04-089 07-04-115
Keywords: C, symbols
Posted-Date: 27 Apr 2007 11:29:14 EDT

George Neuner wrote:


>>I'm looking into hashmap implementations for a VM based
>>dynamically typed language. Since the implementation must allow me to
>>resize on demand, I am a bit confused about the various approaches.
>
>
> The best approach depends on the expected key distribution and number
> of collisions. If relatively few collisions are expected, it really
> doesn't matter which method is used. If lots of collisions are
> expected, then which to use depends on the cost of rehashing or
> probing vs. manipulating the extensible bucket data structure.


When nested scopes come into play, collisions cannot be avoided at
all, in a single hashtable. OTOH the use of a single hash table may be
more efficient than using a stack of tables. That's why in one of my
experiments I combined a symbol stack with a hash table, where the
symbol entries contain the links to other symbols, of the same hash
code, deeper down in the stack. The whole symbol table was implemented
as a single array, whose size will stabilize very soon. The array then
serves as the hash table, by taking the hash code modulo the array
size, to retrieve the index of the first matching symbol. When a scope
is added or removed from the stack, the entries at the high end of the
array are reused (freed for or occupied by new symbols).




> Chained maps are simpler because the key space doesn't have to be
> resized as often (or at all) as the map grows. That means you may
> only need one decent hash function.


sic!




>>How much space should a hashmap allocate initially, and when
>>it's full...by how much would I typically increase it?
>
>
> In general you should not use powers of 2 for the size of the key
> space. Some authors have suggested using prime numbers - at the very
> least the size should be odd. When expanding the map some suggest the
> new size should be relatively prime to the old size.


When the hash table grows together with the number of entries, as in my
approach, every entry has an immutable hash code. The lookup requires a
modulo operation, which will be a little bit faster for array sizes in
powers of 2, but I don't think that this possible optimization has an
big overall impact. More important is the number of grows of the
stack/array, which require to rebuild the whole hash table (the links to
the entry points of the chained lists, which are not further affected by
the reorganization).




> I haven't seen similar studies on chained implementations because the
> key space is so rarely resized. Effort is generally put into making
> the single hash function as good as possible and then making the
> bucket search efficient.


I also didn't thoroughly research my implementation, it immediately
worked quite fine for my purpose, a C preprocessor and parser.




After following the link to VList, I think that I had just the same
ideas with my approach :-)


DoDi


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.