Re: C Hashmap implementation

Gene <gene.ressler@gmail.com>
26 Apr 2007 09:39:44 -0400

          From comp.compilers

Related articles
C Hashmap implementation Sean.Gillespie@bisonofborg.com (bison) (2007-04-23)
Re: C Hashmap implementation vmakarov@redhat.com (Vladimir Makarov) (2007-04-25)
Re: C Hashmap implementation cr88192@hotmail.com (cr88192) (2007-04-26)
Re: C Hashmap implementation Sean.Gillespie@bisonofborg.com (bison) (2007-04-26)
Re: C Hashmap implementation cdiggins@gmail.com (Christopher Diggins) (2007-04-26)
Re: C Hashmap implementation gneuner2@comcast.net (George Neuner) (2007-04-26)
Re: C Hashmap implementation gene.ressler@gmail.com (Gene) (2007-04-26)
Re: C Hashmap implementation roessmann@gmx.net (Bernhard Roessmann) (2007-04-26)
Re: C Hashmap implementation DrDiettrich1@aol.com (Hans-Peter Diettrich) (2007-04-27)
Re: C Hashmap implementation DrDiettrich1@aol.com (Hans-Peter Diettrich) (2007-04-27)
| List of all articles for this month |

From: Gene <gene.ressler@gmail.com>
Newsgroups: comp.compilers
Date: 26 Apr 2007 09:39:44 -0400
Organization: Compilers Central
References: 07-04-089
Keywords: symbols
Posted-Date: 26 Apr 2007 09:39:44 EDT

On Apr 23, 7:52 am, bison <Sean.Gilles...@bisonofborg.com> wrote:
> Hello. I'm looking into hashmap implementations for a VM based
> dynamically typed language. Since the implementation must allow me to
> resize on demand, I am a bit confused about the various approaches.
> Here is what I have found out so far:
>
> Almost everyone I've talked to has said that Chained Hashmaps are much
> easier to implement than Open Addressed maps.
>
> Wikipedia suggests that an approach to resizing hashmaps is to
> allocate space for a newer hashmap and copy elements from to the new
> table, and in some cases do it incrementally.
>
> Quick question about the last point: I'm curious about a starting
> point. How much space should a hashmap allocate initially, and when
> it's full...by how much would I typically increase it? I realize
> there are lots of different answers here, so a good starting point
> would really help out.


You'll want to grow the table by a _factor_ greater than one so that
the amortized time spent on copying does not dominate asymptotic
performance. I have seen factors between 1.3 and 2 used in
practice.


Most hash functions work best with prime table sizes. The following
table supports growth factors of roughly 1.5 for 32-bit architectures.


unsigned primes[] = {
    3, 7, 11, 19, 29, 43, 67, 97, 149, 211, 317, 479, 719, 1069, 1601,
2399,
    3607, 5399, 8093, 12143, 18211, 27329, 40973, 61463, 92173,
138283,
    207401, 311099, 466619, 699931, 1049891, 1574827, 2362229, 3543349,
    5314961, 7972451, 11958671, 17937989, 26906981, 40360471, 60540749,
    90811057,136216589, 204324851, 306487277, 459730919, 689596379,
    1034394589, 1551591841, 2327387723u, 3491081599u, 4294967291u,
};


Initial size is totally application dependent. E.g. if you are likely
to have have a million tables with just a few items in them, then you
probably want to choose a very small initial value. If every table is
likely to be loaded with N >> 1 elements, starting with small tables
wastes O(N) time for copying. And, because you are allocating fresh
storage for each growth cycle, a non-compacted heap may quickly become
fragmented.


A rule of thumb is to pick a total memory size you're willing to waste
as empty table space. Call this W. Then decide the number of tables
your application is likely to have concurrently allocated. Call this
N. Then set starting size to max(1,floor(W/N)), or the nearest
prime. One possible choice for W is some modest fraction of the L2
cache, the rationale being that if you have a bunch of nearly empty
tables of size comparable to a cache line, and you hit them in random
order, you will render the cache useless. Another idea: pick W to be
a modest fraction of physical memory size. Ditto the rationale above
but for pages and swapping rather than cache lines and misses.


If map size is very unpredictable, consider self-balancing trees - red-
black or AVL. If you are using the maps to look up identifiers, you
should only look up each one once and store a Boehm number for future
refs. This renders the performance penalty of trees unimportant,
while the space saved just might recoup that penalty and more.


If you need to delete elements, then yes, chained hashes are a far
easier to implement. If you don't need deletion, open addressing with
a quadratic probe is quite simple, usually effective, and considerably
more economical. Linear probing, which makes deletion simpler, can
magnify any quirks in your hash function.


Cheers!



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.