26 Apr 2007 09:39:44 -0400

Related articles |
---|

C Hashmap implementation Sean.Gillespie@bisonofborg.com (bison) (2007-04-23) |

Re: C Hashmap implementation vmakarov@redhat.com (Vladimir Makarov) (2007-04-25) |

Re: C Hashmap implementation cr88192@hotmail.com (cr88192) (2007-04-26) |

Re: C Hashmap implementation Sean.Gillespie@bisonofborg.com (bison) (2007-04-26) |

Re: C Hashmap implementation cdiggins@gmail.com (Christopher Diggins) (2007-04-26) |

Re: C Hashmap implementation gneuner2@comcast.net (George Neuner) (2007-04-26) |

Re: C Hashmap implementation gene.ressler@gmail.com (Gene) (2007-04-26) |

Re: C Hashmap implementation roessmann@gmx.net (Bernhard Roessmann) (2007-04-26) |

Re: C Hashmap implementation DrDiettrich1@aol.com (Hans-Peter Diettrich) (2007-04-27) |

Re: C Hashmap implementation DrDiettrich1@aol.com (Hans-Peter Diettrich) (2007-04-27) |

From: | Gene <gene.ressler@gmail.com> |

Newsgroups: | comp.compilers |

Date: | 26 Apr 2007 09:39:44 -0400 |

Organization: | Compilers Central |

References: | 07-04-089 |

Keywords: | symbols |

Posted-Date: | 26 Apr 2007 09:39:44 EDT |

On Apr 23, 7:52 am, bison <Sean.Gilles...@bisonofborg.com> wrote:

*> Hello. I'm looking into hashmap implementations for a VM based*

*> dynamically typed language. Since the implementation must allow me to*

*> resize on demand, I am a bit confused about the various approaches.*

*> Here is what I have found out so far:*

*>*

*> Almost everyone I've talked to has said that Chained Hashmaps are much*

*> easier to implement than Open Addressed maps.*

*>*

*> Wikipedia suggests that an approach to resizing hashmaps is to*

*> allocate space for a newer hashmap and copy elements from to the new*

*> table, and in some cases do it incrementally.*

*>*

*> Quick question about the last point: I'm curious about a starting*

*> point. How much space should a hashmap allocate initially, and when*

*> it's full...by how much would I typically increase it? I realize*

*> there are lots of different answers here, so a good starting point*

*> would really help out.*

You'll want to grow the table by a _factor_ greater than one so that

the amortized time spent on copying does not dominate asymptotic

performance. I have seen factors between 1.3 and 2 used in

practice.

Most hash functions work best with prime table sizes. The following

table supports growth factors of roughly 1.5 for 32-bit architectures.

unsigned primes[] = {

3, 7, 11, 19, 29, 43, 67, 97, 149, 211, 317, 479, 719, 1069, 1601,

2399,

3607, 5399, 8093, 12143, 18211, 27329, 40973, 61463, 92173,

138283,

207401, 311099, 466619, 699931, 1049891, 1574827, 2362229, 3543349,

5314961, 7972451, 11958671, 17937989, 26906981, 40360471, 60540749,

90811057,136216589, 204324851, 306487277, 459730919, 689596379,

1034394589, 1551591841, 2327387723u, 3491081599u, 4294967291u,

*};*

Initial size is totally application dependent. E.g. if you are likely

to have have a million tables with just a few items in them, then you

probably want to choose a very small initial value. If every table is

likely to be loaded with N >> 1 elements, starting with small tables

wastes O(N) time for copying. And, because you are allocating fresh

storage for each growth cycle, a non-compacted heap may quickly become

fragmented.

A rule of thumb is to pick a total memory size you're willing to waste

as empty table space. Call this W. Then decide the number of tables

your application is likely to have concurrently allocated. Call this

N. Then set starting size to max(1,floor(W/N)), or the nearest

prime. One possible choice for W is some modest fraction of the L2

cache, the rationale being that if you have a bunch of nearly empty

tables of size comparable to a cache line, and you hit them in random

order, you will render the cache useless. Another idea: pick W to be

a modest fraction of physical memory size. Ditto the rationale above

but for pages and swapping rather than cache lines and misses.

If map size is very unpredictable, consider self-balancing trees - red-

black or AVL. If you are using the maps to look up identifiers, you

should only look up each one once and store a Boehm number for future

refs. This renders the performance penalty of trees unimportant,

while the space saved just might recoup that penalty and more.

If you need to delete elements, then yes, chained hashes are a far

easier to implement. If you don't need deletion, open addressing with

a quadratic probe is quite simple, usually effective, and considerably

more economical. Linear probing, which makes deletion simpler, can

magnify any quirks in your hash function.

Cheers!

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.