27 Apr 2007 11:29:14 -0400

Related articles |
---|

[3 earlier articles] |

Re: C Hashmap implementation Sean.Gillespie@bisonofborg.com (bison) (2007-04-26) |

Re: C Hashmap implementation cdiggins@gmail.com (Christopher Diggins) (2007-04-26) |

Re: C Hashmap implementation gneuner2@comcast.net (George Neuner) (2007-04-26) |

Re: C Hashmap implementation gene.ressler@gmail.com (Gene) (2007-04-26) |

Re: C Hashmap implementation roessmann@gmx.net (Bernhard Roessmann) (2007-04-26) |

Re: C Hashmap implementation DrDiettrich1@aol.com (Hans-Peter Diettrich) (2007-04-27) |

Re: C Hashmap implementation DrDiettrich1@aol.com (Hans-Peter Diettrich) (2007-04-27) |

Re: C Hashmap implementation pww19710430@gmail.com (2015-04-01) |

From: | Hans-Peter Diettrich <DrDiettrich1@aol.com> |

Newsgroups: | comp.compilers |

Date: | 27 Apr 2007 11:29:14 -0400 |

Organization: | Compilers Central |

References: | 07-04-089 07-04-115 |

Keywords: | C, symbols |

Posted-Date: | 27 Apr 2007 11:29:14 EDT |

George Neuner wrote:

*>>I'm looking into hashmap implementations for a VM based*

*>>dynamically typed language. Since the implementation must allow me to*

*>>resize on demand, I am a bit confused about the various approaches.*

*>*

*>*

*> The best approach depends on the expected key distribution and number*

*> of collisions. If relatively few collisions are expected, it really*

*> doesn't matter which method is used. If lots of collisions are*

*> expected, then which to use depends on the cost of rehashing or*

*> probing vs. manipulating the extensible bucket data structure.*

When nested scopes come into play, collisions cannot be avoided at

all, in a single hashtable. OTOH the use of a single hash table may be

more efficient than using a stack of tables. That's why in one of my

experiments I combined a symbol stack with a hash table, where the

symbol entries contain the links to other symbols, of the same hash

code, deeper down in the stack. The whole symbol table was implemented

as a single array, whose size will stabilize very soon. The array then

serves as the hash table, by taking the hash code modulo the array

size, to retrieve the index of the first matching symbol. When a scope

is added or removed from the stack, the entries at the high end of the

array are reused (freed for or occupied by new symbols).

*> Chained maps are simpler because the key space doesn't have to be*

*> resized as often (or at all) as the map grows. That means you may*

*> only need one decent hash function.*

sic!

*>>How much space should a hashmap allocate initially, and when*

*>>it's full...by how much would I typically increase it?*

*>*

*>*

*> In general you should not use powers of 2 for the size of the key*

*> space. Some authors have suggested using prime numbers - at the very*

*> least the size should be odd. When expanding the map some suggest the*

*> new size should be relatively prime to the old size.*

When the hash table grows together with the number of entries, as in my

approach, every entry has an immutable hash code. The lookup requires a

modulo operation, which will be a little bit faster for array sizes in

powers of 2, but I don't think that this possible optimization has an

big overall impact. More important is the number of grows of the

stack/array, which require to rebuild the whole hash table (the links to

the entry points of the chained lists, which are not further affected by

the reorganization).

*> I haven't seen similar studies on chained implementations because the*

*> key space is so rarely resized. Effort is generally put into making*

*> the single hash function as good as possible and then making the*

*> bucket search efficient.*

I also didn't thoroughly research my implementation, it immediately

worked quite fine for my purpose, a C preprocessor and parser.

After following the link to VList, I think that I had just the same

ideas with my approach :-)

DoDi

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.