|[10 earlier articles]|
|Re: Looking for symbol table hashing algorithm email@example.com (Dave Hanson) (1998-07-08)|
|Re: Looking for symbol table hashing algorithm firstname.lastname@example.org (1998-07-10)|
|Re: Looking for symbol table hashing algorithm email@example.com (1998-07-10)|
|Re: Looking for symbol table hashing algorithm firstname.lastname@example.org (1998-07-10)|
|Re: Looking for symbol table hashing algorithm email@example.com (David Chase) (1998-07-10)|
|Re: Looking for symbol table hashing algorithm firstname.lastname@example.org (Paul Dietz) (1998-07-11)|
|Re: Looking for symbol table hashing algorithm email@example.com (Scott David Daniels) (1998-07-13)|
|Re: Looking for symbol table hashing algorithm firstname.lastname@example.org (1998-07-13)|
|Re: Looking for symbol table hashing algorithm email@example.com (Peter Wilson) (1998-07-20)|
|From:||Scott David Daniels <firstname.lastname@example.org>|
|Date:||13 Jul 1998 23:46:30 -0400|
|Organization:||EarthLink Network, Inc.|
|References:||98-07-030 98-07-056 98-07-076|
Similarly, my personal experience gave me the following for symbols on
human-created source (an interpretter/debugger that was both taking
user input and re-scanning while executing). I went to a name table
and name-as-atom implementation, where the name table turned strings
to "atoms" when then were the things in symbol tables instead of
strings. It simplifies symbol table layout questions without imposing
either a large storage cost per symbol or an artificially low limit on
user symbol lengths.
1) There is tremendous locality. I used Tarjan's self-organizing
trees to get very good results; it beat an optimal binary tree
substantially. The technique I used to test was: capture tokens
scanned for a day, and run my algorithms against that. My final
choice was hash to buckets with the self-organizing trees inside. The
hash divided the work "evenly", and the trees did the rest.
2) Short strings dominate. Any hash that doesn't spread bits
quickly is in trouble. I was hashing into a 32-bit word (31 to make
the division to buckets stay positive), and any hash that bunched up
short strings was dead.
3) Even in case-sensitive languages case differences are not very
important. The best I got was rotate circular 5 bits in a 31-bit
rotator, XOR in the next next character, and add in some constant
across the word. The reason for choosing 5 bits is that there are
roughly 5 low-order bits of freely-varied info in an ASCII character
which is a part of an identifier. After that you bits that don't vary
much (case, group, and the "mysterious" bit 7).
4) Fast adequate hash is what you want, not great hashing for too
5) If you are going to spend more than an hour thinking about the
answer, test your results on _your_ problem. You could be surprised.
The random number table sounds like fun, but I wanted to avoid the
cost of an extra memory reference and its consequent cost in cache
space. I used the hash to divide my problem size, not to solve the
email@example.com (Eric O'Dell) wrote:
> >The algorithm Holub proposes in _Compiler Design in C_ is pretty
> >workable, and I can't offhand see any reason why you couldn't make a
> >C++ class out of it. It uses a hash table with individual table
> >entries linked via linked lists to keep track of scope.
Scott Amspoker wrote:
> I was going to suggest that one too. It's simple to implement and
> seems to work pretty well even for moderately large symbol tables
> (adjust the hash table size as necessary). I first used it in a
> compiler thinking that I could always go back and replace it with
> something better if I wasn't happy with it. I never felt the need to
> replace it.
Return to the
Search the comp.compilers archives again.