Re: Looking for symbol table hashing algorithm

Scott David Daniels <scottdaniels@earthlink.net>
13 Jul 1998 23:46:30 -0400

          From comp.compilers

Related articles
[10 earlier articles]
Re: Looking for symbol table hashing algorithm drh@microsoft.com (Dave Hanson) (1998-07-08)
Re: Looking for symbol table hashing algorithm smith@aragorn.uni-trier.de (1998-07-10)
Re: Looking for symbol table hashing algorithm scott@basis.com (1998-07-10)
Re: Looking for symbol table hashing algorithm henry@spsystems.net (1998-07-10)
Re: Looking for symbol table hashing algorithm chase@naturalbridge.com (David Chase) (1998-07-10)
Re: Looking for symbol table hashing algorithm dietz@interaccess.com (Paul Dietz) (1998-07-11)
Re: Looking for symbol table hashing algorithm scottdaniels@earthlink.net (Scott David Daniels) (1998-07-13)
Re: Looking for symbol table hashing algorithm buzzard@world.std.com (1998-07-13)
Re: Looking for symbol table hashing algorithm peter.r.wilson@boeing.com (Peter Wilson) (1998-07-20)
| List of all articles for this month |

From: Scott David Daniels <scottdaniels@earthlink.net>
Newsgroups: comp.compilers
Date: 13 Jul 1998 23:46:30 -0400
Organization: EarthLink Network, Inc.
References: 98-07-030 98-07-056 98-07-076
Keywords: symbols

Similarly, my personal experience gave me the following for symbols on
human-created source (an interpretter/debugger that was both taking
user input and re-scanning while executing). I went to a name table
and name-as-atom implementation, where the name table turned strings
to "atoms" when then were the things in symbol tables instead of
strings. It simplifies symbol table layout questions without imposing
either a large storage cost per symbol or an artificially low limit on
user symbol lengths.


        1) There is tremendous locality. I used Tarjan's self-organizing
trees to get very good results; it beat an optimal binary tree
substantially. The technique I used to test was: capture tokens
scanned for a day, and run my algorithms against that. My final
choice was hash to buckets with the self-organizing trees inside. The
hash divided the work "evenly", and the trees did the rest.


        2) Short strings dominate. Any hash that doesn't spread bits
quickly is in trouble. I was hashing into a 32-bit word (31 to make
the division to buckets stay positive), and any hash that bunched up
short strings was dead.


        3) Even in case-sensitive languages case differences are not very
important. The best I got was rotate circular 5 bits in a 31-bit
rotator, XOR in the next next character, and add in some constant
across the word. The reason for choosing 5 bits is that there are
roughly 5 low-order bits of freely-varied info in an ASCII character
which is a part of an identifier. After that you bits that don't vary
much (case, group, and the "mysterious" bit 7).


        4) Fast adequate hash is what you want, not great hashing for too
much money.


        5) If you are going to spend more than an hour thinking about the
answer, test your results on _your_ problem. You could be surprised.


The random number table sounds like fun, but I wanted to avoid the
cost of an extra memory reference and its consequent cost in cache
space. I used the hash to divide my problem size, not to solve the
whole name-lookup.


eodell@pobox.com (Eric O'Dell) wrote:
> >The algorithm Holub proposes in _Compiler Design in C_ is pretty
> >workable, and I can't offhand see any reason why you couldn't make a
> >C++ class out of it. It uses a hash table with individual table
> >entries linked via linked lists to keep track of scope.


Scott Amspoker wrote:
> I was going to suggest that one too. It's simple to implement and
> seems to work pretty well even for moderately large symbol tables
> (adjust the hash table size as necessary). I first used it in a
> compiler thinking that I could always go back and replace it with
> something better if I wasn't happy with it. I never felt the need to
> replace it.
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.