From: | David Chase <chase@naturalbridge.com> |
Newsgroups: | comp.compilers |
Date: | 10 Jul 1998 20:58:56 -0400 |
Organization: | NaturalBridge LLC |
References: | 98-07-030 98-07-045 |
Keywords: | symbols |
In off-line conversation with Kirk Hays, I "discovered" (surely I
cannot be the first) a minor frob that might be of some use to the
balanced-tree-crowd.
There's two preliminaries. First of all, I am assuming the use of a
"grade A hash function" (this term comes from informal conversation
and griping about the "grade B" hash functions defined into Java).
Carter and Wegman apparently wrote on this some years back, but I lack
the reference. Apprently Hugo Krawczyk has done interesting work
along these lines. I know that people at DEC-SRC were playing with
these for purposes of "signing" types and for rapid text searches.
The hash function I described (order-dependent combining of full-word
random numbers from a table) is "grade A" or very close to it, I
think. Actually, I think you also need to chew up the old hash a bit
before combining it with the new. One step of a LFSR would probably
be helpful. But, assume you have a good hash function.
Second of all, years ago, I wrote a "generic" bucket hash table of
sorts that I've found useful over the years (*). One big help, in a
generic table that might be handling complicated data structures as
elements, is to store the full hash value in each bucket. Given a
grade A hash function, this gives you something that is (with very
high probability) Murphy-proof. First you hash the data to select a
bucket head. Then you zip down the list of buckets, quickly checking
hash values for (in)equality. When you get a hit, you call the
expensive compare function. In practice, when you get a hit, odds are
high (2**32 to 1, approximately) that compare will return equal. What
this means is that the bucket chains must get really long before
chaining is costly, and with a grade A hash function, the odds of that
happening are very low indeed. I lack the skill-time product to
generally compute odds, but if you've got 16 bucket heads, the odds
against adding 16 entries and landing them all in the same bucket are
(I think) 2**60 to 1. And, in practice, if you are just checking
integers for equality, running down a list of 16 elements isn't that
big a deal anyway.
Contrast this with balanced trees. If you add, say, 1000 elements to
a balanced binary tree, you'll perform about 10 comparisons per probe.
Those comparisons aren't cheap, especially with machine- generated
identifiers (naturalbridge_aaaaa_01, naturalbridge_aaaaa_02, etc).
Unless, of course, you hash the value before insertion, and use the
hash value as the primary key. Mr. Stingy Storage notes that you
could use 29 bits of hash, leaving three bits for balance factor or
color, or whatever you need. With high probability, all your
comparisons go fast (with high probability, you avoid an indirect
reference to the keys if they are complicated enough to be represented
as a pointer).
Of course, given a grade A hash function, you might not need to
balance that tree anyway (my skill-time product is definitely
deficient for this analysis, though I think it was an exercise in
either Knuth or the AHU algorithms book). If you trust skip-lists,
you should probably trust this. Note that the hash functions could
also be used in skip lists; simply use some trivial function of the
hash value (xor all the bits down into a smaller number of bits, then
table lookup) to determine the skip level.
Given a good hash function, you can also make use of radix-addressing
schemes, and dispense with a few more comparisons.
However, do note that the prerequisite for all of this is a really
good hash function. Not necessarily cryptographically secure, but
otherwise very good.
For scoped symbol tables, there's a stunt, I think due to Paul Dietz,
that is useful in some applications. Given a program in many
programming languages, it is possible to arrange the variable
declaration scopes into a tree form. For example:
a_function(int a, int b) { // 1
int a; // 2
int c; // 3
{
int a; // 4
int b; // 5
}
{
int b; // 6
int c; // 7
}
}
generates the (fixed-width-font) tree:
root leaves
1--2--3--4--5
|
6--7
Rather than modify the symbol table to reflect the
changing scopes as compilation moves through the
program, you simply note which scope you are in;
a reference to a symbol X takes the form "get scopes
from symbol table for X, find current scope in set
of scopes". Numbering the scopes with in and out
numbers in a standard tree walk generates bracketing
ranges that allow you to find the nearest enclosing
scope in log N time (binary search). Rodney Farrow
of Declarative Systems figured out a nice way to
make use of this in attribute grammars (where traditional
symbol tables are a bit of a headache). I don't know
if this was ever published. Chase (me), Wegman, and
Zadeck also used this stunt in a pointer-analysis-
algorithm to attempt to introduce SSA-like sparseness
to pointer analysis (but the paper didn't include
a description of the algorithm).
(* http://world.std.com/~chase/src/hash/ -- note that it
contains a "grade B" hash function at best, but I wrote
it over ten years ago when I was younger and somewhat
more ignorant. It may also contain a bug; one bug was
discovered over time, but I cannot recall if this copy
of the code is pre-or-post bug. )
David Chase
NaturalBridge LLC
--
Return to the
comp.compilers page.
Search the
comp.compilers archives again.