From: | preston@tera.com (Preston Briggs) |
Newsgroups: | comp.compilers |
Date: | 8 Jul 1998 01:37:30 -0400 |
Organization: | Compilers Central |
References: | 98-07-030 98-07-042 98-07-051 |
Keywords: | symbols |
>>>[My impression is that so long as you avoid obviously horrible hashes
>>>that do things like making all single letter symbols collide, it
>>>doesn't make much difference. -John]
>>That would be true, except that the set of "obviously horrible" hashes
>>is quite large. Real data sets exhibits several kinds of similarity
>>between strings to be hashed, and designing an algorithm that
>>successfully deals with *all* these similarities is a non-trivial
>>exercise.
>I'll put in a plug for using AVL trees (or other balanced trees) at
>this point - they don't show degenerate performance on any case, and
>performance is good on real-world (and machine generated) codes, in my
>experience.
>In particular, having had to write compilers that had to deal with a
>lot of machine-generated code, I will forevermore avoid hashing as a
>technique where the input data is unknown.
Another approach would be to try multiset discrimination
title="``Look Ma, No Hashing, And No Arrays Neither''",
author="Jiazhen Cai and Robert Paige",
booktitle=popl18,
address="Orlando, Florida",
year=1991,
month=jan,
pages="143--154"
I find the paper somewhat hard to read, but the basic idea is to
recognize that while hash tables provide O(n) probabilistic time for n
accesses in an on-line fashion, but that the on-line behaviour is
unnecessary for many applications (e.g., symbol tables in multipass
compilers). So they substitute a simple alternative, multiset
discrimination, to achieve O(n) time for n access in an off-line
fashion.
Note that the word "probabilistic" has disappeared! This means you'll
get your nice time bound regardless of the distribution of the keys.
Of course, the off-line part has some engineering implications and
I've never heard of anyone actually using this approach in a compiler
(but I'd be interested in hearing about it). You'll need to collect
all the identifiers (and perhaps the keywords), then apply multiset
discrimination to distingish between them.
Preston Briggs
--
Return to the
comp.compilers page.
Search the
comp.compilers archives again.