Re: perfect hashing

fjh@cs.mu.OZ.AU (Fergus Henderson)
10 Aug 2000 00:12:46 -0400

          From comp.compilers

Related articles
Hashtable alternatives gwynfa@paradise.net.nz (Gwynfa) (2000-07-27)
Re: Hashtable alternatives rkrayhawk@aol.com (2000-08-04)
perfect hashing preston@tera.com (Preston Briggs) (2000-08-04)
Re: perfect hashing ceco@jupiter.com (Tzvetan Mikov) (2000-08-05)
Re: perfect hashing jsgray@acm.org (Jan Gray) (2000-08-09)
Re: perfect hashing jmochel@foliage.com (2000-08-10)
Re: perfect hashing fjh@cs.mu.OZ.AU (2000-08-10)
Re: perfect hashing parz@home.com (Parzival) (2000-08-10)
Re: perfect hashing rick@home.com (2000-08-13)
Re: perfect hashing lars@bearnip.com (Lars Duening) (2000-08-13)
Re: perfect hashing pmk@cray.com (2000-08-13)
Re: perfect hashing tej@melbpc.org.au (Tim Josling) (2000-08-13)
Re: perfect hashing bob_jenkins@burtleburtle.net (2000-08-13)
[1 later articles]
| List of all articles for this month |

From: fjh@cs.mu.OZ.AU (Fergus Henderson)
Newsgroups: comp.compilers
Date: 10 Aug 2000 00:12:46 -0400
Organization: Computer Science, University of Melbourne
References: 00-07-064 00-08-022 00-08-026 00-08-031
Keywords: symbols

"Tzvetan Mikov" <ceco@jupiter.com> writes:


>Preston Briggs <preston@tera.com> wrote in message
>> The moderator wrote:
>> [Perfect hashing is sometimes useful for a fixed table of keywords, but
>> you can't use it for a normal symbol table. -John]
>>
>> Indeed, I wouldn't use it for keywords.
>> Consider the typical scanner.
>>
>> a) Recognize that we have some sort of keyword or identifier
>> b) See if it's a keyword
>> c) If not, look it up in the symbol table
>>
>> If we use perfect hashing for step b, then we'll _always_ find
>> some entry and have to finish with a string comparison.
>
>We could avoid the string comparison in many cases. We calculate an
>additional hash value and store it in the perfect hash table as well
>as the string itself. Then we need to compare the strings only if the
>hash value matches.


Another approach for detecting keywords which can be even more
efficient is to just write a big multi-level switch statement
where you test one character at a time. For example, if you have
three keywords "the", "this", and "foo", you can use the following
code:


typedef enum {
KEYWORD_THE,
KEYWORD_THIS,
KEYWORD_FOO,
IDENTIFIER
/* ... */
} TokenType;


...
switch(s[0]) {
case 't':
if (s[1] == 'h') {
switch(s[2]) {
case 'e':
if (s[3] == '\0') {
return KEYWORD_THE;
}
break;
case 'i':
if (s[4] == 's' && s[5] == '\0') {
return KEYWORD_THIS;
}
break;
}
}
break;
case 'f':
if (s[1] == 'o' && s[2] == 'o' && s[3] == '\0') {
return KEYWORD_FOO;
}
break;
}
return IDENTIFIER;


For a lengthier example of this technique, see the source code for
GNAT (the GNU Ada compiler).


The main drawback of this approach is that adding a new keyword,
while still very easy, is not quite as trivial as for table-based
approaches. But if your keyword set doesn't change often, and
performance is important, this approach is well worth considering.


Of course, if there was a tool to generate code in that style, given a
list of keywords, then that would solve the maintainability issue.
I imagine it would not be difficult to write such a tool.
--
Fergus Henderson <fjh@cs.mu.oz.au>
WWW: <http://www.cs.mu.oz.au/~fjh>
PGP: finger fjh@128.250.37.3
[Uh, that's basically what lex does, a big DFA that it steps through to
recognize tokens. If you're in a tearing hurry, I think somone once
posted a hack to open-code the flex state transition lookups. -John]





Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.