Re: perfect hashing

rick@home.com (Rick Ohnemus)
13 Aug 2000 18:52:29 -0400

From comp.compilers

Related articles
Hashtable alternatives gwynfa@paradise.net.nz (Gwynfa) (2000-07-27)
perfect hashing preston@tera.com (Preston Briggs) (2000-08-04)
Re: perfect hashing ceco@jupiter.com (Tzvetan Mikov) (2000-08-05)
Re: perfect hashing jsgray@acm.org (Jan Gray) (2000-08-09)
Re: perfect hashing jmochel@foliage.com (2000-08-10)
Re: perfect hashing fjh@cs.mu.OZ.AU (2000-08-10)
Re: perfect hashing parz@home.com (Parzival) (2000-08-10)
*Re: perfect hashing rick@home.com* (2000-08-13)**
Re: perfect hashing lars@bearnip.com (Lars Duening) (2000-08-13)
Re: perfect hashing pmk@cray.com (2000-08-13)
Re: perfect hashing tej@melbpc.org.au (Tim Josling) (2000-08-13)
Re: perfect hashing bob_jenkins@burtleburtle.net (2000-08-13)
Re: perfect hashing intmktg@Gloria.CAM.ORG (Marc Tardif) (2000-08-13)

| List of all articles for this month |

From:	rick@home.com (Rick Ohnemus)
Newsgroups:	comp.compilers
Date:	13 Aug 2000 18:52:29 -0400
Organization:	none
References:	00-07-064 00-08-026 00-08-031 00-08-059
Keywords:	symbols

>"Tzvetan Mikov" <ceco@jupiter.com> writes:
>Another approach for detecting keywords which can be even more
>efficient is to just write a big multi-level switch statement
>where you test one character at a time. For example, if you have
>three keywords "the", "this", and "foo", you can use the following
>code:
>
> << example source deleted and other comments deleted >>
>
>Of course, if there was a tool to generate code in that style, given a
>list of keywords, then that would solve the maintainability issue.
>I imagine it would not be difficult to write such a tool.

I have enclosed a piece of a perl script that I use to generate a
series of C/C++ switch statements to recognize a fixed set of
keywords. The real script reads a template file and inserts the code
in the appropriate place. That is very specific to my application so
I just cut out the relevant parts.

The included script just generates the switch statements. You need to add any
preprocessing (converting to lower/upper case, etc.) and an error return at
the end.

For example:

enum { NOT_KW, KW1, KW2, KW3 } kw_code;
#define MAX_KW_LEN 8

kw_code
match_keyword(const char *str)
{
        size_t i;
        const char * kw;
        char us[MAX_KW_LEN + 1]; /* uppercased copy of str */

        /* make sure str is uppercase */

        for (i = 0; *str != '\0' && i < MAX_KW_LEN; ++i, ++str) {
                us[i] = islower(*str) ? toupper(*str) : *str;
        }
        if (i == 0 || *str != '\0') {
                return NOT_KW;
        }
        us[i] = '\0';

        kw = us; /* point to the uppercased string */

        << generated switch code goes here >>

        return NOT_KW;
}

Here is the script:
#!/usr/local/bin/perl
#
# Generate nested C switch statements to do quick keyword recognition.
# The generated code uses kw as the variable pointing to the keyword to
# match. kw should be declared as 'const char *kw' (or the equivalent).
#
# Input is read from stdin and must be in the following format:
# keyword return_value
# Leading/trailing white space is stripped. White space between the
# keyword and return value is also removed.
#
# Output is written to stdout.
#
# Switches:
# -i n specifies the number of spaces per indent level. The
# default is 4.
# -l convert all input keywords to lowercase
# -u convert all input keywords to uppercase

use strict;
use Getopt::Std;

my $prog = $0;

sub usage() {
        print STDERR <<'USAGE';
usage: $prog [-i indent_spaces] [-l | -u]
        -i specifies the number of spaces per indentation level (default is 4)
        -l convert all keywords to lowercase
        -u convert all keywords to uppercase
USAGE
        exit 1;
}

my $switches = 'i:lu';
my %opts;

if (!getopts($switches, \%opts)) {
        usage();
}

my $nspi; # number spaces per indent

if (exists $opts{'i'}) {
        usage() if ($opts{'i'} <= 0);
        $nspi = $opts{'i'};
}
else {
        $nspi = 4;
}

my $lower = 0;
my $upper = 0;
if (exists $opts{'l'}) {
        if (exists $opts{'u'}) {
                print STDERR "error: -l and -u can not be used together\n";
                exit 1;
        }

        $lower = 1;
}
elsif (exists $opts{'u'}) {
        $upper = 1;
}

sub write_switch($\@);

my $indent = 1;

my @keywords;

my $error = 0;

my $line;
while (defined($line = <>)) {
        my ($keyword, $rv) = $line =~ /^\s*([^\s]+)\s+([^\s]+)\s*$/;
        next if (length($keyword) == 0);
        if (length($rv) == 0) {
                print STDERR "error: '$keyword' does not have a return value\n";
                $error = 1;
        }

        if ($lower) {
                $keyword = lc $keyword;
        }
        elsif ($upper) {
                $keyword = uc $keyword;
        }

        push @keywords, [ $keyword, $rv, $keyword ];
}

exit 1 if $error;

if (@keywords == 0) {
        print STDERR "no keywords\n";
        exit 1;
}

@keywords = sort { $a->[0] cmp $b->[0] } @keywords;

write_switch(0, @keywords);
exit 0;

sub write_switch($\@) {
        my $sub_switch = $_[0];
        my $spaces = ' ' x (($sub_switch + 1) * $nspi);

        my @keywords = @{$_[1]};

        my @cache = ();
        my $prev = '';

        if (@keywords == 1) {
                if ($keywords[0][0] eq '') {
                        print "${spaces}if (*kw == '\\0') {\n";
                }
                else {
                        print "${spaces}if (strcmp(kw, \"$keywords[0][0]\") == 0) {\n";
                }

                print "${spaces} return $keywords[0][1]; /* $keywords[0][2] */\n";
                print "${spaces}}\n";
                print "${spaces}break;\n";
        }
        else {
                print "${spaces}switch (*kw++) {\n";

                foreach my $w (@keywords) {
                        if ($w->[0] eq '') {
                                print "${spaces}case '\\0':\n";
                                print "${spaces} return $w->[1]; /* $w->[2] */\n";
                                next;
                        }

                        my $fc = substr($w->[0], 0, 1);
                        substr($w->[0], 0, 1) = '';

                        if ($fc ne $prev) {
                                if ($prev ne '') {
                                        print "${spaces}case '$prev':\n";
                                        write_switch($sub_switch + 1, @cache);
                                }
                                @cache = ();
                                $prev = $fc;
                        }

                        push @cache, [ $w->[0], $w->[1], $w->[2] ];
                }

                if (@cache) {
                        print "${spaces}case '$prev':\n";
                        write_switch($sub_switch + 1, @cache);
                }

                print "${spaces}}\n";
                print "${spaces}break;\n" if $sub_switch;
        }

        return;
}

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: perfect hashing

rick@home.com (Rick Ohnemus)13 Aug 2000 18:52:29 -0400

rick@home.com (Rick Ohnemus)
13 Aug 2000 18:52:29 -0400