Re: Looking for high-quality hashing function for dynamic linker

chase@centerline.com (David Chase)
Thu, 3 Aug 1995 15:10:09 GMT

          From comp.compilers

Related articles
Looking for high-quality hashing function for dynamic linker eb@kaleida.com (1995-07-28)
Re: Looking for high-quality hashing function for dynamic linker mw@ipx2.rz.uni-mannheim.de (1995-08-02)
Re: Looking for high-quality hashing function for dynamic linker stefan.monnier@epfl.ch (Stefan Monnier) (1995-08-03)
Re: Looking for high-quality hashing function for dynamic linker haahr@netcom.com (1995-08-03)
Re: Looking for high-quality hashing function for dynamic linker chase@centerline.com (1995-08-03)
| List of all articles for this month |

Newsgroups: comp.compilers,comp.theory,comp.lang.misc,comp.lang.dylan
From: chase@centerline.com (David Chase)
Keywords: theory, linker
Organization: CenterLine Software
References: 95-08-021
Date: Thu, 3 Aug 1995 15:10:09 GMT

eb@kaleida.com (Eric Benson) writes:


> Is it feasible to compute a signature consisting of a few (say, less than
> ten) 32 bit words and use this for comparison instead of the entire set of
> names? A false positive comparison would probably result in a severe
> malfunction so it should be extremely unlikely, say one-in-a-billion or
> less. I can spend a big chunk of CPU time in the compilation environment
> to compute this signature if necessary. I have a 29-bit CRC hash function
> already available. Is this, or some multi-word variant of it, good enough
> for my purposes? If so, does anyone have a larger-than-29-bit CRC
> implementation? Does anyone have any references to relevant research?


I'm relatively fond of computing the residues of irreducible polynomials
in GF(2**N), for decently large N. After you strip away all the
mathematics, this reduces to bitwise shifts and constant XORs, or bytewise
shifts and XORs from a lookup table. That is, it's fast.


I "stole" this trick from Steve Glassman, who "stole" it from Lucille
Glassman, who worked at DEC-SRC, where people were playing with
applications of this sort of fingerprinting. The Modula-3 implementations
that I know of use exactly the trick that you describe (and the hashing
function that I'm talking about) to "fingerprint" types. I believe they
use a 63 or 64-bit hash, which makes the probability of a collision really
small, even if you are simultaneuously comparing the signatures of
thousands of modules.


You can read about the Modula-3 interace to do this in DEC-SRC Research
Report #113, "Some Useful Modula-3 Interfaces". That, in turn,
references "Fingerprinting by random polynomials" by M.O. Rabin, Harvard
U. Dept. Comp. Sci. TR-15-81, and also "Some applications of Rabin's
fingerprint method", by Andrei Broder, in _Sequences II: Methods in
Communications, Security, and Computer Science_, eds. Capocelli, De
Santis, and Vaccaro, Springer-Verlag, 1993. A nice casual reference
is _Number Theory in Science and Communication_ by M. R. Schroeder,
Springer-Verlag, 1984.


If you are impatient, then try the following:


"Some Useful M-3 Interfaces" can be had through anonymous ftp to
gatekeeper.dec.com, pub/DEC/SRC/research-reports/SRC-113.ps.Z


If you are interested in the source to the Modula-3 fingerprinting
code, then fire up some net browser and head to:


http://www.research.digital.com/SRC/m3sources/html/INDEX.html,
and follow your nose. Fingerprint can be found under Interfaces-F.
Note also that there's interfaces for both FingerPrint (old Olivetti
code) and Fingerprint (newer DEC code).


Furthermore, if you are too lazy (like me) to go and compute such a
irreducible polynomial yourself, you can play with these, taken from
"Primitive Binary Polynomials" by Wayne Stahnke, from the journal
_Mathematics of Computation_, 27:124, October 1973, published by
the American Mathematical Society. For a 32-bit hash (polynomial
of degree 33) I use 32, 28, 27, 1, 0. You can also use these as
a random bit generator -- the code to do that (using the polynomial
above) looks like this:


    #define P 0x18000003 /* Bit 32 is "hidden". */
    int random_sign(int x) {
          if (x < 0) /* If bit 31, soon to be 32, is set */
x = (x<<1) ^ P; /* flip the hidden bit 32 back to zero */
          else
                x = x<<1;
          return x;
    }


The sign of the result varies "randomly" if you run this on its result.
You can initialize a 256-byte table like this:


    #define step random_sign /* Too lazy to do the edit */
    void init_table(int t[256]) {
        int i;
        for (i = 0; i < 256; i++) {
            t[i] = step(step(step(step(step(step(step(step(i << 24))))))));
        }
    }


>From this you can derive the byte-at-a-time-hasher:


int stepbyte (int x, char b, int t[256]) {
    int xor = t[(unsigned int) x >> 24];
    return ((x << 8) | (unsigned char) b) ^ xor;
}


This stuff doesn't look "too random" for very short strings, since it
takes 5 characters of input (for a 32-bit hash) before any modulus
operations occur. It's also possible to suffix each string with 4
zeros (for hashing purposes) if you want to see more randomness.


I've checked the randomness every way that I know (correlation of x(n) with
x(n+1,2,3,4); relative frequencies of same-sign runs), including listening
to the bit stream coming out, and I believe that it is in fact as
pseudo-random as the theory advertises (and that I coded it correctly). I
also checked this with some other polynomials to see if they were less
"random-seeming" (got to test the test, after all) and they usually were.
One thing to watch out for (in Stahnke's list) is that his polynomials are
"minimal", meaning that they have as fewer 1's in their bit patterns as
possible. This means that some of these sequences start out looking, and
sounding, rather pattern-ful and decidedly non-random. His entry for
degree 64 is one example of this. It would be better to go back to the
literature and figure out how to find one of your own, with more ones in
it. Or, you could just settle for degree 62 -- it has terms 62, 57, 56, 1,
and 0. You can also exploit this trick for random bit streams -- for
instance, if it benefits you for easy cases to have a certain known output,
then start the seed in a way that yields that output. An example of this
is if you use the lengths of same-sign runs as your RNG for skip-lists --
for short lists (common case) you can save space by always starting with
(say) 8 alternating signs. Short lists are short and compact, longer lists
are less compact, but have good access costs.


Another thing to watch out for is that you should not re-fingerprint
fingerprints. This is mentioned explicitly in the Modula-3 interfaces and
source code. They provide an explicit "combine" operation to do this.
I've looked at the source code for combine and I have no idea what on earth
it is doing. I coded up a slightly slower (and different) combine, for a
fingerprint consisting of residue+input length. For two strings S1 and S2,
with residues R(S1) and R(S2) and lengths L(S1) and L(S2), it calculates
R(S1 concat S2) by brute force (equal to R(S1 << L(S2)) ^ R(S2)).


The code to do this is left as an exercise to the reader, since this
posting is long enough already.


speaking for myself,


David Chase


--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.