Thu, 3 Aug 1995 15:10:09 GMT

Related articles |
---|

Looking for high-quality hashing function for dynamic linker eb@kaleida.com (1995-07-28) |

Re: Looking for high-quality hashing function for dynamic linker mw@ipx2.rz.uni-mannheim.de (1995-08-02) |

Re: Looking for high-quality hashing function for dynamic linker stefan.monnier@epfl.ch (Stefan Monnier) (1995-08-03) |

Re: Looking for high-quality hashing function for dynamic linker haahr@netcom.com (1995-08-03) |

Re: Looking for high-quality hashing function for dynamic linker chase@centerline.com (1995-08-03) |

Newsgroups: | comp.compilers,comp.theory,comp.lang.misc,comp.lang.dylan |

From: | chase@centerline.com (David Chase) |

Keywords: | theory, linker |

Organization: | CenterLine Software |

References: | 95-08-021 |

Date: | Thu, 3 Aug 1995 15:10:09 GMT |

eb@kaleida.com (Eric Benson) writes:

*> Is it feasible to compute a signature consisting of a few (say, less than*

*> ten) 32 bit words and use this for comparison instead of the entire set of*

*> names? A false positive comparison would probably result in a severe*

*> malfunction so it should be extremely unlikely, say one-in-a-billion or*

*> less. I can spend a big chunk of CPU time in the compilation environment*

*> to compute this signature if necessary. I have a 29-bit CRC hash function*

*> already available. Is this, or some multi-word variant of it, good enough*

*> for my purposes? If so, does anyone have a larger-than-29-bit CRC*

*> implementation? Does anyone have any references to relevant research?*

I'm relatively fond of computing the residues of irreducible polynomials

in GF(2**N), for decently large N. After you strip away all the

mathematics, this reduces to bitwise shifts and constant XORs, or bytewise

shifts and XORs from a lookup table. That is, it's fast.

I "stole" this trick from Steve Glassman, who "stole" it from Lucille

Glassman, who worked at DEC-SRC, where people were playing with

applications of this sort of fingerprinting. The Modula-3 implementations

that I know of use exactly the trick that you describe (and the hashing

function that I'm talking about) to "fingerprint" types. I believe they

use a 63 or 64-bit hash, which makes the probability of a collision really

small, even if you are simultaneuously comparing the signatures of

thousands of modules.

You can read about the Modula-3 interace to do this in DEC-SRC Research

Report #113, "Some Useful Modula-3 Interfaces". That, in turn,

references "Fingerprinting by random polynomials" by M.O. Rabin, Harvard

U. Dept. Comp. Sci. TR-15-81, and also "Some applications of Rabin's

fingerprint method", by Andrei Broder, in _Sequences II: Methods in

Communications, Security, and Computer Science_, eds. Capocelli, De

Santis, and Vaccaro, Springer-Verlag, 1993. A nice casual reference

is _Number Theory in Science and Communication_ by M. R. Schroeder,

Springer-Verlag, 1984.

If you are impatient, then try the following:

"Some Useful M-3 Interfaces" can be had through anonymous ftp to

gatekeeper.dec.com, pub/DEC/SRC/research-reports/SRC-113.ps.Z

If you are interested in the source to the Modula-3 fingerprinting

code, then fire up some net browser and head to:

http://www.research.digital.com/SRC/m3sources/html/INDEX.html,

and follow your nose. Fingerprint can be found under Interfaces-F.

Note also that there's interfaces for both FingerPrint (old Olivetti

code) and Fingerprint (newer DEC code).

Furthermore, if you are too lazy (like me) to go and compute such a

irreducible polynomial yourself, you can play with these, taken from

"Primitive Binary Polynomials" by Wayne Stahnke, from the journal

_Mathematics of Computation_, 27:124, October 1973, published by

the American Mathematical Society. For a 32-bit hash (polynomial

of degree 33) I use 32, 28, 27, 1, 0. You can also use these as

a random bit generator -- the code to do that (using the polynomial

above) looks like this:

#define P 0x18000003 /* Bit 32 is "hidden". */

int random_sign(int x) {

if (x < 0) /* If bit 31, soon to be 32, is set */

x = (x<<1) ^ P; /* flip the hidden bit 32 back to zero */

else

x = x<<1;

return x;

}

The sign of the result varies "randomly" if you run this on its result.

You can initialize a 256-byte table like this:

#define step random_sign /* Too lazy to do the edit */

void init_table(int t[256]) {

int i;

for (i = 0; i < 256; i++) {

t[i] = step(step(step(step(step(step(step(step(i << 24))))))));

}

}

*>From this you can derive the byte-at-a-time-hasher:*

int stepbyte (int x, char b, int t[256]) {

int xor = t[(unsigned int) x >> 24];

return ((x << 8) | (unsigned char) b) ^ xor;

*}*

This stuff doesn't look "too random" for very short strings, since it

takes 5 characters of input (for a 32-bit hash) before any modulus

operations occur. It's also possible to suffix each string with 4

zeros (for hashing purposes) if you want to see more randomness.

I've checked the randomness every way that I know (correlation of x(n) with

x(n+1,2,3,4); relative frequencies of same-sign runs), including listening

to the bit stream coming out, and I believe that it is in fact as

pseudo-random as the theory advertises (and that I coded it correctly). I

also checked this with some other polynomials to see if they were less

"random-seeming" (got to test the test, after all) and they usually were.

One thing to watch out for (in Stahnke's list) is that his polynomials are

"minimal", meaning that they have as fewer 1's in their bit patterns as

possible. This means that some of these sequences start out looking, and

sounding, rather pattern-ful and decidedly non-random. His entry for

degree 64 is one example of this. It would be better to go back to the

literature and figure out how to find one of your own, with more ones in

it. Or, you could just settle for degree 62 -- it has terms 62, 57, 56, 1,

and 0. You can also exploit this trick for random bit streams -- for

instance, if it benefits you for easy cases to have a certain known output,

then start the seed in a way that yields that output. An example of this

is if you use the lengths of same-sign runs as your RNG for skip-lists --

for short lists (common case) you can save space by always starting with

(say) 8 alternating signs. Short lists are short and compact, longer lists

are less compact, but have good access costs.

Another thing to watch out for is that you should not re-fingerprint

fingerprints. This is mentioned explicitly in the Modula-3 interfaces and

source code. They provide an explicit "combine" operation to do this.

I've looked at the source code for combine and I have no idea what on earth

it is doing. I coded up a slightly slower (and different) combine, for a

fingerprint consisting of residue+input length. For two strings S1 and S2,

with residues R(S1) and R(S2) and lengths L(S1) and L(S2), it calculates

R(S1 concat S2) by brute force (equal to R(S1 << L(S2)) ^ R(S2)).

The code to do this is left as an exercise to the reader, since this

posting is long enough already.

speaking for myself,

David Chase

--

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.