Hash Functions (continued)

Evaluation of CRC32 for Hash Tables

We evaluate CRC32 in the same manner we've evaluated other 32-bit hash functions because it has made its way into general purpose use as a hash function for hash table lookup. This isn't its intended use. It was designed to detect certain classes of bit transmission errors in a channel with noise, and was never designed to have uniform distribution characteristics.

Figure 1 lists one of many varieties of 32-bit Cyclic Redundancy Check functions. Such functions vary by the value of the polynomial used, by the initial value of the hash in the hash computation, by the post-processing of the hash value, and by the "direction" in which the key bits are fed into the hash value. I've selected some of these parameters arbitrarily, but used a commonly used polynomial value.

Uniform Distribution Test

The details of this test are described fully on a previous page. We examine the distribution of numbers derived from lower and upper bits of the hash output in sizes of 1 through 16 bits.

Figure 2 shows the results of this test for the CRC32 function. This test indicates that the upper bits of this hash produce uniformly distributed values for hash tables that are a power of two, up to at least 216, when the key octets are uniformly distributed, distributed similar to alphabetic text, or sparsely distributed. It produces very non-uniform values in the lower bits for hash table sizes above 23.

The poor results are for the end of the hash value that the key octets are "fed into" so if you choose to use CRC32 for hash tables, be sure to use the bits from the opposite end.

Avalanche Test

The details of this test are described fully on a previous page. We examine the diffusion of input bits into different bit positions in the output.

Figure 3 shows the results of this test for the CRC32 function. This test indicates that this hash has absolutely no avalanche behavior. Remember that red means there was no diffusion at all. Input bits directly determine the state of certain output bits and have no effect on the state of the other output bits. This simply reflects the fact that CRC32 wasn't designed for uniform distribution of output bits. Its hash value is manipulated in a very specific way designed to detect channel transmission errors.

While the upper bits of the hash are uniformly distributed for the classes of keys we examined, this is solely because of the function's confusion characteristics, and has nothing to do with its nonexistant diffusion characteristics. Confusion is a hashing technique that obscures the statistical relationship between the input bits and the output bits. Hashes that use substitution boxes (our tab table in the code) typically have good confusion characteristics.

Conclusions

CRC32 was never intended for hash table use. There is really no good reason to use it for this purpose, and I recommend that you avoid doing so.

If you decide to use CRC32, it's critical that you use the hash bits from the end opposite to that in which the key octets are fed in. Which end this is depends on the specific CRC32 implementation. Do not treat CRC32 as a "black box" hash function, and do not use it as a general purpose hash. Be sure to test each application of it for suitability.

First page, Next page